nearest_neighbors

nearest_neighbors#

Source code: sensai/nearest_neighbors.py

class Neighbor(value: PandasNamedTuple, distance: float)[source]#: Bases: object

class NeighborProvider(df_indexed_by_id: pandas.DataFrame)[source]#

Bases: ABC

abstract iter_potential_neighbors(value: PandasNamedTuple) → Iterable[PandasNamedTuple][source]#

class AllNeighborsProvider(df_indexed_by_id: pandas.DataFrame)[source]#

Bases: NeighborProvider

iter_potential_neighbors(value)[source]#

class TimerangeNeighborsProvider(df_indexed_by_id: pandas.DataFrame, timestamps_column='timestamps', past_time_range_days=120, future_time_range_days=120)[source]#

Bases: NeighborProvider

iter_potential_neighbors(value: PandasNamedTuple)[source]#

class AbstractKnnFinder[source]#

Bases: ABC

abstract find_neighbors(named_tuple: PandasNamedTuple, n_neighbors=20) → List[Neighbor][source]#

class CachingKNearestNeighboursFinder(cache: DistanceMetricCache, distance_metric: DistanceMetric, neighbor_provider: NeighborProvider)[source]#

Bases: AbstractKnnFinder

A nearest neighbor finder which uses a cache for distance metrics in order speed up repeated computations of the neighbors of the same data point by keeping a pandas.Series of distances to all provided data points cached. If the distance metric is of the composite type LinearCombinationDistanceMetric, its component distance metrics are cached, such that weights in the linear combination can be varied without necessitating recomputations.

log = <Logger sensai.nearest_neighbors.CachingKNearestNeighboursFinder (WARNING)>#

class DistanceMetricCache[source]#

Bases: object

A cache for distance metrics which identifies equivalent distance metrics by their string representations. The cache can be passed (consecutively) to multiple KNN models in order to speed up computations for the same test data points. If the cache is reused, it is assumed that the neighbor provider remains the same.

log = <Logger sensai.nearest_neighbors.CachingKNearestNeighboursFinder.DistanceMetricCache (WARNING)>#

get_cached_metric(distance_metric)[source]#

class CachedSeriesDistanceMetric(distance_metric)[source]#

Bases: object

Provides caching for a wrapped distance metric: the series of all distances to provided potential neighbors are retained in a cache

get_distance_series(named_tuple: PandasNamedTuple, potential_neighbor_values)[source]#

find_neighbors(named_tuple: PandasNamedTuple, n_neighbors=20) → List[Neighbor][source]#

class KNearestNeighboursFinder(distance_metric: DistanceMetric, neighbor_provider: NeighborProvider)[source]#

Bases: AbstractKnnFinder

find_neighbors(named_tuple: PandasNamedTuple, n_neighbors=20) → List[Neighbor][source]#

class KNearestNeighboursClassificationModel(num_neighbors: int, distance_metric: ~sensai.distance_metric.DistanceMetric, neighbor_provider_factory: ~typing.Callable[[pandas.DataFrame], ~sensai.nearest_neighbors.NeighborProvider] = <class 'sensai.nearest_neighbors.AllNeighborsProvider'>, distance_based_weighting=False, distance_epsilon=0.001, distance_metric_cache: ~typing.Optional[~sensai.nearest_neighbors.CachingKNearestNeighboursFinder.DistanceMetricCache] = None, **kwargs)[source]#

Bases: VectorClassificationModel

Parameters:

num_neighbors – the number of nearest neighbors to consider
distance_metric – the distance metric to use
neighbor_provider_factory – a factory with which a neighbor provider can be constructed using data
distance_based_weighting – whether to weight neighbors according to their distance (inverse); if False, use democratic vote
distance_epsilon – a distance that is added to all distances for distance-based weighting (in order to avoid 0 distances);
distance_metric_cache – a cache for distance metrics which shall be used to store speed up repeated computations of the neighbors of the same data point by keeping series of distances cached (particularly for composite distance metrics); see class CachingKNearestNeighboursFinder
kwargs – parameters to pass on to super-classes

find_neighbors(named_tuple)[source]#

class KNearestNeighboursRegressionModel(num_neighbors: int, distance_metric: ~sensai.distance_metric.DistanceMetric, neighbor_provider_factory: ~typing.Callable[[pandas.DataFrame], ~sensai.nearest_neighbors.NeighborProvider] = <class 'sensai.nearest_neighbors.AllNeighborsProvider'>, distance_based_weighting=False, distance_epsilon=0.001, distance_metric_cache: ~typing.Optional[~sensai.nearest_neighbors.CachingKNearestNeighboursFinder.DistanceMetricCache] = None, **kwargs)[source]#

Bases: VectorRegressionModel

Parameters:

num_neighbors – the number of nearest neighbors to consider
distance_metric – the distance metric to use
neighbor_provider_factory – a factory with which a neighbor provider can be constructed using data
distance_based_weighting – whether to weight neighbors according to their distance (inverse); if False, use democratic vote
distance_epsilon – a distance that is added to all distances for distance-based weighting (in order to avoid 0 distances);
distance_metric_cache – a cache for distance metrics which shall be used to store speed up repeated computations of the neighbors of the same data point by keeping series of distances cached (particularly for composite distance metrics); see class CachingKNearestNeighboursFinder
kwargs – parameters to pass on to super-classes

class FeatureGeneratorNeighbors(num_neighbors: int, neighbor_attributes: ~typing.List[str], distance_metric: ~sensai.distance_metric.DistanceMetric, neighbor_provider_factory: ~typing.Callable[[pandas.DataFrame], ~sensai.nearest_neighbors.NeighborProvider] = <class 'sensai.nearest_neighbors.AllNeighborsProvider'>, cache: ~typing.Optional[~sensai.util.cache.KeyValueCache] = None, categorical_feature_names: ~typing.Sequence[str] = (), normalisation_rules: ~typing.Sequence[~sensai.data_transformation.dft.DFTNormalisation.Rule] = ())[source]#

Bases: FeatureGeneratorFromNamedTuples

Generates features based on nearest neighbors. For each neighbor, a set of features is added to the output data frame. Each feature has the name “n{0-based neighbor index}_{feature name}”, where the feature names are configurable at construction. The feature name “distance”, which indicates the distance of the neighbor to the data point is always present.

Parameters:

num_neighbors – the number of neighbors for to generate features
neighbor_attributes – the attributes of the neighbor’s named tuple to include as features (in addition to “distance”)
distance_metric – the distance metric defining which neighbors are near
neighbor_provider_factory – a factory for the creation of neighbor provider
cache – an optional key-value cache in which feature values are stored by data point identifier (as given by the DataFrame’s index)