vectoriser

vectoriser#

class Vectoriser(f: Callable[[T], Union[float, numpy.ndarray, list]], transformer=None, is_fitted=False)[source]#

Bases: Generic[T], ToStringMixin

A vectoriser represents a method for the conversion of instances of some type T into vectors, i.e. one-dimensional (numeric) arrays, or (in the special case of a 1D vector) scalars

Parameters:

f – the function which maps from an instance of T to an array/list/scalar
transformer – an optional transformer (e.g. instance of one of the classes in sklearn.preprocessing) which can be used to transform/normalise the generated arrays
is_fitted – whether the vectoriser (and therefore the given transformer) is assumed to be fitted already

log = <Logger sensai.vectoriser.Vectoriser (WARNING)>#

is_fitted() → bool[source]#

set_name(name)[source]#

get_name()[source]#

Returns:: the name of this feature generator, which may be a default name if the name has not been set. Note that feature generators created by a FeatureGeneratorFactory always get the name with which the generator factory was registered.

fit(items: Iterable[T])[source]#

apply(item: T, transform=True) → numpy.array[source]#

Parameters:

item – the item to be vectorised
transform – whether to apply this instance’s transformer (if any)

Returns:

a vector

apply_multi(items: Iterable[T], transform=True, use_cache=False, verbose=False) → List[numpy.array][source]#

Applies this vectoriser to multiple items at once. Especially for cases where this vectoriser uses a transformer, this method is significantly faster than calling apply repeatedly.

Parameters:

items – the items to be vectorised
transform – whether to apply this instance’s transformer (if any)
use_cache – whether to apply caching of the value function f given at construction (keeping track of outputs for each input object id), which can significantly speed up computation in cases where an items appears more than once in the collection of items
verbose – whether to generate log messages

Returns:

a list of vectors

class ResultType(value)[source]#

Bases: Enum

An enumeration.

SCALAR = 0#

LIST = 1#

NUMPY_ARRAY = 2#

classmethod from_value(y)[source]#

class EmptyVectoriser[source]#

Bases: Vectoriser

Parameters:

f – the function which maps from an instance of T to an array/list/scalar
transformer – an optional transformer (e.g. instance of one of the classes in sklearn.preprocessing) which can be used to transform/normalise the generated arrays
is_fitted – whether the vectoriser (and therefore the given transformer) is assumed to be fitted already

class ItemIdentifierProvider[source]#

Bases: Generic[T], ABC

Provides identifiers for sequence items.

abstract get_identifier(item: T) → Hashable[source]#

class SequenceVectoriser(vectorisers: Union[Sequence[Vectoriser[T]], Vectoriser[T]], fitting_mode: FittingMode = FittingMode.UNIQUE, unique_id_provider: Optional[ItemIdentifierProvider] = None, refit_vectorisers: bool = True)[source]#

Bases: Generic[T], ToStringMixin

Supports the application of Vectorisers to sequences of objects of some type T, where each object of type T is mapped to a vector (1D array) by the vectorisers. A SequenceVectoriser is fitted by fitting the underlying Vectorisers. In order to obtain the instances of T that are used for training, we take into consideration the fact that the sequences of T may overlap and thus training is performed on the set of unique instances.

Parameters:

vectorisers – zero or more vectorisers that are to be applied. If more than one vectoriser is supplied, vectors are generated from input instances of type T by concatenating the results of the vectorisers in the order the vectorisers are given.
fitting_mode – the fitting mode for vectorisers. If NONE, no fitting takes place. If UNIQUE, fit vectorisers on unique set of items of type T. By default, uniqueness is determined based on Python object identity. If a custom mechanisms for determining an item’s identity is desired, pass unique_id_retriever. If CONCAT, fit vectorisers based on all items of type T, concatenating them to a single sequence.
unique_id_provider – an object used to determine item identities when using fitting mode UNIQUE.
refit_vectorisers – whether any vectorisers that have previously been fitted shall be fitted once more when this sequence vectoriser is fitted. Set this to false if you are reusing vectorisers that are also part of another sequence vectoriser that will be fitted/has been fitted before this sequence vectoriser. This can be useful, in particular, in encoder-decoders where the target features are partly the same as the history sequence features, and we want to reuse the latter and their fitted transformers for the target features.

log = <Logger sensai.vectoriser.SequenceVectoriser (WARNING)>#

class FittingMode(value)[source]#

Bases: Enum

Determines how the individual vectorisers are fitted based on several sequences of objects of type T that are given. If NONE, no fitting is performed, otherwise the mode determines how a single sequence of objects of type T for fitting is obtained from the collection of sequences: either by forming the set of unique objects from the sequences (UNIQUE)

NONE = 'none'#

UNIQUE = 'unique'#

CONCAT = 'concat'#

fit(data: Iterable[Sequence[T]])[source]#

apply(seq: Sequence[T], transform=True) → List[numpy.array][source]#

Applies vectorisation to the given sequence of objects

Parameters:

seq – the sequence to vectorise
transform – whether to apply any post-vectorisation transformers

Returns:

apply_multi(sequences: Iterable[Sequence[T]], use_cache=False, verbose=False) → Tuple[List[List[numpy.array]], List[int]][source]#

Applies this vectoriser to multiple sequences of objects of type T, where each sequence is mapped to a sequence of 1D arrays. This method can be significantly faster than multiple applications of apply, especially in cases where the vectorisers use transformers.

Parameters:

sequences – the sequences to vectorise
use_cache – whether to apply caching of the value functions of contained vectorisers (keeping track of outputs for each input object id), which can significantly speed up computation in cases where the given sequences contain individual items more than once
verbose – whether to generate log messages

Returns:

a pair (vl, l) where vl is a list of lists of vectors/arrays and l is a list of integers containing the lengths of the sequences

apply_multi_with_padding(sequences: Sequence[Sequence[T]], use_cache=False, verbose=False) → Tuple[List[List[numpy.array]], List[int]][source]#

Applies this vectoriser to multiple sequences of objects of type T, where each sequence is mapped to a sequence of 1D arrays. Sequences are allowed to vary in length. for shorter sequences, 0-vectors are appended until the maximum sequence length is reached (padding).

Parameters:

sequences – the sequences to vectorise
use_cache – whether to apply caching of the value functions of contained vectorisers (keeping track of outputs for each input object id), which can significantly speed up computation in cases where the given sequences contain individual items more than once
verbose – whether to generate log messages

Returns:

a pair (vl, l) where vl is a list of lists of vectors/arrays, each list having the same length, and l is a list of integers containing the original unpadded lengths of the sequences

get_vector_dim(seq: Sequence[T])[source]#

Determines the dimensionality of generated vectors by applying the vectoriser to the given sequence

Parameters:: seq – the sequence
Returns:: the number of dimensions in generated output vectors (per item)

class VectoriserRegistry[source]#

Bases: object

get_available_vectorisers()[source]#

register_factory(name: Hashable, factory: Callable[[Callable], Vectoriser], additional_names: Optional[Iterable[Hashable]] = None)[source]#

Registers a vectoriser factory which can subsequently be referenced via their name

Parameters:

name – the name (which can, in particular, be a string or an enum item)
factory – the factory, which takes the default transformer factory as an argument
additional_names – (optional) additional names under which to register the factory

get_vectoriser(name: Hashable, default_transformer_factory: Callable) → Vectoriser[source]#

Creates a vectoriser from a name, which must have been previously registered.

Parameters:

name – the name (which can, in particular, be a string or an enum item)
default_transformer_factory – the default transformer factory

Returns:

a new vectoriser instance

get_vectorisers(names: List[Hashable], default_transformer_factory: Callable) → List[Vectoriser][source]#