vectoriser#
Source code: sensai/vectoriser.py
- class Vectoriser(f: Callable[[T], Union[float, ndarray, list]], transformer=None, is_fitted=False)[source]#
Bases:
Generic
[T
],ToStringMixin
A vectoriser represents a method for the conversion of instances of some type T into vectors, i.e. one-dimensional (numeric) arrays, or (in the special case of a 1D vector) scalars
- Parameters:
f – the function which maps from an instance of T to an array/list/scalar
transformer – an optional transformer (e.g. instance of one of the classes in sklearn.preprocessing) which can be used to transform/normalise the generated arrays
is_fitted – whether the vectoriser (and therefore the given transformer) is assumed to be fitted already
- log = <Logger sensai.vectoriser.Vectoriser (WARNING)>#
- get_name()[source]#
- Returns:
the name of this feature generator, which may be a default name if the name has not been set. Note that feature generators created by a FeatureGeneratorFactory always get the name with which the generator factory was registered.
- apply(item: T, transform=True) array [source]#
- Parameters:
item – the item to be vectorised
transform – whether to apply this instance’s transformer (if any)
- Returns:
a vector
- apply_multi(items: Iterable[T], transform=True, use_cache=False, verbose=False) List[array] [source]#
Applies this vectoriser to multiple items at once. Especially for cases where this vectoriser uses a transformer, this method is significantly faster than calling apply repeatedly.
- Parameters:
items – the items to be vectorised
transform – whether to apply this instance’s transformer (if any)
use_cache – whether to apply caching of the value function f given at construction (keeping track of outputs for each input object id), which can significantly speed up computation in cases where an items appears more than once in the collection of items
verbose – whether to generate log messages
- Returns:
a list of vectors
- class EmptyVectoriser[source]#
Bases:
Vectoriser
- Parameters:
f – the function which maps from an instance of T to an array/list/scalar
transformer – an optional transformer (e.g. instance of one of the classes in sklearn.preprocessing) which can be used to transform/normalise the generated arrays
is_fitted – whether the vectoriser (and therefore the given transformer) is assumed to be fitted already
- class ItemIdentifierProvider(*args, **kwds)[source]#
Bases:
Generic
[T
],ABC
Provides identifiers for sequence items.
- class SequenceVectoriser(vectorisers: Union[Sequence[Vectoriser[T]], Vectoriser[T]], fitting_mode: FittingMode = FittingMode.UNIQUE, unique_id_provider: Optional[ItemIdentifierProvider] = None, refit_vectorisers: bool = True)[source]#
Bases:
Generic
[T
],ToStringMixin
Supports the application of Vectorisers to sequences of objects of some type T, where each object of type T is mapped to a vector (1D array) by the vectorisers. A SequenceVectoriser is fitted by fitting the underlying Vectorisers. In order to obtain the instances of T that are used for training, we take into consideration the fact that the sequences of T may overlap and thus training is performed on the set of unique instances.
- Parameters:
vectorisers – zero or more vectorisers that are to be applied. If more than one vectoriser is supplied, vectors are generated from input instances of type T by concatenating the results of the vectorisers in the order the vectorisers are given.
fitting_mode – the fitting mode for vectorisers. If NONE, no fitting takes place. If UNIQUE, fit vectorisers on unique set of items of type T. By default, uniqueness is determined based on Python object identity. If a custom mechanisms for determining an item’s identity is desired, pass unique_id_retriever. If CONCAT, fit vectorisers based on all items of type T, concatenating them to a single sequence.
unique_id_provider – an object used to determine item identities when using fitting mode UNIQUE.
refit_vectorisers – whether any vectorisers that have previously been fitted shall be fitted once more when this sequence vectoriser is fitted. Set this to false if you are reusing vectorisers that are also part of another sequence vectoriser that will be fitted/has been fitted before this sequence vectoriser. This can be useful, in particular, in encoder-decoders where the target features are partly the same as the history sequence features, and we want to reuse the latter and their fitted transformers for the target features.
- log = <Logger sensai.vectoriser.SequenceVectoriser (WARNING)>#
- class FittingMode(value)[source]#
Bases:
Enum
Determines how the individual vectorisers are fitted based on several sequences of objects of type T that are given. If NONE, no fitting is performed, otherwise the mode determines how a single sequence of objects of type T for fitting is obtained from the collection of sequences: either by forming the set of unique objects from the sequences (UNIQUE)
- NONE = 'none'#
- UNIQUE = 'unique'#
- CONCAT = 'concat'#
- apply(seq: Sequence[T], transform=True) List[array] [source]#
Applies vectorisation to the given sequence of objects
- Parameters:
seq – the sequence to vectorise
transform – whether to apply any post-vectorisation transformers
- Returns:
- apply_multi(sequences: Iterable[Sequence[T]], use_cache=False, verbose=False) Tuple[List[List[array]], List[int]] [source]#
Applies this vectoriser to multiple sequences of objects of type T, where each sequence is mapped to a sequence of 1D arrays. This method can be significantly faster than multiple applications of apply, especially in cases where the vectorisers use transformers.
- Parameters:
sequences – the sequences to vectorise
use_cache – whether to apply caching of the value functions of contained vectorisers (keeping track of outputs for each input object id), which can significantly speed up computation in cases where the given sequences contain individual items more than once
verbose – whether to generate log messages
- Returns:
a pair (vl, l) where vl is a list of lists of vectors/arrays and l is a list of integers containing the lengths of the sequences
- apply_multi_with_padding(sequences: Sequence[Sequence[T]], use_cache=False, verbose=False) Tuple[List[List[array]], List[int]] [source]#
Applies this vectoriser to multiple sequences of objects of type T, where each sequence is mapped to a sequence of 1D arrays. Sequences are allowed to vary in length. for shorter sequences, 0-vectors are appended until the maximum sequence length is reached (padding).
- Parameters:
sequences – the sequences to vectorise
use_cache – whether to apply caching of the value functions of contained vectorisers (keeping track of outputs for each input object id), which can significantly speed up computation in cases where the given sequences contain individual items more than once
verbose – whether to generate log messages
- Returns:
a pair (vl, l) where vl is a list of lists of vectors/arrays, each list having the same length, and l is a list of integers containing the original unpadded lengths of the sequences
- class VectoriserRegistry[source]#
Bases:
object
- register_factory(name: Hashable, factory: Callable[[Callable], Vectoriser], additional_names: Optional[Iterable[Hashable]] = None)[source]#
Registers a vectoriser factory which can subsequently be referenced via their name
- Parameters:
name – the name (which can, in particular, be a string or an enum item)
factory – the factory, which takes the default transformer factory as an argument
additional_names – (optional) additional names under which to register the factory
- get_vectoriser(name: Hashable, default_transformer_factory: Callable) Vectoriser [source]#
Creates a vectoriser from a name, which must have been previously registered.
- Parameters:
name – the name (which can, in particular, be a string or an enum item)
default_transformer_factory – the default transformer factory
- Returns:
a new vectoriser instance
- get_vectorisers(names: List[Hashable], default_transformer_factory: Callable) List[Vectoriser] [source]#