feature_generator

feature_generator#

Source code: sensai/featuregen/feature_generator.py

exception DuplicateColumnNamesException[source]#: Bases: Exception

class FeatureGenerator(categorical_feature_names: Optional[Union[Sequence[str], str]] = None, normalisation_rules: Sequence[Rule] = (), normalisation_rule_template: Optional[RuleTemplate] = None, add_categorical_default_rules: bool = True)[source]#

Bases: ToStringMixin, ABC

Base class for feature generators that create a new DataFrame containing feature values from an input DataFrame

Parameters:

categorical_feature_names – either a sequence of column names or a regex that is to match all categorical feature names (which must not only work for the feature generated by this feature generator, i.e. it should not match feature names generated by other feature generators). It will be ensured that the respective columns in the generated data frames will have dtype ‘category’. Furthermore, the presence of meta-information can later be leveraged for further transformations, e.g., one-hot encoding.
normalisation_rules – Rules to be used by DFTNormalisation (e.g.,for constructing an input transformer for a model). These rules are only relevant if a DFTNormalisation object consuming them is instantiated and used within a data processing pipeline. They do not affect feature generation.
normalisation_rule_template – This parameter can be supplied instead of normalisation_rules for the case where there shall be a single rule that applies to all columns generated by this feature generator that were not labeled as categorical. Like normalisation_rules, this is only relevant if a DFTNormalisation object consuming normalisation rules is instantiated and used within a data processing pipeline. It does not affect feature generation.
add_categorical_default_rules – If True, normalisation rules for categorical features (which are unsupported by normalisation) and their corresponding one-hot encoded features (with “_<index>” appended) will be added. It does not affect feature generation.

get_name() → str[source]#

Returns:: the name of this feature generator, which may be a default name if the name has not been set. Note that feature generators created by a FeatureGeneratorFactory always get the name with which the generator factory was registered.

set_name(name: str) → None[source]#

get_names() → List[str][source]#

Returns:: the list of names of feature generators; will be a list with a single name for a regular feature generator

info()[source]#

get_normalisation_rules(include_generated_categorical_rules=True) → List[Rule][source]#

get_categorical_feature_name_regex() → Optional[str][source]#

is_categorical_feature(feature_name)[source]#

get_generated_column_names() → Optional[List[str]][source]#

Returns:: Column names of the data frame generated by the most recent call of the feature generators ‘generate’ method. Returns None if generate was never called.

to_dft()[source]#

fit(x: pandas.DataFrame, y: Optional[pandas.DataFrame] = None, ctx=None)[source]#

Fits the feature generator based on the given data

Parameters:

x – the input/features data frame for the learning problem
y – the corresponding output data frame for the learning problem (which will typically contain regression or classification target columns)
ctx – a context object whose functionality may be required for feature generation; this is typically the model instance that this feature generator is to generate inputs for

is_fitted()[source]#

generate(df: pandas.DataFrame, ctx=None) → pandas.DataFrame[source]#

Generates features for the data points in the given data frame

Parameters:

df – the input data frame for which to generate features
ctx – a context object whose functionality may be required for feature generation; this is typically the model instance that this feature generator is to generate inputs for

Returns:

a data frame containing the generated features, which uses the same index as X (and Y)

fit_generate(x: pandas.DataFrame, y: Optional[pandas.DataFrame] = None, ctx=None) → pandas.DataFrame[source]#

Fits the feature generator and subsequently generates features for the data points in the given data frame

Parameters:

x – the input data frame for the learning problem and for which to generate features
y – the corresponding output data frame for the learning problem (which will typically contain regression or classification target columns)
ctx – a context object whose functionality may be required for feature generation; this is typically the model instance that this feature generator is to generate inputs for

Returns:

a data frame containing the generated features, which uses the same index as X (and Y)

flattened(columns_to_flatten: Optional[List[str]] = None, normalisation_rules=(), normalisation_rule_template: Optional[RuleTemplate] = None, keep_other_columns=True) → ChainedFeatureGenerator[source]#

Returns a new feature generator which returns flattened versions of one or more of the vector-valued columns generated by this feature generator.

Parameters:

columns_to_flatten – the list of columns to flatten; if None, flatten all columns
normalisation_rules – a list of normalisation rules which apply to the flattened columns
normalisation_rule_template – a normalisation rule template which applies to all generated flattened columns
keep_other_columns – if True, any additional columns that are not to be flattened are to be retained by the returned feature generator; if False, additional columns are to be discarded

Returns:

a feature generator which generates the flattened columns

concat(*others: FeatureGenerator) → MultiFeatureGenerator[source]#

Concatenates this feature generator with one or more other feature generator in order to produce a feature generator that jointly generates all features

Parameters:: others – other feature generators
Returns:: a MultiFeatureGenerator

chain(*others: FeatureGenerator) → ChainedFeatureGenerator[source]#

Chains this feature generator with one or more other feature generators such that each feature generator receives as input the output of the preceding feature generator. The resulting feature generator produces the features of the last element in the chain.

Parameters:: others – other feature generator
Returns:: a ChainedFeatureGenerator

class RuleBasedFeatureGenerator(categorical_feature_names: Optional[Union[Sequence[str], str]] = None, normalisation_rules: Sequence[Rule] = (), normalisation_rule_template: Optional[RuleTemplate] = None, add_categorical_default_rules: bool = True)[source]#

Bases: FeatureGenerator, ABC

A feature generator which does not require fitting

Parameters:

categorical_feature_names – either a sequence of column names or a regex that is to match all categorical feature names (which must not only work for the feature generated by this feature generator, i.e. it should not match feature names generated by other feature generators). It will be ensured that the respective columns in the generated data frames will have dtype ‘category’. Furthermore, the presence of meta-information can later be leveraged for further transformations, e.g., one-hot encoding.
normalisation_rules – Rules to be used by DFTNormalisation (e.g.,for constructing an input transformer for a model). These rules are only relevant if a DFTNormalisation object consuming them is instantiated and used within a data processing pipeline. They do not affect feature generation.
normalisation_rule_template – This parameter can be supplied instead of normalisation_rules for the case where there shall be a single rule that applies to all columns generated by this feature generator that were not labeled as categorical. Like normalisation_rules, this is only relevant if a DFTNormalisation object consuming normalisation rules is instantiated and used within a data processing pipeline. It does not affect feature generation.
add_categorical_default_rules – If True, normalisation rules for categorical features (which are unsupported by normalisation) and their corresponding one-hot encoded features (with “_<index>” appended) will be added. It does not affect feature generation.

fit(x, y=None, ctx=None)[source]#

Fits the feature generator based on the given data

Parameters:

x – the input/features data frame for the learning problem
y – the corresponding output data frame for the learning problem (which will typically contain regression or classification target columns)
ctx – a context object whose functionality may be required for feature generation; this is typically the model instance that this feature generator is to generate inputs for

is_fitted()[source]#

class MultiFeatureGenerator(*feature_generators: Union[FeatureGenerator, List[FeatureGenerator]])[source]#

Bases: FeatureGenerator

Wrapper for multiple feature generators. Calling generate here applies all given feature generators independently and returns the concatenation of their outputs

Parameters:

categorical_feature_names – either a sequence of column names or a regex that is to match all categorical feature names (which must not only work for the feature generated by this feature generator, i.e. it should not match feature names generated by other feature generators). It will be ensured that the respective columns in the generated data frames will have dtype ‘category’. Furthermore, the presence of meta-information can later be leveraged for further transformations, e.g., one-hot encoding.
normalisation_rules – Rules to be used by DFTNormalisation (e.g.,for constructing an input transformer for a model). These rules are only relevant if a DFTNormalisation object consuming them is instantiated and used within a data processing pipeline. They do not affect feature generation.
normalisation_rule_template – This parameter can be supplied instead of normalisation_rules for the case where there shall be a single rule that applies to all columns generated by this feature generator that were not labeled as categorical. Like normalisation_rules, this is only relevant if a DFTNormalisation object consuming normalisation rules is instantiated and used within a data processing pipeline. It does not affect feature generation.
add_categorical_default_rules – If True, normalisation rules for categorical features (which are unsupported by normalisation) and their corresponding one-hot encoded features (with “_<index>” appended) will be added. It does not affect feature generation.

fit_generate(x: pandas.DataFrame, y: Optional[pandas.DataFrame] = None, ctx=None) → pandas.DataFrame[source]#

Fits the feature generator and subsequently generates features for the data points in the given data frame

Parameters:

x – the input data frame for the learning problem and for which to generate features
y – the corresponding output data frame for the learning problem (which will typically contain regression or classification target columns)
ctx – a context object whose functionality may be required for feature generation; this is typically the model instance that this feature generator is to generate inputs for

Returns:

a data frame containing the generated features, which uses the same index as X (and Y)

is_fitted()[source]#

info()[source]#

get_names() → list[source]#

Returns:: the list of names of feature generators; will be a list with a single name for a regular feature generator

class FeatureGeneratorFromNamedTuples(cache: Optional[KeyValueCache] = None, categorical_feature_names: Sequence[str] = (), normalisation_rules: Sequence[Rule] = (), normalisation_rule_template: Optional[RuleTemplate] = None)[source]#

Bases: FeatureGenerator, ABC

Generates feature values for one data point at a time, creating a dictionary with feature values from each named tuple

Parameters:

categorical_feature_names – either a sequence of column names or a regex that is to match all categorical feature names (which must not only work for the feature generated by this feature generator, i.e. it should not match feature names generated by other feature generators). It will be ensured that the respective columns in the generated data frames will have dtype ‘category’. Furthermore, the presence of meta-information can later be leveraged for further transformations, e.g., one-hot encoding.
normalisation_rules – Rules to be used by DFTNormalisation (e.g.,for constructing an input transformer for a model). These rules are only relevant if a DFTNormalisation object consuming them is instantiated and used within a data processing pipeline. They do not affect feature generation.
normalisation_rule_template – This parameter can be supplied instead of normalisation_rules for the case where there shall be a single rule that applies to all columns generated by this feature generator that were not labeled as categorical. Like normalisation_rules, this is only relevant if a DFTNormalisation object consuming normalisation rules is instantiated and used within a data processing pipeline. It does not affect feature generation.
add_categorical_default_rules – If True, normalisation rules for categorical features (which are unsupported by normalisation) and their corresponding one-hot encoded features (with “_<index>” appended) will be added. It does not affect feature generation.

class FeatureGeneratorTakeColumns(columns: Optional[Union[str, List[str]]] = None, except_columns: Sequence[str] = (), categorical_feature_names: Optional[Union[Sequence[str], str]] = (), normalisation_rules: Sequence[Rule] = (), normalisation_rule_template: Optional[RuleTemplate] = None, verify_column_names=True)[source]#

Bases: RuleBasedFeatureGenerator

Parameters:

columns – name of the column or list of names of columns to be taken. If None, all columns will be taken.
except_columns – list of names of columns to not take if present in the input df
categorical_feature_names – either a sequence of column names or a regex that is to match all categorical feature names (which must not only work for the feature generated by this feature generator, i.e. it should not match feature names generated by other feature generators). It will be ensured that the respective columns in the generated data frames will have dtype ‘category’. Furthermore, presence of meta-information can later be leveraged for further transformations, e.g. one-hot encoding.
normalisation_rules – Rules to be used by DFTNormalisation (e.g. for constructing an input transformer for a model). These rules are only relevant if a DFTNormalisation object consuming them is instantiated and used within a data processing pipeline. They do not affect feature generation.
normalisation_rule_template – This parameter can be supplied instead of normalisationRules for the case where there shall be a single rule that applies to all columns generated by this feature generator that were not labeled as categorical.
verify_column_names – if True and columns to take were specified, will raise an error in case said columns are missing during feature generation. If False, will log on info level instead

info()[source]#

class FeatureGeneratorFlattenColumns(columns: Optional[Union[Sequence[str], str]] = None, categorical_feature_names: Sequence[str] = (), normalisation_rules: Sequence[Rule] = (), normalisation_rule_template: Optional[RuleTemplate] = None)[source]#

Bases: RuleBasedFeatureGenerator

Instances of this class take columns with vectors and creates a data frame with columns containing entries of these vectors.

For example, if columns “vec1”, “vec2” contain vectors of dimensions dim1, dim2, a data frame with dim1+dim2 new columns will be created. It will contain the columns “vec1_<i1>”, “vec2_<i2>” with i1, i2 ranging in (0, dim1), (0, dim2).

Parameters:

columns – name of the column or list of names of columns to be flattened. If None, all columns will be flattened.
categorical_feature_names –
normalisation_rules –
normalisation_rule_template –

info()[source]#

class FeatureGeneratorFromColumnGenerator(column_gen: ColumnGenerator, take_input_column_if_present=False, is_categorical=False, normalisation_rule_template: RuleTemplate = None)[source]#

Bases: RuleBasedFeatureGenerator

Implements a feature generator via a column generator

Parameters:

column_gen – the underlying column generator
take_input_column_if_present – if True, then if a column whose name corresponds to the column to generate exists in the input data, simply copy it to generate the output (without using the column generator); if False, always apply the columnGen to generate the output
is_categorical – whether the resulting column is categorical
normalisation_rule_template – template for a DFTNormalisation for the resulting column. This should only be provided if is_categorical is False

log = <Logger sensai.featuregen.feature_generator.FeatureGeneratorFromColumnGenerator (WARNING)>#

info()[source]#

class ChainedFeatureGenerator(*feature_generators: Union[FeatureGenerator, List[FeatureGenerator]])[source]#

Bases: FeatureGenerator

Chains feature generators such that they are executed one after another. The output of generator i>=1 is the input of generator i+1 in the generator sequence.

Parameters:: feature_generators – feature generators to apply in order; the properties of the last feature generator determine the relevant meta-data such as categorical feature names and normalisation rules

is_fitted()[source]#

info()[source]#

fit_generate(x: pandas.DataFrame, y: Optional[pandas.DataFrame] = None, ctx=None) → pandas.DataFrame[source]#

Fits the feature generator and subsequently generates features for the data points in the given data frame

Parameters:

x – the input data frame for the learning problem and for which to generate features
y – the corresponding output data frame for the learning problem (which will typically contain regression or classification target columns)
ctx – a context object whose functionality may be required for feature generation; this is typically the model instance that this feature generator is to generate inputs for

Returns:

a data frame containing the generated features, which uses the same index as X (and Y)

class FeatureGeneratorTargetDistribution(columns: Union[str, Sequence[str]], target_column: str, target_column_bins: Optional[Union[Sequence[float], int, pandas.IntervalIndex]], target_column_in_features_df=False, flatten=True)[source]#

Bases: FeatureGenerator

A feature generator, which, for a column T (typically the categorical target column of a classification problem or the continuous target column of a regression problem),

can ensure that T takes on limited set of values t_1, …, t_n by allowing the user to apply binning using given bin boundaries
computes for each value c of a categorical column C the conditional empirical distribution P(T | C=c) in the training data during the training phase,
generates, for each requested column C and value c in the column, n features ‘<C>_<T>_distribution_<t_i>’ = P(T=t_i | C=c) if flatten=True or one feature ‘<C>_<T>_distribution’ = [P(T=t_i | C=c), …, P(T=t_n | C=c)] if flatten=False

Being probability values, the features generated by this feature generator are already normalised.

Parameters:

columns – the categorical columns for which to generate distribution features
target_column – the column the distributions over which will make up the features. If target_column_bins is not None, this column will be discretised before computing the conditional distributions
target_column_bins – if not None, specifies the binning to apply via pandas.cut (see https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html). Note that if a value should match no bin, NaN will be generated. To avoid this when specifying bin boundaries in a list, -inf and +inf should be used as the first and last entries.
target_column_in_features_df – if True, when fitting will look for targetColumn in the features data frame (X) instead of in target data frame (Y)
flatten – whether to generate a separate scalar feature per distribution value rather than one feature with all of the distribution’s values

info()[source]#

class FeatureGeneratorFromVectorModel(vector_model: VectorModel, target_feature_generator: FeatureGenerator, categorical_feature_names: Sequence[str] = (), normalisation_rules: Sequence[Rule] = (), normalisation_rule_template: RuleTemplate = None, input_feature_generator: FeatureGenerator = None, use_target_feature_generator_for_training=False)[source]#

Bases: FeatureGenerator

Provides a feature via predictions of a given model

Parameters:

vector_model – model used for generate features from predictions
target_feature_generator – generator for target to be predicted
categorical_feature_names –
normalisation_rules –
normalisation_rule_template –
input_feature_generator – optional feature generator to be applied to input of vectorModel’s fit and predict
use_target_feature_generator_for_training – if False, this generator will always apply the model to generate features. If True, this generator will use targetFeatureGenerator to generate features, bypassing the model. This is useful for the case where the model which is to receive the generated features shall be trained on the original targets rather than the predictions thereof.

info()[source]#

class FeatureGeneratorMapColumn(input_col_name: str, feature_col_name: str, categorical_feature_names: Optional[Union[Sequence[str], str]] = None, normalisation_rules: Sequence[Rule] = (), normalisation_rule_template: Optional[RuleTemplate] = None, add_categorical_default_rules=True)[source]#

Bases: RuleBasedFeatureGenerator, ABC

Creates a single feature from a single input column by applying a function to each element of the input column

Parameters:

categorical_feature_names – either a sequence of column names or a regex that is to match all categorical feature names (which must not only work for the feature generated by this feature generator, i.e. it should not match feature names generated by other feature generators). It will be ensured that the respective columns in the generated data frames will have dtype ‘category’. Furthermore, the presence of meta-information can later be leveraged for further transformations, e.g., one-hot encoding.
normalisation_rules – Rules to be used by DFTNormalisation (e.g.,for constructing an input transformer for a model). These rules are only relevant if a DFTNormalisation object consuming them is instantiated and used within a data processing pipeline. They do not affect feature generation.
normalisation_rule_template – This parameter can be supplied instead of normalisation_rules for the case where there shall be a single rule that applies to all columns generated by this feature generator that were not labeled as categorical. Like normalisation_rules, this is only relevant if a DFTNormalisation object consuming normalisation rules is instantiated and used within a data processing pipeline. It does not affect feature generation.
add_categorical_default_rules – If True, normalisation rules for categorical features (which are unsupported by normalisation) and their corresponding one-hot encoded features (with “_<index>” appended) will be added. It does not affect feature generation.

class FeatureGeneratorMapColumnDict(input_col_name: str, categorical_feature_names: Optional[Union[Sequence[str], str]] = None, normalisation_rules: Sequence[Rule] = (), normalisation_rule_template: Optional[RuleTemplate] = None, add_categorical_default_rules=True)[source]#

Bases: RuleBasedFeatureGenerator, ABC

Creates an arbitrary number of features from a single input column by applying a function to each element of the input column

Parameters:

categorical_feature_names – either a sequence of column names or a regex that is to match all categorical feature names (which must not only work for the feature generated by this feature generator, i.e. it should not match feature names generated by other feature generators). It will be ensured that the respective columns in the generated data frames will have dtype ‘category’. Furthermore, the presence of meta-information can later be leveraged for further transformations, e.g., one-hot encoding.
normalisation_rules – Rules to be used by DFTNormalisation (e.g.,for constructing an input transformer for a model). These rules are only relevant if a DFTNormalisation object consuming them is instantiated and used within a data processing pipeline. They do not affect feature generation.
normalisation_rule_template – This parameter can be supplied instead of normalisation_rules for the case where there shall be a single rule that applies to all columns generated by this feature generator that were not labeled as categorical. Like normalisation_rules, this is only relevant if a DFTNormalisation object consuming normalisation rules is instantiated and used within a data processing pipeline. It does not affect feature generation.
add_categorical_default_rules – If True, normalisation rules for categorical features (which are unsupported by normalisation) and their corresponding one-hot encoded features (with “_<index>” appended) will be added. It does not affect feature generation.

class FeatureGeneratorNAMarker(columns: List[str], value_a=0, value_na=1)[source]#

Bases: RuleBasedFeatureGenerator

Creates features indicating whether another feature is N/A (not available). It can be practical to use this feature generator in conjunction with DFTFillNA for models that cannot handle missing values.

Note: When changing the default values used, use only values that are considered to be normalised when using this feature generation in a context where DFTNormalisation is used (no normalisation is applied to features generated by this feature generator).

Parameters:

columns – the columns for which to generate
value_a – the feature value if the input feature is available
value_na – the feature value if the input feature is not available

flattened_feature_generator(fgen: FeatureGenerator, columns_to_flatten: Optional[List[str]] = None, keep_other_columns=True, normalisation_rules: Sequence[Rule] = (), normalisation_rule_template: Optional[RuleTemplate] = None)[source]#

Return a flattening version of the input feature generator.

Parameters:

fgen – the feature generator which generates columns that are to be flattened
columns_to_flatten – list of names of output columns to be flattened; if None, flatten all columns
keep_other_columns – whether any additional columns that are not to be flattened are to be retained by the returned feature generator
normalisation_rules – additional normalisation rules for the flattened output columns
normalisation_rule_template – This parameter can be supplied instead of normalisation_rules for the case where there shall be a single rule that applies to all flattened output columns

Returns:

FeatureGenerator instance that will generate flattened versions of the specified columns and leave all other output columns as is.

Example:

>>> from sensai.featuregen import FeatureGeneratorTakeColumns, flattened_feature_generator
>>> import pandas as pd
>>>
>>> df = pd.DataFrame({"foo": [[1, 2], [3, 4]], "bar": ["a", "b"]})
>>> fgen = flattened_feature_generator(FeatureGeneratorTakeColumns(), columns_to_flatten=["foo"])
>>> fgen.generate(df)
   foo_0  foo_1 bar
0      1      2   a
1      3      4   b

class FeatureGeneratorFromDFT(dft: DataFrameTransformer, categorical_feature_names: Optional[Union[Sequence[str], str]] = None, normalisation_rules: Sequence[Rule] = (), normalisation_rule_template: Optional[RuleTemplate] = None, add_categorical_default_rules=True)[source]#

Bases: FeatureGenerator

Parameters:

categorical_feature_names – either a sequence of column names or a regex that is to match all categorical feature names (which must not only work for the feature generated by this feature generator, i.e. it should not match feature names generated by other feature generators). It will be ensured that the respective columns in the generated data frames will have dtype ‘category’. Furthermore, the presence of meta-information can later be leveraged for further transformations, e.g., one-hot encoding.
normalisation_rules – Rules to be used by DFTNormalisation (e.g.,for constructing an input transformer for a model). These rules are only relevant if a DFTNormalisation object consuming them is instantiated and used within a data processing pipeline. They do not affect feature generation.
normalisation_rule_template – This parameter can be supplied instead of normalisation_rules for the case where there shall be a single rule that applies to all columns generated by this feature generator that were not labeled as categorical. Like normalisation_rules, this is only relevant if a DFTNormalisation object consuming normalisation rules is instantiated and used within a data processing pipeline. It does not affect feature generation.
add_categorical_default_rules – If True, normalisation rules for categorical features (which are unsupported by normalisation) and their corresponding one-hot encoded features (with “_<index>” appended) will be added. It does not affect feature generation.