feature_generator#
Source code: sensai/featuregen/feature_generator.py
- class FeatureGenerator(categorical_feature_names: Optional[Union[Sequence[str], str]] = None, normalisation_rules: Sequence[Rule] = (), normalisation_rule_template: Optional[RuleTemplate] = None, add_categorical_default_rules: bool = True)[source]#
Bases:
ToStringMixin
,ABC
Base class for feature generators that create a new DataFrame containing feature values from an input DataFrame
- Parameters:
categorical_feature_names – either a sequence of column names or a regex that is to match all categorical feature names (which must not only work for the feature generated by this feature generator, i.e. it should not match feature names generated by other feature generators). It will be ensured that the respective columns in the generated data frames will have dtype ‘category’. Furthermore, the presence of meta-information can later be leveraged for further transformations, e.g., one-hot encoding.
normalisation_rules – Rules to be used by DFTNormalisation (e.g.,for constructing an input transformer for a model). These rules are only relevant if a DFTNormalisation object consuming them is instantiated and used within a data processing pipeline. They do not affect feature generation.
normalisation_rule_template – This parameter can be supplied instead of normalisation_rules for the case where there shall be a single rule that applies to all columns generated by this feature generator that were not labeled as categorical. Like normalisation_rules, this is only relevant if a DFTNormalisation object consuming normalisation rules is instantiated and used within a data processing pipeline. It does not affect feature generation.
add_categorical_default_rules – If True, normalisation rules for categorical features (which are unsupported by normalisation) and their corresponding one-hot encoded features (with “_<index>” appended) will be added. It does not affect feature generation.
- get_name() str [source]#
- Returns:
the name of this feature generator, which may be a default name if the name has not been set. Note that feature generators created by a FeatureGeneratorFactory always get the name with which the generator factory was registered.
- get_names() List[str] [source]#
- Returns:
the list of names of feature generators; will be a list with a single name for a regular feature generator
- get_generated_column_names() Optional[List[str]] [source]#
- Returns:
Column names of the data frame generated by the most recent call of the feature generators ‘generate’ method. Returns None if generate was never called.
- fit(x: DataFrame, y: Optional[DataFrame] = None, ctx=None)[source]#
Fits the feature generator based on the given data
- Parameters:
x – the input/features data frame for the learning problem
y – the corresponding output data frame for the learning problem (which will typically contain regression or classification target columns)
ctx – a context object whose functionality may be required for feature generation; this is typically the model instance that this feature generator is to generate inputs for
- generate(df: DataFrame, ctx=None) DataFrame [source]#
Generates features for the data points in the given data frame
- Parameters:
df – the input data frame for which to generate features
ctx – a context object whose functionality may be required for feature generation; this is typically the model instance that this feature generator is to generate inputs for
- Returns:
a data frame containing the generated features, which uses the same index as X (and Y)
- fit_generate(x: DataFrame, y: Optional[DataFrame] = None, ctx=None) DataFrame [source]#
Fits the feature generator and subsequently generates features for the data points in the given data frame
- Parameters:
x – the input data frame for the learning problem and for which to generate features
y – the corresponding output data frame for the learning problem (which will typically contain regression or classification target columns)
ctx – a context object whose functionality may be required for feature generation; this is typically the model instance that this feature generator is to generate inputs for
- Returns:
a data frame containing the generated features, which uses the same index as X (and Y)
- flattened(columns_to_flatten: Optional[List[str]] = None, normalisation_rules=(), normalisation_rule_template: Optional[RuleTemplate] = None, keep_other_columns=True) ChainedFeatureGenerator [source]#
Returns a new feature generator which returns flattened versions of one or more of the vector-valued columns generated by this feature generator.
- Parameters:
columns_to_flatten – the list of columns to flatten; if None, flatten all columns
normalisation_rules – a list of normalisation rules which apply to the flattened columns
normalisation_rule_template – a normalisation rule template which applies to all generated flattened columns
keep_other_columns – if True, any additional columns that are not to be flattened are to be retained by the returned feature generator; if False, additional columns are to be discarded
- Returns:
a feature generator which generates the flattened columns
- concat(*others: FeatureGenerator) MultiFeatureGenerator [source]#
Concatenates this feature generator with one or more other feature generator in order to produce a feature generator that jointly generates all features
- Parameters:
others – other feature generators
- Returns:
- chain(*others: FeatureGenerator) ChainedFeatureGenerator [source]#
Chains this feature generator with one or more other feature generators such that each feature generator receives as input the output of the preceding feature generator. The resulting feature generator produces the features of the last element in the chain.
- Parameters:
others – other feature generator
- Returns:
- class RuleBasedFeatureGenerator(categorical_feature_names: Optional[Union[Sequence[str], str]] = None, normalisation_rules: Sequence[Rule] = (), normalisation_rule_template: Optional[RuleTemplate] = None, add_categorical_default_rules: bool = True)[source]#
Bases:
FeatureGenerator
,ABC
A feature generator which does not require fitting
- Parameters:
categorical_feature_names – either a sequence of column names or a regex that is to match all categorical feature names (which must not only work for the feature generated by this feature generator, i.e. it should not match feature names generated by other feature generators). It will be ensured that the respective columns in the generated data frames will have dtype ‘category’. Furthermore, the presence of meta-information can later be leveraged for further transformations, e.g., one-hot encoding.
normalisation_rules – Rules to be used by DFTNormalisation (e.g.,for constructing an input transformer for a model). These rules are only relevant if a DFTNormalisation object consuming them is instantiated and used within a data processing pipeline. They do not affect feature generation.
normalisation_rule_template – This parameter can be supplied instead of normalisation_rules for the case where there shall be a single rule that applies to all columns generated by this feature generator that were not labeled as categorical. Like normalisation_rules, this is only relevant if a DFTNormalisation object consuming normalisation rules is instantiated and used within a data processing pipeline. It does not affect feature generation.
add_categorical_default_rules – If True, normalisation rules for categorical features (which are unsupported by normalisation) and their corresponding one-hot encoded features (with “_<index>” appended) will be added. It does not affect feature generation.
- fit(x, y=None, ctx=None)[source]#
Fits the feature generator based on the given data
- Parameters:
x – the input/features data frame for the learning problem
y – the corresponding output data frame for the learning problem (which will typically contain regression or classification target columns)
ctx – a context object whose functionality may be required for feature generation; this is typically the model instance that this feature generator is to generate inputs for
- class MultiFeatureGenerator(*feature_generators: Union[FeatureGenerator, List[FeatureGenerator]])[source]#
Bases:
FeatureGenerator
Wrapper for multiple feature generators. Calling generate here applies all given feature generators independently and returns the concatenation of their outputs
- Parameters:
categorical_feature_names – either a sequence of column names or a regex that is to match all categorical feature names (which must not only work for the feature generated by this feature generator, i.e. it should not match feature names generated by other feature generators). It will be ensured that the respective columns in the generated data frames will have dtype ‘category’. Furthermore, the presence of meta-information can later be leveraged for further transformations, e.g., one-hot encoding.
normalisation_rules – Rules to be used by DFTNormalisation (e.g.,for constructing an input transformer for a model). These rules are only relevant if a DFTNormalisation object consuming them is instantiated and used within a data processing pipeline. They do not affect feature generation.
normalisation_rule_template – This parameter can be supplied instead of normalisation_rules for the case where there shall be a single rule that applies to all columns generated by this feature generator that were not labeled as categorical. Like normalisation_rules, this is only relevant if a DFTNormalisation object consuming normalisation rules is instantiated and used within a data processing pipeline. It does not affect feature generation.
add_categorical_default_rules – If True, normalisation rules for categorical features (which are unsupported by normalisation) and their corresponding one-hot encoded features (with “_<index>” appended) will be added. It does not affect feature generation.
- fit_generate(x: DataFrame, y: Optional[DataFrame] = None, ctx=None) DataFrame [source]#
Fits the feature generator and subsequently generates features for the data points in the given data frame
- Parameters:
x – the input data frame for the learning problem and for which to generate features
y – the corresponding output data frame for the learning problem (which will typically contain regression or classification target columns)
ctx – a context object whose functionality may be required for feature generation; this is typically the model instance that this feature generator is to generate inputs for
- Returns:
a data frame containing the generated features, which uses the same index as X (and Y)
- class FeatureGeneratorFromNamedTuples(cache: Optional[KeyValueCache] = None, categorical_feature_names: Sequence[str] = (), normalisation_rules: Sequence[Rule] = (), normalisation_rule_template: Optional[RuleTemplate] = None)[source]#
Bases:
FeatureGenerator
,ABC
Generates feature values for one data point at a time, creating a dictionary with feature values from each named tuple
- Parameters:
categorical_feature_names – either a sequence of column names or a regex that is to match all categorical feature names (which must not only work for the feature generated by this feature generator, i.e. it should not match feature names generated by other feature generators). It will be ensured that the respective columns in the generated data frames will have dtype ‘category’. Furthermore, the presence of meta-information can later be leveraged for further transformations, e.g., one-hot encoding.
normalisation_rules – Rules to be used by DFTNormalisation (e.g.,for constructing an input transformer for a model). These rules are only relevant if a DFTNormalisation object consuming them is instantiated and used within a data processing pipeline. They do not affect feature generation.
normalisation_rule_template – This parameter can be supplied instead of normalisation_rules for the case where there shall be a single rule that applies to all columns generated by this feature generator that were not labeled as categorical. Like normalisation_rules, this is only relevant if a DFTNormalisation object consuming normalisation rules is instantiated and used within a data processing pipeline. It does not affect feature generation.
add_categorical_default_rules – If True, normalisation rules for categorical features (which are unsupported by normalisation) and their corresponding one-hot encoded features (with “_<index>” appended) will be added. It does not affect feature generation.
- class FeatureGeneratorTakeColumns(columns: Optional[Union[str, List[str]]] = None, except_columns: Sequence[str] = (), categorical_feature_names: Optional[Union[Sequence[str], str]] = (), normalisation_rules: Sequence[Rule] = (), normalisation_rule_template: Optional[RuleTemplate] = None, verify_column_names=True)[source]#
Bases:
RuleBasedFeatureGenerator
- Parameters:
columns – name of the column or list of names of columns to be taken. If None, all columns will be taken.
except_columns – list of names of columns to not take if present in the input df
categorical_feature_names – either a sequence of column names or a regex that is to match all categorical feature names (which must not only work for the feature generated by this feature generator, i.e. it should not match feature names generated by other feature generators). It will be ensured that the respective columns in the generated data frames will have dtype ‘category’. Furthermore, presence of meta-information can later be leveraged for further transformations, e.g. one-hot encoding.
normalisation_rules – Rules to be used by DFTNormalisation (e.g. for constructing an input transformer for a model). These rules are only relevant if a DFTNormalisation object consuming them is instantiated and used within a data processing pipeline. They do not affect feature generation.
normalisation_rule_template – This parameter can be supplied instead of normalisationRules for the case where there shall be a single rule that applies to all columns generated by this feature generator that were not labeled as categorical.
verify_column_names – if True and columns to take were specified, will raise an error in case said columns are missing during feature generation. If False, will log on info level instead
- class FeatureGeneratorFlattenColumns(columns: Optional[Union[Sequence[str], str]] = None, categorical_feature_names: Sequence[str] = (), normalisation_rules: Sequence[Rule] = (), normalisation_rule_template: Optional[RuleTemplate] = None)[source]#
Bases:
RuleBasedFeatureGenerator
Instances of this class take columns with vectors and creates a data frame with columns containing entries of these vectors.
For example, if columns “vec1”, “vec2” contain vectors of dimensions dim1, dim2, a data frame with dim1+dim2 new columns will be created. It will contain the columns “vec1_<i1>”, “vec2_<i2>” with i1, i2 ranging in (0, dim1), (0, dim2).
- Parameters:
columns – name of the column or list of names of columns to be flattened. If None, all columns will be flattened.
categorical_feature_names –
normalisation_rules –
normalisation_rule_template –
- class FeatureGeneratorFromColumnGenerator(column_gen: ColumnGenerator, take_input_column_if_present=False, is_categorical=False, normalisation_rule_template: RuleTemplate = None)[source]#
Bases:
RuleBasedFeatureGenerator
Implements a feature generator via a column generator
- Parameters:
column_gen – the underlying column generator
take_input_column_if_present – if True, then if a column whose name corresponds to the column to generate exists in the input data, simply copy it to generate the output (without using the column generator); if False, always apply the columnGen to generate the output
is_categorical – whether the resulting column is categorical
normalisation_rule_template – template for a DFTNormalisation for the resulting column. This should only be provided if is_categorical is False
- log = <Logger sensai.featuregen.feature_generator.FeatureGeneratorFromColumnGenerator (WARNING)>#
- class ChainedFeatureGenerator(*feature_generators: Union[FeatureGenerator, List[FeatureGenerator]])[source]#
Bases:
FeatureGenerator
Chains feature generators such that they are executed one after another. The output of generator i>=1 is the input of generator i+1 in the generator sequence.
- Parameters:
feature_generators – feature generators to apply in order; the properties of the last feature generator determine the relevant meta-data such as categorical feature names and normalisation rules
- fit_generate(x: DataFrame, y: Optional[DataFrame] = None, ctx=None) DataFrame [source]#
Fits the feature generator and subsequently generates features for the data points in the given data frame
- Parameters:
x – the input data frame for the learning problem and for which to generate features
y – the corresponding output data frame for the learning problem (which will typically contain regression or classification target columns)
ctx – a context object whose functionality may be required for feature generation; this is typically the model instance that this feature generator is to generate inputs for
- Returns:
a data frame containing the generated features, which uses the same index as X (and Y)
- class FeatureGeneratorTargetDistribution(columns: Union[str, Sequence[str]], target_column: str, target_column_bins: Optional[Union[Sequence[float], int, IntervalIndex]], target_column_in_features_df=False, flatten=True)[source]#
Bases:
FeatureGenerator
A feature generator, which, for a column T (typically the categorical target column of a classification problem or the continuous target column of a regression problem),
can ensure that T takes on limited set of values t_1, …, t_n by allowing the user to apply binning using given bin boundaries
computes for each value c of a categorical column C the conditional empirical distribution P(T | C=c) in the training data during the training phase,
generates, for each requested column C and value c in the column, n features ‘<C>_<T>_distribution_<t_i>’ = P(T=t_i | C=c) if flatten=True or one feature ‘<C>_<T>_distribution’ = [P(T=t_i | C=c), …, P(T=t_n | C=c)] if flatten=False
Being probability values, the features generated by this feature generator are already normalised.
- Parameters:
columns – the categorical columns for which to generate distribution features
target_column – the column the distributions over which will make up the features. If targetColumnBins is not None, this column will be discretised before computing the conditional distributions
target_column_bins – if not None, specifies the binning to apply via pandas.cut (see https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html). Note that if a value should match no bin, NaN will generated. To avoid this when specifying bin boundaries in a list, -inf and +inf should be used as the first and last entries.
target_column_in_features_df – if True, when fitting will look for targetColumn in the features data frame (X) instead of in target data frame (Y)
flatten – whether to generate a separate scalar feature per distribution value rather than one feature with all of the distribution’s values
- class FeatureGeneratorFromVectorModel(vector_model: VectorModel, target_feature_generator: FeatureGenerator, categorical_feature_names: Sequence[str] = (), normalisation_rules: Sequence[Rule] = (), normalisation_rule_template: RuleTemplate = None, input_feature_generator: FeatureGenerator = None, use_target_feature_generator_for_training=False)[source]#
Bases:
FeatureGenerator
Provides a feature via predictions of a given model
- Parameters:
vector_model – model used for generate features from predictions
target_feature_generator – generator for target to be predicted
categorical_feature_names –
normalisation_rules –
normalisation_rule_template –
input_feature_generator – optional feature generator to be applied to input of vectorModel’s fit and predict
use_target_feature_generator_for_training – if False, this generator will always apply the model to generate features. If True, this generator will use targetFeatureGenerator to generate features, bypassing the model. This is useful for the case where the model which is to receive the generated features shall be trained on the original targets rather than the predictions thereof.
- class FeatureGeneratorMapColumn(input_col_name: str, feature_col_name: str, categorical_feature_names: Optional[Union[Sequence[str], str]] = None, normalisation_rules: Sequence[Rule] = (), normalisation_rule_template: Optional[RuleTemplate] = None, add_categorical_default_rules=True)[source]#
Bases:
RuleBasedFeatureGenerator
,ABC
Creates a single feature from a single input column by applying a function to each element of the input column
- Parameters:
categorical_feature_names – either a sequence of column names or a regex that is to match all categorical feature names (which must not only work for the feature generated by this feature generator, i.e. it should not match feature names generated by other feature generators). It will be ensured that the respective columns in the generated data frames will have dtype ‘category’. Furthermore, the presence of meta-information can later be leveraged for further transformations, e.g., one-hot encoding.
normalisation_rules – Rules to be used by DFTNormalisation (e.g.,for constructing an input transformer for a model). These rules are only relevant if a DFTNormalisation object consuming them is instantiated and used within a data processing pipeline. They do not affect feature generation.
normalisation_rule_template – This parameter can be supplied instead of normalisation_rules for the case where there shall be a single rule that applies to all columns generated by this feature generator that were not labeled as categorical. Like normalisation_rules, this is only relevant if a DFTNormalisation object consuming normalisation rules is instantiated and used within a data processing pipeline. It does not affect feature generation.
add_categorical_default_rules – If True, normalisation rules for categorical features (which are unsupported by normalisation) and their corresponding one-hot encoded features (with “_<index>” appended) will be added. It does not affect feature generation.
- class FeatureGeneratorMapColumnDict(input_col_name: str, categorical_feature_names: Optional[Union[Sequence[str], str]] = None, normalisation_rules: Sequence[Rule] = (), normalisation_rule_template: Optional[RuleTemplate] = None, add_categorical_default_rules=True)[source]#
Bases:
RuleBasedFeatureGenerator
,ABC
Creates an arbitrary number of features from a single input column by applying a function to each element of the input column
- Parameters:
categorical_feature_names – either a sequence of column names or a regex that is to match all categorical feature names (which must not only work for the feature generated by this feature generator, i.e. it should not match feature names generated by other feature generators). It will be ensured that the respective columns in the generated data frames will have dtype ‘category’. Furthermore, the presence of meta-information can later be leveraged for further transformations, e.g., one-hot encoding.
normalisation_rules – Rules to be used by DFTNormalisation (e.g.,for constructing an input transformer for a model). These rules are only relevant if a DFTNormalisation object consuming them is instantiated and used within a data processing pipeline. They do not affect feature generation.
normalisation_rule_template – This parameter can be supplied instead of normalisation_rules for the case where there shall be a single rule that applies to all columns generated by this feature generator that were not labeled as categorical. Like normalisation_rules, this is only relevant if a DFTNormalisation object consuming normalisation rules is instantiated and used within a data processing pipeline. It does not affect feature generation.
add_categorical_default_rules – If True, normalisation rules for categorical features (which are unsupported by normalisation) and their corresponding one-hot encoded features (with “_<index>” appended) will be added. It does not affect feature generation.
- class FeatureGeneratorNAMarker(columns: List[str], value_a=0, value_na=1)[source]#
Bases:
RuleBasedFeatureGenerator
Creates features indicating whether another feature is N/A (not available). It can be practical to use this feature generator in conjunction with DFTFillNA for models that cannot handle missing values.
Note: When changing the default values used, use only values that are considered to be normalised when using this feature generation in a context where DFTNormalisation is used (no normalisation is applied to features generated by this feature generator).
- Parameters:
columns – the columns for which to generate
value_a – the feature value if the input feature is available
value_na – the feature value if the input feature is not available
- flattened_feature_generator(fgen: FeatureGenerator, columns_to_flatten: Optional[List[str]] = None, keep_other_columns=True, normalisation_rules: Sequence[Rule] = (), normalisation_rule_template: Optional[RuleTemplate] = None)[source]#
Return a flattening version of the input feature generator.
- Parameters:
fgen – the feature generator which generates columns that are to be flattened
columns_to_flatten – list of names of output columns to be flattened; if None, flatten all columns
keep_other_columns – whether any additional columns that are not to be flattened are to be retained by the returned feature generator
normalisation_rules – additional normalisation rules for the flattened output columns
normalisation_rule_template – This parameter can be supplied instead of normalisation_rules for the case where there shall be a single rule that applies to all flattened output columns
- Returns:
FeatureGenerator instance that will generate flattened versions of the specified columns and leave all other output columns as is.
- Example:
>>> from sensai.featuregen import FeatureGeneratorTakeColumns, flattened_feature_generator >>> import pandas as pd >>> >>> df = pd.DataFrame({"foo": [[1, 2], [3, 4]], "bar": ["a", "b"]}) >>> fgen = flattened_feature_generator(FeatureGeneratorTakeColumns(), columns_to_flatten=["foo"]) >>> fgen.generate(df) foo_0 foo_1 bar 0 1 2 a 1 3 4 b
- class FeatureGeneratorFromDFT(dft: DataFrameTransformer, categorical_feature_names: Optional[Union[Sequence[str], str]] = None, normalisation_rules: Sequence[Rule] = (), normalisation_rule_template: Optional[RuleTemplate] = None, add_categorical_default_rules=True)[source]#
Bases:
FeatureGenerator
- Parameters:
categorical_feature_names – either a sequence of column names or a regex that is to match all categorical feature names (which must not only work for the feature generated by this feature generator, i.e. it should not match feature names generated by other feature generators). It will be ensured that the respective columns in the generated data frames will have dtype ‘category’. Furthermore, the presence of meta-information can later be leveraged for further transformations, e.g., one-hot encoding.
normalisation_rules – Rules to be used by DFTNormalisation (e.g.,for constructing an input transformer for a model). These rules are only relevant if a DFTNormalisation object consuming them is instantiated and used within a data processing pipeline. They do not affect feature generation.
normalisation_rule_template – This parameter can be supplied instead of normalisation_rules for the case where there shall be a single rule that applies to all columns generated by this feature generator that were not labeled as categorical. Like normalisation_rules, this is only relevant if a DFTNormalisation object consuming normalisation rules is instantiated and used within a data processing pipeline. It does not affect feature generation.
add_categorical_default_rules – If True, normalisation rules for categorical features (which are unsupported by normalisation) and their corresponding one-hot encoded features (with “_<index>” appended) will be added. It does not affect feature generation.