dft#
Source code: sensai/data_transformation/dft.py
- class DataFrameTransformer[source]#
Bases:
ABC
,ToStringMixin
Base class for data frame transformers, i.e. objects which can transform one data frame into another (possibly applying the transformation to the original data frame - in-place transformation). A data frame transformer may require being fitted using training data.
- get_name() str [source]#
- Returns:
the name of this dft transformer, which may be a default name if the name has not been set.
- to_feature_generator(categorical_feature_names: Optional[Union[Sequence[str], str]] = None, normalisation_rules: Sequence[Rule] = (), normalisation_rule_template: Optional[RuleTemplate] = None, add_categorical_default_rules=True)[source]#
- class DFTFromFeatureGenerator(fgen: FeatureGenerator)[source]#
Bases:
DataFrameTransformer
- class InvertibleDataFrameTransformer[source]#
Bases:
DataFrameTransformer
,ABC
- get_inverse() InverseDataFrameTransformer [source]#
- Returns:
a transformer whose (forward) transformation is the inverse transformation of this DFT
- class RuleBasedDataFrameTransformer[source]#
Bases:
DataFrameTransformer
,ABC
Base class for transformers whose logic is entirely based on rules and does not need to be fitted to data
- class InverseDataFrameTransformer(invertible_dft: InvertibleDataFrameTransformer)[source]#
- class DataFrameTransformerChain(*data_frame_transformers: Union[DataFrameTransformer, List[DataFrameTransformer]])[source]#
Bases:
DataFrameTransformer
Supports the application of a chain of data frame transformers. During fit and apply each transformer in the chain receives the transformed output of its predecessor.
- find_first_transformer_by_type(cls) Optional[DataFrameTransformer] [source]#
- append(t: DataFrameTransformer)[source]#
- class DFTRenameColumns(columns_map: Dict[str, str])[source]#
Bases:
RuleBasedDataFrameTransformer
- Parameters:
columns_map – dictionary mapping old column names to new names
- class DFTConditionalRowFilterOnColumn(column: str, condition: Callable[[Any], bool])[source]#
Bases:
RuleBasedDataFrameTransformer
Filters a data frame by applying a boolean function to one of the columns and retaining only the rows for which the function returns True
- class DFTInSetComparisonRowFilterOnColumn(column: str, set_to_keep: Set)[source]#
Bases:
RuleBasedDataFrameTransformer
Filters a data frame on the selected column and retains only the rows for which the value is in the setToKeep
- class DFTNotInSetComparisonRowFilterOnColumn(column: str, set_to_drop: Set)[source]#
Bases:
RuleBasedDataFrameTransformer
Filters a data frame on the selected column and retains only the rows for which the value is not in the setToDrop
- class DFTVectorizedConditionalRowFilterOnColumn(column: str, vectorized_condition: Callable[[Series], Sequence[bool]])[source]#
Bases:
RuleBasedDataFrameTransformer
Filters a data frame by applying a vectorized condition on the selected column and retaining only the rows for which it returns True
- class DFTRowFilter(condition: Callable[[Any], bool])[source]#
Bases:
RuleBasedDataFrameTransformer
Filters a data frame by applying a condition function to each row and retaining only the rows for which it returns True
- class DFTModifyColumn(column: str, column_transform: Union[Callable, ufunc])[source]#
Bases:
RuleBasedDataFrameTransformer
Modifies a column specified by ‘column’ using ‘columnTransform’
- Parameters:
column – the name of the column to be modified
column_transform – a function operating on single cells or a Numpy ufunc that applies to an entire Series
- class DFTModifyColumnVectorized(column: str, column_transform: Callable[[ndarray], Union[Sequence, Series, ndarray]])[source]#
Bases:
RuleBasedDataFrameTransformer
Modifies a column specified by ‘column’ using ‘columnTransform’. This transformer can be used to utilise Numpy vectorisation for performance optimisation.
- Parameters:
column – the name of the column to be modified
column_transform – a function that takes a Numpy array and from which the returned value will be assigned to the column as a whole
- class DFTOneHotEncoder(columns: Optional[Union[Sequence[str], str]], categories: Optional[Union[List[ndarray], Dict[str, ndarray]]] = None, inplace=False, ignore_unknown=False, array_valued_result=False)[source]#
Bases:
DataFrameTransformer
One hot encode categorical variables
- Parameters:
columns – list of names or regex matching names of columns that are to be replaced by a list one-hot encoded columns each (or an array-valued column for the case where useArrayValues=True); If None, then no columns are actually to be one-hot-encoded
categories – numpy arrays containing the possible values of each of the specified columns (for case where sequence is specified in ‘columns’) or dictionary mapping column name to array of possible categories for the column name. If None, the possible values will be inferred from the columns
inplace – whether to perform the transformation in-place
ignore_unknown – if True and an unknown category is encountered during transform, the resulting one-hot encoded columns for this feature will be all zeros. if False, an unknown category will raise an error.
array_valued_result – whether to replace the input columns by columns of the same name containing arrays as values instead of creating a separate column per original value
- class DFTColumnFilter(keep: Optional[Union[Sequence[str], str]] = None, drop: Optional[Union[Sequence[str], str]] = None)[source]#
Bases:
RuleBasedDataFrameTransformer
A DataFrame transformer that filters columns by retaining or dropping specified columns
- class DFTKeepColumns(keep: Optional[Union[Sequence[str], str]] = None, drop: Optional[Union[Sequence[str], str]] = None)[source]#
Bases:
DFTColumnFilter
- class DFTNormalisation(rules: Sequence[Rule], default_transformer_factory: Optional[Callable[[], SkLearnTransformerProtocol]] = None, require_all_handled: bool = True, inplace: bool = False)[source]#
Bases:
DataFrameTransformer
Applies normalisation/scaling to a data frame by applying a set of transformation rules, where each rule defines a set of columns to which it applies (learning a single transformer based on the values of all applicable columns). DFTNormalisation ignores N/A values during fitting and application.
- Parameters:
rules – the set of rules; rules (i.e., their transformers) are always fitted and applied in the given order. A convenient way to obtain a set of rules in the
sensai.vector_model.VectorModel
context is from asensai.featuregen.FeatureCollector
orsensai.featuregen.MultiFeatureGenerator
. Generally, it is often a good idea to associate rules (or a rule template) with a feature generator. Then the rules can be obtained from it using get_normalisation_rules.default_transformer_factory – a factory for the creation of transformer instances (which implements the API used by sklearn.preprocessing, e.g. StandardScaler) that shall be used to create a transformer for all rules that do not specify a particular transformer. The default transformer will only be applied to columns matched by such rules, unmatched columns will not be transformed. Use
SkLearnTransformerFactoryFactory
to conveniently create a factory.require_all_handled – whether to raise an exception if any column is not matched by a rule
inplace – whether to apply data frame transformations in-place
- class RuleTemplate(skip: bool = False, unsupported: bool = False, transformer: Optional[SkLearnTransformerProtocol] = None, transformer_factory: Optional[Callable[[], SkLearnTransformerProtocol]] = None, independent_columns: Optional[bool] = None, array_valued: bool = False, fit: bool = True)[source]#
Bases:
object
A template from which a rule which matches multiple columns can be created. This is useful for the generation of rules which shall apply to all the (numerical) columns generated by a
FeatureGenerator
without specifically naming them.Use the parameters as follows:
If the relevant features are already normalised, pass
skip=True
If the relevant features cannot be normalised (e.g. because they are categorical), pass
unsupported=True
If the relevant features shall be normalised, the other parameters apply. No parameters, i.e.
RuleTemplate()
, are an option if …a default transformer factory is specified in the
DFTNormalisation
instance and its application is suitable for the relevant set of features. Otherwise, specify eithertransformer_factory
ortransformer
.the resulting rule will match only a single column. Otherwise,
independent_columns
must be specified to True or False.
- Parameters:
skip – flag indicating whether no transformation shall be performed on the matched columns (e.g. because they are already normalised).
unsupported – flag indicating whether normalisation of matched columns is unsupported (shall trigger an exception if attempted). Useful e.g. for preventing intermediate features that need further processing (like columns containing strings) from making their way into the final dataframe that will be normalised and used for training a model.
transformer – a transformer instance (following the sklearn.preprocessing interface, e.g. StandardScaler) to apply to the matching column(s) for the case where a transformation is necessary (skip=False, unsupported=False). If None is given, either transformer_factory or the containing
DFTNormalisation
instance’s default factory will be used when the normaliser is fitted. NOTE: Using a transformer_factory is usually preferred. Use an instance only if you want the same transformer instance to be used in multiple places - e.g. sharing it across several feature generators or models that use the same type of column with associated rule/rule template (disabling fit where appropriate).transformer_factory – a factory for the generation of the transformer instance, which will only be applied if transformer is not given; if neither transformer nor transformer_factory are given, the containing
DFTNormalisation
instance’s default factory will be used. SeeSkLearnTransformerFactoryFactory
for convenient construction options.array_valued – whether the column values are not scalars but arrays (of some fixed but arbitrary length). It is assumed that all entries in such arrays are to be normalised in the same way, i.e. the same transformation will be applied to each entry in the array. Only a single matching column is supported for array_valued=True, i.e. the rule must apply to at most one column.
fit – whether the rule’s transformer shall be fitted. One use case for setting this to False is if a transformer instance is provided (instead of a factory), which is already fitted.
independent_columns – whether, for the case where the rule matches multiple columns, the columns are independent and a separate transformation is to be learned for each of them (rather than using the same transformation for all columns and learning the transformation from the data of all columns). This parameter must be specified for rules matching more than one column, None is acceptable for rules matching a single column, in which case None, True, and False all have the same effect.
- class Rule(regex: Optional[Union[str, Pattern]], skip: bool = False, unsupported: bool = False, transformer: Optional[SkLearnTransformerProtocol] = None, transformer_factory: Optional[Callable[[], SkLearnTransformerProtocol]] = None, array_valued: bool = False, fit: bool = True, independent_columns: Optional[bool] = None)[source]#
Bases:
ToStringMixin
Use the parameters as follows:
If the relevant features are already normalised, pass
skip=True
If the relevant features cannot be normalised (e.g. because they are categorical), pass
unsupported=True
If the relevant features shall be normalised, the other parameters apply. No parameters other than regex, i.e.
Rule(regex)
, are an option if …a default transformer factory is specified in the
DFTNormalisation
instance and its application is suitable for the relevant set of features. Otherwise, specify eithertransformer_factory
ortransformer
.the resulting rule will match only a single column. Otherwise,
independent_columns
must be specified to True or False.
- Parameters:
regex – a regular expression defining the column(s) the rule applies to. If it matches multiple columns, these columns will be normalised in the same way (using the same normalisation process for each column) unless independent_columns=True. If None, the rule is a placeholder rule and the regex must be set later via set_regex or the rule will not be applicable.
skip – flag indicating whether no transformation shall be performed on the matched columns (e.g. because they are already normalised).
unsupported – flag indicating whether normalisation of matched columns is unsupported (shall trigger an exception if attempted). Useful e.g. for preventing intermediate features that need further processing (like columns containing strings) from making their way into the final dataframe that will be normalised and used for training a model.
transformer – a transformer instance (following the sklearn.preprocessing interface, e.g. StandardScaler) to apply to the matching column(s) for the case where a transformation is necessary (skip=False, unsupported=False). If None is given, either transformer_factory or the containing
DFTNormalisation
instance’s default factory will be used when the normaliser is fitted. NOTE: Using a transformer_factory is usually preferred. Use an instance only if you want the same transformer instance to be used in multiple places - e.g. sharing it across several feature generators or models that use the same type of column with associated rule/rule template (disabling fit where appropriate).transformer_factory – a factory for the generation of the transformer instance, which will only be applied if transformer is not given; if neither transformer nor transformer_factory are given, the containing
DFTNormalisation
instance’s default factory will be used. SeeSkLearnTransformerFactoryFactory
for convenient construction options.array_valued – whether the column values are not scalars but arrays (of some fixed but arbitrary length). It is assumed that all entries in such arrays are to be normalised in the same way, i.e. the same transformation will be applied to each entry in the array. Only a single matching column is supported for array_valued=True, i.e. the regex must match at most one column.
fit – whether the rule’s transformer shall be fitted. One use case for setting this to False is if a transformer instance is provided (instead of a factory), which is already fitted.
independent_columns – whether, for the case where the rule matches multiple columns, the columns are independent and a separate transformation is to be learned for each of them (rather than using the same transformation for all columns and learning the transformation from the data of all columns). This parameter must be specified to for rules matching more than one column, None is acceptable for rules matching a single column, in which case None, True, and False all have the same effect.
- class DFTFromColumnGenerators(column_generators: Sequence[ColumnGenerator], inplace=False)[source]#
Bases:
RuleBasedDataFrameTransformer
Extends a data frame with columns generated from ColumnGenerator instances
- class DFTCountEntries(column_for_entry_count: str, column_name_for_resulting_counts: str = 'counts')[source]#
Bases:
RuleBasedDataFrameTransformer
Transforms a data frame, based on one of its columns, into a new data frame containing two columns that indicate the counts of unique values in the input column. It is the “DataFrame output version” of pd.Series.value_counts. Each row of the output column holds a unique value of the input column and the number of times it appears in the input column.
- class DFTSkLearnTransformer(sklearn_transformer: SkLearnTransformerProtocol, columns: Optional[List[str]] = None, inplace=False, array_valued=False)[source]#
Bases:
InvertibleDataFrameTransformer
Applies a transformer from sklearn.preprocessing to (a subset of) the columns of a data frame. If multiple columns are transformed, they are transformed independently (i.e. each column uses a separately trained transformation).
- Parameters:
sklearn_transformer – the transformer instance (from sklearn.preprocessing) to use (which will be fitted & applied)
columns – the set of column names to which the transformation shall apply; if None, apply it to all columns
inplace – whether to apply the transformation in-place
array_valued – whether to apply transformation not to scalar-valued columns but to one or more array-valued columns, where the values of all arrays within a column (which may vary in length) are to be transformed in the same way. If multiple columns are transformed, then the arrays belonging to a single row must all have the same length.
- class DFTSortColumns[source]#
Bases:
RuleBasedDataFrameTransformer
Sorts a data frame’s columns in ascending order
- class DFTFillNA(fill_value, inplace: bool = False)[source]#
Bases:
RuleBasedDataFrameTransformer
Fills NA/NaN values with the given value
- class DFTCastCategoricalColumns(columns: ~typing.Optional[~typing.List[str]] = None, dtype=<class 'float'>)[source]#
Bases:
RuleBasedDataFrameTransformer
Casts columns with dtype category to the given type. This can be useful in cases where categorical columns are not accepted by the model but the column values are actually numeric, in which case the cast to a numeric value yields an acceptable label encoding.
- Parameters:
columns – the columns to convert; if None, convert all that have dtype category
dtype – the data type to which categorical columns are to be converted
- class DFTDropNA(axis=0, inplace=False)[source]#
Bases:
RuleBasedDataFrameTransformer
Drops rows or columns containing NA/NaN values
- Parameters:
axis – 0 to drop rows, 1 to drop columns containing an N/A value
inplace – whether to perform the operation in-place on the input data frame