dft

dft#

Source code: sensai/data_transformation/dft.py

class DataFrameTransformer[source]#

Bases: ABC, ToStringMixin

Base class for data frame transformers, i.e. objects which can transform one data frame into another (possibly applying the transformation to the original data frame - in-place transformation). A data frame transformer may require being fitted using training data.

get_name() → str[source]#

Returns:: the name of this dft transformer, which may be a default name if the name has not been set.

set_name(name: str)[source]#

with_name(name: str)[source]#

apply(df: pandas.DataFrame) → pandas.DataFrame[source]#

info()[source]#

fit(df: pandas.DataFrame)[source]#

is_fitted()[source]#

fit_apply(df: pandas.DataFrame) → pandas.DataFrame[source]#

to_feature_generator(categorical_feature_names: Optional[Union[Sequence[str], str]] = None, normalisation_rules: Sequence[Rule] = (), normalisation_rule_template: Optional[RuleTemplate] = None, add_categorical_default_rules=True)[source]#

chain(*others: DataFrameTransformer) → DataFrameTransformerChain[source]#

class DFTFromFeatureGenerator(fgen: FeatureGenerator)[source]#: Bases: DataFrameTransformer

class InvertibleDataFrameTransformer[source]#

Bases: DataFrameTransformer, ABC

abstract apply_inverse(df: pandas.DataFrame) → pandas.DataFrame[source]#

get_inverse() → InverseDataFrameTransformer[source]#

Returns:: a transformer whose (forward) transformation is the inverse transformation of this DFT

class RuleBasedDataFrameTransformer[source]#

Bases: DataFrameTransformer, ABC

Base class for transformers whose logic is entirely based on rules and does not need to be fitted to data

fit(df: pandas.DataFrame)[source]#

is_fitted()[source]#

class InverseDataFrameTransformer(invertible_dft: InvertibleDataFrameTransformer)[source]#: Bases: RuleBasedDataFrameTransformer

class DataFrameTransformerChain(*data_frame_transformers: Union[DataFrameTransformer, List[DataFrameTransformer]])[source]#

Bases: DataFrameTransformer

Supports the application of a chain of data frame transformers. During fit and apply each transformer in the chain receives the transformed output of its predecessor.

is_fitted()[source]#

get_names() → List[str][source]#

Returns:: the list of names of all contained feature generators

info()[source]#

find_first_transformer_by_type(cls) → Optional[DataFrameTransformer][source]#

append(t: DataFrameTransformer)[source]#

class DFTRenameColumns(columns_map: Dict[str, str])[source]#

Bases: RuleBasedDataFrameTransformer

Parameters:: columns_map – dictionary mapping old column names to new names

class DFTConditionalRowFilterOnColumn(column: str, condition: Callable[[Any], bool])[source]#

Bases: RuleBasedDataFrameTransformer

Filters a data frame by applying a boolean function to one of the columns and retaining only the rows for which the function returns True

class DFTInSetComparisonRowFilterOnColumn(column: str, set_to_keep: Set)[source]#

Bases: RuleBasedDataFrameTransformer

Filters a data frame on the selected column and retains only the rows for which the value is in the setToKeep

info()[source]#

class DFTNotInSetComparisonRowFilterOnColumn(column: str, set_to_drop: Set)[source]#

Bases: RuleBasedDataFrameTransformer

Filters a data frame on the selected column and retains only the rows for which the value is not in the setToDrop

info()[source]#

class DFTVectorizedConditionalRowFilterOnColumn(column: str, vectorized_condition: Callable[[pandas.Series], Sequence[bool]])[source]#

Bases: RuleBasedDataFrameTransformer

Filters a data frame by applying a vectorized condition on the selected column and retaining only the rows for which it returns True

info()[source]#

class DFTRowFilter(condition: Callable[[Any], bool])[source]#

Bases: RuleBasedDataFrameTransformer

Filters a data frame by applying a condition function to each row and retaining only the rows for which it returns True

class DFTModifyColumn(column: str, column_transform: Union[Callable, numpy.ufunc])[source]#

Bases: RuleBasedDataFrameTransformer

Modifies a column specified by ‘column’ using ‘columnTransform’

Parameters:

column – the name of the column to be modified
column_transform – a function operating on single cells or a Numpy ufunc that applies to an entire Series

class DFTModifyColumnVectorized(column: str, column_transform: Callable[[numpy.ndarray], Union[Sequence, pandas.Series, numpy.ndarray]])[source]#

Bases: RuleBasedDataFrameTransformer

Modifies a column specified by ‘column’ using ‘columnTransform’. This transformer can be used to utilise Numpy vectorisation for performance optimisation.

Parameters:

column – the name of the column to be modified
column_transform – a function that takes a Numpy array and from which the returned value will be assigned to the column as a whole

class DFTOneHotEncoder(columns: Optional[Union[Sequence[str], str]], categories: Optional[Union[List[numpy.ndarray], Dict[str, numpy.ndarray]]] = None, inplace=False, ignore_unknown=False, array_valued_result=False)[source]#

Bases: DataFrameTransformer

One hot encode categorical variables

Parameters:

columns – list of names or regex matching names of columns that are to be replaced by a list one-hot encoded columns each (or an array-valued column for the case where useArrayValues=True); If None, then no columns are actually to be one-hot-encoded
categories – numpy arrays containing the possible values of each of the specified columns (for case where sequence is specified in ‘columns’) or dictionary mapping column name to array of possible categories for the column name. If None, the possible values will be inferred from the columns
inplace – whether to perform the transformation in-place
ignore_unknown – if True and an unknown category is encountered during transform, the resulting one-hot encoded columns for this feature will be all zeros. if False, an unknown category will raise an error.
array_valued_result – whether to replace the input columns by columns of the same name containing arrays as values instead of creating a separate column per original value

info()[source]#

class DFTColumnFilter(keep: Optional[Union[Sequence[str], str]] = None, drop: Optional[Union[Sequence[str], str]] = None)[source]#

Bases: RuleBasedDataFrameTransformer

A DataFrame transformer that filters columns by retaining or dropping specified columns

info()[source]#

class DFTKeepColumns(keep: Optional[Union[Sequence[str], str]] = None, drop: Optional[Union[Sequence[str], str]] = None)[source]#: Bases: DFTColumnFilter

class DFTDRowFilterOnIndex(keep: Optional[Set] = None, drop: Optional[Set] = None)[source]#: Bases: RuleBasedDataFrameTransformer

class DFTNormalisation(rules: Sequence[Rule], default_transformer_factory: Optional[Callable[[], SkLearnTransformerProtocol]] = None, require_all_handled: bool = True, inplace: bool = False)[source]#

Bases: DataFrameTransformer

Applies normalisation/scaling to a data frame by applying a set of transformation rules, where each rule defines a set of columns to which it applies (learning a single transformer based on the values of all applicable columns). DFTNormalisation ignores N/A values during fitting and application.

Parameters:

rules – the set of rules; rules (i.e., their transformers) are always fitted and applied in the given order. A convenient way to obtain a set of rules in the sensai.vector_model.VectorModel context is from a sensai.featuregen.FeatureCollector or sensai.featuregen.MultiFeatureGenerator. Generally, it is often a good idea to associate rules (or a rule template) with a feature generator. Then the rules can be obtained from it using get_normalisation_rules.
default_transformer_factory – a factory for the creation of transformer instances (which implements the API used by sklearn.preprocessing, e.g. StandardScaler) that shall be used to create a transformer for all rules that do not specify a particular transformer. The default transformer will only be applied to columns matched by such rules, unmatched columns will not be transformed. Use SkLearnTransformerFactoryFactory to conveniently create a factory.
require_all_handled – whether to raise an exception if any column is not matched by a rule
inplace – whether to apply data frame transformations in-place

class RuleTemplate(skip: bool = False, unsupported: bool = False, transformer: Optional[SkLearnTransformerProtocol] = None, transformer_factory: Optional[Callable[[], SkLearnTransformerProtocol]] = None, independent_columns: Optional[bool] = None, array_valued: bool = False, fit: bool = True)[source]#

Bases: object

A template from which a rule which matches multiple columns can be created. This is useful for the generation of rules which shall apply to all the (numerical) columns generated by a FeatureGenerator without specifically naming them.

Use the parameters as follows:

If the relevant features are already normalised, pass skip=True
If the relevant features cannot be normalised (e.g. because they are categorical), pass unsupported=True
If the relevant features shall be normalised, the other parameters apply. No parameters, i.e. RuleTemplate(), are an option if …
- a default transformer factory is specified in the DFTNormalisation instance and its application is suitable for the relevant set of features. Otherwise, specify either transformer_factory or transformer.
- the resulting rule will match only a single column. Otherwise, independent_columns must be specified to True or False.

Parameters:

skip – flag indicating whether no transformation shall be performed on the matched columns (e.g. because they are already normalised).
unsupported – flag indicating whether normalisation of matched columns is unsupported (shall trigger an exception if attempted). Useful e.g. for preventing intermediate features that need further processing (like columns containing strings) from making their way into the final dataframe that will be normalised and used for training a model.
transformer – a transformer instance (following the sklearn.preprocessing interface, e.g. StandardScaler) to apply to the matching column(s) for the case where a transformation is necessary (skip=False, unsupported=False). If None is given, either transformer_factory or the containing DFTNormalisation instance’s default factory will be used when the normaliser is fitted. NOTE: Using a transformer_factory is usually preferred. Use an instance only if you want the same transformer instance to be used in multiple places - e.g. sharing it across several feature generators or models that use the same type of column with associated rule/rule template (disabling fit where appropriate).
transformer_factory – a factory for the generation of the transformer instance, which will only be applied if transformer is not given; if neither transformer nor transformer_factory are given, the containing DFTNormalisation instance’s default factory will be used. See SkLearnTransformerFactoryFactory for convenient construction options.
array_valued – whether the column values are not scalars but arrays (of some fixed but arbitrary length). It is assumed that all entries in such arrays are to be normalised in the same way, i.e. the same transformation will be applied to each entry in the array. Only a single matching column is supported for array_valued=True, i.e. the rule must apply to at most one column.
fit – whether the rule’s transformer shall be fitted. One use case for setting this to False is if a transformer instance is provided (instead of a factory), which is already fitted.
independent_columns – whether, for the case where the rule matches multiple columns, the columns are independent and a separate transformation is to be learned for each of them (rather than using the same transformation for all columns and learning the transformation from the data of all columns). This parameter must be specified for rules matching more than one column, None is acceptable for rules matching a single column, in which case None, True, and False all have the same effect.

to_rule(regex: Optional[Union[str, Pattern]])[source]#

Convert the template to a rule for all columns matching the regex

Parameters:: regex – a regular expression defining the column the rule applies to
Returns:: the resulting Rule

to_placeholder_rule()[source]#

class Rule(regex: Optional[Union[str, Pattern]], skip: bool = False, unsupported: bool = False, transformer: Optional[SkLearnTransformerProtocol] = None, transformer_factory: Optional[Callable[[], SkLearnTransformerProtocol]] = None, array_valued: bool = False, fit: bool = True, independent_columns: Optional[bool] = None)[source]#

Bases: ToStringMixin

Use the parameters as follows:

If the relevant features are already normalised, pass skip=True

If the relevant features cannot be normalised (e.g. because they are categorical), pass unsupported=True

If the relevant features shall be normalised, the other parameters apply. No parameters other than regex, i.e. Rule(regex), are an option if …

a default transformer factory is specified in the DFTNormalisation instance and its application is suitable for the relevant set of features. Otherwise, specify either transformer_factory or transformer.

the resulting rule will match only a single column. Otherwise, independent_columns must be specified to True or False.

Parameters:

regex – a regular expression defining the column(s) the rule applies to. If it matches multiple columns, these columns will be normalised in the same way (using the same normalisation process for each column) unless independent_columns=True. If None, the rule is a placeholder rule and the regex must be set later via set_regex or the rule will not be applicable.
skip – flag indicating whether no transformation shall be performed on the matched columns (e.g. because they are already normalised).
unsupported – flag indicating whether normalisation of matched columns is unsupported (shall trigger an exception if attempted). Useful e.g. for preventing intermediate features that need further processing (like columns containing strings) from making their way into the final dataframe that will be normalised and used for training a model.
transformer – a transformer instance (following the sklearn.preprocessing interface, e.g. StandardScaler) to apply to the matching column(s) for the case where a transformation is necessary (skip=False, unsupported=False). If None is given, either transformer_factory or the containing DFTNormalisation instance’s default factory will be used when the normaliser is fitted. NOTE: Using a transformer_factory is usually preferred. Use an instance only if you want the same transformer instance to be used in multiple places - e.g. sharing it across several feature generators or models that use the same type of column with associated rule/rule template (disabling fit where appropriate).
transformer_factory – a factory for the generation of the transformer instance, which will only be applied if transformer is not given; if neither transformer nor transformer_factory are given, the containing DFTNormalisation instance’s default factory will be used. See SkLearnTransformerFactoryFactory for convenient construction options.
array_valued – whether the column values are not scalars but arrays (of some fixed but arbitrary length). It is assumed that all entries in such arrays are to be normalised in the same way, i.e. the same transformation will be applied to each entry in the array. Only a single matching column is supported for array_valued=True, i.e. the regex must match at most one column.
fit – whether the rule’s transformer shall be fitted. One use case for setting this to False is if a transformer instance is provided (instead of a factory), which is already fitted.
independent_columns – whether, for the case where the rule matches multiple columns, the columns are independent and a separate transformation is to be learned for each of them (rather than using the same transformation for all columns and learning the transformation from the data of all columns). This parameter must be specified to for rules matching more than one column, None is acceptable for rules matching a single column, in which case None, True, and False all have the same effect.

set_regex(regex: str)[source]#

matches(column: str)[source]#

matching_columns(columns: Sequence[str]) → List[str][source]#

info()[source]#

find_rule(col_name: str) → Rule[source]#

class DFTFromColumnGenerators(column_generators: Sequence[ColumnGenerator], inplace=False)[source]#

Bases: RuleBasedDataFrameTransformer

Extends a data frame with columns generated from ColumnGenerator instances

info()[source]#

class DFTCountEntries(column_for_entry_count: str, column_name_for_resulting_counts: str = 'counts')[source]#

Bases: RuleBasedDataFrameTransformer

Transforms a data frame, based on one of its columns, into a new data frame containing two columns that indicate the counts of unique values in the input column. It is the “DataFrame output version” of pd.Series.value_counts. Each row of the output column holds a unique value of the input column and the number of times it appears in the input column.

info()[source]#

class DFTAggregationOnColumn(column_for_aggregation: str, aggregation: Callable)[source]#: Bases: RuleBasedDataFrameTransformer

class DFTRoundFloats(decimals=0)[source]#

Bases: RuleBasedDataFrameTransformer

info()[source]#

class DFTSkLearnTransformer(sklearn_transformer: SkLearnTransformerProtocol, columns: Optional[List[str]] = None, inplace=False, array_valued=False)[source]#

Bases: InvertibleDataFrameTransformer

Applies a transformer from sklearn.preprocessing to (a subset of) the columns of a data frame. If multiple columns are transformed, they are transformed independently (i.e. each column uses a separately trained transformation).

Parameters:

sklearn_transformer – the transformer instance (from sklearn.preprocessing) to use (which will be fitted & applied)
columns – the set of column names to which the transformation shall apply; if None, apply it to all columns
inplace – whether to apply the transformation in-place
array_valued – whether to apply transformation not to scalar-valued columns but to one or more array-valued columns, where the values of all arrays within a column (which may vary in length) are to be transformed in the same way. If multiple columns are transformed, then the arrays belonging to a single row must all have the same length.

apply_inverse(df)[source]#

info()[source]#

class DFTSortColumns[source]#

Bases: RuleBasedDataFrameTransformer

Sorts a data frame’s columns in ascending order

class DFTFillNA(fill_value, inplace: bool = False)[source]#

Bases: RuleBasedDataFrameTransformer

Fills NA/NaN values with the given value

class DFTCastCategoricalColumns(columns: ~typing.Optional[~typing.List[str]] = None, dtype=<class 'float'>)[source]#

Bases: RuleBasedDataFrameTransformer

Casts columns with dtype category to the given type. This can be useful in cases where categorical columns are not accepted by the model but the column values are actually numeric, in which case the cast to a numeric value yields an acceptable label encoding.

Parameters:

columns – the columns to convert; if None, convert all that have dtype category
dtype – the data type to which categorical columns are to be converted

class DFTDropNA(axis=0, inplace=False)[source]#

Bases: RuleBasedDataFrameTransformer

Drops rows or columns containing NA/NaN values

Parameters:

axis – 0 to drop rows, 1 to drop columns containing an N/A value
inplace – whether to perform the operation in-place on the input data frame