Models with Modular Data Pipelines#

Hide code cell content
%load_ext autoreload
%autoreload 2

import sys; sys.path.append("../../src")
Hide code cell content
import sensai
import sensai.xgboost
import sensai.torch
import pandas as pd
import torch

VectorModel#

The backbone of supervised learning implementations is the VectorModel abstraction. It is so named, because, in computer science, a vector corresponds to an array of data, and vector models map such vectors to the desired outputs, i.e. regression targets or classes.

It is important to note that this does not limit vector models to tabular data, because the data within a vector can take arbitrary forms (in contrast to vectors as they are defined in mathematics). Every element of an input vector could itself be arbitrary complex, and could, in the most general sense, be any kind of object.

The VectorModel Class Hierarchy#

VectorModel is an abstract base class. From it, abstract base classes for classification (VectorClassificationModel) and regression (VectorRegressionModel) are derived. And we furthermore provide base classes for rule-based models, facilitating the implementation of models that do not require learning (RuleBasedVectorClassificationModel, RuleBasedVectorRegressionModel).

These base classes are, in turn, specialised in order to provide direct access to model implementations based on widely used machine learning libraries such as scikit-learn, XGBoost, PyTorch, etc. Use your IDE’s hierarchy view to inspect them.

DataFrame-Based Interfaces#

Vector models use pandas DataFrames as the fundmental input and output data structures. Every row in a data frame corresponds to a vector of data, and an entire data frame can thus be viewed as a dataset or batch of data. Data frames are a good base representation for input data because

  • they provide rudimentary meta-data in the form of column names, avoiding ambiguity.

  • they can contain arbitrarily complex data, yet in the simplest of cases, they can directly be mapped to a data matrix (2D array) of features that simple models can directly process.

The fit and predict methods of VectorModel take data frames as input, and the latter furthermore returns its predictions as a data frame. It is important to note that the DataFrame-based interface does not limit the scope of the models that can be applied, as one of the key principles of vector models is that they may define arbitrary model-specific transformations of the data originally contained in a data frame (e.g. a conversion from complex objects in data frames to one or more tensors for neural networks), as we shall see below.

Here’s the particularly simple Iris dataset for flower species classification, where the features are measurements of petals and sepals:

dataset = sensai.data.dataset.DataSetClassificationIris()
io_data = dataset.load_io_data()
io_data.to_df().sample(8)
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) class
25 5.0 3.0 1.6 0.2 setosa
49 5.0 3.3 1.4 0.2 setosa
106 4.9 2.5 4.5 1.7 virginica
119 6.0 2.2 5.0 1.5 virginica
5 5.4 3.9 1.7 0.4 setosa
62 6.0 2.2 4.0 1.0 versicolor
8 4.4 2.9 1.4 0.2 setosa
121 5.6 2.8 4.9 2.0 virginica

Here, io_data is an instance of InputOutputData, which contains two data frames, inputs and outputs. The to_df method merges the two data frames into one for easier visualisation.

Let’s split the dataset and apply a model to it:

# load and split a dataset
splitter = sensai.data.DataSplitterFractional(0.8)
train_io_data, test_io_data = splitter.split(io_data)

# train a model
model = sensai.sklearn.classification.SkLearnRandomForestVectorClassificationModel(
    n_estimators=15)
model.fit_input_output_data(train_io_data)

# make predictions
predictions = model.predict(test_io_data.inputs)

The fit_input_output_data method is just a convenience method to pass an InputOutputData instance instead of two data frames. It is equivalent to

model.fit(train_io_data.inputs, train_io_data.outputs)

where the two data frames containing inputs and outputs are passed separately.

Now let’s compare the ground truth to some of the predictions:

pd.concat((test_io_data.outputs, predictions), axis=1).sample(8)
class class
99 versicolor versicolor
87 versicolor versicolor
58 versicolor versicolor
102 virginica virginica
144 virginica virginica
116 virginica virginica
20 setosa setosa
124 virginica virginica

Implementing Custom Models#

It is straightforward to implement your own model. Simply subclass the appropriate base class depending on the type of model you want to implement.

For example, let us implement a simple classifier where we always return the a priori probability of each class in the training data, ignoring the input data for predictions. For this case, we inherit from VectorClassificationModel and implement the two abstract methods it defines.

class PriorProbabilityVectorClassificationModel(sensai.VectorClassificationModel):
    def _fit_classifier(self, x: pd.DataFrame, y: pd.DataFrame):
        self._prior_probabilities = y.iloc[:, 0].value_counts(normalize=True).to_dict()

    def _predict_class_probabilities(self, x: pd.DataFrame) -> pd.DataFrame:
        values = [self._prior_probabilities[cls] for cls in self.get_class_labels()]
        return pd.DataFrame([values] * len(x), columns=self.get_class_labels(), index=x.index)

Adapting a model implementation from another machine learning library is typically just a few lines. For models that adhere to the scikit-learn interfaces for learning and prediction, there are abstract base classes that make the adaptation particularly straightforward.

Configuration#

Apart from the parameters passed at construction, which are specific to the type of model in question, all vector models can be flexibly configured via methods that can be called post-construction. These methods all have the with_ prefix, indicating that they return the instance itself (akin to the builder pattern), allowing calls to be chained in a single statement.

The most relevant such methods are:

  • with_name to name the model (for reporting purposes)

  • with_raw_input_transformer for adding an initial input transformation

  • with_feature_generator and with_feature_collector for specifying how to generate features from the input data

  • with_feature_transformers for specifying how the generated features shall be transformed

The latter three points are essential for defining modular input pipelines and will be addressed in detail below.

All configured options are fully reflected in the model’s string representation, which can be pretty-printed with the pprint method.

str(model.with_name("RandomForest"))
'SkLearnRandomForestVectorClassificationModel[featureGenerator=None, rawInputTransformerChain=DataFrameTransformerChain[dataFrameTransformers=[]], featureTransformerChain=DataFrameTransformerChain[dataFrameTransformers=[]], fitArgs={}, useBalancedClassWeights=False, useLabelEncoding=False, name=RandomForest, model=RandomForestClassifier(n_estimators=15, random_state=42)]'
model.pprint()
SkLearnRandomForestVectorClassificationModel[
    featureGenerator=None, 
    rawInputTransformerChain=DataFrameTransformerChain[dataFrameTransformers=[]], 
    featureTransformerChain=DataFrameTransformerChain[dataFrameTransformers=[]], 
    fitArgs={}, 
    useBalancedClassWeights=False, 
    useLabelEncoding=False, 
    name=RandomForest, 
    model=RandomForestClassifier(
        n_estimators=15, 
        random_state=42)]

Modular Pipelines#

A key principle of sensAI’s vector models is that data pipelines

  • can be strongly associated with a model. This is critically important of several heterogeneous models shall be applied to the same use case. Typically, every model has different requirements regarding the data it can process and the representation it requires to process it optimally.

  • are to be modular, meaning that a pipeline can be composed from reusable and user-definable components.

An input pipeline typically serves the purpose of answering the following questions:

  • How shall the data be pre-processed?

    It might be necessary to process the data before we can use it and extract data from it. We may need to filter or clean the data; we may need to establish a usable representation from raw data (e.g. convert a string-based representation of a date into a proper data structure); or we may need to infer/impute missing data.

    The relevant abstraction for this task is DataFrameTransformer, which, as the name suggests, can arbitrarily transform a data frame. All non-abstract class implementations have the prefix DFT in sensAI and thus are easily discovered through auto-completion.

    A VectorModel can be configured to apply a pre-processing transformation via method with_raw_input_transformers.

  • What is the data used by the model?

    The relevant abstraction is FeatureGenerator. Via FeatureGenerator instances, a model can define which set of features is to be used. Moreover, these instances can hold meta-data on the respective features, which can be leveraged for downstream representation. In sensAI, the class names of all feature generator implementations use the prefix FeatureGenerator.

    A VectorModel can be configured to answer this question via method with_feature_generator (or with_feature_collector).

  • How does that data need to be represented?

    Different models can require different representations of the same data. For example, some models might require all features to be numeric, thus requiring categorical features to be encoded, while others might work better with the original representation. Furthermore, some models might work better with numerical features normalised or scaled in a certain way while it makes no difference to others. We can address these requirements by adding model-specific transformations.

    The relevant abstraction is, once again, DataFrameTransformer.

    A VectorModel can be configured to apply a transformation to its features via method with_feature_transformers.

The three pipeline stages are applied in the order presented above, and all components are optional, i.e. if a model does not define any raw input transformers, then the original data remains unmodified. If a model defines no feature generator, then the set of features is given by the full input data frame, etc.

Example Dataset: Titanic Survival#

Let us consider the well-known Titanic Survival data set as an example.

Every data point holds data on a passenger. The data set has the following potentially predictive columns,

  • Pclass: the passenger ticket class as an integer (1=first, 2=second, 3=third)

  • Sex: the passenger’s sex (male or female)

  • Age: the passenger’s age in years (integer); this feature is frequently missing

  • SibSp: the number of siblings and spouses of the passenger

  • Parch: the number of parents and children of the passenger

  • Fare: the fare price paid

  • Embark: the port of embarkation (C=Cherbourg, Q=Queenstown, S=Southampton); this feature is missing for two passengers

and some further meta-data columns (Name, Cabin).

The goal is to predict the column ‘Survived’ indicating whether the passenger survived (1) or not (0).

dataset = sensai.data.dataset.DataSetClassificationTitanicSurvival()
io_data = dataset.load_io_data()
io_data.to_df().iloc[:5]
Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Survived
PassengerId
1 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S 0
2 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C 1
3 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S 1
4 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S 1
5 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S 0

Let us define an evaluation object for this data set, which will allow us to evaluate model performance later.

evaluator_params = sensai.evaluation.ClassificationEvaluatorParams(fractional_split_test_fraction=0.2)
titanic_evaluation = sensai.evaluation.ClassificationModelEvaluation(io_data, evaluator_params=evaluator_params)

Raw Input Transformers#

We shall now add pipeline components to an XGBoost model, as it can straightforwardly deal with missing data.

The dataset doesn’t really require any pre-processing, but we could

  • get rid of the useless meta-data columns,

  • convert the passenger class feature into a string to ensure that it is not treated as a numerical feature

class DFTTitanicDropMetaDataColumns(sensai.data_transformation.DFTColumnFilter):
    def __init__(self):
        super().__init__(drop=[dataset.COL_NAME, dataset.COL_CABIN, dataset.COL_TICKET])
        
class DFTTitanicTransformPassengerClass(sensai.data_transformation.DFTModifyColumn):
    def __init__(self):
        super().__init__(
            column=dataset.COL_PASSENGER_CLASS, 
            column_transform=lambda n: {1: "first", 2: "second", 3: "third"}[n])

xgb_model = sensai.xgboost.XGBGradientBoostedVectorClassificationModel() \
    .with_raw_input_transformers(
        DFTTitanicDropMetaDataColumns(),
        DFTTitanicTransformPassengerClass())

Our model uses two data frame transformers which apply the aforementioned pre-processing tasks. We have opted to define classes for each transformation to facilitate reusing the transformations for other models.

We can apply the transformers using the model’s compute_model_inputs method. Since neither transformer requires fitting, we can directly apply it.

xgb_model.compute_model_inputs(io_data.inputs).iloc[:5]
Pclass Sex Age SibSp Parch Fare Embarked
PassengerId
1 third male 22.0 1 0 7.2500 S
2 first female 38.0 1 0 71.2833 C
3 third female 26.0 0 0 7.9250 S
4 first female 35.0 1 0 53.1000 S
5 third male 35.0 0 0 8.0500 S

The model’s input pipeline now transforms the data as desired.

Feature Generation and Transformation#

Feature generators serve to define how features can be generated from the input data. They additionally hold meta-data on the generated features, which can be leveraged for downstream transformation. Specifically,

  • we can define which features are categorical,

  • we can define rules for normalisation or scaling of numerical features.

Simple Feature Pipelines for Titanic Survival#

In the Titanic Survival data set, the features are already fully prepared, so we do not need to actually generate anything; we can simply take the feature values as they are present in the original data frame and add only the necessary meta-data. The base class for this purpose is FeatureGeneratorTakeColumns, which allows us to take over columns directly from the input data. We could use a single feature generator for all features as follows:

class FeatureGeneratorTitanicAll(sensai.featuregen.FeatureGeneratorTakeColumns):
    def __init__(self):
        super().__init__(
            columns=None,  # take all columns
            categorical_feature_names=[dataset.COL_SEX, dataset.COL_PASSENGER_CLASS, dataset.COL_PORT_EMBARKED],
            normalisation_rule_template=sensai.featuregen.DFTNormalisation.RuleTemplate(
                transformer_factory=sensai.data_transformation.SkLearnTransformerFactoryFactory.StandardScaler(),
                independent_columns=True))

We have supplied both meta-data regarding

  • the subset of feature that are categorical

  • the normalisation rule to be applied to the numerical features (if normalisation is applied with DFTNormalisation).

Our XGBoost model does not require normalisation, but we still want to apply a transformation to some of the features: Categorical feature shall be one-hot encoded. To achieve this, we add the feature generator as well as a DFT that applies the one-hot encoding:

feature_generator = FeatureGeneratorTitanicAll()

xgb_model.with_feature_generator(feature_generator) \
    .with_feature_transformers(
        sensai.data_transformation.DFTOneHotEncoder(
            feature_generator.get_categorical_feature_name_regex(), 
            ignore_unknown=True)) \
    .pprint()
XGBGradientBoostedVectorClassificationModel[
    featureGenerator=FeatureGeneratorTitanicAll[
        columns=None, 
        exceptColumns=(), 
        verifyColumnNames=True, 
        name=FeatureGeneratorTitanicAll-139706259416304], 
    rawInputTransformerChain=DataFrameTransformerChain[
        dataFrameTransformers=[
            DFTTitanicDropMetaDataColumns[
                keep=None, 
                drop=[Name, Cabin, Ticket]], 
            DFTTitanicTransformPassengerClass[
                column=Pclass, 
                columnTransform=<function DFTTitanicTransformPassengerClass.__init__.<locals>.<lambda> at 0x7f0fe5f8e670>]]], 
    featureTransformerChain=DataFrameTransformerChain[
        dataFrameTransformers=[
            DFTOneHotEncoder[
                oneHotEncoders=None, 
                inplace=False, 
                arrayValuedResult=False, 
                handleUnknown=ignore, 
                columns=(Sex|Pclass|Embarked)]]], 
    fitArgs={}, 
    useBalancedClassWeights=False, 
    useLabelEncoding=True, 
    featureGeneratorNames=[FeatureGeneratorTitanicAll-139706259416304], 
    modelConstructor=XGBClassifier(random_state=42)]

When using more than one feature generator, the feature generators need to be combined into a MultiFeatureGenerator. To facilitate this and to furthermore simplify the creation of downstream transformers, we can instead use the FeatureCollector abstraction. The above model can be equivalently defined as follows:

feature_collector = sensai.featuregen.FeatureCollector(feature_generator)  # can pass more than one feature generator

xgb_model.with_feature_collector(feature_collector, shared=True) \
    .with_feature_transformers(feature_collector.create_feature_transformer_one_hot_encoder(ignore_unknown=True));

Either way, the model’s feature pipeline is now fully configured. The full pipeline now requires fitting (since the feature transformation is learnt from the training data). Let’s fit the model and then take another look at the inputs that the XGBoost model now actually receives.

xgb_model.fit_input_output_data(io_data)
xgb_model.compute_model_inputs(io_data.inputs).iloc[:5]
Age SibSp Parch Fare Pclass_0 Pclass_1 Pclass_2 Sex_0 Sex_1 Embarked_0 Embarked_1 Embarked_2
PassengerId
1 22.0 1 0 7.2500 0.0 0.0 1.0 0.0 1.0 0.0 0.0 1.0
2 38.0 1 0 71.2833 1.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0
3 26.0 0 0 7.9250 0.0 0.0 1.0 1.0 0.0 0.0 0.0 1.0
4 35.0 1 0 53.1000 1.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0
5 35.0 0 0 8.0500 0.0 0.0 1.0 0.0 1.0 0.0 0.0 1.0

The pipeline works as expected. We are now ready to evaluate the model:

result = titanic_evaluation.perform_simple_evaluation(xgb_model)
result.get_data_frame()
accuracy balancedAccuracy precision recall F1
predictedVar
Survived 0.826816 0.817746 0.75 0.784615 0.766917

In the demonstration above, the model’s full definition was spread out across several cells to incrementally explain the pipeline construction. In practice we want to keep the model definition monolithic. Let us (re-)define the XGBoost model as well as a second model with a slightly different pipeline that additionally applies normalisation and compare the two models.

xgb_model = sensai.xgboost.XGBGradientBoostedVectorClassificationModel() \
    .with_raw_input_transformers(
        DFTTitanicDropMetaDataColumns(),
        DFTTitanicTransformPassengerClass()) \
    .with_name("XGBoost") \
    .with_feature_collector(feature_collector, shared=True) \
    .with_feature_transformers(
        feature_collector.create_feature_transformer_one_hot_encoder(ignore_unknown=True))

torch_mlp_model = sensai.torch.models.MultiLayerPerceptronVectorClassificationModel(
        hid_activation_function=torch.relu,
        hidden_dims=[10, 10, 4],
        cuda=False,
        p_dropout=0.25,
        nn_optimiser_params=sensai.torch.NNOptimiserParams(early_stopping_epochs=10)) \
    .with_name("MLP") \
    .with_raw_input_transformers(
        DFTTitanicDropMetaDataColumns(),
        DFTTitanicTransformPassengerClass()) \
    .with_feature_collector(feature_collector, shared=True) \
    .with_feature_transformers(
        sensai.data_transformation.DFTColumnFilter(drop=[dataset.COL_PORT_EMBARKED, dataset.COL_AGE_YEARS]),
        feature_collector.create_feature_transformer_one_hot_encoder(ignore_unknown=True),
        feature_collector.create_dft_normalisation())

titanic_evaluation.compare_models([xgb_model, torch_mlp_model]).results_df
/home/runner/work/sensAI/sensAI/.tox/py_latest_dependencies/lib/python3.8/site-packages/sensai/torch/torch_base.py:144: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  self.module = torch.load(model_file)
Received value in input tensor 0 which is likely to not be correctly normalised: maximum abs. value in tensor is 6.937743
Received value in input tensor 0 which is likely to not be correctly normalised: maximum abs. value in tensor is 6.937743
accuracy balancedAccuracy precision recall F1
model_name
XGBoost 0.826816 0.817746 0.75 0.784615 0.766917
MLP 0.821229 0.790216 0.80 0.676923 0.733333

Notice that the model definitions are purely declarative: We define each model and the respective feature pipeline by injecting appropriate pipeline components.

For the multi-layer perceptron model, we notably added some additional feature transformers:

  • Since this type model cannot cope with missing feature values, we added a component that drops the age and port columns, which are sometimes undefined.

  • Since neural networks work best with normalised feature representations, we added the normalisation component, which uses a standard scaler (as defined in the feature generator).