Models with Modular Data Pipelines#
Show code cell content
%load_ext autoreload
%autoreload 2
import sys; sys.path.append("../../src")
Show code cell content
import sensai
import sensai.xgboost
import sensai.torch
import pandas as pd
import torch
VectorModel#
The backbone of supervised learning implementations is the VectorModel
abstraction.
It is so named, because, in computer science, a vector corresponds to an array of data,
and vector models map such vectors to the desired outputs, i.e. regression targets or
classes.
It is important to note that this does not limit vector models to tabular data, because the data within a vector can take arbitrary forms (in contrast to vectors as they are defined in mathematics). Every element of an input vector could itself be arbitrary complex, and could, in the most general sense, be any kind of object.
The VectorModel Class Hierarchy#
VectorModel
is an abstract base class.
From it, abstract base classes for classification (VectorClassificationModel
) and regression (VectorRegressionModel
) are derived. And we furthermore provide base classes for rule-based models, facilitating the implementation of models that do not require learning (RuleBasedVectorClassificationModel
, RuleBasedVectorRegressionModel
).
These base classes are, in turn, specialised in order to provide direct access to model implementations based on widely used machine learning libraries such as scikit-learn, XGBoost, PyTorch, etc. Use your IDE’s hierarchy view to inspect them.
DataFrame-Based Interfaces#
Vector models use pandas DataFrames as the fundmental input and output data structures. Every row in a data frame corresponds to a vector of data, and an entire data frame can thus be viewed as a dataset or batch of data. Data frames are a good base representation for input data because
they provide rudimentary meta-data in the form of column names, avoiding ambiguity.
they can contain arbitrarily complex data, yet in the simplest of cases, they can directly be mapped to a data matrix (2D array) of features that simple models can directly process.
The fit
and predict
methods of VectorModel
take data frames as input, and the latter furthermore returns its predictions as a data frame.
It is important to note that the DataFrame-based interface does not limit the scope of the models that can be applied, as one of the key principles of vector models is that they may define arbitrary model-specific transformations of the data originally contained in a data frame (e.g. a conversion from complex objects in data frames to one or more tensors for neural networks), as we shall see below.
Here’s the particularly simple Iris dataset for flower species classification, where the features are measurements of petals and sepals:
dataset = sensai.data.dataset.DataSetClassificationIris()
io_data = dataset.load_io_data()
io_data.to_df().sample(8)
sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | class | |
---|---|---|---|---|---|
117 | 7.7 | 3.8 | 6.7 | 2.2 | virginica |
11 | 4.8 | 3.4 | 1.6 | 0.2 | setosa |
76 | 6.8 | 2.8 | 4.8 | 1.4 | versicolor |
19 | 5.1 | 3.8 | 1.5 | 0.3 | setosa |
94 | 5.6 | 2.7 | 4.2 | 1.3 | versicolor |
58 | 6.6 | 2.9 | 4.6 | 1.3 | versicolor |
120 | 6.9 | 3.2 | 5.7 | 2.3 | virginica |
3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
Here, io_data
is an instance of InputOutputData
, which contains two data frames, inputs
and outputs
.
The to_df
method merges the two data frames into one for easier visualisation.
Let’s split the dataset and apply a model to it:
# load and split a dataset
splitter = sensai.data.DataSplitterFractional(0.8)
train_io_data, test_io_data = splitter.split(io_data)
# train a model
model = sensai.sklearn.classification.SkLearnRandomForestVectorClassificationModel(
n_estimators=15)
model.fit_input_output_data(train_io_data)
# make predictions
predictions = model.predict(test_io_data.inputs)
The fit_input_output_data
method is just a convenience method to pass an InputOutputData
instance instead of two data frames. It is equivalent to
model.fit(train_io_data.inputs, train_io_data.outputs)
where the two data frames containing inputs and outputs are passed separately.
Now let’s compare the ground truth to some of the predictions:
pd.concat((test_io_data.outputs, predictions), axis=1).sample(8)
class | class | |
---|---|---|
71 | versicolor | versicolor |
149 | virginica | virginica |
48 | setosa | setosa |
102 | virginica | virginica |
41 | setosa | setosa |
99 | versicolor | versicolor |
20 | setosa | setosa |
103 | virginica | virginica |
Implementing Custom Models#
It is straightforward to implement your own model. Simply subclass the appropriate base class depending on the type of model you want to implement.
For example, let us implement a simple classifier where we always return the a priori probability of each class in the training data, ignoring the input data for predictions. For this case, we inherit from VectorClassificationModel
and implement the two abstract methods it defines.
class PriorProbabilityVectorClassificationModel(sensai.VectorClassificationModel):
def _fit_classifier(self, x: pd.DataFrame, y: pd.DataFrame):
self._prior_probabilities = y.iloc[:, 0].value_counts(normalize=True).to_dict()
def _predict_class_probabilities(self, x: pd.DataFrame) -> pd.DataFrame:
values = [self._prior_probabilities[cls] for cls in self.get_class_labels()]
return pd.DataFrame([values] * len(x), columns=self.get_class_labels(), index=x.index)
Adapting a model implementation from another machine learning library is typically just a few lines. For models that adhere to the scikit-learn interfaces for learning and prediction, there are abstract base classes that make the adaptation particularly straightforward.
Configuration#
Apart from the parameters passed at construction, which are specific to the type of model in question, all vector models can be flexibly configured via methods that can be called post-construction.
These methods all have the with_
prefix, indicating that they return the instance itself (akin to the builder pattern), allowing calls to be chained in a single statement.
The most relevant such methods are:
with_name
to name the model (for reporting purposes)with_raw_input_transformer
for adding an initial input transformationwith_feature_generator
andwith_feature_collector
for specifying how to generate features from the input datawith_feature_transformers
for specifying how the generated features shall be transformed
The latter three points are essential for defining modular input pipelines and will be addressed in detail below.
All configured options are fully reflected in the model’s string representation, which can be pretty-printed with the pprint
method.
str(model.with_name("RandomForest"))
'SkLearnRandomForestVectorClassificationModel[featureGenerator=None, rawInputTransformerChain=DataFrameTransformerChain[dataFrameTransformers=[]], featureTransformerChain=DataFrameTransformerChain[dataFrameTransformers=[]], fitArgs={}, useBalancedClassWeights=False, useLabelEncoding=False, name=RandomForest, model=RandomForestClassifier(n_estimators=15, random_state=42)]'
model.pprint()
SkLearnRandomForestVectorClassificationModel[
featureGenerator=None,
rawInputTransformerChain=DataFrameTransformerChain[dataFrameTransformers=[]],
featureTransformerChain=DataFrameTransformerChain[dataFrameTransformers=[]],
fitArgs={},
useBalancedClassWeights=False,
useLabelEncoding=False,
name=RandomForest,
model=RandomForestClassifier(
n_estimators=15,
random_state=42)]
Modular Pipelines#
A key principle of sensAI’s vector models is that data pipelines
can be strongly associated with a model. This is critically important of several heterogeneous models shall be applied to the same use case. Typically, every model has different requirements regarding the data it can process and the representation it requires to process it optimally.
are to be modular, meaning that a pipeline can be composed from reusable and user-definable components.
An input pipeline typically serves the purpose of answering the following questions:
How shall the data be pre-processed?
It might be necessary to process the data before we can use it and extract data from it. We may need to filter or clean the data; we may need to establish a usable representation from raw data (e.g. convert a string-based representation of a date into a proper data structure); or we may need to infer/impute missing data.
The relevant abstraction for this task is
DataFrameTransformer
, which, as the name suggests, can arbitrarily transform a data frame. All non-abstract class implementations have the prefixDFT
in sensAI and thus are easily discovered through auto-completion.A
VectorModel
can be configured to apply a pre-processing transformation via methodwith_raw_input_transformers
.What is the data used by the model?
The relevant abstraction is
FeatureGenerator
. ViaFeatureGenerator
instances, a model can define which set of features is to be used. Moreover, these instances can hold meta-data on the respective features, which can be leveraged for downstream representation. In sensAI, the class names of all feature generator implementations use the prefixFeatureGenerator
.A
VectorModel
can be configured to answer this question via methodwith_feature_generator
(orwith_feature_collector
).How does that data need to be represented?
Different models can require different representations of the same data. For example, some models might require all features to be numeric, thus requiring categorical features to be encoded, while others might work better with the original representation. Furthermore, some models might work better with numerical features normalised or scaled in a certain way while it makes no difference to others. We can address these requirements by adding model-specific transformations.
The relevant abstraction is, once again,
DataFrameTransformer
.A
VectorModel
can be configured to apply a transformation to its features via methodwith_feature_transformers
.
The three pipeline stages are applied in the order presented above, and all components are optional, i.e. if a model does not define any raw input transformers, then the original data remains unmodified. If a model defines no feature generator, then the set of features is given by the full input data frame, etc.
Example Dataset: Titanic Survival#
Let us consider the well-known Titanic Survival data set as an example.
Every data point holds data on a passenger. The data set has the following potentially predictive columns,
Pclass
: the passenger ticket class as an integer (1=first, 2=second, 3=third)Sex
: the passenger’s sex (male or female)Age
: the passenger’s age in years (integer); this feature is frequently missingSibSp
: the number of siblings and spouses of the passengerParch
: the number of parents and children of the passengerFare
: the fare price paidEmbark
: the port of embarkation (C=Cherbourg, Q=Queenstown, S=Southampton); this feature is missing for two passengers
and some further meta-data columns (Name, Cabin).
The goal is to predict the column ‘Survived’ indicating whether the passenger survived (1) or not (0).
dataset = sensai.data.dataset.DataSetClassificationTitanicSurvival()
io_data = dataset.load_io_data()
io_data.to_df().iloc[:5]
Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | Survived | |
---|---|---|---|---|---|---|---|---|---|---|---|
PassengerId | |||||||||||
1 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S | 0 |
2 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | 1 |
3 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S | 1 |
4 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S | 1 |
5 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S | 0 |
Let us define an evaluation object for this data set, which will allow us to evaluate model performance later.
evaluator_params = sensai.evaluation.ClassificationEvaluatorParams(fractional_split_test_fraction=0.2)
titanic_evaluation = sensai.evaluation.ClassificationModelEvaluation(io_data, evaluator_params=evaluator_params)
Raw Input Transformers#
We shall now add pipeline components to an XGBoost model, as it can straightforwardly deal with missing data.
The dataset doesn’t really require any pre-processing, but we could
get rid of the useless meta-data columns,
convert the passenger class feature into a string to ensure that it is not treated as a numerical feature
class DFTTitanicDropMetaDataColumns(sensai.data_transformation.DFTColumnFilter):
def __init__(self):
super().__init__(drop=[dataset.COL_NAME, dataset.COL_CABIN, dataset.COL_TICKET])
class DFTTitanicTransformPassengerClass(sensai.data_transformation.DFTModifyColumn):
def __init__(self):
super().__init__(
column=dataset.COL_PASSENGER_CLASS,
column_transform=lambda n: {1: "first", 2: "second", 3: "third"}[n])
xgb_model = sensai.xgboost.XGBGradientBoostedVectorClassificationModel() \
.with_raw_input_transformers(
DFTTitanicDropMetaDataColumns(),
DFTTitanicTransformPassengerClass())
Our model uses two data frame transformers which apply the aforementioned pre-processing tasks. We have opted to define classes for each transformation to facilitate reusing the transformations for other models.
We can apply the transformers using the model’s compute_model_inputs
method.
Since neither transformer requires fitting, we can directly apply it.
xgb_model.compute_model_inputs(io_data.inputs).iloc[:5]
Pclass | Sex | Age | SibSp | Parch | Fare | Embarked | |
---|---|---|---|---|---|---|---|
PassengerId | |||||||
1 | third | male | 22.0 | 1 | 0 | 7.2500 | S |
2 | first | female | 38.0 | 1 | 0 | 71.2833 | C |
3 | third | female | 26.0 | 0 | 0 | 7.9250 | S |
4 | first | female | 35.0 | 1 | 0 | 53.1000 | S |
5 | third | male | 35.0 | 0 | 0 | 8.0500 | S |
The model’s input pipeline now transforms the data as desired.
Feature Generation and Transformation#
Feature generators serve to define how features can be generated from the input data. They additionally hold meta-data on the generated features, which can be leveraged for downstream transformation. Specifically,
we can define which features are categorical,
we can define rules for normalisation or scaling of numerical features.
Simple Feature Pipelines for Titanic Survival#
In the Titanic Survival data set, the features are already fully prepared, so we do not need to actually generate anything;
we can simply take the feature values as they are present in the original data frame and add only the necessary meta-data.
The base class for this purpose is FeatureGeneratorTakeColumns
, which allows us to take over columns directly from the input data.
We could use a single feature generator for all features as follows:
class FeatureGeneratorTitanicAll(sensai.featuregen.FeatureGeneratorTakeColumns):
def __init__(self):
super().__init__(
columns=None, # take all columns
categorical_feature_names=[dataset.COL_SEX, dataset.COL_PASSENGER_CLASS, dataset.COL_PORT_EMBARKED],
normalisation_rule_template=sensai.featuregen.DFTNormalisation.RuleTemplate(
transformer_factory=sensai.data_transformation.SkLearnTransformerFactoryFactory.StandardScaler(),
independent_columns=True))
We have supplied both meta-data regarding
the subset of feature that are categorical
the normalisation rule to be applied to the numerical features (if normalisation is applied with
DFTNormalisation
).
Our XGBoost model does not require normalisation, but we still want to apply a transformation to some of the features: Categorical feature shall be one-hot encoded. To achieve this, we add the feature generator as well as a DFT that applies the one-hot encoding:
feature_generator = FeatureGeneratorTitanicAll()
xgb_model.with_feature_generator(feature_generator) \
.with_feature_transformers(
sensai.data_transformation.DFTOneHotEncoder(
feature_generator.get_categorical_feature_name_regex(),
ignore_unknown=True)) \
.pprint()
XGBGradientBoostedVectorClassificationModel[
featureGenerator=FeatureGeneratorTitanicAll[
columns=None,
exceptColumns=(),
verifyColumnNames=True,
name=FeatureGeneratorTitanicAll-140670605186720],
rawInputTransformerChain=DataFrameTransformerChain[
dataFrameTransformers=[
DFTTitanicDropMetaDataColumns[
keep=None,
drop=[Name, Cabin, Ticket]],
DFTTitanicTransformPassengerClass[
column=Pclass,
columnTransform=<function DFTTitanicTransformPassengerClass.__init__.<locals>.<lambda> at 0x7ff06d765430>]]],
featureTransformerChain=DataFrameTransformerChain[
dataFrameTransformers=[
DFTOneHotEncoder[
oneHotEncoders=None,
inplace=False,
arrayValuedResult=False,
handleUnknown=ignore,
columns=(Sex|Pclass|Embarked)]]],
fitArgs={},
useBalancedClassWeights=False,
useLabelEncoding=True,
featureGeneratorNames=[FeatureGeneratorTitanicAll-140670605186720],
modelConstructor=XGBClassifier(random_state=42)]
When using more than one feature generator, the feature generators need to be combined into a MultiFeatureGenerator
.
To facilitate this and to furthermore simplify the creation of downstream transformers, we can instead use the FeatureCollector
abstraction.
The above model can be equivalently defined as follows:
feature_collector = sensai.featuregen.FeatureCollector(feature_generator) # can pass more than one feature generator
xgb_model.with_feature_collector(feature_collector, shared=True) \
.with_feature_transformers(feature_collector.create_feature_transformer_one_hot_encoder(ignore_unknown=True));
Either way, the model’s feature pipeline is now fully configured. The full pipeline now requires fitting (since the feature transformation is learnt from the training data). Let’s fit the model and then take another look at the inputs that the XGBoost model now actually receives.
xgb_model.fit_input_output_data(io_data)
xgb_model.compute_model_inputs(io_data.inputs).iloc[:5]
Age | SibSp | Parch | Fare | Pclass_0 | Pclass_1 | Pclass_2 | Sex_0 | Sex_1 | Embarked_0 | Embarked_1 | Embarked_2 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
PassengerId | ||||||||||||
1 | 22.0 | 1 | 0 | 7.2500 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 |
2 | 38.0 | 1 | 0 | 71.2833 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 |
3 | 26.0 | 0 | 0 | 7.9250 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 |
4 | 35.0 | 1 | 0 | 53.1000 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 |
5 | 35.0 | 0 | 0 | 8.0500 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 |
The pipeline works as expected. We are now ready to evaluate the model:
result = titanic_evaluation.perform_simple_evaluation(xgb_model)
result.get_data_frame()
accuracy | balancedAccuracy | precision | recall | F1 | |
---|---|---|---|---|---|
predictedVar | |||||
Survived | 0.826816 | 0.817746 | 0.75 | 0.784615 | 0.766917 |
In the demonstration above, the model’s full definition was spread out across several cells to incrementally explain the pipeline construction. In practice we want to keep the model definition monolithic. Let us (re-)define the XGBoost model as well as a second model with a slightly different pipeline that additionally applies normalisation and compare the two models.
xgb_model = sensai.xgboost.XGBGradientBoostedVectorClassificationModel() \
.with_raw_input_transformers(
DFTTitanicDropMetaDataColumns(),
DFTTitanicTransformPassengerClass()) \
.with_name("XGBoost") \
.with_feature_collector(feature_collector, shared=True) \
.with_feature_transformers(
feature_collector.create_feature_transformer_one_hot_encoder(ignore_unknown=True))
torch_mlp_model = sensai.torch.models.MultiLayerPerceptronVectorClassificationModel(
hid_activation_function=torch.relu,
hidden_dims=[10, 10, 4],
cuda=False,
p_dropout=0.25,
nn_optimiser_params=sensai.torch.NNOptimiserParams(early_stopping_epochs=10)) \
.with_name("MLP") \
.with_raw_input_transformers(
DFTTitanicDropMetaDataColumns(),
DFTTitanicTransformPassengerClass()) \
.with_feature_collector(feature_collector, shared=True) \
.with_feature_transformers(
sensai.data_transformation.DFTColumnFilter(drop=[dataset.COL_PORT_EMBARKED, dataset.COL_AGE_YEARS]),
feature_collector.create_feature_transformer_one_hot_encoder(ignore_unknown=True),
feature_collector.create_dft_normalisation())
titanic_evaluation.compare_models([xgb_model, torch_mlp_model]).results_df
/home/runner/work/sensAI/sensAI/.tox/py_latest_dependencies/lib/python3.8/site-packages/sensai/torch/torch_base.py:144: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
self.module = torch.load(model_file)
Received value in input tensor 0 which is likely to not be correctly normalised: maximum abs. value in tensor is 6.937743
Received value in input tensor 0 which is likely to not be correctly normalised: maximum abs. value in tensor is 6.937743
accuracy | balancedAccuracy | precision | recall | F1 | |
---|---|---|---|---|---|
model_name | |||||
XGBoost | 0.826816 | 0.817746 | 0.75 | 0.784615 | 0.766917 |
MLP | 0.821229 | 0.790216 | 0.80 | 0.676923 | 0.733333 |
Notice that the model definitions are purely declarative: We define each model and the respective feature pipeline by injecting appropriate pipeline components.
For the multi-layer perceptron model, we notably added some additional feature transformers:
Since this type model cannot cope with missing feature values, we added a component that drops the age and port columns, which are sometimes undefined.
Since neural networks work best with normalised feature representations, we added the normalisation component, which uses a standard scaler (as defined in the feature generator).