Models with Modular Data Pipelines#

Hide code cell content
%load_ext autoreload
%autoreload 2

import sys; sys.path.append("../../src")
Hide code cell content
import sensai
import sensai.xgboost
import sensai.torch
import pandas as pd
import torch

VectorModel#

The backbone of supervised learning implementations is the VectorModel abstraction. It is so named, because, in computer science, a vector corresponds to an array of data, and vector models map such vectors to the desired outputs, i.e. regression targets or classes.

It is important to note that this does not limit vector models to tabular data, because the data within a vector can take arbitrary forms (in contrast to vectors as they are defined in mathematics). Every element of an input vector could itself be arbitrarily complex, and could, in the most general sense, be any kind of object.

The VectorModel Class Hierarchy#

VectorModel is an abstract base class. From it, abstract base classes for classification (VectorClassificationModel) and regression (VectorRegressionModel) are derived. And we furthermore provide base classes for rule-based models, facilitating the implementation of models that do not require learning (RuleBasedVectorClassificationModel, RuleBasedVectorRegressionModel).

These base classes are, in turn, specialised in order to provide direct access to model implementations based on widely used machine learning libraries such as scikit-learn, XGBoost, PyTorch, etc. Use your IDE’s hierarchy view to inspect them.

DataFrame-Based Interfaces#

Vector models use pandas DataFrames as the fundmental input and output data structures. Every row in a data frame corresponds to a vector of data, and an entire data frame can thus be viewed as a dataset or batch of data. Data frames are a good base representation for input data because

  • they provide rudimentary meta-data in the form of column names, avoiding ambiguity.

  • they can contain arbitrarily complex data, yet in the simplest of cases, they can directly be mapped to a data matrix (2D array) of features that simple models can directly process.

The fit and predict methods of VectorModel take data frames as input, and the latter furthermore returns its predictions as a data frame. It is important to note that the DataFrame-based interface does not limit the scope of the models that can be applied, as one of the key principles of vector models is that they may define arbitrary model-specific transformations of the data originally contained in a data frame (e.g. a conversion from complex objects in data frames to one or more tensors for neural networks), as we shall see below.

Here’s the particularly simple Iris dataset for flower species classification, where the features are measurements of petals and sepals:

dataset = sensai.data.dataset.DataSetClassificationIris()
io_data = dataset.load_io_data()
io_data.to_df().sample(8)
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) class
88 5.6 3.0 4.1 1.3 versicolor
130 7.4 2.8 6.1 1.9 virginica
14 5.8 4.0 1.2 0.2 setosa
108 6.7 2.5 5.8 1.8 virginica
106 4.9 2.5 4.5 1.7 virginica
42 4.4 3.2 1.3 0.2 setosa
34 4.9 3.1 1.5 0.2 setosa
71 6.1 2.8 4.0 1.3 versicolor

Here, io_data is an instance of InputOutputData, which contains two data frames, inputs and outputs. The to_df method merges the two data frames into one for easier visualisation.

Let’s split the dataset and apply a model to it:

# load and split a dataset
splitter = sensai.data.DataSplitterFractional(0.8)
train_io_data, test_io_data = splitter.split(io_data)

# train a model
model = sensai.sklearn.classification.SkLearnRandomForestVectorClassificationModel(
    n_estimators=15)
model.fit_input_output_data(train_io_data)

# make predictions
predictions = model.predict(test_io_data.inputs)

The fit_input_output_data method is just a convenience method to pass an InputOutputData instance instead of two data frames. It is equivalent to

model.fit(train_io_data.inputs, train_io_data.outputs)

where the two data frames containing inputs and outputs are passed separately.

Now let’s compare the ground truth to some of the predictions:

pd.concat((test_io_data.outputs, predictions), axis=1).sample(8)
class class
116 virginica virginica
74 versicolor versicolor
144 virginica virginica
87 versicolor versicolor
90 versicolor versicolor
129 virginica virginica
99 versicolor versicolor
57 versicolor versicolor

Implementing Custom Models#

It is straightforward to implement your own model. Simply subclass the appropriate base class depending on the type of model you want to implement.

For example, let us implement a simple classifier where we always return the a priori probability of each class in the training data, ignoring the input data for predictions. For this case, we inherit from VectorClassificationModel and implement the two abstract methods it defines.

class PriorProbabilityVectorClassificationModel(sensai.VectorClassificationModel):
    def _fit_classifier(self, x: pd.DataFrame, y: pd.DataFrame):
        self._prior_probabilities = y.iloc[:, 0].value_counts(normalize=True).to_dict()

    def _predict_class_probabilities(self, x: pd.DataFrame) -> pd.DataFrame:
        values = [self._prior_probabilities[cls] for cls in self.get_class_labels()]
        return pd.DataFrame([values] * len(x), columns=self.get_class_labels(), index=x.index)

Adapting a model implementation from another machine learning library is typically just a few lines. For models that adhere to the scikit-learn interfaces for learning and prediction, there are abstract base classes that make the adaptation particularly straightforward.

Configuration#

Apart from the parameters passed at construction, which are specific to the type of model in question, all vector models can be flexibly configured via methods that can be called post-construction. These methods all have the with_ prefix, indicating that they return the instance itself (akin to the builder pattern), allowing calls to be chained in a single statement.

The most relevant such methods are:

  • with_name to name the model (for reporting purposes)

  • with_raw_input_transformer for adding an initial input transformation

  • with_feature_generator and with_feature_collector for specifying how to generate features from the input data

  • with_feature_transformers for specifying how the generated features shall be transformed

The latter three points are essential for defining modular input pipelines and will be addressed in detail below.

All configured options are fully reflected in the model’s string representation, which can be pretty-printed with the pprint method.

str(model.with_name("RandomForest"))
"SkLearnRandomForestVectorClassificationModel[featureGenerator=None, rawInputTransformerChain=DataFrameTransformerChain[dataFrameTransformers=[]], featureTransformerChain=DataFrameTransformerChain[dataFrameTransformers=[]], fitArgs={}, useBalancedClassWeights=False, useLabelEncoding=False, name='RandomForest', model='RandomForestClassifier(n_estimators=15, random_state=42)']"
model.pprint()
SkLearnRandomForestVectorClassificationModel[
    featureGenerator=None, 
    rawInputTransformerChain=DataFrameTransformerChain[dataFrameTransformers=[]], 
    featureTransformerChain=DataFrameTransformerChain[dataFrameTransformers=[]], 
    fitArgs={}, 
    useBalancedClassWeights=False, 
    useLabelEncoding=False, 
    name='RandomForest', 
    model='RandomForestClassifier(n_estimators=15, random_state=42)']

Modular Pipelines#

A key principle of sensAI’s vector models is that data pipelines

  • can be strongly associated with a model. This is critically important if several heterogeneous models shall be applied to the same use case. Typically, every model has different requirements regarding the data it can process and the representation it requires to process it optimally.

  • are to be modular, meaning that a pipeline can be composed from reusable and user-definable components.

An input pipeline typically serves the purpose of answering the following questions:

  • How shall the data be pre-processed?

    It might be necessary to process the data before we can use it and extract meaningful features from it. We may need to filter or clean the data; we may need to establish a usable representation from raw data (e.g. convert a string-based representation of a date into a proper data structure); or we may need to infer/impute missing data.

    The relevant abstraction for this task is DataFrameTransformer, which, as the name suggests, can arbitrarily transform a data frame. All non-abstract class implementations have the prefix DFT in sensAI and thus are easily discovered through auto-completion.

    A VectorModel can be configured to apply a pre-processing transformation via method with_raw_input_transformers.

  • What is the data used by the model?

    The relevant abstraction is FeatureGenerator. Via FeatureGenerator instances, a model can define which set of features is to be used. Moreover, these instances can hold meta-data on the respective features, which can be leveraged for downstream representation. In sensAI, the class names of all feature generator implementations use the prefix FeatureGenerator.

    A VectorModel can be configured to answer this question via method with_feature_generator (or with_feature_collector).

  • How does that data need to be represented?

    Different models can require different representations of the same data. For example, some models might require all features to be numeric, thus requiring categorical features to be encoded, while others might work better with the original representation. Furthermore, some models might work better with numerical features normalised or scaled in a certain way while it makes no difference to others. We can address these requirements by adding model-specific transformations.

    The relevant abstraction is, once again, DataFrameTransformer.

    A VectorModel can be configured to apply a transformation to its features via method with_feature_transformers.

The three pipeline stages are applied in the order presented above, and all components are optional, i.e. if a model does not define any raw input transformers, then the original data remains unmodified. If a model defines no feature generator, then the set of features is given by the full input data frame, etc.

Data Frame Transformers (DFTs)#

As the name suggests, a data frame transformer (DFT) is a simple concept: It simply transforms a data frame into a new data frame. The transformation can either be pre-defined or be learnt from data. A common case is for the new data frame to contain a modified representation of the same data.

The package sensai.data_transformation contains a multitude of concrete transformers that can directly be applied as well as base classes for custom transformer implementations.

As an example, consider a data frame containing a column with string representations of points in time:

data = {'ts': ['2024-09-08 12:31:18', '2022-09-10 18:31:12', None, '2022-09-12 07:55:05']}
raw_df = pd.DataFrame(data)

We can define a data frame transformer that will convert the string representations into proper Timestamp objects for downstream processing:

class DFTStringToTimestamp(sensai.data_transformation.RuleBasedDataFrameTransformer):
    def _apply(self, df: pd.DataFrame) -> pd.DataFrame:
        df = df.copy()
        df["ts"] = pd.to_datetime(df["ts"])
        return df

Because this transformation does not require any learning, we have derived it from RuleBasedDataFrameTransformer and the only method to be implemented is the private method that applies the transformation.

Applying the transformation is straightforward:

dft_string_to_timestamp = DFTStringToTimestamp()
transformed_df = dft_string_to_timestamp.apply(raw_df)
transformed_df
ts
0 2024-09-08 12:31:18
1 2022-09-10 18:31:12
2 NaT
3 2022-09-12 07:55:05

DFTs can be chained via DataFrameTranformerChain, which will apply transformations sequentially. A DataFrameTransformerChain is itself a DataFrameTransformer, allowing for the definition of complex pipelines.

dft_chain = sensai.data_transformation.DataFrameTransformerChain(
    DFTStringToTimestamp(), sensai.data_transformation.DFTDropNA())
transformed_df = dft_chain.apply(raw_df)
transformed_df
ts
0 2024-09-08 12:31:18
1 2022-09-10 18:31:12
3 2022-09-12 07:55:05

A DataFrameTransformerChain can also be created by using the chain method of DataFrameTransformer:

DFTStringToTimestamp().chain(sensai.data_transformation.DFTDropNA()).apply(raw_df)
ts
0 2024-09-08 12:31:18
1 2022-09-10 18:31:12
3 2022-09-12 07:55:05

Feature Generators#

Feature generators serve two main functions:

  1. They define how features can be generated from the input data frame.

  2. They hold meta-data on the generated features, which can be leveraged for downstream transformation. Specifically,

    • we can define which features are categorical,

    • we can define rules for normalisation or scaling of numerical features.

The basic functionality of a feature generator is to create, from an input data frame, a data frame with the same index that contains one or more columns, each column representing a feature that the model shall use.

A Simple, Rule-Based Feature Generator#

Let’s consider a simple example: Suppose we have the transformed data frame with timestamps from above and want the model to use the hour of the day as a feature. Since this feature generator does not require learning, we can define a RuleBasedFeatureGenerator as follows:

class FeatureGeneratorHourOfDay(sensai.featuregen.RuleBasedFeatureGenerator):
    def _generate(self, df: pd.DataFrame, ctx=None) -> pd.DataFrame:
        hour_series = df["ts"].apply(lambda t: t.hour)
        return pd.DataFrame({"hour_of_day": hour_series}, index=df.index)
    
FeatureGeneratorHourOfDay().generate(transformed_df)
hour_of_day
0 12
1 18
3 7

This is a most simple example, where the feature generation mechanism must not be learned from data. If we require the generator to adapt itself to the training data, we can instead derive our class from FeatureGenerator and implement method _fit accordingly.

Making Use of Base Classes#

sensAI provides a wide variety of base classes that simplify the definition of feature generators, including

  • :py:class:sensai.featuregen.feature_generator.FeatureGeneratorTakeColumns, which simply takes over columns from the input data frame without modifying them

  • :py:class:sensai.featuregen.feature_generator.FeatureGeneratorMapColumn, which maps the values of an input column to a new feature column

  • :py:class:sensai.featuregen.feature_generator.FeatureGeneratorFlattenColumns, which generates features by flattening one or more vector-valued columns in the input

  • :py:class:sensai.featuregen.feature_generator.FeatureGeneratorMapColumnDict, which maps an input column to several feature columns, i.e. mapping each input value to a dictionary of output values

  • :py:class:sensai.featuregen.feature_generator.FeatureGeneratorFromVectorModel, which generates features by applying a (regression or classifcation) model to the input data frame

  • :py:class:sensai.featuregen.feature_generator.FeatureGeneratorFromDFT, which generates features by applying a given data frame transformer to the input

  • :py:class:sensai.featuregen.feature_generator.FeatureGeneratorFromColumnGenerator, which uses the concept of a ColumnGenerator to implement a feature generator, which specifically supports index-based caching mechanisms for feature generation

  • :py:class:sensai.featuregen.feature_generator.FeatureGeneratorTargetDistribution, which computes conditional distributions of the (optionally discretised) target variable given one or more categorical features in the input data

As a simple example, let us use FeatureGeneratorMapColumn to implement a second feature based on the timestamp column from the earlier data frame: the day of the week.

class FeatureGeneratorDayOfWeek(sensai.featuregen.feature_generator.FeatureGeneratorMapColumn):
    def __init__(self):
        super().__init__(input_col_name="ts", feature_col_name="day_of_week")

    def _create_value(self, ts: pd.Timestamp):
        return ts.day_of_week
    
FeatureGeneratorDayOfWeek().generate(transformed_df)
day_of_week
0 6
1 5
3 0

Adding Meta-Data for Downstream Transformations#

Suppose we want to apply the “hour of day” feature generator within a neural network, where we might want to normalise the features. We can do this by extending the implementation with a constructor that defines a normalisation rule template, which is to apply to the generated column.

In general, we can define specific normalisation rules for every feature generated by a feature generator, but if the same rule shall apply to all columns (or if there is only one), the use of a normalisation rule template avoids the unnecessary repetition of column names.

from sensai.data_transformation.dft import DFTNormalisation
from sensai.data_transformation.sklearn_transformer import SkLearnTransformerFactoryFactory


class FeatureGeneratorHourOfDay(sensai.featuregen.RuleBasedFeatureGenerator):
    def __init__(self):
        super().__init__(normalisation_rule_template=DFTNormalisation.RuleTemplate(
            transformer_factory=SkLearnTransformerFactoryFactory.ManualScaler(scale=1/23)))

    def _generate(self, df: pd.DataFrame, ctx=None) -> pd.DataFrame:
        hour_series = df["ts"].apply(lambda t: t.hour)
        return pd.DataFrame({"hour_of_day": hour_series}, index=df.index)

As already explained above, the normalisation rules and categorical feature data are only meta-data; the normalisation rule template we specified does not affect the actual generation of the feature. Why? Because we want to be able to use the same feature generator with different types of models - and the models shall decide whether they want to use that information in order to apply transformations they need to operate optimally.

In the present case, the normalisation rule provides information for the data frame transformer class DFTNormalisation, and here’s how it could be applied manually:

fg_hour = FeatureGeneratorHourOfDay()
DFTNormalisation(rules=fg_hour.get_normalisation_rules()).fit_apply(fg_hour.generate(transformed_df))
hour_of_day
0 0.521739
1 0.782609
3 0.304348

Similarly, we might want to use one-hot encoding for the “day of the week” feature, as this is essentially categorical information; higher integer values do not indicate “more of something” but are entirely different categories. We can similarly extend the earlier definition of the feature generator and declare that the feature is categorical.

from sensai.data_transformation.dft import DFTOneHotEncoder


class FeatureGeneratorDayOfWeek(sensai.featuregen.feature_generator.FeatureGeneratorMapColumn):
    def __init__(self):
        super().__init__(input_col_name="ts", feature_col_name="day_of_week", categorical_feature_names=["day_of_week"])

    def _create_value(self, ts: pd.Timestamp):
        return ts.day_of_week
    

fg_day = FeatureGeneratorDayOfWeek()
DFTOneHotEncoder(fg_day.get_categorical_feature_name_regex()).fit_apply(fg_day.generate(transformed_df))
day_of_week_0 day_of_week_1 day_of_week_2
0 0.0 0.0 1.0
1 0.0 1.0 0.0
3 1.0 0.0 0.0

Applying these downstream transformations for normalisation and one-hot encoding is, of course, cumbersome, which is why sensAI offers a convenient concept to simplify such applications: feature collectors.

Feature Collectors#

Feature collectors facilitate the combination of several feature generators into a single MultiFeatureGenerator that generates the full feature data frame. They furthermore enable the convenient creation of downstream feature transformers for scaling/normalisation and the encoding of categorical features.

feature_collector = sensai.featuregen.FeatureCollector(
    FeatureGeneratorHourOfDay(), FeatureGeneratorDayOfWeek())

multi_feature_gen = feature_collector.create_multi_feature_generator()
features_df = multi_feature_gen.generate(transformed_df)
features_df
hour_of_day day_of_week
0 12 6
1 18 5
3 7 0

Now let’s create the downstream transformers using the feature collector.

dft_one_hot_encoder = feature_collector.create_dft_one_hot_encoder()
dft_normalisation = feature_collector.create_dft_normalisation()

feature_transformer = dft_one_hot_encoder.chain(dft_normalisation)
feature_transformer.fit_apply(features_df)
hour_of_day day_of_week_0 day_of_week_1 day_of_week_2
0 0.521739 0.0 0.0 1.0
1 0.782609 0.0 1.0 0.0
3 0.304348 1.0 0.0 0.0

Fully Defining a Vector Model’s Input Pipeline#

Now let’s put it all together and define the full input pipeline of a model based on the above definitions.

feature_collector = sensai.featuregen.FeatureCollector(
    FeatureGeneratorHourOfDay(), FeatureGeneratorDayOfWeek())

mlp_model = sensai.sklearn.sklearn_classification.SkLearnMLPVectorClassificationModel() \
    .with_raw_input_transformers(DFTStringToTimestamp(), sensai.data_transformation.DFTDropNA()) \
    .with_feature_collector(feature_collector) \
    .with_feature_transformers(feature_collector.create_dft_one_hot_encoder(), feature_collector.create_dft_normalisation())

This declaration makes the model performs the full set of transformations that we considered earlier. Recall the original input data:

raw_df
ts
0 2024-09-08 12:31:18
1 2022-09-10 18:31:12
2 None
3 2022-09-12 07:55:05

Let’s fit the model’s preprocessors (i.e. its input pipeline), using some dummy classification targets y for the fit call to be applicable.

y = pd.DataFrame({"target": [True, True, True, False]})
mlp_model.fit(raw_df, y, fit_preprocessors=True, fit_model=False)

Every vector model supports the method compute_model_inputs to run the input pipeline and generate the data that would actually be passed to the underlying model:

mlp_model.compute_model_inputs(raw_df)
hour_of_day day_of_week_0 day_of_week_1 day_of_week_2
0 0.521739 0.0 0.0 1.0
1 0.782609 0.0 1.0 0.0
3 0.304348 1.0 0.0 0.0

Feature Registries#

When experimenting with different models, features and their representation play a central role. As we add new features or different representations of the information we have, we require a concise and explicit way of defining the set of features a model shall use. The use of a registry enables this: Using FeatureGeneratorRegistry, we can refer to features by names or other hashable types.

Especially in larger projects, the use of an enum comprising the set of features is recommended. Let us define a registry containing the two features we considered above:

from enum import Enum

class FeatureName(Enum):
    HOUR_OF_DAY = "hour_of_day"
    DAY_OF_WEEK = "day_of_week"

registry = sensai.featuregen.FeatureGeneratorRegistry()
registry.register_factory(FeatureName.HOUR_OF_DAY, lambda: FeatureGeneratorHourOfDay())
registry.register_factory(FeatureName.DAY_OF_WEEK, lambda: FeatureGeneratorDayOfWeek())

Using this registry, we can obtain a feature collector for use in a model as follows:

feature_collector = registry.collect_features(FeatureName.HOUR_OF_DAY, FeatureName.DAY_OF_WEEK)

Example Use Case: Titanic Survival#

The Titanic Survival Data Set#

Let us consider the well-known Titanic Survival data set as an example.

Every data point holds data on a passenger. The data set has the following potentially predictive columns,

  • Pclass: the passenger ticket class as an integer (1=first, 2=second, 3=third)

  • Sex: the passenger’s sex (male or female)

  • Age: the passenger’s age in years (integer); this feature is frequently missing

  • SibSp: the number of siblings and spouses of the passenger

  • Parch: the number of parents and children of the passenger

  • Fare: the fare price paid

  • Embark: the port of embarkation (C=Cherbourg, Q=Queenstown, S=Southampton); this feature is missing for two passengers

and some further meta-data columns (Name, Cabin).

The goal is to predict the column ‘Survived’ indicating whether the passenger survived (1) or not (0).

dataset = sensai.data.dataset.DataSetClassificationTitanicSurvival()
io_data = dataset.load_io_data()
io_data.to_df().iloc[:5]
Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Survived
PassengerId
1 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S 0
2 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C 1
3 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S 1
4 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S 1
5 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S 0

Raw Input Transformers (Pre-Processing)#

We shall now add pipeline components to an XGBoost model, as it can straightforwardly deal with missing data.

The dataset doesn’t really require any pre-processing, but we could (just for demonstration purposes)

  • get rid of the useless meta-data columns,

  • convert the passenger class feature into a string to ensure that it is not treated as a numerical feature

class DFTTitanicDropMetaDataColumns(sensai.data_transformation.DFTColumnFilter):
    def __init__(self):
        super().__init__(drop=[dataset.COL_NAME, dataset.COL_CABIN, dataset.COL_TICKET])
        
class DFTTitanicTransformPassengerClass(sensai.data_transformation.DFTModifyColumn):
    def __init__(self):
        super().__init__(
            column=dataset.COL_PASSENGER_CLASS, 
            column_transform=lambda n: {1: "first", 2: "second", 3: "third"}[n])

Let’s try applying them:

DFTTitanicDropMetaDataColumns().chain(DFTTitanicTransformPassengerClass()).apply(io_data.inputs).iloc[:5]
Pclass Sex Age SibSp Parch Fare Embarked
PassengerId
1 third male 22.0 1 0 7.2500 S
2 first female 38.0 1 0 71.2833 C
3 third female 26.0 0 0 7.9250 S
4 first female 35.0 1 0 53.1000 S
5 third male 35.0 0 0 8.0500 S

Feature Generators for Titanic Survival#

In the Titanic Survival data set, the features are already fully prepared, so we do not need to actually generate anything; we can simply take the feature values as they are present in the original data frame and add only the necessary meta-data. As mentioned above, the base class for this purpose is FeatureGeneratorTakeColumns, which allows us to take over columns directly from the input data. We could use a single feature generator for all features as follows:

class FeatureGeneratorTitanicAll(sensai.featuregen.FeatureGeneratorTakeColumns):
    def __init__(self):
        super().__init__(
            columns=None,  # take all columns
            categorical_feature_names=[dataset.COL_SEX, dataset.COL_PASSENGER_CLASS, dataset.COL_PORT_EMBARKED],
            normalisation_rule_template=sensai.featuregen.DFTNormalisation.RuleTemplate(
                transformer_factory=sensai.data_transformation.SkLearnTransformerFactoryFactory.MaxAbsScaler(),
                independent_columns=True))

We have supplied both meta-data regarding

  • the subset of feature that are categorical

  • the normalisation rule to be applied to the numerical features (if normalisation is applied with DFTNormalisation).

Defining Models with Customised Pipelines#

Let us now define two models, an XGBoost model as well as a torch-based multi-layer perceptron (MLP) model based on the raw input transformers and the feature generator defined above.

feature_collector = sensai.featuregen.FeatureCollector(FeatureGeneratorTitanicAll())

xgb_model = sensai.xgboost.XGBGradientBoostedVectorClassificationModel() \
    .with_raw_input_transformers(
        DFTTitanicDropMetaDataColumns(),
        DFTTitanicTransformPassengerClass()) \
    .with_name("XGBoost") \
    .with_feature_collector(feature_collector, shared=True) \
    .with_feature_transformers(
        feature_collector.create_feature_transformer_one_hot_encoder(ignore_unknown=True))

torch_mlp_model = sensai.torch.models.MultiLayerPerceptronVectorClassificationModel(
        hid_activation_function=torch.relu,
        hidden_dims=[10, 10, 4],
        cuda=False,
        p_dropout=0.25,
        nn_optimiser_params=sensai.torch.NNOptimiserParams(early_stopping_epochs=10)) \
    .with_name("MLP") \
    .with_raw_input_transformers(
        DFTTitanicDropMetaDataColumns(),
        DFTTitanicTransformPassengerClass()) \
    .with_feature_collector(feature_collector, shared=True) \
    .with_feature_transformers(
        sensai.data_transformation.DFTColumnFilter(drop=[dataset.COL_PORT_EMBARKED, dataset.COL_AGE_YEARS]),
        feature_collector.create_feature_transformer_one_hot_encoder(ignore_unknown=True),
        feature_collector.create_dft_normalisation())

Notice that the model definitions are purely declarative: We define each model and the respective feature pipeline by injecting appropriate pipeline components.

Both models use one-hot encoding of categorical features. For the multi-layer perceptron model, we notably added some additional feature transformers:

  • Since this type model cannot cope with missing feature values, we added a component that drops the age and port columns, which are sometimes undefined.

  • Since neural networks work best with normalised feature representations, we added the normalisation component, which uses a standard scaler (as defined in the feature generator).

Evaluating Models#

We define an evaluation object for the data set and subsequently apply it in order to compare the two models:

evaluator_params = sensai.evaluation.ClassificationEvaluatorParams(fractional_split_test_fraction=0.2)
titanic_evaluation = sensai.evaluation.ClassificationModelEvaluation(io_data, evaluator_params=evaluator_params)

titanic_evaluation.compare_models([xgb_model, torch_mlp_model]).results_df
accuracy balancedAccuracy precision recall F1
model_name
XGBoost 0.821229 0.810054 0.746269 0.769231 0.757576
MLP 0.815642 0.775911 0.820000 0.630769 0.713043

More Complex Use Cases#

For some more complex example applications, see the examples folder in the sensAI repository.