feature_importance

feature_importance#

Source code: sensai/feature_importance.py

class FeatureImportance(feature_importance_dict: Union[Dict[str, float], Dict[str, Dict[str, float]]])[source]#

Bases: object

get_feature_importance_dict(predicted_var_name=None) → Dict[str, float][source]#

get_sorted_tuples(predicted_var_name=None, reverse=False) → List[Tuple[str, float]][source]#

Parameters:

predicted_var_name – the predicted variable name for which to retrieve the sorted feature importance values
reverse – whether to reverse the order (i.e. descending order of importance values, where the most important feature comes first, rather than ascending order)

Returns:

a sorted list of tuples (feature name, feature importance)

plot(predicted_var_name=None, sort=True) → matplotlib.pyplot.Figure[source]#

get_data_frame(predicted_var_name=None) → pandas.DataFrame[source]#

Parameters:: predicted_var_name – the predicted variable name
Returns:: a data frame with two columns, “feature” and “importance”

class FeatureImportanceProvider[source]#

Bases: ABC

Interface for models that can provide feature importance values

abstract get_feature_importance_dict() → Union[Dict[str, float], Dict[str, Dict[str, float]]][source]#

Gets the feature importance values

Returns:: either a dictionary mapping feature names to importance values or (for models predicting multiple variables (independently)) a dictionary which maps predicted variable names to such dictionaries

get_feature_importance() → FeatureImportance[source]#

get_feature_importances() → Union[Dict[str, float], Dict[str, Dict[str, float]]][source]#

plot_feature_importance(feature_importance_dict: Dict[str, float], subtitle: Optional[str] = None, sort=True) → matplotlib.pyplot.Figure[source]#

class AggregatedFeatureImportance(*items: Union[FeatureImportanceProvider, Dict[str, float], Dict[str, Dict[str, float]]], feature_agg_reg_ex: Sequence[str] = (), agg_fn=numpy.mean)[source]#

Bases: object

Aggregates feature importance values (e.g. from models implementing FeatureImportanceProvider, such as sklearn’s RandomForest models and compatible models from lightgbm, etc.)

Parameters:

items – (optional) initial list of feature importance providers or dictionaries to aggregate; further values can be added via method add
feature_agg_reg_ex – a sequence of regular expressions describing which feature names to sum as one. Each regex must contain exactly one group. If a regex matches a feature name, the feature importance will be summed under the key of the matched group instead of the full feature name. For example, the regex r”(w+)_d+$” will cause “foo_1” and “foo_2” to be summed under “foo” and similarly “bar_1” and “bar_2” to be summed under “bar”.

add(feature_importance: Union[FeatureImportanceProvider, Dict[str, float], Dict[str, Dict[str, float]]])[source]#

Adds the feature importance values from the given dictionary

Parameters:: feature_importance – the dictionary obtained via a model’s getFeatureImportances method

get_aggregated_feature_importance_dict() → Union[Dict[str, float], Dict[str, Dict[str, float]]][source]#

get_aggregated_feature_importance() → FeatureImportance[source]#

compute_permutation_feature_importance_dict(model, io_data: InputOutputData, scoring, num_repeats: int, random_state, exclude_input_preprocessors=False, num_jobs=None)[source]#

class AggregatedPermutationFeatureImportance(aggregated_feature_importance: AggregatedFeatureImportance, scoring, num_repeats=5, random_seed=42, exclude_model_input_preprocessors=False, num_jobs: Optional[int] = None)[source]#

Bases: ToStringMixin

Parameters:

aggregated_feature_importance – the object in which to aggregate the feature importance (to which no feature importance values should have yet been added)
scoring – the scoring method; see https://scikit-learn.org/stable/modules/model_evaluation.html; e.g. “r2” for regression or “accuracy” for classification
num_repeats – the number of data permutations to apply for each model
random_seed – the random seed for shuffling the data
exclude_model_input_preprocessors – whether to exclude model input preprocessors, such that the feature importance will be reported on the transformed inputs that are actually fed to the model rather than the original inputs. Enabling this can, for example, help save time in cases where the input preprocessors discard many of the raw input columns, but it may not be a good idea of the preprocessors generate multiple columns from the original input columns.
num_jobs – Number of jobs to run in parallel. Each separate model-data permutation feature importance computation is parallelised over the columns. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors.

add(model: VectorModel, io_data: InputOutputData)[source]#

add_cross_validation_data(cross_val_data: VectorModelCrossValidationData)[source]#

get_feature_importance() → FeatureImportance[source]#