io_data#
Source code: sensai/data/io_data.py
- class BaseInputOutputData(inputs: T, outputs: T)[source]#
Bases:
Generic
[T
],ABC
- Parameters:
inputs – expected to have shape and __len__
outputs – expected to have shape and __len__
- abstract filter_indices(indices: Sequence[int]) BaseInputOutputData [source]#
- class InputOutputArrays(inputs: ndarray, outputs: ndarray)[source]#
Bases:
BaseInputOutputData
[ndarray
]- Parameters:
inputs – expected to have shape and __len__
outputs – expected to have shape and __len__
- filter_indices(indices: Sequence[int]) InputOutputArrays [source]#
- class InputOutputData(inputs: DataFrame, outputs: DataFrame, weights: Optional[Union[Series, DataPointWeighting]] = None)[source]#
Bases:
BaseInputOutputData
[DataFrame
],ToStringMixin
Holds input and output data for learning problems
- Parameters:
inputs – expected to have shape and __len__
outputs – expected to have shape and __len__
- classmethod from_data_frame(df: DataFrame, *output_columns: str) InputOutputData [source]#
- Parameters:
df – a data frame containing both input and output columns
output_columns – the output column name(s)
- Returns:
an InputOutputData instance with inputs and outputs separated
- to_data_frame(add_weights: bool = False, weights_col_name: str = 'weights') DataFrame [source]#
- Parameters:
add_weights – whether to add the weights as a column (provided that weights are present)
weights_col_name – the column name to use for weights if add_weights is True
- Returns:
a data frame containing both the inputs and outputs (and optionally the weights)
- filter_indices(indices: Sequence[int]) InputOutputData [source]#
- filter_index(index_elements: Sequence[any]) InputOutputData [source]#
- property input_dim#
- property output_dim#
- apply_weighting(weighting: DataPointWeighting)[source]#
- class DataSplitterFractional(fractional_size_of_first_set: float, shuffle=True, random_seed=42)[source]#
Bases:
DataSplitter
- class DataSplitterFromDataFrameSplitter(data_frame_splitter: DataFrameSplitter, fractional_size_of_first_set: float, apply_to_input=True)[source]#
Bases:
DataSplitter
[InputOutputData
]Creates a DataSplitter from a DataFrameSplitter, which can be applied either to the input or the output data. It supports only InputOutputData, not other subclasses of BaseInputOutputData.
- Parameters:
data_frame_splitter – the splitter to apply
fractional_size_of_first_set – the desired fractional size of the first set when applying the splitter
apply_to_input – if True, apply the splitter to the input data frame; if False, apply it to the output data frame
- split(data: InputOutputData) Tuple[InputOutputData, InputOutputData] [source]#
- class DataSplitterFromSkLearnSplitter(sklearn_splitter)[source]#
Bases:
DataSplitter
- Parameters:
sklearn_splitter – an instance of one of the splitter classes from sklearn.model_selection, see https://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection
- class DataSplitterStratifiedShuffleSplit(fractional_size_of_first_set: float, random_seed=42)[source]#
Bases:
DataSplitterFromSkLearnSplitter
- Parameters:
sklearn_splitter – an instance of one of the splitter classes from sklearn.model_selection, see https://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection
- static is_applicable(io_data: InputOutputData)[source]#
- class DataFrameSplitter[source]#
Bases:
ABC
- abstract compute_split_indices(df: DataFrame, fractional_size_of_first_set: float) Tuple[Sequence[int], Sequence[int]] [source]#
- class DataFrameSplitterFractional(shuffle=False, random_seed=42)[source]#
Bases:
DataFrameSplitter
- class DataFrameSplitterColumnEquivalenceClass(column: str, shuffle=True, random_seed=42)[source]#
Bases:
DataFrameSplitter
Performs a split that keeps together data points/rows that have the same value in a given column, i.e. with respect to that column, the items having the same values are viewed as a unit; they form an equivalence class, and all data points belonging to the same class are either in the first set or the second set.
The split is performed at the level of unique items in the column, i.e. the given fraction of equivalence classes will end up in the first set and the rest in the second set.
The list if unique items in the column can be shuffled before applying the split. If no shuffling is applied, the original order in the data frame is maintained, and if the items were grouped by equivalence class in the original data frame, the split will correspond to a fractional split without shuffling where the split boundary is adjusted to not separate an equivalence class.
- Parameters:
column – the column which defines the equivalence classes (groups of data points/rows that must not be separated)
shuffle – whether to shuffle the list of unique values in the given column before applying the split
random_seed –
- class DataPointWeightingRegressionTargetIntervalTotalWeight(intervals_weights: Sequence[Tuple[float, float]])[source]#
Bases:
DataPointWeighting
Based on relative weights specified for intervals of the regression target, will weight individual data point weights such that the sum of weights of data points within each interval satisfies the user-specified relative weight, while ensuring that the total weight of all data points is still equal to the number of data points.
For example, if one specifies interval_weights as [(0.5, 1), (inf, 2)], then the data points with target values up to 0.5 will get 1/3 of the weight and the remaining data points will get 2/3 of the weight. So if there are 100 data points and 50 of them are in the first interval (up to 0.5), then these 50 data points will each get weight 1/3*100/50=2/3 and the remaining 50 data points will each get weight 2/3*100/50=4/3. The sum of all weights is the number of data points, i.e. 100.
Example:
>>> targets = [0.1, 0.2, 0.5, 0.7, 0.8, 0.6] >>> x = pd.DataFrame({"foo": np.zeros(len(targets))}) >>> y = pd.DataFrame({"target": targets}) >>> weighting = DataPointWeightingRegressionTargetIntervalTotalWeight([(0.5, 1), (1.0, 2)]) >>> weights = weighting.compute_weights(x, y) >>> assert(np.isclose(weights.sum(), len(y))) >>> weights.tolist() [0.6666666666666666, 0.6666666666666666, 0.6666666666666666, 1.3333333333333333, 1.3333333333333333, 1.3333333333333333]
- Parameters:
intervals_weights – a sequence of tuples (upper_bound, rel_total_weight) where upper_bound is the upper bound of the interval, (lower_bound, upper_bound]; lower_bound is the upper bound of the preceding interval or -inf for the first interval. rel_total_weight specifies the relative weight of all data points within the interval.