pandas#
Source code: sensai/util/pandas.py
- class DataFrameColumnChangeTracker(initial_df: DataFrame)[source]#
Bases:
object
A simple class for keeping track of changes in columns between an initial data frame and some other data frame (usually the result of some transformations performed on the initial one).
Example:
>>> from sensai.util.pandas import DataFrameColumnChangeTracker >>> import pandas as pd
>>> df = pd.DataFrame({"bar": [1, 2]}) >>> columnChangeTracker = DataFrameColumnChangeTracker(df) >>> df["foo"] = [4, 5] >>> columnChangeTracker.track_change(df) >>> columnChangeTracker.get_removed_columns() set() >>> columnChangeTracker.get_added_columns() {'foo'}
- extract_array(df: DataFrame, dtype=None)[source]#
Extracts array from data frame. It is expected that each row corresponds to a data point and each column corresponds to a “channel”. Moreover, all entries are expected to be arrays of the same shape (or scalars or sequences of the same length). We will refer to that shape as tensorShape.
The output will be of shape (N_rows, N_columns, *tensorShape). Thus, N_rows can be interpreted as dataset length (or batch size, if a single batch is passed) and N_columns can be interpreted as number of channels. Empty dimensions will be stripped, thus if the data frame has only one column, the array will have shape (N_rows, *tensorShape). E.g. an image with three channels could equally be passed as data frame of the type
R
G
B
channel
channel
channel
channel
channel
channel
…
…
…
or as data frame of type
image
RGB-array
RGB-array
…
In both cases the returned array will have shape (N_images, 3, width, height)
- Parameters:
df – data frame where each entry is an array of shape tensorShape
dtype – if not None, convert the array’s data type to this type (string or numpy dtype)
- Returns:
array of shape (N_rows, N_columns, *tensorShape) with stripped empty dimensions
- remove_duplicate_index_entries(df: DataFrame)[source]#
Removes successive duplicate index entries by keeping only the first occurrence for every duplicate index element.
- Parameters:
df – the data frame, which is assumed to have a sorted index
- Returns:
the (modified) data frame with duplicate index entries removed