pandas#


class DataFrameColumnChangeTracker(initial_df: DataFrame)[source]#

Bases: object

A simple class for keeping track of changes in columns between an initial data frame and some other data frame (usually the result of some transformations performed on the initial one).

Example:

>>> from sensai.util.pandas import DataFrameColumnChangeTracker
>>> import pandas as pd
>>> df = pd.DataFrame({"bar": [1, 2]})
>>> columnChangeTracker = DataFrameColumnChangeTracker(df)
>>> df["foo"] = [4, 5]
>>> columnChangeTracker.track_change(df)
>>> columnChangeTracker.get_removed_columns()
set()
>>> columnChangeTracker.get_added_columns()
{'foo'}
track_change(changed_df: DataFrame)[source]#
get_removed_columns()[source]#
get_added_columns()[source]#

Returns the columns in the last entry of the history that were not present the first one

column_change_string()[source]#

Returns a string representation of the change

assert_change_was_tracked()[source]#
extract_array(df: DataFrame, dtype=None)[source]#

Extracts array from data frame. It is expected that each row corresponds to a data point and each column corresponds to a “channel”. Moreover, all entries are expected to be arrays of the same shape (or scalars or sequences of the same length). We will refer to that shape as tensorShape.

The output will be of shape (N_rows, N_columns, *tensorShape). Thus, N_rows can be interpreted as dataset length (or batch size, if a single batch is passed) and N_columns can be interpreted as number of channels. Empty dimensions will be stripped, thus if the data frame has only one column, the array will have shape (N_rows, *tensorShape). E.g. an image with three channels could equally be passed as data frame of the type

R

G

B

channel

channel

channel

channel

channel

channel

or as data frame of type

image

RGB-array

RGB-array

In both cases the returned array will have shape (N_images, 3, width, height)

Parameters:
  • df – data frame where each entry is an array of shape tensorShape

  • dtype – if not None, convert the array’s data type to this type (string or numpy dtype)

Returns:

array of shape (N_rows, N_columns, *tensorShape) with stripped empty dimensions

remove_duplicate_index_entries(df: DataFrame)[source]#

Removes successive duplicate index entries by keeping only the first occurrence for every duplicate index element.

Parameters:

df – the data frame, which is assumed to have a sorted index

Returns:

the (modified) data frame with duplicate index entries removed