pandas#


class DataFrameColumnChangeTracker(initial_df: DataFrame)[source]#

Bases: object

A simple class for keeping track of changes in columns between an initial data frame and some other data frame (usually the result of some transformations performed on the initial one).

Example:

>>> from sensai.util.pandas import DataFrameColumnChangeTracker
>>> import pandas as pd
>>> df = pd.DataFrame({"bar": [1, 2]})
>>> columnChangeTracker = DataFrameColumnChangeTracker(df)
>>> df["foo"] = [4, 5]
>>> columnChangeTracker.track_change(df)
>>> columnChangeTracker.get_removed_columns()
set()
>>> columnChangeTracker.get_added_columns()
{'foo'}
track_change(changed_df: DataFrame)[source]#
get_removed_columns()[source]#
get_added_columns()[source]#

Returns the columns in the last entry of the history that were not present the first one

column_change_string()[source]#

Returns a string representation of the change

assert_change_was_tracked()[source]#
extract_array(df: DataFrame, dtype=None)[source]#

Extracts array from data frame. It is expected that each row corresponds to a data point and each column corresponds to a “channel”. Moreover, all entries are expected to be arrays of the same shape (or scalars or sequences of the same length). We will refer to that shape as tensorShape.

The output will be of shape (N_rows, N_columns, *tensorShape). Thus, N_rows can be interpreted as dataset length (or batch size, if a single batch is passed) and N_columns can be interpreted as number of channels. Empty dimensions will be stripped, thus if the data frame has only one column, the array will have shape (N_rows, *tensorShape). E.g. an image with three channels could equally be passed as data frame of the type

R

G

B

channel

channel

channel

channel

channel

channel

or as data frame of type

image

RGB-array

RGB-array

In both cases the returned array will have shape (N_images, 3, width, height)

Parameters:
  • df – data frame where each entry is an array of shape tensorShape

  • dtype – if not None, convert the array’s data type to this type (string or numpy dtype)

Returns:

array of shape (N_rows, N_columns, *tensorShape) with stripped empty dimensions

remove_duplicate_index_entries(df: DataFrame)[source]#

Removes successive duplicate index entries by keeping only the first occurrence for every duplicate index element.

Parameters:

df – the data frame, which is assumed to have a sorted index

Returns:

the (modified) data frame with duplicate index entries removed

query_data_frame(df: DataFrame, sql: str)[source]#

Queries the given data frame with the given condition specified in SQL syntax.

NOTE: Requires duckdb to be installed.

Parameters:
  • df – the data frame to query

  • sql – an SQL query starting with the WHERE clause (excluding the ‘where’ keyword itself)

Returns:

the filtered/transformed data frame

class SeriesInterpolation[source]#

Bases: ABC

interpolate(series: Series, inplace: bool = False) Optional[Series][source]#
interpolate_all_with_combined_index(series_list: List[Series]) List[Series][source]#

Interpolates the given series using the combined index of all series.

Parameters:

series_list – the list of series to interpolate

Returns:

a list of corresponding interpolated series, each having the same index

class SeriesInterpolationLinearIndex(ffill: bool = False, bfill: bool = False)[source]#

Bases: SeriesInterpolation

Parameters:
  • ffill – whether to fill any N/A values at the end of the series with the last valid observation

  • bfill – whether to fill any N/A values at the start of the series with the first valid observation

class SeriesInterpolationRepeatPreceding(bfill: bool = False)[source]#

Bases: SeriesInterpolation

Parameters:

bfill – whether to fill any N/A values at the start of the series with the first valid observation

average_series(series_list: List[Series], interpolation: SeriesInterpolation) Series[source]#