pandas

pandas#

class DataFrameColumnChangeTracker(initial_df: pandas.DataFrame)[source]#

Bases: object

A simple class for keeping track of changes in columns between an initial data frame and some other data frame (usually the result of some transformations performed on the initial one).

Example:

>>> from sensai.util.pandas import DataFrameColumnChangeTracker
>>> import pandas as pd

>>> df = pd.DataFrame({"bar": [1, 2]})
>>> columnChangeTracker = DataFrameColumnChangeTracker(df)
>>> df["foo"] = [4, 5]
>>> columnChangeTracker.track_change(df)
>>> columnChangeTracker.get_removed_columns()
set()
>>> columnChangeTracker.get_added_columns()
{'foo'}

track_change(changed_df: pandas.DataFrame)[source]#

get_removed_columns()[source]#

get_added_columns()[source]#: Returns the columns in the last entry of the history that were not present the first one

column_change_string()[source]#: Returns a string representation of the change

assert_change_was_tracked()[source]#

extract_array(df: pandas.DataFrame, dtype=None)[source]#

Extracts array from data frame. It is expected that each row corresponds to a data point and each column corresponds to a “channel”. Moreover, all entries are expected to be arrays of the same shape (or scalars or sequences of the same length). We will refer to that shape as tensorShape.

The output will be of shape (N_rows, N_columns, *tensorShape). Thus, N_rows can be interpreted as dataset length (or batch size, if a single batch is passed) and N_columns can be interpreted as number of channels. Empty dimensions will be stripped, thus if the data frame has only one column, the array will have shape (N_rows, *tensorShape). E.g. an image with three channels could equally be passed as data frame of the type

R	G	B
channel	channel	channel
channel	channel	channel
…	…	…

or as data frame of type

image
RGB-array
RGB-array
…

In both cases the returned array will have shape (N_images, 3, width, height)

Parameters:

df – data frame where each entry is an array of shape tensorShape
dtype – if not None, convert the array’s data type to this type (string or numpy dtype)

Returns:

array of shape (N_rows, N_columns, *tensorShape) with stripped empty dimensions

remove_duplicate_index_entries(df: pandas.DataFrame)[source]#

Removes successive duplicate index entries by keeping only the first occurrence for every duplicate index element.

Parameters:: df – the data frame, which is assumed to have a sorted index
Returns:: the (modified) data frame with duplicate index entries removed

query_data_frame(df: pandas.DataFrame, sql: str)[source]#

Queries the given data frame with the given condition specified in SQL syntax.

NOTE: Requires duckdb to be installed.

Parameters:

df – the data frame to query
sql – an SQL query starting with the WHERE clause (excluding the ‘where’ keyword itself)

Returns:

the filtered/transformed data frame

class SeriesInterpolation[source]#

Bases: ABC

interpolate(series: pandas.Series, inplace: bool = False) → Optional[pandas.Series][source]#

interpolate_all_with_combined_index(series_list: List[pandas.Series]) → List[pandas.Series][source]#

Interpolates the given series using the combined index of all series.

Parameters:: series_list – the list of series to interpolate
Returns:: a list of corresponding interpolated series, each having the same index

class SeriesInterpolationLinearIndex(ffill: bool = False, bfill: bool = False)[source]#

Bases: SeriesInterpolation

Parameters:

ffill – whether to fill any N/A values at the end of the series with the last valid observation
bfill – whether to fill any N/A values at the start of the series with the first valid observation

class SeriesInterpolationRepeatPreceding(bfill: bool = False)[source]#

Bases: SeriesInterpolation

Parameters:: bfill – whether to fill any N/A values at the start of the series with the first valid observation

average_series(series_list: List[pandas.Series], interpolation: SeriesInterpolation) → pandas.Series[source]#