pandas#
Source code: sensai/util/pandas.py
- class DataFrameColumnChangeTracker(initial_df: DataFrame)[source]#
Bases:
object
A simple class for keeping track of changes in columns between an initial data frame and some other data frame (usually the result of some transformations performed on the initial one).
Example:
>>> from sensai.util.pandas import DataFrameColumnChangeTracker >>> import pandas as pd
>>> df = pd.DataFrame({"bar": [1, 2]}) >>> columnChangeTracker = DataFrameColumnChangeTracker(df) >>> df["foo"] = [4, 5] >>> columnChangeTracker.track_change(df) >>> columnChangeTracker.get_removed_columns() set() >>> columnChangeTracker.get_added_columns() {'foo'}
- extract_array(df: DataFrame, dtype=None)[source]#
Extracts array from data frame. It is expected that each row corresponds to a data point and each column corresponds to a “channel”. Moreover, all entries are expected to be arrays of the same shape (or scalars or sequences of the same length). We will refer to that shape as tensorShape.
The output will be of shape (N_rows, N_columns, *tensorShape). Thus, N_rows can be interpreted as dataset length (or batch size, if a single batch is passed) and N_columns can be interpreted as number of channels. Empty dimensions will be stripped, thus if the data frame has only one column, the array will have shape (N_rows, *tensorShape). E.g. an image with three channels could equally be passed as data frame of the type
R
G
B
channel
channel
channel
channel
channel
channel
…
…
…
or as data frame of type
image
RGB-array
RGB-array
…
In both cases the returned array will have shape (N_images, 3, width, height)
- Parameters:
df – data frame where each entry is an array of shape tensorShape
dtype – if not None, convert the array’s data type to this type (string or numpy dtype)
- Returns:
array of shape (N_rows, N_columns, *tensorShape) with stripped empty dimensions
- remove_duplicate_index_entries(df: DataFrame)[source]#
Removes successive duplicate index entries by keeping only the first occurrence for every duplicate index element.
- Parameters:
df – the data frame, which is assumed to have a sorted index
- Returns:
the (modified) data frame with duplicate index entries removed
- query_data_frame(df: DataFrame, sql: str)[source]#
Queries the given data frame with the given condition specified in SQL syntax.
NOTE: Requires duckdb to be installed.
- Parameters:
df – the data frame to query
sql – an SQL query starting with the WHERE clause (excluding the ‘where’ keyword itself)
- Returns:
the filtered/transformed data frame
- class SeriesInterpolationLinearIndex(ffill: bool = False, bfill: bool = False)[source]#
Bases:
SeriesInterpolation
- Parameters:
ffill – whether to fill any N/A values at the end of the series with the last valid observation
bfill – whether to fill any N/A values at the start of the series with the first valid observation
- class SeriesInterpolationRepeatPreceding(bfill: bool = False)[source]#
Bases:
SeriesInterpolation
- Parameters:
bfill – whether to fill any N/A values at the start of the series with the first valid observation
- average_series(series_list: List[Series], interpolation: SeriesInterpolation) Series [source]#