graphtoolbox.data.dataset

Classes

DataClass(path_train, path_test, folder_config)

DataClass handles the loading, preprocessing, and temporal segmentation of graph-based datasets used for machine learning and graph neural networks.

GraphDataset(data, period[, scalers_feat, ...])

GraphDataset organizes time-dependent node features and targets into graph-structured tensors compatible with PyTorch Geometric.

class graphtoolbox.data.dataset.DataClass(path_train: str, path_test: str, folder_config: str, data_kwargs: Dict | None = None, **kwargs)[source][source]

Bases: object

DataClass handles the loading, preprocessing, and temporal segmentation of graph-based datasets used for machine learning and graph neural networks.

This class automates several common data preparation steps for graph-based time series: - Reading training and test datasets (CSV or Parquet format). - Creating lagged versions of numerical features. - Splitting data into train, validation, and test sets based on time boundaries. - Encoding categorical variables into dummy features. - Ensuring consistent node indexing across splits.

df_train_original

Original training DataFrame as loaded from disk.

Type:

pandas.DataFrame

df_test_original

Original test DataFrame as loaded from disk.

Type:

pandas.DataFrame

df_train, df_val, df_test

Preprocessed train/validation/test sets ready for model input.

Type:

pandas.DataFrame

node_var

Column name identifying graph nodes.

Type:

str

nodes

Sorted array of unique node identifiers.

Type:

numpy.ndarray

features_to_lag

Dictionary describing temporal lags to apply on selected features. Format: {'feature': (min_lag, max_lag)}.

Type:

dict or None

dummies

Mapping of categorical features to be one-hot encoded.

Type:

dict or None

day_inf_train, day_sup_train, day_inf_val, day_sup_val, day_inf_test, day_sup_test

Date boundaries for temporal splits.

Type:

str

folder_config

Path to the configuration folder containing data preprocessing parameters.

Type:

str

data_kwargs

Loaded data-related configuration options.

Type:

dict

Parameters:
  • path_train (str) – Path to the training dataset (CSV or Parquet file).

  • path_test (str) – Path to the test dataset (CSV or Parquet file).

  • folder_config (str) – Path to the folder containing configuration files (used by load_kwargs).

  • data_kwargs (dict, optional) – Custom dictionary of preprocessing arguments. If not provided, it is loaded from the configuration folder.

  • col0 (bool, optional) – Whether to treat the first column as the index column. Default is False.

  • csv (bool, optional) – Whether the files are CSVs (if False, Parquet is assumed). Default is True.

  • node_var (str, optional) – Name of the column identifying nodes. If not provided, retrieved from data_kwargs.

  • features_to_lag (dict, optional) – Temporal lags to compute, e.g. {'temperature': (1, 3)} to add columns temperature_l1, temperature_l2, temperature_l3.

  • get_dummies (bool, optional) – Whether to apply one-hot encoding on categorical variables. Default is True.

  • **kwargs – Additional keyword arguments passed to internal preprocessing utilities.

Raises:
  • AssertionError – If mandatory columns (‘date’, node variable, lagged features) are missing.

  • ValueError – If invalid lag intervals are specified.

Examples

>>> data = DataClass(
...     path_train="data/train.csv",
...     path_test="data/test.csv",
...     folder_config="config/",
... )
>>> data.df_train.shape
(12000, 45)
>>> list(data.df_train.columns[:5])
['node_id', 'date', 'feature1', 'feature1_l1', 'feature1_l2']
class graphtoolbox.data.dataset.GraphDataset(data, period: str, scalers_feat=None, scalers_target=None, dataset_kwargs: Dict | None = None, out_channels: int = 1, **kwargs)[source][source]

Bases: object

GraphDataset organizes time-dependent node features and targets into graph-structured tensors compatible with PyTorch Geometric.

This class acts as the bridge between tabular time series data and graph neural network inputs. It handles: - feature and target extraction from the preprocessed DataClass object, - normalization per node using train-based MinMax scaling, - construction of temporal tensors (node × time × features), - association with graph topology (edge_index, edge_weight), - packaging of graph snapshots as torch_geometric.data.Data objects.

Parameters:
  • data (DataClass) – Preprocessed data container including train/val/test splits.

  • period (str) – Dataset split to use, one of {'train', 'val', 'test'}.

  • scalers_feat (dict, optional) – Dictionary of fitted feature scalers per node (from training phase). Required for validation and test datasets.

  • scalers_target (dict, optional) – Dictionary of fitted target scalers per node (from training phase).

  • dataset_kwargs (dict, optional) – Dataset-level configuration (loaded via load_kwargs if not provided). Must include keys like 'features_base' and 'target_base'.

  • out_channels (int, default 1) – Number of temporal steps grouped per graph sample (sliding window width).

  • **kwargs – Additional options such as: - graph_folder (str): path to saved adjacency matrices. - adj_matrix (str): graph construction algorithm (default: ‘space’). - get_dummies (bool): whether to expand categorical dummy variables.

dataframe

Subset of data corresponding to the specified period.

Type:

pandas.DataFrame

features_base

List of input feature column names.

Type:

list of str

feature_groups

Optional mapping of feature groups for grouped GNN inputs.

Type:

dict or None

target_base

Name of the prediction target column.

Type:

str

X_scaled

Normalized feature tensor of shape [num_nodes, T, num_features].

Type:

torch.Tensor

Y_scaled

Normalized target tensor of shape [num_nodes, T, 1].

Type:

torch.Tensor

mask_X, mask_Y

Boolean masks indicating valid (non-NaN) temporal positions.

Type:

torch.BoolTensor

edge_index

Graph connectivity in COO format for PyTorch Geometric.

Type:

torch.LongTensor

edge_weight

Edge weights (typically similarities).

Type:

torch.FloatTensor

pyg_data

List of graph snapshots ready for batching or iteration.

Type:

list[torch_geometric.data.Data]

num_nodes

Number of graph nodes.

Type:

int

num_node_features

Number of input features per node.

Type:

int

Raises:
  • AssertionError – If expected columns or scalers are missing.

  • FileNotFoundError – If the adjacency matrix file is missing.

Examples

>>> dataset = GraphDataset(data=data, period='train', out_channels=3)
>>> len(dataset)
120  # number of temporal graph snapshots
>>> sample = dataset[0]
>>> sample.x.shape, sample.y.shape
(torch.Size([N, F]), torch.Size([N, 3]))
>>> sample.edge_index.shape
torch.Size([2, E])