graphtoolbox.data.dataset

Classes

DataClass(path_train, path_test, folder_config)

DataClass handles the loading, preprocessing, and temporal segmentation of graph-based datasets used for machine learning and graph neural networks.

GraphBuilder(graph_dataset_train, ...)

Constructs graph representations from tabular or temporal datasets, combining feature reduction and graph construction algorithms.

GraphDataset(data, period[, scalers_feat, ...])

GraphDataset organizes time-dependent node features and targets into graph-structured tensors compatible with PyTorch Geometric.

class graphtoolbox.data.dataset.DataClass(path_train: str, path_test: str, folder_config: str, data_kwargs: Dict | None = None, **kwargs)[source][source]

Bases: object

DataClass handles the loading, preprocessing, and temporal segmentation of graph-based datasets used for machine learning and graph neural networks.

This class automates several common data preparation steps for graph-based time series: - Reading training and test datasets (CSV or Parquet format). - Creating lagged versions of numerical features. - Splitting data into train, validation, and test sets based on time boundaries. - Encoding categorical variables into dummy features. - Ensuring consistent node indexing across splits.

df_train_original

Original training DataFrame as loaded from disk.

Type:

pandas.DataFrame

df_test_original

Original test DataFrame as loaded from disk.

Type:

pandas.DataFrame

df_train, df_val, df_test

Preprocessed train/validation/test sets ready for model input.

Type:

pandas.DataFrame

node_var

Column name identifying graph nodes.

Type:

str

nodes

Sorted array of unique node identifiers.

Type:

numpy.ndarray

features_to_lag

Dictionary describing temporal lags to apply on selected features. Format: {'feature': (min_lag, max_lag)}.

Type:

dict or None

dummies

Mapping of categorical features to be one-hot encoded.

Type:

dict or None

day_inf_train, day_sup_train, day_inf_val, day_sup_val, day_inf_test, day_sup_test

Date boundaries for temporal splits.

Type:

str

folder_config

Path to the configuration folder containing data preprocessing parameters.

Type:

str

data_kwargs

Loaded data-related configuration options.

Type:

dict

Parameters:
  • path_train (str) – Path to the training dataset (CSV or Parquet file).

  • path_test (str) – Path to the test dataset (CSV or Parquet file).

  • folder_config (str) – Path to the folder containing configuration files (used by load_kwargs).

  • data_kwargs (dict, optional) – Custom dictionary of preprocessing arguments. If not provided, it is loaded from the configuration folder.

  • col0 (bool, optional) – Whether to treat the first column as the index column. Default is False.

  • csv (bool, optional) – Whether the files are CSVs (if False, Parquet is assumed). Default is True.

  • node_var (str, optional) – Name of the column identifying nodes. If not provided, retrieved from data_kwargs.

  • features_to_lag (dict, optional) – Temporal lags to compute, e.g. {'temperature': (1, 3)} to add columns temperature_l1, temperature_l2, temperature_l3.

  • get_dummies (bool, optional) – Whether to apply one-hot encoding on categorical variables. Default is True.

  • **kwargs – Additional keyword arguments passed to internal preprocessing utilities.

Raises:
  • AssertionError – If mandatory columns (‘date’, node variable, lagged features) are missing.

  • ValueError – If invalid lag intervals are specified.

Examples

>>> data = DataClass(
...     path_train="data/train.csv",
...     path_test="data/test.csv",
...     folder_config="config/",
... )
>>> data.df_train.shape
(12000, 45)
>>> list(data.df_train.columns[:5])
['node_id', 'date', 'feature1', 'feature1_l1', 'feature1_l2']
class graphtoolbox.data.dataset.GraphBuilder(graph_dataset_train, graph_dataset_val, graph_dataset_test, **kwargs)[source][source]

Bases: object

Constructs graph representations from tabular or temporal datasets, combining feature reduction and graph construction algorithms.

This class provides a unified interface for transforming time series or feature datasets into adjacency matrices suitable for graph neural networks. It can:

  • Reduce temporal or feature signals (e.g., via SVD or RESITER)

  • Build graphs based on spatial distance, correlation, precision matrices, GL-3SR, or dynamic time warping (DTW)

  • Optionally reuse previously computed signals or graphs from disk

Parameters:
  • graph_dataset_train (Dataset) – Dataset containing training graph data (with features, nodes, etc.).

  • graph_dataset_val (Dataset) – Dataset for validation.

  • graph_dataset_test (Dataset) – Dataset for testing.

  • model_vgae (object, optional) – Pre-trained VGAE (Variational Graph AutoEncoder) model to initialize the graph builder.

  • load_graph (bool, default=False) – If True, load a previously saved adjacency matrix instead of recomputing it.

  • load_signal (bool, default=False) – If True, load a pre-computed reduced signal representation from disk.

  • reduce_method (str, default='svd') – Method to reduce the signal before graph construction. Options are 'svd' or 'resiter'.

  • folder_config (str, optional) – Path to a configuration folder (used to load positional data and parameters via load_kwargs).

  • **kwargs – Additional keyword arguments (e.g., algorithm hyperparameters or model options).

model_vgae

VGAE model instance, if provided.

Type:

object or None

load_graph

Whether an existing graph should be loaded instead of generated.

Type:

bool

load_signal

Whether to reuse a pre-computed reduced signal.

Type:

bool

reduce_method

Signal reduction strategy used by reduce_signal().

Type:

str

folder_config

Folder path containing saved positional or configuration data.

Type:

str or None

df_pos

Positional data for nodes (longitude, latitude) loaded from configuration.

Type:

pandas.DataFrame or None

graph_dataset_train

Dataset used for training.

Type:

Dataset

graph_dataset_val

Dataset used for validation.

Type:

Dataset

graph_dataset_test

Dataset used for testing.

Type:

Dataset

dataframe

The raw DataFrame from the training dataset.

Type:

pandas.DataFrame

data

The training dataset’s data container.

Type:

DataFrame-like

Notes

The build_graph() method always calls reduce_signal() before constructing an adjacency matrix, unless load_graph=True. The resulting graph can be fed into GNNs (e.g., GCN, GraphSAGE).

Examples

>>> gb = GraphBuilder(train_set, val_set, test_set, reduce_method='svd')
>>> W = gb.build_graph(algo='space', threshold=0.1)
>>> W.shape
torch.Size([N, N])
build_graph(algo, **kwargs)[source][source]

Build or load an adjacency matrix using a specified graph construction algorithm.

Parameters:
  • algo (str) – Graph construction method. Options: 'space', 'correlation', 'precision', 'gl3sr', or 'dtw'.

  • **kwargs (dict) – Algorithm-specific hyperparameters (e.g., threshold, alpha, beta).

Returns:

Adjacency matrix of shape (N, N).

Return type:

torch.Tensor

Raises:

NotImplementedError – If the specified algorithm is not supported.

reduce_signal(**kwargs)[source][source]

Compute or load a reduced signal representation from the dataset.

Parameters:

**kwargs (dict) – Method-specific parameters (e.g., k_max, model_base, num_epochs).

Returns:

Reduced feature matrix of shape (num_nodes, num_features).

Return type:

np.ndarray

class graphtoolbox.data.dataset.GraphDataset(data, period: str, scalers_feat=None, scalers_target=None, dataset_kwargs: Dict | None = None, out_channels: int = 1, **kwargs)[source][source]

Bases: object

GraphDataset organizes time-dependent node features and targets into graph-structured tensors compatible with PyTorch Geometric.

This class acts as the bridge between tabular time series data and graph neural network inputs. It handles: - feature and target extraction from the preprocessed DataClass object, - normalization per node using train-based MinMax scaling, - construction of temporal tensors (node × time × features), - association with graph topology (edge_index, edge_weight), - packaging of graph snapshots as torch_geometric.data.Data objects.

Parameters:
  • data (DataClass) – Preprocessed data container including train/val/test splits.

  • period (str) – Dataset split to use, one of {'train', 'val', 'test'}.

  • scalers_feat (dict, optional) – Dictionary of fitted feature scalers per node (from training phase). Required for validation and test datasets.

  • scalers_target (dict, optional) – Dictionary of fitted target scalers per node (from training phase).

  • dataset_kwargs (dict, optional) – Dataset-level configuration (loaded via load_kwargs if not provided). Must include keys like 'features_base' and 'target_base'.

  • out_channels (int, default 1) – Number of temporal steps grouped per graph sample (sliding window width).

  • **kwargs – Additional options such as: - graph_folder (str): path to saved adjacency matrices. - adj_matrix (str): graph construction algorithm (default: ‘space’). - get_dummies (bool): whether to expand categorical dummy variables.

dataframe

Subset of data corresponding to the specified period.

Type:

pandas.DataFrame

features_base

List of input feature column names.

Type:

list of str

feature_groups

Optional mapping of feature groups for grouped GNN inputs.

Type:

dict or None

target_base

Name of the prediction target column.

Type:

str

X_scaled

Normalized feature tensor of shape [num_nodes, T, num_features].

Type:

torch.Tensor

Y_scaled

Normalized target tensor of shape [num_nodes, T, 1].

Type:

torch.Tensor

mask_X, mask_Y

Boolean masks indicating valid (non-NaN) temporal positions.

Type:

torch.BoolTensor

edge_index

Graph connectivity in COO format for PyTorch Geometric.

Type:

torch.LongTensor

edge_weight

Edge weights (typically similarities).

Type:

torch.FloatTensor

pyg_data

List of graph snapshots ready for batching or iteration.

Type:

list[torch_geometric.data.Data]

num_nodes

Number of graph nodes.

Type:

int

num_node_features

Number of input features per node.

Type:

int

Raises:
  • AssertionError – If expected columns or scalers are missing.

  • FileNotFoundError – If the adjacency matrix file is missing.

Examples

>>> dataset = GraphDataset(data=data, period='train', out_channels=3)
>>> len(dataset)
120  # number of temporal graph snapshots
>>> sample = dataset[0]
>>> sample.x.shape, sample.y.shape
(torch.Size([N, F]), torch.Size([N, 3]))
>>> sample.edge_index.shape
torch.Size([2, E])