graphtoolbox.data.dataset¶

Classes

`DataClass`(path_train, path_test, folder_config)	DataClass handles the loading, preprocessing, and temporal segmentation of graph-based datasets used for machine learning and graph neural networks.
`GraphBuilder`(graph_dataset_train, ...)	Constructs graph representations from tabular or temporal datasets, combining feature reduction and graph construction algorithms.
`GraphDataset`(data, period[, scalers_feat, ...])	GraphDataset organizes time-dependent node features and targets into graph-structured tensors compatible with PyTorch Geometric.

class graphtoolbox.data.dataset.DataClass(path_train: str, path_test: str, folder_config: str, data_kwargs: Dict | None = None, **kwargs)[source][source]¶

Bases: object

DataClass handles the loading, preprocessing, and temporal segmentation of graph-based datasets used for machine learning and graph neural networks.

This class automates several common data preparation steps for graph-based time series: - Reading training and test datasets (CSV or Parquet format). - Creating lagged versions of numerical features. - Splitting data into train, validation, and test sets based on time boundaries. - Encoding categorical variables into dummy features. - Ensuring consistent node indexing across splits.

df_train_original¶

Original training DataFrame as loaded from disk.

Type:: pandas.DataFrame

df_test_original¶

Original test DataFrame as loaded from disk.

Type:: pandas.DataFrame

df_train, df_val, df_test

Preprocessed train/validation/test sets ready for model input.

Type:: pandas.DataFrame

node_var¶

Column name identifying graph nodes.

Type:: str

nodes¶

Sorted array of unique node identifiers.

Type:: numpy.ndarray

features_to_lag¶

Dictionary describing temporal lags to apply on selected features. Format: {'feature': (min_lag, max_lag)}.

Type:: dict or None

dummies¶

Mapping of categorical features to be one-hot encoded.

Type:: dict or None

day_inf_train, day_sup_train, day_inf_val, day_sup_val, day_inf_test, day_sup_test

Date boundaries for temporal splits.

Type:: str

folder_config¶

Path to the configuration folder containing data preprocessing parameters.

Type:: str

data_kwargs¶

Loaded data-related configuration options.

Type:: dict

Parameters:

path_train (str) – Path to the training dataset (CSV or Parquet file).
path_test (str) – Path to the test dataset (CSV or Parquet file).
folder_config (str) – Path to the folder containing configuration files (used by load_kwargs).
data_kwargs (dict, optional) – Custom dictionary of preprocessing arguments. If not provided, it is loaded from the configuration folder.
col0 (bool, optional) – Whether to treat the first column as the index column. Default is False.
csv (bool, optional) – Whether the files are CSVs (if False, Parquet is assumed). Default is True.
node_var (str, optional) – Name of the column identifying nodes. If not provided, retrieved from data_kwargs.
features_to_lag (dict, optional) – Temporal lags to compute, e.g. {'temperature': (1, 3)} to add columns temperature_l1, temperature_l2, temperature_l3.
get_dummies (bool, optional) – Whether to apply one-hot encoding on categorical variables. Default is True.
**kwargs – Additional keyword arguments passed to internal preprocessing utilities.

Raises:

AssertionError – If mandatory columns (‘date’, node variable, lagged features) are missing.
ValueError – If invalid lag intervals are specified.

Examples

>>> data = DataClass(
...     path_train="data/train.csv",
...     path_test="data/test.csv",
...     folder_config="config/",
... )
>>> data.df_train.shape
(12000, 45)
>>> list(data.df_train.columns[:5])
['node_id', 'date', 'feature1', 'feature1_l1', 'feature1_l2']

class graphtoolbox.data.dataset.GraphBuilder(graph_dataset_train, graph_dataset_val, graph_dataset_test, **kwargs)[source][source]¶

Bases: object

Constructs graph representations from tabular or temporal datasets, combining feature reduction and graph construction algorithms.

This class provides a unified interface for transforming time series or feature datasets into adjacency matrices suitable for graph neural networks. It can:

Reduce temporal or feature signals (e.g., via SVD or RESITER)
Build graphs based on spatial distance, correlation, precision matrices, GL-3SR, or dynamic time warping (DTW)
Optionally reuse previously computed signals or graphs from disk

Parameters:

graph_dataset_train (Dataset) – Dataset containing training graph data (with features, nodes, etc.).
graph_dataset_val (Dataset) – Dataset for validation.
graph_dataset_test (Dataset) – Dataset for testing.
model_vgae (object, optional) – Pre-trained VGAE (Variational Graph AutoEncoder) model to initialize the graph builder.
load_graph (bool, default=False) – If True, load a previously saved adjacency matrix instead of recomputing it.
load_signal (bool, default=False) – If True, load a pre-computed reduced signal representation from disk.
reduce_method (str, default='svd') – Method to reduce the signal before graph construction. Options are 'svd' or 'resiter'.
folder_config (str, optional) – Path to a configuration folder (used to load positional data and parameters via load_kwargs).
**kwargs – Additional keyword arguments (e.g., algorithm hyperparameters or model options).

model_vgae¶

VGAE model instance, if provided.

Type:: object or None

load_graph¶

Whether an existing graph should be loaded instead of generated.

Type:: bool

load_signal¶

Whether to reuse a pre-computed reduced signal.

Type:: bool

reduce_method¶

Signal reduction strategy used by reduce_signal().

Type:: str

folder_config¶

Folder path containing saved positional or configuration data.

Type:: str or None

df_pos¶

Positional data for nodes (longitude, latitude) loaded from configuration.

Type:: pandas.DataFrame or None

graph_dataset_train¶

Dataset used for training.

Type:: Dataset

graph_dataset_val¶

Dataset used for validation.

Type:: Dataset

graph_dataset_test¶

Dataset used for testing.

Type:: Dataset

dataframe¶

The raw DataFrame from the training dataset.

Type:: pandas.DataFrame

data¶

The training dataset’s data container.

Type:: DataFrame-like

Notes

The build_graph() method always calls reduce_signal() before constructing an adjacency matrix, unless load_graph=True. The resulting graph can be fed into GNNs (e.g., GCN, GraphSAGE).

Examples

>>> gb = GraphBuilder(train_set, val_set, test_set, reduce_method='svd')
>>> W = gb.build_graph(algo='space', threshold=0.1)
>>> W.shape
torch.Size([N, N])

build_graph(algo, **kwargs)[source][source]¶

Build or load an adjacency matrix using a specified graph construction algorithm.

Parameters:

algo (str) – Graph construction method. Options: 'space', 'correlation', 'precision', 'gl3sr', or 'dtw'.
**kwargs (dict) – Algorithm-specific hyperparameters (e.g., threshold, alpha, beta).

Returns:

Adjacency matrix of shape (N, N).

Return type:

torch.Tensor

Raises:

NotImplementedError – If the specified algorithm is not supported.

reduce_signal(**kwargs)[source][source]¶

Compute or load a reduced signal representation from the dataset.

Parameters:: **kwargs (dict) – Method-specific parameters (e.g., k_max, model_base, num_epochs).
Returns:: Reduced feature matrix of shape (num_nodes, num_features).
Return type:: np.ndarray

class graphtoolbox.data.dataset.GraphDataset(data, period: str, scalers_feat=None, scalers_target=None, dataset_kwargs: Dict | None = None, out_channels: int = 1, **kwargs)[source][source]¶

Bases: object

GraphDataset organizes time-dependent node features and targets into graph-structured tensors compatible with PyTorch Geometric.

This class acts as the bridge between tabular time series data and graph neural network inputs. It handles: - feature and target extraction from the preprocessed DataClass object, - normalization per node using train-based MinMax scaling, - construction of temporal tensors (node × time × features), - association with graph topology (edge_index, edge_weight), - packaging of graph snapshots as torch_geometric.data.Data objects.

Parameters:

data (DataClass) – Preprocessed data container including train/val/test splits.
period (str) – Dataset split to use, one of {'train', 'val', 'test'}.
scalers_feat (dict, optional) – Dictionary of fitted feature scalers per node (from training phase). Required for validation and test datasets.
scalers_target (dict, optional) – Dictionary of fitted target scalers per node (from training phase).
dataset_kwargs (dict, optional) – Dataset-level configuration (loaded via load_kwargs if not provided). Must include keys like 'features_base' and 'target_base'.
out_channels (int, default 1) – Number of temporal steps grouped per graph sample (sliding window width).
**kwargs – Additional options such as: - graph_folder (str): path to saved adjacency matrices. - adj_matrix (str): graph construction algorithm (default: ‘space’). - get_dummies (bool): whether to expand categorical dummy variables.

dataframe¶

Subset of data corresponding to the specified period.

Type:: pandas.DataFrame

features_base¶

List of input feature column names.

Type:: list of str

feature_groups¶

Optional mapping of feature groups for grouped GNN inputs.

Type:: dict or None

target_base¶

Name of the prediction target column.

Type:: str

X_scaled¶

Normalized feature tensor of shape [num_nodes, T, num_features].

Type:: torch.Tensor

Y_scaled¶

Normalized target tensor of shape [num_nodes, T, 1].

Type:: torch.Tensor

mask_X, mask_Y

Boolean masks indicating valid (non-NaN) temporal positions.

Type:: torch.BoolTensor

edge_index¶

Graph connectivity in COO format for PyTorch Geometric.

Type:: torch.LongTensor

edge_weight¶

Edge weights (typically similarities).

Type:: torch.FloatTensor

pyg_data¶

List of graph snapshots ready for batching or iteration.

Type:: list[torch_geometric.data.Data]

num_nodes¶

Number of graph nodes.

Type:: int

num_node_features¶

Number of input features per node.

Type:: int

Raises:

AssertionError – If expected columns or scalers are missing.
FileNotFoundError – If the adjacency matrix file is missing.

Examples

>>> dataset = GraphDataset(data=data, period='train', out_channels=3)
>>> len(dataset)
120  # number of temporal graph snapshots
>>> sample = dataset[0]
>>> sample.x.shape, sample.y.shape
(torch.Size([N, F]), torch.Size([N, 3]))
>>> sample.edge_index.shape
torch.Size([2, E])