graphtoolbox.data.dataset¶
Classes
|
DataClass handles the loading, preprocessing, and temporal segmentation of graph-based datasets used for machine learning and graph neural networks. |
|
Constructs graph representations from tabular or temporal datasets, combining feature reduction and graph construction algorithms. |
|
GraphDataset organizes time-dependent node features and targets into graph-structured tensors compatible with PyTorch Geometric. |
- class graphtoolbox.data.dataset.DataClass(path_train: str, path_test: str, folder_config: str, data_kwargs: Dict | None = None, **kwargs)[source][source]¶
Bases:
objectDataClass handles the loading, preprocessing, and temporal segmentation of graph-based datasets used for machine learning and graph neural networks.
This class automates several common data preparation steps for graph-based time series: - Reading training and test datasets (CSV or Parquet format). - Creating lagged versions of numerical features. - Splitting data into train, validation, and test sets based on time boundaries. - Encoding categorical variables into dummy features. - Ensuring consistent node indexing across splits.
- df_train_original¶
Original training DataFrame as loaded from disk.
- Type:
pandas.DataFrame
- df_test_original¶
Original test DataFrame as loaded from disk.
- Type:
pandas.DataFrame
- df_train, df_val, df_test
Preprocessed train/validation/test sets ready for model input.
- Type:
pandas.DataFrame
- node_var¶
Column name identifying graph nodes.
- Type:
str
- nodes¶
Sorted array of unique node identifiers.
- Type:
numpy.ndarray
- features_to_lag¶
Dictionary describing temporal lags to apply on selected features. Format:
{'feature': (min_lag, max_lag)}.- Type:
dict or None
- dummies¶
Mapping of categorical features to be one-hot encoded.
- Type:
dict or None
- day_inf_train, day_sup_train, day_inf_val, day_sup_val, day_inf_test, day_sup_test
Date boundaries for temporal splits.
- Type:
str
- folder_config¶
Path to the configuration folder containing data preprocessing parameters.
- Type:
str
- data_kwargs¶
Loaded data-related configuration options.
- Type:
dict
- Parameters:
path_train (str) – Path to the training dataset (CSV or Parquet file).
path_test (str) – Path to the test dataset (CSV or Parquet file).
folder_config (str) – Path to the folder containing configuration files (used by
load_kwargs).data_kwargs (dict, optional) – Custom dictionary of preprocessing arguments. If not provided, it is loaded from the configuration folder.
col0 (bool, optional) – Whether to treat the first column as the index column. Default is False.
csv (bool, optional) – Whether the files are CSVs (if False, Parquet is assumed). Default is True.
node_var (str, optional) – Name of the column identifying nodes. If not provided, retrieved from
data_kwargs.features_to_lag (dict, optional) – Temporal lags to compute, e.g.
{'temperature': (1, 3)}to add columnstemperature_l1,temperature_l2,temperature_l3.get_dummies (bool, optional) – Whether to apply one-hot encoding on categorical variables. Default is True.
**kwargs – Additional keyword arguments passed to internal preprocessing utilities.
- Raises:
AssertionError – If mandatory columns (‘date’, node variable, lagged features) are missing.
ValueError – If invalid lag intervals are specified.
Examples
>>> data = DataClass( ... path_train="data/train.csv", ... path_test="data/test.csv", ... folder_config="config/", ... ) >>> data.df_train.shape (12000, 45) >>> list(data.df_train.columns[:5]) ['node_id', 'date', 'feature1', 'feature1_l1', 'feature1_l2']
- class graphtoolbox.data.dataset.GraphBuilder(graph_dataset_train, graph_dataset_val, graph_dataset_test, **kwargs)[source][source]¶
Bases:
objectConstructs graph representations from tabular or temporal datasets, combining feature reduction and graph construction algorithms.
This class provides a unified interface for transforming time series or feature datasets into adjacency matrices suitable for graph neural networks. It can:
Reduce temporal or feature signals (e.g., via SVD or RESITER)
Build graphs based on spatial distance, correlation, precision matrices, GL-3SR, or dynamic time warping (DTW)
Optionally reuse previously computed signals or graphs from disk
- Parameters:
graph_dataset_train (Dataset) – Dataset containing training graph data (with features, nodes, etc.).
graph_dataset_val (Dataset) – Dataset for validation.
graph_dataset_test (Dataset) – Dataset for testing.
model_vgae (object, optional) – Pre-trained VGAE (Variational Graph AutoEncoder) model to initialize the graph builder.
load_graph (bool, default=False) – If True, load a previously saved adjacency matrix instead of recomputing it.
load_signal (bool, default=False) – If True, load a pre-computed reduced signal representation from disk.
reduce_method (str, default='svd') – Method to reduce the signal before graph construction. Options are
'svd'or'resiter'.folder_config (str, optional) – Path to a configuration folder (used to load positional data and parameters via
load_kwargs).**kwargs – Additional keyword arguments (e.g., algorithm hyperparameters or model options).
- model_vgae¶
VGAE model instance, if provided.
- Type:
object or None
- load_graph¶
Whether an existing graph should be loaded instead of generated.
- Type:
bool
- load_signal¶
Whether to reuse a pre-computed reduced signal.
- Type:
bool
- reduce_method¶
Signal reduction strategy used by
reduce_signal().- Type:
str
- folder_config¶
Folder path containing saved positional or configuration data.
- Type:
str or None
- df_pos¶
Positional data for nodes (longitude, latitude) loaded from configuration.
- Type:
pandas.DataFrame or None
- graph_dataset_train¶
Dataset used for training.
- Type:
Dataset
- graph_dataset_val¶
Dataset used for validation.
- Type:
Dataset
- graph_dataset_test¶
Dataset used for testing.
- Type:
Dataset
- dataframe¶
The raw DataFrame from the training dataset.
- Type:
pandas.DataFrame
- data¶
The training dataset’s data container.
- Type:
DataFrame-like
Notes
The
build_graph()method always callsreduce_signal()before constructing an adjacency matrix, unlessload_graph=True. The resulting graph can be fed into GNNs (e.g., GCN, GraphSAGE).Examples
>>> gb = GraphBuilder(train_set, val_set, test_set, reduce_method='svd') >>> W = gb.build_graph(algo='space', threshold=0.1) >>> W.shape torch.Size([N, N])
- build_graph(algo, **kwargs)[source][source]¶
Build or load an adjacency matrix using a specified graph construction algorithm.
- Parameters:
algo (str) – Graph construction method. Options:
'space','correlation','precision','gl3sr', or'dtw'.**kwargs (dict) – Algorithm-specific hyperparameters (e.g., threshold, alpha, beta).
- Returns:
Adjacency matrix of shape (N, N).
- Return type:
torch.Tensor
- Raises:
NotImplementedError – If the specified algorithm is not supported.
- class graphtoolbox.data.dataset.GraphDataset(data, period: str, scalers_feat=None, scalers_target=None, dataset_kwargs: Dict | None = None, out_channels: int = 1, **kwargs)[source][source]¶
Bases:
objectGraphDataset organizes time-dependent node features and targets into graph-structured tensors compatible with PyTorch Geometric.
This class acts as the bridge between tabular time series data and graph neural network inputs. It handles: - feature and target extraction from the preprocessed DataClass object, - normalization per node using train-based MinMax scaling, - construction of temporal tensors (node × time × features), - association with graph topology (edge_index, edge_weight), - packaging of graph snapshots as torch_geometric.data.Data objects.
- Parameters:
data (DataClass) – Preprocessed data container including train/val/test splits.
period (str) – Dataset split to use, one of
{'train', 'val', 'test'}.scalers_feat (dict, optional) – Dictionary of fitted feature scalers per node (from training phase). Required for validation and test datasets.
scalers_target (dict, optional) – Dictionary of fitted target scalers per node (from training phase).
dataset_kwargs (dict, optional) – Dataset-level configuration (loaded via
load_kwargsif not provided). Must include keys like'features_base'and'target_base'.out_channels (int, default 1) – Number of temporal steps grouped per graph sample (sliding window width).
**kwargs – Additional options such as: -
graph_folder(str): path to saved adjacency matrices. -adj_matrix(str): graph construction algorithm (default: ‘space’). -get_dummies(bool): whether to expand categorical dummy variables.
- dataframe¶
Subset of data corresponding to the specified period.
- Type:
pandas.DataFrame
- features_base¶
List of input feature column names.
- Type:
list of str
- feature_groups¶
Optional mapping of feature groups for grouped GNN inputs.
- Type:
dict or None
- target_base¶
Name of the prediction target column.
- Type:
str
- X_scaled¶
Normalized feature tensor of shape
[num_nodes, T, num_features].- Type:
torch.Tensor
- Y_scaled¶
Normalized target tensor of shape
[num_nodes, T, 1].- Type:
torch.Tensor
- mask_X, mask_Y
Boolean masks indicating valid (non-NaN) temporal positions.
- Type:
torch.BoolTensor
- edge_index¶
Graph connectivity in COO format for PyTorch Geometric.
- Type:
torch.LongTensor
- edge_weight¶
Edge weights (typically similarities).
- Type:
torch.FloatTensor
- pyg_data¶
List of graph snapshots ready for batching or iteration.
- Type:
list[torch_geometric.data.Data]
- num_nodes¶
Number of graph nodes.
- Type:
int
- num_node_features¶
Number of input features per node.
- Type:
int
- Raises:
AssertionError – If expected columns or scalers are missing.
FileNotFoundError – If the adjacency matrix file is missing.
Examples
>>> dataset = GraphDataset(data=data, period='train', out_channels=3) >>> len(dataset) 120 # number of temporal graph snapshots >>> sample = dataset[0] >>> sample.x.shape, sample.y.shape (torch.Size([N, F]), torch.Size([N, 3])) >>> sample.edge_index.shape torch.Size([2, E])