graphtoolbox.data.dataset¶
Classes
|
DataClass handles the loading, preprocessing, and temporal segmentation of graph-based datasets used for machine learning and graph neural networks. |
|
GraphDataset organizes time-dependent node features and targets into graph-structured tensors compatible with PyTorch Geometric. |
- class graphtoolbox.data.dataset.DataClass(path_train: str, path_test: str, folder_config: str, data_kwargs: Dict | None = None, **kwargs)[source][source]¶
Bases:
objectDataClass handles the loading, preprocessing, and temporal segmentation of graph-based datasets used for machine learning and graph neural networks.
This class automates several common data preparation steps for graph-based time series: - Reading training and test datasets (CSV or Parquet format). - Creating lagged versions of numerical features. - Splitting data into train, validation, and test sets based on time boundaries. - Encoding categorical variables into dummy features. - Ensuring consistent node indexing across splits.
- df_train_original¶
Original training DataFrame as loaded from disk.
- Type:
pandas.DataFrame
- df_test_original¶
Original test DataFrame as loaded from disk.
- Type:
pandas.DataFrame
- df_train, df_val, df_test
Preprocessed train/validation/test sets ready for model input.
- Type:
pandas.DataFrame
- node_var¶
Column name identifying graph nodes.
- Type:
str
- nodes¶
Sorted array of unique node identifiers.
- Type:
numpy.ndarray
- features_to_lag¶
Dictionary describing temporal lags to apply on selected features. Format:
{'feature': (min_lag, max_lag)}.- Type:
dict or None
- dummies¶
Mapping of categorical features to be one-hot encoded.
- Type:
dict or None
- day_inf_train, day_sup_train, day_inf_val, day_sup_val, day_inf_test, day_sup_test
Date boundaries for temporal splits.
- Type:
str
- folder_config¶
Path to the configuration folder containing data preprocessing parameters.
- Type:
str
- data_kwargs¶
Loaded data-related configuration options.
- Type:
dict
- Parameters:
path_train (str) – Path to the training dataset (CSV or Parquet file).
path_test (str) – Path to the test dataset (CSV or Parquet file).
folder_config (str) – Path to the folder containing configuration files (used by
load_kwargs).data_kwargs (dict, optional) – Custom dictionary of preprocessing arguments. If not provided, it is loaded from the configuration folder.
col0 (bool, optional) – Whether to treat the first column as the index column. Default is False.
csv (bool, optional) – Whether the files are CSVs (if False, Parquet is assumed). Default is True.
node_var (str, optional) – Name of the column identifying nodes. If not provided, retrieved from
data_kwargs.features_to_lag (dict, optional) – Temporal lags to compute, e.g.
{'temperature': (1, 3)}to add columnstemperature_l1,temperature_l2,temperature_l3.get_dummies (bool, optional) – Whether to apply one-hot encoding on categorical variables. Default is True.
**kwargs – Additional keyword arguments passed to internal preprocessing utilities.
- Raises:
AssertionError – If mandatory columns (‘date’, node variable, lagged features) are missing.
ValueError – If invalid lag intervals are specified.
Examples
>>> data = DataClass( ... path_train="data/train.csv", ... path_test="data/test.csv", ... folder_config="config/", ... ) >>> data.df_train.shape (12000, 45) >>> list(data.df_train.columns[:5]) ['node_id', 'date', 'feature1', 'feature1_l1', 'feature1_l2']
- class graphtoolbox.data.dataset.GraphDataset(data, period: str, scalers_feat=None, scalers_target=None, dataset_kwargs: Dict | None = None, out_channels: int = 1, **kwargs)[source][source]¶
Bases:
objectGraphDataset organizes time-dependent node features and targets into graph-structured tensors compatible with PyTorch Geometric.
This class acts as the bridge between tabular time series data and graph neural network inputs. It handles: - feature and target extraction from the preprocessed DataClass object, - normalization per node using train-based MinMax scaling, - construction of temporal tensors (node × time × features), - association with graph topology (edge_index, edge_weight), - packaging of graph snapshots as torch_geometric.data.Data objects.
- Parameters:
data (DataClass) – Preprocessed data container including train/val/test splits.
period (str) – Dataset split to use, one of
{'train', 'val', 'test'}.scalers_feat (dict, optional) – Dictionary of fitted feature scalers per node (from training phase). Required for validation and test datasets.
scalers_target (dict, optional) – Dictionary of fitted target scalers per node (from training phase).
dataset_kwargs (dict, optional) – Dataset-level configuration (loaded via
load_kwargsif not provided). Must include keys like'features_base'and'target_base'.out_channels (int, default 1) – Number of temporal steps grouped per graph sample (sliding window width).
**kwargs – Additional options such as: -
graph_folder(str): path to saved adjacency matrices. -adj_matrix(str): graph construction algorithm (default: ‘space’). -get_dummies(bool): whether to expand categorical dummy variables.
- dataframe¶
Subset of data corresponding to the specified period.
- Type:
pandas.DataFrame
- features_base¶
List of input feature column names.
- Type:
list of str
- feature_groups¶
Optional mapping of feature groups for grouped GNN inputs.
- Type:
dict or None
- target_base¶
Name of the prediction target column.
- Type:
str
- X_scaled¶
Normalized feature tensor of shape
[num_nodes, T, num_features].- Type:
torch.Tensor
- Y_scaled¶
Normalized target tensor of shape
[num_nodes, T, 1].- Type:
torch.Tensor
- mask_X, mask_Y
Boolean masks indicating valid (non-NaN) temporal positions.
- Type:
torch.BoolTensor
- edge_index¶
Graph connectivity in COO format for PyTorch Geometric.
- Type:
torch.LongTensor
- edge_weight¶
Edge weights (typically similarities).
- Type:
torch.FloatTensor
- pyg_data¶
List of graph snapshots ready for batching or iteration.
- Type:
list[torch_geometric.data.Data]
- num_nodes¶
Number of graph nodes.
- Type:
int
- num_node_features¶
Number of input features per node.
- Type:
int
- Raises:
AssertionError – If expected columns or scalers are missing.
FileNotFoundError – If the adjacency matrix file is missing.
Examples
>>> dataset = GraphDataset(data=data, period='train', out_channels=3) >>> len(dataset) 120 # number of temporal graph snapshots >>> sample = dataset[0] >>> sample.x.shape, sample.y.shape (torch.Size([N, F]), torch.Size([N, 3])) >>> sample.edge_index.shape torch.Size([2, E])