Synthetic Data Pipeline#
To enable scalable and easy-to-run machine learning experiments on time-series data, SimbaML offers multiple pipelines covering data pre-processing, training, and evaluation of ML models. The synthetic data piplines and mixed data pipelines can be run by passing them the corresponding config files.
Configure Synthetic Data Pipeline#
All provided machine learning pipelines of SimbaML can be configured based on config files. This way, users can change the models that are going to be trained, their hyperparameters, specify details for preprocessing and much more.
1# Define the models to be used for training
2[[models]]
3id = "KerasDenseNeuralNetworkTransferLearning"
4normalizer = true
5[models.training_params]
6epochs = 10
7patience = 5
8batch_size = 32
9validation_split = 0.2
10verbose = 0
11[models.archictecture_params]
12units = [32, 32]
13activation = "relu"
14
15[[models]]
16id = "PytorchLightningDenseNeuralNetwork"
17normalizer = true
18[models.training_params]
19epochs = 10
20patience = 5
21batch_size = 32
22validation_split = 0.2
23verbose = 0
24[models.archictecture_params]
25units = [32, 32]
26activation = "relu"
27
28[[models]]
29id = "DecisionTreeRegressor"
30[models.model_params]
31criterion = "squared_error" # "squared_error", "absolute_error"
32splitter ="best" # "best", "random"
33
34[[models]]
35id = "LinearRegressor"
36[models.model_params]
37fit_intercept= true
38n_jobs = 1 # TODO: What is good value here
39positive = false
40
41[[models]]
42id = "NearestNeighborsRegressor"
43[models.model_params]
44n_neighbors = 5
45weights = "uniform" # "uniform", "distance"
46
47[[models]]
48id = "RandomForestRegressor"
49[models.model_params]
50n_estimators = 100
51max_depth = 10
52min_samples_split = 2
53min_samples_leaf = 1
54min_weight_fraction_leaf = 0.0
55
56[[models]]
57id = "SVMRegressor"
58kernel = "linear" # "linear", "poly", "rbf", "sigmoid", "precomputed"
59
60# Define the metrics to be used for evaluating the model
61metrics = [
62 "r_square",
63 "mean_absolute_error",
64 "mean_squared_error",
65 "mean_absolute_percentage_error",
66 "root_mean_squared_error",
67 "normalized_root_mean_squared_error",
68]
69
70# Define data configurations
71[data]
72synthetic = "path/to/synthetic/data"
73test_split = 0.2 # [0, 1]
74split_axis = "vertical" # "vertical", "horizontal"
75
76# Define the time series configurations
77[data.time_series]
78input_features = ["Infected", "Recovered"]
79output_features = ["Infected", "Recovered"]
80input_length = 1
81output_length = 1
82
83# Define the plugins you wrote to use your own models
84plugins = [
85 "tests.example_plugin",
86 "simba_ml.prediction.time_series.models.keras",
87 "simba_ml.prediction.time_series.models.pytorch_lightning"
88]
89
90# If you include this section, the results will be logged to wandb
91# make sure to specify the right project and entity
92# [logging]
93# project = "your-wandb-project"
94# entity = "your-wandb-entity"
Start Synthetic Data Pipeline#
$ simba_ml start-prediction synthetic_data –config-path synthetic_data_pipeline.toml