User Guide

ML generation in pyMAISE is done in five steps:

  1. pyMAISE initialization,

  2. pre-processing,

  3. Model initialization,

  4. hyper-parameter tuning,

  5. post-processing.

Each step and their respective classes are discussed below.

pyMAISE Initialization

To access pyMAISE in your Python code you must import the library:

import pyMAISE as mai

We shorten pyMAISE to mai for readability and convenience.

Defining Global Settings

Every pyMAISE job requires initialization through the Settings class. These global settings and their defaults include:

  • verbosity = 0: how much pyMAISE outputs to the terminal,

  • random_state = None: the global random state used by ML pseudo random algorithms during training,

  • test_size = 0.3: the fraction of data used for model testing,

  • num_configs_saved = 5: the number of hyper-parameter configurations saved, only the top num_configs_saved are returned.

To initialize pyMAISE with default global variables you run:

global_settings = mai.settings.init()

However, these defaults can be changed by passing any changed parameters you want in a dictionary. For example, if we wanted a verbosity of 1 and a random_state of 42 we define a dictionary,

settings_changes = {
    "verbosity": 1,
    "random_state": 42,
}

We can pass that to the initialization function:

global_settings = mai.settings.init(settings_changes)

Pre-processing

In this section we’ll load and scale our data. Additionally, we can generate correlation matricies and retrieve properties such as xscaler and yscaler (discussed in the Data Scaling section).

Loading Data

With pyMAISE initialized with settings.init we can load our data into the PreProcessor class. For personal data in one data file, initialize the PreProcessor with

preprocessor = mai.PreProcessor(
   "path/to/data.csv",
   slice(0, x),
   slice(x, y)
)

where x defines the beginning of the outputs and y defines the end +1 position of the outputs in the data file. For data with inputs and outputs in seperate files use

preprocessor = mai.PreProcessor(
    [
        "path/to/inputs.csv",
        "path/to/outputs.csv"
    ]
)

If you wish to load the benchmark specific pre-processors run the corresponding load function:

  • MITR: mai.load_MITR()

  • Reactor physics: mai.load_xs()

  • Fuel performance: mai.load_fp()

  • Heat conduction: mai.load_heat()

  • BWR: mai.load_BWR()

Data Scaling

The performance of many ML models depends on the scaling of the data. pyMAISE offers three scaling options: none, min-max, and standard scaling. Their respective functions are

# No scaling
data = preprocessor.data_split()

# Min-max scaling
data = preprocessor.min_max_scale()

# Standard scaling
data = preprocessor.std_scale()

All three methods return a tuple of training and testing data, xtrain, xtest, ytrain, ytest, and both min_max_scale and std_scale can scale input and/or output data depending on how scale_x and scale_y are defined. Both min_max_scale and std_scale use scaling objects from sklearn.preprocessing that can be retrieved with the xscaler and yscaler properties. To min-max scale only the inputs run

data = preprocessor.min_max_scale(scale_y=False)

Generating a Correlation Matrix

To better understand the correlation between the inputs and the outputs we can plot a correlation matrix using preprocessor.correlation_matrix(). You can toggle the colorbar or annotations using the colorbar and anotations parameters. Additionally, you can pass your own matplotlib figure or axis.

Pre-processing Properties

Using the preprocessor object we can retrieve several useful objects:

  • preprocessor.data: is the raw data,

  • preprocessor.inputs: is only the input raw data,

  • preprocessor.outputs: is only the output raw data,

  • preprocessor.xscaler: scaler object used to scale the inputs,

  • preprocessor.yscaler: scaler object used to scale the outputs.

Model Initialization

pyMAISE supports both classical ML methods and dense sequantial neural networks. Here is a list of the support ML methods and their names in pyMAISE:

  • Linear regression: linear,

  • Lasso regression: lasso,

  • Support vector regression: svr,

  • Decision tree regression: dtree,

  • Random forest regression: rforest,

  • K-nearest neighbors regression: knn,

  • Sequential dense neural networks: nn.

Caution

Support vector regression is only supported for 1-dimensional outputs.

The classical models use scikit-learn model wrappers and the neural networks are based on Keras. For more information on the models themselves refer to the Machine Learning Models section.

To initialize each of your desired models, specify their pyMAISE specific names in the models list in a dicitonary. Then for each of the models define a dictionary of hyper-parameters you’d like to change. Only parameters with values different from their scikit-learn or Keras defaults must be specified. These parameters will define the values that remain constant throughout tuning or the initial guess for random and Bayesian search. Refer to the Model Dictionary Templates section for a dictionary of parameters for each model. As an example, lets initialize linear, lasso, and rforest with 200 estimators:

model_settings = {
   "models": ["linear", "lasso", "rforest"],
   "rforest": {
       "n_estimators": 200,
   }
}

We can then initialize the Tuning class in pyMAISE with our data tuple and the model_settings:

tuning = mai.Tuning(data=data, model_settings=model_settings)

Hyper-parameter Tuning

With all the models of interest initialized in tuning, we can begin hyper-parameter tuning. pyMAISE supports three types of search methods: grid, random, or Bayesian search. Each of the search functions require the definition of parameter search spaces in dictionaries for each model type. The function then pulls the parameter space for each model and passes it to the search function.

Convergence Plots

For each of the search methods you can plot a convergence plot using the convergence_plot function; however, this is more appealing for Bayesian search as it shows how the kernel converges to the optimal hyper-parameter configuration with each step. To plot a specific model such as nn run

tuning.convergence_plot(model_types="nn")

Post-processing

With our top num_configs_saved models we can pass these to the PostProcessor class for model comparison and testing. To do so we provide the data and configuration(s):

postprocessor = mai.PostProcessor(
   data=data,
   models_list=[random_search_configs, bayesian_search_configs],
)

Additionally, we can pass a dictionary similar to model_settings of updated model settings to the new_model_settings parameter. With our PostProcessor initialized we can begin evaluating our models.

Performance Metrics

The performance metrics we’ll use to assess and compare each of the models are

  • r-squared: \(R^2 = 1 - \frac{\sum_{i = 1}^{n}(y_i - \hat{y_i})^2}{\sum_{i = 1}^{n}(y_i - \bar{y_i})^2}\),

  • mean absolute error: \(MAE = \frac{1}{n}\sum_{i = 1}^{n}|y_i - \hat{y_i}|\),

  • mean squared error: \(MSE = \frac{1}{n}\sum_{i = 1}^n(y_i - \hat{y_i})^2\),

  • root mean squared error: \(RMSE = \sqrt{\frac{1}{n}\sum_{i = 1}^n(y_i - \hat{y_i})^2}\),

where \(y\) is the actual outcome, \(\hat{y}\) is the model predicted outcome, \(\bar\) is the average outcome, and \(n\) is the number of observations. These metrics are computed for both the training and testing data sets and are computed through the metrics function in the PostProcessor. You can choose how the DataFrame is sorted, whether the features are averaged or only the metrics for one feature is computed, and which models to show. With this information you can compare the performance of each of your models on your data set.

Diagonal Validation Plots

After computing the performance metrics of each model, you can create diagonal validation plots that show the models predictions versus the actual result. This is done through the diagonal_validation_plot function in the PostProcessor and you can choose the model and the label to plot. Additionally, you can pass the yscaler from the PreProcessor to get representative numbers of the output.

Validation Plots

Similar to the diagonal validation plots, you can also plot validation plots that show the absolute relative error of the model predictions to the correct result. This is done through the validation_plot function in the PostProcessor. This function has the same capabilities as diagonal_validation_plot.

Neural Network Learning Plots

The final plotting capability of the PostProcessor is the neural network learning curves. These curves show the loss per epoch which informs the training of the neural network. With these curves you can determine if the neural network is overfit to the training data. Neural network learning curves are created through the nn_learning_plot function and you can choose with neural network model to plot.

Other Post-processing Functions

Finally, the PreProcessor is equipped with several additional methods to model analysis. These include

  • get_params: get the parameter configurations from a specific model,

  • get_model: get the model wrapper,

  • get_predictions: get the training and testing predictions from a specific model.