This article walks users through the One AI configuration settings for machine learning models in the One AI graphical user interface (GUI). This article is intended to be a guide for configuring the more advanced Global Settings, but is not intended to be a full explanation of the machine learning concepts associated with each setting. See linked articles for complete explanations of each topic.
Introduction
This configuration is geared towards advanced users who are comfortable with machine learning basics and building models in One AI. If you do not wish to perform manual configurations, or do not have the know-how, that’s totally fine! One AI has default settings based on best practices and will try an intelligent subset of the available options and select the combination that results in the best model performance and fit.
Configuring a machine learning model can be a critical step in harnessing the true potential of machine learning. One AI does an excellent job assisting in building models, but sometimes, it doesn’t get everything quite right. By delving into settings such as global settings, dimensionality reduction, estimator configuration, upsampling, and per column interventions, users can tailor their models to extract maximum insights from their data. This customization isn't just about fine-tuning performance, but also enhances the relevance and accuracy of predictions which helps in finding actionable insights.
Where to Find Advanced Configuration Settings
Navigation: Select One AI from the top navigation bar and Edit on a model
The first step in creating a machine learning model is data framing. This is accomplished by either leveraging a Recipe or a Data Destination. Once data framing is complete, the One AI Configuration settings become available. This section is also available in existing machine learning models for making edits.
Each advanced configuration in One AI has a default setting or settings to minimize the amount of work necessary to produce a working model, however the “Override” option is provided in the event that you would like to configure the setting(s) on your own.
Global Settings
Global settings refer to the overarching configuration parameters and hyperparameters that affect the entire model. This means these settings are not specific to individual columns or layers but rather influence the behavior and performance of the model as a whole. Please see the Per Column Interventions section if you are interested in configuring on a per-column basis. While these settings impact the whole model, they do not impact every model on your site and can be unique for each model.
The goal of the global settings overrides is to give the user full control over model configuration, particularly setting the rules for which columns get automatically dropped from the model and which methods are used for preprocessing the data the model consumes.
The global settings are listed below.
Continuous strategy is how One AI handles numeric and date variables, which are transformed into continuous variables (number of days from sample date).
Default strategy:
- Linear scaling (also known as normalization)
Additional strategy option:
- Binning
Default scale type:
- Standard
Additional scale type options:
- MinMax
- Robust
Default bin type:
- Auto (number of bins will be determined with algorithmic methods)
Additional bin type options:
- This field accepts a single integer corresponding to the number of bins you would like to divide the continuous data into, as clarified is you scroll over the information icon
Category size threshold is the criterion used for handling categorical variables, particularly those with a large number of unique category groupings or levels. In One AI machine learning models, this is specifically how small each category grouping within the categorical variable can be before it gets one hot encoded into an “Other” category.
Default category size threshold:
- 0.05 (5%)
- Category groupings must make up at least 5% of the total or they will be one hot encoded into a grouping column called “_other”.
Additional category size threshold options:
- This field accepts a single float value between 0 and 1 corresponding to the minimum amount of values that need to be populated in a categorical column for that column to be one hot encoded.
- For example, if the minimum was 1/10 populated values, the input value would be 0.1
During a machine learning model run, a correlation check is performed to determine how correlated each column is with the target column. One AI will also perform a correlation check in order to test how correlated each predictor variable is with every other predictor variable. We will discuss that more in the General Correlation Threshold section.
Default correlation type:
- A Cramer correlation type test is performed on categorical variables
- A Pearson correlation type test is performed on continuous variables
Additional correlation type options:
- None - no correlation check will be performed
- We do not recommend changing this setting to “cramers” or “pearson” because Cramers should only be used to test for correlation between categorical variables and Pearson should only be used to test for correlation between continuous variables. Most models have both types of variables so if you select either Cramers or Pearson, it will try to apply these tests to variables it shouldn’t and do a bad job.
One AI runs a correlation test to check how correlated each predictor variable is with every other predictor variable using a Cramer or Pearson correlation test depending on the variable type as explained above. The general correlation threshold is how correlated the two or more predictor variables must be in order for the less performant variable(s) to be automatically dropped.
This test is used to detect if two or more predictor variables are too correlated - too similar - to exist in the same model. For example, date of birth and age are often both available to models but typically shouldn’t both be selected by the model because they are effectively the same thing and highly correlated.
Default general correlation threshold:
- 0.65
Additional general correlation threshold options:
- This field accepts a single float between 1 and 0 corresponding to the minimum correlation value necessary to consider two columns correlated. A value closer to 1 will indicate very correlated, while a value closer to 0 indicates a lack of correlation.
When a machine learning model is run, a check for data leakage is performed as part of the data cleaning and preparation process prior. Data leakage happens when the training data contains information about the target, but similar data will not be available when the model is used for prediction on new, unseen data. In simpler terms, it’s a "cheating" column or variable that predicts the outcome too well to be plausible. Common examples are found in columns that are retroactively nulled when employees terminate.
In One AI, leakage is identified by generating a random forest model against the target using only that feature and then measuring the performance using an ROC-AUC score. If the column has an ROC-AUC score higher than the threshold, the column will be automatically dropped.
Default leakage performance threshold:
- 0.85
Additional leakage performance threshold options:
- This field accepts a float between 0 and 1 that controls the aggressiveness for which One AI will label a column as leakage. A smaller value increases the likelihood that a column will be dropped.
SHAP is a powerful way to understand how each feature contributes to the prediction of a machine learning model. These values provide a measure of the importance of each feature by considering its impact on predictions across all possible combinations with other features.
If you would like to generate the SHAP Beeswarm chart or SHAP averages in your results summary or if you intend to visualize data from your machine learning model with metrics in Explore and storyboards, you must turn on Generate SHAP Values.
Default generate SHAP values behavior:
- SHAP values are not generated by default
Additional generate SHAP values options:
- You can generate SHAP values by checking the override box and the additional checkbox before the model is run and deployed.
The null drop threshold refers to a threshold used to determine whether to drop features in a dataset based on the proportion of missing values they contain. In One AI, this is specifically what percentage of a column can be null data before that column is automatically dropped.
Default null drop threshold:
- 0.05 (5%)
- Columns that are 5% or more null will be automatically dropped from the model.
Additional null drop threshold options:
- This field accepts a single float value between 0 and 1 corresponding to the minimum fraction of null values allowed in a column.
If you set your continuous strategy type to bin, then by default, null rows are set aside. Bins are not created against non-null rows for each continuous feature unless you apply a null indicator override, which basically creates a category for nulls to be placed into.
Default null indicator behavior:
- Bins are not created against non-null rows
Additional null indicator behavior:
- This field accepts a free text value that is inserted into continuous features that get binned, indicating scalars that were originally null but had to be filled to apply bin edges.
The random state parameter is used to initialize the random number generator used when generating random numbers or shuffling data. For a given random state, One AI will make the same choices for algorithms requiring a random value; this allows you, for instance to reproduce the same train test split each time the model is run. To preserve determinism in model creation, each random state will produce the same results with the same data, but it may be useful to change the random state between runs to ensure the results are similar and not due to random chance.
Default random state value:
- 43
Additional random state value options:
- This field accepts a single integer between 0 and 4,294,967,295 corresponding to a desired random seed value. This value should only be changed if a different train and test split is desired.
SUSPICIOUS PERFORMANCE THRESHOLD
The suspicious performance threshold is a less stringent version of the process described above for the data leakage performance threshold using the same detection test to inform users of possible leakage. Variables deemed suspicious are not automatically dropped from the model like those that meet the data leakage threshold, but instead are flagged with a purple label in the EDA report and should be validated to ensure the column is not allowing the model to “cheat”.
Default suspicious performance threshold:
- 0.7
Additional suspicious performance threshold options:
- This field accepts a single float value between 1 and 0 that controls the aggressiveness for which One AI will label a column as suspicious. A smaller value increases the likelihood that a column will be labeled suspicious.
Dimensionality Reduction
Dimensionality reduction aims to give users control over the process of reducing the number of features or predictor variables in a dataset so that the model can focus on the most relevant features only. A typical One AI ML Model is presented with anywhere from 50-500 features, but for most, selecting between 5 and 15 features is the sweet spot after which performance and interpretability of the model degrades. One AI optimizes dimensionality through the use of filter and wrapper methods, applied in that order.
Default Filter Methods:
- Filter method: mutual_info
- Filter num features: 5, 10, & 15 are tried and the best result is chosen
Default Wrapper Methods:
- Wrapper method: recursive feature elimination (rfe)
- Wrapper min features: 5
Additional dimensionality reduction options:
- No selection - dimensionality reduction will be disabled.
Additional filter methods:
- Chi2 (chi-squared test)
- F-test (ANOVA)
Additional filter num features options:
- This field accepts a comma-separated list of values that you want the model to try. This is effectively the maximum number of features that the model will select, so it must be equal to or greater than the wrapper min features.
Additional wrapper min features options:
- This field accepts a comma-separated list of values that you want the model to try. This is effectively the minimum number of features that the model will select, so it must be equal to or less than the filter num features.
Estimator Configuration
One AI uses several machine learning estimators to make predictions. These estimators have configurable input parameters in the GUI. A list of the available estimators as well as the default settings can be found in the following help article:
One AI Machine Learning Algorithms and Settings
You will notice the black information icon next to each parameter indicates what values are accepted. In many cases, a list of values is accepted, even though only a single parameter is used for the instance of the estimator. This functionality exists so that a user can configure several parameters at once within a single One AI run and see which estimator performs the best. For example, if you used the following configuration with two values for learning rate, two models will be evaluated: one with an AdaBoostClassifier estimator that has a learning rate of .5 and another with a learning rate of 1. The scores would be compared, with One AI selecting the best performing model.
Default classification estimators:
- AdaBoostClassifier
- LightGBMClassifier
- LogisticRegression
- RandomForestClassifier
Default regression estimators:
- ElasticNet
- Lasso
- LightGBMRegressor
- LinearRegression
- RandomForestRegressor
Additional classification estimators:
- DecisionTreeClassifier
- KNeighborsClassifier
- SVC
Additional regression estimators:
- AdaBoostRegression
- DecisionTreeRegressor
- GaussianProcessRegressor
- HuberRegressor
- Ridge
- SGDRegessor
- SVR
Upsampling
Upsampling is used to address imbalance in datasets and refers to the technique of creating artificial or duplicate synthetic records in order to balance class labels in the target column.
Default upsampling method:
- SMOTE
Default upsampling ratio:
- auto (1.0)
No upsampling option:
- If users want One AI to try no upsampling, they can check the box under none as well as the other methods they would like tried, and that of which provides the best fit and performance will be used.
- If the user wants to disable upsampling, they should check the box under none and select no other methods or ratios.
Additional upsampling method options:
- Users can manually select which method(s) they want One AI to try:
- ADASYN
- SMOTE
- SMOTENC
Additional upsampling ratio options:
- This field accepts a comma separated list of single float values or strings of text.
- Users can insert which ratio(s) they want One AI to try by entering a decimal value for the percentage of minority class / majority class or an accepted text string.
- 1.0 - minority class will be upsampled to the majority class; 50/50 ratio (equal number)
- 0.5 - half as much upsampling will be performed for the minority class
- Ex: a ratio of 0.5 will result in generating enough synthetic examples such that num_minority / num_majority = 0.5)
- auto - upsample all classes except the majority class
- majority - same as auto
- minority - upsample only the minority class
- not minority - upsample all classes but the minority class
- not majority - upsample all classes but the majority class
- all - upsample all classes
Per Column Interventions
So far, the configuration options above will be applied across the entire model dataset. The Per Column Interventions configurations allow you to configure your data on a per-column basis, involving actions such as preprocessing, cleaning, transforming, or handling missing values.
There is no default behavior for per column interventions because they must be manually configured for each column desired.
First, users should add the columns they wish to configure a per column intervention for by clicking the “Column Interventions” dropdown, selecting a column from the Add Column Intervention dropdown menu and clicking the blue “+”. You may add as many columns as you like. Use the red trash icon to delete a column.
Per column intervention options:
- Only use specified columns - check this box to use only the added columns for evaluating model performance. It is much easier to manually select the model columns from the One AI query builder core attributes step, but you can also do it here.
- Droppable - modify the column’s droppability
- Droppable - column can be dropped by One AI but won’t necessarily be dropped. It will go through the feature selection process and model build, and One AI finds that it’s a good feature that leads to better performance and fit than other features, it will be selected. If not, it will not be selected. It will also be dropped if it does not meet other global settings for the model. This is the default treatment for all columns included in the model.
- Not droppable - column cannot be dropped from the model by One AI. This doesn’t necessarily mean the column will be selected for a given model, but it will always be considered, regardless if it meets the requirements from the global settings of the model.
- Always - column will always be dropped by One AI; it will not be tested in any way.
- Null Filling - configure how you want One AI to handle nulls for the selected column(s)
- Null fill strategies (also known as imputation strategies):
- Mean - the mean of a continuous column will be used to fill the null values. This cannot be used on a categorical column.
- Median - the median of a continuous column will be used to fill the null values. This cannot be used on a categorical column.
- Mode - the mode (or highest occurrence) will be used to fill the null values.
- Bfill - the null values will be filled starting from the last value, working toward the first value; any null will be filled with the next populated value.
- Ffill - opposite of bfill; the null values will be filled starting from the first value, working towards the last value; any null will be filled with the previous populated value.
- Pad - same as ffill.
- Custom - use a custom fill value to fill all null values.
- Custom fill value - check the “Override” box and enter any free text value (number, word, date, etc.) in the designated field; this value will be used to fill the nulls.
- Type-Specific Interventions
- None - no type-specific intervention will be performed.
- Categorical - accepts free text; multiple values should be separated by commas.
- Exclude - enter the name(s) of values to exclude from being considered as features.
- Select - enter the name(s) values that the model will then force select as features.
- Continuous -
- Force Select - users can force this variable to be selected as a feature by clicking the force select checkbox.
- Continuous strategy - users can change the continuous strategy for this specific column to bin or scale. Scale is the default if this was not changed in the global settings. If changed to bin, users can also change the bin type here by inputting free text.
Conclusion
One AI has many advanced configuration options - including global settings, dimensionality reduction, estimator configuration, upsampling, and per column interventions - that users can take advantage of in order to enhance model performance and have full control over how their models run. Remember, if you do not wish to perform manual configurations or do not have the know-how, One AI has default settings based on best practices and will try an intelligent subset of the available options and select the combination that results in the best model performance and fit.
As we continue to develop One AI, more features will be available through this advanced interface, and additional functionality will be explained in this article.
Comments
0 comments
Please sign in to leave a comment.