The document located here demonstrates how to build a basic model using OneAI. It is highly recommended that you are comfortable with the basics before attempting the more advanced configurations.

This document is intended to be a guide for configuring the more advanced Global Settings but is not intended to be a full explanation of the machine learning concepts associated with each setting.

Once you select your Dataset ID, Sample Date and Estimator Target, you will notice several configurable options in the dropdown menus underneath the initial configuration.

For each configuration, you will notice an “Override” checkbox.

Each advanced configuration in OneAI has a default setting to minimize the amount of work necessary to produce a working model, however the “Override” is provided in case you would like to configure each setting on your own.

GLOBAL SETTINGS

The configurations listed under the Global dropdown menu affect all columns at once. If you are interested in setting configurations on a per-column basis, please see the Per Column Interventions section below.

The Global settings are listed below:

Continuous Strategy:

Check the “Override” box, then select either “bin” or “scale” from the dropdown menu.

The “bin” selection will divide your continuous data into the amount of bins determined by “Bin Type.”

The black information icon clarifies what type of input is accepted for any free field. In this case Bin Type accepts a single integer corresponding to the number of bins you would like to divide the continuous data into.

If you instead select “scale” from the dropdown menu, you have the option to select which scaling type to use.

Category Size Threshold:

This field accepts a single float corresponding to the minimum amount of values that need to be populated in a categorical column for that column to be one hot encoded. For example, if the minimum was 1/10 populated values, the input value would be .1

Any values that don't meet the minimum will be one hot encoded into an "_other" column.

Cheating Value Mean Threshold:

This field accepts a float between 1 and 0 that controls the aggressiveness for which One AI will label a column as cheating. The cheating value mean threshold limits how much the estimator target can vary within a column before that column is discarded as too predictive. A smaller value increases the likelihood that a column will be dropped.

Correlation Type:

During a OneAI run, a correlation check is performed to determine whether each column is too highly correlated with the target column. You can select which type of correlation check is performed using the dropdown menu, or select “None” if you don’t want a correlation check to be performed.

General Correlation Threshold:

This field accepts a single float corresponding to the minimum correlation value necessary to consider two columns correlated. A value closer to 1 will indicate very correlated, while a value closer to 0 indicates a lack of correlation.

Lime Disabled:

As a default, LIME explanations are turned off. However, if you would like to include LIME explanations in your Results Summary report, check the “Override” box and then leave the “Lime Disabled” box unchecked as demonstrated below. If you choose to have LIME explanations, the results summary will include an additional column with per-record explanations.

Null Drop Threshold:

This field accepts a single float corresponding to the minimum fraction of null values allowed in a column. Any column with more nulls than the allowed threshold will be dropped. For example, if the Null Drop Threshold is set to .3, any column that has more than 30% nulls will be dropped.

Null Indicator:

This field accepts a free text value that is inserted into continuous features that get binned, indicating scalars that were originally null but had to be filled to apply bin edges.

Random State:

This field accepts a single integer corresponding to a desired random seed value. To preserve determinism in model creation, each random seed will produce the same results with the same data, but it may be useful to change the random state between runs to ensure the results are similar and not due to random chance.

Shift Analysis:

Shift analysis measures the distribution of a column between a two given time periods. The float between 1 and 0 controls the amount of shift that is allowed before a column is dropped for being unstable. A higher value indicates more tolerance of shifted features, while a lower value will result in a higher likelihood of dropping a column.

DIMENSIONALITY REDUCTION CONFIGURATIONS

This configuration automatically selects a number of features from your dataset according to a feature selection test. Use the “Feature Selection Test” dropdown to choose your preferred methods for feature selection. You may select multiple tests and you will see them populate in the field as demonstrated below:

When you have selected the desired feature selection tests, input a comma separated list of integers corresponding to the number of features you would like selected into the “K Features Selected” field. Each number will be applied and considered during model performance. The string “all” is also an accepted value, indicating that you would like to include all features for consideration.

All combinations of tests and k features will be used during model performance evaluation. For example, using the configuration above would result in 8 separate model evaluations – a model will be created with features selected via f_classif using 5 features, f_classif using 10 features, etc. then mutual_info_classif using 5 features, mutual_info_classif using 10 features, etc. Selecting “all” effectively negates feature selection.

ESTIMATOR CONFIGURATION

OneAI uses several machine learning estimators to make predictions. These estimators have configurable input parameters in the GUI. For each configuration listed below, there is a link to the sklearn documentation that describes the purpose of each input parameter.

You will notice the black information icon next to each parameter indicates what values are accepted. In many cases, a list of values is accepted, even though only a single parameter is used for the instance of the estimator. This functionality exists so that a user can configure several parameters at once within a single OneAI run and see which estimator performs the best. For example, if you used the following configuration with two values for learning rate, two models will be evaluated: one with an AdaBoostClassifier estimator that has a learning rate of .5 and another with a learning rate of 1. The scores would be compared, with OneAI selecting the best performing model.

CLASSIFICATION ESTIMATORS:

AdaBoostClassifier:

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html

CatBoostClassifier:

https://catboost.ai/docs/concepts/python-reference_parameters-list.html

DecisionTreeClassifier:

https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

KneighborsClassifier:

https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

LogisticRegression:

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

RandomForestClassifier:

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

SVC:

https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html

REGRESSION ESTIMATORS:

ElasticNet:

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNet.html

Lasso:

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html

UPSAMPLING

Upsampling creates synthetic records in order to balance class labels in the target column. This is useful when your classifier target is severely unbalanced which may affect model fitting. As with feature selection, each configuration of upsampling will be applied and considered individually for each combination of method and ratio.

If you would like no upsampling to be considered along with the other upsampling combinations, check the “none” box demonstrated below:

If you do not check “none,” then all of the evaluated runs will be completed using upsampling.

Use the dropdown menu to select the appropriate upsampling method for your dataset. Be aware that the various methods have different data requirements, so you may receive a run error if you select an upsampling type that is not compatible with your data.

The “ratio” form can be populated with a float indicating the ratio of the class balanced desired, or the strings “auto” or “not majority.” In most cases, the “auto” option will default to “not majority.”

PER COLUMN INTERVENTIONS

So far, the configuration options above will be applied across the entire dataset. The Per Column Interventions configurations allow you to configure your data on a per-column basis.

To use Per Column Interventions, you will add a series of columns in the steps described below. If you would like to use ONLY the added columns for evaluating model performance, select the “Only Use Specified Columns” checkbox.

To add a column, click the “Column Interventions” dropdown, and then select a column from the Add Column Intervention dropdown menu. Click the blue “+” to add that column. You may add as many columns as you like. Use the red trash icon to delete a column.

Once you have added the columns desired, you can now configure those columns by clicking on the column name dropdown to see the options.

Droppable:

OneAI will perform various correlation checks, cheating tests, column validations and other analysis to determine whether or not to drop a column. You can use the “Droppable” configuration to determine how OneAI handles dropping the column.

The “Droppable” dropdown menu allows you to select three options for the specific column:

is droppable: the column can be dropped by OneAI but won’t necessarily be dropped.

not droppable: the column cannot be dropped by OneAI. This doesn’t necessarily mean the column will be selected for a given model, but it will always be considered.

always: the column will always be dropped by OneAI.

Null Fill:

If you would like to configure how to fill nulls in your selected column, click the “Null Fill” dropdown tab. This gives you the option to configure a specific Strategy or you can provide a Custom Fill Value that will populate the nulls with a value you supply.

To select a null fill strategy, click the “Override” box for “Strategy” and select one of the options from the dropdown menu:

mean – the mean of a continuous column will be used to fill the null values. This cannot be used on a categorical column.

median – the median of a continuous column will be used to fill the null values. This cannot be used on a categorical column.

mode – the mode (or highest occurrence) will be used to fill the null values.

bfill – The null values will be filled starting from the last value, working toward the first value. Any null will be filled with the next populated value.

ffill – This is the opposite of bfill. The null values will be filled starting from the first value, working towards the last value. Any null will be filled with the previous populated value.

pad – Another term for ffill.

custom – Use a custom fill value. Configuring the custom value is described below.

To add a custom fill value, simply check the “Override” box and enter any free text value in the form below. This is the value that will be used to fill the nulls.

As we continue to develop OneAI, more features will be available through this advanced interface, and additional functionality will be explained in this document.

Did this answer your question?