This article covers the methods One AI uses for preprocessing (cleaning and preparing data), the default settings, and how to override those settings.
Data cleaning and preparation, also known as preprocessing, happens prior to any machine learning occurring in One AI. The Exploratory Data Analysis (EDA) report is your window into what happens when preprocessing is performed. One AI does not consider treatments such as upsampling to be part of preprocessing as they are performed as a part of the subsequent machine learning process.
Scaling (normalization) is a mathematical transformation that shifts the range of a continuous (numeric) variable so multiple features will be on the same scale and won't get incorrectly weighted by the algorithm.
By default, One AI performs linear scaling on numeric variables, which uses a combination of subtraction and division to replace the original value with a number between -1 and +1 or between 0 and 1. Scaled variables can be identified by the tag Scaled in the EDA report in the Variable Status section.
This setting can be overridden to change the scaling type or bin the data instead by navigating to Data > Augmentations > Edit (for a specific augmentation) > One AI Configuration > Global > Continuous Strategy.
Strategy: the default is scale
Bin Type: the default is standard
Variables formatted as dates are converted to the difference from the sample date and are then scaled as a numeric variable would be. This equates to more recent dates being a higher value and earlier dates being a lower value.
One Hot Encoding
One Hot Encoding involves splitting out each value in a categorical variable into its own binary column. Machine learning algorithms require this to be able to interpret categorical data without ordinality. One Hot Encoded variables can be identified by the tag One Hot Encoded in the EDA report in the Variable Status section. The following article explains One Hot Encoding in more detail: https://www.geeksforgeeks.org/ml-one-hot-encoding-of-datasets-in-python/.
Variables containing a certain percentage of NULL or missing values (5% by default) are dropped and not used by the model. This can be identified in the EDA report in the Variable Status section. Variables dropped due to too many NULL values are assigned a tag of Missing .
To prevent variables from being dropped due to missing values, NULL filling can be applied to the data. There are a variety of methods for NULL filling available in One AI. View Refining a Machine Learning Model in One AI to learn more.
To override the 5% default, navigate to Data > Augmentations > Edit (for a specific augmentation) > One AI Configuration > Global > Null Drop Threshold. Check the Override box and enter a decimal value, 0.05 being the default value here.
Data Leakage Detection
Leakage in a variable means that the target for the model (the thing the model is predicting) is likely “leaking” data into the variable. In simple terms, it’s a variable that predicts the outcome too well to be plausible. A flag in the data that indicates whether someone terminated in the future when that’s also what the model is predicting is an example. Similar to a student, a model isn’t actually learning if you give it the answer up front.
Leakage is identified in One AI by generating a random forest model against the target using only that feature and then measuring the performance using an ROC-AUC score. By default, a score above 0.85 is considered leakage and the variable will be dropped and labeled as Leakage on the EDA report. A score above 0.70 is considered suspicious. The variable will not be dropped but will be labeled as Suspicious on the EDA report.
To override the defaults, navigate to Data > Augmentations > Edit (for a specific augmentation) > One AI Configuration > Global. Check the Override box and enter a decimal value for either of the following settings:
Leakage Performance Threshold: the default is 0.85
Suspicious Performance Threshold: the default is 0.7
Correlated Feature Reduction
Correlated features are attributes containing data that is more or less the same. An example is "date of birth" and "age". While both of these being selected as features doesn’t affect the performance of the model, it makes the results more difficult to interpret. It’s not useful to know that both "date of birth" and "age" are important since they’re effectively the same thing.
When correlated variables are detected during preprocessing in One AI, all but one will be ignored. Correlated variables can be identified by the tag Correlated in the EDA report in the Variable Status section.
By default, correlations are determined in One AI by using Cramer’s V with a threshold of 0.65. To override the defaults, navigate to Data > Augmentations > Edit (for a specific augmentation) > One AI Configuration > Global. Check the Override box and select or enter a value for either of the following settings:
Correlation Type: the default is cramers
General Correlation Threshold: the default is 0.65
There are an optimal number of features to consider for a machine learning model. On the surface this might seem counterintuitive since it’s logical to think that more features would be better. As dimensionality increases however, the need for a larger number of data points and computational power increases exponentially. This is known as the “curse of dimensionality”. There’s an optimal number of features after which performance of the model degrades. This article explains the principle in more detail: https://towardsdatascience.com/curse-of-dimensionality-a-curse-to-machine-learning-c122ee33bfeb.
One AI optimizes dimensionality through the use of filter and wrapper methods. First, the number of features is limited to a certain number using filter methods (by default it tries 5, 10, and 15). This treatment looks at the general characteristics of the features and chooses the ones deemed to be the best using a univariate test. One AI then applies a wrapper method to the filtered features and further prunes them using recursive feature elimination (a minimum of 5 is the default). A wrapper method leverages a predictive model (random forest specifically) to score each combination of features and choose the best combination. It is more computationally expensive than a filter method, hence the strategy of filtering first. If you haven’t applied any overrides, the end result will be a number of selected features between 5 and 15.
Both method and number can be overridden for either Filter Method or Wrapper Method. To override the defaults, navigate to Data > Augmentations > Edit (for a specific augmentation) > One AI Configuration > Dimensionality Reduction. Check the Override box and select or enter a value for either of the following settings:
No Selection: this setting disables dimensionality reduction
Filter Methods: Method: the default is mutual_info and multiple methods can be selected
Filter Methods: Num Features: by default, 5, 10, and 15 are attempted and the best result is chosen
Wrapper Methods: Min Features: the default is 5 and the number here must be <= the Num Features defined
Per Column Interventions
One AI allows for configuration of preprocessing for individual variables in addition to global settings. There are a variety of treatments you can apply here. These include the following:
- Force attributes to be included or excluded as features or potential features
- Apply NULL filling to individual features
- Null Fill: Strategy: possible selections are mean, median, mode, bfill, ffill, pad, custom
- Null Fill: Custom Fill Value: enter a string or numeric value to replace nulls with
- Detailed documentation can be found here:
Apply NULL filling to individual features
- Type-Specific Interventions
- Categorical Interventions:
- Exclude: enter the name(s) of values (multiple values should be separated by commas) to exclude from being considered as features
- Select: enter the name(s) of values (multiple values should be separated by commas) to force select as features
- Continuous Interventions:
- Force Select: force this variable to be selected as a feature
- Strategy: bin, scale (scale is default behavior)
- Bin Type: auto
- Categorical Interventions: