What does One AI mean when it marks a column as a “suspicious” variable on the EDA Report? How does this differ from "leakage"? What are the default thresholds?


Suspicious variables are those in which possible data leakage is detected. Data leakage means that the target for the model (the thing the model is predicting) is likely “leaking” data into the variable. In simple terms, it’s a variable that predicts the outcome too well to be plausible. An example would be a flag in the data that indicates whether someone terminated in the future, when that’s also the thing the model is predicting. Similar to a student, a model isn’t actually learning if you give it the answer up front. Leakage is identified in One AI by generating a random forest model against the target using only that feature and then measuring the performance using an ROC-AUC score. A score above 0.85 is considered leakage and the variable will be automatically dropped. You will know that a variable was dropped due to leakage because it will have a grey label   Leakage  next to it on the EDA report.

The test performed for detecting suspicious variables is the same as that for Leakage but the threshold is 0.70 instead of 0.85. Additionally, these variables are not automatically dropped, but rather are flagged on the EDA report with a purple label  Suspicious  and should be validated to ensure there is not indeed leakage present. If leakage is present, you should exclude the column from the model. If leakage is not present, no action is required. 

Both the leakage and suspicious performance threshold value can be changed by inserting a decimal number ranging from 0-1 in the designated field from the augmentation page in the Global section of the One AI configuration.

Navigation: Data > Augmentations > Edit > One AI Configuration > Global > Category Size Threshold > Override

Global Settings configuration from the Augmentation Screen

Was this article helpful?

0 out of 0 found this helpful



Please sign in to leave a comment.