The Exploratory Data Analysis (EDA) report provides valuable information about the data leveraged by machine learning models in One AI. Read more below and watch the video - Exploratory Data Analysis (EDA) Report.
EDA is an approach for analyzing data sets to summarize their main characteristics. This is frequently achieved by employing statistical graphics and various data visualization techniques, allowing us to glean insights from the data that extend beyond formal modeling.
One AI provides a robust EDA report as an additional tool to enable you to analyze the data that you are using in a machine learning model before using that data to make predictions. EDA can also be used on its own to understand a dataset without ever using the data in a predictive model.
Data preparation is automated in the One AI pipeline. One AI detects the type of each variable (attribute column) and applies treatments to enhance the predictive factor of good ones while ignoring bad ones. Some examples of treatments include scaling and one hot encoding, with additional checks for data leakage, NULL values, and correlations. The EDA report provides visibility into these treatments, together with general and statistical information about the data.
Understanding how to harness the benefits of this essential tool in One AI enhances your machine learning efforts.
Finding the EDA Report
An EDA report is automatically generated each time a classification or regression machine learning augmentation run is completed. One AI augmentation runs can be found by navigating to Data > Augmentations and clicking on the Runs icon for a classification or regression augmentation.
Any augmentation in a Pending, Deployed, or Deployed and Persisted status should contain an EDA report. Clicking on the row for the run in the right pane will open the EDA report.
The overview section contains high level information about the structure of the data. Variables (columns) are attributes about each individual we’re making predictions for. The number of observations is the number of individuals (rows) we’re making predictions for. Please note that all information contained in the EDA report is based on the Train/Test dataset.
Variable Status Section
The Variable Status section provides information about the handling of every variable that the augmentation included.
The following are explanations of each of the colored labels in the Variable Status section:
- Selected! - This field was used in the predictive model that One AI determined was the best fit. The selected variables are the ones that were used by the model to make its predictions.
- Processed - One AI tried this field out and may or may not have ended up using it in the particular model that it settled on. Processed variables will have an additional orange label that identifies the treatment applied to the variable during the data cleaning stage.
- Scaled - Scaling is a mathematical transformation that shifts the range of a continuous value so multiple features will be on the same scale and thus won't get incorrectly weighted by the algorithm. One AI generally performs linear scaling on continuous variables, which uses a combination of subtraction and division to replace the original value with a number between -1 and +1 or between 0 and 1.
- One Hot Encoded - One Hot Encoding involves splitting out each category in a categorical variable into its own binary column. Doing so allows the machine learning algorithm to treat the values without bias, since they’re now separate variables. The following article explains One Hot Encoding in more detail: https://www.geeksforgeeks.org/ml-one-hot-encoding-of-datasets-in-python/.
- Dropped - One AI had objections about this variable and chose to leave it out of the predictive models that it tried out. Dropped variables will have an additional gray label that explains why the field was dropped.
- Leakage - Leakage means that the target for the model (the thing the model is predicting) is likely “leaking” data into the variable. In simple terms, it’s a variable that predicts the outcome too well to be plausible. An example would be a flag in the data that indicates whether someone terminated in the future, when that’s also the thing the model is predicting. Similar to a student, a model isn’t actually learning if you give it the answer up front. Leakage is identified in One AI by generating a random forest model against the target using only that feature and then measuring the performance using an ROC-AUC score. A score above 0.85 is considered leakage and the variable will be dropped. To override the threshold above which to identify leakage, use the following key in the yaml overrides section for the augmentation configuration, modifying the decimal value: leakage_drop_perf_threshold: 0.85. The Suspicious status explained below is a less stringent version of the same test.
- Unique - The values in this column were all or nearly all unique and they were categorical values (not numeric). They are not useful in making predictions as a result.
- Constant - Constant is the opposite of unique. The values are all or nearly all the same. Variables of this nature are also not useful for making predictions.
- Missing - Missing means the data in this variable failed the NULL drop threshold. Variables containing more than 5% null values are tagged as missing and ignored. This threshold can be adjusted in the augmentation configuration settings.
- Suspicious - Suspicious variables are those in which possible leakage is detected. These variables are not dropped, but rather are flagged and should be validated to ensure there is not indeed leakage present. The test performed for detecting leakage is the same as described in the Leakage tag above but the threshold is 0.70 instead of 0.85. To override the threshold above which to identify leakage, use the following key in the yaml overrides section for the augmentation configuration, modifying the decimal value: suspicious_note_perf_threshold: 0.7 .
Variable Analysis Section
Sometimes you'll want more information about the data a variable contains. The Variable Analysis section in the EDA report has you covered in this regard. This section can be found below the Variable Status section or by clicking on any variable in the Variable Status section.
Useful but high level information is provided initially in this section.
Clicking on Toggle details expands a detailed analysis of the variable.
The variable analysis view contains some interactivity and provides a variety of statistics and visualizations. Insights about the data can be surfaced here with a bit of interaction. Here are some key items to note:
- The contents of the details section will differ based on whether the variable is a date, numeric, or categorical.
- Each tab contains different information. The specific tabs will not be explained in detail here but the information can be related to the shape of the data itself or the preparation/cleaning of the data.
- Tabs with a down arrow icon can be viewed for the entire data set or filtered to any of the individual targets. This is an especially powerful feature for spotting differences in the makeup of the data between the possible targets.
Correlation is the degree to which two variables are related in a linear fashion. A heatmap chart highlighting correlations is provided on the EDA report. Please note that correlations are not included on the EDA report by default. The following setting must be added to the yaml overrides for the augmentation configuration prior to running the augmentation for correlations to be included: generate_eda_correlations = True.
Similar to the Variable Analysis section, there are a number of tabs available and each one allows you to view correlations for the entire data set or filtered to any of the individual targets. The tabs in this section correspond with different methods for detecting correlation. The red diagonal line on the chart represents each variable’s correlation with itself, which is always 1.0.
In addition to the heatmap charts, correlations data can be downloaded as flat files. This option can be found by clicking the Download Correlations button at the bottom of the window.
Missing Values and Sample
The Missing values section of the EDA report provides a number of views highlighting NULL values in variables. These views can be helpful in determining areas where additional data preparation would be beneficial. Similar to some of the earlier sections in the EDA report, there are a number of tabs available and each one allows you to view results for the entire data set or filtered to any of the individual targets.
The Sample section provides a view of the first and last 5 rows in the dataset. This can be useful to help spot obvious issues with the data.
Exploratory Data Analysis is a powerful tool for understanding data and identifying insights, either on its own or as part of understanding a predictive model. The EDA report is an integral part of machine learning in One AI. In fact, there are scenarios where predictive models will be created only to generate an EDA report, from which the desired insights can be obtained.