One AI creates an EDA report each time it runs a Classification or Regression Data Augmentation. EDA stands for Exploratory Data Analysis. Exploratory Data Analysis is all about checking out the data before you try to use it to make a predictive model. During this step of the process, One AI checks through the columns in the data set you gave it, looking for anything suspicious that could interfere with a valid prediction.
For a more robust description, check out Prasad Patil's write up on Towards Data Science. (Towards Data Science is a great resource in general.)
Here are some examples of the type of information you'll find in the One AI EDA Report:
Variable Summary and Status
At the top of the report you see summary information on the data set and the variables (columns) you fed into it. In this case we passed in a data set with 20 columns. Of those, three were treated as Numeric, six as Categorical, two were Dates, and six were dropped. That adds up to 17. Where are the other three? They are the target, sample date, and dataset id columns which are special fields, required for a Classification or Regression model. The summary information also tells us that there were 5516 observations in the data (this means rows in the dataset). The EDA report is based on the data in the second of the three files that are sent to One AI, which is the training period data.
For a more information on why target, sample date, and dataset id are key fields, as well as the three files that are required for a predictive model, check out this article: http://help.onemodel.co/en/articles/3020852-one-ai-data-destination-files
Next let's look through the Variable Status section and explain what all the little colored labels mean.
Selected! - This field is used in the predictive model that One AI determined was the best fit.
Processed - One AI tried this field out, but didn't end up using it in the particular model that it settled on.
Dropped - One AI had objections about this field and chose to leave it out of the predictive models that it tried out. Dropped elements will have another label, in grey that explain why the field was dropped. See below.
Cheating - Cheating fields get thrown out because they are too good. Why would you want to throw out columns that do really really well in the prediction? Because HR data is littered with all sorts of data points that predict what's going to happen to an employee but aren't useful in a predictive model. For example, in this data set I made up a severance pay column. I ONLY populated a value for severance pay for employees who terminated. Not for every one of them, mind you. But quite a few. When One AI sees something like that, it knows that it's too good to be true and throws it out. After all, wouldn't it be really annoying if we came back with a model that said, "HEY! All these folks who have severance pay queued up are going to leave." "Thanks," you would reply, "can I have my money back?"
There is a similar flag to Cheating you might see called Suspicious. Same idea-- the data seems too good to be true, but Suspious columns aren't thrown out, they are just flagged as suspicious.
When you set up the parameters of a new augmentation in One AI, you can manually set thresholds for the Cheating and Suspicious flags.
Unique - The values in this column were all, or mostly all, unique, and they were categorical values (not numeric). As a result they are not useful in making predictions. In the artificial dataset behind this EDA report there were two such columns. One was the hire cohort, the month and year the person was hired and another was Cost Center. Cost Center was basically a random number with "CC" in front of it. Since just about every employee had a different cost center, we couldn't use it to predict whether other employees in that cost center would terminate.
Constant - Constant is the opposite problem from Unique. The values are all, or nearly all, the same. If every one of your employees has a data point that is the same, it won't really help us figure out any future outcomes for that employee. In this example the data set had a column called "Record Count" and everyone record had a "1.0" in it. So, it got thrown out.
Scaled - For non data scientists like me this means, "This got treated like a series of numbers or something similar." In this example data set, Annual Salary was one such value. While there are lots and lots of different numbers in the column, One AI could base predictions on whether that number was, say, higher or lower. Same deal with Birth Date and Hire Date. Even though they are nearly all unique, they could be relatively earlier or later.
One Hot Encoded - Again for non data scientists: this value got treated like a category and, in order to treat it like a category, One AI broke out out into several columns, each of which it could test differently. Take "nine box" in the example above. This has... surprise... nine values in it with items like "Rockstar" or "Needs Coaching" in it. In order to test whether these values have an impact on the data set, One AI might, for example, make a separate column for "Rockstar" and put a zero or a one in it... then another column for "Needs Coaching" with a zero or a one... and so on. If you really care check out this article: https://hackernoon.com/what-is-one-hot-encoding-why-and-when-do-you-have-to-use-it-e3c6186d008f
Sometimes you'll want more information about the data in a field and why it was treated the way it was. In this case you can click the red text with the field name in it and it will jump you further down in EDA report where you can see additional information. For example, if you click on the nine box field, you can see summary data about that column, along with a histogram (Toggle details) that shows you how many of each value were found in the data.
Different fields will be summarized in different ways. For example, nine box was a categorical value and One AI showed a histogram right away. Performance score, on the other hand, was a number between 0 and 100 and was scaled. So instead the main summary shows us things like the Median, Max, Standard Deviation, etc.
The EDA report contains a lot of information about the data you fed into your prediction. Below the Variable Analysis you can even look over various correlation matrices. Over time, we will likely include even more data on this page.
For my part, I usually focus on the Variable Status section, looking to see if columns were dropped or used. Often you may want to adjust the model configuration based on what you see there.
For example, I recently used One AI on some NFL data to predict whether the Vikings would run or pass on a given down. This isn't HR data, and I ended up having to make a couple of tweaks. For example, the data had a number for what Quarter of the game the play occurred in. I adjusted the model to treat this as a category, rather than a scaled number. Also, Down was initially flagged as a cheating column, before I adjusted the configuration to let One AI use it in the model.