In order to configure a new machine learning augmentation in One AI you must create a data destination with three files.
The Three Required Columns for One AI machine learning augmentations:
Dataset Id: This is a unique id per row in the file. For example, you are creating an attrition risk model, this would likely be the employee id.
Sample Date: This field denotes the "as of" date for the records in the file. For example, you might provide data from the beginning of 2017, the beginning of 2018, and then use that data to predict attrition risk for 2019.
Classifier Target: This is the column with the value you are trying to predict. For example, it might have a 0 or a 1 to indicate whether the employee terminated. For a regression problem it might have a number in it, like headcount or days to fill. The format does not have to be numeric for classification problems. Instead of a 0 or 1, you might have a value that says "Terminated" or "Didn't Terminate".
So that said, what does One AI do with the three files-- and why does it need three of them? Here's how it works.
The Three Files Required for One AI machine learning augmentations
Typically you will provide two files with historical data and one with the current data. So lets say that file A is data from two years ago, file B is data from one year ago, and file C is the current data. One AI considers the files in alphabetical order, so name accordingly with the oldest data having a file name that comes first in the alphabet and the newest data having a file name that comes last.
File A (aka Pipeline Test File, aka Data from 2 periods ago)
One AI starts with this data, splits it into a train/test split, and trains a bunch of models on it (logistic regression, decision tree, random forrest, etc.). These models are not actually used in the final predictions. They are created so that they can be tested against the next period of data (in File B).
While the results from this first round of model training are not used directly, the exercise is valuable because the system gains information on what model parameters are likely to perform over time.
Over time, all models decay. What was predictive last year won't always be predictive this year. So, One AI has a built in process to check performance over time and adjust it's hyper parameters accordingly. (Hyper parameters are the values that govern the types of models that One AI creates. For example, the max depth of a decision tree.)
File B (aka Training Period, aka Data from 1 period ago)
Like File A, File B has actual results in it (like whether the person terminated or not). File B is first used as a way of testing how hyper parameters selected based on File A perform against a future time period (see above).
The data in File B is then split into train/test sets and One AI trains new models on it. One AI then selects the top performing model based on the data from File B, but also informed by what types of models held up over time (when the models from File A were tested against B).
Detailed information about the selected model is show on the Modeling Tab when you click on a specific run in the list of runs for that augmentation.
File C (Present day data)
One AI then takes the model it selected and creates predictions using the data from File C. File C is usually data from the current point in time. It doesn't have actual results in them. In other words, you don't yet know who will terminate, how many days a req will take to fill, etc.
These predictions are available on the Results Explorer tab when you click on a specific run in the list of runs for that augmentation. These predictions are also fed back into One Model as a data source and can be used in metrics, dimensions and dashboards.