One AI looks for four main data elements in a data set: 1) a labeled column, 2) a data set id, 3) a sample date, and 4) lots of feature columns. This article will describe these elements why they are needed.
One AI is a predictive analytics engine. As such, you need to give it something to predict. This is often referred to as a labeled column or as the target column. The value in this column is what you want your model to predict.
For example, suppose that you are looking to identify employees who are likely to terminate within 6 months. In this case the labeled column would contain values like “Terminated” or “Did Not Terminate.”
The labeled column could have more than just two values in it. For example, let’s say you have a 9 box employee evaluation matrix and you want to create a model that predicts which employees will be classified into which boxes in the future. In that case, your labeled column might include the values 1 thru 9.
In One AI configuration, the labeled column will be specified as the classifier_target. For example: classifier_target: oneaidemo.termdata.isfutureterminated_1year
Data Set Id
One AI expects to get one row of data for each entity you are trying to predict. Typically this means one row of data per employee. However, it might also be one row of data per shift or per business unit.
One AI requires a unique identifier for each of these rows. For example, if you are predicting employee attrition, this unique identifier might be the employee id.
This is also the column that will be returned in the prediction results when you deploy the model. In other words, if you use employee id as your unique identifier, then when you deploy the model, you will get a dataset back that shows the likelihood that a given employee will leave the company. Those employees will be identified by the employee id you included.
In One AI configuration, the data set id will be specified as the dataset_id. For example, dataset_id : oneaidemo.termdata.id
Privacy note: You do not have to use your real employee ids in the data set. You can mask them in some way. If you are fancy, you might hash the values so that instead of us seeing “EMP001” we see “b86d07b605150e168675dbda3d1b30fe” instead. This way you can reassure your security team if they are concerned about sending any type of PII data out of the organization. On that note, there’s also no reason to include employee name, email, phone numbers, etc in your Trailblazer data set. We don’t need any of that sensitive data for making predictions.
In general predictive models work by training models and then testing them to see how they perform, then using the model to predict unknown results. One AI requires a sample date column to facilitate this train / test / predict process. Typically you will need to use the sample date column to break your data into three chunks. The first two will be used for model development. The last one for making the actual predictions.
For example, if you were predicting employee attrition, you would pull together your data from three different periods of time. You might have a data set for the beginning of 2017, one for the beginning of 2018, and then a current data set for 2019. The sample date is how One AI will tell them apart. However you specify sample date, when you sort the values in ascending order they should rank from oldest to newest. In this example, the sample dates might simply have the values “2017”, “2018”, and “2019”.
One AI will start with the first (oldest) sample date dataset. It will do all its data preparation and model generation awesomeness on it. Then it’ll take the results and try them out on the second data set. Interestingly, this is not yet the real model. It goes through all this effort just to test out whether the assumptions it is making about how to define the prediction parameters hold up under testing. This step exists because our ML team is pretty paranoid about overfitting models.
Based on the results of that initial, paranoia-inspired test run. One AI will divide the second data set up into train / test split, and go through the whole model generation process again. The resulting model from this step is the one you actually get.
One AI takes this model and uses it to make predictions about the third, current data set. In the employee attrition example, these are your current employees. You don’t yet know if they are going to terminate. When you deploy the model you’ll get these employee records back along with a prediction and supporting details.
In One AI configuration, the data set id will be specified as the sample_date_column. For example, sample_date_column: oneaidemo.termdata.calendar
The last ingredient in your dataset is lots and lots of feature columns. These are the raw materials that One AI will use to try to make predictions.
For example, if you are trying to predict an employee level outcome like attrition, then you’ll include lots and lots of columns that describe your employees and the context in which they work. Examples might include the date they joined the company, who their manager is, how much they get paid, their pay relative to the market, time since last promotion, etc. The more data you can include the better. One AI will ignore or exclude data that is incomplete, that is found to be a cheater value, or that is otherwise not helpful in making a prediction.
For perspective, you might include 500 different data points for each employee. One AI will test them all for inclusion in the final model, perhaps only using a dozen or so in making its predictions. That said, you do not have to include 500 features. Just pull together what you can.
You don’t have to call out feature columns in One AI configuration. Any column that is not the labeled column, dataset id, or sample date will be used as a feature column.