Module Type: Functional
Level: Intermediate
Audience: Model creators & managers
Prerequisites: "One AI Exploratory Data Analysis (EDA) Report" & "One AI Results Summary" modules
Introduction
One AI provides a variety of downloadable reports from the Results Summary, each offering unique insights into different aspects of the model. To access these downloads, navigate to the One AI tab in the main ribbon menu, click on 'Runs' for the model of interest, then click on the model status label for the desired run. From there, click 'Results Summary,' at the bottom of the window and click 'Downloads.' You can download a zip folder containing all the files discussed in this help article or choose specific files you are interested in. These files will be downloaded in CSV format.
Each of these downloadable reports is useful in understanding, evaluating, and improving your machine learning models. By exploring the data provided in these files, you can gain deeper insights into model performance, identify areas for enhancement, and ultimately build more accurate and reliable predictive models. Some of the most important information found in these downloads can also be found in the Exploratory Data Analysis (EDA) or Results Summary reports.
For more information about how these files relate to each other, please refer to the One AI Entity Relationship (ERD) diagram, which is attached at the bottom of this article.
Feature Coefficients & Feature Importances
File Overview
The Feature Coefficients and One AI Feature Importances reports provide insights into the contributions of individual features in the machine learning models. The Feature Coefficients data is applicable to linear models, where each feature is assigned a coefficient indicating its impact on the model’s predictions. In contrast, the Feature Importances data applies to tree-based models, where each feature is given an importance score based on its contribution to the model's decision-making process. These files help users understand which features significantly impact their model's predictions, aiding in model interpretation, feature selection, and improvement for better performance and transparency.
Column Explanations
Feature Coefficients
- run_id: The unique identifier for the model run.
- label_value: The specific label associated with the feature in a classification task. The labels are selected during model configuration in the “Give your prediction target meaningful labels” step. This indicates how much the feature contributes to predicting each class.
- feature_name: The name of the feature selected by the model (e.g., the core attribute or generative attribute name). These are sorted in alphabetical order, not by importance, with generative attributes on top, followed by core attributes.
- feature_coefficient: The coefficient value assigned to the feature, indicating its impact on the model’s predictions. The magnitude represents the strength of the feature’s impact, while the sign indicates the direction of the relationship. Positive coefficients indicate a direct relationship, while negative coefficients indicate an inverse relationship.
Feature Importances
- run_id: A unique identifier for each model run.
- feature_name: The name of the feature selected by the model (e.g., the core attribute or generative attribute name). These are sorted in alphabetical order, not by importance, with generative attributes on top, followed by core attributes.
- feature_importance: The importance score assigned to the feature, reflecting its contribution to the model’s decision-making process. Higher scores indicate greater importance and predictive power.
Features
File Overview
The Features download provides detailed information about the features used in the machine learning model, including their original names, types, transformations applied, and encoding details. Each row pertains to a feature in a classification model run. This data helps users understand how each feature was processed and transformed, which is useful for interpreting the model's behavior and for replicating the preprocessing steps in future projects.
Column Explanations
- run_id: The unique identifier for the model run.
- base_column_name: The original name of the core attribute’s column or generative attribute in the dataset.
- base_column_type: The original data type of the column (e.g., numeric, categorical).
- feature_name: The name of the feature as used in the model after preprocessing.
- feature_type: The data type of the feature after preprocessing (e.g., numeric, binary/categorical).
- encoded_value: The specific value(s) of the feature if it was encoded (i.e., categories for a one-hot encoded feature).
- was_ordinally_encoded: Indicates whether the feature was ordinally encoded (True/False).
- was_scaled: Indicates whether the feature was scaled (True/False). Applies to numerical features only.
- was_binned: Indicates whether the feature was binned (True/False). Applies to numerical features only.
- was_ohe: Indicates whether the feature was one-hot encoded (True/False). Applies to categorical features only.
- scaling_type: The strategy of scaling applied to the feature, if any (e.g., standard, minmax).
- binning_type: The bin type applied to the feature, if any (e.g., auto).
Grid Search Metadata
File Overview
The Grid Search Metadata download provides an extensive dataset of information on the grid search process, detailing the various parameters, preprocessing steps, models evaluated, and performance metrics. Grid search is a method used to systematically test different algorithms and parameters/hyperparameters (settings) to find the combination that produces the best performance. The metadata includes all the configurations and results from these tests, helping users to understand and optimize their models. This file helps users understand the grid search configuration and results, enabling them to optimize and fine-tune their machine learning models effectively.
Column Explanations
Due to the large volume of columns, they have been grouped into categories for clarity.
General Information and Model Details
- data_element_index, model_name, fitted_estimator: Identifiers and names for each grid search element and the specific instance of the model with its parameters.
- gs_params, should_calibrate: Grid search parameters and calibration indicator.
Preprocessing and Feature Engineering
- upsample, filter_method, filter_num_features, wrapper_method, wrapper_min_features, pre_wrap_num_features, dim_reduction_params, post_dim_reduction_dict: Details about upsampling, filtering, feature selection, and dimensionality reduction methods.
- feature_list, feature_importances: List of features used in the model and their importance scores.
- was_ordinally_encoded, was_scaled, was_binned, was_ohe, scaling_type, binning_type: Information on whether features were encoded, scaled, binned, or one-hot encoded, and the types of scaling or binning applied.
Performance and Validation Metrics
- brier_score, fraction_of_positives, mean_predicted_value, weighted_results: Various performance metrics evaluating the model's predictions.
- vif_scores, cross_val_score, cross_val_std: Multicollinearity scores and cross-validation metrics.
- mean_absolute_error, mean_squared_error, median_absolute_error, explained_variance_score, r2_score: Regression performance metrics.
- accuracy_score, precision_score, recall_score, f1_score: Classification performance metrics.
Labels
File Overview
The Labels download provides detailed performance metrics for each label used in the classification model. Each row in this file represents a label in a classification model. Information contained includes precision, recall, F1 scores, and ROC values for different data splits (holdout, validation, and training). This helps users understand the effectiveness of their model in predicting each label and identify areas for improvement. It’s useful for identifying strengths and weaknesses in predictions and guiding improvements for better accuracy and reliability in classification tasks. Please note that this file is only generated for classification models.
Column Explanations
- run_id: The unique identifier for each model run.
- label_value: The specific label for which the performance metrics are calculated This could be any set of labels defined by the user, such as "High Performer" vs. "Not High Performer," "Terminate" vs. "Not Terminate," or, if label overrides were not performed during model configuration, simply "0" and "1."
- predictions: The number of predictions made for this label.
- holdout_precision, holdout_recall, holdout_f1: Precision, recall, and F1 scores for the holdout dataset.
- roc: ROC value for the holdout dataset.
- validation_precision, validation_recall, validation_f1: Precision, recall, and F1 scores for the validation dataset. The validation dataset is a subset of data used to tune model parameters and evaluate model performance during training, ensuring the model generalizes well to unseen data.
- train_precision, train_recall, train_f1: Precision, recall, and F1 scores for the training dataset. The training dataset is the subset of data used to train the machine learning model, allowing it to learn patterns and relationships within the data.
Label Predictions
File Overview
The Label Predictions download provides information about the predicted labels for instances in a classification model. These are effectively the “categorical predictions”. Each row corresponds to a prediction made by the model, indicating which label was assigned to a specific instance. Users would use this file to review and evaluate the predicted outcomes of their classification models, identifying trends and misclassifications to improve model accuracy and inform decision-making processes. It’s great for validating model performance and understanding its predictive capabilities. Please note that this file is only generated for classification models.
Column Explanations
- run_id: The unique identifier for each model run.
- dataset_id: The unique identifier for the instance being predicted (e.g., an employee’s person_id or new hire’s event_person_id).
- label_prediction: The predicted label assigned by the model. This could be any set of labels defined by the user, such as "High Performer" vs. "Not High Performer," "Terminate" vs. "Not Terminate," or, if label overrides were not performed during model configuration, simply "0" and "1."
Regression Predictions
File Overview
The Regression Predictions download provides information about the predicted continuous values for instances in a regression model. Each row corresponds to a prediction made by the model, indicating the continuous value assigned to a specific instance. This helps users evaluate the accuracy of their regression models by comparing predicted values with actual outcomes, thereby identifying errors and improving model performance. Please note that this file is only generated for regression models.
Column Explanations
- run_id: The unique identifier for each model run.
- dataset_id: The unique identifier for the instance being predicted (e.g., an employee’s person_id or job requisition’s req_id).
- regression_prediction: The predicted continuous value assigned by the model (e.g., predicted salary, predicted days to fill).
Prediction Explanations
File Overview
The Predictions Explanations download provides detailed explanations of the predictions made by a machine learning model. It contains one row per prediction, label, and feature combination. This file includes detailed information for these breakouts using SHAP values and coefficients. SHAP values indicate the contribution of each feature to the prediction, while coefficient explanations provide insights into the feature's impact in linear models.
Column Explanations
- run_id: The unique identifier for the model run.
- dataset_id: The unique identifier for the dataset used in the model run (e.g., person_id). This id will repeat as many times as there are features selected in the model because measurements are for each feature.
- label: The predicted label for the data point (e.g., 'Not High Performer').
- feature: The name of the feature as used in the model after preprocessing.
- coefficient_explanation: The explanation of the feature's coefficient in the model (typically for linear models).
- shap_explanation: The SHAP value indicating the impact of the feature on the prediction.
- feature_value: The actual value of the feature for the given data point.
- shap_base_val: The baseline SHAP value for the model, representing the expected value if no feature was present.
- coeff_base_val: The baseline value for the coefficient explanation.
Probability Predictions
File Overview
The Probability Predictions download provides probability predictions from a classification machine learning model for different classes within a dataset. Each row corresponds to a specific prediction (dataset_id and label combination) made by the model, with an associated probability indicating the model's confidence in that prediction. The file allows users to see how the model evaluated different instances and the likelihood it assigned to each possible outcome. Please note that this file is only generated for classification models.
Column Explanations
- run_id: The unique identifier for the specific model run, indicating which iteration of the model produced the predictions in this row.
- dataset_id: The unique identifier for the dataset used in the model run (e.g., person_id). This id will repeat as many times as there are class labels because measurements are for each class label.
- label: The actual label or class that the model is predicting (e.g., High Performer, Not High Performer).
- probability_prediction: The probability score that the model assigns to the given label, indicating how confident the model is that the instance belongs to this category.
Run
File Overview
The Run download provides a comprehensive overview of the model runs, detailing the timeline of the run from initiation to deployment, the individuals involved in each step, the type of model used, the dataset specifics, and various performance metrics for the model on the holdout set. This file serves as a log and performance evaluation tool, allowing users to track model progress and evaluate performance for informed decision-making and regulatory compliance.
Column Explanations
- run_id: The unique identifier for the model run.
- augmentation_name: The display name of the machine learning model.
- run_initiated: The date and time when the model run was initiated. This column is not currently populated, but will be in the future.
- initiated_by: The user who initiated the model run. This column is not currently populated, but will be in the future.
- result_received: The date and time when the model results were received. This is when the model run successfully completed.
- result_finalized: The date and time when the model results were finalized. This is when the model run was deployed. This column is not currently populated, but will be in the future.
- finalized_by: The user who deployed the model. This column is not currently populated, but will be in the future.
- estimator_type: The type of estimator used (e.g., LGBMClassifier).
- dataset_id: The unique identifier for the model population for which the model makes predictions for that is selected during model configuration (e.g., person_id)
- sample_date: The date when the sample data was taken. This is the population date that is selected during model configuration.
- estimator_target: The target variable that the model aims to predict (the metric that defines what you would like to predict). This is selected during model configuration (e.g., Voluntary Terminations).
- holdout_weighted_precision: Precision score for the holdout set, weighted by the class distribution for classification tasks.
- holdout_weighted_recall: Recall score for the holdout set, weighted by the class distribution for classification tasks.
- holdout_weighted_f1: F1 score for the holdout set, weighted by the class distribution for classification tasks.
- holdout_roc: ROC AUC score for the holdout set. This column is not currently populated, but will be in the future.
- holdout_regression_explained_variance: Explained variance score for regression tasks on the holdout set.
- holdout_regression_mse: Mean Squared Error for regression tasks on the holdout set.
- holdout_brier_score: Brier score for the holdout set, indicating the accuracy of probabilistic predictions.
Comments
0 comments
Please sign in to leave a comment.