July 30, 20215 min read

Dataiku Model Testing

Robust Model Testing

After your Dataiku model is created, the next step is to test it on live data. A lot of work has been put into your model and it’s time to see it in action. Though the initial results may look good and provide expected outputs, only by testing your new model under additional parameters will we see how successful it is.

The metrics for detecting a successful model varies depending on the type of model. Your chosen model will also determine how you should test it. All the steps listed are important for a robust model that can be resilient to new data.

Measuring Your New Model

The main metrics depend on whether your model is for classification or for regression.

When it comes to regression models, the mean squared error and R2 are important to consider. The mean squared error is derived by multiplying all mistakes by their square and averaging them over all observations. The lower this value, the more precise your predictions. R2 (pronounced R-Squared) is the percentage of observed deviation from the mean that your model can explain (or forecast). R2 is always between 0 and 1, with a greater value being better.

The most straightforward statistic for evaluating classification models is accuracy. Accuracy is a popular term, but we have a very unique manner of calculating it in this circumstance. The percentage of observations properly predicted by the model is known as accuracy. Accuracy is straightforward to comprehend, but it should be taken with caution, especially when the classes to be predicted are uneven.

Two more metrics for classification are the area under the curve or AUC, and the logarithmic loss. Log loss is so common it’s used in sites like Kaggle where only the top metrics and methodologies are applied to models for competition. This metric is used when your classification model produces probabilities of belonging to discrete classes instead of rigid classifications like true and false. Examples of the classes produced by the AUC could be something like ‘10 percent chance of being true, a 75 percent chance of being true’ etc. When your model makes an inaccurate forecast with high confidence, log loss penalizes it more severely.

Your Model Could Be Overfitted

Your newly trained model can sometimes pay attention to the training data in such detail, that it misses the bigger picture. These learned and unwanted quirks skew data and perform poorly on new and unseen data. This is called overfitting.

Regression Overfitting Reduction

We can also think of overfitting as rigidness in the model. The remedy to this rigidness or overfitting is regularization. Though we won’t go into the details of the math, we can say that some of the variables in the model are weighted too heavily. To fix this, we simply reduce the weight for some of these variables that aren’t so important, thereby reducing the cone vision the model has acquired in its training. With the hundreds of variables that these models can have, how do we find which variables to remove? Using linear regression as an example, we can simply remove a lot of the variables that are close to zero but not quite. For example, 0.000589 is a negligible value and we can assume this is just noise.

In the case of decision trees for example, we can reduce the depth of the tree to reduce overfitting. The depth of the tree means more complexity and data fixation, so to speak. By reducing the size of the tree, we reduce complexity and negligible assumptions of the model that provide no value.

In general, the idea is to ‘generalize’ any model by reducing the amount of specificity it gained during the training stage. The cure to overfitting.

Testing the Goodness Of Your Model

As hard as it is, you must delay taking your new model for a spin on the test data. After training the model, the next step is to perform checks for overfitting as explained in the previous section. With this done, we can finally run the model on the test data.

The test data will not be used fully. Instead, it will be split into certain pieces called ‘folds’. These folds or pieces of test data will contain a different combination of parameters as given the previous section. For example, a depth of 10 in a decision tree or a 10 less variables in the regression model. More technically, this method of validating a model is called K-fold Cross-Validation. The model will then be tested on each fold and the error calculated. This is repeated for the amount of K folds. The average of these errors is your cross-validated error for each combination of parameters.

Retraining the Model

Now you should have a K fold with the least error and a list of the best performing parameters. It’s time to re-train the model using the new parameters on the full set data. Before we only trained the model on half, we now have the benefit of a full set of data and the right model and its parameters.

It takes a lot of steps to create a robust model and K fold validation is one of the best insurances. It does come with drawbacks, however. For example, the amount of time it takes to perform all these tasks, retraining on folds and then once more on a full set of data. Another option to K fold validation is to further split the training set into a validation piece. We would end up with the full data split into something like: 60-20-20. Though this approach is less time consuming and more lenient on computing resources, it’s not as robust and thorough as K fold validation. This approach however, could be sufficient for some models.

Optimize Model Performance

Each approach has pros and cons, but a robust model should go through the steps necessary for the best performance.

Have more questions or need training? Our experts are always here to help.

Ryan Moore

With over 20 years of experience as a Lead Software Architect, Principal Data Scientist, and technical author, Ryan Moore is the Head of Delivery and Solutions at Snow Fox Data and our resident Dataiku Neuron. He provides Data Science architecture and implementation consultation to organizations around the globe.