Introduction to Classification Model Comparison Methods

Summary

This article provides a comprehensive guide on comparing two multi-class classification machine learning models using the UCI Iris Dataset. The focus is on the impact of feature selection and engineering on model outcomes through the building of a base model using only sepal features and a second model that incorporates all features, highlighting the improvement in predictions with the addition of petal measurements. Traditional metrics and techniques such as confusion matrices are covered first before going into more in-depth techniques such as double lift plots, SHAP (SHapley Additive exPlanations) plots and partial dependence plots. These techniques all demonstrate that the addition of all features provides measurable improvement over just the sepal features as well as how they impact how heavily each feature is considered in the models.

Introduction

In the field of machine learning, much of the focus and tutorials available are on the creation of a specific model and going from start to finish as if the process were a linear path. In reality, the process is more like a hike where you may find yourself exploring different paths just to double back and try something new.

This article will explore the various ways of comparing two models built off the same dataset that can be used for comparison of feature selections, feature engineering or other treatments that may be performed. The UCI Iris Dataset¹ will be used in the tutorial due to its popularity and generally straightforward structure so that the focus can be on the evaluation, rather than building, methods.

I. Model Build

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

import matplotlib.pyplot as plt
import polars as pl
import seaborn as sns
import sklearn.metrics as skm

iris = load_iris()
x = pl.DataFrame(iris.data, schema = ['sepal length (cm)', 'sepal width (cm)',
'petal length (cm)', 'petal width (cm)'])
y = iris.target
x.head()

While not the focus of this article, below walks through the creation of the two models that will be compared in the discussed evaluation methods. The Iris dataset was chosen since it’s a relatively straightforward dataset of 3 evenly represented classes of 50 samples each and only 4 features. There are no missing values, the units are consistent and at least two of the classes are linearly separable.

While the evaluation methods can be used to assess the impact of any change to a model set up (hyperparameter tuning, feature engineering, etc.…), this tutorial focuses on one specific set up. A base model with a subset of features will be trained and then a second model with all features will be trained in the same way for a straightforward evaluation and analysis.

Model 1: Sepal Length and Width Only

The base model will consist of sepal length and width only. In practice, it’s common to build a model with control features to have a benchmark from which any lift of additional features/tuning can be assessed. Control features are often straightforward features that are known to be important to being able to make accurate predictions. An example would be number of petals if you were trying to classify a set of flowers that was broader than those included in the Iris dataset. As this dataset only has 4 features, we will treat sepal characteristics as the control features.

First, we will split our data into a 80/20 train/test and then train the first model on the sepal features only. Given that the focus of this article is evaluation, not building, no hyperparameter tuning or other treatments will be discussed or explored.

sepal_features = ['sepal length (cm)', 'sepal width (cm)']

# Split data into train and test sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2,
random_state=35)

# set up a random classifier and train on the subset of sepal features,
# make predictions on the test set with the sepal features
model_one = RandomForestClassifier(random_state = 35)
model_one.fit(x_train[sepal_features], y_train)
model_one_pred = model_one.predict(x_test[sepal_features])

Model 2: All Features

For model 2, all 4 features will be incorporated into the model and the training will be the same set up.

# set up a random classifier and train on all features
# make predictions on the test set with all features
model_two = RandomForestClassifier(random_state = 35)
model_two.fit(x_train, y_train)
model_two_pred = model_two.predict(x_test)

II. Classic Performance Metrics

First, we will explore the most standard performance metrics for classification algorithms.

Confusion Matrix

A confusion matrix is a table (often presented as a heatmap) that shows the comparison of predicted labels with the actual labels of the dataset. Key components of a confusion matrix are:

True Positives (TP): These are the cases where the model correctly predicted the positive class.
True Negatives (TN): These are the cases where the model correctly predicted the negative class.
False Positives (FP): These are the cases where the model incorrectly predicted the positive class when the actual class was negative. Also known as a Type I error.
False Negatives (FN): These are the cases where the model incorrectly predicted the negative class when the actual class was positive. Also known as a Type II error.

# Create confusion matrices for each model and then plot as a heatmap
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

cm1 = confusion_matrix(y_test, model_one_pred)
cm2 = confusion_matrix(y_test, model_two_pred)

sns.heatmap(cm1, annot=True, fmt='d', cmap='Blues', ax=ax1)\
.set(title = 'Sepal Feature Confusion Matrix',
xlabel = 'Predicted', ylabel = 'True')
sns.heatmap(cm2, annot=True, fmt='d', cmap='Blues', ax=ax2)\
.set(title = 'All Feature Confusion Matrix',
xlabel = 'Predicted', ylabel = 'True')
plt.tight_layout()
plt.show()

As you can see in the above graph, there is little improvement in being able to predict class 0 which has almost no error. However, there is marked improvement in being able to predict classes 1 and 2 with the addition of the petal features. From this matrix, you are also able to derive the following key measures:

Accuracy | (TP + TN)/Sample Count | The proportion of correctly classified instances out of the total number of samples. It’s important to note that accuracy is an extremely biased metric if your classes are imbalanced as it becomes easier to be ‘correct’ by simply predicting the majority class. For model one, that would be (11 + 5 + 6) / 30 = .73

Precision | TP / (TP + FP) | The proportion of true positive predictions over all positive predictions. Basically, how many of your postive predictions were actually real positives. This is on a per class basis. For model one, class 0, that would be 11 / (11 + 1) = .92

Recall | TP / (TP + FN) | The proportion of true positive predictions out of all positive samples. Essentially, of all the possible true positives for the class, how many did you find. This is on a per class basis. For model one, class 0, that would be 11 / ( 11 + 0 ) = 1.0

F1 Score | 2 x (Precision x Recall)/(Precision + Recall) | This is a weighted average of precision and recall that serves as a more balanced metric. This is on a per class basis. For model one, class 0, that would be 2 x ((.92 * 1.0) / (.92 + 1.0)) = .96

Further Reading:

Ajay Kulkarni, Deri Chong, Feras A. Batarseh,
5 – Foundations of data imbalance and solutions for a data democracy,
Data Democracy, Academic Press, 2020
Pages 83-106 (Section 3),
https://www.sciencedirect.com/science/article/pii/B9780128183663000058 / Sign in with Georgia Tech to gain access

ROC and AUC-ROC Curves

Similar to the statistics above, ROC and AUC-ROC curves are typically completed on a per class basis. ROC will show you the true positive rate against the false positive rate at various thresholds while AUC shows the area under that curve.

For the purposes of the example, a OnevsRestClassifier was run and the resulting ROC curves are based on the given class versus all others. Anything better than .5 would indicate the model has improved over random chance guesses. Similar to the confusion matrices, it’s evident that model one with only sepal features performs near perfectly on class 0 but less well, although still strongly, on classes 1 and 2.

# Train a classifier
classifier = OneVsRestClassifier(RandomForestClassifier(random_state = 35))
classifier.fit(x_train[sepal_features], y_train)

# Compute ROC curve and ROC area for each class
fpr = dict()
tpr = dict()
roc_auc = dict()
probs = classifier.predict_proba(x_test[sepal_features])
for i in range(3):
prob = [x[i] for x in probs]
actuals = [1 if x == i else 0 for x in y_test]
fpr[i], tpr[i], _ = skm.roc_curve(actuals, prob)
roc_auc[i] = skm.auc(fpr[i], tpr[i])

# Plot ROC curve for each class
plt.figure(figsize=(8, 6))
colors = ['blue', 'green', 'red']
for i, color in zip(range(3), colors):
plt.plot(fpr[i], tpr[i], color=color, lw=2,
label='ROC curve of class {0} (AUC = {1:0.2f})'
''.format(i, roc_auc[i]))

plt.plot([0, 1], [0, 1], 'k--', lw=2)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Multiclass ROC Curve')
plt.legend(loc="lower right")
plt.show()

Further Reading:

Andrew P. Bradley,
The use of the area under the ROC curve in the evaluation of machine learning algorithms,
Pattern Recognition, Volume 30, Issue 7, 1997,
Pages 1145-1159,
https://www.sciencedirect.com/science/article/pii/S0031320396001422 / Sign in with Georgia Tech to gain access

II. Interpretability Analysis

Beyond just straight performance, there are often considerations with regards to interpretability and the importance that features have within the model. In practice, it’s important to ensure that any changes to a model don’t result in counterintuitive treatment of features or an addition of unintentional bias.

SHAP (SHapley Additive exPlanations) Plots

SHAP plots are built off of SHAP values which are a representation of the contribution of each feature to the difference between the actual and average prediction. These plots can help with a number of different aspects of model comparison and evaluation:

Feature Impact: Positive shap values indicate model predictions were moved higher by the feature and vice versa for negative features as well as the strength of the impact (larger shap values == higher impact). They can also show the interaction between features to show the combined impacts.
Debugging: If a change to a model resulted in unanticipated results, the SHAP plot can help look under the hood and see what exactly is driving behavior.
Comparison: By placing two plots side by side, it’s possible to see how different design choices in a model can change the feature importance and weights.

# Create shap plots for each model
x_test = x_test.to_pandas() # no shap
explainer_one = shap.TreeExplainer(model_one)
shap_values_one = explainer_one.shap_values(x_test[sepal_features])

explainer_two = shap.TreeExplainer(model_two)
shap_values_two = explainer_two.shap_values(x_test)

fig, ax = plt.subplots(figsize=(15, 5))
shap.summary_plot(shap_values_one, x_test[sepal_features], plot_type='bar')

fig, ax = plt.subplots(figsize=(15, 5))
shap.summary_plot(shap_values_two, x_test, plot_type='bar')

While the traditional metrics made it evident that the addition of petal features improved the model accuracy, the SHAP plots make it clearer why that was. As can be seen, the sepal features seemed quite important for model one. However, once the petal features were added, they become almost negligible in impact. More interesting though, is how sepal length is much more important than sepal width while petal width and length seem to be of similar importance. In the context of this example, that suggests that discernibility of irises is primarily driven by petal details.

Further Reading:

Mayer, Michael. “SHAP for additively modeled features in a boosted trees model.” https://arxiv.org/pdf/2207.14490.pdf / arXiv preprint arXiv:2207.14490 (2022).

Partial Dependence Plots (PDP)

PDPs are similar to SHAP plots in that they aim to show how features impact model determinations. Where they differ, however, is that PDPs show the average effect of a feature on predictions while SHAP plots show the contribution of each feature to individual predictions. PDPs also require access to the underlying model since the predictions are made by varying the feature value while holding all other features constant.

# Create Partial Dependence Plot for sepal length, model one
pdp_feature = PDPIsolate(model=model_one, df=x.to_pandas(),
model_features=sepal_features,
feature='sepal length (cm)',
feature_name = 'sepal length (cm)')
fig = pdp_feature.plot(engine = 'matplotlib')
plt.show()

all_features = ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
# Create Partial Dependence Plot for sepal length, model two
pdp_feature = PDPIsolate(model=model_two, df=x.to_pandas(),
model_features=all_features,
feature='sepal length (cm)',
feature_name = 'sepal length (cm)')
fig = pdp_feature.plot(engine = 'matplotlib')
plt.show()

Sepal Length PDP for Class 2 — Models One (Left) and Two (Right)

For sepal length in relation to class 2, it’s clear that it has an outsized impact on the prediction and almost entirely positive. Once the petal features are incorporated in model two, sepal length has a much more marginal impact on predictions and can even lower the prediction estimate at points under 7.5 cm length.

Further Reading:

Szepannek, Gero, and Karsten Lübke.
“How much do we see? On the explainability of partial dependence plots for credit risk scoring.”
https://dbc.wroc.pl/Content/121371/Szepannek_Lubke_How_much_do_we_see.pdf (2023).

Double Lift Plots

Double lift plots are an important tool in that they let you directly compare the performance of two models. These plots tell you the fraction of positive instances correctly classified at any given percentile of the dataset. Many of the available out-of-the-box packages only work with binary class problems so for the sake of simplicity, the example is established as a OnevsRest type problem to see how each model predicts class 0 vs any other class.

These can be structured in a number of different ways, including sorting your predictions into deciles to see performance at different predictive strengths. Below is an example with the out-of-the-box lift curve function available in scikit-plot which just goes iteratively through the samples.

model_one_probas = [x[:2] for x in
model_one.predict_proba(x_test[sepal_features])]
model_two_probas = [x[:2] for x in model_two.predict_proba(x_test)]

updated_ytest = [1 if x == 0 else 0 for x in y_test]
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(15, 10))
skplt.metrics.plot_lift_curve(updated_ytest,model_one_probas, ax = ax1);
skplt.metrics.plot_lift_curve(updated_ytest,model_two_probas, ax = ax2);
plt.show();

Class 0 Vs Rest Double Lift for Model One vs Model Two

While this set up is not exactly as it would be done in a true evaluation due to the multiclass set up, it’s evident that model two has better performance across both the focus class and other classes comparative to model one. In double lift plots, the higher the curve, the better the performance is. These curves allow a comparison of reliability of predictions as well as assessing performance at specific probability cutoffs (depending on how the x axis is set up).

Further Reading:

And The Winner Is…? How to Pick a Better Model
Casualty Actuarial Society
https://www.casact.org/sites/default/files/presentation/rpm_2016_presentations_pm-lm-4.pdf

Conclusion

In this tutorial, we embarked on a comprehensive journey through feature selection techniques using the Iris dataset. By constructing models with all features and selected feature subsets, we evaluated their performance using an array of evaluation techniques. These are not an exhaustive list of strategies but should be an excellent start to the journey to model comparisons.

Featured Image created with Adobe Firefly.

[1] Fisher,R. A.. (1988). Iris. UCI Machine Learning Repository. https://doi.org/10.24432/C56C76.