You are on page 1of 240

Interpretable Machine Learning

with Python
Copyright © 2022 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a


retrieval system, or transmitted in any form or by any means, without the
prior written permission of the publisher, except in the case of brief
quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the
accuracy of the information presented. However, the information contained
in this book is sold without warranty, either express or implied. Neither the
author, nor Packt Publishing, and its dealers and distributors will be held
liable for any damages caused or alleged to be caused directly or indirectly
by this book.

Packt Publishing has endeavored to provide trademark information about all


of the companies and products mentioned in this book by the appropriate
use of capitals. However, Packt Publishing cannot guarantee the accuracy of
this information.

Early Access Publication: Interpretable Machine Learning with Python

Early Access Production Reference: B18406

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham

B3 2PB, UK
ISBN: 978-1-80323-542-4

www.packt.com
Table of Contents
1. Interpretable Machine Learning with Python, Second Edition: Build
Your Own Interpretable Models
2. 1 Interpretation, Interpretability, and Explainability; and Why Does It
All Matter?
I. Join our book community on Discord
II. Technical requirements
III. What is machine learning interpretation?
i. Understanding a simple weight prediction model
IV. Understanding the difference between interpretability and
explainability
i. What is interpretability?
ii. What is explainability?
V. A business case for interpretability
i. Better decisions
ii. More trusted brands
iii. More ethical
iv. More profitable
VI. Summary
VII. Image sources
VIII. Further reading
3. 2 Key Concepts of Interpretability
I. Join our book community on Discord
II. Technical requirements
III. The mission
i. Details about CVD
IV. The approach
V. Preparations
i. Loading the libraries
ii. Understanding and preparing the data
VI. Learning about interpretation method types and scopes
i. Model interpretability method types
ii. Model interpretability scopes
iii. Interpreting individual predictions with logistic regression
VII. Appreciating what hinders machine learning interpretability
i. Non-linearity
ii. Interactivity
iii. Non-monotonicity
VIII. Mission accomplished
IX. Summary
X. Further reading
4. 3 Interpretation Challenges
I. Join our book community on Discord
II. Technical requirements
III. The mission
IV. The approach
V. The preparations
i. Loading the libraries
ii. Understanding and preparing the data
VI. Reviewing traditional model interpretation methods
i. Predicting minutes delayed with various regression methods
ii. Classifying flights as delayed or not delayed with various
classification methods
VII. Understanding limitations of traditional model interpretation
methods
VIII. Studying intrinsically interpretable (white-box) models
i. Generalized Linear Models (GLMs)
ii. Decision trees
iii. RuleFit
iv. Nearest neighbors
v. Naïve Bayes
IX. Recognizing the trade-off between performance and
interpretability
i. Special model properties
ii. Assessing performance
X. Discovering newer interpretable (glass-box) models
i. Explainable Boosting Machine (EBM)
ii. GAMI-Net
XI. Mission accomplished
XII. Summary
XIII. Dataset sources
XIV. Further reading
5. 5 Local Model-Agnostic Interpretation Methods
I. Join our book community on Discord
II. Technical requirements
III. The mission
IV. The approach
V. The preparations
i. Loading the libraries
ii. Understanding and preparing the data
VI. Leveraging SHAP's KernelExplainer for local interpretations with
SHAP values
i. Training a C-SVC model
ii. Computing SHAP values using KernelExplainer
iii. Local interpretation for a group of predictions using decision
plots
iv. Local interpretation for a single prediction at a time using a
force plot
VII. Employing LIME
i. What is LIME?
ii. Local interpretation for a single prediction at a time using
LimeTabularExplainer
VIII. Using LIME for NLP
i. Training a LightGBM model
ii. Local interpretation for a single prediction at a time using
LimeTextExplainer
IX. Trying SHAP for NLP
X. Comparing SHAP with LIME
XI. Mission accomplished
XII. Summary
XIII. Dataset sources
XIV. Further reading
6. 6 Anchor and Counterfactual Explanations
I. Join our book community on Discord
II. Technical requirements
III. The mission
i. Unfair bias in recidivisim risk assessments
IV. The approach
V. The preparations
i. Loading the libraries
ii. Understanding and preparing the data
VI. Understanding anchor explanations
i. Preparations for anchor and counterfactual explanations with
alibi
ii. Local interpretations for anchor explanations
VII. Exploring counterfactual explanations
i. Counterfactual explanations guided by prototypes
ii. Counterfactual instances and much more with the What-If
Tool (WIT)
VIII. Mission accomplished
IX. Summary
X. Dataset sources
XI. Further reading
7. 9 Interpretation Methods for Multivariate Forecasting and Sensitivity
Analysis
I. Join our book community on Discord
II. Technical requirements
III. The mission
IV. The approach
V. The preparation
i. Loading the libraries
ii. Understanding and preparing the data
VI. Assessing time series models with traditional interpretation
methods
VII. Generating LSTM attributions with integrated gradients
VIII. Computing global and local attributions with SHAP's
KernelExplainer
i. Why use the KernelExplainer?
ii. Defining a strategy to get it to work with a multivariate time
series model
iii. Laying the groundwork for the permutation approximation
strategy
iv. Computing the SHAP values
IX. Identifying influential features with factor prioritization
i. Computing Morris sensitivity indices
ii. Analyzing the elementary effects
X. Quantifying uncertainty and cost sensitivity with factor fixing
i. Generating and predicting on Salteli samples
ii. Performing Sobol sensitivity analysis
iii. Incorporating a realistic cost function
XI. Mission accomplished
XII. Summary
XIII. Dataset and image sources
XIV. References
8. 10 Feature Selection and Engineering for Interpretability
I. Join our book community on Discord
II. Technical requirements
III. The mission
IV. The approach
V. The preparations
i. Loading the libraries
ii. Understanding and preparing the data
VI. Understanding the effect of irrelevant features
i. Creating a base model
ii. Evaluating the model
iii. Training the base model at different max depths
VII. Reviewing filter-based feature selection methods
i. Basic filter-based methods
ii. Correlation filter-based methods
iii. Ranking filter-based methods
iv. Comparing filter-based methods
VIII. Exploring embedded feature selection methods
IX. Discovering wrapper, hybrid, and advanced feature selection
methods
i. Wrapper methods
X. Hybrid methods
i. Advanced methods
ii. Evaluating all feature-selected models
XI. Considering feature engineering
XII. Mission accomplished
XIII. Summary
XIV. Dataset sources
XV. Further reading
9. 14 What's Next for Machine Learning Interpretability?
I. Join our book community on Discord
II. Understanding the current landscape of ML interpretability
i. Tying everything together!
ii. Current trends
III. Speculating on the future of ML interpretability
i. A new vision for ML
ii. A multidisciplinary approach
iii. Adequate standardization
iv. Enforcing regulation
v. Seamless machine learning automation with built-in
interpretation
vi. Tighter integration with MLOps engineers
IV. Further reading
Interpretable Machine Learning
with Python, Second Edition: Build
Your Own Interpretable Models
Welcome to Packt Early Access. We’re giving you an exclusive preview
of this book before it goes on sale. It can take many months to write a book,
but our authors have cutting-edge information to share with you today.
Early Access gives you an insight into the latest developments by making
chapter drafts available. The chapters may be a little rough around the edges
right now, but our authors will update them over time.

You can dip in and out of this book or follow along from start to finish; Early
Access is designed to be flexible. We hope you enjoy getting to know more
about the process of writing a Packt book.

1. Chapter 1: Interpretation, Interpretability and Explainability; and why


does it all matter?
2. Chapter 2: Key Concepts of Interpretability
3. Chapter 3: Interpretation Challenges
4. Chapter 4: Global Model-agnostic Interpretation Methods
5. Chapter 5: Local Model-agnostic Interpretation Methods
6. Chapter 6: Anchor and Counterfactual Explanations
7. Chapter 7: Visualizing Convolutional Neural Networks
8. Chapter 8: Understanding NLP Transformers
9. Chapter 9: Interpretation Methods for Multivariate Forecasting and
Sensitivity Analysis
10. Chapter 10: Feature Selection and Engineering for Interpretability
11. Chapter 11: Bias Mitigation and Causal Inference Methods
12. Chapter 12: Feature Selection for Interpretability
13. Chapter 13: Adversarial Robustness
14. Chapter 14: What's Next for Machine Learning Interpretability?
1 Interpretation, Interpretability, and
Explainability; and Why Does It All Matter?
Join our book community on Discord
https://packt.link/EarlyAccessCommunity

We live in a world whose rules and procedures are ever-increasingly governed by data and
algorithms.

For instance, there are rules as to who gets approved for credit or released on bail, and which
social media posts might get censored. There are also procedures to determine which marketing
tactics are most effective and which chest x-ray features might diagnose a positive case of
pneumonia.

You expect this because it is nothing new!

But not so long ago, rules and procedures such as these used to be hardcoded into software,
textbooks, and paper forms, and humans were the ultimate decision-makers. Often, it was
entirely up to human discretion. Decisions depended on human discretion because rules and
procedures were rigid and, therefore, not always applicable. There were always exceptions, so
a human was needed to make them.

For example, if you would ask for a mortgage, your approval depended on an acceptable and
reasonably lengthy credit history. This data, in turn, would produce a credit score using a
scoring algorithm. Then, the bank had rules that determined what score was good enough for
the mortgage you wanted. Your loan officer could follow it or not.

These days, financial institutions train models on thousands of mortgage outcomes, with
dozens of variables. These models can be used to determine the likelihood that you would
default on a mortgage with a presumed high accuracy. If there is a loan officer to stamp the
approval or denial, it's no longer merely a guideline but an algorithmic decision. How could it
be wrong? How could it be right?

Hold on to that thought because, throughout this book, we will be learning the answers to these
questions and many more!

To interpret decisions made by a machine learning model is to find meaning in it, but
furthermore, you can trace it back to its source and the process that transformed it. This chapter
introduces machine learning interpretation and related concepts such as interpretability,
explainability, black-box models, and transparency. This chapter provides definitions for these
terms to avoid ambiguity and underpins the value of machine learning interpretability. These
are the main topics we are going to cover:

What is machine learning interpretation?


Understanding the difference between interpretation and explainability
A business case for interpretability

Let's get started!

Technical requirements
To follow the example in this chapter, you will need Python 3, either running in a Jupyter
environment or in your favorite integrated development environment (IDE) such as
PyCharm, Atom, VSCode, PyDev, or Idle. The example also requires the requests , bs4 ,
pandas , sklearn , matplotlib , and scipy Python libraries. The code for this chapter is
located here: https://github.com/PacktPublishing/Interpretable-Machine-Learning-with-
Python/tree/master/Chapter01.

What is machine learning interpretation?


To interpret something is to explain the meaning of it. In the context of machine learning, that
something is an algorithm. More specifically, that algorithm is a mathematical one that takes
input data and produces an output, much like with any formula.

Let's examine the most basic of models, simple linear regression, illustrated in the following
formula:

Once fitted to the data, the meaning of this model is that ŷ predictions are a weighted sum of
the x features with the β coefficients. In this case, there's only one x feature or predictor
variable, and the y variable is typically called the response or target variable. A simple linear
regression formula single-handedly explains the transformation, which is performed on the
input data x1 to produce the output ŷ. The following example can illustrate this concept in
further detail.

Understanding a simple weight prediction model

If you go to this web page maintained by the University of California,


http://wiki.stat.ucla.edu/socr/index.php/SOCR_Data_Dinov_020108_HeightsWeights, you can
find a link to download a dataset of 25,000 synthetic records of weights and heights of 18-year-
olds. We won't use the entire dataset but only the sample table on the web page itself with 200
records. We scrape the table from the web page and fit a linear regression model to the data.
The model uses the height to predict the weight.

In other words, x1 = height and y = weight, so the formula for the linear regression model
would be as follows:

You can find the code for this example here: https://github.com/PacktPublishing/Interpretable-
Machine-Learning-with-Python/blob/master/Chapter01/WeightPrediction.ipynb.

To run this example, you need to install the following libraries:

requests to fetch the web page


bs4 (Beautiful Soup) to scrape the table from the web page
pandas to load the table in to a dataframe
sklearn (scikit-learn) to fit the linear regression model and calculate its error
matplotlib to visualize the model
scipy to test the correlation

You should load all of them first, as follows:


Import math
import requests
from bs4 import BeautifulSoup
import pandas as pd
from sklearn import linear_model
from sklearn.metrics import mean_absolute_error
import matplotlib.pyplot as plt
from scipy.stats import pearsonr

Once the libraries are all loaded, you use requests to fetch the contents of the web page, like
this:
url = \
'http://wiki.stat.ucla.edu/socr/index.php/SOCR_Data_Dinov_020108_HeightsWeights'
page = requests.get(url)

Then, take these contents and scrape out just the contents of the table with BeautifulSoup , as
follows:
soup = BeautifulSoup(page.content, 'html.parser')
tbl = soup.find("table",{"class":"wikitable"})

pandas can turn the raw HyperText Markup Language ( HTML ) contents of the table into a
dataframe, as illustrated here:
height_weight_df = pd.read_html(str(tbl))[0]\
[['Height(Inches)','Weight(Pounds)']]
And voilà! We now have a dataframe with Heights(Inches) in one column and
Weights(Pounds) in another.

Now that we the we have the data, we must transform it so that it conforms to the model's
specifications. sklearn needs it as NumPy arrays with (200,1) dimensions, so we must first
extract the Height(Inches) and Weight(Pounds) pandas Series. Then, we turn them into
(200,) NumPy arrays, and, finally, reshape them into (200,1) dimensions. The following
commands perform all the necessary transformation operations:
x = height_weight_df['Height(Inches)'].values.\
reshape(num_records, 1)
y = height_weight_df['Weight(Pounds)'].values.\
reshape(num_records, 1)

Then, you initialize the scikit-learn LinearRegression model and fit it with the training
data, as follows:
model = linear_model.LinearRegression().fit(x,y)

To output the fitted linear regression model formula in scikit-learn, you must extract the
intercept and coefficients. This is the formula that explains how it makes predictions:
print("ŷ =" + str(model.intercept_[0]) + " + " +\
str(model.coef_.T[0][0]) + " x₁")

The following is the output:


ŷ = -106.02770644878132 + 3.432676129271629 x1

This tells us that, on average, for every additional pound, there are 3.4 inches of height.

However, explaining how the model works is only one way to explain this linear regression
model, and this is only one side of the story. The model isn't perfect because the actual
outcomes and the predicted outcomes are not the same for the training data. The difference
between both is the error or residuals.

There are many ways of understanding an error in a model. You can use an error function such
as mean_absolute_error to measure the deviation between the predicted values and the
actual values, as illustrated in the following code snippet:
y_pred = model.predict(x)
print(mean_absolute_error(y, y_pred))

A 7.8 mean absolute error means that, on average, the prediction is 7.8 pounds from the actual
amount, but this might not be intuitive or informative. Visualizing the linear regression model
can shed some light on how accurate these predictions truly are.

This can be done by using a matplotlib scatterplot and overlaying the linear model (in blue)
and the mean absolute error (as two parallel bands in gray), as shown in the following code
snippet:
plt.scatter(x, y, color='black')
plt.plot(x, y_pred, color='blue', linewidth=3)
plt.plot(x, y_pred + mae, color='lightgray')
plt.plot(x, y_pred - mae, color='lightgray')

''''If you run the preceding snippet, the plot shown here in Figure 1.1 is what you get as the
output:

Figure 1.1 – Linear regression model to predict weight based on height

As you can appreciate from the plot in Figure 1.1, there are many times in which the actuals are
20–25 pounds away from the prediction. Yet the mean absolute error can fool you into thinking
that the error is always closer to 8. This is why it is essential to visualize the error of the model
to understand its distribution. Judging from this graph, we can tell that there are no red flags
that stand out about this distribution, such as residuals being more spread out for one range of
heights than for others. Since it is more or less equally spread out, we say it's homoscedastic.
In the case of linear regression, this is one of many model assumptions you should test for,
along with linearity, normality, independence, and lack of multicollinearity (if there's more than
one feature). These assumptions ensure that you are using the right model for the job. In other
words, the height and weight can be explained with a linear relationship, and it is a good idea
to do so, statistically speaking.

With this model, we are trying to establish a linear relationship between x height and y weight.
This association is called a linear correlation. One way to measure this relationship's strength
is with Pearson's correlation coefficient. This statistical method measures the association
between two variables using their covariance divided by their standard deviations. It is a
number between -1 and 1 whereby the closer the number it is to zero, the weaker the
association is. If the number is positive, there is a positive association, and if it's negative, there
is a negative one. In Python, you can compute Pearson's correlation coefficient with the
pearsonr function from scipy , as illustrated here:

corr, pval = pearsonr(x[:,0], y[:,0])


print(corr)

The following is the output:


0.5568647346122992

The number is positive, which is no surprise because as height increases, weight also tends to
increase, but it is also closer to 1 than to 0, denoting that it is strongly correlated. The second
number produced by the pearsonr function is the p-value for testing non-correlation. If we
test that it's less than an error level of 5%, we can say there's sufficient evidence of this
correlation, as illustrated here:
print(pval < 0.05)

It confirms with a True that it is statistically significant.

Understanding how a model performs and in which circumstances can help us explain why it
makes certain predictions, and when it cannot. Let's imagine we are asked to explain why
someone who is 71 inches tall was predicted to have a weight of 134 pounds but instead
weighed 18 pounds more. Judging from what we know about the model, this margin of error is
not unusual even though it's not ideal. However, there are many circumstances in which we
cannot expect this model to be reliable. What if we were asked to predict the weight of a
person who is 56 inches tall with the help of this model? Could we assure the same level of
accuracy? Definitely not, because we fit the model on the data of subjects no shorter than 63
inches. Ditto if we were asked to predict the weight of a 9-year-old, because the training data
was for 18-year-olds.

Despite the acceptable results, this weight prediction model was not a realistic example. If you
wanted to be more accurate but—more importantly—faithful to what can really impact the
weight of an individual, you would need to add more variables. You can add—say—gender,
age, diet, and activity level. This is where it gets interesting because you have to make sure it is
fair to include them, or not to include them. For instance, if gender were included yet most
of our dataset was composed of males, how could you ensure accuracy for females? This is
what is called selection bias. And what if weight had more to do with lifestyle choices and
circumstances such as poverty and pregnancy than gender? If these variables aren't included,
this is called omitted variable bias. And then, does it make sense to include the sensitive
gender variable at the risk of adding bias to the model?

Once you have multiple features that you have vetted for fairness, you can find out and explain
which features impact model performance. We call this feature importance. However, as we
add more variables, we increase the complexity of the model. Paradoxically, this is a problem
for interpretation, and we will explore this in further detail in the following chapters. For now,
the key takeaway should be that model interpretation has a lot to do with explaining the
following:

1. Can we explain that predictions were made fairly?


2. Can we trace the predictions reliably back to something or someone?
3. Can we explain how predictions were made? Can we explain how the model works?

And ultimately, the question we are trying to answer is this:

Can we trust the model?

The three main concepts of interpretable machine learning directly relate to the three preceding
questions and have the acronym of FAT, which stands for fairness, accountability, and
transparency. If you can explain that predictions were made without discernible bias, then
there is fairness. If you can explain why it makes certain predictions, then there's
accountability. And if you can explain how predictions were made and how the model works,
then there's transparency. There are many ethical concerns associated to these concepts, as
shown here in Figure 1.2:
Figure 1.2 – Three main concept of Interpretable Machine Learning

Some researchers and companies have expanded FAT under a larger umbrella of ethical ethical
AI, thus turning FAT into FATE. However, both concepts very much overlap since interpretable
machine learning is how FAT principles and ethical concerns get implemented in machine
learning. In this book, we will discuss ethics in this context. For instance, Chapter 13,
Adversarial Robustness relates to reliability, safety, and security. Chapter 11, Mitigating Bias
and Causal Inference Methods relates to fairness. That being said, interpretable machine
learning can be leveraged with no ethical aim in mind, and also for unethical reasons too.

Understanding the difference between interpretability and


explainability
Something you've probably noticed when reading the first few pages of this book is that the
verbs interpret and explain, as well as the nouns interpretation and explanation, have been
used interchangeably. This is not surprising, considering that to interpret is to explain the
meaning of something. Despite that, the related terms interpretability and explainability should
not be used interchangeably, even though they are often mistaken for synonyms. Most
practitioners don't make any distinction and many academics reverse the definitions provided
in this book.
What is interpretability?

Interpretability is the extent to which humans, including non-subject-matter experts, can


understand the cause and effect, and input and output, of a machine learning model. To say a
model has a high level of interpretability means you can describe in a human-interpretable way
its inference. In other words, why does an input to a model produce a specific output? What are
the requirements and constraints of the input data? What are the confidence bounds of the
predictions? Or, why does one variable have a more substantial effect than another? For
interpretability, detailing how a model works is only relevant to the extent that it can explain its
predictions and justify that it's the right model for the use case.

In this chapter's example, you could explain that there's a linear relationship between human
height and weight, so using linear regression rather than a non-linear model makes sense. You
can prove this statistically because the variables involved don't violate the assumptions of
linear regression. Even when statistics are on our side, you still ought to consult with the
domain knowledge area involved in the use case. In this one, we rest assured, biologically
speaking, because our knowledge of human physiology doesn't contradict the connection
between height and weight.

Beware of complexity

Many machine learning models are inherently harder to understand simply because of the math
involved in the inner workings of the model or the specific model architecture. In addition to
this, many choices are made that can increase complexity and make the models less
interpretable, from dataset selection to feature selection and engineering, to model training and
tuning choices. This complexity makes explaining how it works a challenge. Machine learning
interpretability is a very active area of research, so there's still much debate on its precise
definition. The debate includes whether total transparency is needed to qualify a machine
learning model as sufficiently interpretable.

This book favors the understanding that the definition of interpretability shouldn't necessarily
exclude opaque models, which, for the most part, are complex, as long as the choices made
don't compromise their trustworthiness. This compromise is what is generally called post-hoc
interpretability. After all, much like a complex machine learning model, we can't explain
exactly how a human brain makes a choice, yet we often trust its decision because we can ask a
human for their reasoning. Post-hoc machine learning interpretation is exactly the same thing,
except it's a human explaining the reasoning on behalf of the model. Using this particular
concept of interpretability is advantageous because we can interpret opaque models and not
sacrifice the accuracy of our predictions. We will discuss this in further detail in Chapter 3,
Interpretation Challenges.

When does interpretability matter?

Decision-making systems don't always require interpretability. There are two cases that are
offered as exceptions in research, outlined here:
When incorrect results have no significant consequences. For instance, what if a machine
learning model is trained to find and read the postal code in a package, occasionally
misreads it, and sends it elsewhere? There's little chance of discriminatory bias, and the
cost of misclassification is relatively low. It doesn't occur often enough to magnify the
cost beyond acceptable thresholds.
When there are consequences, but these have been studied sufficiently and validated
enough in the real world to make decisions without human involvement. This is the case
with a traffic-alert and collision-avoidance system (TCAS), which alerts the pilot of
another aircraft that poses a threat of a mid-air collision.

On the other hand, interpretability is needed for these systems to have the following attributes:

Minable for scientific knowledge: Meteorologists have much to learn from a climate
model, but only if it's easy to interpret.
Reliable and safe: The decisions made by a self-driving vehicle must be debuggable so
that its developers can understand points of failure.
Ethical: A translation model might use gender-biased word embeddings that result in
discriminatory translations, but you must be able to find these instances easily to correct
them. However, the system must be designed in such a way that you can be made aware of
a problem before it is released to the public.
Conclusive and consistent: Sometimes, machine learning models may have incomplete
and mutually exclusive objectives—for instance, a cholesterol-control system may not
consider how likely a patient is to adhere to the diet or drug regimen, or there might be a
trade-off between one objective and another, such as safety and non-discrimination.

By explaining the decisions of a model, we can cover gaps in our understanding of the problem
—its incompleteness. One of the most significant issues is that given the high accuracy of our
machine learning solutions, we tend to increase our confidence level to a point where we think
we fully understand the problem. Then, we are misled into thinking our solution covers ALL
OF IT!

At the beginning of this book, we discussed how levering data to produce algorithmic rules is
nothing new. However, we used to second-guess these rules, and now we don't. Therefore, a
human used to be accountable, and now it's the algorithm. In this case, the algorithm is a
machine learning model that is accountable for all of the ethical ramifications this entails. This
switch has a lot to do with accuracy. The problem is that although a model may surpass human
accuracy in aggregate, machine learning models have yet to interpret its results like a human
would. Therefore, it doesn't second-guess its decisions, so as a solution it lacks a desirable level
of completeness. and that's why we need to interpret models so that we can cover at least some
of that gap. So, why is machine learning interpretation not already a standard part of the
pipeline? In addition to our bias toward focusing on accuracy alone, one of the biggest
impediments is the daunting concept of black-box models.

What are black-box models?

This is just another term for opaque models. A black box refers to a system in which only the
input and outputs are observable, and you cannot see what is transforming the inputs into the
outputs. In the case of machine learning, a black-box model can be opened, but its mechanisms
are not easily understood.

What are white-box models?

These are the opposite of black-box models (see Figure 1.3). They are also known as
transparent because they achieve total or near-total interpretation transparency. We call them
intrinsically interpretable in this book, and we cover them in more detail in Chapter 3,
Interpretation Challenges.

Have a look at a comparison between the models here:

Figure 1.3 – Visual comparison between white- and black-box models

What is explainability?

Explainability encompasses everything interpretability is. The difference is that it goes deeper
on the transparency requirement than interpretability because it demands human-friendly
explanations for a model's inner workings and the model training process, and not just model
inference. Depending on the application, this requirement might extend to various degrees of
model, design, and algorithmic transparency. There are three types of transparency, outlined
here:

Model transparency: Being able to explain how a model is trained step by step. In the
case of our simple weight prediction model, we can explain how the optimization method
called ordinary least squares finds the β coefficient that minimizes errors in the model.
Design transparency: Being able to explain choices made, such as model architecture
and hyperparameters. For instance, we could justify these choices based on the size or
nature of the training data. If we were performing a sales forecast and we knew that our
sales had a seasonality of 12 months, this could be a sound parameter choice. If we had
doubts, we could always use some well-established statistical method to find the right
seasonality.
Algorithmic transparency: Being able to explain automated optimizations such as grid
search for hyperparameters; but note that the ones that can't be reproduced because of
their random nature—such as random search for hyperparameter optimization, early
stopping, and stochastic gradient descent—make the algorithm non-transparent.

Opaque models are called opaque simply because they lack model transparency, but for many
models this is unavoidable, however justified the model choice might be. In many scenarios,
even if you outputted the math involved in—say—training a neural network or a random
forest, it would raise more doubts than generate trust. There are at least a few reasons for this,
outlined here:

Not "statistically grounded": An opaque model training process maps an input to an


optimal output, leaving behind what appears to be an arbitrary trail of parameters. These
parameters are optimized to a cost function but are not grounded in statistical theory.
Uncertainty and non-reproducibility: When you fit a transparent model with the same
data, you always get the same results. On the other hand, opaque models are not equally
reproducible because they use random numbers to initialize their weights or to regularize
or optimize their hyperparameters, or make use of stochastic discrimination (such is the
case for Random Forest).
Overfitting and the curse of dimensionality: Many of these models operate in a high-
dimensional space. This doesn't elicit trust because it's harder to generalize on a larger
number of dimensions. After all, there's more opportunity to overfit a model, the more
dimensions you add.
Human cognition and the curse of dimensionality: Transparent models are often used
for smaller datasets with fewer dimensions, and even if they aren't a transparent model,
never use more dimensions than necessary. They also tend to not complicate the
interactions between these dimensions more than necessary. This lack of unnecessary
complexity makes it easier to visualize what the model is doing and its outcomes. Humans
are not very good at understanding many dimensions, so using transparent models tends to
make this much easier to understand.
Occam's razor: This is what is called the principle of simplicity or parsimony. It states
that the simplest solution is usually the right one. Whether true or not, humans also have a
bias for simplicity, and transparent models are known for— if anything—their simplicity.

Why and when does explainability matter?


Trustworthy and ethical decision-making is the main motivation for interpretability.
Explainability has additional motivations such as causality, transferability, and informativeness.
Therefore, there are many use cases in which total or nearly total transparency is valued, and
rightly so. Some of these are outlined here:

Scientific research: Reproducibility is essential to the scientific method. Also, using


statistically grounded optimization methods is especially desirable when causality needs
to be proven.
Clinical trials: These must also produce reproducible findings and be statistically
grounded. In addition to this, given the potential gravity of overfitting, they must use the
fewest dimensions possible and models that don't complicate them.
Consumer product safety testing: Much as with clinical trials, when life-and-death
safety is a concern, simplicity is preferred whenever possible.
Public policy and law: This is a more nuanced discussion, as part of what is called by law
scholars algorithmic governance, and they have distinguished between fishbowl
transparency and reasoned transparency. The former is closer to the rigor required for
consumer product safety testing, and the latter is one where post-hoc interpretability
would suffice. One day, the government could be entirely run by algorithms. When that
happens, it's hard to tell which policies will align with which form of transparency, but
there are many areas of public policy, such as criminal justice, where absolute
transparency is necessary. However, whenever total transparency contradicts privacy or
security objectives, a less rigorous form of transparency would have to make do.
Criminal investigation and regulatory compliance audits: If something goes wrong,
such as an accident at a chemical factory caused by a robot malfunction or a crash by an
autonomous vehicle, an investigator needs to trace the decision trail. This is to "facilitate
the assignment of accountability and legal liability". Even when no accident has
happened, this kind of auditing can be performed when mandated by authorities.
Compliance auditing applies to industries that are regulated, such as financial services,
utilities, transportation, and healthcare. In many cases, fishbowl transparency is preferred.

A business case for interpretability


This section describes several practical business benefits for machine learning interpretability,
such as better decisions, as well as being more trusted, ethical, and profitable.

Better decisions

Typically, machine learning models are trained and then evaluated against the desired metrics.
If they pass quality control against a hold-out dataset, they are deployed. However, once tested
in the real world, that's when things can get wild, as in the following hypothetical scenarios:

A high-frequency trading algorithm could single-handedly crash the stock market.


Hundreds of smart home devices might inexplicably burst into unprompted laughter,
terrifying their users.
License-plate recognition systems could incorrectly read a new kind of license plate and
fine the wrong drivers.
A racially biased surveillance system could incorrectly detect an intruder, and because of
this guards shoot an innocent office worker.
A self-driving car could mistake snow for a pavement, crash into a cliff, and injure
passengers.

Any system is prone to error, so this is not to say that interpretability is a cure-all. However,
focusing on just optimizing metrics can be a recipe for disaster. In the lab, the model might
generalize well, but if you don't know why the model is making the decisions, then you can
miss on an opportunity for improvement. For instance, knowing what the self-driving car
thinks is a road is not enough, but knowing why could help improve the model. If, say, one of
the reasons was that road is light-colored like the snow, this could be dangerous. Checking the
model's assumptions and conclusions can lead to an improvement in the model by introducing
winter road images into the dataset or feeding real-time weather data into the model. Also, if
this doesn't work, maybe an algorithmic fail-safe can stop it from acting on a decision that it's
not entirely confident about.

One of the main reasons why a focus on machine learning interpretability leads to better
decision-making was mentioned earlier when we talked about completeness. If you think a
model is complete, what is the point of making it better? Furthermore, if you don't question the
model's reasoning, then your understanding of the problem must be complete. If this is the
case, perhaps you shouldn't be using machine learning to solve the problem in the first place!
Machine learning creates an algorithm that would otherwise be too complicated to program in
if-else statements, precisely to be used for cases where our understanding of the problem is
incomplete!

It turns out that when we predict or estimate something, especially with a high level of
accuracy, we think we control it. This is what is called the illusion of control bias. We can't
underestimate the complexity of a problem just because, in aggregate, the model gets it right
almost all the time. Even for a human, the difference between snow and concrete pavement can
be blurry and difficult to explain. How would you even begin to describe this difference in such
a way that it is always accurate? A model can learn these differences, but it doesn't make it any
less complex. Examining a model for points of failure and continuously being vigilant for
outliers requires a different outlook, whereby we admit that we can't control the model but we
can try to understand it through interpretation.

The following are some additional decision biases that can adversely impact a model, and serve
as reasons why interpretability can lead to better decision-making:

Conservatism bias: When we get new information, we don't change our prior beliefs.
With this bias, entrenched pre-existing information trumps new information, but models
ought to evolve. Hence, an attitude that values questioning prior assumptions is a healthy
one to have.
Salience bias: Some prominent or more visible things may stand out more than others, but
statistically speaking, they should get equal attention to others. This bias could inform our
choice of features, so an interpretability mindset can expand our understanding of a
problem to include other less perceived features.
Fundamental attribution error: This bias causes us to attribute outcomes to behavior
rather than circumstances, character rather than situations, nature rather than nurture.
Interpretability asks us to explore deeper and look for the less obvious relationships
between our variables or those that could be missing.

One crucial benefit of model interpretation is locating outliers. These outliers could be a
potential new source of revenue or a liability waiting to happen. Knowing this can help us to
prepare and strategize accordingly.

More trusted brands

Trust is defined as a belief in the reliability, ability, or credibility of something or someone. In


the context of organizations, trust is their reputation; and in the unforgiving court of public
opinion, all it takes is one accident, controversy, or fiasco to lose substantial amounts of public
confidence. This, in turn, can cause investor confidence to wane.

Let's consider what happened to Boeing after the 737 MAX debacle or Facebook after the 2016
presidential election scandal. In both cases, there were short-sighted decisions solely made to
optimize a single metric, be it forecasted plane sales or digital ad sales. These underestimated
known potential points of failure and missed out entirely on very big ones.

And these were examples of, for the most part, decisions made by people. With decisions made
exclusively by machine learning models, this could get worse because it is easy to drop the ball
and keep the accountability in the model's corner. For instance, if you started to see offensive
material in your Facebook feed, Facebook could say it's because its model was trained with
your data such as your comments and likes, so it's really a reflection of what you want to see.
Not their fault—your fault. If the police targeted your neighborhood for aggressive policing
because it uses PredPol, an algorithm that predicts where and when crimes will occur, it could
blame the algorithm. On the other hand, the makers of this algorithm could blame the police
because the software is trained on their police reports. This generates a potentially troubling
feedback loop, not to mention an accountability gap. And if some pranksters or hackers place
strange textured meshes into a highway (see https://arxiv.org/pdf/2101.06784.pdf), this could
cause a Tesla self-driving car to veer into the wrong lane. Is this Tesla's fault that they didn't
anticipate this possibility, or the hackers', for throwing a monkey wrench into their model? This
is what is called an adversarial attack, and we discuss this in Chapter 13, Adversarial
Robustness.

It is undoubtedly one of the goals of machine learning interpretability to make models better at
making decisions. But even when they fail, you can show that you tried. Trust is not lost
entirely because of the failure itself but because of the lack of accountability, and even in cases
where it is not fair to accept all the blame, some accountability is better than none. For
instance, in the previous set of examples, Facebook could look for clues as to why offensive
material is shown more often, then commit to finding ways to make it happen less even if this
means making less money. PredPol could find other sources of crime-rate datasets that are
potentially less biased, even if they are smaller. They could also use techniques to mitigate bias
in existing datasets (these are covered in Chapter 11, Bias Mitigation and Causal Inference
Methods). And Tesla could audit its systems for adversarial attacks, even if this delays
shipment of its cars. All of these are interpretability solutions. Once a common practice, they
can lead to an increase in not only public trust—be it from users and customers, but also
internal stakeholders such as employees and investors.

Many public relation AI blunders that have occurred over the past couple of years''. Due to
trust issues, many AI-driven technologies are losing public support, to the detriment of both
companies that monetize AI and users that could benefit from them. This, in part, requires a
legal framework at a national or global level and, at the organizational end, for those that
deploy these technologies, more accountability.

More ethical

There are three schools of thought for ethics: utilitarians focus on consequences, deontologists
are concerned with duty, and teleologicalists are more interested in overall moral character. So,
this means that there are different ways to examine ethical problems. For instance, they are
useful lessons to draw from all of them. There are cases in which you want to produce the
greatest amount of "good", despite some harm being produced in the process. Other times,
ethical boundaries must be treated as lines in the sand you mustn't cross. And at other times, it's
about developing a righteous disposition, much like many religions aspire to do. Regardless of
the school of ethics we align with, our notion of what it is evolves with time because it mirrors
our current values. At this moment, in Western cultures, these values include the following:

Human welfare
Ownership and property
Privacy
Freedom from bias
Universal usability
Trust
Autonomy
Informed consent
Accountability
Courtesy
Environmental sustainability

Ethical transgressions are cases whereby you cross the moral boundaries that these values seek
to uphold, be it by discriminating against someone or polluting their environment, whether it's
against the law or not. Ethical dilemmas occur when you have a choice between options that
lead to transgressions, so you have to choose between one and another.

The first reason machine learning is related to ethics is because technologies and ethical
dilemmas have an intrinsically linked history.

Since the first widely adopted tool made by humans, it brought progress but also caused harm,
such as accidents, war, and job losses. This is not to say that technology is always bad but that
we lack the foresight to measure and control its consequences over time. In AI's case, it is not
clear what the harmful long-term effects are. What we can anticipate is that there will be a
major loss of jobs and an immense demand for energy to power our data centers, which could
put stress on the environment. There's speculation that AI could create an "algocratic"
surveillance state run by algorithms, infringing on values such as privacy, autonomy, and
ownership. Some readers might point to examples of this already happening.

The second reason is even more consequential than the first. It's that machine learning is a
technological first for humanity: machine learning is a technology that can make decisions for
us, and these decisions can produce individual ethical transgressions that are hard to trace. The
problem with this is that accountability is essential to morality because you have to know who
to blame for human dignity, atonement, closure, or criminal prosecution. However, many
technologies have accountability issues to begin with, because moral responsibility is often
shared in any case. For instance, maybe the reason for a car crash was partly due to the driver
and mechanic and car manufacturer. The same can happen with a machine learning model,
except it gets trickier. After all, a model's programming has no programmer because the
"programming" was learned from data, and there are things a model can learn from data that
can result in ethical transgressions. Top among them are biases such as the following:

Sample bias: When your data, the sample, doesn't represent the environment accurately,
also known as the population
Exclusion bias: When you omit features or groups that could otherwise explain a critical
phenomenon with the data
Prejudice bias: When stereotypes influence your data, either directly or indirectly
Measurement bias: When faulty measurements distort your data

Interpretability comes in handy to mitigate bias, as seen in Chapter 11, Bias Mitigation and
Causal Inference Methods, or even place guardrails on the right features, which may be a
source of bias. This is covered in Chapter 12, Monotonic Constraints and Model Tuning for
Interpretability. As explained in this chapter, explanations go a long way in establishing
accountability, which is a moral imperative. Also, by explaining the reasoning behind models,
you can find ethical issues before they cause any harm. But there are even more ways in which
models' potentially worrisome ethical ramifications can be controlled for, and this has less to
do with interpretability and more to do with design. There are frameworks such as human-
centered design, value-sensitive design, and techno moral virtue ethics that can be used to
incorporate ethical considerations into every technological design choice. An article by Kirsten
Martin (https://doi.org/10.1007/s10551-018-3921-3) also proposes a specific framework for
algorithms. This book won't delve into algorithm design aspects too much, but for those readers
interested in the larger umbrella of ethical AI, this article is an excellent place to start. You can
see Martin's algorithm morality model in Figure 1.4 here:
Figure 1.4 – Martin's algorithm morality model

Organizations should take the ethics of algorithmic decision-making seriously because ethical
transgressions have monetary and reputation costs. But also, AI left to its own devices could
undermine the very values that sustain democracy and the economy that allows businesses to
thrive.

More profitable

As seen already in this section, interpretability improves algorithmic decisions, boosting trust
and mitigating ethical transgressions.

When you leverage previously unknown opportunities and mitigate threats such as accidental
failures through better decision-making, you can only improve the bottom line; and if you
increase trust in an AI-powered technology, you can only increase its use and enhance overall
brand reputation, which also has a beneficial impact on profits. On the other hand, as for
ethical transgressions, they can be there by design or by accident, but when they are
discovered, they adversely impact both profits and reputation.
When businesses incorporate interpretability into their machine learning workflows, it's a
virtuous cycle, and it results in higher profitability. In the case of a non-profit or governments,
profits might not be a motive. Still, finances are undoubtedly involved because lawsuits, lousy
decision-making, and tarnished reputations are expensive. Ultimately, technological progress is
contingent not only on the engineering and scientific skills and materials that make it possible
but its voluntary adoption by the general public.

Summary
Upon reading this chapter, you should now have a clear understanding of what machine
learning interpretation is and isn't, and recognize the importance of interpretability. In the next
chapter, we will learn what can make machine learning models so challenging to interpret, and
how you would classify interpretation methods in both category and scope.

Image sources
Martin, K. (2019). Ethical Implications and Accountability of Algorithms. Journal of
Business Ethics 160. 835–850. https://doi.org/10.1007/s10551-018-3921-3

Further reading
Lipton, Zachary (2017). The Mythos of Model Interpretability. ICML 2016 Human
Interpretability in Machine Learning Workshop https://doi.org/10.1145/3236386.3241340
Roscher, R., Bohn, B., Duarte, M.F. & Garcke, J. (2020). Explainable Machine Learning
for Scientific Insights and Discoveries. IEEE Access, 8, 42200-42216.
https://dx.doi.org/10.1109/ACCESS.2020.2976199
Doshi-Velez, F. & Kim, B. (2017). Towards A Rigorous Science of Interpretable Machine
Learning. http://arxiv.org/abs/1702.08608
Arrieta, A.B., Diaz-Rodriguez, N., Ser, J.D., Bennetot, A., Tabik, S., Barbado, A., Garc'ia,
S., Gil-L'opez, S., Molina, D., Benjamins, R., Chatila, R., & Herrera, F. (2020).
Explainable Artificial Intelligence (XAI): Concepts, Taxonomies, Opportunities and
Challenges toward Responsible AI. https://arxiv.org/abs/1910.10045
Coglianese, C. & Lehr, D. (2019). Transparency and algorithmic governance.
Administrative Law Review, 71, 1-4. https://ssrn.com/abstract=3293008
Weller, Adrian. (2019) "Transparency: Motivations and Challenges". arXiv:1708.01870
[Cs]. http://arxiv.org/abs/1708.01870
2 Key Concepts of Interpretability
Join our book community on Discord
https://packt.link/EarlyAccessCommunity

This book covers many model interpretation methods: some produce metrics, other visuals, and some both; some
depict your model broadly and others granularly. In this chapter, we will learn about two methods, feature
importance and decision regions, as well as the taxonomies used to describe these methods. We will also detail
what elements hinder machine learning interpretability as a primer to what lies ahead.

The following are the main topics we are going to cover in this chapter:

Learning about interpretation method types and scopes


Appreciating what hinders machine learning interpretability

Technical requirements
Although we began the book with a "toy example," we will be leveraging real datasets throughout this book to be
used in specific interpretation use cases. These come from many different sources and are often used only once.

To avoid that, readers spend a lot of time downloading, loading, and preparing datasets for single examples; there's
a library called mldatasets that takes care of most of this. Instructions on how to install this library are located in
the preface. In addition to mldatasets , this chapter's examples also use the pandas , numpy , statsmodel ,
sklearn , and matplotlib libraries. The code for this chapter is located here:
https://github.com/PacktPublishing/Interpretable-Machine-Learning-with-Python/tree/master/Chapter02.

The mission
Imagine you are an analyst for a national health ministry, and there's a Cardiovascular Diseases (CVDs)
epidemic. The minister has made it a priority to reverse the growth and reduce the case load to a 20-year low. To
this end, a task force has been created to find clues in the data to ascertain the following:

1. What risk factors can be addressed.


2. If future cases can be predicted, interpret predictions on a case-by-case basis.

You are part of this task force!

Details about CVD

Before we dive into the data, we must gather some important details about CVD in order to do the following:

Understand the problem's context and relevance.


Extract domain knowledge information that can inform our data analysis and model interpretation.
Relate an expert-informed background to a dataset's features.
CVDs are a group of disorders, the most common of which is coronary heart disease (also known as Ischaemic
Heart Disease). According to the World Health Organization, CVD is the leading cause of death globally, killing
close to 18 million people annually. Coronary heart disease and strokes (which are, for the most part, a byproduct
of CVD) are the most significant contributors to that. It is estimated that 80% of CVD is made up of modifiable
risk factors. In other words, some of the preventable factors that cause CVD include the following:

Poor diet
Smoking and alcohol consumption habits
Obesity
Lack of physical activity
Poor sleep

Also, many of the risk factors are non-modifiable, and therefore known to be unavoidable, including the following:

Genetic predisposition
Old age
Male (varies with age)

We won't go into more domain-specific details about CVD because it is not required to make sense of the example.
However, it can't be stressed enough how central domain knowledge is to model interpretation. So, if this example
was your job and many lives depended on your analysis, it would be advisable to read the latest scientific research
on the subject or consult with domain experts to inform your interpretations.

The approach
Logistic regression is one common way to rank risk factors in medical use cases. Unlike linear regression, it
doesn't try to predict a continuous value for each of your observations, but it predicts a probability score that an
observation belongs to a particular class. In this case, what we are trying to predict is, given x data for each patient,
what is the y probability, from 0 to 1, that they have cardiovascular disease?

Preparations
You will find the code for this example here: https://github.com/PacktPublishing/Interpretable-Machine-Learning-
with-Python/blob/master/Chapter02/CVD.ipynb.

Loading the libraries

To run this example, you need to install the following libraries:

mldatasets to load the dataset


pandas and numpy to manipulate it
statsmodels to fit the logistic regression model
sklearn (scikit-learn) to split the data
matplotlib to visualize the interpretations

You should load all of them first:


import math
import mldatasets
import pandas as pd
import numpy as np
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

Understanding and preparing the data


The data to be used in this example should then be loaded into a DataFrame we call cvd_df :
cvd_df = mldatasets.load("cardiovascular-disease")

From this, you should be getting 70,000 records and 12 columns. We can take a peek at what was loaded with
info() :

cvd_df.info()

The preceding command will output the names of each column with its type and how many non-null records it
contains:
RangeIndex: 70000 entries, 0 to 69999
Data columns (total 12 columns):
age 70000 non-null int64
gender 70000 non-null int64
height 70000 non-null int64
weight 70000 non-null float64
ap_hi 70000 non-null int64
ap_lo 70000 non-null int64
cholesterol 70000 non-null int64
gluc 70000 non-null int64
smoke 70000 non-null int64
alco 70000 non-null int64
active 70000 non-null int64
cardio 70000 non-null int64
dtypes: float64(1), int64(11)

The data dictionary

To understand what was loaded, the following is the data dictionary, as described in the source:

age : Of the patient in days (Objective Feature)


height : In centimeters (Objective Feature)
weight : In kg (Objective Feature)
gender : A binary where 1: female, 2: male (Objective Feature)
ap_hi : Systolic blood pressure, which is the arterial pressure exerted when blood is ejected during
ventricular contraction. Normal value: < 120 mmHg (Examination Feature)
ap_lo : Diastolic blood pressure, which is the arterial pressure in between heartbeats. Normal value: < 80
mmHg (Examination Feature)
cholesterol : An ordinal where 1: normal, 2: above normal, 3: well above normal (Examination Feature)
gluc : An ordinal where 1: normal, 2: above normal, 3: well above normal (Examination Feature)
smoke : A binary where 0: non-smoker, 1: smoker (Subjective Feature)
alco : A binary where 0: non-drinker, 1: drinker (Subjective Feature)
active : A binary where 0: non-active, 1: active (Subjective Feature)
cardio : A binary where 0: no CVD, 1: has CVD (Target Feature)

Data preparation

For the sake of interpretability and model performance, there are several data preparation tasks that we can take
care of, but the one that stands out right now is age . Age is not something we usually measure in days. In fact, for
health-related predictions like this one, we might even want to bucket them into age groups since people tend to
age differently. For now, we will convert all ages into years:
cvd_df['age'] = cvd_df['age'] / 365.24

The result is a more understandable column because we expect age values to be between 0 and 120. We took
existing data and transformed it. This is an example of feature engineering, which is when you use domain
knowledge of your data to create features that better represent your problem, thereby improving your models. We
will discuss this further in Chapter 11, Bias Mitigation and Causal Inference Methods. There's value in performing
feature engineering simply to make model outcomes more interpretable as long as this doesn't significantly hurt
model performance. As regards the age column, it can't hurt it because we haven't degraded the data. This is
because you still have the decimal points for the years that represent the days.

Now we are going to take a peak at what the summary statistics are for each one of our features using the
describe() method:

cvd_df.describe().transpose()

Figure 2.1 shows the summary statistics outputted by the preceding code. In Figure 2.1, age is looking good
because it ranges between 29 and 65 years, which is not out of the ordinary, but there are some anomalous outliers
for ap_hi and ap_lo . Blood pressure can't be negative, and the highest ever recorded was 370. These records will
have to be dropped because they could lead to poor model performance and interpretability:
cvd_df = cvd_df[(cvd_df['ap_lo'] <= 370) &\
(cvd_df['ap_lo'] > 0)].reset_index(drop=True)
cvd_df = cvd_df[(cvd_df['ap_hi'] <= 370) &\
(cvd_df['ap_hi'] > 0)].reset_index(drop=True)

Figure 2.1 – Summary statistics for the dataset

For good measure, we ought to make sure that ap_hi is always higher than ap_lo , so any record with that
discrepancy should also be dropped:
cvd_df = cvd_df[cvd_df['ap_hi'] >=\
cvd_df['ap_lo']].reset_index(drop=True)

Now, in order to fit a logistic regression model, we must put all objective, examination, and subjective features
together as X and the target feature alone as y. After this, you split the X and y into training and test datasets, but
make sure to include random_state for reproducibility:
y = cvd_df['cardio']
X = cvd_df.drop(['cardio'], axis=1).copy()
X_train, X_test, y_train, y_test =\
train_test_split(X, y, test_size=0.15, random_state=9)

Learning about interpretation method types and scopes


Now that we have prepared our data and split it into training/test datasets, we can fit the model using the training
data and print a summary of the results:
log_model = sm.Logit(y_train, sm.add_constant(X_train))
log_result = log_model.fit()
print(log_result.summary2())

Printing summary2 on the fitted model produces the following output:


Optimization terminated successfully.
Current function value: 0.561557
Iterations 6
Results: Logit
=================================================================
Model: Logit Pseudo R-squared: 0.190
Dependent Variable: cardio AIC: 65618.3485
Date: 2020-06-10 09:10 BIC: 65726.0502
No. Observations: 58404 Log-Likelihood: -32797.
Df Model: 11 LL-Null: -40481.
Df Residuals: 58392 LLR p-value: 0.0000
Converged: 1.0000 Scale: 1.0000
No. Iterations: 6.0000
-----------------------------------------------------------------
Coef. Std.Err. z P>|z| [0.025 0.975]
-----------------------------------------------------------------
const -11.1730 0.2504 -44.6182 0.0000 -11.6638 -10.6822
age 0.0510 0.0015 34.7971 0.0000 0.0482 0.0539
gender -0.0227 0.0238 -0.9568 0.3387 -0.0693 0.0238
height -0.0036 0.0014 -2.6028 0.0092 -0.0063 -0.0009
weight 0.0111 0.0007 14.8567 0.0000 0.0096 0.0125
ap_hi 0.0561 0.0010 56.2824 0.0000 0.0541 0.0580
ap_lo 0.0105 0.0016 6.7670 0.0000 0.0075 0.0136
cholesterol 0.4931 0.0169 29.1612 0.0000 0.4600 0.5262
gluc -0.1155 0.0192 -6.0138 0.0000 -0.1532 -0.0779
smoke -0.1306 0.0376 -3.4717 0.0005 -0.2043 -0.0569
alco -0.2050 0.0457 -4.4907 0.0000 -0.2945 -0.1155
active -0.2151 0.0237 -9.0574 0.0000 -0.2616 -0.1685
=================================================================

The preceding summary helps us to understand which X features contributed the most to the y CVD diagnosis
using the model coefficients (labeled Coef. in the table). Much like with linear regression, they are like a weight
applied to every predictor. However, the linear combination exponent is a logistic function. This makes the
interpretation more difficult. We explain this function further in Chapter 3, Interpretation Challenges.

You can only tell by looking at it that the features with the absolute highest values are cholesterol and active ,
but it's not very intuitive in terms of what this means. A more interpretable way of looking at these values is
revealed once you calculate the exponential of these coefficients:
np.exp(log_result.params).sort_values(ascending=False)

The preceding code outputs the following:


cholesterol 1.637374
ap_hi 1.057676
age 1.052357
weight 1.011129
ap_lo 1.010573
height 0.996389
gender 0.977519
gluc 0.890913
smoke 0.877576
alco 0.814627
active 0.806471
const 0.000014
dtype: float64

Why the exponential? The coefficients are the log odds, which are the logarithms of the odds. Also, odds are the
probability of a positive case over the probability of a negative case, where the positive case is the phenomenon
we are trying to predict. It doesn't necessarily indicate what is favored by anyone. For instance, if we are trying to
predict the odds of rain today, the positive case would be that it rained, regardless of whether you predicted rain or
not. Odds are often expressed as a ratio. The news could say the probability of rain today is 60% or say the odds of
rain are 3:2 or 3/2 = 1.5. In log odds form, this would be 0.176, which is the logarithm of 1.5. They are basically
the same thing, but expressed differently. An exponential function is the inverse of a logarithm, so it can take any
log odds and return the odds.

Back to our CVD case. Now that we have the odds, we can interpret what it means. For example, what do the odds
mean in the case of cholesterol? It means that the odds of CVD increase by a factor of 1.64 for each additional unit
of cholesterol, provided every other feature stays unchanged. Being able to explain the impact of a feature on the
model in such tangible terms is one of the advantages of an intrinsically interpretable model such as logistic
regression.

Although the odds provide us with useful information, they don't tell us what matters the most and, therefore, by
themselves, cannot be used to measure feature importance. But how could that be? If something has higher odds,
then it must matter more, right? Well, for starters, they all have different scales, so that makes a huge difference.
This is because if you are to measure the odds of how much something increases, you have to know by how much
it typically increases because that provides context. For example, we could say that the odds of a specific species
of butterfly living one day more are 0.66 after their first eggs hatch. This statement is meaningless to you unless
you know the lifespan and reproductive cycle of this species.

To provide context to our odds, we can easily calculate the standard deviation of our features using the np.std
function:
np.std(X_train, 0)

The following series is what is outputted by the np.std function:


age 6.757537
gender 0.476697
height 8.186987
weight 14.335173
ap_hi 16.703572
ap_lo 9.547583
cholesterol 0.678878
gluc 0.571231
smoke 0.283629
alco 0.225483
active 0.397215
dtype: float64

As you can tell by the output, binary and ordinal features only typically vary by one at most, but continuous
features, such as weight or ap_hi , can vary 10-20 times more, as evidenced by the standard deviation of the
features.

Another reason why odds cannot be used to measure feature importance is because despite favorable odds,
sometimes features are not statistically significant. They are entangled with other features in such a way they
might appear to be significant, but we can prove that they aren't. This can be seen in the summary table for the
model, under the P>|z| column. This value is called the p-value, and when it's less than 0.05, hypothesis testing
determines that there's strong evidence that it is significant. However, when it's above this number, especially by a
large margin, there's no statistical evidence that it affects the predicted score. Such is the case with gender , at
least in this dataset.

If we are trying to obtain what features matters most, one way to approximate this is to multiply the coefficients by
the standard deviations of the features. Incorporating the standard deviations accounts for differences in variances
between features. Hence, it is better if we get gender out of the way too while we are at it:
coefs = log_result.params.drop(labels=['const','gender'])
stdv = np.std(X_train, 0).drop(labels='gender')
abs(coefs * stdv).sort_values(ascending=False)

The preceding code produced this output:


ap_hi 0.936632
age 0.344855
cholesterol 0.334750
weight 0.158651
ap_lo 0.100419
active 0.085436
gluc 0.065982
alco 0.046230
smoke 0.037040
height 0.029620

The preceding table can be interpreted as an approximation of risk factors from high to low according to the
model. It is also a model-specific feature importance method, in other words, a global model (modular)
interpretation method. There's a lot of new concepts to unpack here so let's break them down.

Model interpretability method types

There are two model interpretability method types:

Model-specific: When the method can only be used for a specific model class, then it's model-specific. The
method detailed in the previous example can only work with logistic regression because it uses its
coefficients.
Model-agnostic: These are methods that can work with any model class. We cover these in Chapter 4, Global
Model-agnostic Interpretation Methods, and the next two chapters.

Model interpretability scopes

There are several model interpretability scopes:

Global holistic interpretation: You can explain how a model makes predictions simply because you can
comprehend the entire model at once with a complete understanding of the data, and it's a trained model. For
instance, the simple linear regression example in Chapter 1, Interpretation, Interpretability, and
Explainability; and Why Does It All Matter?, can be visualized in a two-dimensional graph. You can
conceptualize this in memory, but this is only possible because the simplicity of the model allows you to do
so, and it's not very common nor expected.
Global modular interpretation: In the same way that you can explain the role of parts of an internal
combustion engine in the whole process of turning fuel into movement, you can also do so with a model. For
instance, in the CVD risk factor example, our feature importance method tells us that ap_hi (systolic blood
pressure), age , cholesterol , and weight are the parts that impact the whole the most. Feature importance
is only one of many global modular interpretation methods but arguably the most important one. Chapter 4,
Global Model-agnostic Interpretation Methods, goes into more detail on feature importance.
Local single-prediction interpretation: You can explain why a single prediction was made. The next
example will illustrate this concept and Chapter 5, Local Model-agnostic Interpretation Methods will go into
more detail.
Local group-prediction interpretation: The same as single-prediction, except that it applies to groups of
predictions.

Congratulations! You've already determined the risk factors with a global model interpretation method, but the
health minister also wants to know whether the model can be used to interpret individual cases. So, let's look into
that.

Interpreting individual predictions with logistic regression

What if you used the model to predict CVD for the entire test dataset? You could do so like this:
y_pred = log_result.predict(sm.add_constant(X_test)).to_numpy()
print(y_pred)

The resulting array is the probabilities that each test case is positive for CVD:
[0.40629892 0.17003609 0.13405939 ... 0.95575283 0.94095239 0.91455717]

Let's take one of the positive cases; test case #2872:


print(y_pred[2872])

We know that it predicted positive for CVD because the score exceeds 0.5.

And these are the details for test case #2872:


print(X_test.iloc[2872])

The following is the output:


age 60.521849
gender 1.000000
height 158.000000
weight 62.000000
ap_hi 130.000000
ap_lo 80.000000
cholesterol 1.000000
gluc 1.000000
smoke 0.000000
alco 0.000000
active 1.000000
Name: 46965, dtype: float64

So, by the looks of the preceding series, we know that the following applies to this individual:

A borderline high ap_hi (systolic blood pressure).


Normal ap_lo (diastolic blood pressure). Having high systolic blood pressure and normal diastolic blood
pressure is what is known as isolated systolic hypertension. It could be causing a positive prediction, but
ap_hi is borderline (130 mmHg being the border), so therefore the condition of isolated systolic
hypertension is borderline.
age is not too old, but among the oldest in the dataset.
cholesterol is normal.
weight also appears to be in the healthy range.

There are also no other risk factors: glucose is normal, no smoking, no alcohol, and no sedentarism, since the
individual is active. It is not clear exactly why it's positive. Is the age and borderline isolated systolic hypertension
enough to tip the scales? It's tough to understand the reasons for the prediction without putting all the predictions
into context, so let's try to do that!

But how do we put everything in context at the same time? We can't possibly visualize how one prediction
compares with the other ten thousand for every single feature and their respective predicted CVD diagnosis.
Unfortunately, humans can't process that level of dimensionality, even if it were possible to visualize a ten-
dimensional hyperplane!

However, we can do it for two features at a time, resulting in a graph that conveys where the decision boundary for
the model lies for those features. On top of that, we can overlay what the predictions were for the test dataset based
on all the features. This is to visualize the discrepancy between the effect of two features and all eleven features.

This graphical interpretation method is what is termed a decision boundary. It draws boundaries for the classes,
leaving areas that belong to one class or another. Such areas are called decision regions. In this case, we have two
classes, so we will see a graph with a single boundary between cardio = 0 and cardio = 1 , only concerning the
two features we are comparing.

We have managed to visualize the two decision-based features at a time, with one big assumption that if all the
other features are held constant, we can observe only two in isolation. This is also known as the ceteris paribus
assumption and is critical in a scientific inquiry, allowing us to control some variables in order to observe others.
One way to do this is to fill them with a value that won't affect the outcome. Using the table of odds we produced,
we can tell whether a feature increases as it will increase the odds of CVD. So, in aggregate, a lower value is less
risky for CVD.

For instance, age = 30 is the least risky value of those present in the dataset for age . It can also go in the opposite
direction, so active = 1 is known to be less risky than active = 0 . We can come up with optimal values for the
remainder of the features:

height = 165 .
weight = 57 (optimal for that height ).
ap_hi = 110 .
ap_lo = 70 .
smoke = 0 .
cholesterol = 1 (this means normal).
gender can be coded for male or female, which doesn't matter because the odds for gender ( 0.977519 ) are
so close to 1.

The following filler_feature_values dictionary exemplifies what should be done with the features matching
their index to their least risky values:
filler_feature_values = {"age": 30, "gender": 1, "height": 165,\
"weight": 57, "ap_hi": 110, "ap_lo": 70,\
"cholesterol": 1, "gluc": 1, "smoke": 0,\
"alco":0, "active":1}

The next thing to do is to create a (1,12) shaped NumPy array with test case #2872 so that the plotting function can
highlight it. To this end, we first convert it to NumPy and then prepend the constant of 1 , which must be the first
feature, and then reshape it so that it meets the (1,12) dimensions. The reason for the constant is because in
statsmodels , you must explicitly define the intercept. For this reason the logistic model has an additional 0
feature, which always equals 1.
X_highlight = np.reshape(\
np.concatenate(([1], X_test.iloc[2872].to_numpy())), (1, 12))
print(X_highlight)

The following is the output:


[[ 1. 60.52184865 1. 158. 62. 130.
80. 1. 1. 0. 0. 1. ]]

We are good to go now! Let's visualize some decision region plots! We will compare the feature that is thought to
be the highest risk factor, ap_hi , with the following four most important risk factors: age , cholesterol ,
weight , and ap_lo .

The following code will generate the plots in Figure 2.2:


plt.rcParams.update({'font.size': 14})
fig, axarr = plt.subplots(2, 2, figsize=(12,8), sharex=True,\
sharey=False)
mldatasets.create_decision_plot(X_test, y_test, log_result,\
["ap_hi", "age"], None, X_highlight,\
filler_feature_values, ax=axarr.flat[0])
mldatasets.create_decision_plot(X_test, y_test, log_result,\
["ap_hi", "cholesterol"], None, X_highlight,\
filler_feature_values, ax=axarr.flat[1])
mldatasets.create_decision_plot(X_test, y_test, log_result,\
["ap_hi", "ap_lo"], None, X_highlight,\
filler_feature_values, ax=axarr.flat[2])
mldatasets.create_decision_plot(X_test, y_test, log_result,\
["ap_hi", "weight"], None, X_highlight,\
filler_feature_values, ax=axarr.flat[3])
plt.subplots_adjust(top = 1, bottom=0, hspace=0.2, wspace=0.2)
plt.show()
In the plot in Figure 2.2, the circle represents test case #2872. In all the plots bar one, this test case is on the
negative (left-side) decision region, representing cardio = 0 classification. The borderline high ap_hi (systolic
blood pressure) and the relatively high age is barely enough for a positive prediction in the top-left chart. Still, in
any case, for test case #2872, we have predicted a 57% score for CVD, so this could very well explain most of it.

Not surprisingly, by themselves, ap_hi and a healthy cholesterol are not enough to tip the scales in favor of a
definitive CVD diagnosis according to the model because it's decidedly in the negative decision region, and neither
is a normal ap_lo (diastolic blood pressure). You can tell from these three charts that although there's some
overlap in the distribution of squares and triangles, there is a tendency for more triangles to gravitate toward the
positive side as the y-axis increases, while fewer squares populate this region:

Figure 2.2 – The decision regions for ap_hi and other top risk factors, with test case #2872

The overlap across the decision boundary is expected because, after all, these squares and triangles are based on
the effects of all features. Still, you expect to find a somewhat consistent pattern. The chart with ap_hi versus
weight doesn't have this pattern vertically as weight increases, which suggests something is missing in this
story… Hold that thought because we are going to investigate that in the next section!
Congratulations! You have completed the second part of the minister's request.

Decision region plotting, a local model interpretation method, provided the health ministry with a tool to
interpret individual case predictions. You could now extend this to explain several cases at a time, or plot all-
important feature combinations to find the ones where the circle is decidedly in the positive decision region. You
can also change some of the filler variables one at a time to see how they make a difference. For instance, what if
you increase the filler age to the median age of 54 or even to the age of test case #2872. Would a borderline high
ap_hi and healthy cholesterol now be enough to tip the scales? We will answer this question later, but first let's
understand what can make machine learning interpretation so difficult.

Appreciating what hinders machine learning interpretability


In the last section, we were wondering why the chart with ap_hi versus weight didn't have a conclusive pattern.
It could very well be that although weight is a risk factor, there are other critical mediating variables that could
explain the increased risk of CVD. A mediating variable is one that influences the strength between the
independent and target (dependent) variable. We probably don't have to think too hard to find what is missing. In
Chapter 1, Interpretation, Interpretability, and Explainability; and Why Does It All Matter?, we performed linear
regression on weight and height because there's a linear relationship between these variables. In the context of
human health, weight is not nearly as meaningful without height , so you need to look at both.

Perhaps if we plot the decision regions for these two variables, we will get some clues. We can plot them with the
following code:
fig, ax = plt.subplots(1,1, figsize=(12,8))
mldatasets.create_decision_plot(X_test, y_test, log_result, [3, 4], ['height [cm]', 'weight [kg]
filler_feature_values, filler_feature_ranges, ax=ax)
plt.show()

The preceding snippet will generate the plot in Figure 2.3:


Figure 2.3 – The decision regions for weight and height, with test case #2872

No decision boundary was ascertained in Figure 2.3 because if all other variables are held constant (at a less risky
value), no height and weight combination is enough to predict CVD. However, we can tell that there is a pattern
for the orange triangles, mostly located in one ovular area. This provides exciting insight that even though we
expect weight to increase when height increases, the concept of an inherently unhealthy weight is not one that
increases linearly with height .

In fact, for almost two centuries, this relationship has been mathematically understood by the name body mass
index (BMI):

Before we discuss BMI further, you must consider complexity. Dimensionality aside, there are chiefly three things
that introduce complexity that makes interpretation difficult:

1. Non-linearity
2. Interactivity
3. Non-monotonicity

Non-linearity

Linear equations such as y = a + bx are easy to understand. They are additive, so it is easy to separate and quantify
the effects of each of its terms (a and bx) from the outcome of the model (y). Many model classes have linear
equations incorporated in the math. These equations can both be used to fit the data to the model and describe the
model.

However, there are model classes that are inherently non-linear because they introduce non-linearity in their
training. Such is the case for deep learning models because they have non-linear activation functions such as
sigmoid. However, logistic regression is considered a generalized linear model (GLM) because it's additive. In
other words, the outcome is a sum of weighted inputs and parameters. We will discuss GLMs further in Chapter 3,
Challenges of Interpretability.

However, even if your model is linear, the relationships between the variables may not be linear, which can lead to
poor performance and interpretability. What you can do in these cases is adopt either of the following approaches:

Use a non-linear model class, which will fit these non-linear feature relationships much better, possibly
improving model performance. Nevertheless, as we will explore in more detail in the next chapter, this can
make it less interpretable.
Use domain knowledge to engineer a feature that can help "linearize" it. For instance, if you had a feature
that increased exponentially against another, you can engineer a new variable with the logarithm of that
feature. In the case of our CVD prediction, we know BMI is a better way to understand weight in the
company of height. Best of all, it's not an arbitrary made-up feature, so it's easier to interpret. We can prove
this point by making a copy of the dataset, engineering the BMI feature in it, training the model with this
extra feature, and performing local model interpretation. The following code snippet does just that:
X2 = cvd_df.drop(['cardio'], axis=1).copy()
X2["bmi"] = X2["weight"] / (X2["height"]/100)**2

To illustrate this new feature, let's plot BMI against both weight and height using the following code:
fig, axs = plt.subplots(1,3, figsize=(15,4))
axs[0].scatter(X2["weight"], X2["bmi"], color='black', s=2) axs[0].set_xlabel('weight [kg]')
axs[0].set_ylabel('bmi')
axs[1].scatter(X2["height"], X2["weight"], color='black', s=2)
axs[1].set_xlabel('height [cm]')
axs[1].set_ylabel('weight [kg]')
axs[2].scatter(X2["bmi"], X2["height"], color='black', s=2) axs[2].set_xlabel('bmi')
axs[2].set_ylabel('height [cm]')
plt.subplots_adjust(top = 1, bottom=0, hspace=0.2, wspace=0.3) plt.show()

Figure 2.4 is produced with the preceding code:


Figure 2.4 – Bivariate comparison between weight, height, and bmi

As you can appreciate by the plots in Figure 2.4, there is a more definite linear relationship between bmi and
weight than between height and weight and, even, between bmi and height .

Let's fit the new model with the extra feature using the following code snippet:
X2 = X2.drop(['weight','height'], axis=1)
X2_train, X2_test,__,_ = train_test_split(X2, y,\
test_size=0.15, random_state=9)
log_model2 = sm.Logit(y_train, sm.add_constant(X2_train))
log_result2 = log_model2.fit()

Now, let's see whether test case #2872 is on the positive decision region when comparing ap_hi to bmi if we
keep age constant at 60:
filler_feature_values2 = {"age": 60, "gender": 1, "ap_hi": 110,\
"ap_lo": 70, "cholesterol": 1, "gluc": 1,\
"smoke": 0, "alco":0, "active":1, "bmi":20 }
X2_highlight = np.reshape(\
np.concatenate(([1],X2_test.iloc[2872].to_numpy())), (1, 11))
fig, ax = plt.subplots(1,1, figsize=(12,8))
mldatasets.create_decision_plot(X2_test, y_test, log_result2,\
["ap_hi", "bmi"], None, X2_highlight,\
filler_feature_values2, ax=ax)
plt.show()

The preceding code plots decision regions in the following Figure 2.5:
Figure 2.5 – The decision regions for ap_hi and bmi, with test case #2872

Figure 2.5 shows that controlling for age , ap_hi and bmi can help explain the positive prediction for CVD
because the circle is in the positive decision region. Please note that there are some likely anomalous bmi outliers
(the highest BMI ever recorded was 204), so there are probably some incorrect weights or heights in the dataset.

WHAT'S THE PROBLEM WITH OUTLIERS?

Outliers can be influential or high leverage and therefore affect the model when trained with these. Even if they
don't, they can make interpretation more difficult. If they are anomalous, then you should remove them, as we did
with blood pressure at the beginning of this chapter. And sometimes, they can hide in plain sight because they are
only perceived as anomalous in the context of other features. In any case, there are practical reasons why outliers
are problematic, such as making plots like the preceding one "zoom out" to be able to fit them while not letting
you appreciate the decision boundary where it matters. And there are also more profound reasons, such as losing
trust in the data, thereby tainting trust in the models that were trained on that data. This sort of problem is to be
expected with real-world data. Even though we haven't done it in this chapter for the sake of expediency, it's
essential to begin every project by thoroughly exploring the data, treating missing values and outliers, and other
data housekeeping tasks.

Interactivity
When we created bmi , we didn't only linearize a non-linear relationship, but we also created interactions between
two features. bmi is, therefore, an interaction feature, but this was informed by domain knowledge. However,
many model classes do this automatically by permutating all kinds of operations between features. After all,
features have latent relationships between one another, much like height and width , and ap_hi and ap_lo .
Therefore, automating the process of looking for them is not always a bad thing. In fact, it can even be absolutely
necessary. This is the case for many deep learning problems where the data is unstructured and, therefore, part of
the task of training the model is looking for the latent relationships to make sense of it.

However, for structured data, even though interactions can be significant for model performance, they can hurt
interpretability by adding potentially unnecessary complexity to the model and also finding latent relationships that
don't mean anything (which is called a spurious relationship or correlation).

Non-monotonicity

Often, a variable has a meaningful and consistent relationship between a feature and the target variable. So, we
know that as age increases, the risk of CVD ( cardio ) must increase. There is no point at which you reach a
certain age and this risk drops. Maybe the risk slows down, but it does not drop. We call this monotonicity, and
functions that are monotonic are either always increasing or decreasing throughout their entire domain.

Please note that all linear relationships are monotonic, but not all monotonic relationships are necessarily linear.
This is because they don't have to be a straight line. A common problem in machine learning is that a model
doesn't know about a monotonic relationship that we expect because of our domain expertise. Then, because of
noise and omissions in the data, the model is trained in such a way in which there are ups and downs where you
don't expect them.

Let's propose a hypothetical example. Let's imagine that due to a lack of availability of data for 57-60-year-olds,
and because the few cases we did have for this range were negative for CVD, the model could learn that this is
where you would expect a drop in CVD risk. Some model classes are inherently monotonic, such as logistic
regression, so they can't have this problem, but many others do. We will examine this in more detail in Chapter 12,
Monotonic Constraints and Model Tuning for Interpretability:
Figure 2.6 – A partial dependence plot between a target variable (yhat) and a predictor with monotonic and non-
monotonic models

Figure 2.6 is what is called a Partial Dependence Plot (PDP), from an unrelated example. PDPs are a concept we
will study in further detail in Chapter 4, Global Model-agnostic Interpretation Methods, but what is important to
grasp from it is that the prediction yhat is supposed to decrease as the feature
quantity_indexes_for_real_gdp_by_state increases. As you can tell by the lines, in the monotonic model, it
consistently decreases, but in the non-monotonic one, it has jagged peaks as it decreases, and then increases at the
very end.

Mission accomplished
The first part of the mission was to understand risk factors for cardiovascular disease, and you've determined that
the top four risk factors are systolic blood pressure ( ap_hi ), age , cholesterol , and weight according to the
logistic regression model, of which only age is non-modifiable. However, you also realized that systolic blood
pressure ( ap_hi ) is not as meaningful on its own since it relies on diastolic blood pressure ( ap_lo ) for
interpretation. The same goes for weight and height . We learned that the interaction of features plays a crucial
role in interpretation, and so does their relationship with each other and the target variable, whether linear or
monotonic. Furthermore, the data is only a representation of the truth, which can be wrong. After all, we found
anomalies that, left unchecked, can bias our model.

Another source of bias is how the data was collected. After all, you can wonder why the model's top features were
all objective and examination features. Why isn't smoking nor drinking a larger factor? To verify whether there
was sample bias involved, you would have to compare with other more trustworthy datasets to check whether your
dataset is under-representing drinkers and smokers. Or maybe the bias was introduced by the question that asked
whether they smoked now, and not whether they had ever smoked for an extended period.

Another type of bias that we could address is exclusion bias — our data might be missing information that explains
the truth that the model is trying to depict. For instance, we know through medical research that blood pressure
issues such as isolated systolic hypertension, which increases CVD risk, are caused by underlying conditions such
as diabetes, hyperthyroidism, arterial stiffness, and obesity, to name a few. The only one of these conditions that
we can derive from the data is obesity, and not the other ones. If we want to be able to interpret a model's
predictions well, we need to have all relevant features. Otherwise, there will be gaps we cannot explain. Maybe
once we add them, they won't make much of a difference, but that's what the methods we will learn in Chapter 10,
Feature Selection for Interpretability, are for.

The second part of the mission was to be able to interpret individual model predictions. We can do this well
enough by plotting decision regions. It's a simple method, but it has many limitations, especially in situations
where there are more than a handful of features, and they tend to interact a lot with each other. Chapter 5, Local
Model-Agnostic Interpretation Methods, and Chapter 6, Anchor and Counterfactual Explanations, will cover
better local interpretation methods. However, the decision region plot method helps illustrate many of the concepts
surrounding decision boundaries we will discuss in those chapters.

Summary
After reading this chapter, you should know about two model interpretation methods: feature importance and
decision boundaries. You also learned about model interpretation method types and scopes and the three elements
that impact interpretability in machine learning. We will keep mentioning these fundamental concepts in
subsequent chapters. For a machine learning practitioner, it is paramount to be able to spot them so you can know
what tools to leverage to overcome interpretation challenges. In the next chapter, we will dive deeper into this
topic.

Further reading
Molnar, Christoph. Interpretable Machine Learning. A Guide for Making Black Box Models Explainable,
2019: https://christophm.github.io/interpretable-ml-book/.
Mlextend Documentation. Plotting Decision Regions.
http://rasbt.github.io/mlxtend/user_guide/plotting/plot_decision_regions/.
3 Interpretation Challenges
Join our book community on Discord
https://packt.link/EarlyAccessCommunity

In this chapter, we will discuss the traditional methods used for machine learning interpretation for both regression
and classification. This includes model performance evaluation methods such as RMSE, R-squared, AUC, ROC
curves, and the many metrics derived from confusion matrices. We will then examine the limitations of these
traditional methods and explain what exactly makes "white-box" models intrinsically interpretable and why we
cannot always use white-box models. To answer this question, we'll consider the trade-off between prediction
performance and model interpretability. Finally, we will discover some new "glass-box" models such as EBM and
GAMI-Net that attempt to not compromise in this trade-off.

The following are the main topics that will be covered in this chapter:

Reviewing traditional model interpretation methods


Understanding the limitations of traditional model interpretation methods
Studying intrinsically interpretable (white-box) models
Recognizing the trade-off between performance and interpretability
Discovering newer interpretable (glass-box) models

Technical requirements
From Chapter 2, Key Concepts of Interpretability, onward, we are using a custom mldatasets library to load our
datasets. Instructions on how to install this library are located in the Preface. In addition to mldatasets , this
chapter's examples also use the pandas , numpy , sklearn , rulefit , interpret , statsmodels , matplotlib ,
and gaminet libraries. The code for this chapter is located here: https://github.com/PacktPublishing/Interpretable-
Machine-Learning-with-Python/tree/master/Chapter03.

The mission
Picture yourself, a data science consultant, in a conference room in Forth Worth, Texas, during early January 2019.
In this conference room, executives for one of the world's largest airlines, American Airlines (AA), are briefing
you on their on-time performance (OTP). OTP is a widely accepted key performance indicator for flight
punctuality. It is measured as the percentage of flights that arrived within 15 minutes of the scheduled arrival. It
turns out that AA has achieved an OTP of just over 80% for 3 years in a row, which is already acceptable, and
much better than before, but they are still ninth in the world and fifth in North America. To brag about it next year
in their advertising, they aspire to achieve, at least, number one in North America for 2019, besting their biggest
rivals.

On the financial front, it is estimated that delays cost the airline close to $2 billion, so reducing this by even 25-
35% to be on parity with their competitors could produce sizable savings. And it is estimated that it costs
passengers just as much due to tens of millions of lost hours. A reduction in delays would produce happier
customers, which could lead to an increase in ticket sales.
Your task is to create models that can predict delays for domestic flights only. What they hope to gain from the
models is the following:

To understand what factors impacted domestic arrival delays the most in 2018
To anticipate a delay caused by the airline in midair with enough accuracy to mitigate some of these factors in
2019

But not all delays are made equal. The International Air Transport Association (IATA) has over 80 delay codes
ranging from 14 (oversales, booking errors) to 75 (de-icing of aircraft, removal of ice/snow, frost prevention).
Some are preventable, and others unavoidable.

The airline executives told you that the airline is not, for now, interested in predicting delays caused by events out
of their control, such as extreme weather, security events, and air traffic control issues. They are also not interested
in delays caused by late arrivals from previous flights using the same aircraft because this was not the root cause.
Nevertheless, they would like to know the effect of a busy hub on avoidable delays even if this has to do with
congestion because, after all, perhaps there's something they can do with flight scheduling or flight speed, or even
gate selection. And while they understand that international flights occasionally impact domestic flights, they hope
to tackle the sizeable local market first.

Executives have provided you with a dataset from the United States Department of Transportation Bureau of
Transportation Statistics with all 2018 AA domestic flights.

The approach
Upon careful consideration, you have decided to approach this both as a regression problem and a classification
problem. Therefore, you will produce models that predict minutes delayed as well as models that classify whether
flights were delayed by more than 15 minutes or not. For interpretation, using both will enable you to use a wider
variety of methods, and expand your interpretation accordingly. So we will approach this example by taking the
following steps:

1. Predicting minutes delayed with various regression methods


2. Classifying flights as delayed or not delayed with various classification methods

These steps in the Reviewing traditional model interpretation methods section are followed by conclusions spread
out in the rest of the sections of this chapter.

The preparations
You will find the code for this example here: https://github.com/PacktPublishing/Interpretable-Machine-Learning-
with-Python/blob/master/Chapter03/FlightDelays.ipynb.

Loading the libraries

To run this example, you need to install the following libraries:

mldatasets to load the dataset


pandas and numpy to manipulate it
sklearn (scikit-learn), rulefit , statsmodels , interpret , and gaminet to fit models and calculate
performance metrics
matplotlib to create visualizations

Load these libraries as seen in the following snippet:


import math
import mldatasets
import pandas as pd
import numpy as np
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PolynomialFeatures, StandardScaler,\
MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn import metrics, linear_model, tree, naive_bayes,\
neighbors, ensemble, neural_network, svm
from rulefit import RuleFit
import statsmodels.api as sm
from interpret.glassbox import ExplainableBoostingClassifier
from interpret import show
from interpret.perf import ROC
import tensorflow as tf
from gaminet import GAMINet
from gaminet.utils import plot_trajectory, plot_regularization,\
local_visualize, global_visualize_density,\
feature_importance_visualize
import matplotlib.pyplot as plt

Understanding and preparing the data

We then load the data as shown:


aad18_df = mldatasets.load("aa-domestic-delays-2018")

There should be nearly 900,000 records and 23 columns. We can take a peek at what was loaded like this:
print(aad18_df.info())

The following is the output:


RangeIndex: 899527 entries, 0 to 899526
Data columns (total 23 columns):
FL_NUM 899527 non-null int64
ORIGIN 899527 non-null object
DEST 899527 non-null object
PLANNED_DEP_DATETIME 899527 non-null object
CRS_DEP_TIME 899527 non-null int64
DEP_TIME 899527 non-null float64
DEP_DELAY 899527 non-null float64
DEP_AFPH 899527 non-null float64
DEP_RFPH 899527 non-null float64
TAXI_OUT 899527 non-null float64
WHEELS_OFF 899527 non-null float64
: : : :
WEATHER_DELAY 899527 non-null float64
NAS_DELAY 899527 non-null float64
SECURITY_DELAY 899527 non-null float64
LATE_AIRCRAFT_DELAY 899527 non-null float64
dtypes: float64(17), int64(3), object(3)

Everything seems to be in order because all columns are there and there are no null values.

The data dictionary

Let's examine the data dictionary.

General features are as follows:

FL_NUM : Flight number


ORIGIN : Starting airport code (IATA)
DEST : Destination airport code (IATA)

Departure features are as follows:

PLANNED_DEP_DATETIME : The planned date and time of the flight.


CRS_DEP_TIME : The planned departure time.
DEP_TIME : The actual departure time.
DEP_AFPH : The number of actual flights per hour occurring during the interval in between the planned and
actual departure from the origin airport (factoring in 30 minutes of padding). The feature tells you how busy
the origin airport was during takeoff.
DEP_RFPH : The departure relative flights per hour is the ratio of actual flights per hour over the median
amount of flights per hour that occur at the origin airport at that time of day, day of the week, and month of
the year. The feature tells you how relatively busy the origin airport was during takeoff.
TAXI_OUT : The time duration elapsed between the departure from the origin airport gate and wheels off.
WHEELS_OFF : point in time that the aircraft's wheels leave the ground.

In-flight features are as follows:

CRS_ELAPSED_TIME : The planned amount of time needed for the flight trip.
PCT_ELAPSED_TIME : The ratio of actual flight time over planned flight time to gauge the plane's relative
speed.
DISTANCE : The distance between two airports.

Arrival features:

CRS_ARR_TIME : The planned arrival time.


ARR_AFPH : The number of actual flights per hour occurring during the interval between the planned and
actual arrival time at the destination airport (factoring in 30 minutes of padding). The feature tells you how
busy the destination airport was during landing.
ARR_RFPH : The arrival relative flights per hour is the ratio of actual flights per hour over the median amount
of flights per hour that occur at the destination airport at that time of day, day of the week, and month of the
year. The feature tells you how relatively busy the destination airport was during landing.

Delay features:

DEP_DELAY : The total delay on departure in minutes.


ARR_DELAY : The total delay on arrival in minutes can be subdivided into any or all of the following:

a) CARRIER_DELAY : The delay in minutes caused by circumstances within the airline's control (for example,
maintenance or crew problems, aircraft cleaning, baggage loading, fueling, and so on).

b) WEATHER_DELAY : The delay in minutes caused by significant meteorological conditions (actual or forecasted).

c) NAS_DELAY : The delay in minutes mandated by a national aviation system such as non-extreme weather
conditions, airport operations, heavy traffic volume, and air traffic control.

d) SECURITY_DELAY : The delay in minutes caused by the evacuation of a terminal or concourse, re-boarding of an
aircraft because of a security breach, faulty screening equipment, or long lines above 29 minutes in screening
areas.

e) LATE_AIRCRAFT_DELAY : The delay in minutes caused by a previous flight with the same aircraft that arrived
late.

Data preparation

For starters, PLANNED_DEP_DATETIME must be of datetime data type:


aad18_df['PLANNED_DEP_DATETIME'] =\
pd.to_datetime(aad18_df['PLANNED_DEP_DATETIME'])

The exact day and time of a flight don't matter, but maybe the month and day of the week do because of weather
and seasonal patterns that can only be appreciated at this level of granularity. Also, the executives mentioned
weekends and winters being especially bad for delays. Therefore, we will create features for the month and day of
the week:
aad18_df['DEP_MONTH'] = aad18_df['PLANNED_DEP_DATETIME'].dt.month
aad18_df['DEP_DOW'] = aad18_df['PLANNED_DEP_DATETIME'].dt.dayofweek

We don't need the PLANNED_DEP_DATETIME column so let's drop it like this:


aad18_df = aad18_df.drop(['PLANNED_DEP_DATETIME'], axis=1)

It is essential to record whether the arrival or destination airport is a hub. AA, in 2019, had 10 hubs: Charlotte,
Chicago–O'Hare, Dallas/Fort Worth, Los Angeles, Miami, New York–JFK, New York–LaGuardia, Philadelphia,
Phoenix–Sky Harbor, and Washington–National. Therefore, we can encode which ORIGIN and DEST airports are
AA hubs using their IATA codes, and get rid of columns with codes since they are too specific ( FL_NUM , ORIGIN ,
and DEST ):
#Create list with 10 hubs (with their IATA codes)
hubs = ['CLT', 'ORD', 'DFW', 'LAX', 'MIA', 'JFK', 'LGA', 'PHL',\
'PHX', 'DCA']
#Boolean series for if ORIGIN or DEST are hubs
is_origin_hub = aad18_df['ORIGIN'].isin(hubs)
is_dest_hub = aad18_df['DEST'].isin(hubs)
#Use boolean series to set ORIGIN_HUB and DEST_HUB
aad18_df['ORIGIN_HUB'] = 0
aad18_df.loc[is_origin_hub, 'ORIGIN_HUB'] = 1
aad18_df['DEST_HUB'] = 0
aad18_df.loc[is_dest_hub, 'DEST_HUB'] = 1
#Drop columns with codes
aad18_df = aad18_df.drop(['FL_NUM', 'ORIGIN', 'DEST'], axis=1)

After all these operations, we have a fair number of useful features, but we are yet to determine the target feature.
There are two columns that could serve this purpose. We have ARR_DELAY , which is the total amount of minutes
delayed regardless of the reason, and then there's CARRIER_DELAY , which is just the total amount of those minutes
that can be attributed to the airline. For instance, look at the following sample of flights delayed over 15 minutes
(which is considered late according to the airline's definition):
aad18_df.loc[aad18_df['ARR_DELAY'] > 15,\
['ARR_DELAY','CARRIER_DELAY']].head(10)

The preceding code outputs Figure 3.1:


Figure 3.1 – Sample observations with arrival delays over 15 minutes

Of all the delays in Figure 3.1, one of them (#26) wasn't at all the responsibility of the airline. Four of them were
partially the responsibility of the airline (#8, #16, #33, #40), two of which were over 15 minutes late due to the
airline (#8, #40). The rest of them were entirely the airline's fault. We can tell that although the total delay is useful
information, the airline executives were only interested in delays caused by the airline so ARR_DELAY can be
discarded. Furthermore, there's another more important reason it should be discarded, and it's that if the task at
hand is to predict a delay, we cannot use pretty much the very same delay (minus the portions not due to the
airline) to predict it. This would be like using today's newspaper slightly redacted to predict today's news. For this
very same reason, it is best to remove ARR_DELAY :
aad18_df = aad18_df.drop(['ARR_DELAY'], axis=1)

Finally, we can put the target feature alone as y and all the rest as X . After this, we split y and X into train and
test datasets. Please note that the target feature ( y ) stays the same for regression so we split it into y_train_reg
and y_test_reg . However, for classification, we must make binary versions of these labels denoting whether it's
more than 15 minutes late or not, called y_train_class and y_test_class . Please note that we are setting a
fixed random_state for reproducibility:
rand = 9
np.random.seed(rand)
y = aad18_df['CARRIER_DELAY']
X = aad18_df.drop(['CARRIER_DELAY'], axis=1).copy()
X_train, X_test, y_train_reg, y_test_reg = train_test_split(X,\
y, test_size=0.15, random_state=rand)
y_train_class = y_train_reg.apply(lambda x: 1 if x > 15 else 0)
y_test_class = y_test_reg.apply(lambda x: 1 if x > 15 else 0)

To examine how linearly correlated the features are to the target CARRIER_DELAY , we compute Pearson's
correlation coefficient, turn coefficients to absolute values (because we aren't interested in whether they are
positively or negatively correlated), and sort them in descending order:
corr = aad18_df.corr()
abs(corr['CARRIER_DELAY']).sort_values(ascending=False)

As you can tell from the output, only one feature (DEP_DELAY) is highly correlated. The others ar
CARRIER_DELAY 1.000000
DEP_DELAY 0.703935
ARR_RFPH 0.101742
LATE_AIRCRAFT_DELAY 0.083166
DEP_RFPH 0.058659
ARR_AFPH 0.035135
DEP_TIME 0.030941
NAS_DELAY 0.026792
: :
WEATHER_DELAY 0.003002
SECURITY_DELAY 0.000460

However, this is only linearly correlated and on a one-by-one basis. It doesn't mean that they don't have a non-
linear relationship, or that several features interacting together wouldn't impact the target. In the next section, we
will discuss this further.

Reviewing traditional model interpretation methods


To explore as many model classes and interpretation methods as possible, we will fit the data to regression and
classification models.

Predicting minutes delayed with various regression methods

To compare and contrast regression methods, we will first create a dictionary named reg_models . Each model is
its own dictionary and the function that creates it in the model attribute. This structure will be used later to store
the fitted model neatly and its metrics. Model classes in this dictionary have been chosen to represent several
model families and to illustrate important concepts that we will discuss later:
reg_models = {
#Generalized Linear Models (GLMs)
'linear':{'model': linear_model.LinearRegression()},
'linear_poly':{'model':
make_pipeline(PolynomialFeatures(degree=2),
linear_model.LinearRegression(fit_intercept=False))
'linear_interact':{'model':
make_pipeline(PolynomialFeatures(interaction_only=True),
linear_model.LinearRegression(fit_intercept=False)) },
'ridge':{'model': linear_model.\
RidgeCV(alphas=[1e-3, 1e-2, 1e-1, 1]) },
#Trees
'decision_tree':{'model': tree.\
DecisionTreeRegressor(max_depth=7, random_state=rand)},
#RuleFit
'rulefit':{'model': RuleFit(max_rules=150, rfmode='regress',\
random_state=rand)},
#Nearest Neighbors
'knn':{'model': neighbors.KNeighborsRegressor(n_neighbors=7)},
#Ensemble Methods
'random_forest':{'model':ensemble.\
RandomForestRegressor(max_depth=7, random_state=rand)},
#Neural Networks
'mlp':{'model':neural_network.\
MLPRegressor(hidden_layer_sizes=(21,),\
max_iter=500, \
early_stopping=True,\
random_state=rand)}
}

Before we start fitting the data to these models, we will briefly explain them one by one:

linear : Linear regression was the first model class we discussed. For better or for worse, it makes several
assumptions about the data. Chief among them is the assumption that the prediction must be a linear
combination of X features. This, naturally, limits the capacity to discover non-linear relationships and
interactions among the features.
linear_poly : Polynomial regression extends linear regression by adding polynomial features. In this case,
as indicated by degree=2 , the polynomial degree is two, so it's quadratic. This means, in addition to having
all features in their monomial form (for example, DEP_FPH ), it also has them in a quadratic form (for
example, DEP_FPH2 ), plus the many interaction terms for all of the 21 features. In other words, for DEP_FPH ,
there would be interaction terms such as DEP_FPH ´ DISTANCE , DEP_FPH ´ DELAY , and so on for the rest of
the features.
linear_interact : This is just like the polynomial regression model but without the quadratic terms. In
other words, only the interactions, as interaction_only=True would suggest. It's useful because there is no
reason to believe any of our features have a relationship that is better fitted with quadratic terms. Still,
perhaps it's the interaction with other features that makes an impact.
ridge : Ridge regression is a variation of linear regression. However, even though the method behind linear
regression, called Ordinary Least Squares (OLS), does a pretty good job in reducing the error, fitting the
model to the features, it does it without considering overfitting. The problem here is that OLS treats all
features equally, so the model becomes more complex as each variable is added. As the word overfitting
suggests, the resulting model fits the training data too well, resulting in the lowest bias but the highest
variance. There's a sweet spot in this trade-off between bias and variance, and one way of getting to this
spot is reducing the complexity added by the introduction of too many features. Linear regression is not
equipped to do so on its own. This is where ridge regression comes along, with our friend regularization. It
does this by shrinking coefficients that don't contribute to the outcome with a penalty term called the L2
norm. In this example, we use a cross-validated version of ridge ( RidgeCV ) that tests several regularization
strengths ( alphas ).
decision_tree : A decision tree is precisely as the name suggests. Imagine a tree-like structure where at
every point where branches subdivide to form more branches, there is a "test" performed on a feature
partitioning the datasets into each branch. When branches stop subdividing, they become leaves, and at every
leaf, there's a decision, be it to assign a class for classification or a fixed value for regression. We are limiting
this tree to max_depth=7 to prevent overfitting because the larger the tree, the better it will fit our training
data.
rule_fit : RuleFit is a regularized linear regression expanded to include feature interactions in the form of
rules. The rules are formed by traversing a decision tree, except it discards the leaves and keeps the feature
interactions found traversing the branches toward these leaves. It uses Lasso Regression, which like ridge,
uses regularization, but instead of using the L2 norm, it uses the L1 norm. The result is that useless features
end up with a coefficient of zero and do not just converge to zero, as they do with L2. We are limiting the
rules to 150 ( max_rules=150 ) and the attribute rfmode='regress' tells RuleFit that this is a regression
problem, since it can also be used for classification. Unlike all other models used here, this isn't a scikit-learn
one but was created by Christoph Molnar adapting a paper.
knn : k-Nearest Neighbors (kNN) is a simple method based on the locality assumption, which is that data
points that are close to each other are similar. In other words, they must have similar predicted values, and, in
practice, this isn't a bad guess, so it takes data points nearest to the point you want to predict and derives a
prediction based on that. In this case, n_neighbors=7 so k = 7. It's an instance-based machine learning
model, also known as a lazy learner because it simply stores the training data. During inference, it employs
training data to calculate the similarity with points and generate a prediction based on that. This is opposed to
what model-based machine learning techniques, or eager learners, do, which is to use training data to learn
formulas, parameters, coefficients, or bias/weights, which it then leverages to make a prediction during
inference.
random_forest : Imagine not one but hundreds of decision trees trained on random combinations of the
features and random samples of the data. random forest takes an average of these randomly generated
decision trees to create the best tree. This concept of training less effective models in parallel and combining
them using an averaging process is called bagging. It is an ensemble method because it combines more than
one model (usually called weak learners) into a strong learner. In addition to bagging, there are two other
ensemble techniques, called boosting and stacking. For bagging deeper, trees are better because they reduce
variance, so this is why we are using max_depth=7 .
mlp : A multi-layer perceptron is a "vanilla" feed-forward (sequential) neural network, so it uses non-linear
activation functions ( MLPRegressor uses ReLU by default), stochastic gradient descent, and
backpropagation. In this case, we are using 21 neurons in the first and only hidden layer, hence
hidden_layer_sizes=(21,) , running training for 500 epochs ( max_iter=500 ), and terminating training
when the validation score is not improving ( early_stopping=True ).
If you are unfamiliar with some of these models, don't fret! We will cover them in more detail either later in this
chapter or later in the book. Also, please note that some of these models have a random process somewhere. To
ensure reproducibility, we have set random_state . It would be best if you strived to always set this, otherwise, it
will randomly set it every single time, which will make your results hard to reproduce.

Now, let's iterate over our dictionary of models ( reg_models ), fit them to the training data, and predict and
compute two metrics based on the quality of these predictions. We'll then save the fitted model, test predictions,
and metrics in the dictionary for later use. Note that rulefit only accepts numpy arrays, so we can't fit it in the
same way. Also, note rulefit and mlp take longer than the rest to train, so this can take a few minutes to run:
for model_name in reg_models.keys():
if model_name != 'rulefit':
fitted_model = reg_models[model_name]['model'].\
fit(X_train, y_train_reg)
else:
fitted_model = reg_models[model_name]['model'].\
fit(X_train.values, y_train_reg.values, X_test.columns)
y_train_pred = fitted_model.predict(X_train.values)
y_test_pred = fitted_model.predict(X_test.values)
reg_models[model_name]['fitted'] = fitted_model
reg_models[model_name]['preds'] = y_test_pred
reg_models[model_name]['RMSE_train'] =\
math.sqrt(metrics.mean_squared_error(y_train_reg, y_train_pred))
reg_models[model_name]['RMSE_test'] =\
math.sqrt(metrics.mean_squared_error(y_test_reg, y_test_pred))
reg_models[model_name]['R2_test'] =\
metrics.r2_score(y_test_reg, y_test_pred)

We can now convert the dictionary to a DataFrame and display the metrics in a sorted and color-coded fashion:
reg_metrics = pd.DataFrame.from_dict(reg_models,\
'index')[['RMSE_train', 'RMSE_test', 'R2_test']]
reg_metrics.sort_values(by='RMSE_test').style.format({'RMSE_train':\
'{:.2f}', 'RMSE_test': '{:.2f}', 'R2_test': '{:.3f}'}).\
background_gradient(cmap='viridis_r', low=0.1, high=1,
subset=['RMSE_train', 'RMSE_test']).\
background_gradient(cmap='plasma', low=0.3, high=1,
subset=['R2_test'])

The preceding code outputs Figure 3.2. Please note that color-coding doesn't work in all Jupyter Notebook
implementations:
Figure 3.2 – Regression metrics for our models

To interpret the metrics in Figure 3.2, we ought to first understand what they mean, both in general and in the
context of this regression exercise:

RMSE: Root Mean Square Error is defined as the standard deviation of the residuals. It's the square root of
the squared residuals divided by the number of observations, in this case, flights. It tells you, on average, how
far apart the predictions are from the actuals, and as you can probably tell from the color-coding, less is better
because you want your predictions to be as close as possible to the actuals in the test (hold-out) dataset. We
have also included this metric for the train dataset to see how well it's generalizing. You expect the test error
to be higher than the training error, but not by much. If it is, like it is for random_forest , you need to tune
some of the parameters. In this case, reducing the trees' maximum depth, increasing the number of trees (also
called estimators), and reducing the maximum number of features to use should do the trick. On the other
hand, with knn , you can adjust the , but it is expected, because of its lazy learner nature, to overperform on
the training data.

In any case, these numbers are pretty good because even our worst performing model is below a test RMSE of 10,
and about half of them have a test RMSE of less than 7.5, quite possibly predicting a delay effectively, on average,
since the threshold for a delay is 15 minutes.

Note that linear_poly is the second and linear_interact is the fourth most performant model, significantly
ahead of linear , suggesting that non-linearity and interactivity are important factors to produce better predictive
performance.

R2: R-squared is also known as the coefficient of determination. It's defined as the proportion of the
variance in the y (outcome) target that can be explained by the X (predictors) features in the model. It answers
the question of what is the variability explained by the model as a proportion of all of it? And as you can
probably tell from the color-coding, more is better. And our models appear to include significant X features,
as evidenced by our Pearson's correlation coefficients. So if this R2 value was low, perhaps adding additional
features would help, such as flight logs, terminal conditions, and even those things airline executives said they
weren't interested in exploring right now, such as knock-off effects and international flights. These could fill in
the gaps in the unexplained variance.

Let's see if we can get good metrics with classification.

Classifying flights as delayed or not delayed with various classification methods


Just as we did with regression, to compare and contrast classification methods, we will first create a dictionary for
them named class_models . Each model is its own dictionary and the function that creates it in the model
attribute. This structure will be used later to store the fitted model neatly, and its metrics. Model classes in this
dictionary have been chosen to represent several model families and to illustrate important concepts that we will
discuss later. Some of these will look familiar because they are the same methods used in regression but applied to
classification:
class_models = {
#Generalized Linear Models (GLMs)
'logistic':{'model': linear_model.LogisticRegression()},
'ridge':{'model': linear_model.\
RidgeClassifierCV(cv=5,\
alphas=[1e-3, 1e-2, 1e-1, 1],\
class_weight='balanced')},
#Tree
'decision_tree':{'model': tree.\
DecisionTreeClassifier(max_depth=7,\
random_state=rand)},
#Nearest Neighbors
'knn':{'model': neighbors.KNeighborsClassifier(n_neighbors=7)},
#Naive Bayes
'naive_bayes':{'model': naive_bayes.GaussianNB()},
#Ensemble Methods
'gradient_boosting':{'model':ensemble.\
GradientBoostingClassifier(n_estimators=210)},
'random_forest':{'model':ensemble.\
RandomForestClassifier(max_depth=11,\
class_weight='balanced', random_state=rand)},
#Neural Networks
'mlp':{'model': make_pipeline(StandardScaler(),
neural_network.MLPClassifier(hidden_layer_sizes=(7,),\
max_iter=500, early_stopping=True,\
random_state=rand))}
}

Before we start fitting the data to these models, we will briefly explain them one by one:

logistic : logistic regression was introduced in Chapter 2, Key Concepts of Interpretability. It has many of
the same pros and cons as linear regression. For instance, feature interactions must be added manually. Like
other classification models, it returns a probability between 0 and 1, which, when closer to 1 denotes a
probable match to a positive class while when closer to 0, it denotes an improbable match to the positive
class, and therefore a probable match to the negative class. Naturally, 0.5 is the threshold used to decide
between classes, but it doesn't have to be. As we will examine later in the book, there are interpretation and
performance reasons to adjust the threshold. Note that this is a binary classification problem, so we are only
choosing between delayed (positive) and not delayed (negative), but this method could be extended to multi-
class classification. It would then be called multinomial classification.
ridge : Ridge classification leverages the same regularization technique used in ridge regression but
applied to classification. It does this by converting the target values to -1 (for a negative class) and keeping 1
for a positive class and then performing ridge regression. At its heart, its regression in disguise will predict
values between -1 and 1, and then convert them back to a 0-1 scale. Like with RidgeCV for regression,
RidgeClassifierCV uses leave-one-out cross-validation, which means it first splits the data into different
equal-size sets – in this case, we are using five sets ( cv=5 ) – and then removes features one at a time to see
how well the model performs without them, on average in all the five sets. Those features that don't make
much of a difference are penalized testing several regularization strengths ( alphas ) to find the optimal
strength. As with all regularization techniques, the point is to discourage learning from unnecessary
complexity, minimizing the impact of less salient features.
decision_tree : A "vanilla" decision Tree, such as this one, is also known as a CART (Classification And
Regression Tree) because it can be used for regression or classification tasks. It has the same architecture for
both tasks but functions slightly differently, like the algorithm used to decide where to "split" a branch. In this
case, we are only allowing our trees to have a depth of 7.
knn : kNN can also be applied to classification tasks, except instead of averaging what the nearest neighbors'
target features (or labels) are, it chooses the most frequent one (also known as the mode). We are also using a
of 7 for classification ( n_neighbors ).
naive_bayes : Gaussian Naïve Bayes is part of the family of Naïve Bayes classifiers, which are called naïve
because they make some assumptions that the features are independent of each other, which is usually not the
case. This dramatically impedes its capacity to predict unless the assumption is correct. It's called Bayes
because it's based on Bayes' theorem of conditional probabilities, which is that the conditional probability
of a class is the class probability times the feature probability given the class. Gaussian Naïve Bayes makes an
additional assumption, which is that continuous values have a normal distribution, also known as a Gaussian
distribution.
gradient_boosting : Like random forest, gradient boosted trees are also an ensemble method, but that
leverages boosting instead of bagging. Boosting doesn't work in parallel but in sequence, iteratively training
weak learners and incorporating their strengths into a stronger learner, while adapting another weak learner to
tackle their weaknesses. Although ensembles and boosting, in particular, can be done with a model class, this
one uses decision trees. We have limited the number of trees to 210 ( n_estimators=210 ).
random_forest : The same random forest as with regression except it uses classification decision trees and
not regression trees.
mlp : The same multi-layer perceptron as with regression, but the output layer, by default, uses a logistic
function in the output layer to yield probabilities, which it then converts to 1 or 0, based on the 0.5 threshold.
Another difference is that we are using seven neurons in the first and only hidden layer
( hidden_layer_sizes=(7,) ) because binary classification tends to require fewer of them to achieve an
optimal result.

Please note that some of these models use balanced weights for the classes ( class_weight='balanced' ), which is
very important because this happens to be an imbalanced classification task. By that, we mean that negative
classes vastly outnumber positive classes. You can find out what this looks like for our training data:
print(y_train_class[y_train_class==1].shape[0] /
y_train_class.shape[0])

As you can see, the output in our training data's positive classes represents only 6% of the total. Models that
account for this will achieve fairer results. There are different ways for accounting for class imbalance, which we
will discuss in further detail in Chapter 11, Bias Mitigation and Causal Inference Methods, but
class_weight='balanced' applies a weight inversely proportional to class frequencies, giving the outnumbered
positive class a leg up.

Training and evaluating the classification models

Now, let's iterate over our dictionary of models ( class_models ), fit them to the training data, and predict both
probabilities and the class except for ridge , which doesn't output probabilities. We'll then compute five metrics
based on the quality of these predictions. Lastly, we'll save the fitted model, test predictions, and metrics in the
dictionary for later use. You can go get a coffee while you run the next snippet of code because
gradient_boosting of sklearn takes longer than the rest to train, so this can take a few minutes to run:

for model_name in class_models.keys():


fitted_model = class_models[model_name]['model'].\
fit(X_train, y_train_class)
y_train_pred = fitted_model.predict(X_train.values)
if model_name == 'ridge':
y_test_pred = fitted_model.predict(X_test.values)
else:
y_test_prob = fitted_model.predict_proba(X_test.values)[:,1]
y_test_pred = np.where(y_test_prob > 0.5, 1, 0)
class_models[model_name]['fitted'] = fitted_model
class_models[model_name]['probs'] = y_test_prob
class_models[model_name]['preds'] = y_test_pred
class_models[model_name]['Accuracy_train'] =\
metrics.accuracy_score(y_train_class, y_train_pred)
class_models[model_name]['Accuracy_test'] =\
metrics.accuracy_score(y_test_class, y_test_pred)
class_models[model_name]['Recall_train'] =\
metrics.recall_score(y_train_class, y_train_pred)
class_models[model_name]['Recall_test'] =\
metrics.recall_score(y_test_class, y_test_pred)
if model_name != 'ridge':
class_models[model_name]['ROC_AUC_test'] =\
metrics.roc_auc_score(y_test_class, y_test_prob)
else:
class_models[model_name]['ROC_AUC_test'] = np.nan
class_models[model_name]['F1_test'] =\
metrics.f1_score(y_test_class, y_test_pred)
class_models[model_name]['MCC_test'] =\
metrics.matthews_corrcoef(y_test_class, y_test_pred)

We can now convert the dictionary to a DataFrame and display the metrics in a sorted and color-coded fashion:
class_metrics = pd.DataFrame.from_dict(class_models,\
'index')[['Accuracy_train', 'Accuracy_test',\
'Recall_train', 'Recall_test',\
'ROC_AUC_test', 'F1_test', 'MCC_test']]
class_metrics.sort_values(by='ROC_AUC_test', ascending=False).style.\
format(dict(zip(class_metrics.columns, ['{:.3f}']*7))).\
background_gradient(cmap='plasma', low=1, high=0.1,
subset=['Accuracy_train', 'Accuracy_test']).\
background_gradient(cmap='viridis', low=1, high=0.1,\
subset=['Recall_train', 'Recall_test',\
'ROC_AUC_test', 'F1_test', 'MCC_test'])

The preceding code outputs Figure 3.3:

Figure 3.3 – Classification metrics for our models

To interpret the metrics in Figure 3.3, we ought to first understand what they mean, both in general and in the
context of this classification exercise:

Accuracy: Accuracy is the simplest way to measure the effectiveness of a classification task, and it's the
percentage of correct predictions over all predictions. In other words, in a binary classification task, you can
calculate this by adding the number of True Positives (TPs) and True Negatives (TNs) and dividing them by
a tally of all predictions made. As with regression metrics, you can measure accuracy for both train and test to
gauge overfitting.
Recall: Even though accuracy sounds like a great metric, recall is much better in this case and the reason is
you could have an accuracy of 94%, which sounds pretty good, but it turns out you are always predicting no
delay! In other words, even if you get high accuracy, it is meaningless unless you are predicting accurately for
the least represented class, delays. We can find this number with recall (also known as sensitivity or true
positive rate), which is TP / TP + FN and it can be interpreted as how much of the relevant results were
returned. In other words, in this case, what percentage of the actual delays were predicted. Another good
measure involving true positives is precision, which is how much our predicted samples are relevant, which
is TP / TP + FP. In this case, that would be what percentage of predicted delays were actual delays. For
imbalanced classes, it is recommended to use both, but depending on your preference for FN over FF, you
will prefer recall over precision or vice versa.
ROC-AUC: ROC is an acronym for Receiver Operating Characteristic and was designed to separate
signal from noise. What it does is plot the proportion of true positive rate (Recall) on the x axis and the false
positive rate on the y axis. AUC stands for area under the curve, which is a number between 0 and 1 that
assesses the prediction ability of the classifier 1 being perfect, 0.5 being as good as a coin toss, and anything
lower meaning that if we inverted the results of our prediction, we would have a better prediction. To
illustrate this, let's generate a ROC curve for our worse-performing model, Naïve Bayes, according to the
AUC metric:
plt.tick_params(axis = 'both', which = 'major')
fpr, tpr, _ = metrics.roc_curve(y_test_class,\
class_models['naive_bayes']['probs'])
plt.plot(fpr, tpr, label='ROC curve (area = %0.2f)' %\
class_models['naive_bayes']['ROC_AUC_test'])
plt.plot([0, 1], [0, 1], 'k–') #coin toss line
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.legend(loc="lower right")

The preceding code outputs Figure 3.4. Note that the diagonal line signifies half the area. In other words, the point
where it has coin-toss-like prediction qualities:
Figure 3.4 – ROC curve for Naïve Bayes

F1: The F1-score is also called the harmonic average of precision and recall because it's calculated like this:
2TP / 2TP + FP + FN. Since it includes both precision and recall metrics, which pertain to the proportion of
true positives, it's a good metric choice to use when your dataset is imbalanced, and you don't prefer either
precision or recall.
MCC: The Matthews correlation coefficient is a metric drawn from biostatistics. It's gaining popularity in
the broader data science community because it has the ability to produce high scores considering TP, FN, TN,
and FP fairly because it takes into account proportions of classes. This makes it optimal for imbalanced
classification tasks. Unlike all other metrics used so far, it doesn't range from 0 to 1 but -1, complete
disagreement, to 1, a total agreement between predictions and actuals. The mid-point, 0, is equivalent to a
random prediction.

Our classification metrics are mostly very good, exceeding 96% accuracy and 75% recall. However, even recall
isn't everything. For instance, RandomForest , due to its class balancing with weights, got the highest recall but did
poorly in F1 and MCC, which suggests that precision is not very good.
Ridge classification also had the same setting and had such a poor F1 score, precision must have been dismal. This
doesn't mean this weighting technique is inherently wrong, but it often requires more control. This book will cover
techniques to achieve the right balance between fairness and accuracy, accuracy and reliability, reliability and
validity, and so on. This is a balancing act that requires many metrics and visualizations. A key takeaway from this
exercise should be that a single metric will not tell you the whole story, and interpretation is about telling the
most relevant and sufficiently complete story.

Understanding limitations of traditional model interpretation methods


In a nutshell, traditional interpretation methods only cover surface-level questions about your models such as the
following:

In aggregate, do they perform well?


What changes in hyperparameters may impact predictive performance?
What latent patterns can you find between the features and their predictive performance?

These questions are very limiting if you are trying to understand not only whether your model works but why and
how?

This gap in understanding can lead to unexpected issues with your model that won't necessarily be immediately
apparent. Let's consider that models, once deployed, are not static but dynamic. They face different challenges than
they did in the "lab" when you were training them. They may face not only performance issues but issues with bias
such as imbalance with underrepresented classes, or security with adversarial attacks. Realizing that the features
have changed in the real-world environment, we might have to add new features instead of merely retraining with
the same feature set. And if there are some troubling assumptions made by your model, you might have to re-
examine the whole pipeline. But how do you recognize that these problems exist in the first place? That's when
you will need a whole new set of interpretation tools that can help you dig deeper and answer more specific
questions about your model. These tools provide interpretations that can truly account for Fairness,
Accountability, and Transparency (FAT), which we discussed in Chapter 1, Interpretation, Interpretability, and
Explainability; and Why Does It All Matter?

Studying intrinsically interpretable (white-box) models


So far, in this chapter, we have already fitted our training data to model classes representing each of these "white-
box" model families. The purpose of this section is to show you exactly why they are intrinsically interpretable.
We'll do so by employing the models that were previously fitted.

Generalized Linear Models (GLMs)

GLMs are a large family of model classes that have a model for every statistical distribution. Just like linear
regression assumes your target feature and residuals have a normal distribution, logistic regression assumes the
Bernoulli distribution. There are GLMs for every distribution, such as Poisson regression for Poisson distribution
and multinomial response for multinomial distribution. You choose which GLM to use based on the distribution
of your target variable and whether your data meets the other assumptions of the GLM (they vary). In addition to
an underlying distribution, what ties GLMs together into a single family is the fact that they all have a linear
predictor. In other words, the ŷ target variable (or predictor) can be expressed mathematically as a weighted sum of
X features, where weights are called b coefficients. This is the simple formula, the linear predictor function, that all
GLMs share:

However, although they share this same formula, they each have a different link function, which provides a link
between the linear predictor function and the mean of the statistical distribution of the GLM. This can add some
non-linearity to the resulting model formula while retaining the linear combination between the b coefficients and
the X input data, which can be a source of confusion. Still, it's linear because of the linear combination.

There are also many variations for specific GLMs. For instance, Polynomial regression is linear regression with
polynomials of its features, and ridge regression is linear regression with L2 regularization. We won't cover all
GLMs in this section because they aren't needed for the example in this chapter, but all have plausible use cases.

Incidentally, there's also a similar concept called Generalized Additive Models (GAMs), which are GLMs that
don't require linear combinations of features and coefficients and instead retain the addition part, but of arbitrary
functions applied on the features. GAMs are also interpretable, but they are not as common, and usually tailored to
specific use cases ad hoc.

Linear regression

In Chapter 1, Interpretation, Interpretability, and Explainability, and Why Does It All Matter?, we covered the
formula of simple linear regression, which only has a single X feature. Multiple linear regression extends this to
have any number of features, so instead of being:

it can be:

with n features, and where

is the intercept and thanks to linear algebra this can be a simple matrix multiplication:

The method used to arrive at the optimal b coefficients, OLS, is well-studied and understood. Also, in addition to
the coefficients, you can extract confidence intervals for each. The model's correctness depends on whether the
input data meets the assumptions: linearity, normality, independence, (mostly) a lack of multicollinearity, and
homoscedasticity. We've discussed linearity, so far, quite a bit so we will briefly explain the rest:

Normality is the property that that each feature is normally distributed. This can be tested with a Q-Q plot,
histogram, or Kolmogorov-Smirnov test, and non-normality can be corrected with non-linear
transformations. If a feature isn't normally distributed, it will make its coefficient confidence intervals invalid.
Independence is when your observations (the rows in your dataset) are independent of each other, like
different and unrelated events. If your observations aren't independent, it could affect your interpretation of
the results. In this chapter's example, if you had multiple rows about the same flight, that could violate this
assumption and make results hard to understand. This can be tested by looking for duplicate flight numbers.
Lack of multicollinearity is desirable because, otherwise, you'd have inaccurate coefficients.
Multicollinearity occurs when the features are highly correlated with each other. This can be tested with a
correlation matrix, tolerance measure, or Variance Inflation Factor (VIF), and it can be fixed by
removing one of each highly correlated feature.
Homoscedasticity was briefly discussed in Chapter 1, Interpretation, Interpretability, and Explainability;
and Why Does It All Matter? and it's when the residuals (the errors) are more or less equal across the
regression line. This can be tested with the Goldfeld-Quandt test, and heteroscedasticity (the lack of
homoscedasticity) can be corrected with non-linear transformations. This assumption is often violated in
practice.

Even though we haven't done it for this chapter's example, if you are going to rely on linear regression heavily, it's
always good to test these assumptions before you even begin to fit your data to a linear regression model. This
book won't detail how this is done because it's more about model-agnostic and deep-learning interpretation
methods than delving into how to meet the assumptions of a specific class of models such as normality and
homoscedasticity. However, we covered the characteristics that trump interpretation the most in Chapter 2, Key
Concepts of Interpretability, and we will continue to look for these characteristics: non-linearity, non-
monotonicity, and interactivity. We will do this, mainly, because the linearity and correlation of and between
features are still relevant, regardless of the modeling class used to make predictions. And these are characteristics
that can be easily tested for in the methods used for linear regression.

Interpretation

So how do we interpret a linear regression model? Easy! Just get the coefficients and the intercept. Our scikit-learn
models have these attributes embedded in the fitted model:
coefs_lm = reg_models['linear']['fitted'].coef_
intercept_lm = reg_models['linear']['fitted'].intercept_
print('coefficients:%s' % coefs_lm)
print('intercept:%s' % intercept_lm)

The preceding code outputs the following:


coefficients: [ 4.54955677e-03 -5.25032459e-03 8.94123625e-01 1.25274473e-01 -6.46799581e-04
intercept: -37.860211953237275

So now you know the formula, which looks something like this:
ŷ = -37.86 + 0.0045X1 + -0.0053X2 + 0.894X3 + ...

This formula should provide some intuition on how the model can be interpreted globally. Interpreting each
coefficient in the model can be done for multiple linear regression, just as we did with the simple linear regression
example in Chapter 1, Interpretation, Interpretability, and Explainability; and Why Does It All Matter?. The
coefficients act as weights, but they also tell a story that varies depending on the kind of feature. To make
interpretation more manageable, let's put our coefficients in a DataFrame alongside the names of each feature:
pd.DataFrame({'feature': X_train.columns.tolist(),\
'coef': coefs_lm.tolist()})

The preceding code produces the data frame in Figure 3.5:


Figure 3.5 – Coefficients of linear regression features

Here's how to interpret a feature using the coefficients in Figure 3.9:

Continuous: Like ARR_RFPH , you know that for every one-unit increase (relative flights per hour), it
increases the predicted delay by 0.373844 minutes, if all other features stay the same.
Binary: Like ORIGIN_HUB , you know the difference between the origin airport being a hub or not is
expressed by the coefficient -1.029088. In other words, since it's a negative number, the origin airport is a
hub. It reduces the delay by just over 1 minute if all other features stay the same.
Categorical: We don't have categorical features, but we have ordinal features that could have been, and
actually should have been, categorical features. For instance, DEP_MONTH and DEP_DOW are integers from 1-
12 and 0-6, respectively. If they are treated as ordinals, we are assuming because of the linear nature of linear
regression that an increase or decrease in months has an impact on the outcome. It's the same with the day of
the week. But the impact is tiny. Had we treated them as dummy or one-hot encoded features, we could
measure whether Fridays are more prone to carrier delays than Saturdays and Wednesdays, or Julys than
Octobers and Junes. This couldn't possibly be modeled with them in order, because they have no relation to
this order (yep – it's non-linear!).
So, say, we had a feature called DEP_FRIDAY and another called DEP_JULY . They are treated like binary
features and can tell you precisely what effect a departure being on a Friday or in July has on the model.
Some features were kept as ordinal or continuous on purpose, despite being good candidates for being
categorical, to demonstrate how not making the right adjustments to your features can impact the expressive
power of model interpretation. It would have been good to tell airline executives more about how the day and
time of a departure impacted delays. Also, in some cases – not in this one – an oversight like this can grossly
affect a linear regression model's performance.

The intercept (-37.86) is not a feature, but it does have a meaning, which is if all features were at 0, what would the
prediction be? In practice, this doesn't happen unless your features happen to all have a plausible reason to be 0.
Just as in Chapter 1, Interpretation, Interpretability, and Explainability; and Why Does It All Matter? you wouldn't
have expected anyone to have a height of 0, in this example, you wouldn't expect a flight to have a distance of 0.
However, if you standardized the features so that they had a mean of 0, then you would change the interpretation
of the intercept to be the prediction you expect if all features are their mean value.

Feature importance

The coefficients can also be leveraged to calculate feature importance. Unfortunately, scikit-learn's linear regressor
is ill-equipped to do this because it doesn't output the standard error of the coefficients. According to their
importance, all it takes to rank features is to divide the βs by their corresponding standard errors. This result is
something called the t-statistic:

()

And then you take an absolute value of this and sort them from high to low. It's easy enough to calculate, but you
need the standard error. You could reverse engineer the linear algebra involved to retrieve it using the intercept,
and the coefficients returned by scikit-learn. However, it's probably a lot easier to fit the linear regression model
again, but this time using the statsmodels library, which has a summary with all the statistics, including! By the
way, statsmodels names its linear regressor OLS , which makes sense because OLS is the name of the
mathematical method that fits the data:
linreg_mdl = sm.OLS(y_train_reg, sm.add_constant(X_train))
linreg_mdl = linreg_mdl.fit()
print(linreg_mdl.summary())

There's quite a bit to unpack in the regression summary. This book won't address everything except that the t-
statistic can tell you how important features are in relation to each other. There's another more pertinent statistical
interpretation, which is that if you were to hypothesize that the b coefficient is 0, in other words, that the feature
has no impact on the model, the distance of the t-statistic from 0 helps reject that null hypothesis. This is what the
p-value to the right of the t-statistic does. It's no coincidence that the closest t to 0 (for ARR_AFPH ) has the only p-
value above 0.05. This puts this feature at a level of insignificance since everything below 0.05 is statistically
significant according to this method of hypothesis testing.

So to rank our features, let's extract the data frame from the statsmodels summary. Then, we drop the const
(the intercept) because this is not a feature. Then, we make a new column with the absolute value of the t-statistic
and sort it accordingly. To demonstrate how the absolute value of the t-statistic and p-value are inversely related,
we are also color-coding these columns:
summary_df = linreg_mdl.summary2().tables[1]
summary_df = summary_df.drop(['const']).reset_index().\
rename(columns={'index':'feature'})
summary_df['t_abs'] = abs(summary_df['t'])
summary_df.sort_values(by='t_abs', ascending=False).style.\
format(dict(zip(summary_df.columns[1:], ['{:.4f}']*7))).\
background_gradient(cmap='plasma_r', low=0, high=0.1,\
subset=['P>|t|' ,'t_abs'])

The preceding code outputs Figure 3.6:

Figure 3.6 – Linear regression summary table sorted by the absolute value of the t-statistic

Something particularly interesting about the feature importance in Figure 3.11 is that different kinds of delays
occupy 5 out of the top six positions. Of course, this could be because linear regression is confounding different
non-linear effects these have, or perhaps there's something here we should look further into. Especially since the
statsmodels summary under the "Warnings" section cautions:

"[2] The condition number is large, 5.69e+04. This might indicate that there are strong multicollinearity or
other numerical problems."

This is odd. Hold that thought. We will examine this further later.

Ridge regression

Ridge regression is part of a sub-family of penalized or regularized regression along with the likes of LASSO and
ElasticNet because, as explained earlier in this chapter, it penalizes using the L2 norm. This sub-family is also
called sparse linear models because, thanks to the regularization, it cuts out some of the noise by making
irrelevant features less relevant. Sparsity in this context means less is more because reduced complexity will lead
to lower variance and improved generalization.

To illustrate this concept, look at the feature importance table (Figure 3.11) we output for linear regression.
Something that should be immediately apparent is how the t_abs column starts with every row a different color,
and then a whole bunch of them are the same shade of yellow. Because of the variation in confidence intervals, the
absolute t-value is not something you can take proportionally and say that your top feature is hundreds of times
more relevant than every one of your bottom 10 features. However, it should indicate that there are significantly
more important features than others to the point of irrelevance, and possibly confoundment, hence creating noise.
There's ample research on how there's a tendency for a small subset of features to have the most substantial effects
on the outcome of the model. This is called the bet on sparsity principle. Whether it's true or not for your data,
it's always good to test the theory by applying regularization, especially in cases where data is very wide (many
features) or exhibits multicollinearity. These regularized regression techniques can be incorporated into feature
selection processes or to inform your understanding of what features are essential.

There is a technique to adapt ridge regression to classification problems. It was briefly discussed before. It
converts the labels to a -1 to 1 scale for training to predict values between -1 and 1, and then turns them back to a
0-1 scale. However, it uses regularized linear regression to fit the data, and can be interpreted in the same way.

Interpretation

Ridge regression can be interpreted in the same way as linear regression, both globally and locally, because once
the model has been fitted, there's no difference. The formula is the same:
Except

coefficients are different because they were penalized with a

parameter, which controls how much shrinkage to apply.

We can quickly compare coefficients by extracting the ridge coefficients from their fitted model and placing them
side by side in a DataFrame with the coefficients of the linear regression:
coefs_ridge = reg_models['ridge']['fitted'].coef_
coef_ridge_df =\
pd.DataFrame({'feature':X_train.columns.values.tolist(),\
'coef_linear': coefs_lm,\
'coef_ridge': coefs_ridge})
coef_ridge_df['coef_regularization'] =\
coef_ridge_df['coef_linear'] - coef_ridge_df['coef_ridge']
coef_ridge_df.style.\
background_gradient(cmap='plasma_r', low=0, high=0.1 ,\
subset=['coef_regularization'])

As you can tell in Figure 3.7 output by the preceding code, the coefficients are always slightly different, but
sometimes they are lower and sometimes higher:

Figure 3.7 – Linear regression coefficients compared to ridge regression coefficients

We didn't save the λ parameter (which scikit-learn calls alpha) that the ridge regression cross-validation deemed
optimal. However, we can run a little experiment of our own to figure out which parameter was the best. We do
this by iterating through 100 possible alphas values between 100 (1) and 1013 (10000000000000), fitting the data to
Ridge model which each alpha and then appending the coefficients to an array. We exclude the eight coefficient in
the array because it’s so much larger than the rest and it will make it harder to visualize the effects of shrinkage:
num_alphas = 100
alphas = np.logspace(0, 13, num_alphas)
alphas_coefs = []
for alpha in alphas:
ridge = linear_model.Ridge(alpha=alpha).fit(X_train, y_train_reg)
alphas_coefs.append(np.concatenate((ridge.coef_[:8],\
ridge.coef_[9:])))

Now that we have an array of coefficients, we can plot the progression of coefficients:
plt.gca().invert_xaxis()
plt.tick_params(axis = 'both', which = 'major')
plt.plot(alphas, alphas_coefs)
plt.xscale("log")
plt.xlabel('Alpha')
plt.ylabel('Ridge coefficients')
plt.grid()
plt.show()

The preceding code generates Figure 3.8:

Figure 3.8 – Value of alpha hyperparameters versus the value of ridge regression coefficients

Something to note in Figure 3.13 is that the higher the alpha, the higher the regularization. This is why when alpha
is 1012, all coefficients have converged to 0, and as the alpha becomes smaller, they get to a point where they have
all diverged and more or less stabilized. In this case, this point is reached at about 102. Another way of seeing it is
when all coefficients are around 0, it means that the regularization is so strong that all features are irrelevant. When
they have sufficiently diverged and stabilized, the regularization makes them all relevant, which defeats the
purpose. Now on that note, if we go back to our code, we will find that this is what we chose for alphas in our
RidgeCV : alphas=[1e-3, 1e-2, 1e-1, 1] . As you can tell from the preceding plot, by the time the alphas have
reached 1 and below, the coefficients have already stabilized even though they are still fluctuating slightly. This
can explain why our ridge was not better performing than linear regression. Usually, you would expect a
regularized model to perform better than one that isn't – unless your hyperparameters are not right.

INTERPRETATION AND HYPERPARAMETERS

Well-tuned regularization can help cut out the noise and thus increase interpretability but the alphas
chosen for RidgeCV were selected on purpose to be able to convey this point: Regularization can only
work if you chose hyperparameters correctly. Or, when regularization hyperparameter tuning is
automatic, the method must be optimal for your dataset.
Feature importance

This is precisely the same as with linear regression, but again we need the standard error of the coefficients, which
is something that cannot be extracted from the scikit-learn model. You can use the statsmodels
fit_regularized method to this effect.

Polynomial regression

Polynomial regression is a special case of linear or logistic regression where the features have been expanded to
have higher degree terms. We have only performed polynomial linear regression in this chapter's exercise, so we
will only discuss this variation. However, it is applied similarly.

A two feature multiple linear regression would look like this:

but in polynomial regression, every feature is expanded to have higher degree terms and interactions between all
the features. So, if this 2 feature example were to be expanded to a second degree polynomial, the linear regression
formula would look like this:
It's still linear regression in every way except it has extra features, higher-degree terms, and interactions. While
you can limit polynomial expansion to only one or a few features, we used PolynomialFeatures , which does this
to all features. Therefore, 21 features were likely multiplied many times over. We can extract the coefficients from
our fitted model and, using the shape property of the numpy array, return how many coefficients were generated.
This amount corresponds to the number of features generated:
reg_models['linear_poly']['fitted'].\
get_params()['linearregression'].coef_.shape[0]

It outputs 253. We can do the same with the version of polynomial regression, which was with interaction terms
only:
reg_models['linear_interact']['fitted'].\
get_params()['linearregression'].coef_.shape[0]

The above code outputs 232. The reality is that most terms in a polynomial generated like this are interactions
between all the features.

Interpretation and Feature Importance

Polynomial regression can be interpreted, both globally and locally, in precisely the same way as linear regression.
In this case, it's not practical to understand a formula with 253 linearly combined terms, so it loses what we
defined in Chapter 2, Key Concepts of Interpretability, as global holistic interpretation. However, it still can be
interpreted in all other scopes and retains many of the properties of linear regression. For instance, since the model
is additive, so it easy to separate the effects of the features. You can also use the same many peer-reviewed tried
and tested statistical methods that are used for linear regression. For instance, you can use the t-statistic, p-value,
confidence bounds, R-squared, as well as the many tests used to assess goodness or a lack of fit, residual analysis,
linear correlation, and analysis of variance. This wealth of statistically proven methods to test and interpret models
isn't something most model classes can count on. Unfortunately, many of them are model-specific to linear
regression and its special cases.

Also, we won't do it here because there are so many terms. Still, you could undoubtedly rank features for
polynomial regression in the same way we have for linear regression using the statsmodels library. The
challenge is figuring out the order of the features generated by PolynomialFeatures to name them accordingly in
the feature name column. Once this is done, you can tell if some second-degree terms or interactions are important.
This could tell you if these features have a non-linear nature or highly depend on other features.

Logistic regression

We discussed logistic regression as well as its interpretation and feature importance in Chapter 2, Key Concepts of
Interpretability. We will only expand on that a bit here in the context of this chapter's classification exercise and to
underpin why exactly it is interpretable. The fitted logistic regression model has coefficients and intercepts just as
the linear regression model does:
coefs_log = class_models['logistic']['fitted'].coef_
intercept_log = class_models['logistic']['fitted'].intercept_
print('coefficients:%s' % coefs_log)
print('intercept:%s' % intercept_log)

The preceding code outputs this:


coefficients: [[-6.31114061e-04 -1.48979793e-04 2.01484473e-01 1.32897749e-01 1.31740116e-05
intercept: [-0.20139626]

However, the way these coefficients appear in the formula for a specific prediction Ƹ ( ) is entirely different:
In other words, the probability that

(is a positive case) is expressed by a logistic function that involves exponentials of the linear combination of

coefficients and the x features. The presence of the exponentials explains why the coefficients extracted from the
model are log-odds because to isolate the coefficients, and you should apply a logarithm to both sides of the
equation.

Interpretation

To interpret each coefficient, you do it in precisely the same way as with linear regression, except each unit
increase in the features, you increase the odds of getting the positive case by a factor expressed by the exponential
of the coefficient – all things being equal (remember the ceteris paribus assumption discussed in Chapter 2, Key
Concepts of Interpretability). An exponential

has to be applied to each coefficient because they express an increase in log-odds and not odds. Besides
incorporating the log-odds into the interpretation, the same as was said about continuous, binary, and categorical in
linear regression interpretation applies to logistic regression.

Feature importance

Frustrating as it is, there isn't consensus yet from the statistical community on how to best get feature importance
for logistic regression. There's a standardize-all-features-first method, a pseudo R2 method, a one-feature-at-a-time
ROC AUC methods, a partial chi-squared statistic method, and then the simplest one, which is multiplying the
standard deviations of each feature times the coefficients. We won't cover all these methods, but it has to be noted
that computing feature importance consistently and reliably is a problem for most model classes, even white-box
ones. We will dig deeper into this in Chapter 4, Fundamentals of Feature Importance and Impact. For logistic
regression, perhaps the most popular method is achieved by standardizing all the features before training. That is,
making sure they are centered at zero and divided by their standard deviation. But we didn't do this because
although it has other benefits, it makes the interpretation of coefficients more difficult, so here we are using the
rather crude method leveraged in Chapter 2, Key Concepts of Interpretability which is to multiply the standard
deviations of each feature times the coefficients:
stdv = np.std(X_train, 0)
abs(coefs_log.reshape(21,) * stdv).sort_values(ascending=False)

The preceding code yields the following output:


DEP_DELAY 8.918590
CRS_ELAPSED_TIME 6.034794
DISTANCE 5.309037
LATE_AIRCRAFT_DELAY 4.985519
NAS_DELAY 2.387845
WEATHER_DELAY 2.155292
TAXI_OUT 1.311593
SECURITY_DELAY 0.383242
ARR_AFPH 0.320974
: :
WHEELS_OFF 0.006806
PCT_ELAPSED_TIME 0.003410

It can still approximate the importance of features quite well. And just like with linear regression, you can tell that
delay features are ranking quite high. All five of them are among the top eight features. Indeed, it's something we
should look into. We will discuss more on that as we discuss some other white-box methods.

Decision trees

Decision trees have been used for the longest time, even before they were turned into algorithms. They hardly
require any mathematical abilities to understand them and this low barrier for comprehensibility makes them
extremely interpretable in their simplest representations. However, in practice, there are many kinds of decision
trees and most of them are not very interpretable because they use ensemble methods (boosting, bagging, and
stacking), or even leverage PCA or some other embedder. Even non-ensembled decision trees can get extremely
complicated as they become deeper. Regardless of the complexity of a decision tree, they can always be mined for
important insights about your data and expected predictions, and they can be fitted to both regression and
classification tasks.
CART decision trees

The Classification and Regression Trees (CART) algorithm is the "vanilla" no-frills decision tree of choice in
most use cases. And as noted, most decision trees aren't white-box models, but this one is because it is expressed
as a mathematical formula, visualized and printed as a set of rules that subdivide the tree into branches and
eventually the leaves.

The mathematical formula:

And what this means is that if according to the identity function I, x is in the subset Rm, then it returns a 1 and if
not a 0. This binary term is multiplicated by the averages of all elements in the subset Rm denoted as

. So if xi is in the subset belonging to the leaf node Rk then the prediction

. In other words, the prediction is the average of all elements in subset Rm. This is what happens to regression
tasks, and in binary classification, there is simply no

to multiply times the I identify function.

At the heart of every decision tree algorithm, there's a method to generate the Rm subsets. For CART, this is
achieved using something called the Gini index, recursively splitting on where the two branches are as different as
possible.

Interpretation

A decision tree can be globally and locally interpreted visually. Here, we have established a maximum depth of 2
( max_depth=2 ) because we could generate all 7 layers, but the text would be too small to appreciate. One of the
limitations of this method is that it can get complicated to visualize with depths above 3 or 4. However, you can
always programmatically traverse through the branches of the tree and visualize only some branches at a time:
fig, axes = plt.subplots(nrows = 1, ncols = 1,\
figsize = (16,8), dpi=600)
tree.plot_tree(class_models['decision_tree']['fitted'],\
feature_names=X_train.columns.values.tolist(),\
filled = True, max_depth=2)
fig.show()

The preceding code prints out the tree in Figure 3.9. From the tree, you can tell that the very first branch splits the
decision tree based on the value of DEP_DELAY being equal to or smaller than 20.5. It tells you the Gini index that
informed that decision and the number of samples (just another way of saying observations, data points, or rows)
present. You can traverse these branches till they reach a leaf. There is one leaf node in this tree, and it is on the far
left. This is a classification tree, so you can tell by value =[629167, 0] that all 629,167 samples left in this node
have been classified as a 0 (Not Delayed):

Figure 3.9 – Our models' plotted decision tree

Another way the tree can be better visualized but with fewer details such as the Gini index and sample size is by
printing out the decisions made in every branch and the class in every node:
text_tree = tree.\
export_text(class_models['decision_tree']['fitted'],\
feature_names=X_train.columns.values.tolist())
print(text_tree)

And the preceding code outputs the following:

Figure 3.X – Our decision tree's structure

There's a lot more that can be done with a decision tree, and scikit-learn provides an API to explore the tree.

Feature importance

Calculating feature importance in a CART decision tree is reasonably straightforward. As you can appreciate from
the visualizations, some features appear more often in the decisions, but their appearances are weighted by how
much they contributed to the overall reduction in the Gini index compared to the previous node. All the sum of the
relative decrease in the Gini index throughout the tree is tallied, and the contribution of each feature is a
percentage of this reduction:
dt_imp_df = pd.DataFrame({'feature':X_train.columns.values.tolist(),
'importance': class_models['decision_tree']['fitted'].\
feature_importances_}).\
sort_values(by='importance', ascending=False)
dt_imp_df

The dt_imp_df data frame output by the preceding code can be appreciated in Figure 3.15.

Figure 3.15 – Our decision tree's feature importance

This last feature importance table, Figure 3.15, increases suspicions about the delay features. They occupy, yet
again, five of the top six positions. Is it possible that all five of them have such an outsized effect on the model?

INTERPRETATION AND DOMAIN EXPERTISE

The target CARRIER_DELAY is also called a dependent variable because it's dependent on all the other
features, the independent variables. Even though a statistical relationship doesn't imply causation, we
want to inform our feature selection based on our understanding of what independent variables could
plausibly affect a dependent one. It makes sense that a departure delay ( DEPARTURE_DELAY ) affects the
arrival delay (which we removed), and therefore, CARRIER_DELAY . Similarly, LATE_AIRCRAFT_DELAY
makes sense as a predictor because it is known before the flight takes off if a previous aircraft was
several minutes late, causing this flight to be at risk of arriving late, but not as a cause of the current
flight (ruling this option out). However, even though the Bureau of Transportation Statistics website
defines delays in such a way that they appear to be discrete categories, some may be determined well
after a flight has departed. For instance, in predicting a delay mid-flight, could we predict based on
WEATHER_DELAY if the bad weather hasn't yet happened? And could we predict based on
SECURITY_DELAY if the security breach hasn't yet occurred? The answers to these questions are that we
probably shouldn't because the rationale for including them is they could serve to rule out
CARRIER_DELAY but this only works if they are discrete categories that pre-date the dependent variable!
Before coming to further conclusions, what you would need to do is talk to the airline executives to
determine the timeline on which each delay category gets consistently set and (hypothetically) is
accessible from the cockpit or the airline's command center. Even if you are forced to remove them from
the models, maybe other data can fill the void in a meaningful way, such as the first 30 minutes of flight
logs and or historical weather patterns. Interpretation is not always directly inferred from the data and
the machine learning models, but by working closely with domain experts. But sometimes domain
experts can mislead you too. In fact, another insight is with all the time-based metrics and categorical
features we engineered at the beginning of the chapter ( DEP_DOW , DEST_HUB , ORIGIN_HUB , and so on).
It turns out they have consistently had little to no effect on the models. Despite the airline executives
hinting at the importance of days of the week, hubs, and congestion, we should have explored the data
further, looking for correlations before engineering the data. But even if we do engineer some useless
features, it also helps to use a white-box model to assess their impact, as we have. In data science,
practitioners often will learn the same way the most performant machine learning models do – by trial
and error!

RuleFit

RuleFit is one model-class family that is a hybrid between a LASSO linear regression to get regularized
coefficients for every feature and merges this with decision rules, which it also uses LASSO to regularize. These
decision rules are extracted by traversing a decision tree finding interaction effects between features and assigning
coefficients to them based on their impact on the model. The implementation used in this chapter uses gradient
boosted decision trees to perform this task.

We haven't covered decision rules explicitly in this chapter, but they are yet another family of intrinsically
interpretable models. They weren't included because, at the time of writing, the only Python library that supports
decision rules, called Bayesian Rule List (BRL) by Skater, is still at an experimental stage. In any case, the
concept behind decision rules is very similar. They extract the feature interactions from a decision tree but don't
discard the leaf node, and instead of assigning coefficients, they use the predictions in the leaf node to construct
the rules. The last rule is a catch-all like an ELSE statement. Unlike RuleFit, it can only be understood sequentially
because it's so similar to any IF-THEN-ELSE statement, but that's its main advantage.

Interpretation and feature importance

You can put everything you need to know about RuleFit into a single dataframe ( rulefit_df ). Then you remove
the rules that have a coefficient of 0 . It has these because in LASSO, unlike ridge, coefficient estimates converge
to zero. You can sort the dataframe by importance in a descending manner to see what features or feature
interactions (in the form of rules) are most important:
rulefit_df = reg_models['rulefit']['fitted'].get_rules()
rulefit_df = rulefit_df[rulefit_df.coef !=0].\
sort_values(by="importance", ascending=False)
rulefit_df

The rules in the rulefit_df data frame can be seen in Figure 3.16:
Figure 3.16 – RuleFit's rules

There's a type for every RuleFit feature in Figure 3.16. Those that are linear are interpreted as you would any
linear regression coefficient. Those that are type=rule are also to be treated like binary features in a linear
regression model. For instance, if the rule WEATHER_DELAY > 255.0 & DEP_DELAY > 490.5 is true, then the
coefficient -333.579026 is applied to the prediction. The rules capture the interaction effects, so you don't have to
add interaction terms to the model manually or use some non-linear method to find them. Furthermore, it does this
in an easy-to-understand manner. You can use RuleFit to guide your understanding of feature interactions even if
you choose to productionize other models.

Nearest neighbors

Nearest neighbors is a family of models that even includes unsupervised methods. All of them use the closeness
between data points to inform their predictions. Of all these methods, only the supervised kNN and its cousin
Radius Nearest Neighbors are somewhat interpretable.

k-Nearest Neighbors

The idea behind kNN is straightforward. It takes the k closest points to a data point in the training data and uses
their labels ( y_train ) to inform the predictions. If it's a classification task, it's the mode of all the labels, and if it's
a regression task, it's the mean. It's a lazy learner because the "fitted model" is not much more than the training
data and the parameters such as k and the list of classes (if it's classification). It doesn't do much till inference.
That's when it leverages the training data, tapping into it directly rather than extracting parameters, weights/biases,
or coefficients learned by the model as eager learners do.

Interpretation
kNN only has local interpretability because since there's no fitted model, you don't have global modular or global
holistic interpretability. For classification tasks, you could attempt to get a sense of this using the decision
boundaries and regions we studied in Chapter 2, Key Concepts of Interpretability. Still, it's always based on local
instances.

To interpret a local point from our test dataset, we query the pandas dataframe using its index. We will be using
flight #721043:
print(X_test.loc[721043,:])

The preceding code outputs the following pandas series:


CRS_DEP_TIME 655.000000
DEP_TIME 1055.000000
DEP_DELAY 240.000000
TAXI_OUT 35.000000
WHEELS_OFF 1130.000000
CRS_ARR_TIME 914.000000
CRS_ELAPSED_TIME 259.000000
DISTANCE 1660.000000
WEATHER_DELAY 0.000000
NAS_DELAY 22.000000
SECURITY_DELAY 0.000000
LATE_AIRCRAFT_DELAY 221.000000
DEP_AFPH 90.800000
ARR_AFPH 40.434783
DEP_MONTH 10.000000
DEP_DOW 4.000000
DEP_RFPH 0.890196
ARR_RFPH 1.064073
ORIGIN_HUB 1.000000
DEST_HUB 0.000000
PCT_ELAPSED_TIME 1.084942
Name: 721043, dtype: float64

In the y_test_class labels for flight #721043, we can tell that it was delayed because this code outputs 1:
print(y_test_class[721043])

However, our kNN model predicted that it was not because this code outputs 0:
print(class_models['knn']['preds'][X_test.index.get_loc(721043)])

Please note that the predictions are output as a NumPy array, so we can't access the prediction for flight #721043
using its pandas index (721043). We have to use the sequential location of this index in the test dataset using
get_loc to retrieve it.

To find out why this was the case, we can use kneighbors on our model to find the 7 nearest neighbors of this
point. To this end, we have to reshape our data because kneighbors will only accept it in the same shape found
in the training set, which is (n, 21) where nis the number of observations (rows). In this case, n=1 because we only
want the nearest neighbors for a single data point. And as you can tell from what was output by
X_test.loc[721043,:] , the pandas series has a shape of (21,1), so we have to reverse this shape:

print(class_models['knn']['fitted'].\
kneighbors(X_test.loc[721043,:].values.reshape(1,21), 7))

kneighbors outputs two arrays:


(array([[143.3160128 , 173.90740076, 192.66705727, 211.57109221,
243.57211853, 259.61593993, 259.77507391]]),
array([[105172, 571912, 73409, 89450, 77474, 705972, 706911]]))

The first is the distance of each of the seven closest training points to our test data point. And the second is the
location of these data points in the training data:
print(y_train_class.iloc[[105172, 571912, 73409, 89450, 77474,\
705972, 706911]])

The preceding code outputs the following pandas series:


3813 0
229062 1
283316 0
385831 0
581905 1
726784 1
179364 0
Name: CARRIER_DELAY, dtype: int64

We can tell that the prediction reflects the mode because the most common class in the seven nearest points was 0
(Not delayed). You can increase or decrease the k to see if this holds. Incidentally, when using binary
classification, it's recommended to choose an odd-numbered k so that there are no ties. Another important aspect is
the distance metric that was used to select the closest data points. You can easily find out which one it is using:
print(class_models['knn']['fitted'].effective_metric_)

The output is Euclidean, which makes sense for this example. After all, Euclidean is optimal for a real-valued
vector space because most features are continuous. You could also test alternative distance metrics such as
minkowski , seuclidean , or mahalanobis . When most of your features are binary and categorical, you have an
integer-valued vector space. So your distances ought to be calculated with algorithms suited for this space such
as hamming or canberra .

Feature importance

Feature importance is, after all, a global model interpretation method and kNN has a hyper-local nature, so there's
no way of deriving feature importance from a kNN model.

Naïve Bayes

Like GLMs, Naïve Bayes is a family of model classes with a model tailored to different statistical distributions.
However, unlike GLMs' assumption that the target y feature has the chosen distribution, all Naïve Bayes models
assume that your X features have this distribution. More importantly, they were based on Bayes' theorem of
conditional probability, so they output a probability and are, therefore, exclusively classifiers. But they treat the
probability of each feature impacting the model independently, which is a strong assumption. This is why they are
called naïve. There's one for Bernouilli called Bernouilli Naïve Bayes, one for multinomial called Multinomial
Naïve Bayes, and, of course, one for Gaussian, which is the most common.

Gaussian Naïve Bayes

Bayes' theorem is defined by this formula:


In other words, to find the probability of A happening given that B is true, you take the conditional probability of B
given A is true times the probability of A occurring divided by the probability of B. In the context of a machine
learning classifier, this formula can be rewritten as follows:
This is because what we want is the probability of y given X is true. But our has more than one feature, so this can
be expanded like this:

To compute y∑ predictions, we have to consider that we have to calculate and compare probabilities for each Ck
class (the probability of a delay versus the probability of no delay) and choose the class with the highest
probability:
Calculating the probability of each class B(y = Ck) (also known as the class prior) is relatively trivial. In fact, the
fitted model has stored this in an attribute called class_prior_ :
print(class_models['naive_bayes']['fitted'].class_prior_)

This outputs the following:


array([0.93871674, 0.06128326])

Naturally, since delays caused by the carrier only occur 6% of the time, there is a marginal probability of this
occurring.

Then the formula has a product of conditional probabilities that each feature belongs to a class P(xi | y = Ck). Since
this is binary there's no need to calculate the probabilities of multiple classes because they are inversely
proportional. Therefore, we can drop Ck and replace it with a 1 like this:
This is because what we are trying to predict is the probability of a delay. Also, P(xi | y =1) is its own formula,
which differs according to the assumed distribution of the model, in this case, Gaussian:

This formula is called the probability density of the Gaussian distribution.

Interpretation and feature importance

So what are these sigmas (ơ) and thetas (Ɵ) in the formula? They are, respectively, the variance and mean of the
xi feature when y=1 . The intuition behind this is that features have a different variance and mean in one class
versus another, which can inform the classification. This is a binary classification task, but you could calculate ơi
and Ɵi for both classes. Fortunately, the fitted model has this stored:

print(class_models['naive_bayes']['fitted'].sigma_)

There are two arrays output, the first one corresponding to the negative class and the second to the positive. The
arrays contain the sigmas (variance) for each of the 21 features given the class:
array([[2.50123026e+05, 2.61324730e+05, ..., 1.13475535e-02],
[2.60629652e+05, 2.96009867e+05, ..., 1.38936741e-02]])

You can also extract the thetas (means) from the model:
print(class_models['naive_bayes']['fitted'].theta_)

The preceding code also outputs two arrays, one for each class:
array([[1.30740577e+03, 1.31006271e+03, ..., 9.71131781e-01],
[1.41305545e+03, 1.48087887e+03, ..., 9.83974416e-01]])

These two arrays are all you need to debug and interpret Naïve Bayes results because you can use them to compute
the conditional probability that xi feature given a positive class P(xi | y). You could use this probability to rank the
features by importance on a global level or interpret a specific prediction, on a local level.

Naïve Bayes is a fast algorithm with some good use cases, such as spam filtering and recommendation systems,
but the independence assumption hinders its performance for most situations. Speaking of performance, let's
discuss this topic in the context of interpretability.

Recognizing the trade-off between performance and interpretability


We have briefly touched on this topic before, but high performance often requires complexity, and complexity
inhibits interpretability. As studied in Chapter 2, Key Concepts of Interpretability, this complexity comes from
primarily three sources: non-linearity, non-monotonicity, and interactivity. If the model adds any complexity, it is
compounded by the number and nature of features in your dataset, which by itself is a source of complexity.

Special model properties

These special properties can help make a model more interpretable.

The key property: explainability


In Chapter 1, Interpretation, Interpretability, and Explainability; and Why Does It All Matter?, we discussed why
being able to look under the hood of the model and intuitively understand how all its moving parts derive its
predictions in a consistent manner is, mostly, what separates explainability from interpretability. This property is
also called transparency or translucency. A model can be interpretable without this, but in the same way that we
can interpret a person's decisions because we can't understand what is going on "under the hood." This is often
called post-hoc interpretability and this is the kind of interpretability this book primarily focuses on, with a few
exceptions. That being said, we ought to recognize that if a model is understood by leveraging its mathematical
formula (grounded in statistical and probability theory), as we've done with linear regression and Naïve Bayes, or
by visualizing a human-interpretable structure, as with decision trees, or a set of rules as with RuleFit, it is much
more interpretable than machine learning model classes where none of this is practically possible. White-box
models will always have the upper hand in this regard, and as listed in Chapter 1, Interpretation, Interpretability,
and Explainability; and Why Does It All Matter? there are many use cases in which a white-box model is a must-
have. But even if you don't productionize white-box models, they can always serve a purpose in assisting with
interpretation, if data dimensionality allows. It is a key property because it wouldn't matter if it didn't comply with
the other properties as long as it had explainability; it would still be more interpretable than those without it.

The remedial property: regularization

In this chapter, we've learned that regularization tones down the complexity added by the introduction of too many
features, and this can make the model more interpretable, not to mention more performant. Some models
incorporate regularization into the training algorithm, such as RuleFit and gradient boosted trees; others have the
ability to integrate it, such as multi-layer perceptron, or linear regression, and some cannot include it, such as kNN.
Regularization comes in many forms. Decision trees have a method called pruning, which can help reduce
complexity by removing non-significant branches. Neural networks have a technique called dropout, which
randomly drops neural network nodes from layers during training. Regularization is a remedial property because it
can help even the least interpretable models lessen complexity and thus improve interpretability.

Assessing performance

By now, in this chapter, you have already assessed performance on all of the white-box models reviewed in the last
section as well as a few black-box models. Maybe you've already noticed that black-box models have topped most
metrics, and for most use cases, this is generally the case.

Figuring out which model classes are more interpretable is not an exact science, but the following table (Figure
3.17) is sorted by those models with the most desirable properties. That is, they don't introduce non-linearity, non-
monotonicity, and interactivity. Of course, explainability on its own is a property that is a game-changer,
regardless, and regularization can help. There are also cases in which it's hard to assess properties. For instance,
polynomial (linear) regression implements a linear model, but it fits nonlinear relationships, which is why it is
color-coded differently. As you will learn in Chapter 12, Monotonic Constraints and Model Tuning for
Interpretability, some libraries support adding monotonic constraints to gradient boosted trees and neural
networks, which means it's possible to make these monotonic. However, the black-box methods we used in this
chapter do not support monotonic constraints.

The task columns tell you whether they can be used for regression or classification. And the Performance Rank
columns show you how well these models ranked in RMSE (for regression) and ROC AUC (for classification),
where lower ranks are better. Please note that even though we have used only one metric to assess performance for
this chart for simplicity's sake, the discussion about performance should be more nuanced than that. Another thing
to note is that ridge regression did poorly, but this is because we used the wrong hyperparameters, as explained in
the previous section.
Figure 3.17 – A table assessing the interpretability and performance of several white-hat and black-box models we
have explored in this chapter

Because it's compliant on all five properties, it's easy to tell why linear regression is the gold standard for
interpretability. Also, while recognizing that this is anecdotal evidence, it should be immediately apparent that
most of the best ranks are with black-box models. This is no accident! The math behind neural networks and
gradient boosted trees is brutally efficient in achieving the best metrics. Still, as the red dots suggest, they have all
the properties that make a model less interpretable, making their biggest strength (complexity) a potential
weakness.

This is precisely why black-box models are our primary interest in this book, although many of the methods you
will learn to apply to white-box models. In Part 2, which comprises Chapters 4 through 9, we will learn model-
agnostic and deep-learning-specific methods that assist with interpretation. And in Part 3, which includes
Chapters 10 through 14, we will learn how to tune models and datasets to increase interpretability.

INTERPRETATION AND EXECUTION SPEED

Predictive performance is not the only kind of performance to watch out for. When we have discussed
performance so far in this book, we have not directly addressed the importance of execution speed (also
called computation time). Predictive performance is, generally, inversely proportional to both
interpretability and execution speed. Just as black-box models tend to predict better, white-box models
are more interpretable and faster than black-box models. Often, not only in training but also in the
inference. This problem used to be a significant deterrent. Even though deep learning methods have
existed for over half a century, they only really took off a decade ago because of resource constraints! So
why is it still relevant? Because data scientists, data engineers, and machine learning engineers are
continually pushing the boundaries by increasing the complexity of their models, the size of datasets, and
the use of hyperparameter tuning to improve predictive performance. They thus require more resources
to train and possibly make them quick at inference. However, a model that has slow inference is not
practical for many use cases because it might not be cost-effective or requires real-time inference, which
it would have too much latency to achieve. Therefore, there is a trade-off between predictive
performance and execution performance. And while AI researchers push the boundaries for model
interpretability, there will be cases where trade-offs between all three are considered: predictive
performance, execution speed performance, and interpretability (see Figure 3.18). Higher
interpretability, while retaining high predictive performance, might come with a significant loss in
execution speed performance. Such is the case for the glass-box models we review in the next section,
but who knows? Someday we might have our cake and eat it too!

Figure 3.18 – A table comparing white-box, black-box, and glass-box models, or at least what is known so far
about them

Discovering newer interpretable (glass-box) models


Recently, there are significant efforts in both industry and in academia to create new models that can have enough
complexity to find the sweet spot between underfitting and overfitting, known as the bias-variance trade-off, but
retain an adequate level of explainability.

Many models fit this description, but most of them are meant for specific use cases, haven't been properly tested
yet, or have released a library or open-sourced the code. However, two general-purpose ones are already gaining
traction, which we will look at now.

Explainable Boosting Machine (EBM)


EBM is part of Microsoft's InterpretML framework, which includes many of the model-agnostic methods we will
use later in the book.

EBM leverages the GAMs we mentioned earlier, which are like linear models but look like this:

Individual functions f1 through fp are fitted to each feature using spline functions. Then a link function g adapts the
GAM to perform different tasks such as classification or regression, or adjust predictions to different statistical
distributions. GAMs are white-box models, so what makes EBM a glass-box model? It incorporates bagging and
gradient boosting, which tend to make models more performant. The boosting is done one feature at a time using a
low learning rate so as not to confound them. It also finds practical interaction terms automatically, which
improves performance while maintaining interpretability:

Once fitted, this formula is made up of complicated non-linear formulas, so a global holistic interpretation isn't
likely feasible. However, since the effects of each feature or pairwise interaction terms are additive, they are easily
separable, and global modular interpretation is entirely possible. Local interpretation is equally easy given that a
mathematical formula can assist in debugging any prediction.

One drawback is that EBM can be much slower than gradient boosted trees and neural networks because of the one
feature at a time approach, a low learning rate not impacting the feature order, and spline fitting methods.
However, it is parallelizable, so in environments with ample resources and multiple cores or machines, it will be
much quicker. To not have you wait for results for an hour or two, it is best to create abbreviated versions of
X_train and X_test – that is, with less columns representing only the eight features white-box models found to
be most important: DEP_DELAY , LATE_AIRCRAFT_DELAY , PCT_ELAPSED_TIME , WEATHER_DELAY , NAS_DELAY ,
SECURITY_DELAY , DISTANCE , CRS_ELAPSED_TIME , and TAXI_OUT . These are placed in a feature_samp array,
and then the X_train and X_test dataframes are subset to only include this feature. We are setting the
sample2_size to 10%, but if you feel you have enough resources to handle it, adjust accordingly:
#Make new abbreviated versions of datasets
feature_samp = ['DEP_DELAY', 'LATE_AIRCRAFT_DELAY',\
'PCT_ELAPSED_TIME', 'DISTANCE', 'WEATHER_DELAY',\
'NAS_DELAY', 'SECURITY_DELAY', 'CRS_ELAPSED_TIME']
X_train_abbrev2 = X_train[feature_samp]
X_test_abbrev2 = X_test[feature_samp]
#For sampling among observations
np.random.seed(rand)
sample2_size = 0.1
sample2_idx = np.random.choice(X_train.shape[0],
math.ceil(X_train.shape[0]*sample2_size), replace=False)

To train your EBM, all you have to do is instantiate an ExplainableBoostingClassifier() and then fit your
model to your training data. Note that we are using sample_idx to sample a portion of the data so that it takes less
time:
ebm_mdl = ExplainableBoostingClassifier()
ebm_mdl.fit(X_train_abbrev2.iloc[sample2_idx],
y_train_class.iloc[sample2_idx])

Global interpretation

Global interpretation is dead simple. It comes with an explain_global dashboard you can explore. It loads with
the feature importance plot first, and you can select individual features to graph what was learned from each one:
show(ebm_mdl.explain_global())

The preceding code generates a dashboard that looks like Figure 3.19:

Figure 3.19 – EBM's global interpretation dashboard

Local interpretation

Local interpretation uses a dashboard like global does except you choose specific predictions to interpret with
explain_local . In this case, we are selecting #76, which, as you can tell, was incorrectly predicted. But the
LIME-like plot we will study in Chapter 6, Local Model Agnostic Interpretation Methods, helps make sense of it:
ebm_lcl = ebm_mdl.explain_local(X_test_abbrev2.iloc[76:77],\
y_test_class[76:77], name='EBM')
show(ebm_lcl)

Similar to the global dashboard, the preceding code generates another one, depicted in Figure 3.20:

Figure 3.20 – EBM's local interpretation dashboard

Performance

Performance, at least measured with the ROC AUC, EBM is not far from what was achieved by the top 2
classification models, and we can only expect it to get better with 10 times more training and testing data!
ebm_perf = ROC(ebm_mdl.predict_proba).\
explain_perf(X_test_abbrev2.iloc[sample_idx],
y_test_class.iloc[sample_idx], name='EBM')
show(ebm_perf)

You can appreciate the performance dashboard produced by the preceding code in Figure 3.21. The performance
dashboard can also compare several models at a time since its explainers are model-agnostic. And there's even a
fourth dashboard that can be used for data exploration:

Figure 3.21 – One of EBM's performance dashboards


Figure 3.21 – One of EBM's performance dashboards

GAMI-Net

There's also a newer GAM-based method with similar properties than EBM but trained with neural networks. At
the time of this writing, this method has yet to get commercial traction but yields good interpretability and
performance.

As we had previously discussed, interpretability is decreased by each additional feature, especially those that don't
significantly impact model performance. In addition to too many features, it's also trumped by the added
complexity of non-linearities, non-monotonicity, and interactions. GAMI-Net tackles all these problems by fitting
non-linear subnetworks for each feature in the main effects network first. Then, fitting a pairwise interaction
network with subnetworks for each combination of features. The user provides a top amount of interactions to
keep, which are then fitted to the residuals of the main effects network. See Figure 18 for a diagram.

GAMI-Net has three interpretability constraints built-in:

Sparsity: only the top features and interactions are kept.


Heredity: a pairwise interaction can be included if at least one of its parent features is included.
Marginal clarity: non-orthogonality in interactions is penalized to approximate better marginal clarity. We
will demonstrate

The GAMI-Net implementation can also enforce monotonic constraints, which we will cover in more detail in
Chapter 12, Monotonic Constraints and Model Tuning for Interpretability.

Before we start, we must create a dictionary called meta_info with details about each feature and target, such as
the type (continuous, categorical and target) and the scaler used to scale each feature — since the library expects
each feature to be scaled independently. All the features in the abbreviated dataset are continuous so we can
leverage dictionary comprehension to this easily.

Next, we will create a copy of X_train_abbrev and X_train_abbrev and then scale them, store scalers in the
dictionary. Then, append info about the target variable to the dictionary. And lastly, convert all the data to NumPy
format.

Now that we have a meta_info dictionary and the dataset is ready, we can initialize and fit GAMI-Net to the
training data. In addition to meta_info it has a lot of parameters: interact_num defines how many top
interactions should it consider, and task_type whether it's a classification or regression task. Note that GAMI-
Net trains three neural networks, so there are three epochs parameters to fill in ( main_effect_epochs ,
interaction_epochs , tuning_epochs ) but the learning rate ( lr_bp ) and early stopping thresholds
( early_stop_thres ) are entered as a list of three items corresponding to each one. You will find lists also for the
architecture of the networks where each item corresponds to a number of nodes per layer ( interact_arch ,
subnet_arch ). Furthermore, there are additional parameters for batch size, activation function, whether to enforce
heredity constraint, loss threshold to use for early stopping and what percentage of the training data to use for
validation ( val_ratio ). Finally, there are two optional parameters for monotonic constraints
( mono_increasing_list , mono_decreasing_list ) which we won't explain here.
We can plot the training loss for each epoch across all three trainings with plot_trajectory . Then, with
plot_regularization plot the outcome for the regularization both for the main effects and interaction networks.
Both plotting functions can save the image in a folder but will do so in a folder called "results" by default unless
you change the path with the folder parameter.

The preceding snippet generates the plots in Figure 19.

Figure 19 tells the store of how the three stages sequentially reduce loss while regularizing to only keep the fewest
features and interactions as possible.

Global explanations can be extracted in a dictionary with global_explain function and then turned into a feature
importance plot with feature_importance_visualize like in the following snippet:

As you can tell by Figure 20, the most important feature is, by far, DEP_DELAY and one interaction is among the
top 6 in the plot. We can also use the global_visualize_density plot to output partial dependence plots, which
we will cover in Chapter 4, Global model-agnostic interpretation methods.
Let's examine an explanation for a single prediction using local_explain followed by local_visualize . We
are selecting test case #73.

Figure 21 tells the story of how each feature weighs in the outcome. Note that DEP_DELAY is over 50 but that
there's an intercept that almost cancels it out. The intercept is a counterbalance — afterall, the dataset is
unbalanced towards it being less likely to be a CARRIER_DELAY . But, then, all the subsequent features after the
intercept are not enough to push the outcome positively.
To determine the predictive performance of the GAMI-Net model all we need to do is get the scores
( y_test_prob ) and predictions ( y_test_pred ) for the test dataset. And then use Scikit-learn's metric functions to
compute them.
sr_mdl = SkopeRules(n_estimators=200, precision_min=0.2,\
recall_min=0.01, n_jobs=-1, random_state=rand,\
max_depth=7, feature_names=X_train_abbrev2.columns)
sr_mdl.fit(X_train_abbrev2.iloc[sample2_idx],\
y_train_class.iloc[sample2_idx])

In the following code, the probability of each flight being delayed is returned by score_top_rules , and this, in
turn, can be used to create the predictions using np.where with the threshold set at :
sr_y_test_prob = sr_mdl.\
score_top_rules(X_test_abbrev2.iloc[sample_idx])
sr_y_test_pred = np.where(sr_y_test_prob > 0.5, 1, 0)

Global interpretation

The rules_ attribute has a list of tuples with each rule. We can count them as such:
print(len(sr_mdl.rules_))

As you can tell, there are 1,517 rules generated but because of the way the algorithm uses precision and recall,
rules are not always considered. This makes inference slower. The rules are sorted by how well they perform. Let's
look at the five highest-performing rules generated:
print(sr_mdl.rules_[0:5])

The preceding code prints the following:


[('DEP_DELAY > 39.5 and LATE_AIRCRAFT_DELAY <= 12.5 and WEATHER_DELAY <= 12.0 and NAS_DELAY <= 27
('DEP_DELAY > 39.5 and LATE_AIRCRAFT_DELAY <= 11.5 and WEATHER_DELAY <= 12.0 and NAS_DELAY <= 27
('DEP_DELAY > 39.5 and LATE_AIRCRAFT_DELAY <= 12.5 and WEATHER_DELAY <= 12.5 and NAS_DELAY <= 27
('DEP_DELAY > 39.5 and LATE_AIRCRAFT_DELAY <= 11.5 and WEATHER_DELAY <= 12.0 and NAS_DELAY <= 29
('DEP_DELAY > 39.5 and LATE_AIRCRAFT_DELAY <= 11.5 and WEATHER_DELAY <= 12.0 and NAS_DELAY <= 27

As you go down the list, you can start to understand what matters the most to the model as singular IF statements,
if true, indicate a positive class.

Local interpretation

Let's examine one model-specific local prediction method – the prediction for the seventy-sixth flight not being
delayed even though the flight was delayed:
print('actual: %s, predicted: %s' %\
(y_test_class.iloc[76], sr_y_test_pred[76]))

The preceding code prints out the following:


actual: 1, predicted: 0

We can tell why leveraging the decision function that tells you the anomaly score for the input sample. This score
is the weighted sum of the binary rules, where each weight is the precision of each rule. So, the lower the score,
the more likely it is a positive match, and if it's null, it's a definite positive match:
print(sr_mdl.decision_function(X_test_abbrev2.iloc[76:77]))

The result is 18.23, which is not close to 0 or null.

Performance

The performance was not bad considering it was trained on 10% of the training data and evaluated on only 10% of
the test data. Especially the recall score, which was among the top three places:
print('accuracy: %.3g, recall: %.3g, roc auc: %.3g, f1: %.3g, mcc: %.3g' %\
(metrics.accuracy_score(y_test_class.iloc[sample_idx],\
sr_y_test_pred),
metrics.recall_score(y_test_class.iloc[sample_idx],\
sr_y_test_pred),
metrics.roc_auc_score(y_test_class.iloc[sample_idx],\
sr_y_test_prob),
metrics.f1_score(y_test_class.iloc[sample_idx], sr_y_test_pred),
metrics.matthews_corrcoef(y_test_class.iloc[sample_idx],\
sr_y_test_pred)))

The preceding code yields the following metrics:


accuracy: 0.969, recall: 0.981,
roc auc: 0.989, f1: 0.789, mcc: 0.787

Mission accomplished
The mission was to train models that could predict preventable delays with enough accuracy to be useful, and then,
to understand the factors that impacted these delays, according to these models, to improve OTP. The resulting
regression models all predicted delays, on average, well below the 15-minute threshold according to the RMSE.
And most of the classification models achieved an F1 score well above 50% – one of them reached 98.8%! We
also managed to find factors that impacted delays for all white-box models, some of which performed reasonably
well. So, it seems like it was a resounding success!

Don't celebrate just yet! Despite the high metrics, this mission was a failure. Through interpretation methods, we
realized that the models were accurate mostly for the wrong reasons. This realization helps underpin the mission-
critical lesson that a model can easily be right for the wrong reasons, so the question "why?" is not a question to
be asked only when it performs poorly but always. And using interpretation methods is how we ask that
question.

But if the mission failed, why is this section called Mission accomplished? Good question!

It turns out there was a secret mission. Hint: it's the title of this chapter. The point of it was to learn about common
interpretation challenges through the failure of the overt mission. In case you missed them, here are the
interpretation challenges we stumbled upon:

Traditional model interpretation methods only cover surface-level questions about your models. Note that we
had to resort to model-specific global interpretation methods to discover that the models were right for the
wrong reasons.
Assumptions can derail any machine learning project since this is information that you suppose without
evidence. Note that it is crucial to work closely with domain experts to inform decisions throughout the
machine learning workflow, but sometimes they can also mislead you. Ensure you check for inconsistencies
between the data and what you assume to be the truth about that data. Finding and correcting these problems
is at the heart of what interpretability is about.
Many model classes, even white-box models, have issues with computing feature importance consistently and
reliably.
Incorrect model tuning can lead to a model that performs well enough but is less interpretable. Note that a
regularized model overfits less but is also more interpretable. We will cover methods to address this challenge
in Chapter 12, Monotonic Constraints and Model Tuning for Interpretability. Feature selection and
engineering can also have the same effect, which you can read about in Chapter 10, Feature Selection and
Engineering for Interpretability.
There's a trade-off between predictive performance and interpretability. And this trade-off extends to
execution speed. For these reasons, this book primarily focuses on black-box models, which have the
predictive performance we want and a reasonable execution speed but could use some help on the
interpretability side.

If you learned about these challenges, then congratulations! Mission accomplished!

Summary
After reading this chapter, you should understand some traditional methods for interpretability and what their
limitations are. You learned about intrinsically interpretable models and how to both use them and interpret
them, for both regression and classification. You also studied the performance versus interpretability trade-off
and some models that attempt not to compromise in this trade-off. You also discovered many practical
interpretation challenges involving the roles of feature selection and engineering, hyperparameters, domain
experts, and execution speed. In the next chapter, we will learn more about different interpretation methods to
measure the effect of a feature on a model.

Dataset sources
United States Department of Transportation Bureau of Transportation Statistics. (2018). Airline On-Time
Performance Data. Originally retrieved from https://www.transtats.bts.gov.

Further reading
Friedman, J., & Popescu, B. (2008). Predictive Learning via Rule Ensembles. The Annals of Applied
Statistics, 2(3), 916-954. http://doi.org/10.1214/07-AOAS148
Hastie, T., R. Tibshirani, and M. Wainwright. 2015. Statistical Learning with Sparsity: The Lasso and
Generalizations. Chapman & Hall/Crc Monographs on Statistics & Applied Probability. Taylor & Francis
Thomas, D.R., Hughes, E. & Zumbo, B.D. On Variable Importance in Linear Regression. Social Indicators
Research 45, 253–275 (1998). https://doi.org/10.1023/A:1006954016433
Nori, H., Jenkins, S., Koch, P., & Caruana, R. (2019). InterpretML: A unified framework for machine
learning interpretability. arXiv preprint https://arxiv.org/pdf/1909.09223.pdf
Hastie, T and Tibshirani, R. Generalized additive models: some applications. Journal of the American
Statistical Association, 82(398):371–386, 1987. http://doi.org/10.2307%2F2289439
5 Local Model-Agnostic Interpretation Methods
Join our book community on Discord
https://packt.link/EarlyAccessCommunity

In the previous two chapters, we dealt exclusively with global interpretation methods. This chapter will foray into
local interpretation methods, which are there to explain why a single prediction or a group of predictions was
made. It will cover how to leverage SHapley Additive exPlanations' (SHAP's) KernelExplainer and also,
another method called Local Interpretable Model-agnostic Explanations (LIME) for local interpretations. We
will also explore how to use these methods with both tabular and text data.

These are the main topics we are going to cover in this chapter:

Leveraging SHAP's KernelExplainer for local interpretations with SHAP values


Employing LIME
Using LIME for natural language processing (NLP)
Trying SHAP for NLP
Comparing SHAP with LIME

Technical requirements
This chapter's example uses the mldatasets , pandas , numpy , sklearn , nltk , lightgbm , rulefit ,
matplotlib , seaborn , shap , and lime libraries. Instructions on how to install all of these libraries are in the
preface of the book. The code for this chapter is located here: https://github.com/PacktPublishing/Interpretable-
Machine-Learning-with-Python/tree/master/Chapter06

The mission
Who doesn't love chocolate?! It's a global favorite, with around nine out of ten people loving it and about a billion
people eating it every day. One popular form in which it is consumed is as a chocolate bar. However, even
universally beloved ingredients can be used in ways that aren't universally appealing—so, chocolate bars can range
from the sublime to the mediocre, to downright unpleasant. Often, this is solely determined by the quality of the
cocoa or additional ingredients, and sometimes it becomes an acquired taste once it's combined with exotic flavors.

A French chocolate manufacturer who is obsessed with excellence has reached out to you. They have a problem.
All of their bars have been highly rated by critics, yet critics have very particular taste buds. And some bars they
love have inexplicably mediocre sales, but non-critics seem to like them in focus groups and tastings, so they are
puzzled why sales don't coincide with their market research. They have found a dataset of chocolate bars rated by
knowledgeable lovers of chocolate, and these ratings happen to coincide with their sales. To get an unbiased
opinion, they have sought your expertise.

As for the dataset, members of the Manhattan Chocolate Society have been meeting since 2007 for the sole
purpose of tasting and judging fine chocolate, to educate consumers and inspire chocolate makers to produce
higher-quality chocolate. Since then, they have compiled a dataset of over 2,200 chocolate bars, rated by their
members with the following scale:
4.0 - 5.00 = Outstanding
3.5 - 3.99 = Highly Recommended
3.0 - 3.49 = Recommended
2.0 - 2.99 = Disappointing
1.0 - 1.90 = Unpleasant

These ratings are derived from a rubric that factors in aroma, appearance, texture, flavor, aftertaste, and overall
opinion, and the bars rated are mostly darker chocolate bars since the aim is to appreciate the flavors of cacao. In
addition to the ratings, the Manhattan Chocolate Society dataset includes many characteristics, such as the country
where the cocoa bean was farmed, how many ingredients the bar has, whether it includes salt, and the words used
to describe it.

The goal is to understand why one of the chocolate manufacturers' bars is rated Outstanding yet sells poorly, while
another one, whose sales are impressive, is rated as Disappointing.

The approach
You have decided to use local model interpretation to explain why each bar is rated as it is. To that end, you will
prepare the dataset and then train classification models to predict if chocolate-bar ratings are above or equal to
Highly Recommended, because the client would like all their bars to fall above this threshold. You will need to
train two models: one for tabular data, and another NLP one for the words used to describe the chocolate bars. We
will employ support vector machines (SVMs) and Light Gradient Boosting Machine (LightGBM),
respectively, for these tasks. If you haven't used these black-box models, no worries—we will briefly explain them.
Once you train the models, then comes the fun part: leverage two local model-agnostic interpretation methods to
understand what makes a specific chocolate bar less than Highly Recommended or not. These methods are SHAP
and LIME, which when combined will provide a richer explanation to convey back to your client. Then, we will
compare both methods to understand their strengths and limitations.

The preparations
You will find the code for this example here:
https://github.com/PacktPublishing/Interpretable-Machine-Learning-with-
Python/blob/master/Chapter06/ChocoRatings.ipynb

Loading the libraries

To run this example, you need to install the following libraries:

mldatasets to load the dataset


pandas , numpy , and nltk to manipulate it
sklearn (scikit-learn) and lightgbm to split the data and fit the models
matplotlib , seaborn , shap , and lime to visualize the interpretations

You should load all of them first, as follows:


import math
import mldatasets
import pandas as pd
import numpy as np
import re
import nltk
from nltk.probability import FreqDist
from nltk.tokenize import word_tokenize
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn import metrics, svm
from sklearn.feature_extraction.text import TfidfVectorizer
import lightgbm as lgb
import matplotlib.pyplot as plt
import seaborn as sns
import shap
import lime
import lime.lime_tabular
from lime.lime_text import LimeTextExplainer

Understanding and preparing the data

We load the data into a dataframe we call chocolateratings_df , like this:


chocolateratings_df = mldatasets.load("chocolate-bar-ratings_v2")

There should be over 2,200 records and 18 columns. We can verify this was the case simply by inspecting the
contents of the dataframe, like this:
chocolateratings_df

The output shown here in Figure 5.1 corresponds to what we were expecting:

Figure 5.1 – Contents of chocolate-bar dataset

The data dictionary

The data dictionary comprises the following:

company : Categorical; the manufacturer of the chocolate bar (out of over 500 different ones)
company_location : Categorical; country of the manufacturer (66 different countries)
review_date : Continuous; year in which the bar was reviewed (from 2006 to 2020)
country_of_bean_origin : Categorical; country where the cocoa beans were harvested (62 different
countries)
cocoa_percent : Categorical; what percentage of the bar is cocoa
rating : Continuous; rating given by the Manhattan Chocolate Society (possible values: 1-5)
counts_of_ingredients : Continuous; amount of ingredients in the bar
cocoa_butter : Binary; was it made with cocoa butter?
vanilla : Binary; was it made with vanilla?
lecithin : Binary; was it made with lecithin?
salt : Binary; was it made with salt?
sugar : Binary; was it made with sugar?
sweetener_without_sugar : Binary; was it made with sweetener without sugar?
first_taste : Text; word(s) used to describe the first taste
second_taste : Text; word(s) used to describe the second taste
third_taste : Text; word(s) used to describe the third taste
fourth_taste : Text; word(s) used to describe the fourth taste

Now that we have taken a peek at the data, we can quickly prepare this and then work on the modeling and
interpretation!

Data preparation

The first thing we ought to do is set aside the text features so that we can process them separately. We can start by
creating a dataframe called tastes_df with them and then drop them from chocolateratings_df . We can then
take a look at tastes_df using head and tail , as illustrated in the following code snippet:
tastes_df = chocolateratings_df[['first_taste', 'second_taste',
'third_taste', 'fourth_taste']]
chocolateratings_df = chocolateratings_df.\
drop(['first_taste', 'second_taste', 'third_taste',\
'fourth_taste'], axis=1)
tastes_df

The preceding code produces the dataframe shown here in Figure 5.2:
Figure 5.2 – Tastes columns have quite a few null values

Now, let's categorically encode the categorical features. There are too many countries in company_location and
country_of_bean_origin , so let's establish a threshold. Say, if there are fewer than 3.333% (or 74 rows) for any
country, let's bucket it into an Other category and then encode the categories. We can easily do this with the
make_dummies_with_limits function and the process is shown again in the following code snippet:

chocolateratings_df =\
mldatasets.make_dummies_with_limits(chocolateratings_df,\
'company_location', 0.03333)
chocolateratings_df =\
mldatasets.make_dummies_with_limits(chocolateratings_df,\
'country_of_bean_origin', 0.03333)

Now, to process the contents of tastes_df , the following code replaces all the null values with empty strings,
then joins all the columns in tastes_df together, forming a single series. Then, it strips leading and trailing
whitespace. The code is illustrated in the following snippet:
tastes_s = tastes_df.replace(np.nan, '', regex=True).\
agg(' '.join, axis=1).str.strip()

And voilà! You can verify that the result is a pandas series ( tastes_s ) with (mostly) taste-related adjectives by
printing it. As expected, this series is the same length as the chocolateratings_df dataframe, as illustrated in it’s
output:
0 cocoa blackberry robust
1 cocoa vegetal savory
2 rich fatty bready
3 fruity melon roasty
4 vegetal nutty
...
2221 muted roasty accessible
2222 fatty mild nuts mild fruit
2223 fatty earthy cocoa
Length: 2224, dtype: object

But let's find out how many of its phrases are unique, with print(np.unique(tastes_s).shape) . Since the
output is (2178,) it means fewer than 50 phrases are duplicated, so tokenizing by phrases would be a bad idea.

There are many approaches you could take here, such as tokenizing by bi-grams (sequences of two words) or even
subwords (dividing words into logical parts). However, even though order matters slightly (because the first words
had to do with the first taste, and so on), our dataset is too small and had too many nulls (especially in
third taste and fourth taste ) to derive meaning from the order. This is why it was a good choice to
concatenate all the "tastes" together, thus removing their discernible division.

Another thing to note is that our words are (mostly) adjectives. We made a small effort to remove adverbs, but
there are still some nouns present, such as "fruit" and "nuts", versus adjectives such as "fruity" and "nutty". We
can't be sure if the chocolate connoisseurs who judged the bars meant something different by using "fruit" rather
than "fruity". However, if we were sure of this, we could have performed stemming or lemmatization to turn all
instances of "fruit", "fruity", and "fruitiness" to a consistent "fru" (stem) or "fruiti" (lemma). We won't concern
ourselves with this because many of our adjectives' variations are not as common in the phrases anyway.

Let's find out the most common words by first tokenizing them with word_tokenize and using FreqDist to
count their frequency. We can then place the resulting tastewords_fdist dictionary into a dataframe
( tastewords_df ). We can save only those words with more than 74 instances as a list ( commontastes_l ). The
code is illustrated in the following snippet:
tastewords_fdist = FreqDist(word for word in
word_tokenize(tastes_s.str.cat(sep=' ')))
tastewords_df = pd.DataFrame.from_dict(tastewords_fdist,\
orient='index').rename(columns={0:'freq'})
commontastes_l = tastewords_df[tastewords_df.freq > 74].\
index.to_list()
print(commontastes_l)

As you can tell from the following output for commontastes_l , the most common words are mostly different
(except for spice and spicy ):
['cocoa', 'rich', 'fatty', 'roasty', 'nutty', 'sweet', 'sandy', 'sour', 'intense', 'mild', 'fruit

Something we can do with this list to enhance our tabular dataset is to turn these common words into binary
features. In other words, there would be a column for each one of these "common tastes" ( commontastes_l ), and
if the "tastes" for the chocolate bar include it, the column would have a 1, otherwise a 0. Fortunately, we can easily
do this with two lines of code. First, we create a new column with our text-tastes series ( tastes_s ). Then, we use
the make_dummies_from_dict function we used in the last chapter to generate the dummy features by looking for
each "common taste" in the contents of our new column, as follows:
chocolateratings_df['tastes'] = tastes_s
chocolateratings_df =\
mldatasets.make_dummies_from_dict(chocolateratings_df,\
'tastes', commontastes_l)

Now that we are done with our feature engineering, we can use info() to examine our dataframe. The output has
all numeric non-null features except for company . There are over 500 companies, so categorical encoding of this
feature would be complicated and, because it would be advisable to bucket most companies as Other , it would
likely introduce bias toward the few companies that are most represented. Therefore, it's better to remove this
column altogether. The output is shown here:
RangeIndex: 2224 entries, 0 to 2223
Data columns (total 46 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 company 2224 non-null object
1 review_date 2224 non-null int64
2 cocoa_percent 2224 non-null float64
: : : : :
43 tastes_berry 2224 non-null int64
44 tastes_vanilla 2224 non-null int64
45 tastes_creamy 2224 non-null int64
dtypes: float64(2), int64(30), object(1), uint8(13)

Our last step to prepare the data for modeling starts with initializing rand , a constant to serve as our "random
state" throughout this exercise. Then, we define y as the rating column converted to 1s if greater than or equal
to 3.5, and 0 otherwise. X is everything else (excluding company ). Then, we split X and y into train and test
datasets with train_test_split , as illustrated in the following code snippet:
rand = 9
y = chocolateratings_df['rating'].\
apply(lambda x: 1 if x >= 3.5 else 0)
X = chocolateratings_df.drop(['rating','company'], axis=1).copy()
X_train, X_test, y_train, y_test = train_test_split(X, y,\
test_size=0.33, random_state=rand)

In addition to the tabular test and train datasets, for our NLP models we will need text-only feature datasets that are
consistent with our train_test_split so that we can use the same y labels. To this end, we can do this by
subsetting our tastes series ( tastes_s ), using the index of our X_train and X_test sets to yield NLP specific
versions of the series, as follows:
X_train_nlp = tastes_s[X_train.index]
X_test_nlp = tastes_s[X_test.index]

OK! We are all set now. Let's start modeling and interpreting our models!
Leveraging SHAP's KernelExplainer for local interpretations with SHAP
values
For this section, and for subsequent use, we will train a Support Vector Classifier (SVC) model first.

Training a C-SVC model

SVM is a family of model classes that operate in high-dimensional space to find an optimal hyperplane, where
they attempt to separate the classes with the maximum margin between them. Support vectors are the points
closest to the decision boundary (the dividing hyperplane) that would change it if were removed. To find the best
hyperplane, they use a cost function called hinge loss and a computationally cheap method to operate in high-
dimensional space, called the kernel trick, and even though a hyperplane suggests linear separability, it's not
always limited to a linear kernel.

The scikit-learn implementation we will use is called C-SVC. SVC uses an L2 regularization parameter called C
and, by default, uses a kernel called the radial basis function (RBF), which is decidedly non-linear. For an RBF, a
gamma hyperparameter defines the radius of influence of each training example in the kernel, but in an inversely
proportional fashion. Hence, a low value increases the radius, while a high value decreases it.

The SVM family includes several variations for classification and even regression classes through support vector
regression (SVR). The most significant advantage of SVM models is that they tend to work effectively and
efficiently when there are many features compared to the observations, and even when the features exceed the
observations! It also tends to find latent non-linear relationships in the data, without overfitting or becoming
unstable. However, SVM is not as scalable to larger datasets, and it's hard to tune its hyperparameters.

Since we will use seaborn plot styling, which is activated with set() , for some of this chapter's plots, we will
first save the original matplotlib settings ( rcParams ) so that we can restore them later. One thing to note about
SVC is that it doesn't natively produce probabilities since it's linear algebra. However, if probability=True , the
scikit-learn implementation uses cross-validation and then fits a logistic regression model to the SVC's scores to
produce the probabilities. We are also using gamma=auto , which means it is set to 1/# features—so, 1/44. As
always, it is recommended to set your random_state parameter for reproducibility. Once we fit the model to the
training data, we can use evaluate_class_mdl to evaluate our model's predictive performance, as illustrated in
the following code snippet:
svm_mdl = svm.SVC(probability=True, gamma='auto', random_state=rand)
fitted_svm_mdl = svm_mdl.fit(X_train, y_train)
y_train_svc_pred, y_test_svc_prob, y_test_svc_pred =\
mldatasets.evaluate_class_mdl(fitted_svm_mdl, X_train,\
X_test, y_train, y_test)

The preceding code produces the output shown here in Figure 5.3:
Figure 5.3 – Predictive performance of our SVC model

The performance achieved (see Figure 5.3) is not bad, considering this is a small imbalanced dataset in an already
challenging domain for machine learning models' user ratings. In any case, the Area Under the Curve (AUC)
curve is above the dotted coin toss line, and the Matthews correlation coefficient (MCC) is safely above 0. More
importantly, precision is substantially higher than recall, and this is very good given the hypothetical cost of
misclassifying a lousy chocolate bar as Highly Recommended. We favor precision over recall because we would
prefer to have fewer false positives than false negatives.

Computing SHAP values using KernelExplainer

Given how computationally intensive calculating SHAP values by brute force can be, the SHAP library takes
many statistically valid shortcuts. As we learned in Chapter 4, Global Model-Agnostic Interpretation Methods,
these shortcuts range from leveraging a decision tree's structure ( TreeExplainer ) to the difference in a neural
network's activations, and a baseline ( DeepExplainer ) to a neural network's gradient ( GradientExplainer ).
These shortcuts make the explainers significantly less model-agnostic since they are limited to a family of model
classes. However, there is a truly model-agnostic explainer in SHAP, called the KernelExplainer .
KernelExplainer has two shortcuts: it samples a subset of all feature permutations for coalitions and uses a
weighting scheme according to the size of the coalition to compute SHAP values. The first shortcut is a
recommended technique to reduce computation time. The second one is drawn from LIME's weighting scheme,
which we will cover next in this chapter, and the authors of SHAP did this so that it remains compliant to Shapley.
However, for "missing" features in the coalition, it randomly samples from the features' values in a background
training dataset, which violates the dummy property of Shapley values. More importantly, as with permutation
feature importance, if there's multicollinearity, it puts too much weight on unlikely instances. Despite this near-
fatal flaw, KernelExplainer has all the other benefits of Shapley values and is one of LIME's main advantages.

Before we engage with the KernelExplainer , it's important to note that for classification models, it yields a list of
multiple SHAP values. You access these for each class with an index. Confusion may arise if this index is not in
the order you expect because it's in the order provided by the model. So, it is essential to make sure of the order of
the classes in your model by running print(svm_mdl.classes_) .

The output array([0, 1]) tells you that Not Highly Recommended has an index of 0, as you would expect, and
Highly Recommended has an index of 1. We are interested in the SHAP values for the latter because this is what
we are trying to predict.

KernelExplainer takes a predict function for a model ( fitted_svm_mdl.predict_proba ) and some background
training data ( X_train_summary ). KernelExplainer strongly suggests other measures to minimize computation.
One of these is using k-means to summarize the background training data instead of using it whole. Another
method could be using a sample of the training data. In this case, we opted for k-means clustering into 10
centroids. Once we have initialized our explainer, we can use samples of our test dataset ( nsamples=200 ) to come
up with the SHAP values. It uses L1 regularization ( l1_reg ) during the fitting process. What we are telling it here
is to regularize to a point where it only has 20 relevant features. Lastly, we can use a summary_plot to plot our
SHAP values for class 1. The code is illustrated in the following snippet:
np.random.seed(rand)
X_train_summary = shap.kmeans(X_train, 10)
shap_svm_explainer =\
shap.KernelExplainer(fitted_svm_mdl.predict_proba,
X_train_summary)
shap_svm_values_test = shap_svm_explainer.shap_values(X_test,
nsamples=200, l1_reg="num_features(20)")
shap.summary_plot(shap_svm_values_test[1], X_test, plot_type="dot")

The preceding code produces the output shown in Figure 5.4. Even though the point of this chapter is local model
interpretation, it's important to start with the global form of this to make sure outcomes are intuitive. If they aren't,
perhaps something is amiss.
Figure 5.4 – Global model interpretation with SHAP using a summary plot

In Figure 5.4, we can tell that the highest (red) cocoa percentages ( cocoa_percent ) tend to correlate with a
decrease in the likelihood of Highly Recommended, but the middle values (purple) tend to increase it. This finding
makes intuitive sense because the darkest chocolates are more of an acquired taste than less-dark chocolates. The
low values (blue) are scattered throughout so they show no trend, but this could be because there aren't many. On
the other hand, review date suggests that it was likely to be Highly Recommended in earlier years. There are
significant shades of red and purple on both sides of 0, so it's hard to identify a trend here. A dependence plot, such
as those used in Chapter 4, Global Model-Agnostic Interpretation Methods, would be better for this purpose.
However, it's very easy for binary features to visualize how high and low values, ones and zeros, impact the model.
For instance, we can tell that the presence of cocoa, creamy, rich, and berry tastes increases the likelihood of the
chocolate being recommended, while sweet, earthy, sour, and fatty tastes do the opposite. Likewise, the odds for
Highly Recommended decrease if the chocolate was manufactured in the US! Sorry, US.

Local interpretation for a group of predictions using decision plots


For local interpretation, you don't have to visualize one point at a time—you can instead interpret several at a time.
The key is providing some context to compare the points adequately, and there can't be so many that you can't
distinguish them. Usually, you would find outliers or only those that meet specific criteria. For this exercise, we
will select only those bars that were produced by your client, as follows:
sample_test_idx = X_test.index.\
get_indexer_for([5,6,7,18,19,21,24,25,27])

One great thing about Shapley is its additivity property, which can be easily demonstrated. If you add all the SHAP
values to the expected value used to compute them, you get a prediction. Of course, this is a classification problem,
so the prediction is a probability; so, to get a Boolean array instead, we have to check if the probability is greater
than 0.5. We can check if this Boolean array matches our model's test dataset predictions ( y_test_svc_pred ) by
running the following code:
expected_value = shap_svm_explainer.expected_value[1]
y_test_shap_pred =\
(shap_svm_values_test[1].sum(1) + expected_value) > 0.5
print(np.array_equal(y_test_shap_pred, y_test_svc_pred))

It should, and it does

SHAP's decision plot comes with a highlight feature that we can use to make false negatives ( FN ) stand out. Now,
let's figure out which of our sample observations are FN , as follows:
FN = (~y_test_shap_pred[sample_test_idx]) &
(y_test.iloc[sample_test_idx] == 1).to_numpy()

We can now quickly reset our plotting style back to the default matplotlib style, and plot a decision_plot . It
takes the expected_value , the SHAP values, and actual values of those items we wish to plot. Optionally, we can
provide a Boolean array of the items we want to highlight, with dotted lines—in this case, the false negatives ( FN ),
as illustrated in the following code snippet:
shap.decision_plot(expected_value,\
shap_svm_values_test[1][sample_test_idx],\
X_test.iloc[sample_test_idx], highlight=FN)

The plot produced in Figure 5.5 has a single color-coded line for each observation.
Figure 5.5 – Local model interpretation with SHAP for a sample of predictions, highlighting false negatives

The color of each line represents not the value of any feature, but the model output. Since we used
predict_proba in KernelExplainer this is a probability, but otherwise it would have displayed SHAP values,
and the value they have when they strike the top x axis is the predicted value. The features are sorted in terms of
importance but only among the observations plotted, and you can tell that the lines increase and decrease
horizontally depending on each feature. How much they vary and toward which direction depends on the feature's
contribution to the outcome. The gray line represents the class's expected value, which is like the intercept in a
linear model. In fact, similarly, all lines start at this value, making it best to read the plot from bottom to top.

You can tell that there are three false negatives plotted in Figure 5.5 because they have dotted lines. Using this
plot, we can easily visualize which features made them veer toward the left the most because this is what made
them negative predictions. For instance, we know that the leftmost false negative was to the right of the expected
value line until lecithin and then continued decreasing till company_location_France , and review_date
increased its likelihood of Highly Recommended, but it wasn't enough. You can tell that
county_of_bean_origin_Other decreased the likelihood of two of the misclassifications. This decision could be
unfair because the country could be one of over 50 countries that didn't get their own feature. Quite possibly,
there's a lot of variation between the beans of these countries grouped together.

Decision plots can also isolate a single observation. When it does this, it prints the value of each feature next to the
dotted line. Let's plot one for a decision plot of the same company (true-positive observation #696), as follows:
shap.decision_plot(expected_value, shap_svm_values_test[1][696],\
X_test.iloc[696], highlight=0)

Figure 5.6 here was outputted by the preceding code:

Figure 5.6 – Local model interpretation with SHAP for a single true positive in the sample of predictions

In Figure 5.6, you can see that lecithin and counts_of_ingredients decreased the Highly Recommended
likelihood to a point where it could have jeopardized it. Fortunately, all features above those veered the line
decidedly rightward because company_location_France=1 , cocoa_percent=70 , and tastes_berry=1 are all
favorable.
Local interpretation for a single prediction at a time using a force plot

Your client, the chocolate manufacturer, has two bars they want you to compare. Bar #5 is Outstanding and #24 is
Disappointing. They are both in your test dataset. One way of comparing them is to place their values side by side
in a dataframe to understand how exactly they differ. We will concatenate the rating, the actual label y , and the
y_pred predicted label to these observations' values, as follows:

eval_idxs = (X_test.index==5) | (X_test.index==24)


X_test_eval = X_test[eval_idxs]
eval_compare_df = pd.concat([\
chocolateratings_df.iloc[X_test[eval_idxs].index].rating,\
pd.DataFrame({'y':y_test[eval_idxs]}, index=[5,24]),\
pd.DataFrame({'y_pred':y_test_svc_pred[eval_idxs]},\
index=[24,5]), X_test_eval], axis=1).transpose()
eval_compare_df

The preceding code produces the dataframe shown in Figure 5.7.

Figure 5.7 – Observations #5 and #24 side by side, with feature differences highlighted in yellow
With this dataframe, you can confirm that they aren't misclassifications because y=y_pred . A misclassification
could make model interpretations unreliable to understand why people tend to like one chocolate bar more than
another. Then, you can examine the features to spot the differences—for instance, you can tell that the
review_date is 2 years apart. Also, the beans for the Outstanding bar were from Venezuela, and the
Disappointing beans came from another, lesser-represented country. The Outstanding one had a berry taste, and
the Disappointing one was earthy.

The force plot can tell us a complete story of what weighed in the model's decisions (and, presumably, the
reviewers'), and gives us clues as to what consumers might prefer. Plotting a force_plot requires the expected
value for the class of your interest ( expected_value ), the SHAP values for the observation of your interest, and
this observation's actual values. We will start with observation #5, as illustrated in the following code snippet:
shap.force_plot(expected_value,\
shap_svm_values_test[1][X_test.index==5],\
X_test[X_test.index==5], matplotlib=True)

The preceding code produces the plot shown in Figure 5.8. This force plot depicts how much review_date ,
cocoa_percent , and tastes_berry weigh in the prediction, while the only feature that seems to be weighing in
the opposite direction is counts_of_ingredients .

Figure 5.8 – Force plot for observation #5 (Outstanding)

Let's compare it with a force plot of observation #24, as follows:


shap.force_plot(expected_value,\
shap_svm_values_test[1][X_test.index==24],\
X_test[X_test.index==24], matplotlib=True)

The preceding code produces the plot shown in Figure 5.9. We can easily tell that tastes_earthy and
country_of_bean_origin_Other are considered highly negative attributes by our model. The outcome could be
mostly explained by the difference in the chocolate tasting of "berry" versus "earthy". Despite our findings, the
beans' origin country needs further investigation. After all, it is possible that the actual country of origin doesn't
correlate with poor ratings.
Figure 5.9 – Force plot for observation #24 (Disappointing)

In this section, we covered the KernelExplainer , which uses some tricks it learned from LIME. But what is
LIME? We will find that out next!

Employing LIME
Until now, the model-agnostic interpretation methods we've covered attempt to reconcile the totality of outputs of a
model with its inputs. For these methods to get a good idea of how and why X becomes y_pred , they need some
data first. Then, they perform simulations with this data, pushing variations of it in and evaluating what comes out
of the model. Sometimes, they even leverage a global surrogate to connect the dots. By using what they learned in
this process, they yield importances, scores, rules, or values that quantify a feature's impact, interactions, or
decisions on a global level. For many methods such as SHAP, these can be observed locally too. However, even
when it can be observed locally, what was quantified globally may not apply locally. For this reason, there should
be another approach that quantifies the local effects of features solely for local interpretation—one such as LIME!

What is LIME?

LIME trains local surrogates to explain a single prediction. To this end, it starts by asking you which data point
you want to interpret. You also provide it with your black-box model and a sample dataset. It then makes
predictions on a perturbed version of the dataset with the model, creating a scheme whereby it samples and weighs
points higher if they are closer to your chosen data point. This area around your point is called a neighborhood.
Then, using the sampled points and black-box predictions in this neighborhood, it trains a weighted intrinsically
interpretable surrogate model. Lastly, it interprets the surrogate model.

There are lots of keywords to unpack here so let's define them, as follows:

Chosen data point: LIME calls the data point, row, or observation you want to interpret an instance. It's just
another word for this concept.
Perturbation: LIME simulates new samples by perturbing each feature drawing from its training-dataset
distribution for categorical features and normal distribution for continuous features.
Weighting scheme: LIME uses an exponential smoothing kernel to both define the neighborhood radius and
determine how to weigh the points farthest versus those closest.
Closer: LIME uses Euclidean distance for tabular and image data, and cosine similarity for text. This is hard
to imagine in high-dimensional feature spaces, but you can calculate the distance between points for any
number of dimensions and find which points are closest to the one of interest.
Intrinsically interpretable surrogate model: LIME uses a sparse linear model with weighted ridge
regularization. However, it could use any intrinsically interpretable model as long as the data points can be
weighted. The idea behind this is twofold. It needs a model that can yield reliable intrinsic parameters such as
coefficients that tell it how much each feature impacts the prediction. It also needs to consider data points
closest to the chosen point more because these are more relevant.
Much like with k-Nearest Neighbors (k-NN), the intuition behind LIME is that points in a neighborhood have
commonality because you could expect points close to each other to have similar, if not the same, labels. There are
decision boundaries for classifiers, so this could be a very naive assumption to make when close points are divided
by one.

Similar to another model class in the Nearest Neighbors family, Radius Nearest Neighbors, LIME factors in
distance along a radius and weighs points accordingly, although it does this exponentially. However, LIME is not a
model class but an interpretation method, so the similarities stop there. Instead of "voting" for predictions among
neighbors, it fits a weighted surrogate sparse linear model because it assumes that every complex model is linear
locally, and because it's not a model class, the predictions the surrogate model makes don't matter. In fact, the
surrogate model doesn't even have to fit the data like a glove because all you need from it is the coefficients. Of
course, that being said, it is best if it fits well so that there is higher fidelity in the interpretation.

LIME works for tabular, image, and text data and generally has high local fidelity, meaning that it can approximate
the model predictions quite well on a local level. However, this is contingent on the neighborhood being defined
correctly, which stems from choosing the right kernel width and the assumption of local linearity holding true.

Local interpretation for a single prediction at a time using LimeTabularExplainer

To explain a single prediction, you first instantiate a LimeTabularExplainer by providing it with your sample
dataset in a NumPy 2D array ( X_test.values ), a list with the names of the features ( X_test.columns ), a list
with the indices of the categorical features (only the first three features aren't categorical), and the class names.
Even though only the sample dataset is required, it is recommended that you provide names for your features and
classes so that the interpretation makes sense. For tabular data, telling LIME which features are categorical
( categorical_features ) is important because it treats categorical features differently from continuous ones, and
not specifying this could potentially make for a poor-fitting local surrogate. Another parameter that can greatly
impact the local surrogate is kernel_width . This defines the diameter of the neighborhood, thus answering the
question of what is considered local. It has a default value, which may or may not yield interpretations that make
sense for your instance. You could tune this parameter on an instance-by-instance basis to optimize your
explanations. The code can be seen in the following snippet:
lime_svm_explainer =\
lime.lime_tabular.LimeTabularExplainer(X_test.values,\
feature_names=X_test.columns,\
categorical_features=list(range(3,44)),\
class_names=['Not Highly Recomm.', 'Highly Recomm.'])

With the instantiated explainer, you can now use explain_instance to fit a local surrogate model to observation
#5. We also will use our model's classifier function ( predict_proba ) and limit our number of features to eight
( num_features=8 ). We can take the "explanation" returned and immediately visualize it with show_in_notebook .
At the same time, the predict_proba parameter makes sure it also includes a plot to show which class is the most
probable, according to the local surrogate model. The code is illustrated in the following snippet:
lime_svm_explainer.\
explain_instance(X_test[X_test.index==5].values[0],\
fitted_svm_mdl.predict_proba,\
num_features=8).show_in_notebook(predict_proba=True)

The preceding code provides the output shown in Figure 5.10. According to the local surrogate, a cocoa_percent
value smaller or equal to 70 is a favorable attribute, as is the berry taste. A lack of sour, sweet, and molasses tastes
also weighs in favorably in this model. However, a lack of rich, creamy, and cocoa tastes does the opposite, but not
enough to push the scales toward Not Highly Recommended.
Figure 5.10 – LIME tabular explanation for observation #5 (Outstanding)

With a small adjustment to the code that produced Figure 5.10, we can produce the same plot but for observation
#24, as follows:
lime_svm_explainer.\
explain_instance(X_test[X_test.index==24].values[0],\
fitted_svm_mdl.predict_proba,\
num_features=8).\
show_in_notebook(predict_proba=True)

Here, in Figure 5.11, we can clearly see why the local surrogate believes that observation #24 is Not Highly
Recommended:

Figure 5.11 – LIME tabular explanation for observation #24 (Disappointing)

Once you compare the explanation of #24 (Figure 5.11) with that of #5 (Figure 5.10), the problems become
evident. A single feature, tastes_berry , is what differentiates both explanations. Of course, we have limited it to
the top eight features, so there's probably much more to it. However, you would expect the top eight features to
include the ones that make the most difference.
According to SHAP, knowing that tastes_earthy=1 is what globally explains the disappointing nature of the #24
chocolate bar, but this appears to be counterintuitive. So, what happened? It turns out that observations #5 and #24
are relatively similar and, thus, in the same neighborhood. This neighborhood also includes many chocolate bars
with berry tastes, and very few with earthy ones. However, there are not enough earthy ones to consider it a salient
feature, so it attributes the difference between Highly Recommended and Not Highly Recommended to other
features that seem to differentiate more often, at least locally. The reason for this is twofold: the local
neighborhood could be too small, and linear models, given their simplicity, are on the bias end of a bias-variance
trade-off. This bias is only exacerbated by the fact that some features such as tastes_berry can appear relatively
more often than tastes_earthy . There's an approach we can use to fix this, and we'll cover this in the next
section.

Using LIME for NLP


At the beginning of the chapter, we set aside training and test datasets with the cleaned-up contents of all the
"tastes" columns for NLP. We can take a peek at the test dataset for NLP, as follows:
print(X_test_nlp)

This outputs the following:


1194 roasty nutty rich
77 roasty oddly sweet marshmallow
121 balanced cherry choco
411 sweet floral yogurt
1259 creamy burnt nuts woody
...
327 sweet mild molasses bland
1832 intense fruity mild sour
464 roasty sour milk note
2013 nutty fruit sour floral
1190 rich roasty nutty smoke
Length: 734, dtype: object

No machine learning model can ingest the data as text, so we need to turn it into a numerical format—in other
words, vectorize it. There are many techniques we can use to do this. In our case, we are not interested in the
position of words in each phrase, nor the semantics. However, we are interested in their relative occurrence—after
all, that was an issue for us in the last section.

For these reasons, Term Frequency-Inverse Document Frequency (TF-IDF) is the ideal method because it's
meant to evaluate how often a term (each word) appears in a document (each phrase). However, it's weighted
according to its frequency in the entire corpus (all phrases). We can easily vectorize our datasets using the TF-IDF
method with TfidfVectorizer from scikit-learn. However, when you have to make TD-IDF scores, these are
fitted to the training dataset only because that way, the transformed train and test datasets have consistent scoring
for each term. Have a look at the following code snippet:
vectorizer = TfidfVectorizer(lowercase=False)
X_train_nlp_fit = vectorizer.fit_transform(X_train_nlp)
X_test_nlp_fit = vectorizer.transform(X_test_nlp)

To get an idea of what the TF-IDF score looks like, we can place all the feature names in one column of a
dataframe, and their respective scores for a single observation in another. Note that since the vectorizer produces a
scipy sparse matrix, we have to convert it into a NumPy matrix with todense() and then a NumPy array with
asarray() . We can sort this dataframe in descending order by TD-IDF scores. The code is shown in the following
snippet:
pd.DataFrame({'taste':vectorizer.get_feature_names(),\
'tf-idf': np. asarray(X_test_nlp_fit[X_test_nlp.index==5].\
todense())[0]}).\
sort_values(by='tf-idf', ascending=False)

The preceding code produces the output shown here in Figure 5.12:
Figure 5.12 – The TF-IDF scores for words present in observation #5

And as you can tell from Figure 5.12, the TD-IDF scores are normalized values between 0 and 1, and those most
common in the corpus have a lower value. Interestingly enough, we realize that observation #5 in our tabular
dataset had berry=1 because of raspberry. The categorical encoding method we used searched occurrences of
berry regardless of whether it matched an entire word or not. This isn't a problem because raspberry is a kind of
berry, and raspberry wasn't one of our common tastes with its own binary column.

Now that we have vectorized our NLP datasets, we can proceed with the modeling.

Training a LightGBM model

LightGBM, like XGBoost, is another very popular and performant gradient-boosting framework that leverages
boosted-tree ensembles and histogram-based split finding. The main differences lie in the split method's
algorithms, which for LightGBM uses sampling with Gradient-based One-Side Sampling (GOSS) and bundling
sparse features with Exclusive Feature Bundling (EFB) versus XGBoost's more rigorous Weighted Quantile
Sketch and Sparsity-aware Split Finding. Another difference lies in how the trees are built, which is depth-first
(top-down) for XGBoost and best-first (across a tree's leaves) for LightGBM. We won't get into the details of how
these algorithms work because that would derail the topic at hand. However, it's important to note that thanks to
GOSS, LightGBM is usually even faster than XGBoost, and though it can lose predictive performance due to
GOSS split approximations, it gains some of it back with its best-first approach. On the other hand, Explainable
Boosting Machine (EBM) makes LightGBM ideal for training on sparse features efficiently and effectively, such
as those in our X_train_nlp_fit sparse matrix! That pretty much sums up why we are using LightGBM for this
exercise.

To train the LightGBM model, we first initialize the model by setting the maximum tree depth ( max_depth ), the
learning rate ( learning_rate ), the number of boosted trees to fit ( n_estimators ), the objective , which is
binary classification, and—last but not least—the random_state for reproducibility. With fit , we train the
model using our vectorized NLP training dataset ( X_train_nlp_fit ) and the same labels used for the SVM
model ( y_train ). Once trained, we can evaluate using the evaluate_class_mdl we used with SVM. The code is
illustrated in the following snippet:
lgb_mdl = lgb.LGBMClassifier(max_depth=13, learning_rate=0.05,\
n_estimators=100, objective='binary', random_state=rand)
fitted_lgb_mdl = lgb_mdl.fit(X_train_nlp_fit, y_train)
y_train_lgb_pred, y_test_lgb_prob, y_test_lgb_pred =\
mldatasets.evaluate_class_mdl(fitted_lgb_mdl, X_train_nlp_fit, X_test_nlp_fit, y_train, y_test)

The preceding code produces Figure 5.13, shown here:


Figure 5.13 – Predictive performance of our LightGBM model

The performance achieved by LightGBM (see Figure 5.13) is slightly lower than for SVM (Figure 5.3) but it's still
pretty good, safely above the coin-toss line. The comments for SVM about favoring precision over recall for this
model also apply here.

Local interpretation for a single prediction at a time using LimeTextExplainer

To interpret any black-box model prediction with LIME, you need to specify a classifier function such as
predict_proba for your model, and it will use this function to make predictions with perturbed data in the
neighborhood of your instance and then train a linear model with it. The instance must be in its numerical form—
in other words, vectorized. However, it would be easier if you could provide any arbitrary text, and it could then
vectorize it on the fly. This is precisely what a pipeline can do for you. With the make_pipeline function from
scikit-learn, you can define a sequence of estimators that transform the data, followed by one that can fit it. In this
case, we just need vectorizer to transform our data, followed by our LightGBM model ( lgb_mdl ) that takes the
transformed data, as illustrated in the following code snippet:
lgb_pipeline = make_pipeline(vectorizer, lgb_mdl)
Initializing a LimeTextExplainer is pretty simple. All parameters are optional, but it's recommended to specify
names for your classes. Just as with LimeTabularExplainer , a kernel_width optional parameter can be critical
because it defines the neighborhood's size, and there's a default that may not be optimal but can be tuned on an
instance-by-instance basis. The code is illustrated here:
lime_lgb_explainer = LimeTextExplainer(\
class_names=['Not Highly Recomm.', 'Highly Recomm.'])

Explaining an instance with LimeTextExplainer is similar to doing it for LimeTabularExplainer . The


difference is that we are using a pipeline ( lgb_pipeline ), and the data we are providing (first parameter) is text
since the pipeline can transform it for us. The code is illustrated in the following snippet:
lime_lgb_explainer.\
explain_instance(X_test_nlp[X_test_nlp.index==5].values[0],\
lgb_pipeline.predict_proba, num_features=4).\
show_in_notebook(text=True)

According to the LIME text explainer (see Figure 5.14), the LightGBM model predicts Highly Recommended for
observation #5 because of the word caramel. At least according to the local neighborhood, raspberry is not a
factor.

Figure 5.14 – LIME text explanation for observation #5 (Outstanding)

Now, let's contrast the interpretation for observation #5 with that of #24, as we've done before. We can use the
same code but simply replace 5 with 24, as follows:
lime_lgb_explainer.\
explain_instance(X_test_nlp[X_test_nlp.index==24].values[0], \
lgb_pipeline.predict_proba, num_features=4).
show_in_notebook(text=True)

According to Figure 5.15, you can tell that observation #24, described as tasting like burnt wood earthy choco is
Not Highly Recommended because of the words earthy and burnt.

Figure 5.15 – LIME tabular explanation for observation #24 (Disappointing)

Given that we are using a pipeline that can vectorize any arbitrary text, let's have some fun with that! We will first
try a phrase made out of adjectives we suspect that our model favors, then try one with unfavorable adjectives, and
lastly try using words that our model shouldn't be familiar with, as follows:
lime_lgb_explainer.explain_instance('creamy rich complex fruity', \
lgb_pipeline.predict_proba, num_features=4).\
show_in_notebook(text=True)
lime_lgb_explainer.explain_instance('sour bitter roasty molasses',
lgb_pipeline.predict_proba, num_features=4).\
show_in_notebook(text=True)
lime_lgb_explainer.explain_instance('nasty disgusting gross stuff', \
lgb_pipeline.predict_proba, num_features=4).\
show_in_notebook(text=True)

In Figure 5.16, the explanations are spot-on for creamy rich complex fruity and sour bitter roasty molasses
since the model knows these words to be either very favorable or unfavorable. These words are also common
enough to be appreciated on a local level.

You can see the output here:

Figure 5.16 – Arbitrary phrases not in the training or test dataset can be effortlessly explained with LIME, as long
as words are in the corpus

However, you'd be mistaken to think that the prediction of Not Highly Recommended for nasty disgusting gross
stuff has anything to do with the words. The LightGBM model hasn't seen these words before, so the prediction
has more to do with Not Highly Recommended being the majority class, which is a good guess, and the sparse
matrix for this phrase is all zeros. Therefore, LIME likely found few distant points—if any at all—in its
neighborhood, so the zero coefficients of LIME's local surrogate model reflect this.
Trying SHAP for NLP
Most of SHAP's explainers will work with tabular data. DeepExplainer can do text but is restricted to deep
learning models, and, as we will cover in Chapter 7, Visualizing Convolutional Neural Networks, three of them do
images, including KernelExplainer . In fact, SHAP's KernelExplainer was designed to be a general-purpose
truly model-agnostic method, but it's not promoted as an option for NLP. It easy to understand why: it's slow, and
NLP models tend to be very complex and with hundreds—if not thousands—of features to boot. In cases such as
this one, where word order is not a factor and you have a few hundred features, but the top 100 are present in most
of your observations, KernelExplainer could work.

In addition to overcoming slowness, there are a couple of technical hurdles you would need to overcome. One of
them is that KernelExplainer is compatible with a pipeline, but it expects a single set of predictions back. But
LightGBM returns two sets, one for each class: Not Highly Recommended and Highly Recommended. To overcome
this problem, we can create a lambda function ( predict_fn ) that includes a predict_proba function, which
returns only those predictions for Highly Recommended. This is illustrated in the following code snippet:
predict_fn = lambda X: lgb_mdl.predict_proba(X)[:,1]

The second technical hurdle has to with SHAP's incompatibility with SciPy's sparse matrices, and for our explainer
we will need sample vectorized test data, which is in this format. To overcome this issue, can convert our data in
SciPy sparse-matrix format to a NumPy matrix and then to a pandas dataframe ( X_test_nlp_samp_df ). To
overcome any slowness, we can use the same kmeans trick we used last time. Other than the adjustments made to
overcome obstacles, the following code is exactly the same as with SHAP performed with the SVM model:
X_test_nlp_samp_df = pd.DataFrame(shap.\
sample(X_test_nlp_fit, 50).todense())
shap_lgb_explainer =\
shap.KernelExplainer(predict_fn,\
shap.kmeans(X_train_nlp_fit.todense(), 10))
shap_lgb_values_test =\
shap_lgb_explainer.shap_values(X_test_nlp_samp_df,\
l1_reg="num_features(20)")
shap.summary_plot(shap_lgb_values_test, X_test_nlp_samp_df,\
plot_type="dot", feature_names=vectorizer.get_feature_names())

By using SHAP's summary plot in Figure 5.17, you can tell that globally the words creamy, rich, cocoa, fruit,
spicy, nutty, and berry have a positive impact on the model toward predicting Highly Recommended. On the other
hand, sweet, sour, earthy, hammy, sandy, and fatty have the opposite effect. These results shouldn't be entirely
unexpected given what we learned with our prior SVM model with the tabular data and local LIME interpretations.
That being said, the SHAP values were derived from samples of a sparse matrix, and they could be missing details
and perhaps even be partially incorrect, especially for underrepresented features. Therefore, we should take the
conclusions with a grain of salt, especially toward the bottom half of the plot. To increase interpretation fidelity it's
best to increase sample size, but given the slowness of KernelExplainer , there's a trade-off to consider.

You can view the output here:


Figure 5.17 – SHAP summary plot for the LightGBM NLP model

Now that we have validated our SHAP values globally, we can use them for local interpretation with a force plot.
Unlike LIME, we cannot use arbitrary data for this. With SHAP, we are limited to those data points we have
previously generated SHAP values for. For instance, let's take the 18th observation from our test dataset sample, as
follows:
print(shap.sample(X_test_nlp, 50).to_list()[18])

The preceding code outputs this phrase:


woody earthy medicinal

It's important to note which words are represented in the 18th observation because the X_test_nlp_samp_df
dataframe contains the vectorized representation. The 18th observation's row in this dataframe is what you use to
generate the force plot, along with the SHAP values for this observation and the expected value for the class, as
illustrated in the following code snippet:
shap.force_plot(shap_lgb_explainer.expected_value,\
shap_lgb_values_test[18,:],\
X_test_nlp_samp_df.iloc[18,:],\
feature_names=vectorizer.get_feature_names())

Figure 5.18 is the force plot for woody earthy medicinal. As you can tell, earthy and woody weigh heavily in a
prediction against Highly Recommended. The word medicinal is not featured in the force plot and instead you get
a lack of creamy and cocoa as negative factors. As you can imagine, medicinal is not a word used often to
describe chocolate bars, so there was only one observation in the sampled dataset that included it. Therefore, its
average marginal contribution across possible coalitions would be greatly diminished.

Figure 5.18 – SHAP force plot for the 18 th observation of the sampled test dataset

Let's try another one, as follows:


print(shap.sample(X_test_nlp, 50).to_list()[9])

The 9th observation is the following phrase:

intense spicy floral

Generating a force_plot for this observation is the same as before, except you replace 18 with 9 . If you run
this code, you produce the output shown here in Figure 5.19:

Figure 5.19 – SHAP force plot for the 9 th observation of the sampled test dataset

As you can appreciate in Figure 5.19, all words in the phrase are featured in the force plot: floral and spicy
pushing toward Highly Recommended, and intense toward Not Highly Recommended. So, now you know how to
perform both tabular and NLP interpretations with SHAP, how does it compare with LIME?

Comparing SHAP with LIME


As you will have noticed by now, both SHAP and LIME have limitations, but they also have strengths. SHAP is
grounded in game theory and approximate Shapley values, so its SHAP values mean something. These have great
properties such as additivity, efficiency, and substitutability that make it consistent but violate the dummy property.
It always adds up and doesn't need parameter tuning to accomplish this. However, it's more suited for global
interpretations, and one of its most model-agnostic explainers, KernelExplainer , is painfully slow.
KernelExplainer also deals with missing values by using random ones, which can put too much weight on
unlikely observations.

LIME is speedy, very model-agnostic, and adaptable to all kinds of data. However, it's not grounded on strict and
consistent principles but has the intuition that neighbors are alike. Because of this, it can require tricky parameter
tuning to define the neighborhood size optimally, and even then, it's only suitable for local interpretations.

Mission accomplished
The mission was to understand why one of your client's bars is Outstanding while another one is Disappointing.
Your approach employed the interpretation of machine learning models to arrive at the following conclusions:

According to SHAP on the tabular model, the Outstanding bar owes that rating to its berry taste and its cocoa
percentage of 70%. On the other hand, the unfavorable rating for the Disappointing bar is due mostly to its
earthy flavor and bean country of origin ( Other ). Review date plays a smaller role, but it seems that
chocolate bars reviewed in that period (2013-15) were at an advantage.
LIME confirms that cocoa_percent<=70 is a desirable property, and that, in addition to berry, creamy,
cocoa, and rich are favorable tastes, while sweet, sour, and molasses are unfavorable.
The commonality between both methods using the tabular model is that despite the many non-taste-related
attributes, taste features are among the most salient. Therefore, it's only fitting to interpret the words used to
describe each chocolate bar via an NLP model.
The Outstanding bar was represented by the phrase oily nut caramel raspberry, of which, according to
LIMETextExplainer , caramel is positive and oily is negative. The other two words are neutral. On the other
hand, the Disappointing bar was represented by burnt wood earthy choco, of which burnt and earthy are
unfavorable and the other two are favorable.
The inconsistencies between the tastes in tabular and NLP interpretations are due to the presence of lesser-
represented tastes, including raspberry, which is not as common as berry.
According to SHAP's global explanation of the NLP model, creamy, rich, cocoa, fruit, spicy, nutty, and
berry have a positive impact on the model toward predicting Highly Recommended. On the other hand,
sweet, sour, earthy, hammy, sandy, and fatty have the opposite effect.

With these notions of which chocolate-bar characteristics and tastes are considered less attractive by Manhattan
Chocolate Society members, a client can apply changes to their chocolate-bar formulas to appeal to a broader
audience—that is, if the assumption is correct about that group being representative of their target audience.

It could be argued that it is pretty apparent that words such as earthy and burnt are not favorable words to
associate with chocolate bars, while caramel is. Therefore, we could have reached this conclusion without
machine learning! But first of all, a conclusion not informed by data would have been an opinion, and, secondly,
context is everything. Furthermore, humans can't always be relied upon to place one point objectively in its context
—especially considering it's among thousands of records!

Also, local model interpretation is not only about the explanation for one prediction because it's connected to how
a model makes all predictions but, more importantly, to how it makes predictions for similar points—in other
words, in the local neighborhood! In the next chapter, we will expand on what it means to be in the local
neighborhood by looking at the commonalities (anchors) and inconsistencies (counterfactuals) we can find there.

Summary
After reading this chapter, you should know how to use SHAP's KernelExplainer , as well as its decision and
force plot to conduct local interpretations. You also should know how to do the same with LIME's instance
explainer for both tabular and text data. Lastly, you should understand the strengths and weaknesses of SHAP's
KernelExplainer and LIME. In the next chapter, we will learn how to create even more human-interpretable
explanations of a model's decisions, such as "if X conditions are met, then Y is the outcome".

Dataset sources
Brelinski, Brady (2020). Manhattan Chocolate Society. http://flavorsofcacao.com/mcs_index.xhtml

Further reading
Platt, J. C. (1999). Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized
Likelihood Methods. Advances in Large Margin Classifiers, MIT Press.
https://www.cs.colorado.edu/~mozer/Teaching/syllabi/6622/papers/Platt1999.pdf
Lundberg, S. & Lee, S. (2017). A Unified Approach to Interpreting Model Predictions. Advances in Neural
Information Processing Systems, 30. https://arxiv.org/abs/1705.07874 (documentation for SHAP:
https://github.com/slundberg/shap)
Ribeiro, M. T., Singh, S. & Guestrin, C. (2016). "Why Should I Trust You?": Explaining the Predictions of
Any Classifier. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery
and Data Mining. http://arxiv.org/abs/1602.04938
Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q. & Liu, T. (2017). LightGBM: A Highly
Efficient Gradient Boosting Decision Tree. Advances in Neural Information Processing Systems vol. 30, pp.
3149-3157. https://papers.nips.cc/paper/6907-lightgbm-a-highly-efficient-gradient-boosting-decision-tree
6 Anchor and Counterfactual Explanations
Join our book community on Discord
https://packt.link/EarlyAccessCommunity

In previous chapters, we have learned how to attribute model decisions to features and their interactions with state-
of-the-art global and local model interpretation methods. However, the decision boundaries are not always easy to
define nor interpret with these methods. Wouldn't it be nice to be able to derive human-interpretable rules from
model interpretation methods? In this chapter, we will cover a few human-interpretable, local, classification-only
model interpretation methods. We will first learn how to use scoped rules called anchors to explain complex
models with statements such as if X conditions are met, then Y is the outcome. Then, we will explore
counterfactual explanations that follow the form if Z conditions aren't met, then Y is not the outcome.

These are the main topics we are going to cover in this chapter:

Understanding anchor explanations


Exploring counterfactual explanations

Technical requirements
This chapter's example uses the mldatasets , pandas , numpy , sklearn , catboost , matplotlib , seaborn ,
alibi , tensorflow , shap , and witwidget libraries. Instructions on how to install all of these libraries are in the
Preface. The code for this chapter is located here: https://github.com/PacktPublishing/Interpretable-Machine-
Learning-with-Python/tree/master/Chapter07.

The mission
In the United States, for the last two decades, private companies and non-profits have been developing criminal
risk assessment tools, most of which employ statistical models. As many states can no longer afford their large
prison populations, these methods have increased in popularity, guiding judges and parole boards through every
step of the prison system. However, they often do more than guide a decision. They make them for justice system
decision-makers because they assume it is correct. Worse of all, they don't exactly know how an assessment was
made. The risk is usually calculated with a white-box model, but, in practice, a black-box model is used because it
is proprietary. Predictive performance is also relatively low, with median AUC scores for nine tools ranging
between 0.57 and 0.74. Still, validity and biases are rarely examined, especially by the criminal justice institutions
that purchase them.

Even though traditional statistical methods are still the norm for criminal justice models, to improve performance,
some researchers have been proposing leveraging more complex models such as Random Forest with larger
datasets. Far from being science fiction drawn from Minority Report or Black Mirror, in other countries, scoring
people based on their likelihood of engaging in antisocial, or even antipatriotic, behavior with big data and
machine learning is already a reality.

As more and more AI solutions attempt to make life-changing predictions about us with our data, fairness must be
properly assessed, and all its ethical and practical implications must be adequately discussed. Chapter 1,
Interpretation, Interpretability, and Explainability; and Why Does It All Matter?, covered how fairness is an
integral concept for machine learning interpretation. You can evaluate fairness in any model, but fairness is
especially tricky when it involves human behavior. The dynamics between human psychological, neurological, and
sociological factors are extremely complicated. In the context of predicting criminal behavior, it boils down to
what factors are potentially to blame for a crime, because it wouldn't be fair to include anything else in a model,
and how these factors interact.

Quantitative criminologists are still debating the best predictors of criminality and their root causes. They're also
debating whether it is ethical to blame a criminal for these factors to begin with. Thankfully, demographic traits
such as race, gender, and nationality are no longer used in criminal risk assessments. But this doesn't mean that
these methods are no longer racially biased. Scholars recognize the problem and are proposing solutions.

This chapter will examine racial bias in one of the most widely used risk assessment tools. Given this topic's
sensitive and relevant nature, it was essential to provide a modicum of context about criminal risk assessment tools
and how machine learning and fairness connects with all of it. We won't go into much more detail, but it can't be
understated how vital the context is to appreciate how machine learning could perpetuate structural inequality and
unfair biases.

Now, let's introduce you to your mission for this chapter!

Unfair bias in recidivisim risk assessments

An investigative journalist is writing an article on how one particular African American defendant was detained
while waiting for trial. A tool called Correctional Offender Management Profiling for Alternative Sanction
(COMPAS) deemed him as being at risk of recidivism. Recidivism is when someone relapses into criminal
behavior. And the score convinced the judge that he had to be detained pretrial so much that they didn't even
consider any other arguments or testimonies. He was locked up for many months, and, in the trial, was found not
guilty. Over 5 years have passed since the trial, and he hasn't been accused of any crime. You could say the
prediction for recidivism was a false positive.

The journalist has reached out to you because she would like to ascertain with data science whether there was
unfair bias in this particular case. The COMPAS risk assessment is computed using 137 questions
(https://www.documentcloud.org/documents/2702103-Sample-Risk-Assessment-COMPAS-CORE.xhtml). It
includes questions such as the following:

"Based on the screener's observations, is this person a suspected or admitted gang member?"
"How often have you moved in the last 12 months?"
"How often do you have barely enough money to get by?"
Psychometric LIKERT scale questions such as "I have never felt sad about things in my life,"

Even though race is not one of the questions, many of these questions may correlate with race. Not to mention, in
some cases, they can be more a question of opinion than fact, and thus be prone to bias.

The journalist cannot provide you with the 137 answered questions or the COMPAS model because this data is not
publicly available. However, all defendants' demographic and recidivism data for the same county in Florida is.

The approach
You have decided to do the following:

Train proxy models: You don't have the original features or model, but all is not lost because you have the
COMPAS scores – the labels. And we also have relevant features to the problem we can connect to these
labels with models. By approximating the COMPAS model via the proxies, you can assess its unfairness of
the labels. In this chapter, we will train a CatBoost model.
Anchor explanations: Using this method will unearth insights into why the proxy model makes specific
predictions using a series of rules called anchors, which tell you where the decision boundaries lie. The
boundaries are relevant for our mission because we want to know why the defendant has been wrongfully
predicted to recidivate. It's an approximate boundary to the original model, but there's still some truth to it.
Counterfactual explanations: The opposite concept to anchors is about understanding why similar data
points are on the opposite side of the decision boundary, which is particularly notable when discussing topics
of unfairness. We will use an unbiased method to find counterfactuals and then use the What-If Tool (WIT)
to explore counterfactuals and fairness further.

The preparations
You will find the code for this example here: https://github.com/PacktPublishing/Interpretable-Machine-Learning-
with-Python/blob/master/Chapter07/Recidivism_part1.ipynb.

Loading the libraries

To run this example, you need to install the following libraries:

mldatasets to load the dataset


pandas and numpy to manipulate the dataset
sklearn (scikit-learn), and catboost to split the data and fit the models
matplotlib , seaborn , alibi , tensorflow , shap , and witwidget to visualize the interpretations

You should load all of them first:


import math
import mldatasets
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import metrics
from catboost import CatBoostClassifier
import matplotlib.pyplot as plt
import seaborn as sns
from alibi.utils.mapping import ohe_to_ord, ord_to_ohe
import tensorflow as tf
from alibi.explainers import AnchorTabular, CounterFactualProto
import shap
import witwidget
from witwidget.notebook.visualization import WitWidget,\
WitConfigBuilder

Let's check that TensorFlow has loaded the right version with print(tf.__version__) . It should be 2.0 or above.
We should also disable eager execution and verify that it worked with this command. The output should say that
it's False :
tf.compat.v1.disable_eager_execution()
print('Eager execution enabled:', tf.executing_eagerly())

Understanding and preparing the data

We load the data like this into a dataframe we call recidivism_df :


recidivism_df = mldatasets.load("recidivism-risk", prepare=True)

There should be almost 15,000 records and 23 columns. We can verify this was the case with info() :
recidivism_df.info()

The following output checks out. All features are numeric with no missing values, and categorical features have
already been one-hot encoded for us:
Int64Index: 14788 entries, 0 to 18315
Data columns (total 23 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----0 age 14788 non-null int8
1 juv_fel_count 14788 non-null int8
2 juv_misd_count 14788 non-null int8
3 juv_other_count 14788 non-null int64
4 priors_count 14788 non-null int8
5 is_recid 14788 non-null int8
6 sex_Female 14788 non-null uint8
7 sex_Male 14788 non-null uint8
8 race_African-American 14788 non-null uint8
9 race_Asian 14788 non-null uint8
:
13 race_Other 14788 non-null uint8
14 c_charge_degree_(F1) 14788 non-null uint8
15 c_charge_degree_(F2) 14788 non-null uint8
:
21 c_charge_degree_Other 14788 non-null uint8
22 compas_score 14788 non-null int64
dtypes: int64(2), int8(5), uint8(16)

The data dictionary

There are only nine features, but they become 22 columns because of the categorical encoding:

age : Continuous, the age of the defendant (between 8 and 9).


juv_fel_count : Continuous, the number of juvenile felonies (between 0 and 2).
juv_misd_count : Continuous, the number of juvenile misdemeanors (between 0 and 1).
juv_other_count : Continuous, the number of juvenile convictions that are neither felonies nor
misdemeanors (between 0 and 1).
priors_count : Continuous, the number of prior crimes committed (between 0 and 13).
is_recid : Binary, did the defendant recidivate within 2 years (1 for yes, 0 for no)?
sex : Categorical, the gender of the defendant.
race : Categorical, the race of the defendant.
c_charge_degree : Categorical, the degree of what the defendant is currently being charged with. The United
States classifies criminal offenses as felonies, misdemeanors, and infractions, ordered from most serious to
least. These are subclassified in the form of degrees, which go from 1st (most serious offenses) to 3rd or 5th
(least severe). However, even though this is standard for federal offenses, it is tailored to state law on a state
level. For felonies, Florida (http://www.dc.state.fl.us/pub/scoresheet/cpc_manual.pdf) has a level system that
determines the severity of a crime regardless of the degree, and this goes from 10 (most severe) to 1 (least).
The categories of this feature are prefixed with F for felonies and M for misdemeanors. They are followed by
a number, which is a level for felonies and a degree for misdemeanors.
compas_score : Binary, COMPAS scores defendants as "low," "medium," or "high" risk. In practice,
"medium" is often treated as "high" by decision-makers, so this feature has been converted to binary to reflect
this behavior: 1: high/medium risk, 0: low risk.

Examining predictive bias with confusion matrices

There are two binary features in the dataset. The first one is the recidivism risk prediction made by COMPAS
( compas_score ). The second one ( is_recid ) is the ground truth because it's what happened within 2 years of the
defendant's arrest. Just as you would with the prediction of any model against its training labels, you can build
confusion matrices with these two features. scikit-learn can produce one with the confusion_matrix function
( cf_matrix ), and we can then create a Seaborn heatmap with it. Instead of plotting the number of True
Negatives (TNs), False Positives (FPs), False Negatives (FNs), and True Positives (TPs), we can plot
percentages with a simple division ( cf_matrix/np.sum(cf_matrix) ). The other parameters of heatmap only
assist with formatting:
cf_matrix = metrics.confusion_matrix(recidivism_df.is_recid,\
recidivism_df.compas_score)
sns.heatmap(cf_matrix/np.sum(cf_matrix), annot=True,\
fmt='.2%', cmap='Blues', annot_kws={'size':16})

The preceding code outputs Figure 7.1. The top-right corner is FPs, which is nearly one-fifth of all predictions, and
together with the FNs in the bottom-left corner, they make up over two-thirds:

Figure 7.1 – Confusion matrix between the predicted risk of recidivism (compas_score) and the ground truth
(is_recid)

Figure 7.1 tells us that the COMPAS model's predictive performance is not very good, especially if we assume that
criminal justice decision-makers are taking medium or high risk assessments at face value. It also tells us that FP
and FNs occur at a similar rate. Nevertheless, simple visualizations such as the confusion matrix obscure
predictive disparities between subgroups of a population. We can quickly compare disparities between two
subgroups that historically have been treated differently by the United States criminal justice system. To this end,
we first subdivide our data frame into two dataframe: one for Caucasians ( recidivism_c_df ) and another for
African Americans ( recidivism_aa_df ). Then we can generate confusion matrices for each data frame and plot
them side by side with the following code:
recidivism_c_df =\
recidivism_df[recidivism_df['race_Caucasian'] == 1] recidivism_aa_df =\
recidivism_df[recidivism_df['race_African-American'] == 1]
_ = mldatasets.\
compare_confusion_matrices(recidivism_c_df.is_recid,\
recidivism_c_df.compas_score,\
recidivism_aa_df.is_recid,\
recidivism_aa_df.compas_score,\
'Caucasian', 'African-American',\
compare_fpr=True)

The preceding snippet generated Figure 7.2. At a glance, you can tell that it's like the confusion matrix for
Caucasians has been flipped 90 degrees to form the African American confusion matrix, and even then, it is still
less unfair. Pay close attention to the difference between FPs and TNs. As a Caucasian defendant, a result is more
than half as likely to be an FP than a TN, but as an African American, it is a few percentage points more likely. In
other words, a Black defendant who doesn't recidivate is predicted as at risk of recidivating more than half of the
time:

Figure 7.2 – Comparison of the confusion matrices for the predicted risk of recidivism (compas_score) and the
ground truth (is_recid) between African Americans and Caucasians in the dataset

Instead of eyeballing it by looking at the plots, we can measure the False Positive Rate (FPR), which is the ratio
between these two measures FB/(FP+TN) . Then, we can compare the FPR for both groups and divide between
them to examine the relative difference. The higher this ratio between the FPRs, the more unfairness there is,
because it means one group is being misclassified to recidivate more often.

Data preparation
Before we move on to the modeling and interpretation, we have one last step.

Since prepare=True for the data loading, all we do now is train/test split the data. As usual, it is critical to set
your random states so that all your findings are reproducible. We will then set our y to be our target variable
( compas_score ) and set X as every other feature except for is_recid , because this is the ground truth. Lastly,
we split y and X into train and test datasets as we have before:
rand = 9
np.random.seed(rand)
y = recidivism_df['compas_score']
X = recidivism_df.drop(['compas_score', 'is_recid'],\
axis=1).copy()
X_train, X_test, y_train, y_test = train_test_split(X, y,\
test_size=0.2, random_state=rand)

Now, let's get started!

Modeling

First, let's quickly train model we will use throughout this chapter.

Proxy models are a means to emulate output from a black-box model just like global surrogate models, which
we covered in Chapter 4, Global Model-Agnostic Interpretation Methods. So, are they the same thing? In machine
learning, surrogate and proxy are terms that are often used interchangeably. However, semantically, surrogacy
relates to substitution and proxy relates more to a representation. So, we call these proxy models to distinguish that
we don't have the exact training data. Therefore, you only represent the original model because you cannot
substitute it. For the same reason, unlike interpretation with surrogates, which is best served by simpler models, a
proxy is best suited to complex models that can make up for the difference in training data with complexity.

We will train a CatBoost classifier. For those of you who aren't familiar with CatBoost, it's an efficient boosted
ensembled tree method. It's similar to LightGBM, except it uses a new technique called Minimal Variance
Sampling (MVS) instead of Gradient-Based One-Side Sampling (GOSS). Unlike LightGBM, it grows trees in a
balanced fashion. It's called CatBoost because it can automatically encode categorical features, and it's particularly
good at tackling overfitting, with unbiased treatment of categorical features and class imbalances. We won't go into
a whole lot of detail, but it was chosen for this exercise for those reasons.

As a tree-based model class, you can specify a maximum depth value for CatBoostClassifier . We are setting a
relatively high learning_rate value and a lower iterations value (the default is 1000). Once we have used
fit on the model, we can evaluate the results with evaluate_class_mdl :

cb_mdl = CatBoostClassifier(iterations=500, learning_rate=0.5,\


depth=8)
fitted_cb_mdl = cb_mdl.fit(X_train, y_train, verbose=False) y_train_cb_pred, y_test_cb_prob, y_te
mldatasets.evaluate_class_mdl(fitted_cb_mdl, X_train,\
X_test, y_train, y_test)

You can appreciate the output of evaluate_class_mdl for our CatBoost model in Figure 7.3:
Figure 7.3 – Predictive performance of our CatBoost model

From the optics of fairness, we care more about FPs than FNs because it's more unfair to put an innocent person in
prison than it is to leave a guilty person in the streets. Therefore, we should aspire to have higher precision than
recall. Figure 7.3 confirms this, as well as a healthy ROC curve, ROC-AUC, and MCC.

The predictive performance for the model isn't bad considering it’s a proxy model meant to only approximate the
real thing with different, yet related, data.

Getting acquainted with our "instance of interest"

The journalist reached out to you with a case in mind: the African American defendant who was falsely predicted
to recidivate. This case is # 5231 and is your main instance of interest. Since our focus is racial bias, we'd like to
compare it with similar instances but of different races. To that end, we found case #10127 (Caucasian) and
#2726 (Hispanic).
We can take a look at the data for all three. Since we will keep referring to these instances throughout this chapter,
let's first save the indexes of the African American ( idx1 ), Hispanic ( idx2 ), and Caucasian ( idx3 ) cases. Then,
we can subset the test dataset by these indexes. Since we have to make sure that our predictions match, we will
concatenate this subsetted test dataset to the true labels ( y_test ) and the CatBoost predictions
( y_test_cb_pred ):
idx1 = 5231
idx2 = 2726
idx3 = 10127
eval_idxs = X_test.index.isin([idx1, idx2, idx3])
X_test_evals = X_test[eval_idxs]
eval_compare_df = pd.concat([
pd.DataFrame({'y':y_test[eval_idxs]},
index=[idx3, idx2, idx1]),
pd.DataFrame({'y_pred':y_test_cb_pred[eval_idxs]},
index=[idx3, idx2, idx1]),
X_test_evals], axis=1).transpose()
eval_compare_df

The preceding code produces the data frame in Figure 6.4. You can tell that the predictions match the true labels,
and our main instance of interest was the only one predicted as a medium or high risk of recidivism. Besides race,
the only other differences are with c_charge_degree and one minor age difference:
Figure 6.4 – Observations #5231, #10127, and #2726 side by side with feature differences highlighted

Throughout this chapter, we will pay close attention to these differences to see whether they played a large role in
producing the prediction difference. All the methods we will cover will complete the picture of what can determine
or change the proxy model's decision, and, potentially, the COMPAS model by extension. Now that we have
completed the setup, we will be moving forward with employing the interpretation methods.

Understanding anchor explanations


In Chapter 5, Local Model-Agnostic Interpretation Methods, we learned that LIME trains a local surrogate model
(specifically a weighted sparse linear model) on a perturbed version of your dataset in the neighborhood of
your instance of interest. The result is that you approximate a local decision boundary that can help you interpret
the model's prediction for it.

Like LIME, anchors are also derived from a model-agnostic perturbation-based strategy. However, they are not
about the decision boundary but the decision region. Anchors are also known as scoped rules because they list
some decision rules that apply to your instance and its perturbed neighborhood. This neighborhood is also known
as the perturbation space. An important detail is to what extent the rules apply to it, known as precision.

Imagine the neighborhood around your instance. You would expect the points to have more similar predictions the
closer you get to your instance, right? So, if you had decision rules that defined these predictions, the smaller the
area surrounding your instance, the more precise your rules. This concept is called coverage, which is the
percentage of your perturbation space that yields a specific precision.

Unlike LIME, anchors don't fit a local surrogate model to explain your chosen instance's prediction. Instead, they
explore possible candidate decision rules using an algorithm called Kullback-Leibler divergence Lower and
Upper Confidence Bounds (KL-LUCB), which is derived from a Multi-Armed Bandit (MAB) algorithm.

MABs are a family of reinforcement learning algorithms about maximizing payoff when you have limited
resources to explore all unknown possibilities. The algorithm originated from understanding how casino slot
machine players could maximize their payoff by playing multiple machines. It's called multi-armed bandit because
slot machine players are known as one-armed bandits. Yet players don't know which machine will yield the highest
payoff, can't try all of them at once, and have finite funds. The trick is to learn how to balance exploration (trying
unknown slot machines) with exploitation (using those you already have reasons to prefer).

In the anchors case, each slot machine is a potential decision rule, and the payoff is how much precision it yields.
The KL-LUCB algorithm uses confidence regions based on the Kullback-Leibler divergence between the
distributions to find the decision rule with the highest precision sequentially, yet efficiently.

Preparations for anchor and counterfactual explanations with alibi

Several small steps need to be performed to help the alibi library produce human-friendly explanations. The first
one pertains to the prediction, since the model may output a 1 or a 0, but it's easier to understand a prediction by its
name. To help us with this, we need a list with the class names where the 0 position matches our negative class
name and the 1 matches the positive one:
class_names = ['Low Risk', 'Medium/High Risk']

Next, let's create a numpy array with our main instance of interest and print it out. Please note that the single-
dimension array needs to be expanded ( np.expand_dims ) so that it's understood by alibi :
X_test_eval = np.expand_dims(X_test.values[X_test.\
index.get_loc(idx1)], axis=0)
print(X_test_eval)

The preceding code outputs an array with the 21 features, of which 12 were the result of One-Hot Encoding
(OHE):
[[23 0 0 0 2 0 1 1 0 ... 0 1 0 0 0 0]]

A problem with making human-friendly explanations arises when you have OHE categories. To both the machine
learning model and the explainer, each OHE feature is separate from the others. Still, to the human interpreting the
outcomes, they cluster together as categories of their original features.

The alibi library has several utility functions to deal with this problem, such as ohe_to_ord , which takes a one-
hot-encoded instance and puts it in an ordinal format. To use this function, we first define a dictionary
( cat_vars_ohe ) that tells alibi where the categorical variables are in our features and how many categories
each one has. For instance, in our data, gender starts at the 5th index and has two categories, which is why our
cat_vars_ohe dictionary begins with 5: 2 . Once you have this dictionary, ohe_to_ord can take your instance
( X_test_eval ) and output it in ordinal format, where each categorical variable takes up a single feature. This
utility function will prove useful for Alibi's counterfactual explanations, where the explainer will need this
dictionary to map categorical features together:
cat_vars_ohe = {5: 2, 7: 6, 13: 8}
print(ohe_to_ord(X_test_eval, cat_vars_ohe)[0])
The preceding code outputs the following array:
[[23 0 0 0 2 1 0 3]]

For when it's in ordinal format, Alibi will need a dictionary that provides names for each category and a list of
feature names:
category_map = {
5: ['Female', 'Male'],\
6: ['African-American', 'Asian', 'Caucasian',\
'Hispanic', 'Native American', 'Other'],\
7: ['Felony 1st Degree', 'Felony 2nd Degree',\
'Felony 3rd Degree', 'Felony 7th Degree',\
'Misdemeanor 1st Degree', 'Misdemeanor 2nd Degree',\
'Misdemeanor 3rd Degree', 'Other Charge Degree'] }
feature_names = ['age', 'juv_fel_count', 'juv_misd_count',\
'juv_other_count', 'priors_count',\
'sex', 'race', 'c_charge_degree']

However, Alibi's anchor explanations use the data as it is provided to our models. We are using OHE data, so we
need a category map for that format. Of course, the OHE features are all binary, so they only have two "categories"
each:
category_map_ohe = {5: ['Not Female', 'Female'],\
6: ['Not Male', 'Male'],\
7:['Not African American', 'African American'],\
8:['Not Asian', 'Asian'], 9:['Not Caucasian', 'Caucasian'],\
10:['Not Hispanic', 'Hispanic'],\
11:['Not Native American', 'Native American'],\
12:['Not Other Race', 'Other Race'],\
:
19:['Not Misdemeanor 3rd Deg', 'Misdemeanor 3rd Deg'],\
20:['Not Other Charge Degree', 'Other Charge Degree']}

Local interpretations for anchor explanations

All Alibi explainers require a predict function, so we create a lambda function called predict_cb_fn for our
CatBoost model. Please note that we are using predict_proba for the classifier's probabilities. Then, to initialize
AnchorTabular , we also provide it with our features' names as they are in our OHE dataset and the category map
( category_map_ohe ). Once it has initialized, we fit it with our training data:
predict_cb_fn = lambda x: fitted_cb_mdl.predict_proba(x) anchor_cb_explainer = AnchorTabular(pred
X_train.columns,\
categorical_names=category_map_ohe)
anchor_cb_explainer.fit(X_train.values)

Before we leverage the explainer, it's good practice to check that the anchor "holds." In other words, we should
check that the MAB algorithm found decision rules that help explain the prediction. To verify this, you use the
predictor function to check that the prediction is the same as the one you expect for this instance. Right now, we
are using idx1 , which is the case of the African American defendant:
print('Prediction: %s' % class_names[anchor_cb_explainer.\
predictor(X_test.loc[idx1].values)[0]])

The preceding code outputs the following:


Prediction: Medium/High Risk

We can proceed to use the explain function to generate an explanation for our instance. We can set our precision
threshold to 0.85 , which means we expect the predictions on anchored observations to be the same as our
instance at least 85% of the time. Once we have an explanation, we can print the anchors as well as their precision
and coverage:
anchor_cb_explanation =\
anchor_cb_explainer.explain(X_test.loc[idx1].values,\ threshold=0.85, seed=rand)
print('Anchor: %s' % (' AND'.join(anchor_cb_explanation.anchor)))
print('Precision: %.3f' % anchor_cb_explanation.precision)
print('Coverage: %.3f' % anchor_cb_explanation.coverage)

The following output was generated by the preceding code. You can tell that age , priors_count , and
race_African-American are factors at 86% precision. Impressively, this rule applies to almost a third of all the
perturbation space's instances:
Anchor: age <= 25.00 AND
priors_count > 0.00 AND
race_African-American = African American
Precision: 0.863
Coverage: 0.290

We can try the same code but with a 5% bump in the precision threshold. It produces the same first three anchors it
did with a lower precision threshold but now expands it with two more:
Anchor: age <= 25.00 AND
priors_count > 0.00 AND
race_African-American = African American AND
c_charge_degree_(M1) = Not Misdemeanor 1st Deg AND
c_charge_degree_(F3) = Not Felony 3rd Level AND
race_Caucasian = Not Caucasian
Precision: 0.903
Coverage: 0.290

Interestingly enough, although precision did increase by a few percentage points, coverage stayed the same, so the
additional anchors apply to a similar subset of perturbations with increased accuracy. At this level of precision, we
may confirm that race is a significant factor because being African American is an anchor but so is not being
Caucasian. Another factor was c_charge_degree . The explanation reveals that being accused of a first-degree
misdemeanor or third-level felony would have been better. Understandably, a seventh-level felony is a more
serious charge than these two.

Another way of understanding why a model made a specific prediction is looking for a similar data point that had
the opposite prediction and figuring out why it made that one. The decision boundary crosses between both points,
so it's helpful to contrast decision explanations from both sides of the boundary. This time we will use idx3 ,
which is the case for the Caucasian defendant with a threshold of 85% which outputs the anchors as follows:
Anchor: priors_count <= 2.00 AND
race_African-American = Not African American AND
c_charge_degree_(M1) = Misdemeanor 1st Deg
Precision: 0.891
Coverage: 0.578

The first anchor is priors_count <= 2.00 , but on the other side of the boundary, the first two anchors were
age <= 25.00 and priors_count > 0.00 . In other words, for an African American under or equal to the age of
25, any amount of priors is enough to categorize them as having a medium/high risk of recidivism (86% of the
time). On the other hand, for a White person, as long as priors don't exceed two and they haven't been accused of a
first degree misdemeanor, they will be predicted as low risk (89% of the time and with 58% coverage). These
decision rules not only suggest racial bias by race alone but also by applying double standards on other features.
A double standard is when different rules are applied when, in principle, the situation is the same. In this case, the
different rules for priors_count and the absence of age as a factor for Caucasian constitutes double standards.

We can now try the Hispanic defendant ( idx2 ) to observe whether double standards are also to be found with this
instance. We just run the same code as before but replace idx3 with idx2 :
Anchor: priors_count <= 2.00 AND
race_African-American = Not African American AND
juv_fel_count <= 0.00 AND
sex_Male = Male
Precision: 0.851
Coverage: 0.578

The explanations for the Hispanic defendant confirm the double standard with priors_count and that race
continues to be a strong factor, since there's one anchor for not being African American and another one for being
Hispanic.

For specific model decisions, anchor explanations answer the question why?. However, we have crossed the
decision boundary looking for answers to why our point wasn't on that side. By doing so, we have dabbled in the
question what if?. In the next section, we will expand on this question further.

Exploring counterfactual explanations


Counterfactuals are an integral part of human reasoning. How many of us have muttered the words "If I had done
X instead, my outcome y would have been different"? There's always one or two things that, if done differently,
could lead to the outcomes we prefer!

In machine learning outcomes, you can leverage this way of reasoning to make for extremely human-friendly
explanations where we can explain outcomes in terms of what would need to change to get the opposite outcome
(the counterfactual class). After all, we are often interested in knowing how to make a lousy outcome better. For
instance, how do you get your denied loan application approved or decrease your risk of cardiovascular disease
from high to low? However, hopefully, answers to those questions aren't a huge list of changes. You expect the
smallest amount of changes required to change your outcome.

Regarding fairness, counterfactuals are an important interpretation method, in particular when there are elements
involved that we can't change or shouldn't have to change. For instance, if you perform exactly the same job and
have the same level of experience as your coworker, you expect to have the same salary, right? If you and your
spouse share the same assets and credit history but have different credit scores, you have to wonder why. Does it
have to do with gender, race, age, or even political affiliations? Whether it's a compensation, credit rating, or
recidivism risk model, you'd hope that similar points have similar outcomes.

Finding counterfactuals is not particularly hard. All we have to do is change our instance of interest slightly until it
changes the outcome. And maybe there's an instance already in the dataset just like that!

In fact, you could say that the three instances we examined with anchors in the previous section are close enough
to be counterfactuals of each other, except for the Caucasian and Hispanic cases, which have the same outcome.
But the Caucasian and Hispanic instances were "cherry-picked" by looking for data points with the same criminal
history but different races than the instance of interest. Perhaps by comparing similar points, mostly except for
race, we limited the scope in such a way that we confirm what we hope to confirm, which is that race matters for
the model's decision-making.

This is an example of selection bias. After all, counterfactuals are inherently selective because they focus on a few
feature changes. And even with a few features, there are so many possible permutations that change the outcome,
which means that a single point could have hundreds of counterfactuals. And not all of these will tell a consistent
story. This phenomenon is called the Rashomon effect. It is named after a famous Japanese movie about a murder
mystery. And as we have come to expect from murder mysteries, witnesses have different interpretations of what
happened. But in the same way that it's difficult to rely on a single witness, you cannot rely on a single
counterfactual. Also, in the same way that great detectives are trained to look for clues everywhere in connection
to the scene of a crime (even if it contradicts their instincts), counterfactuals can't be "cherry-picked" because they
conveniently tell the story we want them to tell.

Fortunately, there are algorithmic ways of looking for counterfactual instances in an unbiased manner. Typically,
these involve finding the closest points with different outcomes, but there are different ways of measuring the
distance between points. For starters, there's the L1 distance (also known as the Manhattan distance) and L2
distance (also known as the Euclidean distance), among many others. But there's also the question of normalizing
the distances because not all features have the same scale. Otherwise, they would be biased against features with
smaller scales, such as one-hot-encoded features. There are many normalization schemes to chose from too. You
could use standard deviation, min-max scaling, or even median absolute deviation [9].

In this section, we will explain and use one advanced counterfactual finding method. Then, we will explore
Google's WIT. It has a simple L1- and L2-based counterfactual finder, which is limited to the dataset but makes up
for it with other useful interpretation features.

Counterfactual explanations guided by prototypes

The most sophisticated counterfactual finding algorithms do the following:

Loss: These leverage a loss function that helps optimize to find the counterfactuals closest to our instance of
interest.
Perturbation: These tend to operate with a perturbation space much like anchors do, changing as few
features as possible. Please note that counterfactuals don't have to be real points in your dataset. That would
be far too limiting. Counterfactuals exist in the realm of the possible, not of the necessarily known.
Distribution: However, they have to be realistic, and therefore, interpretable. For example, a loss function
could help determine that age < 0 alone is enough to make any medium-/high-risk instance low-risk. This is
why counterfactuals should lie close to the statistical distributions of your data, especially class-specific
distributions. They also should not be biased against smaller-scale features, namely categorical variables.
Speed: These run fast enough to be useful in real-world scenarios.

Alibi's Counterfactuals Guided by Prototypes (CounterFactualProto) has all these properties. It has a loss
function that includes both L1 (Lasso) and L2 (Ridge) regularization as a linear combination, just like Naïve
Elastic Net does β(L1 + L2) but with a weight (β) only on the L1 term. The clever part of this algorithm is that it
can (optionally) use an autoencoder to understand the distributions. We will leverage one in Chapter 7,Visualizing
Convolutional Neural Networks. However, what's important to note here is that autoencoders, in general, are
neural networks that learn a compressed representation of your training data. This method incorporates loss terms
from the autoencoder, such as one for the nearest prototype. A prototype is the dimensionality-reduced
representation of the counterfactual class.

If an autoencoder is not available, the algorithm uses a tree often used for multidimensional search (k-d trees)
instead. With this tree, the algorithm can efficiently capture the class distributions and also choose the nearest
prototype. Once it has the prototype, the perturbations are guided by it. Incorporating a prototype loss term in the
loss function ensures that the resulting perturbations will be close enough to the prototype that is in-distribution for
the counterfactual class. Many modeling class and interpretation methods overlook the importance of treating
continuous and categorical features differently. CounterFactualProto can use two different distance metrics to
compute the pairwise distances between categories of a categorical variable: Modified Value Difference Metric
(MVDM) and Association-Based Distance Metric (ABDM), and can even combine both. Another way in which
CounterFactualProto ensures meaningful counterfactuals is by limiting permutated features to predefined ranges.
We can use the minimum and maximum values of features to generate a tuple of arrays ( feature_range ):
feature_range =\
(X_train.values.min(axis=0).reshape(1,21).astype(np.float32),\ X_train.values.max(axis=0).r
print(feature_range)

The preceding code outputs two arrays – the first one with the minimum and the second with the maximum of all
features:
(array([[18., 0., ... , 0., 0., 0.]], dtype=float32), array([[96., 20., ... , 1., 1., 1.]]

We can now instantiate an explainer with CounterFactualProto . As arguments, it requires the black-box model's
predict function ( predict_cb_fn ), the shape of the instance you want to explain ( X_test_eval.shape ), the
maximum amount of optimization iterations to perform ( max_iterations ), and the feature range for perturbed
instances ( feature_range ). Many hyperparameters can be tuned, including the β weight to apply to the L1 loss
( beta ) and the θ weight to apply to the prototype loss ( theta ). Also, you must specify whether to use the k-d tree
or not ( use_kdtree ) when the autoencoder model isn't provided. Once the explainer is instantiated, you fit it to
the test dataset. We are specifying the distance metric for categorical features ( d_type ) as the combination of
ABDM and MVDM:
cf_cb_explainer = CounterFactualProto(predict_cb_fn, c_init=1,\
X_test_eval.shape, max_iterations=500,\
feature_range=feature_range, beta=.01,\
theta=5, c_steps=2, use_kdtree=True)
cf_cb_explainer.fit(X_test.values, d_type='abdm-mvdm')

Creating an explanation with an explainer is similar to how it was with anchors. Just pass the instance
( X_test_eval ) to the explain function. However, outputting the results is not as straightforward: mainly
because of converting the features between one-hot-encoded and ordinal, and interating among the features. The
documentation for Alibi (https://docs.seldon.io/projects/alibi/) has a detailed example of how this is done. We will
instead use a utility function called describe_cf_instance that does this for us using the instance of interest
( X_test_eval ), explanation ( cf_cb_explanation ), class names ( class_names ), one-hot.encoded category
locations ( cat_vars_ohe ), category map ( category_map ), and feature names ( feature_names ):
cf_cb_explanation = cf_cb_explainer.explain(X_test_eval) mldatasets.describe_cf_instance(X_test_e

The following output was produced by the preceding code:


Instance Outcomes and Probabilities
-----------------------------------------------
original: Medium/High Risk
[0.46732193 0.53267807]
counterfactual: Low Risk
[0.50025815 0.49974185]
Categorical Feature Counterfactual Perturbations
------------------------------------------------
sex: Male --> Female
race: African-American --> Asian
c_charge_degree: Felony 7th Degree --> Felony 1st Degree
Numerical Feature Counterfactual Perturbations
------------------------------------------------
priors_count: 2.00 --> 1.90

You can appreciate from the output that the instance of interest ("original") has a 53.26% probability of being
Medium/High Risk, but the counterfactual is barely on the Low Risk side with 50.03%! A counterfactual that is
slightly on the other side is what we would like to see because it likely means that it is as close as possible to our
instance of interest. There are four feature differences between them, three of which are categorical ( sex , race ,
and c_charge_degree ). The fourth difference is with the priors_count numerical feature, which is treated as
continuous since the explainer doesn't know it's discrete. In any case, it should be monotonic, and therefore fewer
priors should always mean lower risk, which means we can interpret the 1.90 as a 1 because if 0.1 fewer priors
helped reduce the risk, a whole prior should also do so.

A more powerful insight derived from CounterFactualProto's output is that two demographic features were present
in the closest counterfactual to this feature. One was found with a method that is designed to follow our classes'
statistical distributions and isn't biased against or in favor of specific types of features. And even though it is
surprising to see Asian female in our counterfactual because it doesn't fit the narrative that White males are getting
preferential treatment, it is troubling to realize that race appears in the counterfactual at all.

The Alibi library has several counterfactual finding methods, including one that leverages Reinforcement
Learning. Alibi also uses k-d trees for its Trust Score, which I highly recommend as well! Trust Score is a method
to measure the agreement between any classifier and a modified nearest neighbors classifier by calculating the
ratio between the distances to the predicted class and the nearest other class. The reasoning behind this is that
model's predictions should be consistent on a local level to be trustworthy. In other words, if you and your
neighbor are almost the same in every way, why would you be treated differently?

Counterfactual instances and much more with the What-If Tool (WIT)
Google's WIT is a very versatile tool. It requires very little input or preparation and opens up in your Jupyter or
Colab notebook as an interactive dashboard with three tabs:

Datapoint editor: To visualize your datapoints, edit them, and explain their predictions.
Performance: To see high-level model performance metrics (for all regression and classification models).
For binary classification, this tab is called Performance and Fairness because, in addition to high-level
metrics, predictive fairness can be compared between your dataset's feature-based slices.
Features: To view general feature statistics.

Given that the Features tab doesn't relate to model interpretations, we will explore only the first two in this
section.

Configuring WIT

Optionally, we can enrich our interpretations in WIT by creating attributions, which are values that explain how
much each feature contributes to each prediction. You could use any method to generate attributions, but we will
use SHAP. We covered SHAP first in Chapter 4, Global Model-Agnostic Interpretation Methods. Since we will
interpret our CatBoost model in the WIT dashboard, the SHAP explainer that is most suitable is TreeExplainer ,
but DeepExplainer would work for the neural network (and KernelExplainer for both). To initialize
TreeExplainer , all we need to pass is the fitted model ( fitted_cb_mdl ):

shap_cb_explainer = shap.TreeExplainer(fitted_cb_mdl)

WIT requires all the features in the dataset (including the labels). We will use the test dataset, so you could
concatenate X_test and y_test , but even those two exclude the ground truth feature ( is_recid ). One way of
getting all of them is to subset recidivism_df with the test dataset indexes ( y_test.index ). WIT also needs
your data and your columns in list format so we can save them as variables for later use ( test_np and cols_l ).
Lastly, for predictions and attributions, we will need to remove our ground truth ( is_recid ) and classification
label ( compas_score ), so let's save the index of these columns ( delcol_idx ):
test_df = recidivism_df.loc[y_test.index]
test_np = test_df.values
cols_l = test_df.columns
delcol_idx = [cols_l.get_loc("is_recid"),\
cols_l.get_loc("compas_score")]

WIT has several useful functions for customizing the dashboard, such as setting a custom distance metric
( set_custom_distance_fn ), displaying class names instead of numbers ( set_label_vocab ), setting a custom
predict function ( set_custom_predict_fn ), and a second predict function to compare two models
( compare_custom_predict_fn ).

In addition to set_label_vocab , we are going only to use a custom predict function


( custom_predict_with_shap ). All it needs to function is to take an array with your examples_np dataset and
produce some predictions ( preds ). However, we first must remove features that we want in the dashboard but
weren't used for the training ( delcol_idx ). This function's required output is a dictionary with the predictions
stored in a predictions key. But we'd also like some attributions too, which is why we need an attributions
key in that dictionary. Therefore, we take our SHAP explainer and generate shap_values , which is a NumPy
array. However, attributions need to be a list of dictionaries to be understood by the WIT dashboard. To this end,
we iterate shap_output and convert each observation's SHAP values array to a dictionary ( attrs ) and then
append this to a list ( attributions ):
def custom_predict_with_shap(examples_np):
#For shap values we only need same features
#that were used for training
inputs_np = np.delete(np.array(examples_np),delcol_idx,axis=1)
#Get the model's class predictions
preds = predict_cb_fn(inputs_np)
#With test data generate SHAP values which converted
#to a list of dictionaries format
keepcols_l = [c for i, c in enumerate(cols_l)\
if i not in delcol_idx]
shap_output = shap_cb_explainer.shap_values(inputs_np)
attributions = []
for shap in shap_output:
attrs = {}
for i, col in enumerate(keepcols_l):
attrs[col] = shap[i]
attributions.append(attrs)
#Prediction function must output predictions/attributions
#in dictionary
output = {'predictions': preds, 'attributions': attributions}
return output

Before we build the WIT dashboard, it's important to note that to find our instance of interest in the dashboard, we
need to know its position within the NumPy array provided to WIT because these don't have indexes as pandas
DataFrames do. To find the position, all we need to do is provide the get_loc function with the index:
print(y_test.index.get_loc(5231))

The preceding code outputs as 2910 , so we can take note of this number. Building the WIT dashboard is fairly
straightforward now. We first initialize a config ( WitConfigBuilder ) with our test dataset in NumPy format
( test_np ) and our list of features ( cols_l ). Both are converted to lists with tolist() . Then, we set our custom
predict function with set_custom_predict_fn and our target feature ( is_recid ) and provide our class names.
We will use the ground truth this time to evaluate fairness from the perspective of what really happened. Once the
config is initializing, the widget ( WitWidget ) builds the dashboard with it. You can optionally provide a height
(default is 1000 pixels):
wit_config_builder = WitConfigBuilder(\
test_np.tolist(), feature_names=cols_l.tolist()
).set_custom_predict_fn(custom_predict_with_shap).\
set_target_feature("is_recid").set_label_vocab(class_names)
WitWidget(wit_config_builder, height=800)

Datapoint editor

In Figure 6.5, you can see the WIT dashboard with its three tabs. We will first explore the first tab (Datapoint
editor). It has Visualize and Edit panes on the left, and on the right, it can show you either Datapoints or Partial
dependence plots. When you have Datapoints selected, you can visualize the datapoints in many ways using the
controls in the upper right (highlighted area A). What we have done in Figure 6.5 is set the following:

Binning | X-axis: c_charge_degree_(F7) .


Binning | Y-axis: compas_score .
Color By: race_African-American .
Everything else stays the same.

These settings resulted in all our datapoints neatly organized in 2 rows and 2 columns and color-coded by African
American or not. The right column is for those with a level 7 charge degree, and the upper row is for those with a
Medium/High Risk COMPAS score. We can look for datapoint 2910 in this subgroup (B) by clicking on the top-
rightmost item. It should appear in the Edit pane (C). Interestingly enough, the SHAP attributions for this
datapoint are three times higher for age than they are for race_African-American . But still, race altogether is
second to age in importance. Also, notice that in the Infer pane, you see the predicted probability for
Medium/High Risk is approximately 83%:
Figure 6.5 – WIT dashboard with our instance of interest

WIT can find the nearest counterfactual using L1 or L2 distances. And it can use either feature values or
attributions to calculate the distances. As mentioned earlier, WIT can also include a custom distance finding
function if you add it to the configuration. For now, we will select L2 with Feature value. In Figure 6.6, these
options appear in the highlighted A area. Once you choose a distance metric and enable Nearest counterfactual, it
appears side by side with our instance of interest (area B), and it compares their predictions as shown in the
following figure (C). You can sort the features by Absolute attribution for a clearer understanding of feature
importance on a local level. The counterfactual is only 3 years older but has zero priors instead of two, yet that was
enough to reduce the Medium/High Risk to nearly 5%:
Figure 6.6 – How to find the nearest counterfactual in WIT

While both our instance of interest and counterfactual remain selected, we can visualize them along with all other
points. By doing this, you take insights from local interpretations and can create enough context for global
understandings. For instance, let's change our visualization settings to the following:

Binning | X-axis: Inference label .


Binning | Y-axis: (none) .
Scatter | X-axis: age .
Scatter | Y-axis: priors_count .

Everything else stays the same.

The result of this visualization is depicted in Figure 6.7. You can tell that the Low Risk bins' points tend to hover
in the lower end of priors_count . Both bins show that prior_count and age have a slight correlation, although
this is substantially more pronounced in the Medium/High Risk bin. However, what is most interesting is the
sheer density of African American data points deemed Medium/High Risk in age ranging 18-25 and with
prior_count below three compared to those in the Low Risk bin. It suggests that both lower age and higher
priors_count increases risk more for African Americans than others:
Figure 6.7 – Visualizing age versus priors_count in WIT

We can try creating our own counterfactuals by editing the datapoint. What happens when we reduce
priors_count to one? The answer to this question is depicted in Figure 6.8. Once you make the change and click
on the Predict button in the Infer pane, it adds an entry to the prediction history last in the Infer pane. You can tell
in Run #2 that the risk reduces nearly to 33.5%, down nearly 50%!
Figure 6.8 – Editing the datapoint to decrease priors_count in WIT

Now, what happens if age is only 2 years older but there are two priors? In Figure 6.9, Run #3 tells you that it
barely made it inside the Low Risk score:

Figure 6.9 – Editing the datapoint to increase the age in WIT


Another feature that the Datapoint editor tab has is partial dependence plots, which we covered in Chapter 4,
Global Model-Agnostic Interpretation Methods. If you click on this radio button, it will modify the right pane to
look like Figure 6.10. By default, if a data point is selected, the PDPs are local, meaning they pertain to the chosen
datapoint. But you can switch to global. In any case, it's best to sort plots by variation as done for Figure 6.10,
where age and priors_count have the highest variation. Interestingly, neither of them is monotonic, which
doesn't make sense. The model should be learning that an increase in priors_count should consistently increase
risk. It should be the same with a decrease in age . After all, academic research shows that crime tends to peak in
the mid-20s and that higher priors increase the likelihood of recidivism. The relationship between these two
variables is also well understood, so perhaps some data engineering and monotonic constraints could make sure a
model is consistent with known phenomena rather than learning the inconsistencies in the data that lead to
unfairness. We will cover this in Chapter 12, Monotonic Constraints and Model Tuning for Interpretability:

Figure 6.10 – Local partial dependence plot for age and priors_count

Is there something that can be done to improve fairness in a model that has already been trained? Indeed, there is.
The Performance & Fairness tab can help with that.

Performance & Fairness

When you click on the Performance & Fairness tab, you will see that it has Configure and Fairness panes on the
left. And on the right, you can explore the overall performance of the model (see Figure 6.11). In the upper part of
this pane, it has False Positives (%), False Negatives (%), Accuracy (%), and F1 fields. If you expand the pane,
it shows the ROC curve, PR curve, confusion matrix, and mean attributions – the average Shapley values. We have
covered all of these previously in this book either directly or indirectly, except for the PR curve. The Precision-
Recall (PR) is very much like the ROC curve, except it plots precision against recall instead of TPR versus FPR.
In this plot, precision is expected to decrease as recall decreases. Unlike ROC, it's considered worse than a coin
toss when the line is close to the x axis, and it's best suited to imbalanced classification problems:

Figure 6.11 – Performance & Fairness tab initial view

A classification model will output probabilities that an observation is in one class or another. We usually take
every observation above or equal to 0.5 to belong to the positive class. Otherwise, we predict it to belong to the
negative class. This threshold is called the classification threshold, and you don't always have to use the standard
0.5.

There are many cases in which it is appropriate to perform threshold tuning. One of the most compelling reasons
is imbalanced classification problems because often models optimize performance on accuracy alone but end up
with bad recall or precision. Adjusting the threshold will improve the metric you most care about:
Figure 6.12 – Slicing performance metrics by race_African-American

Another primary reason to adjust thresholds is for fairness. To this end, you need to examine the metric you most
care about across different slices of your data. In our case, False Positives (%) is where we can appreciate
unfairness the most. For instance, take a look at Figure 6.12. With the Configure pane, we can slice the data by
race_African-American , and to the right of it, we can see what we observed at the beginning of this chapter,
which is that FPs for African Americans are substantially higher than for other segments. One way to fix this is
through an automatic optimization method such as Demographic parity or Equal opportunity. If you are to use
one of these, it's best to adjust Cost Ratio (FP/FN) to tell the optimizer that FPs are worth more than FNs:

Figure 6.13 – Adjusting the classification threshold for the dataset sliced by race_African-American

We can also adjust thresholds manually using the default Custom Thresholds setting (see Figure 6.13). For these
slices, if we want approximate parity with our FPs, we should use 0.78 as our threshold for when
race_African-American=1 . The drawback is that FNs will increase for this group, not achieving parity on that
end. A cost ratio would help determine whether 14.7% in FPs justifies 24.4% in FNs, but to do this we would have
to understand the average costs involved. We will examine odds calibration methods further in Chapter 11, Bias
Mitigation and Causal Inference Methods.

Mission accomplished
This chapter's mission was to see whether there was unfair bias in predicting whether a particular defendant would
recidivate. We demonstrated that the FPR for African American defendants is 1.87 times higher than for Caucasian
defendants. This disparity was confirmed with WIT, indicating that the model in question is much more likely to
misclassify the positive class on the basis of race. However, this is a global interpretation method, so it doesn't
answer our question regarding a specific defendant. Incidentally, in Chapter 11, Bias Mitigation and Causal
Inference Methods, we will cover other global interpretation methods for unfairness.
To ascertain whether the model was racially biased toward the defendant in question, we leveraged anchor and
counterfactual explanations – they both output race as a primary feature in their explanations. The anchor did it
with relatively high precision and coverage, and Counterfactuals Guided by Prototypes found that the closest one
has a different race. That being said, in both cases, race wasn't the only feature in the explanations. The features
usually included any or all of the following: priors_count , age , charge_degree , and sex . The inconsistent
rules involving the first three regarding race suggest double standards and the involvement of sex suggests
intersectionality. Double standards are when rules are applied unfairly to different groups. Intersectionality is to
do with how overlapping identities create different systems of interconnected modes of discrimination. However,
we know that females of all races are less likely to recidivate according to academic research. Still, we have to ask
ourselves whether they have a structural advantage that makes them necessarily privileged in this context. There's
a more elaborate dynamic going on than meets the eye. The bottom line is that despite all the other factors that
interplay with race, and provided that there's no relevant criminological information that we are missing, yes,
there's racial bias involved in this particular prediction.

Summary
After reading this chapter, you should know how to leverage anchors, to understand the decision rules that impact
a classification, and counterfactuals, to grasp what needs to change for the predicted class to change. You also
learned how to assess fairness using confusion matrices and Google's WIT. In the next chapter, we will study
interpretation methods for Convolutional Neural Networks (CNNs).

Dataset sources
ProPublica Data Store. (2019). COMPAS Recidivism Risk Score Data and Analysis. Originally retrieved from
https://www.propublica.org/datastore/dataset/compas-recidivism-risk-score-data-and-analysisv

Further reading
Desmarais, S.L., Johnson, K.L., & Singh, J.P. (2016). Performance of recidivism risk assessment instruments
in U.S. correctional settings. Psychol Serv;13(3):206-222. https://doi.org/10.1037/ser0000075
Berk, R., Heidari, H., Jabbari, S., Kearns, M., & Roth, A. (2017). Fairness in Criminal Justice Risk
Assessments: The State of the Art. Sociological Methods & Research.
Angwin, J., Larson, J., Mattu, S., & Kirchner, L. (2016). Machine Bias. There is software that is used across
the county to predict future criminals. https://www.propublica.org/article/machine-bias-risk-assessments-in-
criminal-sentencing
Ribeiro, M.T., Singh, S., & Guestrin, C. (2018). Anchors: High-Precision Model-Agnostic Explanations.
Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society.
https://doi.org/10.1145/3375627.3375830
Rocque, M., Posick, C., & Hoyle, J. (2015). Age and Crime. The encyclopedia of crime and punishment, 1-8.
https://doi.org/10.1002/9781118519639.wbecpx275
Dhurandhar, A., Chen, P., Luss, R., Tu, C., Ting, P., Shanmugam, K., & Das, P. (2018). Explanations based on
the Missing: Towards Contrastive Explanations with Pertinent Negatives. NeurIPS.
https://arxiv.org/abs/1802.07623
Jiang, H., Kim, B., & Gupta, M.R. (2018). To Trust Or Not To Trust A Classifier. NeurIPS.
https://arxiv.org/pdf/1805.11783.pdf
9 Interpretation Methods for Multivariate
Forecasting and Sensitivity Analysis
Join our book community on Discord
https://packt.link/EarlyAccessCommunity

Throughout this book, we have learned about various methods we can use to interpret supervised learning models.
They can be quite effective at assessing models while also uncovering their most influential predictors and their
hidden interactions. But as the term supervised learning suggests, these methods can only leverage known samples
and permutations based on these known samples' distributions. However, when these samples represent the past,
things can get tricky! As the Nobel laureate in Physics Niels Bohr famously quipped, "Prediction is very difficult,
especially if it's about the future."

Indeed, when you see datapoints fluctuating in a time series, they may appear to be rhythmically dancing in a
predictable pattern – at least in the best-case scenarios. Like a dancer moving to a beat, every repetitive movement
(or frequency) can be attributed to seasonal patterns, while a gradual change in volume (or amplitude) is attributed
to an equally predictable trend. The dance is inevitably misleading because there are always missing pieces of the
puzzle that slightly shift the data points, such as a delay in a supplier's supply chain causing an unexpected dent in
today's sales figures. To make matters worse, there's also unforeseen catastrophic once-in-a-decade, once-in-a-
generation, or, simply, once-ever events that can radically make the somewhat understood movement of a time
series unrecognizable, similar to a ballroom dancer having a seizure. For instance, in 2020, sales forecasts
everywhere, either for better or worse, were rendered useless by COVID-19!

We could call this an extreme outlier event, but we must recognize that models weren't built to predict these
momentous events because they were trained on almost entirely likely occurrences. Not predicting these unlikely
yet most consequential events is why we shouldn't place so much trust in forecasting models to begin with,
especially without discussing certainty or confidence bounds.

This chapter will examine a multivariate forecasting problem with Long Short-Term Memory (LSTM) models.
We will first assess the models with traditional interpretation methods, followed by the Integrated Gradient
method we learned about in Chapter 7, Visualizing Convolutional Neural Networks, to generate our model's local
attributions. But more importantly, we will understand the LSTM's learning process and limitations better. We will
then employ a prediction approximator method and SHAP's KernelExplainer for both global and local
interpretation. Lastly, forecasting and uncertainty are intrinsically linked, and Sensitivity Analysis is a family of
methods designed to measure the uncertainty of the model's output in relation to its input, so it's very useful in
forecasting scenarios. We will also study two such methods: Morris for factor prioritization and Sobol for factor
fixing, which involves cost sensitivity.

The following are the main topics we are going to cover:

Assessing time series models with traditional interpretation methods


Generating LSTM attributions with integrated gradients
Computing global and local attributions with SHAP's KernelExplainer
Identifying influential features with factor prioritization
Quantifying uncertainty and cost sensitivity with factor fixing

Technical requirements
This chapter's example uses the mldatasets , pandas , numpy , sklearn , tensorflow , matplotlib , seaborn ,
alibi , distython , shap , and SALib libraries. Instructions on how to install all these libraries can be found in
this book's preface. The code for this chapter is located here: https://github.com/PacktPublishing/Interpretable-
Machine-Learning-with-Python/tree/master/Chapter09.

The mission
Highway traffic congestion is a problem that's affecting cities across the world. As vehicle per capita steadily
increases across the developing world with not enough road and parking infrastructure to keep up with it,
congestion has been increasing at alarming levels. In the United States, the vehicle per capita statistic is among the
highest in the world (838 per 1,000 people for 2019). For this reason, US cities represent 62 out of the 381 cities
worldwide. with at least a 15% congestion level.

Minneapolis is one such city (see the following screenshot) where that threshold was recently surpassed and keeps
rising. To put this metropolitan area into context, congestion levels are extremely severe above 50%, but moderate
level congestion (15-25%) is already a warning sign of bad congestion to come. It's challenging to reverse
congestion once it reaches 25% because any infrastructure improvement will be costly to implement without
disrupting traffic even further. One of the worst congestion points is between the twin cities of Minneapolis and St.
Paul throughout the Interstate 94 (I-94) highway, which congests alternate routes as commuters try to cut on travel
time. Knowing this, the mayors of both cities have obtained some federal funding to expand the highway:

Figure 9.1 – TomTom's 2019 traffic index for Minneapolis

The mayors want to be able to tout a completed expansion as a joint accomplishment to get reelected for a second
term. However, they are well aware that a noisy, dirty, and obstructive expansion can be a big nuisance for
commuters, so the construction project could backfire politically if it's not made nearly invisible. Therefore, they
have stipulated that the construction company prefabricate as much as possible elsewhere and assemble only
during low-volume hours. These hours have less than 1,500 vehicles per hour. They can also only work on one
direction of the highway at a time and only block no more than half of its lanes when they are working on it. To
ensure compliance with these stipulations, they will fine the company if they are blocking no more than a quarter
of the highway any time that volume is above this threshold, at a rate of $15 per vehicle.

In addition to that, if the highway exceeds half-capacity while the construction crew are on-site, it will cost them
$5,000 a day. To put this into perspective, blocking during a typical peak hour could cost them $67,000 per hour,
plus the $5,000 daily fee! The local authorities will use Automated Traffic Recorder (ATR) stations along the
route to monitor traffic volume, as well as local traffic police to register when lanes are getting blocked for
construction.
It's been planned as a 2-year construction project; the first year will expand the westbound lanes on the I-94 route,
while the second will expand the eastbound lanes. The on-site portion of the construction will only occur from
May through October because snow is less likely to delay construction during these months. Throughout the rest of
the year, they will focus on pre-fabrication. They will attempt to work weekdays only because the workers union
negotiated generous overtime pay for weekends. Therefore, weekend construction will happen only if there are
significant delays. However, the union agreed to work holidays May through October for the same rate.

The construction company doesn't want to take any risks! Therefore, they need a model to predict traffic for the I-
94 route and, more importantly, to understand what factors create uncertainty and possibly increase costs. They
have hired a machine learning expert to do this: you!

The ATR data provided by the construction company includes hourly traffic volumes up to September 2018, as
well as weather data at the same timescale. It only consists of the westbound lanes because that expansion will
come first. Also, since 2015, congestion has become considerably worse during peak hours, which has become the
new normal for commuters. Therefore, they are only interested in training the model with 3 years' worth of data.

The approach
You have trained a Stateful Bidirectional LSTM model with almost four years' worth of data (October 2012 –
September 2016). You reserved the last year for testing (September 2017 –2018) and the prior year to that for
validation (September 2016 –2017). This made sense because the combined testing and validation datasets align
well with the highway expansion project's expected conditions (March – November). You wondered about using
other splitting schemes that leveraged only the data representative of these conditions, but you didn't want to
reduce the training data so drastically, and maybe they might need it for winter predictions after all. A look-back
window defines how much past data a time series model has access to. You chose 168 hours (1 week) as the look-
back window size. Given the stateful nature of the model, as the model moves forward in the training data, it can
learn daily and weekly seasonality, as well as some trends and patterns that can only be observed across several
weeks. You also trained other two models. You have outlined the following steps to meet the client's expectations:

1. With RMSE, regression plots, confusion matrices, and much more, you will access the models' predictive
performance and, more importantly, how the error is distributed.
2. With Integrated Gradients, you will understand if you took the best modeling strategy since it can help you
visualize each of the model's pathways to a decision, and help you choose a model based on that.
3. With SHAP's KernelExplainer and a prediction approximation method, you will derive both a global and local
understanding of what features matter to the chosen model.
4. With Morris Sensitivity Analysis, you will identify Factor Prioritization, which ranks factors (in other words,
features) by how much they can drive output variability.
5. With Sobol Sensitivity Analysis, you will compute Factor Fixing, which helps determine what factors aren't
influential. It does this by quantifying the input factors' contributions and interactions to the output's
variability. With this, you can understand what uncertain factors may have the most effect on potential fines
and costs, thus producing a variance-based cost-sensitivity analysis.

The preparation
You can find the code for this example here: https://github.com/PacktPublishing/Interpretable-Machine-Learning-
with-Python-2E/blob/main/Chapter09/Traffic_compact1.ipynb.

Loading the libraries

To run this example, you will need to install the following libraries:

mldatasets to load the dataset


pandas and numpy to manipulate the dataset
tensorflow to load the model
sklearn (scikit-learn), matplotlib , seaborn , alibi , distython , shap , and SALib to create and
visualize the interpretations

You should load all of them first:


import math
import os
import mldatasets
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from sklearn import metrics
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.sequence import TimeseriesGenerator
from keras.utils.data_utils import get_file
import matplotlib.pyplot as plt
from matplotlib.colors import TwoSlopeNorm
import seaborn as sns
from alibi.explainers import IntegratedGradients
from distython import HEOM
import shap
from SALib.sample import morris as ms
from SALib.analyze import morris as ma
from SALib.plotting import morris as mp
from SALib.sample.saltelli import sample as ss
from SALib.analyze.sobol import analyze as sa
from SALib.plotting.bar import plot as barplot

Let's check that TensorFlow has loaded the right version by using the print(tf.__version__) command. It
should be 2.0 or above.

Understanding and preparing the data

In the following snippet, we are loading the data into a DataFrame called traffic_df . Please note that the
prepare=True parameter is important because it performs necessary tasks such as subsetting it to the required
timeframe since October 2015, some interpolation, correcting holidays, and performing one-hot encoding:
traffic_df = mldatasets.load("traffic-volume-v2", prepare=True)

There should be over 52,000 records and 16 columns. We can verify this with traffic_df.info() . The output
should check out. All the features are numeric and have no missing values, and the categorical features have
already been one-hot encoded for us.

The data dictionary

There are only nine features, but they become 16 columns because of categorical encoding:

dow : Ordinal; day of week starting with Monday (between 0 – 6)


hr : Ordinal; hour of day (between 0 – 23)
temp : Continuous; average temperature in Celsius (between – 30 and 37)
rain_1h : Continuous; mm of rainfall occurred in the hour (between 0 – 21)
snow_1h : Continuous; cm of snow (when converted to liquid form) occurred in the hour (between 0 – 2.5)
cloud_coverage : Continuous; percentage of cloud coverage (between 0 – 100)
is_holiday : Binary; is the day a national or state holiday when it occurs Monday – Friday (1 for yes, 0 for
no)?
traffic_volume : Continuous, target, traffic volume
weather : Categorical; a short description of the weather during that hour (Clear | Clouds | Haze | Mist | Rain |
Snow | Unknown | Other)

Understanding the data


The first step in understanding a time series problem is understanding the target variable. This is because it
determines how you approach everything else, from data preparation to modeling. The target variable is likely to
have a special relationship with time, such as a seasonal movement or a trend.

Understanding weeks

First, we can sample one 168-hour period from every season to understand the variance a bit better between days
of the week, and then get an idea of how they could vary across seasons and holidays:
lb = 168
fig, (ax0,ax1,ax2,ax3) = plt.subplots(4,1, figsize=(15,8))
plt.subplots_adjust(top = 0.99, bottom=0.01, hspace=0.4)
traffic_df[(lb*160):(lb*161)].traffic_volume.plot(ax=ax0)
traffic_df[(lb*173):(lb*174)].traffic_volume.plot(ax=ax1)
traffic_df[(lb*186):(lb*187)].traffic_volume.plot(ax=ax2)
traffic_df[(lb*199):(lb*200)].traffic_volume.plot(ax=ax3)

The preceding code generates the plots shown in the following image. If you read them from left to right, you'll see
that they all start with Wednesday and end with Tuesday of the following week. Every day of the week starts and
ends at a low point, with a high point in-between. Weekdays tend to have two peaks corresponding to morning and
afternoon rush hour, while weekends only have one mid-afternoon bump:
Figure 9.2 – Several sample weekly periods for traffic_volume representing each season

There are some major inconsistencies, such as Saturday October 31, which is basically Halloween and is not an
official holiday. Also, February 2 (a Tuesday) was the beginning of a severe snowstorm, and the period in the late
summer is much more chaotic than the other sample weeks. It turns out that in that year, the State Fair occurred.
Like Halloween, it's not a federal nor regional holiday either, but it's important to note that the fairgrounds are
located halfway between Minneapolis and St. Paul. You'll also notice that on Friday July 29, there's a midnight
bump in traffic, which can be attributed to this being a big day for Minneapolis concerts.

Trying to explain these inconsistencies while comparing periods in your time series is a good exercise as it helps
you figure out what variables to add to your model, or at least know what is missing. In our case, we know our
is_holiday variable doesn't include days such as Halloween or the entire State Fair week, nor do we have a
variable for big music or sporting events. To produce a more robust model, it would be advisable to look for
reliable external data sources and add more features that cover all these possibilities, not to mention validate the
existing variables. For now, we will work with what we've got.

Understanding days
It is crucial for the highway expansion project to understand what traffic looks like for the average workday. The
construction crew will be working on weekdays only (Monday – Friday) unless they experience delays, in which
case they will also work weekends. We must also make a distinction between holidays and other weekdays because
these are likely to be different.

To this end, we will create a DataFrame ( weekend_df ) and engineer a new column ( type_of_day ) that codes
hours as being part of a "Holiday," "Weekday," or "Weekend." Then, we can group by this column and the hr
column, and aggregate with mean and standard deviation ( std ). We can then pivot so that we have one column
with the average and standard deviations traffic volumes for every type_of_day category, where the rows
represent the hours of the day ( hr ). Then, we can plot the resulting DataFrame. We can create intervals with the
standard deviations:
weekend_df =\
traffic_df[['hr', 'dow', 'is_holiday', 'traffic_volume']].copy()
weekend_df['type_of_day'] = np.where(weekend_df.is_holiday == 1,\
'Holiday', np.where(weekend_df.dow >= 5, 'Weekend', 'Weekday'))
weekend_df = weekend_df.groupby(['type_of_day','hr'])\
['traffic_volume'].agg(['mean','std']).\
reset_index().pivot(index='hr', columns='type_of_day',\
values=['mean', 'std'])
weekend_df.columns = [''.join(col).strip().replace('mean','')
for col in weekend_df.columns.values]
fig, ax = plt.subplots(figsize=(15,8))
weekend_df[['Holiday','Weekday','Weekend']].plot(ax=ax)
plt.fill_between(weekend_df.index,\
np.maximum(weekend_df.Weekday - 2 * weekend_df.std_Weekday, 0),\
weekend_df.Weekday + 2 * weekend_df.std_Weekday,\
color='darkorange', alpha=0.2)
plt.fill_between(weekend_df.index,\
np.maximum(weekend_df.Weekend - 2 * weekend_df.std_Weekend, 0),\
weekend_df.Weekend + 2 * weekend_df.std_Weekend,
color='green', alpha=0.1)
plt.fill_between(weekend_df.index,\
np.maximum(weekend_df.Holiday - 2 * weekend_df.std_Holiday, 0),\
weekend_df.Holiday + 2 * weekend_df.std_Holiday,
color='cornflowerblue', alpha=0.1)
ax.axhline(y=5300, linewidth=3, color='red', dashes=(2,2))
ax.axhline(y=2650, linewidth=2, color='darkviolet', dashes=(2,2))
ax.axhline(y=1500, linewidth=2, color='teal', dashes=(2,2))

The preceding snippet results in the following plot. It represents the hourly average, but there's quite a bit of
variation, which is why the construction company is proceeding with caution. There are horizontal lines that have
been plotted representing each of the thresholds:

5,300 for full capacity.


2,650 for half-capacity, after which the construction company will get fined the daily amount specified.
1,500 is the no-construction threshold, after which the construction company will get fined the hourly amount
specified.

They only want to work Monday – Friday during the hours that are typically below the 1,500 threshold. These five
hours would be 11 p.m. (the day before) to 5 a.m. If they had to work weekends, this schedule would typically be
delayed until 1 a.m. and end at 6 a.m. There's considerably less variance during weekdays, so it's understandable
why the construction company is adamant about only working weekdays. During these hours, holidays appear to
be similar to weekends, but holidays tend to vary even more than weekends, which is potentially even more
problematic:
Figure 9.3 – The average hourly traffic volume for holidays, weekdays, and weekends, with intervals

Usually, for a project like this, you would explore the predictor variables to the extent we have done with the
target. This book is about model interpretation, so we will learn about the predictors by interpreting the models.
But before we get to the models, we must prepare the data for them.

Data preparation

The first data preparation step is to split it into train, validation, and test sets. Please note that the test dataset
comprises the last 52 weeks ( 2184 hours), while the validation dataset comprises the 52 weeks before that, so it
starts at 4368 and ends 2184 hours before the last row of the DataFrame:
train = traffic_df[:-4368]
valid = traffic_df[-4368:-2184]
test = traffic_df[-2184:]

Now that the DataFrame has been split, we can plot it to ensure that its parts are split as intended. We can do so
with the following code:
plt.plot(train.index.values, train.traffic_volume.values,
label='train')
plt.plot(valid.index.values, valid.traffic_volume.values,
label='validation')
plt.plot(test.index.values, test.traffic_volume.values,
label='test')
plt.ylabel('Traffic Volume')
plt.legend()
The preceding code produces Figure 9.4. It shows that almost 4 years of data was allocated for the training dataset,
and a year to validation and test each. We won't reference the validation dataset from this point on during this
exercise because it was only instrumental during training to assess the model's predictive performance after every
epoch.

Figure 9.4 – Time series split into train, validation, and test sets

The next step is to min-max normalize the data. We are doing this because larger values lead to slower learning for
all neural networks in general and LSTMs are very prone to exploding and vanishing gradients. Relatively
uniform and small numbers can help counter these problems. We will discuss this later in this chapter, but
basically, the network becomes either numerically unstable or ineffective at reaching a global minimum.

We can min-max normalize with MinMaxScaler from the scikit package. For now, all we will do is fit the
scaler so that we can use them whenever we need them. We will create a scaler for our target ( traffic_volume )
called y_scaler and another for the rest of the variables ( X_scaler ) with the entire dataset, so that
transformations are consistent no matter what part you are using, be it train , valid , or test . All the fit
process does is save the formula to make each variable fit between zero and one:
y_scaler = MinMaxScaler()
y_scaler.fit(traffic_df[['traffic_volume']])
X_scaler = MinMaxScaler()
X_scaler.fit(traffic_df.drop(['traffic_volume'], axis=1))

Now, we will transform both our train and test datasets with our scaler, creating y and X pairs for each:
y_train = y_scaler.transform(train[['traffic_volume']])
X_train = X_scaler.transform(train.drop(['traffic_volume'], axis=1))
y_test = y_scaler.transform(test[['traffic_volume']])
X_test = X_scaler.transform(test.drop(['traffic_volume'], axis=1))

However, for a time series model, the y and X pairs we created aren't useful because each observation is a timestep.
And each timestep is more than the variables for that timestep, but the previous timesteps are going a certain
amount of lag backward. Therefore, you have to generate an array for every timestep, as well as its lags.
Fortunately, keras has a function called TimeseriesGenerator that takes your X and y and produces a generator
that feeds the data to your model. You must specify a certain length , which is the number of lagging timesteps
(also known as the lookback window). The default batch_size is one, but we are using 24 because the client
prefers to get forecasts 24 hours at a time, and also training and inference are much faster with a larger batch size.

Naturally, when you need to forecast tomorrow, you will need tomorrow's weather, but you can complete the
timesteps with weather forecasts:
gen_train = TimeseriesGenerator(X_train, y_train, length=lb,\
batch_size=24)
gen_test = TimeseriesGenerator(X_test, y_test, length=lb,
batch_size=24)
print("gen_train:%s×%s→%s" % (len(gen_train),
gen_train[0][0].shape,
gen_train[0][1].shape))
print("gen_test:%s×%s→%s" % (len(gen_test),
gen_test[0][0].shape,
gen_test[0][1].shape))

The preceding snippet outputs the dimensions of the training generator ( gen_train ) and the testing generator
( gen_test ), which use a length of 168 hours and a batch size of 24:
gen_train: 1454 × (24, 168, 15) → (24, 1)
gen_test: 357 × (24, 168, 15) → (24, 1)

Any model that was trained with a 1-week look-back window and 24 hour batch size will need this generator. Each
generator is a list of tuples corresponding to each batch. Index 0 of this tuple is the X feature array, while index 1
is the y label array. Therefore, the first number output is the length of the list, which is the number of batches. The
dimensions of the X and y array follow. For instance, gen_train has 1454 batches, and each batch has 24
timesteps, with a length of 168 and 15 features. The shape of the predicted labels expected from these 24 timesteps
is (24,1) .

Lastly, before moving forward with handling models and stochastic interpretation methods, let's attempt to make
things more reproducible by initializing our random seeds:
rand = 9
os.environ['PYTHONHASHSEED']=str(rand)
tf.random.set_seed(rand)
np.random.seed(rand)

Loading the LSTM model

We can quickly load the model and output its summary like this:
model_name = 'LSTM_traffic_168_compact1.hdf5'
model_path = get_file(model_name,\
'https://github.com/PacktPublishing/Interpretable-Machine-Learning-with-Python-2E/blob/main/mo
lstm_traffic_mdl = keras.models.load_model(model_path)
lstm_traffic_mdl.summary()

As you can tell by the summary that's produced by the preceding snippet, the model starts with a Bidirectional
LSTM layer with an output of (24, 168) . 24 corresponds to the batch size, while 168 means that there's not one
but two 84-unit LSTMs going in opposite directions and meeting in the middle. It has a dropout of 10%, and then
a dense layer with a single ReLu activated unit. The ReLu ensures that all the predictions are over zero since
negative traffic volume makes no sense:
Model: "LSTM_traffic_168_compact1"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
Bidir_LSTM (Bidirectional) (24, 168) 67200
_________________________________________________________________
Dropout (Dropout) (24, 168) 0
_________________________________________________________________
Dense (Dense) (24, 1) 169
=================================================================
Total params: 67,369
Trainable params: 67,369
Non-trainable params: 0
_________________________________________________________________

Now, let's assess the “LSTM_traffic_168_compact1” model using traditional interpretation methods.

Assessing time series models with traditional interpretation methods


A time series regressor model can be evaluated as you would evaluate any regression model; that is, using metrics
derived from mean square error or the r-squared score. There are, of course, cases in which you will need to use
a metric with medians, logs, deviances, or absolute values. These models don't require any of this.

Using standard regression metrics

The evaluate_reg_mdl function can evaluate the model, output some standard regression metrics, and plot them.
The parameters for this model are the fitted model ( lstm_traffic_mdl ), X_train ( gen_train ), X_test
( gen_test ), y_train , and y_test .

Optionally, we can specify a y_scaler so that the model is evaluated with the labels inverse transformed, which
makes the plot and root mean square error (RMSE) much easier to interpret. Another optional parameter that is
very much necessary, in this case, is y_truncate=True because our y_train and y_test are of larger
dimensions than the predicted labels. This discrepancy happens because the first prediction occurs several
timesteps after the first timestep in the dataset due to the look-back window. Therefore, we would need to deduct
these timesteps from y_train in order to match the length of gen_train .

We will now evaluate both models with the following code. To observe the prediction's progress as it happens, we
will use predopts={"verbose":1} .
y_train_pred, y_test_pred, y_train, y_test =\
mldatasets.evaluate_reg_mdl(lstm_traffic_mdl, gen_train,\
gen_test, y_train, y_test, scaler=y_scaler,
y_truncate=True, predopts={"verbose":1})

The preceding snippet produced the plot and metrics shown in Figure 9.5. The "regression plot" is, essentially, a
scatter plot of the observed versus predicted traffic volumes, fitted to a linear regression model to show how well
they match. These plots show that the model tends to predict zero traffic when it's substantially higher. Besides
that, there are a number of extreme outliers but fits relatively well with a test RMSE of 430 and only slightly better
train RMSE:
Figure 9.5 – Predictive performance evaluations for the “LSTM_traffic_168_compact1” model

We can also evaluate the model by comparing observed versus predicted traffic. It would be helpful to break down
the error by the hour and type of day too. To this end, we can create DataFrames with these values – one for each
model. But first, we must truncate the DataFrame ( -y_test_pred.shape[0] ) so that it matches the length of the
predictions array, and we won't need all the columns, so we are providing indexes for only those we are interested
in: traffic_volume is #7 but we also will want dow (#0), hr (#1), and is_holiday (#6). We will rename
traffic_volume to actual_traffic and create a new column called predicted_traffic with our predictions.
Then, we will engineer a type_of_day column, as we did previously, which tells us if it's a holiday, weekday, or
weekend. Finally, we can drop the dow and is_holiday columns since we won't need them:
evaluate_df = test.iloc[-y_test_pred.shape[0]:,[0,1,6,7]].\
rename(columns={'traffic_volume':'actual_traffic'})
evaluate_df['predicted_traffic'] = y_test_pred
evaluate_df['type_of_day'] =\
np.where(evaluate_df.is_holiday == 1, 'Holiday',\ np.where(evaluate_df.dow >=
evaluate_df.drop(['dow','is_holiday'], axis=1, inplace=True)
You can quickly review the contents of the dataframes by simply running a cell with evaluate_df . It should have
4 columns.

Predictive error aggregations

It may be that some days and times of day are more prone to predictive errors. To get a better sense of how these
errors are distributed across time, we can plot RMSE on an hourly basis segmented by type_of_day . To do this,
we must first define an rmse function and then group each of the models' evaluated DataFrames by type_of_day
and hr and use the apply function to aggregate using the rmse function. We can then pivot to ensure that each
type_of_day has a column with the RMSEs on an hourly basis. We can then average these columns and store
them in a Series :
def rmse(g):
rmse = np.sqrt(metrics.mean_squared_error(g['actual_traffic'],
g['predicted_traffic']))
return pd.Series({'rmse': rmse})|
evaluate_by_hr_df = evaluate_df.\
groupby(['type_of_day', 'hr']).apply(rmse).reset_index().\
pivot(index='hr', columns='type_of_day', values='rmse')
mean_by_daytype_s = evaluate_by_hr_df.mean(axis=0)

Now that we have DataFrames with the hourly RMSEs for holidays, weekdays, and weekends, as well as the
average for these "types" of day, we can plot themusing the evaluate_by_hr DataFrame. We will also create
dotted horizontal lines with the averages for each type_of_day from the mean_by_daytype pandas series:
evaluate_by_hr_df.plot()
ax = plt.gca()
ax.set_title('Hourly RMSE distribution', fontsize=16)
ax.set_ylim([0,2500])
ax.axhline(y=mean_by_daytype_s.Holiday, linewidth=2,
color='cornflowerblue', dashes=(2,2))
ax.axhline(y=mean_by_daytype_s.Weekday, linewidth=2,
color='darkorange', dashes=(2,2))
ax.axhline(y=mean_by_daytype_s.Weekend, linewidth=2,
color='green', dashes=(2,2))

The preceding code generated the plot shown in Figure 9.6. As we can see, the model has a high RMSE for
holidays However, the model could be overestimating the traffic volume, and overestimating is not as bad as
underestimating, in this particular use case.
Figure 9.6 – Hourly RMSE segmented by type_of_day for the “LSTM_traffic_168_compact1” model

Evaluating it like a classification problem

Indeed, just like classification problems can have false positives and false negatives, and also realize that one is
more costly than another, you can frame any regression problem with concepts such as underestimation and
overestimation. This framing is especially useful when one is more costly than the other. If you have clearly
defined thresholds, as we have for this project, you can evaluate any regression problem as you would a
classification one. We will assess it with confusion matrix with the half capacity and no-construction thresholds.
To accomplish this, we can use np.where to get binary arrays for when the actuals and predictions surpassed each
threshold. We can then use the compare_confusion_matrices function to compare the confusion matrices for the
model:
actual_over_half_cap = np.where(evaluate_df['actual_traffic'] >\
2650, 1, 0)
pred_over_half_cap = np.where(evaluate_df['predicted_traffic'] >\
2650, 1, 0)
actual_over_nc_thresh = np.where(evaluate_df['actual_traffic'] >\
1500, 1, 0)
pred_over_nc_thresh = np.where(evaluate_df['predicted_traffic'] >\
1500, 1, 0)
mldatasets.\
compare_confusion_matrices(actual_over_half_cap,\
pred_over_half_cap, actual_over_nc_thresh, pred_over_nc_thresh,\
'Over Half-Capacity', 'Over No-Construction Threshold')

The preceding snippet produced the confusion matrices shown in Figure 9.7.
Figure 9.7 – Confusion matrices for going over half and the no-construction threshold for the
“LSTM_traffic_168_compact1” model

We are most interested in the percentage of false negatives (bottom left quadrant) because predicting no traffic
beyond the threshold when, in fact, it did rise above it, will lead to a steep fine. On the other hand, the cost of false
positives is in preemptively leaving the construction site when traffic didn't rise above the threshold after all. It's
better to be safe than sorry, though! If you compare false negatives for the "no-construction" threshold (0.85%) it’s
less than tan a third of the that of the half-capacity threshold (3.08. Ultimately, what matters most is the no-
construction threshold because the idea is to stop construction before it gets close to half-capacity.

Now that we have leveraged traditional methods to understand the model's decisions, let's move on to some more
advanced model-agnostic methods.

Generating LSTM attributions with integrated gradients


We first learned about integrated gradients (IG) in Chapter7, Visualizing Convolutional Neural Networks. Unlike
the other gradient-based attribution methods studied in that chapter, path-integrated gradients is not contingent on
convolutional layers, nor is it limited to classification problems. In fact, since it computes the gradients of the
output concerning the inputs averaged along the path, the input and output could be anything! It is common to use
integrated gradients with CNNs and Recurrent Neural Networks (RNNs), like the one we are interpreting in this
chapter. Frankly, when you see an IG LSTM example online, it has an embedding layer and is an NLP classifier,
but IG could be used very effectively for LSTMs that even process sounds or genetic data!

The integrated gradient explainer and the explainers that we will use moving forward can access any part of the
traffic dataset. First, let's create a generator for all of it:
y_all = y_scaler.transform(traffic_df[['traffic_volume']])
X_all = X_scaler.transform(traffic_df.drop(['traffic_volume'],\
axis=1))
gen_all = TimeseriesGenerator(X_all, y_all, length=lb,\
batch_size=24)
Integrated gradients is a local interpretation method. so let's get a few sample "instances of interest" we can
interpret. We know holidays are a concern, so let's see if our model picks up on the importance of is_holiday for
one example ( holiday_afternoon_s ). Also, mornings are a concern, especially mornings with a larger than
average rush hour because of weather conditions, so we have one example for that ( peak_morning_s ). Lastly, a
hot day might have more traffic, especially on a weekend ( hot_saturday_s ):
X_df = traffic_df.drop(['traffic_volume'], axis=1).\
reset_index(drop=True)
holiday_afternoon_s = X_df[(X_df.index >= 43800) & (X_df.dow==0) &\
(X_df.hr==16) &(X_df.is_holiday==1)].tail(1)
peak_morning_s = X_df[(X_df.index >= 43800) & (X_df.dow==2) &\
(X_df.hr==8) & (X_df.weather_Clouds==1) &\
(X_df.temp<20)].tail(1)
hot_saturday_s = X_df[(X_df.index >= 43800) & (X_df.dow==5) &\
(X_df.hr==12) & (X_df.temp>29) &\
(X_df.weather_Clear==1)].tail(1)

Now that we have created some instances, let's instantiate our explainers. IntegratedGradients from the alibi
package only requires a deep learning model, but it is recommended to set a number of steps ( n_steps ) for the
integral approximation and internal_batch_size . We will instantiate an explainer for our model:
ig = IntegratedGradients(lstm_traffic_mdl,
n_steps=25, internal_batch_size=24)

Before we iterate our samples and the explainers, it is important to realize how we need to input the sample to the
explainer because it will need a batch of 24. To this end, we will have to get the index of the sample once we've
deducted the lookback window ( nidx ). Then, you can obtain the batch for this sample from the generator
( gen_all ). Each batch includes 24 timesteps, so you floor nidx by 24 ( nidx//24 ) to get the batch's position for
that sample. Once you've got the batch for the sample ( batch_X ) and printed the shape (24, 168, 15) , it
shouldn't surprise you that the first number is 24. Of course, we will need to get the index of the sample within the
batch ( nidx%24 ) to obtain the data for that sample:
nidx = holiday_afternoon_s.index.tolist()[0] – lb
batch_X = gen_all[nidx//24][0]
print(batch_X.shape)

The for loop will use the previously explained method to locate the batch for the sample ( batch_X ).This
batch_X is inputted into the explain function. This is because this is a regression problem and there's no target
class; that is, target=None . Once the explanation is produced, the attributions property will have the
attributions for the entire batch. We can only obtain this for the sample and transpose it to produce an image that
has this shape: (15, lb) . The rest of the code in the for loop simply obtains the labels to use in the tick marks
and then plots an image stretched out to fit the dimensions of our figure , along with its labels:
samples = [holiday_afternoon_s, peak_morning_s, hot_saturday_s]
sample_names = ['Holiday Afternoon', 'Peak Morning' , 'Hot Saturday']
for s in range(len(samples)):
nidx = samples[s].index.tolist()[0] – lb
batch_X = gen_all[nidx//24][0]
explanation = ig.explain(batch_X, target=None)
attributions = explanation.attributions[0]
attribution_img = np.transpose(attributions[nidx%24,:,:])

end_date = traffic_df.iloc[samples[s].index].\
index.to_pydatetime()[0]
date_range = pd.date_range(end=end_date, periods=8,\
freq=’1D’).to_pydatetime().tolist()
columns = samples[s].columns.tolist()
plt.title('Integrated Gradient Attribution Map for “{}”'.\
format(sample_names[s], lb), fontsize=16)
divnorm = TwoSlopeNorm(vmin=attribution_img.min(), vcenter=0,\
vmax=attribution_img.max())
plt.imshow(attribution_img, interpolation='nearest' ,\
aspect='auto', cmap='coolwarm_r', norm=divnorm)
plt.xticks(np.linspace(0,lb,8).astype(int), labels=date_range)
plt.yticks([*range(15)], labels=columns)
plt.colorbar(pad=0.01,fraction=0.02,anchor=(1.0,0.0))
plt.show()

The preceding code will generate the plots shown in Figure 9.8. On the y-axis, you can see the variable names,
while on the x-axis, you can see the dates corresponding to the lookback window for the sample in question. The
rightmost part of the x-axis is the sample's date, and as you move left, you go backward in time. For instance, the
holiday afternoon sample was 4 p.m. September 3 and there is 1 weeks' worth of lookback, so each tick mark
backward is a day before that date.

Figure 9.8 – Annotated integrated gradients attribution map for all samples for the
“LSTM_traffic_168_compact1” model

You can tell by the intensity in the attribution maps in Figure 9.8 which hour/variables mattered for the prediction.
The colorbar to the right of each attribution map can serve as a key. Negative numbers in red correspond to
negative correlation, while positive numbers in blue correspond to positive correlation. However, something that is
pretty evident is the tendency for intensities to fade as it goes backward in time. Since it’s bidirectional, has this
happen from both ends. What is surprising is how fast this happens.
Let’s start from the bottom. For “Hot Saturday” day of week, hour, temperature and clear weather play an
important role in this prediction increasingly as you get closer to the predicted time (midday Saturday). The day
started cooler which explains how there’s a patch of red before the blue in the temperature feature.

As for "Peak Morning", attributions make sense since it is clear after it had been previously rainy and cloudy,
which caused the rush hour to peak quickly rather than increase slowly. To a certain degree, the LSTM has learned
that only recent weather matters – no more than 2 or 3 days' worth. However, that is not the only reason the
integrated gradients fade. They also fade because of the vanishing gradient problem. This problem occurs during
backpropagation because the gradient values are multiplied by the weight matrices in each step, so gradients can
exponentially decrease to zero.

LSTMs are organized in a very long sequence, making the network ever-more ineffective at capturing
dependencies long-term. Fortunately, these LSTMs are stateful, which means they string batches in a sequence by
leveraging states from the previous batch. Statefulness ensures learning from a long sequence, despite vanishing
gradients. This is why when we observe the attribution map for "Holiday Afternoon," there are negative
attributions for is_holiday , which makes sense to anticipate no rush hour. It turns out September 3 (Labor Day)
in is nearly 2 months after the previous holiday (Independence Day), which is a more festive holiday. Is it possible
that the model is picking up on these patterns?

We could try subcategorizing holidays by their traffic patterns to see if that helps the model identify them. We
could also make rolling aggregations of previous weather conditions to make it easier for the model to pick up on
recent weather patterns. Weather patterns span hours, so it is intuitive to aggregate, not to mention easier to
interpret. Interpretation methods can point us in the right direction as to how to improve models, and there's
certainly a lot of room for improvement.

Next, we will have a stab at a permutation-based method!

Computing global and local attributions with SHAP's KernelExplainer


Permutation methods make changes to the input to assess how much difference they will produce to a model's
output. We first discussed this in Chapter 4, Global model-agnostic interpretation methods, but if you recall,
there's a coalitional framework to perform these permutations that will produce the average marginal contribution
for each feature across different coalitions of features. This process's outcome is Shapely Values, which have
essential mathematical properties such as additivity and symmetry. Unfortunately, shapely values are costly to
compute for datasets that aren't small, so the SHAP library has approximation methods. One of these methods is
the KernelExplainer, which we also explained in Chapter 4 and used in Chapter 5, Local Model-Agnostic
Interpretation Methods. It approximates the Shapely Values with a weighted local linear regression, just like LIME
does.

Why use the KernelExplainer?

We have a deep learning model, so why aren't we using SHAP's DeepExplainer as we did with the CNN in
Chapter 8, Visualizing Convolutional Neural Networks? DeepExplainer adapted the DeepLIFT algorithm to
approximate the Shapely Values. It works very well with any forward feed network that's used for tabular data,
CNNs, and RNNs with an embedding layer, such as those used for an NLP classifier, or even to detect genomic
sequences. It gets trickier for multivariate time series because DeepExplainer doesn't know what to do with the
input's three-dimensional array. Even if it did, it includes data for previous timesteps, so you cannot permute one
timestep without considering the previous ones. For instance, if the permutation dictates that the temperature is 5
degrees lower, shouldn't that affect all the previous timestep's temperatures up to a certain amount of hours? And
what if it's 20 degrees lower? Doesn't that mean it's likely in a different season with entirely different weather –
perhaps more clouds and snow as well?

SHAP's KernelExplainer can receive any arbitrary black box predict function. It also makes assumptions about the
input dimensions. Fortunately, we can change the input data before it permutes it, making it seem to the
KernelExplainer like it's dealing with a tabular dataset. The arbitrary predict function doesn't have to simply call
the model's predict function – it can change data both on the way in and on the way out!
Defining a strategy to get it to work with a multivariate time series model

To mimic likely past weather patterns based on the permutated input data, we could create a generative model or
something to that effect. This strategy will help us generate a variety of past timesteps that fit the permutated
timestep, as well as generate images for a specific class. Although this would likely lead to more accurate
predictions, we won't use this strategy because it's incredibly time-consuming.

Instead, we will find the time series data that best suits the permutated input with existing examples from our
gen_all generator. There are distance metrics we can use to find the one that is closest to the permutated input.
However, we must place some guardrails because if the permutation is for a Saturday at 5 a.m. with a temperature
of 27 degrees Celsius and 90 percent cloud coverage, the closest observation to this one could be on a Friday at
7a.m., but regardless of the weather traffic, it would be completely different. Therefore, we can implement a filter
function that ensures that it only finds closest observations for the same dow , is_holiday , and hr . The filter
function can also clean up the permutated sample to remove or modify anything nonsensical for the model, such as
a continuous value for a categorical feature:

Figure 9.10 – Permutation approximation strategy

The preceding diagram depicts the rest of the process where it uses a distance function to find the closest
observation to the modified permutated sample. This function returns the closest observation index, but the model
can't predict on singular observations (or timesteps), so it requires its past hourly history up to the lookback
window. For this reason, it retrieves the right batch from the generator and makes a prediction on that, but the
predictions will be on a different scale, so they need to be inverse transformed with y_scaler . Once the predict
function has iterated through all the samples and made predictions for it and rescaled them, it sends them back to
the KernelExplainer, which outputs their SHAP values.

Laying the groundwork for the permutation approximation strategy


You can define a custom filter function ( filt_fn ). It takes a pandas DataFrame with the entire dataset ( X_df )
you want to filter from, as well as the permutated sample ( x ) for filtering and the length of the lookback window.
The function can also modify the permutated sample. In this case, we have to do this because so many features of
the model are discrete, but the permutation process makes them continuous. As we mentioned previously, all the
filtering does is protect the distance function from finding a nonsensical closest sample to the permutated sample
by limiting the options:
def filt_fn(X_df, x, lookback):
x_ = x.copy()
x_[0] = round(x_[0])
x_[1] = round(x_[1])
x_[6] = round(x_[6])
if x_[1] < 0:
x_[1] = 24 + x_[1]
x_[0] = x_[0] – 1
if x_[0] < 0:
x_[0] = 7 + x_[0]
X_filt_df = X_df[(X_df.index >= lookback) & (X_df.dow==x_[0]) &\
(X_df.hr==x_[1]) & (X_df.is_holiday==x_[5]) &\
(X_df.temp-5<=x_[2]) & (X_df.temp+5>=x_[2])]
return X_filt_df, x_

If you refer to Figure 9.9, after the filter function, the next thing we ought to define is the distance function. We
could use any standard distance function accepted by scipy.spatial.distance.cdist , such as "Euclidean,"
"cosine," or "Hamming." The problem with these standard distance functions is that they either work well with
continuous or discrete variables but not both. We have both in this dataset!

Fortunately, some alternatives exist that can handle both, such as Heterogeneous Euclidean-Overlap Metric
(HEOM) and Heterogeneous Value Difference Metric (HVDM). Both methods apply different distance metrics,
depending on the nature of the variable. HEOM uses a normalized Euclidean

for continuous and, for discrete, "overlap" distance; that is, a distance of zero if the same and one otherwise.

HVDM is more complicated because, for continuous variables, it's the absolute distance between both values,
divided by the standard deviation of the feature in question times four (|a - b| / 4), which is a great distance metric
for handling outliers. For discrete variables, it uses a normalized Value Difference Metric, which is based on the
difference between the conditional probability of both values.

Even though HVDM is better than HEOM for datasets with many continuous values, it is overkill in this case.
Once the dataset has been filtered by day of week ( dow ) and hour ( hr ), the remaining discrete features are all
binary, so "overlap" distance is ideal, and for the three remaining continuous features ( temp , rain_1h , snow_1h ,
and cloud_coverage ), Euclidean distance should suffice. distython has an HEOM distance method, and all it
requires is a background dataset ( X_df.values ) and the indexes of the categorical features ( cat_idxs ). We can
programmatically identify these features with an np.where command. If you want to verify that these are the right
ones, run print(cat_idxs) in a cell. Only indexes 2, 3, 4, and 5 should be omitted:
cat_idxs = np.where(traffic_df.drop(['traffic_volume'],\
axis=1).dtypes != np.float64)[0]
heom_dist = HEOM(X_df.values, cat_idxs)
print(cat_idxs)

Now, we can create a lambda function that takes puts everything depicted in Figure 9.9 together. It leverages a
function called approx_predict_ts that takes care of the entire pipeline. It takes our filter function ( filt_fn ),
distance function ( heom_dist.heom ), generator ( gen_all ), and fitted model ( lstm_traffic_mdl ), and chains
them together, as described in Figure 9.9. It also scales the data with our scalers ( X_scaler and y_scaler ).
Distance is computed on transformed features for higher accuracy, and the predictions are reversed transformed on
the way out:
predict_fn = lambda X: mldatasets.\
approx_predict_ts(X, X_df, gen_all, lstm_traffic_mdl,\
dist_metric=heom_dist.heom, lookback=lookback,\
filt_fn=filt_fn, X_scaler=X_scaler, y_scaler=y_scaler)

We can now use the prediction function with KernelExplainer , but it should be done on samples that are most
representative of the construction crew's expected working conditions; that is, they plan to work March through
November only, preferably on weekdays and low-traffic hours. To this end, let's create a DataFrame
( working_season_df ) that only includes these months and initializes a KernelExplainer with predict_fn and
the k-means of the DataFrame as background data:
working_season_df =\
traffic_df[lookback:].drop(['traffic_volume'], axis=1).copy()
working_season_df =\
working_season_df[(working_season_df.index.month >= 3) &\
(working_season_df.index.month <= 11)]
explainer = shap.KernelExplainer(predict_fn,\
shap.kmeans(working_season_df.values, 24))

We can now produce SHAP values for a random set of observations of the working_season_df dataframe.

Computing the SHAP values

We will sample 48 observations from it. KernelExplainer is rather slow, especially when it's using our
approximation method. To get an optimal global interpretation, it is best to use a high number of observations but
also a high nsamples , which is the number of times we need to reevaluate the model when explaining each
prediction. Unfortunately, having 50 of each would cause the explainer to take many hours to run, depending on
your available compute, so we will use nsamples=10 . You can look at SHAP's progress bar and adjust it
accordingly. Once it's done, it will produce a feature importance summary_plot containing the SHAP values:
X_samp_df = working_season_df.sample(80, random_state=rand)
shap_values = explainer.shap_values(X_samp_df, nsamples=10)
shap.summary_plot(shap_values, X_samp_df)

The preceding code plots the summary shown in the following graph. Not surprisingly, hr and dow are the most
important features, followed by some weather features. Strangely enough, temperature and rain don't seem to
weigh in on the predictions, but late Spring through Fall may not be a significant factor. Or maybe more
observations and a higher nsample will yield a better global interpretation:
Figure 9.10 – SHAP summary plot based on the SHAP values produced by 48 sampled observations

We can do the same with the instances of interest we chose in the previous section for local interpretations. Let's
iterate through all these datapoints. Then, we can produce a single shap_values but this time with nsamples=80 ,
and then generate a force_plot for each one:
for s in range(len(samples)):
print('Local Force Plot for "{}"'.format(sample_names[s]))
shap_values_single = explainer.shap_values(datapoints[i],\
nsamples=80)
shap.force_plot(explainer.expected_value, shap_values_single[0],\
samples[s], matplotlib=True)
plt.show()

The preceding code generates the plots shown in Figure 9.11. "Holiday afternoon" has the hour ( hr=16 ) pushing
toward a higher prediction, while the fact that it's a Monday ( dow=0 ) and a holiday ( is_holiday=1 ) is a driving
force in the opposite direction. On the other hand, "Peak Morning" is mostly peak due to the hour ( hr=8.0 ), but it
has a high cloud_coverage , affirmative weather_Clouds , and yet no rain ( rain_1h=0.0 ). Lastly, "Hot
Saturday" has the day of week ( dow=5 ) pushing for a lower value, but the abnormally high value is mostly due to
it being midday with no rain and clouds. Strangely higher than normal temperature is not one of the factors.

Figure 9.11 – Force plots generated with SHAP values using nsamples=80 for a Holiday Afternoon, Peak
Morning, and Hot Saturday

With SHAP's game theory-based approach, we can gauge how many permutations for the existing observations
marginally vary the predicted outcome across many possible coalitions of features. However, this approach can be
very limiting because our background data's existing variance shapes our understanding of outcome variance.

In the real world, variability is often determined by what is NOT represented in your data – but infinitesimally
plausible. For instance, reaching 25°C (77°F) before 5 a.m. in a Minneapolis summer is not a common occurrence,
but with global warming, it could become frequent, so we would want to simulate how it could impact traffic
patterns. Forecasting models are particularly prone to risk, so simulating is a crucial interpretation component to
assess this uncertainty. A better understanding of uncertainty can yield more robust models or directly inform
decisions. Next, we will discuss how we can produce simulations with sensitivity analysis methods.

Identifying influential features with factor prioritization


The Morris Method is one of several global sensitivity analysis methods that range from simpler Fractional
Factorial to complicated Monte Carlo Filtering. Morris is somewhere in-between this spectrum, falling into two
categories. It uses one-at-a-time sampling, which means that only one value changes between consecutive
simulations. It's also elementary effects (EE), which means that it doesn't quantify the exact effect of a factor in a
model but rather gauges its importance and relationship with other factors. By the way, factor is just another word
for a feature or variable that's commonly used in applied statistics. To be consistent with the related theory, we will
use this word in this and the next section.

Another property of Morris is that it's less computationally expensive than the variance-based methods we will
study next. It can provide more insights than simpler and less costly methods such as regression-, derivative-, or
factorial-based ones. It can't quantify effects precisely but can identify those with negligible or interaction effects,
making it an ideal method for screening factors when the number of factors is low. Screening is also known as
factor prioritization because it can prioritize your factors by how they are classified.

Computing Morris sensitivity indices

The Morris method derives a distribution of elementary effects that it associates with an individual factor. Each EE
distribution has a mean (µ) and a standard deviation (σ). These two statistics are what helps map the factors into
different classifications. The mean could be negative when the model is non-monotonic, so a Morris method
variation adjusts for this with absolute values (µ*) so that it is more manageable to interpret. We will use this
variation here.

Now, let's limit the scope of this problem to make it more manageable. The traffic uncertainties the construction
crew will face will be ongoing from May to October, Monday to Friday, from 11 p.m. to 5 a.m. Therefore, we can
take the working_season_df DataFrame and subset it further to produce a working hours one ( working_hrs_df )
that we can describe . We will include the 1%, 50%, and 99% percentiles to understand where the median and
outliers lie:
working_hrs_df = working_season_df[(working_season_df.dow < 5) &\
((working_season_df.hr < 5) |\
(working_season_df.hr > 22))]
working_hrs_df.describe(percentiles=[.01,.5,.99]).transpose()

The preceding code produced the table in Figure 9.12. We can use this table to extract the ranges we will use for
our features in the simulation. Typically, we would use plausible values that have exceeded the existing maximums
or minimums. For most models, any feature value can be increased or decreased beyond its known limits, and
since the model learned a monotonic relationship, it can infer a realistic outcome. For instance, it might learn that
rain beyond a certain point will increasingly diminish traffic. Then, say you want to simulate a severe flood with,
say, 30 mm of rain per hour; it can accurately predict no traffic:
Figure 9.12 – Summary statistics for the period that the construction crew plans to work through

However, because we are using a prediction approximation method that samples from historical values, we are
limited to how far we can push the boundaries outside of the known. For this reason, we will use the 1% and 99%
percentile values as our limits. We should note that this is an important caveat for any findings, especially for
features that could plausibly extend beyond these limits, such as temp , rain_1h , and snow_1h .

Another thing to note from the summary of Figure 9.12 is that many weather-related binary features are very
sparse. You can tell by their extremely low mean. Each factor that's added to the sensitivity analysis simulation
slows it down, so we will only take the top three; that is, weather_Clear , weather_Clouds , and weather_Rain .
These factors are specified along with the other six factors into a "problem" dictionary ( morris_problem ), which
has their corresponding names , bounds , and groups . Now, bounds is critical because it denotes what ranges of
values will be simulated for each factor. We will use [0,4] (Monday – Friday) for dow and [-1,4] (11p.m. – 4a.m.)
for hr . The filter function automatically translates negative hours into hours from the day before so that -1 on a
Tuesday is equivalent to 23 on a Monday. The rest of the bounds were informed by the percentiles. Note that
groups all have factors in the same group, except for the three weather ones:

morris_problem = {
# There are nine variables
'num_vars': 10,
# These are their names
'names': ['dow', 'hr', 'temp', 'rain_1h', 'snow_1h',\
'cloud_coverage', 'is_holiday', 'weather_Clear',\
'weather_Clouds', 'weather_Rain'],
# Plausible ranges over which we'll move the variables
'bounds': [[0, 4], # dow
[-1, 4], # hr
[-12, 25.], # temp (C)
[0., 3.1], # rain_1h
[0., .3], # snow_1h
[0., 100.], # cloud_coverage
[0, 1], # is_holiday
[0, 1], # weather_Clear
[0, 1], # weather_Clouds
[0, 1] # weather_Rain
],
# Only weather is grouped together
'groups': ['dow', 'hr', 'temp', 'rain_1h', 'snow_1h',\
'cloud_coverage', 'is_holiday', 'weather', 'weather',\
'weather']
}

Once the dictionary has been defined, we can generate Morris method samples with SALib's sample method. In
addition to the dictionary, it takes a number of trajectories ( 256 ) and levels ( num_levels=4 ). The method uses a
grid with factors and levels to construct the trajectories for which inputs are randomly moved one-at-a-time
(OAT). What is important to heed here is that more levels add more resolution to this grid, potentially making for
a better analysis. However, this can be very time-consuming. It's better to start with a ratio between the number of
trajectories and levels of 25:1 or higher. Then, you can decrease this ratio progressively. In other words, if you
have enough compute, you can make num_levels match the number of trajectories, but if you have this much
compute available, you could try optimal_trajectories=True . However, given that we have groups,
local_optimization would have to be False . The output of sample is an array that is one column for each
factor and (G + 1) × T rows (where G is the number of groups and T is the number of trajectories). We have eight
groups and 256 trajectories, so print should output a shape of 2304 rows and 10 columns:
morris_sample = ms.sample(morris_problem, 256,\
num_levels=4, seed=rand)
print(morris_sample.shape)

Given that the predict function will only work with 15 factors, we should modify the samples to fill the remaining
five factors with zeroes. We use zeroes because that is the median value for these features. Medians are least likely
to increase traffic, but you ought to tailor your default values on a case-by-case basis. If you recall our
Cardiovascular Disease (CVD) example from Chapter 2, Key Concepts of Interpretability, the feature value that
would increase CVD risk was sometimes the minimum or maximum.

The np.hstack function can concatenate the array horizontally so that three zero factors follow the samples for
the first eight factors. Then, there's a lonely ninth sample factor corresponding to weather_Rain , followed by two
zero factors. The resulting array should have the same numbers of rows as before but 15 columns:
morris_sample_mod = np.hstack((morris_sample[:,0:9],\
np.zeros((morris_sample.shape[0],3)),\
morris_sample[:,9:10],\
np.zeros((morris_sample.shape[0],2))))
print(morris_sample_mod.shape)

The numpy array known as morris_sample_mod now has the Morris samples in a shape that can be understood by
our predict function. If this was a model that had been trained on a tabular dataset, we could just leverage the
model's predict function. However, just as we did with SHAP, we have to use the approximation method. This
time, we won't use predict_fn because we want to set one additional option, progress_bar=True , in
approx_predict_ts . Everything else will remain the same. The progress bar will come in handy because this
should take a while. Run the cell and take a coffee break:
morris_preds = mldatasets.\
approx_predict_ts(morris_sample_mod, X_df, gen_all,\
lstm_traffic_mdl, filt_fn=filt_fn,\
dist_metric=heom_dist.heom,lookback=lookback,\
X_scaler=X_scaler, y_scaler=y_scaler,\
progress_bar=True)

To produce a sensitivity analysis with SALib's analyze function, all you need is your problem dictionary
( morris_problem ), the original Morris samples ( morris_sample ), and the predictions we just produced with
those samples ( morris_preds ). There's a optional confidence interval level argument ( conf_level ), but the
default of 0.95 is good. It uses resamples to compute this confidence level, which is 1,000 by default. This setting
can also be changed with an optional num_resamples argument:
morris_sensitivities = ma.analyze(morris_problem, morris_sample,\
morris_preds, print_to_console=False)
Analyzing the elementary effects

analyze will return a dictionary with the Morris sensitivity indices, including the mean (µ) and standard deviation
(σ) elementary effect, as well as the absolute value of the mean (µ *). It's easier to appreciate these values in a
tabular format so that we can place them into a DataFrame and sort and color-code them according to µ *, which
can be interpreted as the overall importance of the factor. σ, on the other hand, is how much the factor interacts
with other ones:
morris_df = pd.DataFrame({'features':morris_sensitivities['names'],\
'μ':morris_sensitivities['mu'],\
'μ*':morris_sensitivities['mu_star'],\
'σ':morris_sensitivities['sigma']})
morris_df.sort_values('μ*', ascending=False).style.\
background_gradient(cmap='plasma', subset=['μ*'])

The preceding code outputs the DataFrame depicted in Figure 9.13. You can tell that is_holiday surprisingly
becomes the second-most important factor although not by a huge margin, at least during the bounds specified in
the problem definition ( morris_problem ). Another thing to note is that weather does have an absolute mean
elementary effect but inconclusive interaction effects. Groups are challenging to assess, especially when they are
sparse binary factors:

Figure 9.14 – The Elementary Effects (EE) decomposition of the factors

The DataFrame in the preceding figure is not the best way to visualize the elementary effects. When there are not
too many factors, it's easier to plot them. SALib comes with two plotting methods. The horizontal bar plot
( horizontal_bar_plot ) and covariance plot ( covariance_plot ) can be placed side by side. The covariance plot
is excellent, but it doesn't annotate the areas it delineates. We will learn about these next. So, solely for
instructional purposes, we will use text to place the annotations:
fig, (ax0, ax1) = plt.subplots(1,2, figsize=(12,8))
mp.horizontal_bar_plot(ax0, morris_sensitivities, {})
mp.covariance_plot(ax1, morris_sensitivities, {})
ax1.text(ax1.get_xlim()[1] * 0.45, ax1.get_ylim()[1] * 0.75,\
'Non-linear and/or-monotonic', color='gray',\
horizontalalignment='center')
ax1.text(ax1.get_xlim()[1] * 0.75, ax1.get_ylim()[1] * 0.5,\
'Almost Monotonic', color='gray', horizontalalignment='center')
ax1.text(ax1.get_xlim()[1] * 0.83, ax1.get_ylim()[1] * 0.2,\
'Monotonic', color='gray', horizontalalignment='center')
ax1.text(ax1.get_xlim()[1] * 0.9, ax1.get_ylim()[1] * 0.025,
'Linear', color='gray', horizontalalignment='center')
The preceding code produces the plots shown in Figure 9.14. The bar plot on the left ranks the factors by µ *,
while the lines sticking out of each bar signify their corresponding confidence bands. The covariance plot to the
right is a scatter plot with µ * on the x-axis and σ on the y-axis. Therefore, the farther right the point is, the more
important it is, while the further up it is in the plot, the more it interacts with other factors and becomes
increasingly less monotonic. Naturally, this means that factors that don't interact much and are mostly monotonic
ones comply with linear regression assumptions, such as linearity and multicollinearity. However, the spectrum
between linear and non-linear or non-monotonic is determined diagonally by the ratio between σ and µ *

Figure 9.15 – A bar and covariance plot depicting the Elementary Effects (EE)

You can tell by the preceding covariance plot that all the factors are non-linear or non-monotonic. hr is by far the
most important, with the following two ( dow , and temp ) clustered relatively nearby followed by weather and
is_holiday . The weather group is not on the plot because interactivity was inconclusive, yet cloud_coverage,
rain_1h and snow_1h are considerably more interactive than important on their own.

Elementary effects help us understand how to classify our factors in accordance with their effects on model
outcomes. However, it's not a robust method to properly quantify their effects or those derived from factor
interactions. For that, we would have to turn to a variance-based global method that uses a probabilistic framework
to decompose the output's variance and trace it back to the inputs. Those methods include Fourier Amplitude
Sensitivity Test (FAST) and Sobol. We will study the latter approach next.

Quantifying uncertainty and cost sensitivity with factor fixing


With the Morris indices, it became evident that all the factors are non-linear or non-monotonic. There's a high
degree of interactivity between them – as expected! It should be no surprise that climate factors ( temp , rain_1h ,
snow_1h , and cloud_coverage ) are likely multicollinear with hr . There are also patterns to be found between
hr , is_holiday , and dow and the target. Many of these factors most definitely don't have a monotonic
relationship with the target. We know this already. For instance, traffic doesn't consistently increase as hours
increase throughout the day. That's not the case between days of the week either!

However, we didn't know to what degree is_holiday and temp impacted the model, particularly during the
crew's working hours, which was an important insight. That being said, factor prioritization with Morris indices is
usually to be taken as a starting point or "first setting" because once you ascertain that there are interaction effects,
it's best if you disentangle them. To this end, there's a "second setting" called factor fixing. We can quantify the
variance and, by doing so, the uncertainty brought on by all the factors.

Only variance-based methods can quantify these effects in a statistically rigorous fashion. Sobol Sensitivity
Analysis is one of these methods, which means that it decomposes the model's output variance into percentages
and attributes it to the model's inputs and interactions. Like Morris, it has a sampling step, as well as a sensitivity
index estimation step.

Unlike Morris, the sampling doesn't follow a series of levels but the input data's distribution. It uses a quasi-
Monte Carlo method, where it samples points in hyperspace that follow the inputs' probability distributions.
Monte Carlo methods are a family of algorithms that perform random sampling, often for optimization or
simulation. They seek to cut corners on problems that would be impossible to solve with brute force or entirely
deterministic approaches. Monte Carlo methods are common in sensitivity analysis precisely for this reason.
Quasi-Monte Carlo methods have the same goal. However, they converge faster because they use a deterministic
low-discrepancy sequence instead of using a pseudorandom one. The Sobol method uses the Sobol sequence,
devised by the same mathematician. We will use another sampling scheme derived from Sobol's, called Saltelli's.

Once the samples have been produced, Monte Carlo estimators compute the variance-based sensitivity indices.
These indices are capable of quantifying non-linear non-additive effects and second-order indices, which relate to
the interaction between two factors. Morris can reveal interactivity in your model, but not precisely how it is
manifested. Sobol can tell you what factors are interacting and to what degree.

Generating and predicting on Salteli samples

To begin a Sobol sensitivity analysis with SALib , we must first define a problem. We'll do the same as what we
did with Morris. This time, we will reduce the factors because we realized that the weather grouping led to
inconclusive results. We should include the least sparse of all weather factors; that is, weather_Clear . And since
Sobol uses a probabilistic framework, there's no harm in expanding the bounds to their minimum and maximum
values for temp , rain_1h , and cloud_coverage , as seen in Figure 9.12:
sobol_problem = {
'num_vars': 8,
'names': ['dow', 'hr', 'temp', 'rain_1h', 'snow_1h',\
'cloud_coverage', 'is_holiday', 'weather_Clear'],
'bounds': [[0, 4], # dow
[-1, 4], # hr
[-3., 31.], # temp (C)
[0., 21.], # rain_1h
[0., 1.6], # snow_1h
[0., 100.], # cloud_coverage
[0, 1], # is_holiday
[0, 1] # weather_Clear
],
'groups': None
}

Generating the samples should look familiar too. The Saltelli sample function requires the following:

A problem statement ( sobol_problem )


A number of samples to produce per factor ( 300 )
Second-order indices to compute ( calc_second_order=True )

Given that we want the interactions, the output of sample is an array that has one column for each factor and 𝑁 ×
(2𝐹 + 2) rows (where 𝑁 is the number of samples and 𝐹 is the number of factors). We have eight factors and 256
samples per factor, so print should output a shape of 4,608 rows and 8 columns. First, we will modify it, as we
did previously, with hstack to add the 7 empty factors needed to make the predictions, resulting in 15 columns
instead:
saltelli_sample = ss.sample(sobol_problem, 256,\
calc_second_order=True, seed=rand)
saltelli_sample_mod = np.hstack((saltelli_sample,\
np.zeros((saltelli_sample.shape[0],7))))
print(saltelli_sample_mod.shape)

Now, let's predict on these samples. This should take a while, so it's coffee time once more:
saltelli_preds = mldatasets.\
approx_predict_ts(saltelli_sample_mod, X_df, gen_all,\
lstm_traffic_mdl, filt_fn=filt_fn,\
dist_metric=heom_dist.heom, lookback=lookback,\
X_scaler=X_scaler, y_scaler=y_scaler,\
progress_bar=True)

Performing Sobol sensitivity analysis

For Sobol sensitivity analysis ( analyze ), all you need is a problem statement ( sobol_problem ) and the model
outputs ( saltelli_preds ). But the predictions don't tell the story of uncertainty. Sure, there's variance in the
predicted traffic, but that traffic is only a problem once it exceeds 1,500. Uncertainty is something you want to
relate to risk or reward, costs or revenue, loss or profit – something tangible you can connect to your problem.

First, we must assess if there's any risk at all. To get an idea of whether the predicted traffic in the samples
exceeded the no-construction threshold during the working hours, we can use
print(max(saltelli_preds[:,0])) . The maximum traffic level should be somewhere in the neighborhood of
1,800-1,900, which means that there's at least some risk that the construction company will pay a fine. Instead of
using the predictions ( saltelli_preds ) as the model's output, we can create a simple binary array with ones
when it exceeded 1,500 and zero otherwise. We will call this costs , and then run the analyze function with it.
Note that calc_second_order=True is also set here too. It will throw an error if sample and analyze don't have
a consistent setting. Like with Morris, there's an optional confidence interval level argument ( conf_level ), but
the default of 0.95 is good:
costs = np.where(saltelli_preds > 1500, 1,0)[:,0]
factor_fixing_sa = sa.analyze(sobol_problem, costs,\
calc_second_order=True,\
print_to_console=False)

analyze will return a dictionary with the Sobol sensitivity indices, including the first-order ( S1 ), second-order
( S2 ), and total-order ( ST ) indices, as well as the total confidence bounds ( ST_conf ). The indices correspond to
percentages, but the totals won't necessarily add up unless the model is additive. It's easier to appreciate these
values in a tabular format so that we can place them into a DataFrame and sort and color-code them according to
the total, which can be interpreted as the overall importance of the factor. However, we will leave the second-order
indices out because they are two-dimensional and akin to a correlation plot:
sobol_df = pd.DataFrame({'features':sobol_problem['names'],\
'1st':factor_fixing_sa['S1'],\
'Total':factor_fixing_sa['ST'],\
'Total Conf':factor_fixing_sa['ST_conf'],\
'Mean of Input':saltelli_sample.mean(axis=0)[:8]})
sobol_df.sort_values('Total', ascending=False).style.\
background_gradient(cmap='plasma', subset=['Total'])

The preceding code outputs the DataFrame depicted in Figure 9.15. You can tell that temp and is_holiday are in
the top four, at least during the bounds specified in the problem definition ( sobol_problem ). Another thing to note
is that weather_Clear does have more of an effect on its own, but rain_1h and cloud_coverage seem to have
no effect on the potential cost:

|Figure 9.15 – Sobol global sensitivity indices for the eight factors

Something interesting about the first-order values is how low they are, suggesting that interactions account for
most of the model output variance. We can easily produce a heatmap with second-order indices to corroborate this.
It's the combination of these indices and the first-order ones that add up to the totals:
S2 = factor_fixing_sa['S2']
divnorm = TwoSlopeNorm(vmin=S2.min(), vcenter=0, vmax=S2.max())
sns.heatmap(S2, center=0.00, norm=divnorm, cmap='coolwarm_r',\
annot=True, fmt ='.2f',\
xticklabels=sobol_problem['names'],\
yticklabels=sobol_problem['names'])

The preceding code outputs the heatmap in Figure 9.16:


Figure 9.16 – Sobol second-order indices for the eight factors

Here, you can tell that is_holiday and weather_Clear are the two factors that contribute the most to the output
variance. dow and hr have sizable interactions with all the factors.

Incorporating a realistic cost function

Now, we can create a cost function that takes our inputs ( saltelli_sample ) and outputs ( saltelli_preds ) and
computes how much the twin cities would fine the construction company, plus any additional costs the additional
traffic could produce. It is better to do this if both the input and outputs are in the same array because we will need
details from both to calculate the costs. We can use hstack to join the samples and their corresponding
predictions, producing an array with eight columns ( saltelli_sample_preds ). We can then define a cost function
that can compute the costs ( cost_fn ), given an array with these nine columns:
#Join input and outputs into a sample+prediction array
saltelli_sample_preds = np.hstack((saltelli_sample, saltelli_preds))
We know that the half-capacity threshold wasn't exceeded for any sample predictions, so we won't even bother to
include the daily penalty in the function. Besides that, the fines are $15 per vehicle that exceeds the hourly no-
construction threshold. In addition to these fines, to be able to leave on time, the construction company estimates
additional costs: $1,500 in extra wages if the threshold is exceeded at 4 a.m. and $4,500 more on Fridays to speed
up the move of their equipment because it can't stay on the highway shoulder during weekends. Once we have the
cost function, we can iterate through the combined array ( saltelli_sample_preds ), calculating costs for each
sample. List comprehension can do this efficiently:
#Define cost function
def cost_fn(x):
cost = 0
if x[8] > 1500:
cost = (x[8] - 1500) * 15
if round(x[1]) == 4:
cost = cost + 1500
if round(x[0]) == 4:
cost = cost + 4500
return cost
#Use list comprehension to compute costs for sample+prediction array
costs2 = np.array([cost_fn(xi) for xi in saltelli_sample_preds])
#Print total fines for entire sample predictions
print('Total Fines: $%s' % '{:,.2f}'.format(sum(costs2)))

The print statement should output a cost somewhere between $170-200 thousand. But not to worry! The
construction crew only plans to work about 195 days on-site per year and 5 hours each day, for a total of 975
hours. However, there are 4,608 samples, which means that there's almost 5 years' worth of predicted costs due to
excess traffic. In any case, the point of calculating these costs is to figure out how they relate to the model's inputs.
More years' worth of samples means tighter confidence intervals.
factor_fixing2_sa = sa.analyze(sobol_problem, costs2,\
calc_second_order=True,\
print_to_console=False)

We can now perform the analysis again but with costs2 , and we can save the analysis into a factor_fixing2_sa
dictionary. Lastly, we can produce a new sorted and color-coded DataFrame with this dictionary's values, as we did
previously for Figure 9.15, which generates the output shown in the Figure 9.17.

As you can tell by Figure 9.17 once the actual costs have been factored in, dow , hr and is_holiday become
riskier factors, while snow_1h and temp become less relevant when compared to Figure 9.15.

Figure 9.17 – Sobol global sensitivity indices for the eight factors using the realistic cost function
One thing that is hard to appreciate with a table is the confidence intervals of the sensitivity indices. For that, we
can use a bar plot, but first, we must convert the entire dictionary into a DataFrame so that SALib's plotting
function can plot it:
factor_fixing2_df = factor_fixing2_sa.to_df()
fig, (ax) = plt.subplots(1,1, figsize=(15, 7))
sp.plot(factor_fixing2_df[0], ax=ax)

The preceding code generates the bar plot in Figure 9.18. The 95% confidence interval for dow is much larger
than for other important factors, which shouldn't be surprising considering how much variance there is between
days of the week. Another interesting insight is how weather_Clear has negative first-order effects, so the
positive total-order indices are entirely attributed to second-order ones, which expand the confidence interval:

Figure 9.18 – Bar plot with the Sobol sensitivity total-order indices and their confidence intervals using a realistic
cost function

To understand how, let's plot the heatmap shown in Figure 9.16 again but this time using factor_fixing2_sa
instead of factor_fixing_sa . The heatmap in Figure 9.19 should depict how the realistic costs reflect the
interactions in the model:
Figure 9.19 – Sobol second-order indices for seven factors while factoring a more realistic cost function

The preceding heatmap shows similar salient interactions to those in Figure 9.16 but they're much more nuanced
since there are more shades. It becomes evident that weather_Clear has a magnifying effect when combined
with is_holiday , and tempering effect for dow and hr .

Mission accomplished
The mission was to train a traffic prediction model and understand what factors create uncertainty and possibly
increase costs for the construction company. We can conclude a significant portion of the potential $35,000/year in
fines can be attributed to the is_holiday factor. Therefore, the construction company should rethink working
holidays. There are only seven or eight holidays between March and November, and they could cost more because
of the fines than working on a few Sundays instead. With this caveat, the mission was successful, but there's still a
lot of room for improvement.

Of course, these conclusions are for the “LSTM_traffic_168_compact1” model – which we can compare with
other models. Try replacing the model_name at the beginning of the notebook with
“LSTM_traffic_168_compact2”, an equally small but significantly more robust model, or
“LSTM_traffic_168_optimal”, a larger slightly better performing model, and re-running the notebook. Or glance at
the notebooks named Traffic_compact2 and Traffic_optimal which already have been re-run with these
corresponding models. You will find that it is possible to train and select models that manage uncertain inputs
much better. That being said, improvement doesn’t always come simply selecting a better model.

For instance, one thing that could be covered in further depth is the true impact of temp , rain_1h and snow_1h .
Our prediction approximation method precluded Sobol from testing the effect of extreme weather events. If we
modified the model to train on aggregated weather features at single timesteps and built in some guardrails, we
could simulate weather extremes with Sobol. And the "third setting" of sensitivity analysis, known as factor
mapping, could help pinpoint how exactly some factor values affect the predicted outcome, leading to a sturdier
cost-benefit analysis, but we won't cover this in this chapter.

Throughout Part Two of this book, we explored an ecosystem of interpretation methods: global and local; model-
specific and model-agnostic; permutation-based and sensitivity-based. There's no shortage of interpretation
methods to choose from for any machine learning use case. However, it cannot be stressed enough that NO method
is perfect. Still, they can complement each other to approximate a better understanding of your machine learning
solution and the problem it aims to solve.

This chapter's focus on certainty in forecasting was designed to shed light on a particular problem in the machine
learning community: overconfidence. Chapter 1, Interpretation, Interpretability, Explainability, and Why It All
Matters, in the Recognizing the business importance of interpretability section, described the many biases that
infest human decision-making. These biases are often fueled by overconfidence in domain knowledge or our
models' impressive results. And these impressive results cloud us from grasping the limitations of our models as
the public distrust of AI increases.

As we discussed in Chapter 1, Interpretation, Interpretability, Explainability, and Why It All Matters, machine
learning is only meant to tackle incomplete problems. Otherwise, we might as well use deterministic and
procedural programming like those found in closed-loop systems. An incomplete problem requires an incomplete
solution, which should be optimized to solve as much of it as possible. Whether through gradient descent, least-
squares estimation, or splitting and pruning a decision tree, machine learning doesn't produce a model that
generalizes perfectly. That lack of completeness in machine learning is precisely why we need interpretation
methods. In a nutshell: models learn from our data, and we can learn a lot from our models, but only if we interpret
them!

Interpretability doesn't stop there, though. Model interpretations can drive decisions and help us understand model
strengths and weaknesses. However, often, there are problems in the data or models themselves that can make
them less interpretable. In Part Three of this book, we'll learn how to tune models and the training data for
interpretability by reducing complexity, mitigating bias, placing guardrails, and enhancing reliability.

Statistician George E.P. Box famously quipped that "all models are wrong, but some are useful." Perhaps they
aren't always wrong, but humility is required from machine learning practitioners to accept that even high-
performance models should be subject to scrutiny and our assumptions about them. Uncertainty with machine
learning models is expected and shouldn't be a source of shame or embarrassment. This leads us to another
takeaway from this chapter: that uncertainty comes with ramifications, be it costs or profit lift, and that we can
gauge these with sensitivity analysis.

Summary
After reading this chapter, you should understand how to assess a time series model's predictive performance,
know how to perform local interpretations for them with integrated gradients, and know how to produce both local
and global attributions with SHAP. You should also know how to leverage sensitivity analysis factor prioritization
and factor fixing for any model.

In the next chapter, we will learn how to reduce complexity in a model and make it more interpretable with feature
selection and engineering.
Dataset and image sources
TomTom. (2019). Traffic Index: https://www.tomtom.com/en_gb/traffic-index/ranking/?
congestion=WORST,BAD,MODERATE
UCI Machine Learning Repository (2019). Metro Interstate Traffic Volume Data
Set:https://archive.ics.uci.edu/ml/datasets/Metro+Interstate+Traffic+Volume

References
Wilson, D.R., & Martinez, T. (1997). Improved Heterogeneous Distance Functions. J. Artif. Int. Res.
6-1. pp.1-34. https://arxiv.org/abs/cs/9701101
Morris, M. (1991). Factorial sampling plans for preliminary computational experiments. Quality Engineering,
37, 307-310. https://doi.org/10.2307%2F1269043
Saltelli, A., Tarantola, S., Campolongo, F., & Ratto, M. (2007). Sensitivity analysis in practice: A guide to
assessing scientific models. Chichester: John Wiley & Sons.
Sobol, I.M. (2001), Global sensitivity indices for nonlinear mathematical models and their Monte Carlo
estimates. MATH COMPUT SIMULAT,55(1–3),271-280 https://doi.org/10.1016/S0378-4754(00)00270-6
Saltelli, A., P. Annoni, I. Azzini, F. Campolongo, M. Ratto, and S. Tarantola (2010). "Variance based
sensitivity analysis of model output. Design and estimator for the total sensitivity index." Computer Physics
Communications, 181(2):259-270. https://doi.org/10.1016/j.cpc.2009.09.018
10 Feature Selection and Engineering for
Interpretability
Join our book community on Discord
https://packt.link/EarlyAccessCommunity

In the first three chapters, we discussed how complexity hinders machine learning (ML) interpretability. There's a
trade-off because you want some complexity to maximize predictive performance, yet not to the extent that you
cannot rely on the model to satisfy the tenets of interpretability: fairness, accountability, and transparency. This
chapter is the first of four focused on how to tune for interpretability. One of the easiest ways to improve
interpretability is through feature selection. It has many benefits, such as faster training and making the model
easier to interpret. But if these two reasons don't convince you, perhaps another one will.

A common misunderstanding is that complex models can self-select features and perform well nonetheless, so why
even bother to select features? Yes, many model classes have mechanisms that can take care of useless features,
but they aren't perfect. And the potential for overfitting increases with each one that remains. Overfitted models
aren't reliable, even if they are more accurate. So, while employing model mechanisms such as regularization is
still highly recommended to avoid overfitting, feature selection is the first step.

In this chapter, you will comprehend how irrelevant features adversely weigh on the outcome of a model and, thus,
the importance of feature selection for model interpretability. Then, we will review filter-based feature selection
methods such as Spearman's correlation, and learn about embedded methods such as LASSO and Ridge
regression. Then, you will discover wrapper methods such as sequential feature selection and hybrid ones such
as recursive feature elimination (RFE), as well as more advanced ones, such as genetic algorithms (GAs).
Lastly, even though feature engineering is typically conducted before selection, there's value in exploring feature
engineering for many reasons after the dust has settled and features have been selected.

These are the main topics we are going to cover in this chapter:

Understanding the effect of irrelevant features


Reviewing filter-based feature selection methods
Exploring embedded feature selection methods
Discovering wrapper, hybrid, and advanced feature selection methods
Considering feature engineering

Technical requirements
This chapter's example uses the mldatasets , pandas , numpy , scipy , mlxtend , sklearn_genetic , xgboost ,
sklearn , matplotlib , and seaborn libraries. Instructions on how to install all of these libraries are in the
Preface.

The GitHub code for this chapter is located here: https://github.com/PacktPublishing/Interpretable-Machine-


Learning-with-Python/tree/master/Chapter10/.

The mission
It has been estimated that there are over 10 million non-profits worldwide, and while a large portion of them have
public funding, most of them depend mostly on private donors, both corporate and individual, to continue
operations. As such, fundraising is mission-critical and carried out throughout the year.

Year over year, donation revenue has grown but there are several problems non-profits face: donor interests evolve,
so a charity popular one year might be forgotten the next; competition is fierce between non-profits; and
demographics are shifting. In the United States, the average donor only gives two charitable gifts per year and is
over 64 years old. Identifying potential donors is challenging and campaigns to reach them can be expensive.

A National Veterans Organization non-profit arm has a large mailing list of about 190,000 past donors and would
like to send a special mailer to ask for donations. However, even with a special bulk discount rate, it costs them
$0.68 per address. This adds up to over $130,000. They only have a marketing budget of $35,000. Given that they
have made this a high priority, they are willing to extend the budget but only if the return on investment (ROI) is
high enough to justify the additional cost.

To minimize the use of their limited budget, instead of mass mailing, they'd like to try direct mailing, which aims
to identify potential donors using what is already known, such as past donations, geographic location, and
demographic data. They will reach other donors via email instead, which is much cheaper, costing no more than
/month for their entire list. They hope this hybrid marketing plan will yield better results. They also recognize that
high-value donors respond better to personalized paper mailers, while smaller donors respond better to email
anyway.

No more than six percent of the mailing list donates at any given campaign. Using ML to predict human behavior
is by no means an easy task, especially when it's so imbalanced. Nevertheless, success is not measured by the
highest predictive accuracy but by profit lift. In other words, the direct mailing model evaluated on the test dataset
should produce more profit than if they mass-mailed the entire dataset.

They have sought your assistance to use ML to produce a model that identifies the most probable donors, but also
in a way that guarantees an ROI. Note that the model must be reliable in producing an ROI.

You received the dataset from the non-profit, which is more or less evenly split between train and test. If you send
the mailer to absolutely everybody in the test dataset, you make a profit of $11,173, but if you manage somehow to
identify only those that will donate, the maximum yield of $73,136 will be attained. Your goal is to achieve a high-
profit lift and reasonable ROI. When the campaign runs, it will identify most probably donors for the entire
mailing list, and they hope to spend not much more than $35,000 in total. However, the dataset has 435 columns,
and some simple statistical tests and modeling exercises show that the data is too noisy to identify the potential
donors' reliability because of overfitting.

The approach
You've decided to first fit a base model with all the features and assess it at different levels of complexity to
understand how having more features increases the propensity to overfit. Then, you employ a series of feature
selection methods ranging from simple filter-based methods to the most advanced ones to determine which one
achieves the profitability and reliability goals sought after by the client. Lastly, once a list of final features has been
selected, at this stage, feature engineering can be considered to enhance model interpretability.

Given the cost-sensitive nature of the problem, thresholds are important to optimize the profit lift. We will get into
the role of thresholds later on, but one significant effect is that even though this is a classification problem, it is
best to use regression models, and then use predictions to classify so that there's only one threshold to tune. That
is, for classification models, you would need a threshold for the label, say those that donated over $1, and then
another one for probabilities predicted. On the other hand, regression predicts the donation, and the threshold can
be optimized based on that.

The preparations
You will find the code for this example here: https://github.com/PacktPublishing/Interpretable-Machine-Learning-
with-Python/tree/master/Chapter10/Mailer.ipynb.

Loading the libraries

To run this example, you need to install the following libraries:

mldatasets to load the dataset


pandas , numpy , and scipy to manipulate it
mlxtend , sklearn_genetic , xgboost , and sklearn (scikit-learn) to fit the models
matplotlib and seaborn to create and visualize the interpretations

To load the libraries, use the following code block:


import math
import os
import mldatasets
import pandas as pd
import numpy as np
import timeit
from tqdm.notebook import tqdm
from sklearn.feature_selection import VarianceThreshold,\
mutual_info_classif, SelectKBest
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LogisticRegression,\
LassoCV, LassoLarsCV, LassoLarsIC
from mlxtend.feature_selection import SequentialFeatureSelector
from sklearn.feature_selection import RFECV
from sklearn.decomposition import PCA import shap
from sklearn_genetic import GAFeatureSelectionCV
from scipy.stats import rankdata
from sklearn.discriminant_analysis import
LinearDiscriminantAnalysis
from sklearn.ensemble import RandomForestRegressor
import xgboost as xgb
import matplotlib.pyplot as plt
import seaborn as sns

Next, we will load and prepare the dataset.

Understanding and preparing the data

We load the data like this into two dataframes ( X_train , X_test ) with the features and two NumPy arrays with
corresponding labels ( y_train , y_test ). Please note that these dataframes have already been previously
prepared for us to remove sparse or unnecessary features, treat missing values, and encode categorical features:
X_train, X_test, y_train, y_test =\
mldatasets.load("nonprofit-mailer", prepare=True)
y_train = y_train.squeeze()
y_test = y_test.squeeze()

All features are numeric with no missing values and categorical features have already been one-hot encoded for us.
Between both train and test mailing lists, there should be over 191,500 records and 435 features. You can check
this is the case like this:
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

The preceding code should output the following:


(95485, 435)
(95485,)
(96017, 435)
(96017,)

Next we can verify that the test labels have the right amount of donators ( test_donators ), donations
( test_donations ), and profit ranges ( test_min_profit , test_max_profit ). We can print these, and then do
the same for the training dataset:
var_cost = 0.68
y_test_donators = y_test[y_test > 0]
test_donators = len(y_test_donators)
test_donations = sum(y_test_donators)
test_min_profit = test_donations - (len(y_test)*var_cost)
test_max_profit = test_donations - (test_donators*var_cost)
print('%s test donators totaling $%.0f (min profit: $%.0f,
max profit: $%.0f)' %\
(test_donators, test_donations, test_min_profit,\
test_max_profit))
y_train_donators = y_train[y_train > 0]
train_donators = len(y_train_donators)
train_donations = sum(y_train_donators)
train_min_profit = train_donations –
(len(y_train)*var_cost)
train_max_profit = train_donations –\
(train_donators*var_cost)
print('%s train donators totaling $%.0f (min profit: $%.0f,
max profit: $%.0f)' %\
(train_donators, train_donations, train_min_profit,\
train_max_profit))

The preceding code should output the following:


4894 test donators totaling $76464 (min profit: $11173, maxprofit: $73136)
4812 train donators totaling $75113 (min profit: $10183, max profit: $71841)

Indeed, if the non-profit mass-mailed to everyone on the test mailing list, they'd make about $11,000 profit but
would have to go grossly over budget to achieve this. The non-profit recognizes that making the max profit by
identifying and targeting only donors is nearly an impossible feat. Therefore, they would be content with
producing a model that reliably can yield more than the min profit but with a smaller cost, preferably under budget.

Understanding the effect of irrelevant features


Feature selection is also known as variable or attribute selection. It is the method by which you can
automatically or manually select a subset of specific features useful to the construction of ML models.

It's not necessarily true that more features lead to better models. Irrelevant features can impact the learning
process, leading to overfitting. Therefore, we need some strategies to remove any features that might adversely
affect learning. Some of the advantages of selecting a smaller subset of features include the following:

It's easier to understand simpler models: For instance, feature importance for a model that uses 15 variables is
much easier to grasp than one that uses 150 variables.
Shorter training time: Reducing the number of variables decreases the cost of computing, speeds up model
training, and perhaps most notably, simpler models have quicker inference times.
Improved generalization by reducing overfitting: Sometimes, with little prediction value, many of the
variables are just noise. The ML model, however, learns from this noise and triggers overfitting while
minimizing generalization simultaneously. We may significantly enhance the generalization of ML models by
removing these irrelevant noisy features.
Variable redundancy: It is common for datasets to have collinear features, which could mean they are
redundant. In cases like these, as long as no significant information is lost, we can retain only one variable
and delete others.

Now, we will fit some models to demonstrate the effect of too many features.
Creating a base model

Let's create a base model for our mailing list dataset to see how this plays out. But first, let's set our random
numbers for reproducibility:
rand = 9
os.environ['PYTHONHASHSEED']=str(rand)
np.random.seed(rand)

We will use XGBoost's Random Forest (RF) regressor ( XGBRFRegressor ) throughout this chapter. It's just like
scikit-learn's but faster because it uses second-order approximations of the objective function. It also has more
options, such as setting the learning rate and monotonic constraints, examined in Chapter 12, Monotonic
Constraints and Model Tuning for Interpretability. We initialize XGBRFRegressor with a max_depth value of 4
and always use 200 estimators for consistency. Then, we fit it with our training data. We will use timeit to
measure how long it takes, which we save in a variable ( baseline_time ) for later reference:
stime = timeit.default_timer()
reg_mdl = xgb.XGBRFRegressor(max_depth=4, n_estimators=200, seed=rand)
fitted_mdl = reg_mdl.fit(X_train, y_train)
etime = timeit.default_timer()
baseline_time = etime-stime

Now that we have a base model, let's evaluate it.

Evaluating the model

Next, let's create a dictionary ( reg_mdls ) to house all the models we will fit in this chapter to test which feature
subsets produce the best models. Here, we can evaluate the RF model with all the features and a max_depth value
of 4 ( rf_4_all ) using evaluate_reg_mdl . It will make a summary and a scatter plot with a regression line:
reg_mdls = {}
reg_mdls['rf_4_all'] = mldatasets.evaluate_reg_mdl(fitted_mdl,\
X_train, X_test, y_train, y_test,\
plot_regplot=True, ret_eval_dict=True)

The preceding code produces the metrics and plot shown in Figure 10.1:
Figure 10.1 – Base model predictive performance

For a plot like the one in Figure 10.1, usually a diagonal line is expected, so one glance at this plot would tell you
that the model is useless. Also, the RMSEs may not seem bad but in the context of such a lopsided problem, they
are dismal. Consider this: only 5% of the list makes a donation, and only 20% of those are over $20, so an average
error of $4.3 – $4.6 of is enormous.

So, is this model useless? The answer lies in what thresholds we use to classify with it. Let's start by defining an
array of thresholds ( threshs ), ranging from $0.40 to $25. We start spacing these out by a cent until it reaches $1,
then by 10 cents until it reaches $3, and after that spaced by $1:
threshs = np.hstack([np.linspace(0.40,1,61), np.linspace(1.1,3,20),\
np.linspace(4,25,22)])

There's a function in mldatasets that can compute profit at every threshold ( profits_by_thresh ). All it needs is
the actual ( y_test ) and predicted labels, followed by the thresholds ( threshs ), the variable cost ( var_costs ),
and the min_profit required. It produces a pandas dataframe with the revenue, costs, profit, and ROI for every
threshold, as long as profit is above the min_profit . Remember, we had set this minimum at the beginning of the
chapter as $11,173 because it makes no sense to target donators under this amount. After we generate these profit
dataframes for the test and train datasets, we can place the maximum, and minimum amounts in the model's
dictionary for later use. And then, we employ compare_df_plots to plot the costs, profits, and ROI ratio for test
and train for every threshold where it exceeded the profit minimum:
y_formatter = plt.FuncFormatter(lambda x, loc: "${:,}K".format(x/1000))
profits_test = mldatasets.profits_by_thresh(y_test,\
reg_mdls['rf_4_all']['preds_test'], threshs,\
var_costs=var_cost, min_profit=test_min_profit)
profits_train = mldatasets.profits_by_thresh(y_train,\
reg_mdls['rf_4_all']['preds_train'], threshs,\
var_costs=var_cost, min_profit=train_min_profit)
reg_mdls['rf_4_all']['max_profit_train'] =profits_train.profit.max()
reg_mdls['rf_4_all']['max_profit_test'] = profits_test.profit.max()
reg_mdls['rf_4_all']['max_roi'] = profits_test.roi.max()
reg_mdls['rf_4_all']['min_costs'] = profits_test.costs.min()
reg_mdls['rf_4_all']['profits_train'] = profits_train
reg_mdls['rf_4_all']['profits_test'] = profits_test
mldatasets.compare_df_plots(\
profits_test[['costs', 'profit', 'roi']],\
profits_train[['costs', 'profit', 'roi']],\
'Test', 'Train', y_formatter=y_formatter, x_label='Threshold',\
plot_args={'secondary_y':'roi'})

The preceding snippet generates the plots in Figure 10.2. You can tell that Test and Train are almost identical.
Costs decrease steadily at a high rate and profit at a lower rate, while ROI increases steadily. However, some
differences exist, such as ROI, which become a bit higher eventually, and although viable thresholds start at the
same point, Train does end at a different threshold. It turns out the model can turn a profit, so despite the
appearances of the plot in Figure 10.1, the model is far from useless:

Figure 10.2 – Comparison between profit, costs, and ROI for the test and train datasets for the base model across
thresholds

The difference in RMSEs for the train and test sets didn't lie. The model did not overfit. The main reason for this is
that we used relatively shallow trees by setting our max_depth value at 4. We can easily see this effect of using
shallow trees by computing how many features had a feature_importances_ value of over 0:
reg_mdls['rf_4_all']['total_feat'] =\
reg_mdls['rf_4_all']['fitted'].feature_importances_.shape[0] reg_mdls['rf_4_all']['num_fe
sum(reg_mdls['rf_4_all']['fitted'].feature_importances_ > 0)
print(reg_mdls['rf_4_all']['num_feat'])
The preceding code outputs 160. In other words, only 160 were used out of 435—there are only so many features
that can be accommodated into such a shallow tree! Naturally, this leads to lowering overfitting, but at the same
time, the choice of features with measures of impurity over a random selection of features is not necessarily the
most optimal.

Training the base model at different max depths

So, what happens if we make the trees deeper? Let's repeat all the steps we did for the shallow one but for max
depths between 5 and 12:
for depth in tqdm(range(5, 13)):
mdlname = 'rf_'+str(depth)+'_all'
stime = timeit.default_timer()
reg_mdl = xgb.XGBRFRegressor(max_depth=depth, n_estimators=200,\
seed=rand)
fitted_mdl = reg_mdl.fit(X_train, y_train)
etime = timeit.default_timer()
reg_mdls[mdlname] = mldatasets.evaluate_reg_mdl(fitted_mdl, X_train,\
X_test, y_train, y_test, plot_regplot=False,\
show_summary=False, ret_eval_dict=True)
reg_mdls[mdlname]['speed'] = (etime - stime)/baseline_time
reg_mdls[mdlname]['depth'] = depth
reg_mdls[mdlname]['fs'] = 'all'
profits_test = mldatasets.profits_by_thresh(y_test,\
reg_mdls[mdlname]['preds_test'], threshs,\
var_costs=var_cost, min_profit=test_min_profit)
profits_train = mldatasets.profits_by_thresh(y_train,
reg_mdls[mdlname]['preds_train'], threshs,\
var_costs=var_cost, min_profit=train_min_profit)
reg_mdls[mdlname]['max_profit_train'] = profits_train.profit.max()
reg_mdls[mdlname]['max_profit_test'] = profits_test.profit.max()
reg_mdls[mdlname]['max_roi'] = profits_test.roi.max()
reg_mdls[mdlname]['min_costs'] = profits_test.costs.min()
reg_mdls[mdlname]['profits_train'] = profits_train
reg_mdls[mdlname]['profits_test'] = profits_test
reg_mdls[mdlname]['total_feat'] =\
reg_mdls[mdlname]['fitted'].feature_importances_.shape[0]
reg_mdls[mdlname]['num_feat'] =\
sum(reg_mdls[mdlname]['fitted'].feature_importances_ > 0)

Now, let's plot the details in the profits dataframes for the "deepest" model (with a max depth of 12) as we did
before with compare_df_plots , producing Figure 10.3:
Figure 10.3 – Comparison between profit, costs, and ROI for the test and train datasets for a "deep" base model
across thresholds

See how different Test and Train are this time in Figure 10.3. Test reaches a max of about $15,000 and Train
exceeds $20,000. Train's costs dramatically fall, making the ROI orders of magnitude much higher than Test.
Also, the ranges of thresholds are much different. Why is this a problem, you ask? If we had to guess what
threshold to use to pick who to target in the next mailer, the optimal for Train is higher than for Test—meaning
that by using an overfit model, we could miss the mark and underperform in unseen data.

Next, let's convert our model dictionary ( reg_mdls ) into a dataframe and extract some details from it. Then, we
can sort it by depth, format it, color-code it, and output it:
def display_mdl_metrics(reg_mdls, sort_by='depth', max_depth=None):
reg_metrics_df = pd.DataFrame.from_dict(reg_mdls, 'index')\
[['depth', 'fs', 'rmse_train', 'rmse_test',\
'max_profit_train', 'max_profit_test', 'max_roi',\
'min_costs', 'speed', 'num_feat']]
pd.set_option('precision', 2)
html = reg_metrics_df.sort_values(by=sort_by, ascending=False).style.\
format({'max_profit_train':'${0:,.0f}',\
'max_profit_test':'${0:,.0f}', 'min_costs':'${0:,.0f}'}).\
background_gradient(cmap='plasma', low=0.3, high=1,\
subset=['rmse_train', 'rmse_test']).\
background_gradient(cmap='viridis', low=1, high=0.3,\
subset=['max_profit_train', 'max_profit_test'])
html
display_mdl_metrics(reg_mdls)

The preceding snippet leverages the display_mdl_metrics function to output the dataframe shown in Figure
10.4. Something that should be immediately visible is how RMSE train and RMSE test are inverses. One decreases
dramatically, and another increases slightly as depth increases. The same can be said for profit. ROI tends to
increase with depth and training speed and the number of features used as well:
Figure 10.4 – Comparing metrics for all base RF models with different depths

You could be tempted to use rf_11_all with the highest profitability, but it will be risky to use it! A common
misunderstanding is that black-box models can effectively cut through any amount of irrelevant features. While it
will often be able to find something of value and make the most out of it, too many features will hinder its
reliability by overfitting with more ease. Fortunately, there is a sweet spot where you can reach high profitability
with minimal overfitting, but to get there, you have to reduce the number of features first!

Reviewing filter-based feature selection methods


Filter-based methods independently pick out features from a dataset without employing any ML. These methods
depend only on the variables' characteristics and are relatively effective, computationally inexpensive, and quick to
perform. Therefore, being the low-hanging fruit of feature selection methods, they are usually the first step in any
feature selection pipeline.

Two kinds of filter-based methods exist:

Univariate: Individually and independently of the feature space, they evaluate and rate a single feature at a
time. One problem that can occur with univariate methods is that they may filter out too much since they don't
take into consideration the relationship between features.
Multivariate: These take into account the entire feature space and how features within interact with each
other.

Overall, for the removal of obsolete, redundant, constant, duplicated, and uncorrelated features, filter methods are
very strong. However, by not accounting for complex, non-linear, non-monotonic correlations and interactions that
only ML models can find, they aren't effective whenever these relationships are prominent in the data.

We will review three categories of filter-based methods:

Basic
Correlation
Ranking

We will explain them further in their own sections.

Basic filter-based methods

We employ basic filter methods in the data preparation stage, specifically, the data cleaning stage, before any
modeling. The reason for this is there's a low risk of taking feature selection decisions that would adversely impact
models. They involve common-sense operations such as removing features that carry no information or duplicate
it.

Constant features with a variance threshold

Constant features don't change in the training dataset and, therefore, carry no information, and the model can't
learn from them. We can use a univariate method called VarianceThreshold , which filters out features that are
low-variance. We will use a threshold of zero because we want to filter out only features with zero variance—in
other words, constant. It only works with numeric features, so we must first identify which features are numeric
and which are categorical. Once we fit the method on the numeric columns, get_support() returns the list of
features that aren't constant, and we can use set algebra to return only the constant features ( num_const_cols ):
num_cols_l = X_train.select_dtypes([np.number]).columns
cat_cols_l = X_train.select_dtypes([np.bool,
np.object]).columns
num_const = VarianceThreshold(threshold=0)
num_const.fit(X_train[num_cols_l])
num_const_cols = list(set(X_train[num_cols_l].columns) -\
set(num_cols_l[num_const.get_support()]))

The preceding snippet produced a list of constant numeric features, but how about categorical features?
Categorical features would only have one category or unique value. You can easily check this by applying the
nunique() function on categorical features. It will return a pandas Series, and then a lambda function can filter
out only those with one unique value. Then, .index.tolist() returns the name of the features as a list. Now, you
just join both lists of constant features and voilá! You have all constants ( all_const_cols ). You can print them;
there should be three:
cat_const_cols = X_train[cat_cols_l].nunique()[lambda x:\
x<2].index.tolist()
all_const_cols = num_const_cols + cat_const_cols
print(all_const_cols)

In most cases, removing constant features isn't good enough. A redundant feature might be almost constant or
quasi-constant.

Quasi-constant features with Value-Counts

Quasi-constant features are almost entirely the same value. Unlike constant filtering, using a variance threshold
won't work because high variance and quasi-constantness aren't mutually exclusive. Instead, we will iterate all
features and get value_counts() , which returns the number of rows for each value. Then, divide these counts by
the total number of rows to get a percentage and sort by the highest. If the top value is higher than the
predetermined threshold ( thresh ), we append it to a list of quasi-constant columns ( quasi_const_cols ). Please
note that choosing this threshold must be done with a lot of care and understanding of the problem. For instance, in
this case, we know that it's lopsided because only 5% donate, most of which donate a low amount, so even a tiny
percentage of a feature might make an impact, which is why our threshold is so high at 99.9%:
thresh = 0.999
quasi_const_cols = []
num_rows = X_train.shape[0]
for col in tqdm(X_train.columns):
top_val = (X_train[col].value_counts() /
num_rows).sort_values(ascending=False).values[0]
if top_val >= thresh:
quasi_const_cols.append(col)
print(quasi_const_cols)

The preceding code should have printed five features, which include the three that were previously obtained. Next,
we will deal with another form of irrelevant features: duplicates!

Duplicating features

Usually, when you discuss duplicates with data, you immediately think of duplicate rows, but duplicate columns
are also problematic. You can find them just as you would find duplicate rows with the pandas duplicated()
function, except you would transpose the dataframe first inversing columns and rows:
X_train_transposed = X_train.T
dup_cols =\
X_train_transposed[X_train_transposed.duplicated()].index.tolist()
print(dup_cols)

The preceding snippet outputs a list with the two duplicated rows.

Removing unnecessary features

Unlike other feature selection methods, which you should test with models, you can apply basic filter-based feature
selection methods right away by removing the features you deemed useless. But just in case, it's good practice to
make a copy of the original data. Please note that we don't include constant columns ( all_constant_cols ) in the
columns we are to drop ( drop_cols ) because the quasi-constant ones already include them:
X_train_orig = X_train.copy()
X_test_orig = X_test.copy()
drop_cols = quasi_const_cols + dup_cols
X_train.drop(labels=drop_cols, axis=1, inplace=True)
X_test.drop(labels=drop_cols, axis=1, inplace=True)

Next, we will explore multivariate filter-based methods on the remaining features.

Correlation filter-based methods

Correlation filter-based methods quantify the strength of the relationship between two features. It is useful for
feature selection because we might want to filter out extremely correlated features or those that aren't correlated
with others at all. Either way, it is a multivariate feature selection method—bivariate to be precise.

But first, we ought to choose a correlation method:

Pearson's correlation coefficient: Measures how linearly correlated two features are between -1 (negative)
and 1 (positive) with 0 meaning no linear correlation. Like linear regression, it assumes linearity, normality,
and homoscedasticity.
Spearman's rank correlation coefficient: Measures the strength of monotonicity of two features regardless
of whether they are linearly related or not. It also measured between -1 and 1 with 0 meaning no monotonic
correlation. It makes no distribution assumptions and can work with both continuous and discrete features.
However, its weakness is with non-monotonic relationships.
Kendall's tau correlation coefficient: Measures the ordinal association between features. It also ranges
between -1 and 1, but they mean low and high, respectively. It's useful with discrete features.

The dataset is a mix of continuous and discrete, and we cannot make any linear assumptions about it, so spearman
is the right choice. All three can be used with the pandas corr function though:
corrs = X_train.corr(method='spearman')
print(corrs.shape)
The preceding code should output the shape of the correlation matrix, which is (428, 428) . This dimension
makes sense because there are 428 features left, and each feature has a relationship with 428 features, including
itself.

We can now look for features to remove in the correlation matrix ( corrs ). Note that to do so, we must establish
thresholds. For instance, we can say that an extremely correlated feature has an absolute value coefficient over
0.99 and less than 0.15 for an uncorrelated feature. With these thresholds in mind, we can find features that are
correlated to only one feature and extremely correlated to more than one feature. Why one feature? Because the
diagonals in a correlation matrix are always 1 because a feature is always perfectly correlated with itself. The
lambda functions in the following code make sure we are accounting for this:

extcorr_cols = (abs(corrs) > 0.99).sum(axis=1)[lambda x: x>1].\


index.tolist()
print(extcorr_cols)
uncorr_cols = (abs(corrs) > 0.15).sum(axis=1)[lambda x: x==1].\
index.tolist()
print(uncorr_cols)

The preceding code outputs the two lists as follows:


['MAJOR', 'HHAGE1', 'HHAGE3', 'HHN3', 'HHP1', 'HV1', 'HV2', 'MDMAUD_R', 'MDMAUD_F', 'MDMAUD_A']
['TCODE', 'MAILCODE', 'NOEXCH', 'CHILD03', 'CHILD07', 'CHILD12', 'CHILD18', 'HC15', 'MAXADATE']

The first list is one of features that are extremely correlated with ones other than themselves. While this is useful to
know, you shouldn't remove features from this list without understanding what features they are correlated with
and how, as well as with the target. Then, only if redundancy is found, make sure you only remove one of them.
The second one is of uncorrelated features to any others than themself, which in this case is suspicious given the
sheer amount of features. That being said, you also should inspect them one by one, especially to measure them
against the target to see whether they are redundant. However, we will take a chance and make a feature subset
( corr_cols ) excluding the uncorrelated ones:
corr_cols = X_train.columns[~X_train.columns.isin(uncorr_cols)].tolist()
print(len(corr_cols))

The preceding code should output 419. Let's now fit the RF model with only these features. Given that there are
still over 400 features, we will use a max_depth value of 11 . Except for that and a different model name
( mdlname ), it's the same code as before:
mdlname = 'rf_11_f-corr'
stime = timeit.default_timer()
reg_mdl = xgb.XGBRFRegressor(max_depth=11, n_estimators=200, seed=rand)
fitted_mdl = reg_mdl.fit(X_train[corr_cols], y_train)
:
reg_mdls[mdlname]['num_feat'] =\
sum(reg_mdls[mdlname]['fitted'].feature_importances_ > 0)

Before we compare the results for the preceding model, let's learn about ranking filter methods.

Ranking filter-based methods

Ranking filter-based methods are based on statistical univariate ranking tests, which assess the strength of
features against the target. These are some of the most popular methods:

ANOVA F-test: Analysis of Variance (ANOVA) F-test measures the linear dependency between features
and the target. As the name suggests, it does this by decomposing the variance. It makes similar assumptions
to linear regression, such as normality, independence, and homoscedasticity. In scikit-learn, you can use
f_regression and f_classification for regression and classification, respectively, to rank features by the
F-score yielded by the F-test.
Chi-square test of independence: This test measures the association between non-negative categorical
variables and binary targets, so it's only suitable for classification problems. In scikit-learn, you can use
chi2 .
Mutual information (MI): Unlike the two previous methods, this one is derived from information theory
rather than classical statistical hypothesis testing. It's a different name but a concept we have already
discussed in this book as the Kullback-Leibler (KL) divergence because it's the KL for feature X and target
Y. The Python implementation in scikit-learn uses a numerically stable and symmetric offshoot of KL called
Jensen-Shannon (JS) divergence instead and leverages k-nearest neighbors to compute distances. Features
can be ranked by MI with mutual_info_regression and mutual_info_classif for regression and
classification, respectively.

Of the three options mentioned, the one that is most appropriate for this dataset is MI because we cannot assume
linearity among our features, and most of them aren't categorical either. We can try classification with a threshold
of $0.68, which at least covers the cost of sending the mailer. To that end, we must first create a binary
classification target ( y_train_class ) with that threshold:
y_train_class = np.where(y_train > 0.68, 1, 0)

Next, we can use SelectKBest to get the top-160 features according to MI classification (MIC). We then employ
get_support() to obtain a Boolean vector (or mask), which tells us which features are in the top 160, and we
subset the list of features with this mask:
mic_selection = SelectKBest(mutual_info_classif, k=160).\
fit(X_train, y_train_class)
mic_cols = X_train.columns[mic_selection.get_support()].tolist()
print(len(mic_cols))

The preceding code should confirm that there are 160 features in the mic_cols list. Incidentally, this is an
arbitrary number. Ideally, if there was time, we could test different thresholds for the classification target and ks for
the MI, looking for the model that achieved the highest profit lift while underfitting the least. Next, we can fit the
RF model as we've done before with the MIC features. This time, we will use a max depth of 5 because there are
significantly fewer features:
mdlname = 'rf_5_f-mic'
stime = timeit.default_timer()
reg_mdl = xgb.XGBRFRegressor(max_depth=5, n_estimators=200, seed=rand)
fitted_mdl = reg_mdl.fit(X_train[mic_cols], y_train)
:
reg_mdls[mdlname]['num_feat'] =\
sum(reg_mdls[mdlname]['fitted'].feature_importances_ > 0)

Now, let's plot the profits for test and train as we did in Figure 10.3 but for the MIC model. It will produce what's
shown in Figure 10.5:
Figure 10.5 – Comparison between profit, costs, and ROI for the test and train datasets for a model with MIC
features across thresholds

In Figure 10.5, you can tell that there is quite a bit of difference between Test and Train, yet similarities indicate
minimal overfitting. For instance, the highest profitability can be found between 0.66 and 0.75 for Train, and
while Test is mostly between 0.66 and 0.7, it only gradually decreases afterward.

Although we have visually examined the MIC model, it's nice to have some reassurance by looking at raw metrics.
Next, we will compare all the models we have trained so far using consistent metrics.

Comparing filter-based methods

We have been saving metrics into a dictionary ( reg_mdls ), which we easily convert to a dataframe and output as
we have done before, but this time we sort by max_profit_test :
display_mdl_metrics(reg_mdls, 'max_profit_test')

The preceding snippet generated what is shown in Figure 10.6. It is evident that the filter MIC model is the least
overfitted of all. It ranked higher than more-complex models with more features and took less time to train than
any model. Its speed is an advantage for hyperparameter tuning. What if we wanted to find the best classification
target thresholds or MIC k? We won't do this now, but we could likely get a better model if we ran every
combination but it would take time to do and even more with more features:
Figure 10.6 – Comparing metrics for all base models and filter-based feature-selected models

In Figure 10.6, you can tell that the correlation filter model ( f-corr ) performs worse than the model with more
features and an equal amount of max_depth , which suggests that we must have removed an important feature. As
cautioned in that section, the problem with blindly setting thresholds and removing anything above it is that you
can inadvertently remove something useful. Not all extremely correlated and uncorrelated features are useless, so
further inspection is required. Next, we will explore some embedded methods that when combined with cross-
validation, which require less oversight.

Exploring embedded feature selection methods


Embedded methods exist within models themselves by naturally selecting features during training. You can
leverage the intrinsic properties of any model that has them to capture the features selected:

Tree-based models: For instance, we have used the following code many times to count the number of
features used by the RF models, which is evidence of feature selection naturally occurring in the learning
process:
sum(reg_mdls[mdlname]['fitted'].feature_importances_ > 0)

XGBoost's RF uses gain by default, which is the average decrease in error in all splits where it used the feature to
compute feature importance. We can increase the threshold above 0 to select even fewer features according to this
relative contribution. However, by constraining the trees' depth, we forced the model to choose even fewer features
already.
Regularized models with coefficients: We will study this further in Chapter 12, Monotonic Constraints and
Model Tuning for Interpretability, but many model classes can incorporate penalty-based regularization, such
as L1, L2, and elastic net. However, not all of them have intrinsic parameters such as coefficients that can be
extracted to determine which features were penalized.

This section will only cover regularized models given that we are using a tree-based model already. It's best to
leverage different model classes to get different perspectives of what features matter the most.

We covered some of these models in Chapter 3, Interpretation Challenges, but these are a few model classes that
incorporate penalty-based regularization and output feature-specific coefficients:

Least absolute shrinkage and selection operator (LASSO): Because it uses L1 penalty in the loss function,
LASSO can set coefficients to 0.
Least-angle regression (LARS): Similar to LASSO but is vector-based and is more suitable to high-
dimensional data. It is also fairer toward equally correlated features.
Ridge regression: Uses L2 penalty in the loss function and because of this can only shrink coefficients of
irrelevance close to 0 but not to 0.
Elastic net regression: Uses a mix of both L1 and L2 norms as penalties.
Logistic regression: Contingent on the solver, it can handle L1, L2, or elastic net penalties.

There are also several variations of the preceding models, such as LASSO LARS, which is a LASSO fit using the
LARS algorithm, or even LASSO LARS IC, which is the same but uses AIC or BIC criteria for the model
section:

Akaike's Information Criteria (AIC): A relative goodness of fit measure founded in information theory
Bayesian Information Criteria (BIC): Has a similar formula to AIC but has a different penalty term

OK, now let's use SelectFromModel to extract top features from a LASSO model. We will use LassoCV because
it can automatically cross-validate to find optimal penalty strength. Once you fit it, we can get the feature mask
with get_support() . We can then print the number of features and list of features:
lasso_selection = SelectFromModel(LassoCV(n_jobs=-1, random_state=rand))
lasso_selection.fit(X_train, y_train)
lasso_cols = X_train.columns[lasso_selection.get_support()].tolist()
print(len(lasso_cols))
print(lasso_cols)

The preceding code outputs the following:


7
['ODATEDW', 'TCODE', 'POP901', 'POP902', 'HV2', 'RAMNTALL', 'MAXRDATE']

Now, let's try the same but with LassoLarsCV :


llars_selection = SelectFromModel(LassoLarsCV(n_jobs=-1))
llars_selection.fit(X_train, y_train)
llars_cols = X_train.columns[llars_selection.get_support()].tolist()
print(len(llars_cols))
print(llars_cols)

The preceding snippet produces the following output:


8
['RECPGVG', 'MDMAUD', 'HVP3', 'RAMNTALL', 'LASTGIFT', 'AVGGIFT', 'MDMAUD_A', 'DOMAIN_SOCIALCLS']

Lasso shrunk coefficients for all but seven features to 0, and Lasso LARS did the same but for eight. However,
notice how there's no overlap between both lists! OK, so let's try incorporating AIC model selection into Lasso
Lars with LassoLarsIC :
llarsic_selection = \ SelectFromModel(LassoLarsIC(criterion='aic'))
llarsic_selection.fit(X_train, y_train)
llarsic_cols = X_train.columns[llarsic_selection.get_support()].tolist()
print(len(llarsic_cols))
print(llarsic_cols)

The preceding snippet generates the following output:


111
['TCODE', 'STATE', 'MAILCODE', 'RECINHSE', 'RECP3', 'RECPGVG', 'RECSWEEP',..., 'DOMAIN_URBANICITY

It's the same algorithm but with a different method for selecting the value of the regularization parameter. Note
how this less-conservative approach expands the number of features to 111. Now, so far, all of the methods we
have used have the L1 norm. Let's try one with L2—more specifically, L2-penalized logistic regression. We do
exactly what we did before, but this time we fit with the binary classification targets ( y_train_class ):
log_selection = SelectFromModel(LogisticRegression(C=0.0001,\
solver='sag', penalty='l2',\
n_jobs=-1, random_state=rand))
log_selection.fit(X_train, y_train_class)
log_cols = X_train.columns[log_selection.get_support()].tolist()
print(len(log_cols))
print(log_cols)

The preceding code produces the following output:


87
['ODATEDW', 'TCODE', 'STATE', 'POP901', 'POP902', 'POP903', 'ETH1', 'ETH2', 'ETH5', 'CHIL1', 'HHN

Now that we have a few feature subsets to test, we can place their names into a list ( fsnames ) and the feature
subset lists into another list ( fscols ):
fsnames = ['e-lasso', 'e-llars', 'e-llarsic', 'e-logl2']
fscols = [lasso_cols, llars_cols, llarsic_cols, log_cols]

We can then iterate across all list names and fit and evaluate our XGBRFRegressor model as we have done before
but increasing max_depth at every iteration:
def train_mdls_with_fs(reg_mdls, fsnames, fscols, depths):
for i, fsname in tqdm(enumerate(fsnames), total=len(fsnames)):
depth = depths[i]
cols = fscols[i]
mdlname = 'rf_'+str(depth)+'_'+fsname
stime = timeit.default_timer()
reg_mdl = xgb.XGBRFRegressor(max_depth=depth, n_estimators=200,\
seed=rand)
fitted_mdl = reg_mdl.fit(X_train[cols], y_train)
:
reg_mdls[mdlname]['num_feat'] =\
sum(reg_mdls[mdlname]['fitted'].feature_importances_ > 0)
train_mdls_with_fs(reg_mdls, fsnames, fscols, [3, 4, 5, 6])

Now, let's see how our embedded feature-selected models fare in comparison to the filtered ones. We will rerun the
code we ran to output what was shown in Figure 10.6. This time, we will get what is shown in Figure 10.7:
Figure 10.7 – Comparing metrics for all base models and filter-based and embedded feature-selected models

According to Figure 10.7, three out of the four embedded methods we tried produced models with the lowest test
RMSE. They also all train much faster than any othesr and are more profitable than any other model of equal
complexity. One of them ( rf_5_e-llarsic ) is even highly profitable. Compare this with rf_9_all with similar
test profitability to see how performance diverges with that on the training data.

Discovering wrapper, hybrid, and advanced feature selection methods


The feature selection methods studied so far are computationally inexpensive because they require no model fitting
or fitting simpler white-box models. In this section, we will learn about other, more exhaustive methods with many
possible tuning options. The categories of methods included here are as follows:

Wrapper: Exhaustively look for the best subset of features by fitting an ML model using a search strategy
that measures improvement on a metric.
Hybrid: A method that combines embedded and filter methods with wrapper methods.
Advanced: A method that doesn't fall into any of the previously discussed categories. Examples include
dimensionality reduction, model-agnostic feature importance, and GAs.

And now, let's get started with wrapper methods!

Wrapper methods

The concept behind wrapper methods is reasonably simple: evaluate different subsets of features on the ML
model and choose the one that achieves the best score in a predetermined objective function. What varies here is
the search strategy:

Sequential forward selection (SFS): This approach begins without a feature and adds one, one at a time.
Sequential forward floating selection (SFFS): Same as the previous except for every feature it adds, it can
remove one as long as the objective function increases.
Sequential backward selection (SBS): This process begins with all features present and eliminates one
feature at a time.
Sequential floating backward selection (SFBS): Same as the previous except for every feature it removes, it
can add one as long as the objective function increases.
Exhaustive feature selection (EFS): This approach seeks all possible combinations of features.
Bidirectional search (BDS): This last one simultaneously allows both forward and backward function
selection to get one unique solution.

These methods are greedy algorithms because they solve the problem piece by piece, choosing pieces based on
their immediate benefit. Even though they may arrive at a global maximum, they take an approach more suited for
finding local maxima. Depending on the number of features, they might be too computationally expensive to be
practical, especially EFS, which grows combinatorially.

To allow for shorter search times, we will do two things:

1. Start our search with the features collectively selected by other methods to have a smaller feature space to
chose from. To that end, we combine feature lists from several methods into a single top_cols list:
top_cols = list(set(mic_cols).union(set(llarsic_cols)).\
union(set(log_cols)))
len(top_cols)

1. Sample our datasets so that ML models speed up. We can use np.random.choice to do random selection of
row indexes without replacement:
sample_size = 0.1
sample_train_idx = np.random.choice(X_train.shape[0],\
math.ceil(X_train.shape[0]*sample_size),\
replace=False)
sample_test_idx = np.random.choice(X_test.shape[0],\
math.ceil(X_test.shape[0]*sample_size),\
replace=False)

Out of the wrapper methods presented, we will only perform SFS given how time-consuming they are. Still, with
an even smaller dataset, you can try the other options, which the mlextend library also supports.

Sequential forward selection (SFS)

The first argument of a wrapper method is an unfitted estimator (a model). In SequentialFeatureSelector , we


are placing a LinearDiscriminantAnalysis model. Other arguments include the direction ( forward=true ),
whether it's floating ( floating=False ), the number of features we wish to select ( k_features=27 ), the number
of cross-validations ( cv=3 ), and the loss function to use ( scoring=f1 ). Some recommended optional arguments
to enter are the verbosity ( verbose=2 ) and the number of jobs to run in parallel ( n_jobs=-1 ). Since it could take
a while, you'll definitely want it to output something and use as many processors as possible:
sfs_lda = SequentialFeatureSelector(\
LinearDiscriminantAnalysis(n_components=1), forward=True,\
floating=False, k_features=100, cv=3, scoring='f1',\
verbose=2, n_jobs=-1)
sfs_lda = sfs_lda.fit(X_train.iloc[sample_train_idx][top_cols],\
y_train_class[sample_train_idx])
sfs_lda_cols = X_train.columns[list(sfs_lda.k_feature_idx_)].tolist()

Once we fit the SFS, it will return the index of features that have been selected with k_feature_idx_ , and we can
use those to subset the columns and obtain the list of feature names.

Hybrid methods
Starting with 435 features, there are over 1042 combinations of 27 feature subsets alone! So, you can see how EFS
would be impractical on such a large feature space. Therefore, except for EFS on the entire dataset, wrapper
methods will invariably take some shortcuts to select the features. Whether you are going forward, backward, or
both, as long as you are not assessing every single combination of features, you could easily miss out on the best
one.

However, we can leverage the more rigorous, exhaustive search approach of wrapper methods with filter and
embedded methods' efficiency. The result of this is hybrid methods. For instance, you could employ filter or
embedded methods to derive only the top-10 features and perform EFS or SBS on only those.

Recursive feature elimination

Another, more common approach is something such as SBS, but instead of removing features based on improving
a metric alone, using the model's intrinsic parameters to rank the features and only removing the least ranked. The
name of this approach is RFE, and it is a hybrid between embedded and wrapper methods. You can only use
models with feature_importances_ or coefficients ( coef_ ) because this is how the method knows what features
to remove. Model classes in scikit-learn with these attributes are classified under linear_model , tree , and
ensemble . Also, scikit-learn-compatible versions of XGBoost, LightGBM, and CatBoost also have
feature_importances_ .

We will use the cross-validated version of RFE because it's more reliable. RFECV takes the estimator first
( LinearDiscriminantAnalysis ). We can then define step , which sets how many features it should remove in
every iteration, the number of cross-validations ( cv ), and the metric used for evaluation ( scoring ). Lastly, it is
recommended to set the verbosity ( verbose=2 ) and leverage as many processors as possible ( n_jobs=-1 ). To
speed it up, we will use a sample again for the training and start with the 267 for top_cols :
rfe_lda = RFECV(LinearDiscriminantAnalysis(n_components=1),\
step=2, cv=3, scoring='f1', verbose=2, n_jobs=-1)
rfe_lda.fit(X_train.iloc[sample_train_idx][top_cols],
y_train_class[sample_train_idx])
rfe_lda_cols = np.array(top_cols)[rfe_lda.support_].tolist()

Next, we will try different methods that don't relate to the main three feature selection categories: filter, embedded,
and wrapper.

Advanced methods

Many methods can be categorized under advanced feature selection methods, including the following
subcategories:

Model-agnostic feature importance: Any feature importance method covered in Chapter 4, Global
Model-Agnostic Interpretation Methods, can be used to obtain the top features of a model for feature
selection purposes.
GA: This is a wrapper method in the sense that it "wraps" a model assessing predictive performance across
many feature subsets. However, unlike the wrapper methods we examined, it's not greedy, and it's more
optimized to work with large feature spaces. It's called genetic because it's inspired by biology—natural
selection, specifically.
Dimensionality reduction: Some dimensionality reduction methods, such as Principal Component
Analysis (PCA), can return explained variance on a feature basis. For others, such as factor analysis, it can be
derived from other outputs. Explained variance can be used to rank features.
Auto-encoders: We won't delve into this one, but deep learning can be leveraged for feature selection with
auto-encoders.

We will briefly cover the first two in this section so you can understand how they can be implemented. Let's dive
right in!

Model-agnostic feature importance

A popular model-agnostic feature importance method that we have used throughout this book is SHAP, and it has
many properties that make it more reliable than other methods. In the following code, we can take our best model
and extract shap_values for it using TreeExplainer :
fitted_rf_mdl = reg_mdls['rf_11_all']['fitted']
shap_rf_explainer = shap.TreeExplainer(fitted_rf_mdl)
shap_rf_values =\
shap_rf_explainer.shap_values(X_test_orig.iloc[sample_test_idx])
shap_imps = pd.DataFrame({'col':X_train_orig.columns,\
'imp':np.abs(shap_rf_values).mean(0)}).\
sort_values(by='imp',ascending=False)
shap_cols = shap_imps.head(120).col.tolist()

Then, we average for the absolute value of the SHAP values across the first dimension is what provides us with a
ranking for each feature. We put this value in a dataframe and sort it as we did for PCA. Lastly, also take the top
120 and place them in a list ( shap_cols ).

Genetic algorithms

GAs are a stochastic global optimization technique inspired by natural selection, which wrap a model much like
wrapper methods do. However, they don't follow a sequence on a step-by-step basis. GAs don't have iterations but
generations, which include populations of chromosomes. Each chromosome is a binary representation of your
feature space where 1 means to select a feature and 0 to not. Each generation is produced with the following
operations:

Selection: Like with natural selection, this is partially random (exploration) and partially based on what has
already worked (exploitation). What has worked is its fitness. Fitness is assessed with a "scorer" much like
wrapper methods. Poor fitness chromosomes are removed, whereas good ones get to reproduce through
"crossover."
Crossover: Randomly, some good bits (or features) of each parent go to a child.
Mutation: Even when a chromosome has proved effective, given a low mutation rate, it will occasionally
mutate or flip one of its bits, in other words, features.

The Python implementation we will use has many options. We won't explain all of them here but they are
documented well in the code should you be interested. The first attribute is the estimator. We can also define the
cross-validation iterations ( cv=3 ), scoring to determine whether chromosomes are fit. There are some important
probabilistic properties, such as probability for a mutated bit ( mutation_probability ) and that bits will get
exchanged ( crossover_probability ). Generation-wise, n_gen_no_change provides a means for early stopping
if generations haven't improved, and generations , by default, 40, but we will use 5. You can fit
GeneticSelectionCV as you would any model. It can take a while, so it is best to define the verbosity and allow it
to use all the processing capacity. Once finished, we can use the Boolean mask ( support_ ) to subset the features:
ga_rf = GAFeatureSelectionCV(RandomForestRegressor(random_state=rand, max_depth=3)
crossover_probability=0.8, mutation_probability=0.1,\
generations=5, n_jobs=-1)
ga_rf = ga_rf.fit(X_train.iloc[sample_train_idx][top_cols].values,\
y_train[sample_train_idx])
ga_rf_cols = np.array(top_cols)[ga_rf.best_features_].tolist()

OK, now that we have covered a wide variety of wrapper, hybrid, and advanced feature selection methods in this
section, let's evaluate all of them at once and compare results.

Evaluating all feature-selected models

As we have done with embedded methods, we can place feature subset names ( fsnames ), lists ( fscols ), and
corresponding depths in lists:
fsnames = ['w-sfs-lda', 'h-rfe-lda', 'a-shap', 'a-ga-rf']
fscols = [sfs_lda_cols, rfe_lda_cols, shap_cols, ga_rf_cols]
depths = [5, 6, 5, 6]

Then, we can use the two functions we created to first iterate across all feature subsets, training and evaluating a
model with them. Then the second function outputs the results of the evaluation in a dataframe with previously
trained models:
train_mdls_with_fs(reg_mdls, fsnames, fscols, depths)
display_mdl_metrics(reg_mdls, 'max_profit_test', max_depth=7)

This time we are limiting the models to those with no more than a depth of seven since those a very overfitted. The
result of the snippet is depicted in Figure 10.8:

Figure 10.8 – Comparing metrics for all feature-selected models

Figure 10.8 shows how feature-selected models are more profitable than ones that include all the features
compared at the same depths. Also, the embedded Lasso LARS with AIC ( e-llarsic ) method and the MIC
( f-mic ) filter method outperform all wrapper, hybrid, and advanced methods with the same depths. Still, we also
impeded these methods by using a sample of the training dataset, which was necessary to speed up the process.
Maybe they would have outperformed the top ones otherwise. However, the three feature selection methods that
follow are pretty competitive:

RFE with LDA: Hybrid method ( h-rfe-lda )


Logistic regression with L2 regularization: Embedded method ( e-logl2 )
GAs with RF: Advanced method ( a-ga-rf )

It would make sense to spend many days running many variations of the methods reviewed in this book. For
instance, perhaps RFE with L1 regularized logistic regression or GA with support vector machines with additional
mutation yields the best model. There are so many different possibilities! Nevertheless, if you were forced to make
a recommendation based on Figure 10.8, by profit alone, the 111-feature e-llarsic is the best option, but it also
has higher minimum costs and lower maximum ROI than any of the top models. There's a trade-off. And even
though it has among the highest test RMSEs, the 160-feature model ( f-mic ) has a similar spread between max
profit train and test and beat it in max ROI and min costs. Therefore, these are the two reasonable options. But
before making a final determination, profitability would have to be compared side by side across different
thresholds to assess when each model can make the most reliable predictions and at what costs and ROIs.

Considering feature engineering


Let's assume that the non-profit has chosen to use the model whose features were selected with Lasso LARS with
AIC ( e-llarsic ) but would like to evaluate whether you can improve it further. Now that you have removed over
300 features that might have only marginally improved predictive performance but mostly added noise, you are
left with more relevant features. However, you also know that 8 features selected by e-llars produced the same
amount of RMSE as the 111 features. This means that while there's something in those extra features that improves
profitability, it does not improve the RMSE.

From a feature selection standpoint, many things can be done to approach this problem. For instance, examine the
overlap and difference of features between e-llarsic and e-llars , and do feature selection variations strictly
on those features to see whether the RMSE dips on any combination while keeping or improving on current
profitability. However, there's also another possibility, which is feature engineering. There are a few important
reasons you would want to perform feature engineering at this stage:

Make model interpretation easier to understand: For instance, sometimes features have a scale that is not
intuitive, or the scale is intuitive but the distribution makes it hard to understand. As long as transformations
to these features don't worsen model performance, there's value in transforming the features to understand the
outputs of interpretation methods better. As you train models on more engineered features, you realize what
works and why it does. This will help you understand the model and, more importantly, the data.
Place guardrails on individual features: Sometimes, features have an uneven distribution, and models tend
to overfit in sparser areas of the feature's histogram or where influential outliers exist.
Clean up counterintuitive interactions: Some interactions that models find make no sense and only exist
because the features correlate, but not for the right reasons. They could be confounding variables or perhaps
even redundant ones (such as the one we found in Chapter 4, Global Model-Agnostic Interpretation
Methods). You could decide to engineer an interaction feature or remove a redundant one.

In reference to the last two reasons, we will examine feature engineering strategies in more detail in Chapter 12,
Monotonic Constraints and Model Tuning for Interpretability. This section will focus on the first reason,
particularly because it's a good place to start since it will allow you to understand the data better until you know it
well enough to make more transformational changes.

So, we are left with 111 features but have no idea how they relate to the target or each other. The first thing we
ought to do is run a feature importance method. We can use SHAP's TreeExplainer on the e-llarsic model.
An advantage of TreeExplainer is that it can compute SHAP interaction values, shap_interaction_values ,
instead of outputting an array of (N, 111) dimensions where N is the number of observations as shap_values
does; it will output (N, 111, 111) . You can produce a summary_plot graph with it that ranks both individual
features and interactions. The only difference for interaction values is you use plot_type="compact_dot" :
winning_mdl = 'rf_5_e-llarsic'
fitted_rf_mdl = reg_mdls[winning_mdl]['fitted']
shap_rf_explainer = shap.TreeExplainer(fitted_rf_mdl)
shap_rf_interact_values = shap_rf_explainer.\
shap_interaction_values(X_test.\
iloc[sample_test_idx][llarsic_cols])
shap.summary_plot(shap_rf_interact_values,\
X_test.iloc[sample_test_idx][llarsic_cols],
plot_type="compact_dot", sort=True)

The preceding snippet produces the SHAP interaction summary plot shown in Figure 10.9:

Figure 10.9 – SHAP interaction summary plot

You can read Figure 10.9 as you would any summary plot except it includes bivariate interactions twice—first
with one feature and then with another. For instance, MDMAUD_A* - CLUSTER is the interaction SHAP values for
that interaction from MDMAUD_A 's perspective, so the feature values correspond to that feature alone, but the SHAP
values are for the interaction. One thing that we can agree on here is that the plot is hard to read given the scale of
the importance values and complexity of comparing bivariate interactions in no order. We will address this later.

Throughout this book, chapters with tabular data have started with a data dictionary. This one was an exception,
given that there were 435 features to begin with. Now, it makes sense to at the very least understand what the top
features are. The complete data dictionary can be found here,
https://kdd.ics.uci.edu/databases/kddcup98/epsilon_mirror/cup98dic.txt, but some of the features have already
been changed because of categorical encoding, so we will explain them in more detail here:

MAXRAMNT: Continuous, the dollar amount of the largest gift to date


HVP2: Discrete, percentage of homes with a value of >= $150,000 in the neighborhoods of donors (values
between 0 and 100)
LASTGIFT: Continuous, the dollar amount of the most recent gift
RAMNTALL: Continuous, the dollar amount of lifetime gifts to date
AVGGIFT: Continuous, the average dollar amount of gifts to date
MDMAUD_A: Ordinal, the donation amount code for donors who have given a $100 + gift at any time in
their giving history (values between 0 and 3, -1 for those who have never exceeded $100). The amount code
is the third byte of an RFA (recency/frequency/amount) major customer matrix code, which is the amount
given. The categories are as follows:

0: Less than $100 (low dollar)

1: $100 – 499 (core)

2: $500 – 999 (major)

3: $1,000 + (top)

NGIFTALL: Discrete, number of lifetime gifts to date


AMT_14: Ordinal, donation amount code of the RFA for the 14th previous promotion (2 years prior), which
corresponds to the last dollar amount given back then:

0: $0.01 – 1.99

1: $2.00 – 2.99

2: $3.00 – 4.99

3: $5.00 – 9.99

4: $10.00 – 14.99

5: $15.00 – 24.99

6: $25.00 and above

DOMAIN_SOCIALCLS: Nominal, socio-economic status (SES) of the neighborhood, which combines


with DOMAIN_URBANICITY (0: Urban, 1: City, 2: Suburban, 3: Town, 4: Rural), meaning the following:

1: Highest SES

2: Average SES, except above average for urban communities

3: Lowest SES, except below average for urban communities

4: Lowest SES for urban communities only

CLUSTER: Nominal, code indicating which cluster group the donor falls in
MINRAMNT: Continuous, dollar amount of the smallest gift to date
LSC2: Discrete, percent age of Spanish-speaking families in the donor's neighborhood (values between 0 and
100)
IC15: Discrete, percentage of families with an income of < $15,000 in the donor's neighborhood (values
between 0 and 100)

The following insights can be distilled by the preceding dictionary and Figure 10.9:

Gift amounts prevail: Seven of the top features pertain to gift amounts, whether it's a total, min, max,
averagem, or last. If you include the count of gifts ( NGIFTALL ), there are eight features involving donation
history, making complete sense. So, why is this relevant? Because they are likely highly correlated and
understanding how could hold the keys on how to improve the model. Perhaps other features can be created
that distill these relationships much better.
High values of continuous gift amount features have high SHAP values: Plot a box plot of any of those
features like this, plt.boxplot(X_test.MAXRAMNT) , and you'll see how right-skewed these features are.
Perhaps a transformation such as breaking them into bins—called "discretization"—or using a different scale
such as logarithmic (try plt.boxplot(np.log(X_test.MAXRAMNT)) ) can help interpret these features but also
help find the pockets where the likelihood of donation dramatically increases.
Relationship with the 14th previous promotion: What happened 2 years before they made that promotion
connect to the one denoted in the dataset labels? Were the promotional materials similar? Is there a
seasonality factor occurring at the same time every couple of years? Maybe you can engineer a feature that
better identifies this phenomenon.
Inconsistent classifications: DOMAIN_SOCIALCLS has different categories depending on the
DOMAIN_URBANICITY value. We can make this consistent by using all five categories in the scale (Highest,
Above Average, Average, Below Average, and Lowest) even if this means non-urban donors would be using
only three. The advantage to doing this would be easier interpretation, and it's highly unlikely it would
adversely impact the model's performance.

The SHAP interaction summary plot can be useful for identifying feature and interaction rankings and some
commonalities between them, but in this case (see Figure 10.9) it was hard to read. But to dig deeper into
interactions, you first need to quantify their impact. To this end, let's create a heatmap with only the top
interactions as measured by their mean absolute SHAP value ( shap_rf_interact_avgs ). We should then set all
the diagonal values to 0 ( shap_rf_interact_avgs_nodiag ) because these aren't interactions but feature SHAP
values, and it's easier to observe the interactions without them. We can place this matrix in a dataframe but it's a
dataframe of 111 columns and 111 rows, so to filter it by those features with those most interactions, we sum them
and rank them with scipy 's rankdata . Then, we use the ranking to identify the 12 most interactive features
( most_interact_cols ) and subset the dataframe by them. Finally, we plot the dataframe as a heatmap:
shap_rf_interact_avgs = np.abs(shap_rf_interact_values).mean(0)
shap_rf_interact_avgs_nodiag = shap_rf_interact_avgs.copy()
np.fill_diagonal(shap_rf_interact_avgs_nodiag, 0)
shap_rf_interact_df = pd.DataFrame(shap_rf_interact_avgs_nodiag)
shap_rf_interact_df.columns = X_test[llarsic_cols].columns
shap_rf_interact_df.index = X_test[llarsic_cols].columns
shap_rf_interact_ranks = 112 -\
rankdata(np.sum(shap_rf_interact_avgs_nodiag, axis=0))
most_interact_cols =\
shap_rf_interact_df.columns[shap_rf_interact_ranks < 13]
shap_rf_interact_df =\
shap_rf_interact_df.loc[most_interact_cols,most_interact_cols]
sns.heatmap(shap_rf_interact_df, cmap='Blues', annot=True,\
annot_kws={'size':10}, fmt='.3f', linewidths=.5)

The preceding snippet outputs what is shown in Figure 10.10. It depicts the most salient feature interactions
according to SHAP interaction absolute mean values. Note that these are averages, so given how right-skewed
most of these features are, it is likely much higher for many observations. However, it's still a good indication of
relative impact:
Figure 10.10 – SHAP interactions heatmap

One way in which we can understand feature interactions one by one is with SHAP's dependence_plot . For
instance, we can take our top feature, MAXRAMNT , and plot it with color-coded interactions with features such as
RAMNTALL , LSC4 , HVP2 , and AVGGIFT . But first, we will need to compute shap_values . There are a couple of
problems, though, that need to be addressed, which we mentioned earlier. They have to do with the following:

The prevalence of outliers: We can cut them out of the plot by limiting the x and y axes using percentiles for
the feature and SHAP values, respectively, with plt.xlim and plt.ylim . This essentially zooms in to cases
that lie between the 1st and 99th percentiles.
Lopsided distribution of dollar amount features: It is common in any feature involving money for it to be
right-skewed. There are many ways to simplify it, such as using percentiles to bin the feature, but a quick way
to make it easier to appreciate is by using a logarithmic scale. In matplotlib , you can do this with
plt.xscale('log') without any need to transform the feature.

The following code accounts for the two issues. You can try commenting out xlim , ylim , or xscale to see the
big difference they individually make in understanding dependence_plot :
shap_rf_values =\
shap_rf_explainer.shap_values(X_test.iloc[sample_test_idx]\
[llarsic_cols])
maxramt_shap = shap_rf_values[:,llarsic_cols.index("MAXRAMNT")]
shap.dependence_plot("MAXRAMNT", shap_rf_values,\
X_test.iloc[sample_test_idx][llarsic_cols],\
interaction_index="AVGGIFT", show=False, alpha=0.1)
plt.xlim(xmin=np.percentile(X_test.MAXRAMNT, 1),\
xmax=np.percentile(X_test.MAXRAMNT, 99))
plt.ylim(ymin=np.percentile(maxramt_shap, 1),\
ymax=np.percentile(maxramt_shap, 99))
plt.xscale('log')

The preceding code generates what is shown in Figure 10.11. It shows how there's a tipping point somewhere
between 10 and 100 for MAXRAMNT where the mean impact on the model output starts to creep out, and these
correlate with a higher AVGGIFT value:

Figure 10.11 – SHAP interaction plot between MAXRAMNT and AVGGIFT

A lesson you could take from Figure 10.11 is that a cluster is formed by certain values of these features and
possibly a few other two that increase the likelihood of a donation. From a feature engineering standpoint, you
could take unsupervised methods to create special cluster features solely based on the few features you have
identified as related. Or you could take a more manual route, comparing different plots to understand how to best
identify clusters. You could derive binary features from this process or even a ratio between features that more
clearly depict interactions or cluster belonging.

The idea here is not to reinvent the wheel trying to do what the model already does so well but to, first and
foremost, aim for a more straightforward model interpretation. Hopefully, that will even have a positive impact on
predictive performance by tidying up the features, because if you understand them better, maybe the model does so
too! It's like smoothing a grainy image; it might confuse you less and the model too (see Chapter 13, Adversarial
Robustness, for more on that)! But understanding the data better through the model has other positive side effects.

In fact, the lessons don't stop with feature engineering or modeling but can be directly applied to promotions. What
if tipping points identified could be used to encourage donations? Perhaps get a free mug if you donate over $X?
Or set up a recurring donation of $X and be in the exclusive list of "silver" patrons?

We will end this topic on that curious note, but hopefully, this inspires you to appreciate how we can apply lessons
from model interpretation to feature selection, engineering, and much more.

Mission accomplished
To approach this mission, you have reduced overfitting using primarily the toolset of feature selection. The non-
profit is pleased with a profit lift of roughly 30%, costing a total of $35,601, which is $30,000 less than it would
cost to send everyone in the test dataset the mailer. However, they still want assurance that they can safely employ
this model without worries that they'll experience losses.

In this chapter, we've examined how overfitting can cause the profitability curves not to align. Misalignment is
critical because it could mean that choosing a threshold based on training data would not be reliable on out-of-
sample data. So, you use compare_df_plots to compare profitability between the test and train sets as you've
done before, but this time for the chosen model ( rf_5_e-llarsic ):
profits_test = reg_mdls['rf_5_e-llarsic']['profits_test']
profits_train = reg_mdls['rf_5_e-llarsic']['profits_train']
mldatasets.compare_df_plots(\
profits_test[['costs', 'profit', 'roi']],\
profits_train[['costs', 'profit', 'roi']], 'Test',\
'Train', x_label='Threshold', \
y_formatter=y_formatter,\
plot_args={'secondary_y':'roi'})

The preceding code generates what is shown in Figure 10.12. You can show this to the non-profit to prove that
there's a sweet spot at $0.68 that is the second highest profit attainable in Test. It is also within reach of their
budget and achieves an ROI of 41%. More importantly, these numbers are not far from what they are for Train.
Another thing that is great to see is that the profit curve slowly slides down for both Train and Test instead of
dramatically falling off a cliff. The non-profit can be assured that the operation would still be profitable if they
choose to increase the threshold. After all, they want to target donors from the entire mailing list, and for that to be
financially feasible, they have to be more exclusive. Say they are using a threshold of $0.77 on the entire mailing
list; the campaign would cost about $46,000 but return over $24,000 in profit:
Figure 10.12 – Comparison between profit, costs, and ROI for the test and train datasets for the model with Lasso
Lars via AIC features across different thresholds

Congratulations! You have accomplished this mission!

But there's one crucial detail we'd be remiss if we didn't bring up.

Although we trained this model with the next campaign in mind, the model will likely be used in future direct
marketing campaigns without retraining. This model reusing presents a problem. There's a concept called data
drift, also known as feature drift, which is that over time, what the model learned about the features concerning
the target variable no longer holds true. Another, concept drift, is about how the definition of the target feature
changes over time. For instance, what constitutes a profitable donor can change. Both drifts can happen
simultaneously, and with problems involving human behavior, this is to be expected. Behavior is shaped by
cultures, habits, attitudes, technologies, and fashions, which are always evolving. You can caution the non-profit
that you can only assure that the model will be reliable for the next campaign, but they can't afford to hire you for
model retraining every single time!

You can propose to the client to create a script that monitors drift directly on their mailing list database. If it finds
significant changes in the features used by the model, it will alert both them and you. You could, at this point,
trigger automatic retraining of the model. However, if the drift is due to data corruption, you won't have an
opportunity to address the problem. And even if automatic retraining is done, it can't be deployed if performance
metrics don't meet predetermined standards. Either way, you should keep a close eye on predictive performance to
be able to guarantee reliability. Reliability is an essential theme in model interpretability because it relates heavily
to accountability. We won't cover drift detection in this book, but future chapters discuss data augmentation
(Chapter 11, Bias Mitigation and Causal Inference Methods) and adversarial robustness (Chapter 13), which
pertain to reliability.

Summary
In this chapter, we have learned about how irrelevant features impact model outcomes and how feature selection
provides a toolset to solve this problem. We then explored many different methods in this toolset, from the most
basic filter methods to the most advanced ones. Lastly, we broached the subject of feature engineering for
interpretability. Feature engineering can make for a more interpretable model that will perform better. We will
cover this topic in more detail in Chapter 12, Monotonic Constraints and Model Tuning for Interpretability. In the
next chapter, we will discuss methods for bias mitigation and causal inference.

Dataset sources
Ling, C., & Li, C. (1998). Data Mining for Direct Marketing: Problems and Solutions. In Proceedings of the
Fourth International Conference on Knowledge Discovery and Data Mining (KDD'98). AAAI Press, 73–79.
https://dl.acm.org/doi/10.5555/3000292.3000304
UCI Machine Learning Repository. (1998). KDD Cup 1998 Data Data Set.
https://archive.ics.uci.edu/ml/datasets/KDD+Cup+1998+Data

Further reading
Ross, B.C. (2014). Mutual Information between Discrete and Continuous Data Sets. PLoS ONE, 9.
https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0087357
P. Geurts, D. Ernst., & L. Wehenkel. (2006). Extremely randomized trees. Machine Learning, 63(1), 3-42.
https://link.springer.com/article/10.1007/s10994-006-6226-1
Abid, A., Balin, M.F., & Zou, J. (2019). Concrete Autoencoders for Differentiable Feature Selection and
Reconstruction. ICML. https://arxiv.org/abs/1901.09346
Tan, F., Fu, X., Zhang, Y., & Bourgeois, A.G. (2008). A genetic algorithm-based method for feature subset
selection. Soft Computing, 12, 111-120. https://link.springer.com/article/10.1007/s00500-007-0193-8
Manuel Calzolari. (2020, October 12). manuel-calzolari/sklearn-genetic: sklearn-genetic 0.3.0 (Version 0.3.0).
Zenodo. http://doi.org/10.5281/zenodo.4081754
14 What's Next for Machine
Learning Interpretability?
Join our book community on Discord
https://packt.link/EarlyAccessCommunity

Over the last thirteen chapters, we have explored the field of Machine
Learning (ML) interpretability. As stated in the preface, it's a broad area of
research, most of which hasn't even left the lab and become widely used yet,
and this book has no intention of covering absolutely all of it. Instead, the
objective is to present various interpretability tools in sufficient depth to be
useful as a starting point for beginners and even complement the knowledge
of more advanced readers. This chapter will summarize what we've learned
in the context of the ecosystem of ML interpretability methods, and then
speculate on what's to come next!

These are the main topics we are going to cover in this chapter:

Understanding the current landscape of ML interpretability


Speculating on the future of ML interpretability

Understanding the current landscape of ML


interpretability
First, we will provide some context on how the book relates to the main
goals of ML interpretability and how practitioners can start applying the
methods to achieve those broad goals. Then, we'll discuss what the current
areas of growth in research are.
Tying everything together!

As discussed in Chapter 1, Interpretation, Interpretability, and


Explainability; and Why Does It All Matter?, there are three main themes
when talking about ML interpretability: Fairness, Accountability, and
Transparency (FAT), and each of these presents a series of concerns (see
Figure 14.1). I think we can all agree these are all desirable properties for a
model! Indeed, these concerns all present opportunities for the improvement
of Artificial Intelligence (AI) systems. These improvements start by
leveraging model interpretation methods to evaluate models, confirm or
dispute assumptions, and find problems.

What your aim is will depend on what stage you are at in the ML workflow.
If the model is already in production, the objective might be to evaluate it
with a whole suite of metrics, but if the model is still in early development,
the aim may be to find deeper problems that a metric won't discover. Perhaps
you are also just using black-box models for knowledge discovery as we did
in Chapters 4, and 5; in other words, leveraging the models to learn from the
data with no plan to take it into production. If this is the case, you might
confirm or dispute the assumptions you had about the data, and by extension,
the model.
Figure 14.1 – ML interpretation methods

In any case, none of these aims are mutually exclusive, and you should
probably always be looking for problems and disputing assumptions, even
when the model appears to be performing well!

And regardless of the aim and primary concern, it is recommended that you
use many interpretation methods, not only because no technique is perfect,
but also because all problems and aims are interrelated. In other words,
there's no justice without consistency and no reliability without transparency.
In fact, you can read Figure 14.1 from bottom to top as if it were a pyramid,
because transparency is foundational, followed by accountability in the
second tier, and, ultimately, fairness as the cherry on top. Therefore, even
when the goal is to assess model fairness, the model should be stress-tested
for robustness. All feature importances and interactions should be
understood. Otherwise, it won't matter if predictions aren't robust and
transparent/

There are many interpretation methods covered in Figure 14.1, and these are
by no means every interpretation method available. They represent the most
popular methods with well-maintained open source libraries behind them. In
this book, we have touched on most of them, albeit some of them only
briefly. Those that weren't discussed are in italics and those that were have
the relevant chapter numbers provided next to them. There's been a focus on
model-agnostic methods for black-box supervised learning models. Still,
outside of this realm, there are also many other interpretation methods, such
as those found in reinforcement learning, generative models, or the many
statistical methods used strictly for linear regression. And even within the
supervised learning black-box model realm, there are hundreds of
application-specific model interpretation methods used for applications
ranging from chemistry graph CNNs to customer churn classifiers.

That being said, many of the methods discussed in this book can be tailored
to a wide variety of applications. Integrated gradients can be used to
interpret audio classifiers, and hydrological forecasting models. Sensitivity
analysis can be employed in financial modeling and infectious disease risk
models. Causal inference methods can be leveraged to improve user
experience and drug trials.

Improve is the operative word here, because interpretation methods have a


flip side!

In this book, that flip side has been referred to as tuning for interpretability,
which means creating solutions to problems with FAT. Those solutions can
be appreciated in Figure 14.2:
Figure 14.2 – Toolset to treat FAT issues

I have observed five approaches to interpretability solutions:

Mitigating Bias: Any corrective measure taken to account for bias.


Please note that this bias refers to the sampling, exclusion, prejudice,
and measurement biases in the data, along with any other bias
introduced in the ML workflow.
Placing Guardrails: Any solution that ensures that the model doesn't
contradict domain knowledge and predict without confidence.
Enhancing Reliability: Any fix that increases the confidence and
consistency of predictions, excluding those that do so by reducing
complexity.
Reducing Complexity: Any means by which sparsity is introduced. As
a side effect, this generally enhances reliability by generalizing better.
Ensuring Privacy: Any effort to secure private data and model
architecture from third parties. We didn't cover this approach in this
book.

There are also three areas in which these approaches can be applied:

Data ("pre-processing"): By modifying the training data


Model ("in-processing"): By modifying the model, its parameters, or
training procedure
Prediction ("post-processing"): By intervening in the inference of the
model

There's a fourth area that can impact the other three; namely, data and
algorithmic governance. This includes regulations and standards that dictate
a certain methodology or framework. It's a missing column because very few
industries and jurisdictions have laws dictating what methods and
approaches should be applied to comply with FAT. For instance, governance
could impose a standard for explaining algorithmic decisions, data
provenance, or a robustness certification threshold. We will discuss this
further in the next section.

You can tell in Figure 14.2 that many of the methods repeat themselves for
FAT. Feature Selection and Engineering, Monotonic Constraints, and
Regularization benefit all three but are not always leveraged by the same
approach. Data Augmentation also can enhance reliability for fairness and
accountability. As with Figure 14.1, the items in italics were not covered in
the book, of which three topics stand out: Uncertainty Estimation,
Adversarial Robustness and Privacy Preservation are fascinating topics
and deserve books of their own.
Current trends

One of the most significant deterrents of AI adoption is a lack of


interpretability, which is partially the reason why 50-90% of AI projects
never take off, and the other is the ethical transgressions that happen as a
result of not complying with FAT. In this aspect, Interpretable Machine
Learning (iML) has the power to lead ML as a whole because it can help
with both goals with the corresponding methods in Figure 14.1 and Figure
14.2.

Thankfully, we are witnessing an increase in interest and production in iML,


mostly under Explainable Artificial Intelligence (XAI) — see Figure 14.3.
In the scientific community, iML is still the most popular term, but XAI
dominates in public settings:

XAI VERSUS IML – WHICH ONE TO USE? My take: Although they


are understood as synonyms in industry and iML is regarded as more of an
academic term, ML practitioners, even those in industry, should be wary
about using the term XAI. Words can have outsized suggestive power.
Explainable presumes full understanding but interpretable leaves room for
error, as there always should be when talking about models, and
extraordinarily complex black-box ones at that. Furthermore, AI has
captured the public imagination as a panacea or has been vilified as
dangerous. Either way, along with the term explainable, it serves to make it
even more filled with hubris for those who think it's a panacea, and perhaps
calm some concerns for those who think it's dangerous. XAI as a marketing
term might be serving a purpose. However, for those that build models, the
suggestive power of the word explainable can make us overconfident of our
interpretations. That being said, this is just an opinion.
Figure 14.3 – Publication and search trends for iML and XAI

This means that just as ML is starting to get standardized, regulated,


consolidated, and integrated into a whole host of other disciplines,
interpretation will soon get a seat at the table.
ML is replacing software in all industries. And as more is getting automated,
more models are deployed to the cloud. And it will get worse with the
Artificial Intelligence of Things (AIoT). Deployment is not traditionally in
the ML practitioner's wheelhouse. That is why ML increasingly depends on
Machine Learning Operations (MLOps). And the pace of automation
means more tools are needed to build, test, deploy, and monitor these
models. At the same time, there's a need for the standardization of tools,
methods, and metrics. Slowly but surely, this is happening. Since 2017, we
have had the Open Neural Network Exchange (ONNX), an open standard
for interoperability. And at the time of the writing, the International
Organization for Standardization (ISO) has over two dozen AI standards
being written (and one published), several of which involve interpretability.
Naturally, some things will get standardized because of common use, due to
the consolidation of ML model classes, methods, libraries, service providers,
and practices. Over time one or a few in each area will become the victors.
Lastly, given ML's outsized role in algorithmic decision-making, it's only a
matter of time before they get regulated. Only some financial markets
regulate trading algorithms, such as the Securities and Exchange
Commission (SEC) in the United States and the Financial Conduct
Authority (FCA) in the UK. Besides that, only data privacy and provenance
regulations are widely enforced, such as HIPAA in the US and LGPD in
Brazil. The GDPR in the European Union takes this a bit further with the
"right to an explanation" for algorithmic decisions but the intended scope
and methodology are still unclear.

ML interpretability is growing quickly but is lagging behind ML. Some


interpretation tools have been integrated into the cloud ecosystem, from
SageMaker to DataRobot. They are yet to be fully automated, standardized,
consolidated, and regulated, but there's no doubt that this will happen.

Speculating on the future of ML interpretability


I'm used to hearing the metaphor of this period being the "Wild West of AI",
or worse, an "AI Gold Rush"! It conjures images of unexplored and untamed
territory being eagerly conquered, or worse, civilized. Yet, in the 19th
century, the United States' western areas were not too different from other
regions on the planet and had already been inhabited by Native Americans
for millennia, so the metaphor doesn't quite work. Predicting with the
accuracy and confidence that we can achieve with ML would spook our
ancestors and is not a "natural" position for us humans. It's more akin to
flying than exploring unknown land.

The article Toward the Jet Age of machine learning (linked in the Further
reading section at the end of this chapter) presents a much more fitting
metaphor of AI being like the dawn of aviation. It's new and exciting, and
people still marvel at what we can do from down below (see Figure 14.4)!

However, it yet had to fulfill its potential. Decades after the barnstorming
era, aviation matured into the safe, reliable, and efficient Jet Age of
commercial aviation. In the case of aviation, the promise was that it could
reliably take goods and people halfway around the world in less than a day.
In AI's case, the promise is that it can make fair, accountable, and
transparent decisions — maybe not for any decision, but at least those it was
designed to make, unless it's an example of Artificial General Intelligence
(AGI):
Figure 14.4 – Barnstorming during the 1920s (United States Library of
Congress's Prints and Photographs Division)
So how do we get there? The following are a few ideas I anticipate will
occur in the pursuit of reaching the Jet Age of ML.

A new vision for ML

As we intend to go farther with AI than we have ever gone before, the ML


practitioners of tomorrow have to be more aware of the dangers of the sky.
And by the sky, I mean the new frontiers of predictive and prescriptive
analytics. The risks are numerous and involve all kinds of biases and
assumptions, problems with data both known and potential, and our models'
mathematical properties and limitations. It's easy to be deceived by ML
models thinking they are software. Still, in this analogy, software is
completely deterministic in nature – it's solidly anchored to the ground, not
hovering in the sky!

For civil aviation to become safe, it required a new mindset — a new


culture. The fighter pilots of WWII, as capable they were, had to be retrained
to work in civil aviation. It's not the same mission because when you know
that you are carrying passengers on board, and the stakes are high,
everything changes. Ethical AI, and by extension, iML, ultimately require
this awareness that models directly or indirectly carry passengers "on
board." And that models aren't as robust as they seem. A robust model must
be able to reliably withstand almost any condition over and over again in the
same way the planes of today do. To that end, we need to be using more
instruments, and those instruments come in the form of interpretation
methods.

A multidisciplinary approach

Tighter integration with many disciplines is needed for models that comply
with the principles of FAT. This means more significant involvement of AI
ethicists, lawyers, sociologists, psychologists, human-centered designers,
and countless other professions. Along with AI technologists and software
engineers, they will help code best practices into standards and regulations.

Adequate standardization
New standards will be needed not only for code, metrics, and methodologies,
but also for language. The language behind data has mostly been derived
from statistics, math, computer science, and econometrics, which leads to a
lot of confusion.

Enforcing regulation

It will likely be required that all production models fulfil the following
specifications:

Are certifiably robust and fair


Are capable of explaining their reasoning behind one prediction with a
TRACE command and, in some cases, are required to deliver the
reasoning with the prediction
Can abstain from a prediction they aren't confident about
Yield confidence levels for all predictions (see conformal prediction)
Have metadata with training data provenance (even if anonymized) and
authorship and, when needed, regulatory compliance certificates and
metadata tied to a public ledger – possibly a blockchain
Have security certificates much like websites do to ensure a certain
level of trust
Expire, and stop working upon expiration, until they are retrained with
new data
Be taken offline automatically when they fail model diagnostics and
only put online again when they pass
Have Continuous Training/Continuous training (CT/CI) pipelines
that help retrain the model and perform the model diagnostics at regular
intervals to avoid any model downtime
Are diagnosed by a certified AI auditor when they fail catastrophically
and cause public damage

New regulations will likely create new professions such as AI auditors and
model diagnostics engineers. But they will also prop up MLOps engineers
and ML automation tools.

Seamless machine learning automation with built-in


interpretation
In the future, we won't program an ML pipeline; it will mostly be a drag-
and-drop affair with a dashboard offering all kinds of metrics. It will evolve
to be mostly automated. Automation shouldn't come as a surprise because
some existing libraries perform automated feature-selection model training.
Some interpretability-enhancing procedures may be done automatically, but
most of them should require human discretion. However, interpretation
ought to be injected throughout the process, much like planes that mostly fly
themselves have instruments that alert pilots of issues; the value is in
informing the ML practitioner of potential problems and improvements at
every step. Did it find a feature to recommend for monotonic constraints?
Did it find some imbalances that might need adjusting? Did it find anomalies
in the data that might need some correction? Show the practitioner what
needs to be seen to make an informed decision and let them make it.

Tighter integration with MLOps engineers

Certifiably robust models trained, validated, and deployed at a click of a


button require more than just cloud infrastructure – the orchestration of
tools, configurations, and people trained in MLOps to monitor them and
perform maintenance at regular intervals.

Much like aviation took a few decades to become the safest mode of
transportation, it will take AI a few decades to become the safest mode of
decision-making. It will take a global village to get us there, but it will be an
exciting journey! And remember, the best way to predict the future is to
create it.

Further reading
O'Neil, C. (2017). Weapons of Math Destruction. Penguin Books.
Talwalkar, A. (2018, April 25). Toward the Jet Age of machine
learning. O'Reilly. https://www.oreilly.com/content/toward-the-jet-age-
of-machine-learning/
Angelopoulos, A.N., & Bates, S. (2021). A Gentle Introduction to
Conformal Prediction and Distribution-Free Uncertainty Quantification.
https://arxiv.org/abs/2107.07511

You might also like