This is an exposition of three techniques, namely Partial Dependence Plot (PDP), Marginal Plot (M-Plot), and Accumulated Local Effects (ALE) Plot, which are popular model-agnostic methods to measure and visualize the “effect” of a given feature on the predictions of a ML model. A key reference for this post is the research article that proposed ALE. Notations are also largely borrowed from this article.

Setup

For simplicity, suppose two features, $X_1$ and $X_2$, are used to build a ML model. Given a specific pair of feature values $(x_1, x_2)$, denote the prediction generated by the ML model as $f(x_1,x_2)$. Further denote $p_1(.)$ and $p_2()$ as the probability density function of feature $X_1$ and $X_2$, and $p_{2|1}(.)$ as the conditional density of feature $X_2$ on feature $X_1$.

The interpretation objective is to quantify (and visualize) the “effect” of feature $X_1$ on the predictions of the model.

Three Model-Agnostic Methods

Partial Dependence Plot (PDP)

PDP represents a straightforward (yet somewhat naïve) approach to quantify the effect of a feature on predictions. Specifically, for a given feature value $x_1$, it simply computes the average prediction when $X_1 = x_1$, and the “average” is taken over all possible values of $X_2$, in order to “marginalize” the effect of the second feature on predictions. More precisely, PDP computes:

f1,PDP(x1)=E(f(x1,X2))=p2(x2)f(x1,x2)dx2f_{1,PDP}(x_1) = \mathbb{E}(f(x_1, X_2)) = \int p_2(x_2) f(x_1, x_2) dx_2

then plots $f_{1,PDP}(x_1)$ against $x_1$ for different values of $x_1$.

Despite its intuitiveness, PDP suffers from two key limitations:

  1. Because the average prediction at $X_1 = x_1$ is taken over all possible values of $X_2$ (i.e., taken over the entire density of $X_2$), it is committing “extrapolation error”, namely that it includes $(x_1,x_2)$ that does not actually exist in the data into the computation.
  2. It also relies on the “feature independence assumption”. If $X_1$ and $X_2$ are not independent (e.g., correlated), then the computation in Eq (1) is generally “biased”, in the sense that it blends the effects of both features.

Marginal Plot (M-Plot)

M-Plot builds on PDP by making a single tweak. For a given feature value $x_1$, it still computes the average prediction when $X_1 = x_1$, but now the “average” is taken over values of $X_2$ conditional on $X_1 = x_1$. More precisely, M-plot computes:

f1,M(x1)=E(f(X1,X2)X1=x1)=p21(x2x1)f(x1,x2)dx2f_{1,M}(x_1) = \mathbb{E}(f(X_1, X_2)|X_1 = x_1) = \int p_{2|1}(x_2|x_1) f(x_1, x_2) dx_2

then plots $f_{1,M}(x_1)$ against $x_1$ for different values of $x_1$.

By taking the average over the conditional density of $X_2$ on $X_1=x_1$, it effectively avoids the “extrapolation error”, because only feasible values of $X_2$ when $X_1 = x_1$ are considered in the computation. Moreover, if the two features are indeed independent, then Eq (2) will become equivalent to Eq (1). However, it turns out that the validity of M-Plot still relies on the “feature independence assumption” (more on this later).

Accumulated Local Effects (ALE) Plot

Finally, the ALE was proposed as an interpretation method that remains robust when $X_1$ and $X_2$ are not independent. It computes the “marginal effect” of $X_1$, then accumulates such marginal effects to obtain the overall main effect. More precisely, ALE computes:

f1,ALE(x1)=min(X1)x1E(f1(X1,X2)X1=z1)dz1=min(X1)x1p21(x2z1)f1(z1,x2)dx2dz1f_{1,ALE}(x_1) = \int_{\min(X1)}^{x_1} \mathbb{E}(f^1 (X_1, X_2)|X_1 = z_1) dz_1 = \int_{\min(X1)}^{x_1} \int p_{2|1}(x_2|z_1) f^1(z_1, x_2) dx_2 dz_1

where $f^1(z_1, x_2)$ is the partial derivative of prediction w.r.t. the first feature, i.e., $f^1(z_1, x_2)=\frac{\partial f(z_1,x_2)}{\partial z_1}$, which measures the “marginal effect” of $X_1$ on the predictions in the neighborhood of $X_1 = z_1$. Then, such marginal effects are accumulated from the minimum value that $X_1$ takes, up to the focal value $x_1$. Finally, $f_{1,ALE}(x_1)$ is plotted against $x_1$ for different values of $x_1$ to get the ALE plot.

The elements of differentiation and accumulation in ALE are not super intuitive at first sight. For example, it is not immediately clear (1) why ALE addresses the feature dependency issue (and why M-plot doesn’t); and (2) why the integration in ALE starts from $\min(X_1)$, i.e., the minimum value of feature $X_1$. In the next section, I offer an intuitive (though not 100% rigorous) explanation for both questions.

Understanding ALE

To understand why ALE works, it helps to consider a (drastically simplified) special case where the prediction is a linear combination of the two features, i.e., $f(x_1,x_2) := \beta_1 x_1 + \beta_2 x_2$. Under this linear setting, the effect of the first feature on predictions is simply $\beta_1 x_1$. Now, let’s work out the effects computed by M-plot and ALE.

First, in the M-plot:

f1,M(x1)=p21(x2x1)(β1x1+β2x2)dx2=β1x1p21(x2x1)dx2+β2p21(x2x1)x2dx2=β1x1+β2E(X2X1=x1)\begin{align*} f_{1,M}(x_1) & = \int p_{2|1}(x_2|x_1) (\beta_1 x_1 + \beta_2 x_2) dx_2 \\ & = \beta_1 x_1 \int p_{2|1}(x_2|x_1) dx_2 + \beta_2 \int p_{2|1}(x_2|x_1) x_2 dx_2 \\ & = \beta_1 x_1 + \beta_2 \mathbb{E}(X_2 | X_1 = x_1) \end{align*}

In other words, $f_{1,M}(x_1)$ is unable to separate the effects of $X_1$ and $X_2$ on the predictions.

In comparison, in the ALE plot, we have $f^1 (z_1, x_2) = \beta_1$, so

f1,ALE(x1)=min(X1)x1p21(x2z1)β1dx2dz1=min(X1)x1β1p21(x2z1)dx2dz1=min(X1)x1β1dz1=β1x1β1min(X1)=β1x1constant\begin{align*} f_{1,ALE}(x_1) & = \int_{\min(X1)}^{x_1} \int p_{2|1}(x_2|z_1) \beta_1 dx_2 dz_1 \\ & = \int_{\min(X1)}^{x_1} \beta_1 \int p_{2|1}(x_2|z_1) dx_2 dz_1 \\ & = \int_{\min(X1)}^{x_1} \beta_1 dz_1 \\ & = \beta_1 x_1 - \beta_1 \min(X_1) = \beta_1 x_1 - constant \end{align*}

Note that $f_{1,ALE}(x_1)$ correctly recovers the pure effect of $X_1$, without being confounded by $X_2$. Furthermore, this demonstration also helps illustrate why the integration in ALE starts with $\min(X_1)$. In fact, as the authors of ALE said in their paper, the choice of lower bound does not matter that much, as long as it is a fixed value, which would result in a constant term in the ALE computation. The ALE plot can be vertically adjusted later to remove that constant shift. With that in mind, $\min(X_1)$ is a convenient choice, as it covers all possible values of $X_1$ up to the focal value $x_1$.

The more important lesson, which is also the key novelty in ALE, is the use of partial derivative to block out the impact of (potentially correlated) $X_2$. While this strategy is commonly understood in the context of linear regressions, it is also more broadly relevant in nonlinear, nonparametric, or black box ML models.