Global Explainability#

PFI (Permutation Feature Importance)#

The permutation feature importance measures the influence of individual features on the model by calculating the increase in loss \(L\) when the feature set, typically one feature, is permuted [L2001]. When a feature value is randomly shuffled, the relationship between the feature and the target is broken, and the resulting drop in model performance indicates the feature’s significance. PFI can be used to assess the model’s reliance on each feature in the input set. However, it should be noted that different models can have very different feature importance rankings, the PFI results can only reveal the importance of each feature to that specific model.

Assuming we have a fully trained model, denoted by \(f(x)\), and a dataset either training or testing data. First of all, we evaluate the performance of the model and the data and record the resulting performance score. Next, for each feature \(k\), do the following:

  • Randomly shuffle the values of feature \(k\) in the original dataset, while the rest feature values keep unchanged. By doing so, we generate a shuffled dataset.

  • Evaluate the model performance on the shuffled dataset, and record the resulting performance score.

  • Compute the performance degradation, and this is viewed as the importance of feature \(k\).

Finally, as the above steps involve some randomness, we usually repeat these steps for some iterations and calculate the average.

The larger the average performance degradation, the more important the feature is considered to be. Note that the performance degradation may be negative, which means that the model achieves better performance as that feature is removed. In this case, we truncate the negative importance to zero. In Modeva, the above calculations are based on the permutation_importance function of scikit-learn. For regression tasks, the performance metric is set to MSE; and for binary classification, the AUC metric is used. For more analysis of this algorithm, please refer to the documentation here_.

Hstats (Friedman’s H-statistic)#

H-statistic measures the interaction strength of two features [Friedman2008].

Consider a set of features, represented by \(X\), and a fitted model, represented by \(\hat{f}\). The H-statistic is defined based on partial dependence, as follows:

\[\begin{align} H_{j k}^2=\frac{\sum_{i=1}^n\left[P D_{j k}\left(x_j^{(i)}, x_k^{(i)}\right)-P D_j\left(x_j^{(i)}\right)-P D_k\left(x_k^{(i)}\right)\right]^2}{\sum_{i=1}^n P D_{j k}^2\left(x_j^{(i)}, x_k^{(i)}\right)}, \tag{1} \end{align}\]

where feature \(j\) and \(k\) are two features in \(X\), \(x_j^{(i)}\) and \(x_k^{(i)}\) are the values of features \(j\) and \(k\) for the \(i\)-th sample, respectively, and \(PD_{jk}(x_j^{(i)}, x_k^{(i)})\) is the partial dependence of \(\hat{f}\) on features \(j\) and \(k\) at \((x_j^{(i)}, x_k^{(i)})\). The H-statistic is a measure of the interaction strength between features \(j\) and \(k\). The larger the H-statistic, the stronger the interaction between features \(j\) and \(k\). The H-statistic is symmetric, i.e., \(H_{jk}=H_{kj}\).

PDP (Partial Dependence Plot)#

A partial dependence plot (PDP) [Friedman2001] is a model-agnostic tool that helps visualize the relationship between a subset of features and the predicted response. This allows us to determine whether the relationship between the target and an input feature is linear, monotonic, or more complex. However, one key assumption of PDP is that the features in the complement set are not correlated with the features of interest.

Consider a set of features, represented by \(X\), and a fitted model, represented by \(\hat{f}\). Note that for binary classification, we use the predicted log odds instead of the probability as the output function. Suppose we partition \(X\) into two sets: \(X_S\), which represents the features of interest, and \(X_C\), which represents the complement of \(X_S\). In this context, the partial function is defined as follows:

\[\begin{align} \mathrm{PD}_{S}(x_{S})=\mathbb{E}_{X_{C}}[\hat{f}(x_{S},X_{C})]= \int \hat{f}(x_{S}, X_{C})p(X_{C})dX_{C}, \tag{1} \end{align}\]

where the integral above can be approximated using the training data,

\[\begin{align} \mathrm{PD}_{S}(x_{S}) = \frac{1}{n}\sum_{i=1}^n \hat{f}(x_{S}, x_{C}^{(i)}), \tag{2} \end{align}\]

where \(x_{C}^{(i)}\) is the complement features of the \(i\)-th training sample. The integral approximation described above is commonly referred to as the “brute” method. However, for some tree-based estimators, a faster recursive method is also available. In Modeva, the PDP is obtained by calling the partial_dependence function of the scikit-learn package; see more details here_.

Warning

PDPs also have a few limitations, including:

  • Assumption of independence: PDPs assume that the features of interest are independent of each other. If the features are highly correlated, the results can be inaccurate, as they require extrapolation of the response at predictor values that are far outside the multivariate envelope of the training data.

  • Inconsistent global and local explanation: PDPs provide an average view of features’ effect on the predicted response. Local effects or effects specific to certain subsets of data may be different from global ones.

ALE (Accumulated Local Effects)#

Accumulated Local Effects (ALE; [Apley2016]) is a model-agnostic method for explaining how features impact a model’s prediction. Its aim is similar to that of PDP, but PDP results may be biased when features are correlated. ALE overcomes this limitation and offers a quicker and unbiased alternative to PDP.

ALE plots, by definition, accumulate the local effects over the range of features. Here we just describe the one-way ALE for a single numerical feature, i.e., one-way ALE. It first divides the feature of interest into \(K\) intervals (bins). The local effect of each bin is computed as the difference in prediction between the first and last values. Let \(N_{j}(k) = (z_{k-1,j} , z_{k,j}]\) be the \(k\)-th interval for the \(j\)-th feature. The split point \(z_{k,j}\) is usually set as the \(\frac{k}{K}\) quantile of the \(j\)-th feature. Based on the binning results, we can compute the uncentered effect of the feature \(j\) as follows:

\[\begin{align} \hat{h}_{j,ALE}(x) = \sum_{k=1}^{k_{j}(x)}\frac{1}{n_{j}(k)}\sum_{i:x_{j}^{(i)}\in N_{j}(k)}[\hat{f}(z_{k,j},\textbf{x}_{-j}^{(i)})- \hat{f}(z_{k-1,j},\textbf{x}_{-j}^{(i)}))], \tag{1} \end{align}\]

where \(k_{j}(x)\) is the index of the interval, \(n_{j}(k)\) is the number of samples in \(N_{j}(k)\), and \(\hat{f}\) is the model being explained. Finally, the ALE is centered using the following formula:

\[\begin{align} \hat{f}_{j,ALE}(x) = \hat{h}_{j,ALE}(x)-\frac{1}{n} \sum_{i=1}^{n} \hat{h}_{j, ALE}(x_{j}^{(i)}). \tag{2} \end{align}\]

Note that the computation of ALE is faster than that of PDP, as it requires less number of function calls to \(\hat{f}\). Moreover, there is no standard for selecting the number of intervals. If the number of intervals is too small, the ALE plots might not be very accurate. On the other hand, If the number is too high, the curve will have many small ups and downs. For additional details on the two-way ALE, please refer to the original paper. In Modeva, the ALE plot is generated based on the Python package PyALE.

Warning

When features are strongly correlated, it is not suggested to do the interpretation of the effect across intervals. That is because the effects are computed per interval (locally), and require no extrapolation beyond the data envelope. If fix one feature and move another feature (highly correlated with the fixed one) from one interval to another one, the model being explained \(\hat{f}\) may output unreliable predictions, which will lead to unreliable ALE values. In other words, the ALE values are not reliable when the model is extrapolating outside of the data envelope. Therefore, as features are strongly correlated, we should interpret ALE plots locally for each bin.

Examples#

References#