Feature Selection#

Feature selection aims at selecting a subset of features that are most relevant to the target. When the number of features is large, feature selection can help mitigate computational burden and avoid overfitting. Moreover, reducing the number of modeling features is also beneficial for enhancing model interpretability.

Modeva has implemented three feature selection strategies:

Correlation Coefficient#

The correlation coefficient is measured between features and the target variable depending on the feature type.

  • Pearson’s correlation for a numerical-numerical pair,

  • Theil’s U for a categorical-categorical pair [TheilsU], and

  • Correlation ratio for a mixed numerical-categorical pair [CorrRatio].

Features with correlation strength above the specified threshold will be selected.

Feature Importance#

This strategy is based on first fitting a predictable model, then using a post-hoc feature attribution method for selecting the important variables. Modeva DataSet.feature_select_xgbpfi fits an XGB model using all the covariates and the response, then run permutation feature importance test for the fitted XGB model. After obtaining the importance of each feature, sort them in the descending order (normalized such that all values sum to 1), finally select the top features with accumulated importance greater than a pre-defined threshold.

Note that the fitted XGB model can overfit or underfit, so the feature selection results need be treated with caution.

Conditional Independence#

This method implements a two-stage feature selection process that combines Randomized Conditional Independence Test (RCIT) with Forward-Backward-Elimination with Early Dropping (FBEDk). It first performs forward selection to identify potentially important features, followed by backward elimination to remove redundant ones, using random Fourier features for non-parametric conditional independence testing.

The RCIT-based feature selection method is capable of handling non-linear relationships, and the selected features are all causally related to the response, subject to the pre-defined significance level. The disadvantages of this method include: a) the computational burden is relatively high; b) as it is a sequential selection approach, the results may be slightly different as we use different initial Markov boundary sets.

RCIT Test#

RCIT can measure non-linear relationships. Given a Markov boundary set \(Z\), the goal is to test whether a feature \(X\) is independent of the response variable \(Y\), namely \(X \perp Y \mid Z\). Here are the details:

  • \(X, Y, Z\) are first transformed using random Fourier features;

  • Then the null hypothesis \(X \perp Y \mid Z\) is equivalent to zero partial cross-covariance \(\Sigma_{X Y \mid Z}=\Sigma_{X Y}-\Sigma_{Y Z} \Sigma_{Z Z}^{-1} \Sigma_{X Z}=0\);

  • The test statistics is then approximated by \(\left\|\hat{\Sigma}_{X Y \mid Z}\right\|_F^2=\mathrm{n} \hat{\Sigma}_{X Y}-\hat{\Sigma}_{Y Z}\left(\hat{\Sigma}_{Z Z}+\gamma I\right)^{-1} \hat{\Sigma}_{X Z}\);

  • The asymptotic distribution of the test statistics is \(\sum_{i=1} \lambda_i z_i^2\), where \(z_i\) are i.i.d. standard Gaussian variables.

  • Lindsay-Pilla-Basak (LPB; [Lindsayl2000]) approximates the CDF under the null using a finite mixture of Gamma distribution.

See also [Zhang2012] and [Strobl2019]. In Modeva, the number of random Fourier features is set to 100 (\(Z\)) and 5 (\(X\) and \(Y\)), respectively.

FBEDk Algorithm#

The FBEDk (Forward-Backward selection with Early Dropping) algorithm [Borboudakis2019] is a combination of forward selection and backward elimination. Here we first run forward selection with a user-defined Markov boundary set as initial and then conduct backward elimination to further delete insignificant features.

Forward Selection

  • Given a predefined Markov boundary set, we initialize all the remaining covariate features as candidate features.

  • Run the RCIT test between each candidate feature and the response variable, conditional on the Markov boundary set.

  • Features with p_value <= threshold will be selected as the candidate features.

  • Among the candidate features, the most significant one will be added to the Markov boundary set.

  • Repeat the last three steps, and the algorithm stops as the candidate set is empty.

The above steps describe one run of the forward phase. To increase accuracy, the overall forward phase is repeated for k times, i.e., the character “k” in “FBEDk”. As recommended by [Yu2020], the forward phase is repeated twice, i.e., the value of k is 2, and you may change this parameter by specifying the argument n_forward_phase.

Backward Elimination

  • Temporarily remove feature \(j\) from the Markov boundary set.

  • Run RCIT test for feature \(j\) and \(Y\), conditional on the temporary Markov boundary set.

  • Permanently remove feature \(j\) from the Markov boundary set, if p_value > threshold.

Examples#

References#