modeva.DataSet.feature_select_corr#

DataSet.feature_select_corr(dataset: str = 'train', method: str = 'pearson', threshold: float = 0.2, random_state: int = 0)#

Selects features based on their correlation strength with the target variable.

Analyzes the relationship between features and the target variable using different correlation metrics:

Pearson / Spearman / Kendall / XI correlation for numerical-numerical pairs;
Theil’s U for categorical-categorical pairs;
correlation ratio for numerical-categorical pairs.

Features with absolute correlation above the specified threshold are selected.

Parameters:

dataset ({"main", "train", "test"}, default="train") – Dataset partition to use for feature selection analysis.
method ({"pearson", "spearman", "kendall", "xicor"}, default="pearson") –
The algorithm for calculating the correlation when the variable and target are both numerical.
- ”pearson”: Pearson correlation measures the linear relationship between two continuous variables. Its value ranges from −1 (perfect negative linear relationship) to 1 (perfect positive linear relationship), with 0 indicating no linear correlation. It is sensitive to linear relationships but not to nonlinear patterns.
- ”spearman”: Spearman correlation assesses the strength and direction of a monotonic relationship between two variables, based on their ranks. It ranges from −1 to 1, where 1 indicates a perfect increasing monotonic relationship and −1 a perfect decreasing one. It is robust to outliers and can capture non-linear relationships.
- ”kendall”: Kendall Tau measures the association between two ranked variables, focusing on the consistency of the order between them. Its value ranges from −1 (perfect discordance) to 1 (perfect concordance). It is particularly useful for ordinal data and is robust to outliers.
- ”xicor”: XiCor detects both linear and nonlinear dependencies between continuous variables. It typically ranges from 0 (no dependence) to 1 (strong dependence), providing a more comprehensive view of relationships. Negative XI correlation does not have any innate significance, other than close to zero. See details in the paper [1].
[1] Chatterjee, S. (2021). A new coefficient of correlation. J. Amer. Statist. Assoc., 116, no. 536, 2009–2022.
threshold (float, default=0.2) – Minimum absolute correlation value required for feature selection. Features with correlation above this threshold are considered important.
random_state (int, default=0) – Random seed for reproducibility in calculations involving random sampling.

Returns:

A container object with the following components:

key: “data_fs_corr”
data: Name of the dataset used
inputs: Input parameters used for feature selection
value: Dictionary containing:
- ”selected”: List of selected feature names
table: DataFrame with feature names, importance scores, and selection status
options: Dictionary of visualizations configuration for a horizontal bar plot of feature importance. Run results.plot() to show this plot.

Return type:

ValidationResult

Examples

Feature Selection