modeva.DataSet.feature_select_corr#
- DataSet.feature_select_corr(dataset: str = 'train', method: str = 'pearson', threshold: float = 0.2, random_state: int = 0)#
Selects features based on their correlation strength with the target variable.
Analyzes the relationship between features and the target variable using different correlation metrics:
Pearson / Spearman / Kendall / XI correlation for numerical-numerical pairs;
Theil’s U for categorical-categorical pairs;
correlation ratio for numerical-categorical pairs.
Features with absolute correlation above the specified threshold are selected.
- Parameters:
dataset ({"main", "train", "test"}, default="train") – Dataset partition to use for feature selection analysis.
method ({"pearson", "spearman", "kendall", "xicor"}, default="pearson") –
The algorithm for calculating the correlation when the variable and target are both numerical.
”pearson”: Pearson correlation measures the linear relationship between two continuous variables. Its value ranges from −1 (perfect negative linear relationship) to 1 (perfect positive linear relationship), with 0 indicating no linear correlation. It is sensitive to linear relationships but not to nonlinear patterns.
”spearman”: Spearman correlation assesses the strength and direction of a monotonic relationship between two variables, based on their ranks. It ranges from −1 to 1, where 1 indicates a perfect increasing monotonic relationship and −1 a perfect decreasing one. It is robust to outliers and can capture non-linear relationships.
”kendall”: Kendall Tau measures the association between two ranked variables, focusing on the consistency of the order between them. Its value ranges from −1 (perfect discordance) to 1 (perfect concordance). It is particularly useful for ordinal data and is robust to outliers.
”xicor”: XiCor detects both linear and nonlinear dependencies between continuous variables. It typically ranges from 0 (no dependence) to 1 (strong dependence), providing a more comprehensive view of relationships. Negative XI correlation does not have any innate significance, other than close to zero. See details in the paper [1].
[1] Chatterjee, S. (2021). A new coefficient of correlation. J. Amer. Statist. Assoc., 116, no. 536, 2009–2022.
threshold (float, default=0.2) – Minimum absolute correlation value required for feature selection. Features with correlation above this threshold are considered important.
random_state (int, default=0) – Random seed for reproducibility in calculations involving random sampling.
- Returns:
A container object with the following components:
key: “data_fs_corr”
data: Name of the dataset used
inputs: Input parameters used for feature selection
value: Dictionary containing:
”selected”: List of selected feature names
table: DataFrame with feature names, importance scores, and selection status
options: Dictionary of visualizations configuration for a horizontal bar plot of feature importance. Run results.plot() to show this plot.
- Return type:
Examples