modeva.DataSet.feature_select_corr#

DataSet.feature_select_corr(dataset: str = 'train', method: str = 'pearson', threshold: float = 0.2, random_state: int = 0)#

Selects features based on their correlation strength with the target variable.

Analyzes the relationship between features and the target variable using different correlation metrics:

  • Pearson / Spearman / Kendall / XI correlation for numerical-numerical pairs;

  • Theil’s U for categorical-categorical pairs;

  • correlation ratio for numerical-categorical pairs.

Features with absolute correlation above the specified threshold are selected.

Parameters:
  • dataset ({"main", "train", "test"}, default="train") – Dataset partition to use for feature selection analysis.

  • method ({"pearson", "spearman", "kendall", "xicor"}, default="pearson") –

    The algorithm for calculating the correlation when the variable and target are both numerical.

    • ”pearson”: Pearson correlation measures the linear relationship between two continuous variables. Its value ranges from −1 (perfect negative linear relationship) to 1 (perfect positive linear relationship), with 0 indicating no linear correlation. It is sensitive to linear relationships but not to nonlinear patterns.

    • ”spearman”: Spearman correlation assesses the strength and direction of a monotonic relationship between two variables, based on their ranks. It ranges from −1 to 1, where 1 indicates a perfect increasing monotonic relationship and −1 a perfect decreasing one. It is robust to outliers and can capture non-linear relationships.

    • ”kendall”: Kendall Tau measures the association between two ranked variables, focusing on the consistency of the order between them. Its value ranges from −1 (perfect discordance) to 1 (perfect concordance). It is particularly useful for ordinal data and is robust to outliers.

    • ”xicor”: XiCor detects both linear and nonlinear dependencies between continuous variables. It typically ranges from 0 (no dependence) to 1 (strong dependence), providing a more comprehensive view of relationships. Negative XI correlation does not have any innate significance, other than close to zero. See details in the paper [1].

    [1] Chatterjee, S. (2021). A new coefficient of correlation. J. Amer. Statist. Assoc., 116, no. 536, 2009–2022.

  • threshold (float, default=0.2) – Minimum absolute correlation value required for feature selection. Features with correlation above this threshold are considered important.

  • random_state (int, default=0) – Random seed for reproducibility in calculations involving random sampling.

Returns:

A container object with the following components:

  • key: “data_fs_corr”

  • data: Name of the dataset used

  • inputs: Input parameters used for feature selection

  • value: Dictionary containing:

    • ”selected”: List of selected feature names

  • table: DataFrame with feature names, importance scores, and selection status

  • options: Dictionary of visualizations configuration for a horizontal bar plot of feature importance. Run results.plot() to show this plot.

Return type:

ValidationResult

Examples

Feature Selection

Feature Selection