modeva.DataSet.eda_correlation#

DataSet.eda_correlation(features: Tuple | List = None, dataset: str = 'main', method: str = 'pearson', sample_size: int = None, random_state: int = 0)#

Calculate and visualize correlation matrices between features in the dataset.

Pearson / Spearman / Kendall / XI correlation for two numerical features. Symmetric Theil’s U correlation for two categorical features. Correlation ratio for numerical and categorical features.

Parameters:

features (tuple, default=None) – Feature names to include in correlation analysis. If None, all features in the dataset will be used.
dataset ({"main", "train", "test"}, default="main") – Specifies which dataset partition to use for correlation analysis.
method ({"pearson", "spearman", "kendall", "xicor"}, default="pearson") –
The algorithm for calculating the correlation between two numerical features.
- ”pearson”: Pearson correlation measures the linear relationship between two continuous variables. Its value ranges from −1 (perfect negative linear relationship) to 1 (perfect positive linear relationship), with 0 indicating no linear correlation. It is sensitive to linear relationships but not to nonlinear patterns.
- ”spearman”: Spearman correlation assesses the strength and direction of a monotonic relationship between two variables, based on their ranks. It ranges from −1 to 1, where 1 indicates a perfect increasing monotonic relationship and −1 a perfect decreasing one. It is robust to outliers and can capture non-linear relationships.
- ”kendall”: Kendall Tau measures the association between two ranked variables, focusing on the consistency of the order between them. Its value ranges from −1 (perfect discordance) to 1 (perfect concordance). It is particularly useful for ordinal data and is robust to outliers.
- ”xicor”: XiCor detects both linear and nonlinear dependencies between continuous variables. It typically ranges from 0 (no dependence) to 1 (strong dependence), providing a more comprehensive view of relationships. Negative XI correlation does not have any innate significance, other than close to zero. See details in the paper [1].
[1] Chatterjee, S. (2021). A new coefficient of correlation. J. Amer. Statist. Assoc., 116, no. 536, 2009–2022.
sample_size (int, default=None) – Number of random samples to use for calculation. If None, uses entire dataset. Useful for large datasets to reduce computation time.
random_state (int, default=0) – Random seed for reproducible sampling when sample_size is specified.

Returns:

A container object with the following components:

key: “data_eda_correlation”
data: Name of the dataset used
inputs: Dictionary of input parameters
table: Correlation matrix as a pandas DataFrame
options: Dictionary of visualizations configuration for a heatmap plot. Run results.plot() to show this plot.

Return type:

ValidationResult

Examples

Exploratory Data Analysis