modeva.DataSet.eda_umap#

DataSet.eda_umap(features: Tuple | List = None, n_neighbors: int = 15, n_components: int = 2, metric: str = 'euclidean', dataset: str = 'main', sample_size: int = None, categorical_encoding: str = 'ordinal', random_state: int = 0)#

Performs UMAP dimensionality reduction on the specified features of the dataset.

This function processes both numerical and categorical features, applies appropriate encoding and scaling, and then performs UMAP dimensionality reduction. It handles data preprocessing including optional subsampling, categorical encoding, and numerical scaling before applying the UMAP algorithm.

Parameters:

features (tuple, default=None) – Names of features to use for UMAP. If None, all features are used.
n_neighbors (int, default=15) – Number of neighboring points used in local approximations. Values typically range from 2 to 100, with larger values producing more global structure.
n_components (int, default=3) – Number of dimensions in the reduced space. Should be between 2 and min(n_samples-1, n_features).
metric (str, default="euclidean") – Distance metric for UMAP. Supported metrics include ‘euclidean’, ‘manhattan’, ‘chebyshev’, ‘minkowski’, ‘cosine’, among others.
dataset ({"main", "train", "test"}, default="main") – Specifies which dataset partition to use for the analysis.
sample_size (int, default=None) – If set, randomly samples this many points from the dataset. Useful for large datasets.
categorical_encoding ({"ordinal", "onehot"}, default="ordinal") –
Method for encoding categorical variables:
- ”ordinal”: Converts categories to integer values
- ”onehot”: Creates binary columns for each category
random_state (int, default=0) – Seed for random number generation, ensuring reproducibility.

Returns:

A container object with the following components:

key: “data_eda_umap”
data: Name of the dataset used
inputs: Dictionary of input parameters
table: DataFrame containing the UMAP reduced dimensions

Return type:

ValidationResult

Examples

Exploratory Data Analysis