modeva.DataSet.detect_outlier_pca#

DataSet.detect_outlier_pca(dataset: str = 'main', threshold: float = 0.99, cumulative_variance_threshold: float = 0.9, method: str = 'mahalanobis', sparse_pca: bool = False, alpha: float = 1.0, random_state: int = 0)#

Performs outlier detection using PCA-based methods.

This function implements two PCA-based outlier detection approaches: Mahalanobis distance in PCA space and reconstruction error using PCA components. It first transforms the data using PCA (or Sparse PCA), then calculates outlier scores based on the chosen method, and identifies outliers using a threshold on these scores.

Parameters:

dataset ({"main", "train", "test"}, default="main") – Dataset to analyze for outliers. Options are “main” (full dataset), “train” (training set), or “test” (test set).
threshold (float, default=0.99) – Quantile threshold for outlier classification. Values above this quantile of outlier scores are considered outliers. Must be between 0 and 1.
cumulative_variance_threshold (float, default=0.9) – Threshold for selecting number of principal components based on cumulative explained variance. Must be between 0 and 1.
method ({"mahalanobis", "reconst_error"}, default="mahalanobis") –
Method for calculating outlier scores:
- ”mahalanobis”: Uses Mahalanobis distance in PCA space
- ”reconst_error”: Uses reconstruction error with PCA components
sparse_pca (bool, default=False) – Whether to use Sparse PCA instead of standard PCA for dimensionality reduction.
alpha (float, default=1.0) – Sparsity controlling parameter for Sparse PCA. Only used when sparse_pca=True.
random_state (int, default=0) – Random seed for reproducibility.

Returns:

A container object with the following components:

key: “data_outlier_pca”
data: Name of the dataset used
inputs: Dictionary of input parameters
table: Dictionary containing:
- ”outliers”: DataFrame of detected outlier samples
- ”non-outliers”: DataFrame of normal samples
func: Callable function that computes outlier scores for new data
options: Dictionary of visualizations configuration for a histogram plot where x-axis is the outlier scores, and y-axis is the density. Run results.plot() to show this plot.

Return type:

ValidationResult

Examples

Outlier Detection