modeva.DataSet.detect_outlier_cblof#

DataSet.detect_outlier_cblof(dataset: str = 'main', threshold: float = 0.99, method: str = 'kmeans', n_clusters: int = 10, cluster_threshold: float = 0.1, use_weights: bool = False, random_state: int = 0)#

Performs Cluster-Based Local Outlier Factor (CBLOF) detection on the dataset.

CBLOF is an outlier detection method that uses clustering to identify anomalies. It first partitions data into small and large clusters, then calculates outlier scores based on the distances between points and cluster centers, optionally weighted by cluster sizes.

Parameters:

dataset ({"main", "train", "test"}, default="main") – Specifies which dataset partition to analyze for outlier detection.
threshold (float, default=0.99) – Quantile threshold for determining outliers. Values between 0 and 1, where higher values result in fewer outliers.
method ({"kmeans", "gmm"}, default="kmeans") – Clustering algorithm to use: K-means or Gaussian Mixture Model.
n_clusters (int, default=10) – Number of clusters to partition the data into.
cluster_threshold (float, default=0.1) – Proportion threshold to distinguish between small and large clusters. Clusters with proportion of points above this are considered large.
use_weights (bool, default=False) – Whether to weight outlier scores by cluster sizes.
random_state (int, default=0) – Random seed for reproducibility in clustering algorithms.

Returns:

A container object with the following components:

key: “data_outlier_cblof”
data: Name of the dataset used
inputs: Dictionary of input parameters
table: Dictionary containing:
- ”outliers”: DataFrame of detected outlier samples
- ”non-outliers”: DataFrame of normal samples
func: Callable function that computes outlier scores for new data
options: Dictionary of visualizations configuration for a histogram plot where x-axis is the outlier scores, and y-axis is the density. Run results.plot() to show this plot.

Return type:

ValidationResult

Examples

Outlier Detection