Outlier Detection#

Outlier detection is a crucial step in data preprocessing. Outliers are data points that significantly deviate from the rest of the data. They can be caused by measurement errors, data corruption, or other reasons.

Modeva has implemented three outlier detection strategies:

CBLOF#

CBLOF (Cluster-Based Local Outlier Factor) is an outlier detection method that uses clustering to identify anomalies. It first partitions data into small and large clusters, then calculates outlier scores based on the distances between points and cluster centers, optionally weighted by cluster sizes.

This method for outlier detection is based on cluster analysis, originally proposed by [He2003]. The process can be divided into the following steps:

  • Partition the data into clusters using K-means or Gaussian mixture model. Clusters can be classified into two categories, namely large clusters and small clusters, based on the size of each cluster. The size of a cluster is determined by the number of points it contains, and a given threshold is utilized for this classification.

  • Calculate the CBLOF score for each sample.

    • For samples belonging to large clusters, compute the Euclidean distance between each sample and its corresponding cluster centroid. This distance represents the outlier score for samples within large clusters.

    • For samples belonging to small clusters, calculate the Euclidean distance between each sample and the centroid of the nearest large cluster. This distance serves as the outlier score for samples within small clusters.

  • Multiplication by Cluster Size (optional): By default, the outlier score is not multiplied by the cluster size. However, if desired, the outlier score can be multiplied by the cluster size to emphasize the impact of outliers within larger clusters.

This method effectively utilizes the characteristics of clusters to detect outliers. The score calculation considers both the distances within a cluster and the relative distances to neighboring clusters. By combining these factors, the CBLOF score provides a comprehensive measure for identifying and quantifying outliers in the dataset.

Isolation Forest#

This method adopts a unique approach to isolate observations by following a random selection process [Liu2008]. It begins by randomly choosing a feature from the dataset and then selecting a split value for that feature within the range of its maximum and minimum values. This process is repeated recursively to construct isolation trees, until the node has only one instance, or all data at the node have the same values.

The algorithm measures the anomaly score based on the average path length required to isolate each observation. Outliers are expected to have shorter average path lengths, indicating they are easier to separate from the rest of the data. In contrast, normal observations will require longer paths for isolation. The randomness in feature and split value selection contributes to the algorithm’s efficiency and ability to handle high-dimensional datasets. It does not rely on the specific distribution of the data, making it suitable for various types of data and outlier detection tasks. Note that the DataSet.detect_outlier_isolation_forest in Modeva is a wrapper of sklearn’s implementation sklearn.ensemble.IsolationForest.

PCA-based Method#

In addition to the clustering-based methods, dimensionality reduction techniques like Principal Component Analysis (PCA) can also be utilized for outlier detection. In Modeva, DataSet.detect_outlier_pca implements a two PCA-based approach for calculating the outlier score: Mahalanobis distance and error reconstruction.

  • Mahalanobis distance: This method computes the distance between each data point and the centroid of the dataset, taking into account the covariance structure of the data. The Mahalanobis distance accounts for correlations between variables and provides a measure of how far each data point deviates from the overall centroid. The Mahalanobis distance can be obtained easily with PCA under the formula, see [Shyu2003] for details.

\[MD(x)^2 = \sum z_{i}^{2} / \lambda_{i},\]

where \(z_{i}\) are the \(i\)-th principal component scores, and \(\lambda_{i}\) is the \(i\)-th eigenvalue of the covariance matrix which can be explained as the variance. But this is precisely the sum of squared distances in the transformed PCA space, which gives us the desired result.

  • Error reconstruction: This method utilizes the reconstruction error obtained from reconstructing each data point using the principal components. The reconstruction error quantifies the dissimilarity between the original data point and its reconstructed representation. Higher reconstruction errors indicate potential outliers.

In Modeva, the reconstruction is performed by fitting an XGBoost model between the principal components and the covariates. This model is then used to reconstruct the covariates \(X_{new}\). The difference between the original covariates and the reconstructed data is calculated as the reconstruction error, i.e., \(X - X_{new}\). Finally, we also calculate the Mahalanobis distance of the reconstruction error as the final outlier score, to account for the correlations among reconstruction errors. If the reconstruction errors of each feature are mutually independent, the outlier score reduces to the mean squared reconstruction error.

Examples#

References#