Subsampling and Data Drift#

This function performs random subsampling on the specified dataset, allowing for options such as shuffling and stratification. One may test the distribution drift of the subsampled data from the original data.

DataSet.subsample_random: Subsample data randomly with shuffling and stratification options.
DataSet.data_drift_test: Test distributional differences between two data samples (e.g. subsampled data vs original data, training data vs. testing data.)

Subsampling#

DataSet.subsample_random performs random subsampling of the original dataset by specifying the desired sample_size. It supports stratified subsampling by specifying the stratify parameter. The function returns the indices of the subsampled data, which can be used to set the active samples in the dataset.

subsampler = ds.subsample_random(dataset="main",
        sample_size=1000, stratify="Gender")
idx = subsampler.value["sample_idx"]
ds.set_active_samples(dataset="main", sample_idx=idx)

Distribution Drift#

Given two datasets, DataSet.data_drift_test calculates the distance metrics to assess the distributional drift. It supports several distance metrics {PSI, KS, WD1} to evaluate the marginal distribution drift between the two datasets.

Population Stability Index (PSI)

PSI is a statistical measure used to determine the extent to which the distribution of a variable has changed. It is a discrete (binned) version of the Kullback-Leibler (K-L) distance between two sets of sample data. More specifically, define a binning scheme that bins the two datasets \(P\) and \(Q\) into \(B\) bins. The discrete K-L distance between \(P\) and \(Q\) with respect to \(P\) is defined as \(D_{KL} (p│q)= \sum_{i=1}^B p_i ln ((p_i/q_i))\). Note that this distance is asymmetric. Define the distance with resect to \(q\) as \(D_{KL} (q│p)= \sum_{i=1}^B q_i ln (q_i/p_i)\). Then, PSI is the sum of the two asymmetric versions:

\[\begin{align} D(P, Q)_{PSI} = D_{KL}(p│q) + D_{KL} (q│p) = \sum^{B}_{i=1} (p_i - q_i)ln \frac{p_i}{q_i}. \end{align}\]

Here, \(B\) is the number of bins, and \(p_i\)’s and \(q_i\)’s are the proportions of the two samples in each bin. Note that the PSI calculation is related to the binning method, and Modeva provides two options for binning, i.e., “equal width” or “equal quantile”. The number of bins is fixed at 10.

Wasserstein Distance 1D (WD1)

WD1 calculates the absolute difference between the cumulative distribution functions of the two samples.

\[\begin{align} D(F, G)_{WD1} = \int |F(x) - G(x)| dx. \end{align}\]

Here \(F(x)\) and \(G(x)\) are the cumulative distribution functions of the target and base population.

Kolmogorov-Smirnov (KS) Distance

KS calculates the maximum absolute distance between the cumulative distribution functions of the two samples.

\[\begin{align} D(F, G)_{KS} = \sup_x |F(x) - G(x)|. \end{align}\]

Note that in Modeva, the WD1 and KS statistics are calculated by the wasserstein_distance and ks_2samp functions from scipy.stats.

Subsampling and Data Drift#

Subsampling#

Distribution Drift#

Examples#