modeva.DataSet.data_drift_test#
- DataSet.data_drift_test(dataset1: str = 'train', dataset2: str = 'test', sample_idx1: Tuple[int] | ndarray = None, sample_idx2: Tuple[int] | ndarray = None, name1: str = None, name2: str = None, distance_metric: str = 'PSI', psi_method: str = 'uniform', psi_bins: int = 10)#
Evaluates the distributional differences between two data samples using various distance metrics.
This function compares two datasets by calculating the distance metrics between them, allowing for the assessment of data drift. Only features in X and y are compared.
- Parameters:
dataset1 ({"main", "train", "test"}, default="train") – Identifier for the first dataset to analyze. Must be one of the predefined dataset types.
dataset2 ({"main", "train", "test"}, default="test") – Identifier for the second dataset to analyze. Must be one of the predefined dataset types.
sample_idx1 (Tuple or np.ndarray of int, default=None) – Indices of samples to select from the first dataset. If None, will use all samples in the specified data.
sample_idx2 (Tuple or np.ndarray of int, default=None) – Indices of samples to select from the second dataset. If None, will use all samples in the specified data.
name1 (str, default=None) – Custom label for the first dataset used in visualization outputs. If None, will set to dataset1.
name2 (str, default=None) – Custom label for the second dataset used in visualization outputs. If None, will set to dataset2.
distance_metric ({"PSI", "WD1", "KS"}, default="PSI") –
Method to calculate distribution difference:
”PSI”: Population Stability Index
”WD1”: Wasserstein Distance
”KS”: Kolmogorov-Smirnov test
psi_method ({"uniform", "quantile"}, default="uniform") – Binning strategy for PSI calculation. Only applicable when distance_metric=”PSI”.
psi_bins (int, default=10) – Number of bins for PSI calculation. Only applicable when distance_metric=”PSI”.
- Returns:
A container object with the following components:
key: “data_drift”
data: Name of the dataset used
inputs: Dictionary of input parameters used in the analysis
value: Dictionary containing:
”Distance_Scores”: Feature-wise distance metrics between the two datasets
table: DataFrame representation of the distance scores
options: Dictionary of visualizations configuration. Run results.plot() to show all plots; Run results.plot(name=xxx) to display one preferred plot; and the following names are available:
”summary”: Horizontal bar plot visualizing the distance metric for each feature.
”(“density”, <feature_name>)”: Density distribution comparison plots for two set of samples.
- Return type:
Examples