modeva.DataSet.data_drift_test#

DataSet.data_drift_test(dataset1: str = 'train', dataset2: str = 'test', sample_idx1: Tuple[int] | ndarray = None, sample_idx2: Tuple[int] | ndarray = None, name1: str = None, name2: str = None, distance_metric: str = 'PSI', psi_method: str = 'uniform', psi_bins: int = 10)#

Evaluates the distributional differences between two data samples using various distance metrics.

This function compares two datasets by calculating the distance metrics between them, allowing for the assessment of data drift. Only features in X and y are compared.

Parameters:
  • dataset1 ({"main", "train", "test"}, default="train") – Identifier for the first dataset to analyze. Must be one of the predefined dataset types.

  • dataset2 ({"main", "train", "test"}, default="test") – Identifier for the second dataset to analyze. Must be one of the predefined dataset types.

  • sample_idx1 (Tuple or np.ndarray of int, default=None) – Indices of samples to select from the first dataset. If None, will use all samples in the specified data.

  • sample_idx2 (Tuple or np.ndarray of int, default=None) – Indices of samples to select from the second dataset. If None, will use all samples in the specified data.

  • name1 (str, default=None) – Custom label for the first dataset used in visualization outputs. If None, will set to dataset1.

  • name2 (str, default=None) – Custom label for the second dataset used in visualization outputs. If None, will set to dataset2.

  • distance_metric ({"PSI", "WD1", "KS"}, default="PSI") –

    Method to calculate distribution difference:

    • ”PSI”: Population Stability Index

    • ”WD1”: Wasserstein Distance

    • ”KS”: Kolmogorov-Smirnov test

  • psi_method ({"uniform", "quantile"}, default="uniform") – Binning strategy for PSI calculation. Only applicable when distance_metric=”PSI”.

  • psi_bins (int, default=10) – Number of bins for PSI calculation. Only applicable when distance_metric=”PSI”.

Returns:

A container object with the following components:

  • key: “data_drift”

  • data: Name of the dataset used

  • inputs: Dictionary of input parameters used in the analysis

  • value: Dictionary containing:

    • ”Distance_Scores”: Feature-wise distance metrics between the two datasets

  • table: DataFrame representation of the distance scores

  • options: Dictionary of visualizations configuration. Run results.plot() to show all plots; Run results.plot(name=xxx) to display one preferred plot; and the following names are available:

    • ”summary”: Horizontal bar plot visualizing the distance metric for each feature.

    • ”(“density”, <feature_name>)”: Density distribution comparison plots for two set of samples.

Return type:

ValidationResult

Examples

Data Drift Test

Data Drift Test