modeva.DataSet.data_drift_test#
- DataSet.data_drift_test(dataset1: str = 'train', dataset2: str = 'test', sample_idx1: Tuple[int] | ndarray = None, sample_idx2: Tuple[int] | ndarray = None, name1: str = None, name2: str = None, distance_metric: str = 'PSI', psi_method: str = 'uniform', psi_bins: int = 10)#
- Evaluates the distributional differences between two data samples using various distance metrics. - This function compares two datasets by calculating the distance metrics between them, allowing for the assessment of data drift. Only features in X and y are compared. - Parameters:
- dataset1 ({"main", "train", "test"}, default="train") – Identifier for the first dataset to analyze. Must be one of the predefined dataset types. 
- dataset2 ({"main", "train", "test"}, default="test") – Identifier for the second dataset to analyze. Must be one of the predefined dataset types. 
- sample_idx1 (Tuple or np.ndarray of int, default=None) – Indices of samples to select from the first dataset. If None, will use all samples in the specified data. 
- sample_idx2 (Tuple or np.ndarray of int, default=None) – Indices of samples to select from the second dataset. If None, will use all samples in the specified data. 
- name1 (str, default=None) – Custom label for the first dataset used in visualization outputs. If None, will set to dataset1. 
- name2 (str, default=None) – Custom label for the second dataset used in visualization outputs. If None, will set to dataset2. 
- distance_metric ({"PSI", "WD1", "KS"}, default="PSI") – - Method to calculate distribution difference: - ”PSI”: Population Stability Index 
- ”WD1”: Wasserstein Distance 
- ”KS”: Kolmogorov-Smirnov test 
 
- psi_method ({"uniform", "quantile"}, default="uniform") – Binning strategy for PSI calculation. Only applicable when distance_metric=”PSI”. 
- psi_bins (int, default=10) – Number of bins for PSI calculation. Only applicable when distance_metric=”PSI”. 
 
- Returns:
- A container object with the following components: - key: “data_drift” 
- data: Name of the dataset used 
- inputs: Dictionary of input parameters used in the analysis 
- value: Dictionary containing: - ”Distance_Scores”: Feature-wise distance metrics between the two datasets 
 
- table: DataFrame representation of the distance scores 
- options: Dictionary of visualizations configuration. Run results.plot() to show all plots; Run results.plot(name=xxx) to display one preferred plot; and the following names are available: - ”summary”: Horizontal bar plot visualizing the distance metric for each feature. 
- ”(“density”, <feature_name>)”: Density distribution comparison plots for two set of samples. 
 
 
- Return type:
 - Examples 
 
    