DataSet#
Data Loading and Management#
Load built-in data. |
|
Load data from csv file. |
|
Load data from spark file. |
|
Load data from DataFrame or numpy ndarray. |
|
Load data from DataFrame or numpy ndarray with given train test data. |
|
Return the source data using pandas DataFrame. |
|
Register the dataset into mlflow. |
|
Return the list of registered dataset in mlflow. |
|
Load registered data. |
|
Delete a registered dataset from MLflow. |
Data Exploration#
Generates comprehensive descriptive statistics and analysis of the dataset. |
|
Creates a univariate visualization for analyzing the distribution of a single feature. |
|
Creates a bivariate visualization between two features with optional color encoding. |
|
Creates an interactive 3D scatter plot visualization for exploring relationships between three features. |
|
Calculate and visualize correlation matrices between features in the dataset. |
|
Performs Principal Component Analysis (PCA) on specified features with preprocessing for both numerical and categorical variables. |
|
Performs UMAP dimensionality reduction on the specified features of the dataset. |
Preprocessing#
Performs data imputation for missing values with optional indicator columns. |
|
Scales numerical features using various scaling methods. |
|
Performs binning transformation on numerical features by discretizing continuous values into discrete bins. |
|
Encodes categorical features using either ordinal, one-hot, or target encoding methods. |
|
Preprocess the internal raw data and extra data based on existing settings. |
|
Convert raw data to preprocessed data. |
|
Convert preprocessed data to raw data. |
|
Remove all previous preprocess steps. |
|
Get preprocessor. |
Feature Selection and Management#
Selects features based on their correlation strength with the target variable. |
|
Selects important features using XGBoost model and permutation importance analysis. |
|
Performs feature selection using RCIT and FBEDk to identify important features based on conditional independence testing. |
|
Update the type of selected feature. |
|
Set active features. |
|
Set inactive features. |
|
Set the target feature. |
|
Set the task type. |
|
Set the column that will be used as sample weight. |
|
Set the column that will be used as prediction. |
|
Set the column that will be used as prediction probability. |
Outlier Detection#
Performs outlier detection using PCA-based methods. |
|
Performs outlier detection using the Isolation Forest algorithm. |
|
Performs Cluster-Based Local Outlier Factor (CBLOF) detection on the dataset. |
Data Access and Properties#
Return the name of this dataset. |
|
Return the raw data as pd.DataFrame. |
|
Return the preprocessed version of data as pd.DataFrame. |
|
Return the shape of this data. |
|
Get the list of all column names of the data. |
|
Get the list of selected feature names (only X). |
|
Get the list of selected numerical feature names (only X). |
|
Get the list of selected categorical feature names (only X). |
|
Get the list of selected mixed type feature names (only X). |
|
Get the selected target feature name (only y). |
|
Get the selected sample weight column name. |
|
Get the prediction column name. |
|
Get the prediction probability column name. |
|
Get the list of all column names of the raw data. |
|
Get the list of all column types of the data, including "categorical" and "numerical". |
|
Get the list of selected feature types (only X). |
|
Get the number of selected features (only X). |
|
Get the task type, including "Regression" and "Classification". |
|
Get the preprocessed version of data of selected features (only X). |
|
Get the preprocessed version of data of selected target feature (only y). |
|
Get the preprocessed version of data of selected sample weight. |
Train-Test Split Management#
Set random train-test split on active samples. |
|
Set training indices. |
|
Set testing indices. |
|
Get the training x (active and preprocessed). |
|
Get the training y. |
|
Get the training sample_weight (active and preprocessed). |
|
Get the testing x. |
|
Get the testing y (active and preprocessed). |
|
Get the testing sample_weight (active and preprocessed). |
|
Return True if data is already split, otherwise False. |
Extra and Protected Data Management#
Set protected features, e.g., demographic features. |
|
Set protected features, e.g., demographic features in extra data. |
|
Add a new group of data to Dataset. |
|
Delete a group of data. |
|
Get the extra data split names. |
|
Get the available data split names. |
|
Get the raw data in the format of np.ndarray |
|
Get the preprocessed data in the format of np.ndarray (all variables including X, y, sample_weight, etc.) |
|
Get the preprocessed data in the form of X, y, sample_weight. |
|
Get the sample indices (active). |
|
Get the protected data in the format of np.ndarray (raw, as there is no preprocessed version of protected data). |
Data Drift and Sampling#
Subsample data randomly. |
|
Evaluates the distributional differences between two data samples using various distance metrics. |
|
Save the preprocessing steps to file. |
|
Load preprocessing steps from a saved file. |