DataSet#

Data Loading and Management#

DataSet.load

Load built-in data.

DataSet.load_csv

Load data from csv file.

DataSet.load_spark

Load data from spark file.

DataSet.load_dataframe

Load data from DataFrame or numpy ndarray.

DataSet.load_dataframe_train_test

Load data from DataFrame or numpy ndarray with given train test data.

DataSet.to_df

Return the source data using pandas DataFrame.

DataSet.register

Register the dataset into mlflow.

DataSet.list_registered_data

Return the list of registered dataset in mlflow.

DataSet.load_registered_data

Load registered data.

DataSet.delete_registered_data

Delete a registered dataset from MLflow.

Data Exploration#

DataSet.summary

Generates comprehensive descriptive statistics and analysis of the dataset.

DataSet.eda_1d

Creates a univariate visualization for analyzing the distribution of a single feature.

DataSet.eda_2d

Creates a bivariate visualization between two features with optional color encoding.

DataSet.eda_3d

Creates an interactive 3D scatter plot visualization for exploring relationships between three features.

DataSet.eda_correlation

Calculate and visualize correlation matrices between features in the dataset.

DataSet.eda_pca

Performs Principal Component Analysis (PCA) on specified features with preprocessing for both numerical and categorical variables.

DataSet.eda_umap

Performs UMAP dimensionality reduction on the specified features of the dataset.

Preprocessing#

DataSet.impute_missing

Performs data imputation for missing values with optional indicator columns.

DataSet.scale_numerical

Scales numerical features using various scaling methods.

DataSet.bin_numerical

Performs binning transformation on numerical features by discretizing continuous values into discrete bins.

DataSet.encode_categorical

Encodes categorical features using either ordinal, one-hot, or target encoding methods.

DataSet.preprocess

Preprocess the internal raw data and extra data based on existing settings.

DataSet.transform

Convert raw data to preprocessed data.

DataSet.inverse_transform

Convert preprocessed data to raw data.

DataSet.reset_preprocess

Remove all previous preprocess steps.

DataSet.get_preprocessor

Get preprocessor.

Feature Selection and Management#

DataSet.feature_select_corr

Selects features based on their correlation strength with the target variable.

DataSet.feature_select_xgbpfi

Selects important features using XGBoost model and permutation importance analysis.

DataSet.feature_select_rcit

Performs feature selection using RCIT and FBEDk to identify important features based on conditional independence testing.

DataSet.set_feature_type

Update the type of selected feature.

DataSet.set_active_features

Set active features.

DataSet.set_inactive_features

Set inactive features.

DataSet.set_target

Set the target feature.

DataSet.set_task_type

Set the task type.

DataSet.set_sample_weight

Set the column that will be used as sample weight.

DataSet.set_prediction

Set the column that will be used as prediction.

DataSet.set_prediction_proba

Set the column that will be used as prediction probability.

Outlier Detection#

DataSet.detect_outlier_pca

Performs outlier detection using PCA-based methods.

DataSet.detect_outlier_isolation_forest

Performs outlier detection using the Isolation Forest algorithm.

DataSet.detect_outlier_cblof

Performs Cluster-Based Local Outlier Factor (CBLOF) detection on the dataset.

Data Access and Properties#

DataSet.name

Return the name of this dataset.

DataSet.raw_data

Return the raw data as pd.DataFrame.

DataSet.data

Return the preprocessed version of data as pd.DataFrame.

DataSet.shape

Return the shape of this data.

DataSet.all_feature_names

Get the list of all column names of the data.

DataSet.feature_names

Get the list of selected feature names (only X).

DataSet.feature_names_numerical

Get the list of selected numerical feature names (only X).

DataSet.feature_names_categorical

Get the list of selected categorical feature names (only X).

DataSet.feature_names_mixed

Get the list of selected mixed type feature names (only X).

DataSet.target_feature_name

Get the selected target feature name (only y).

DataSet.sample_weight_name

Get the selected sample weight column name.

DataSet.prediction_name

Get the prediction column name.

DataSet.prediction_proba_name

Get the prediction probability column name.

DataSet.protected_feature_names

Get the list of all column names of the raw data.

DataSet.all_feature_types

Get the list of all column types of the data, including "categorical" and "numerical".

DataSet.feature_types

Get the list of selected feature types (only X).

DataSet.n_features

Get the number of selected features (only X).

DataSet.task_type

Get the task type, including "Regression" and "Classification".

DataSet.x

Get the preprocessed version of data of selected features (only X).

DataSet.y

Get the preprocessed version of data of selected target feature (only y).

DataSet.sample_weight

Get the preprocessed version of data of selected sample weight.

Train-Test Split Management#

DataSet.set_random_split

Set random train-test split on active samples.

DataSet.set_train_idx

Set training indices.

DataSet.set_test_idx

Set testing indices.

DataSet.train_x

Get the training x (active and preprocessed).

DataSet.train_y

Get the training y.

DataSet.train_sample_weight

Get the training sample_weight (active and preprocessed).

DataSet.test_x

Get the testing x.

DataSet.test_y

Get the testing y (active and preprocessed).

DataSet.test_sample_weight

Get the testing sample_weight (active and preprocessed).

DataSet.is_splitted

Return True if data is already split, otherwise False.

Extra and Protected Data Management#

DataSet.set_protected_data

Set protected features, e.g., demographic features.

DataSet.set_protected_extra_data

Set protected features, e.g., demographic features in extra data.

DataSet.set_raw_extra_data

Add a new group of data to Dataset.

DataSet.delete_extra_data

Delete a group of data.

DataSet.get_extra_data_list

Get the extra data split names.

DataSet.get_data_list

Get the available data split names.

DataSet.get_raw_data

Get the raw data in the format of np.ndarray

DataSet.get_data

Get the preprocessed data in the format of np.ndarray (all variables including X, y, sample_weight, etc.)

DataSet.get_X_y_data

Get the preprocessed data in the form of X, y, sample_weight.

DataSet.get_active_sample_idx

Get the sample indices (active).

DataSet.get_protected_data

Get the protected data in the format of np.ndarray (raw, as there is no preprocessed version of protected data).

Data Drift and Sampling#

DataSet.subsample_random

Subsample data randomly.

DataSet.data_drift_test

Evaluates the distributional differences between two data samples using various distance metrics.

DataSet.save_preprocessing

Save the preprocessing steps to file.

DataSet.load_preprocessing

Load preprocessing steps from a saved file.