DataSet#

Data Loading and Management#

`DataSet.load`	Load built-in data.
`DataSet.load_csv`	Load data from csv file.
`DataSet.load_spark`	Load data from spark file.
`DataSet.load_dataframe`	Load data from DataFrame or numpy ndarray.
`DataSet.load_dataframe_train_test`	Load data from DataFrame or numpy ndarray with given train test data.
`DataSet.to_df`	Return the source data using pandas DataFrame.
`DataSet.register`	Register the dataset into mlflow.
`DataSet.list_registered_data`	Return the list of registered dataset in mlflow.
`DataSet.load_registered_data`	Load registered data.
`DataSet.delete_registered_data`	Delete a registered dataset from MLflow.

Data Exploration#

`DataSet.summary`	Generates comprehensive descriptive statistics and analysis of the dataset.
`DataSet.eda_1d`	Creates a univariate visualization for analyzing the distribution of a single feature.
`DataSet.eda_2d`	Creates a bivariate visualization between two features with optional color encoding.
`DataSet.eda_3d`	Creates an interactive 3D scatter plot visualization for exploring relationships between three features.
`DataSet.eda_correlation`	Calculate and visualize correlation matrices between features in the dataset.
`DataSet.eda_pca`	Performs Principal Component Analysis (PCA) on specified features with preprocessing for both numerical and categorical variables.
`DataSet.eda_umap`	Performs UMAP dimensionality reduction on the specified features of the dataset.

Preprocessing#

`DataSet.impute_missing`	Performs data imputation for missing values with optional indicator columns.
`DataSet.scale_numerical`	Scales numerical features using various scaling methods.
`DataSet.bin_numerical`	Performs binning transformation on numerical features by discretizing continuous values into discrete bins.
`DataSet.encode_categorical`	Encodes categorical features using either ordinal, one-hot, or target encoding methods.
`DataSet.preprocess`	Preprocess the internal raw data and extra data based on existing settings.
`DataSet.transform`	Convert raw data to preprocessed data.
`DataSet.inverse_transform`	Convert preprocessed data to raw data.
`DataSet.reset_preprocess`	Remove all previous preprocess steps.
`DataSet.get_preprocessor`	Get preprocessor.

Feature Selection and Management#

`DataSet.feature_select_corr`	Selects features based on their correlation strength with the target variable.
`DataSet.feature_select_xgbpfi`	Selects important features using XGBoost model and permutation importance analysis.
`DataSet.feature_select_rcit`	Performs feature selection using RCIT and FBEDk to identify important features based on conditional independence testing.
`DataSet.set_feature_type`	Update the type of selected feature.
`DataSet.set_active_features`	Set active features.
`DataSet.set_inactive_features`	Set inactive features.
`DataSet.set_target`	Set the target feature.
`DataSet.set_task_type`	Set the task type.
`DataSet.set_sample_weight`	Set the column that will be used as sample weight.
`DataSet.set_prediction`	Set the column that will be used as prediction.
`DataSet.set_prediction_proba`	Set the column that will be used as prediction probability.

Outlier Detection#

`DataSet.detect_outlier_pca`	Performs outlier detection using PCA-based methods.
`DataSet.detect_outlier_isolation_forest`	Performs outlier detection using the Isolation Forest algorithm.
`DataSet.detect_outlier_cblof`	Performs Cluster-Based Local Outlier Factor (CBLOF) detection on the dataset.

Data Access and Properties#

`DataSet.name`	Return the name of this dataset.
`DataSet.raw_data`	Return the raw data as pd.DataFrame.
`DataSet.data`	Return the preprocessed version of data as pd.DataFrame.
`DataSet.shape`	Return the shape of this data.
`DataSet.all_feature_names`	Get the list of all column names of the data.
`DataSet.feature_names`	Get the list of selected feature names (only X).
`DataSet.feature_names_numerical`	Get the list of selected numerical feature names (only X).
`DataSet.feature_names_categorical`	Get the list of selected categorical feature names (only X).
`DataSet.feature_names_mixed`	Get the list of selected mixed type feature names (only X).
`DataSet.target_feature_name`	Get the selected target feature name (only y).
`DataSet.sample_weight_name`	Get the selected sample weight column name.
`DataSet.prediction_name`	Get the prediction column name.
`DataSet.prediction_proba_name`	Get the prediction probability column name.
`DataSet.protected_feature_names`	Get the list of all column names of the raw data.
`DataSet.all_feature_types`	Get the list of all column types of the data, including "categorical" and "numerical".
`DataSet.feature_types`	Get the list of selected feature types (only X).
`DataSet.n_features`	Get the number of selected features (only X).
`DataSet.task_type`	Get the task type, including "Regression" and "Classification".
`DataSet.x`	Get the preprocessed version of data of selected features (only X).
`DataSet.y`	Get the preprocessed version of data of selected target feature (only y).
`DataSet.sample_weight`	Get the preprocessed version of data of selected sample weight.

Train-Test Split Management#

`DataSet.set_random_split`	Set random train-test split on active samples.
`DataSet.set_train_idx`	Set training indices.
`DataSet.set_test_idx`	Set testing indices.
`DataSet.train_x`	Get the training x (active and preprocessed).
`DataSet.train_y`	Get the training y.
`DataSet.train_sample_weight`	Get the training sample_weight (active and preprocessed).
`DataSet.test_x`	Get the testing x.
`DataSet.test_y`	Get the testing y (active and preprocessed).
`DataSet.test_sample_weight`	Get the testing sample_weight (active and preprocessed).
`DataSet.is_splitted`	Return True if data is already split, otherwise False.

Extra and Protected Data Management#

`DataSet.set_protected_data`	Set protected features, e.g., demographic features.
`DataSet.set_protected_extra_data`	Set protected features, e.g., demographic features in extra data.
`DataSet.set_raw_extra_data`	Add a new group of data to Dataset.
`DataSet.delete_extra_data`	Delete a group of data.
`DataSet.get_extra_data_list`	Get the extra data split names.
`DataSet.get_data_list`	Get the available data split names.
`DataSet.get_raw_data`	Get the raw data in the format of np.ndarray
`DataSet.get_data`	Get the preprocessed data in the format of np.ndarray (all variables including X, y, sample_weight, etc.)
`DataSet.get_X_y_data`	Get the preprocessed data in the form of X, y, sample_weight.
`DataSet.get_active_sample_idx`	Get the sample indices (active).
`DataSet.get_protected_data`	Get the protected data in the format of np.ndarray (raw, as there is no preprocessed version of protected data).

Data Drift and Sampling#

`DataSet.subsample_random`	Subsample data randomly.
`DataSet.data_drift_test`	Evaluates the distributional differences between two data samples using various distance metrics.
`DataSet.save_preprocessing`	Save the preprocessing steps to file.
`DataSet.load_preprocessing`	Load preprocessing steps from a saved file.