Basic Data Operations#

This section introduces some basic data operatios in Modeva, such as loading, summary, preprocessing and registration. All these operations are based on the DataSet class.

Data Loading#

Built-in Dataset#

There are four built-in datasets available in Modeva for demo purposes. The datasets are:

  • BikeSharing (Regression case)

  • CaliforniaHousing (Regression case),

  • SimuCredit (Classification case)

  • TaiwanCredit (Classification case)

One may use the DataSet.load function to a built-in dataset. For example,

## Create an instance of DataSet class
from modeva import DataSet
ds = DataSet()
ds.load("SimuCredit")
ds.data.head(5)
Built-in SimuCredit Dataset

External Dataset#

Modeva DataSet class supports load_csv, load_dataframe and load_spark functions to load external datasets. For example,

# Load external data
import pandas as pd
from sklearn.datasets import load_iris
from modeva import DataSet
iris = load_iris()
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df['species'] = pd.Categorical.from_codes(iris.target, iris.target_names)
ds = DataSet(name="IrisData")
ds.load_dataframe(df)

Data Summary#

Run DataSet.summary to get a summary of a dataset, which includes the overall summary of the dataset, descriptive statistics of categorical variables and numerical variables.

The overall summary, return from res.table[“summary”], includes the number of samples, number of features with different types (numerical, categorical, and mixed), number of duplicated samples, number and percentage of missing and infinite values.

res = ds.summary()
res.table["summary"]
Overall Summary of the Dataset

Categorical Variables#

The summary statistics of categorical variables, return from res.table[“categorical”], includes the number of missing values, number of unique values, and the frequency of top 2 unique values, for each categorical variable.

Summary of Categorical Variables in Dataset

Numerical Variables#

The summary statistics of numerical variables, return from res.table[“numerical”], includes the number of missing and inf values, number of unique values, mean, standard deviation, minimum, 25th percentile, median, 75th percentile, and maximum value, for each numerical variable.

Summary of Numerical Variables in Dataset

Data Preprocessing#

Data preprocessing in Modeva enables cleaning and transforming raw datasets to ensure they are ready for model development. All preprocessing steps are executed using the DataSet class, with DataSet.reset_preprocess to initiate the preprocessing and DataSet.preprocess to execute the defined preprocessing steps.

ds.reset_preprocess()
ds.xxxxxx() # defined preprocessing steps
ds.preprocess()

Below is a list of key functionalities in Modeva for data preprocessing:

Handling Missing Values#

Run DataSet.impute_missing function to impute missing values in the dataset. The function supports imputing missing values of numerical, categorical, and mixed features with different methods: {mean, median, most_frequent, constant}. The function also supports adding an indicator for imputed values.

# Impute missing values of umerical features with `mean/median/constant` and add an indicator
ds.impute_missing(features=ds.feature_names_numerical,
    method='mean', add_indicators=True)

# Impute missing values of categorical features with `most_frequent` and add an indicator
ds.impute_missing(features=ds.feature_names_categorical,
    method='most_frequent', add_indicators=True)

# Impute missing and special values of mixed features and add an indicator.
ds.impute_missing(features=ds.feature_names_mixed,
    method='median', add_indicators=True, special_values=["SV1", "SV2"])

Categorical Variable Encoding#

Run DataSet.encode_categorical function to encode categorical features. The function supports encoding categorical features using {one-hot, ordinal} methods.

ds.encode_categorical(features=("Gender", "Race"), method="onehot")

Numerical Variable Scaling#

Run DataSet.scale_numerical function to scale numerical features. The function supports scaling numerical features using {standardize, minmax, quantile, log1p, square} methods.

ds.scale_numerical(features=("Mortgage", "Balance"), method="log1p")
ds.scale_numerical(features=("Delinquency",), method="minmax")
ds.scale_numerical(features=("Inquiry",), method="quantile")

Numerical Variable Binning#

Run DataSet.bin_numerical function to bin numerical features. The function supports binning numerical features using {uniform, quantile, precompute} methods.

ds.bin_numerical(features=("Utilization",),
    bins=10, method="uniform")
ds.bin_numerical(features=("Mortgage", "Balance","Amount Past Due"),
    bins=10, method="quantile")

Data Preparation#

Data preparation involves configuring the dataset for modeling purpose. Modeva provides the following functionalities

ds.set_random_split()
ds.set_target("Status")
ds.set_inactive_features(features=('Gender','Race'))

Data Registration#

Modeva supports registration of datasets, making it easier to manage and reuse them across multiple experiments. It leverages the open-source MLflow framework and provide the following functionalities:

ds.register(name="A0-SimuCredit", override=True)
ds.list_registered_data()
Modeva Data Registry

Examples#