modeva.DataSet.impute_missing#

Performs data imputation for missing values with optional indicator columns.

This function handles missing value imputation across multiple features using various strategies, with special consideration for categorical features. It can create indicator columns to track where imputation occurred and handles both numeric and categorical data appropriately.

Parameters:

features (str or tuple, default=None) – Feature names to process. If None, all features in the dataset will be processed.
dataset ({"main", "train", "test"}, default="main") – The data to fit the imputer.
method (str, default='mean') –
The imputation strategy. For categorical features, will use “most_frequent” by default.
- If “mean”, then replace missing values using the mean along each column. Can only be used with numeric data.
- If “median”, then replace missing values using the median along each column. Can only be used with numeric data.
- If “most_frequent”, then replace missing using the most frequent value along each column. Can be used with strings or numeric data. If there is more than one such value, only the smallest is returned.
- If “constant”, then replace missing values with fill_value. Can be used with strings or numeric data.
missing_values (int, float, str, np.nan, None or pandas.NA, default=np.nan) – The placeholder for the missing values. All occurrences of missing_values will be imputed. For pandas’ dataframes with nullable integer dtypes with missing values, missing_values can be set to either np.nan or pd.NA.
fill_value (str or numerical value, default=None) – When strategy == “constant”, fill_value is used to replace all occurrences of missing_values. For string or object data types, fill_value must be a string. If None, fill_value will be 0 when imputing numerical data and “missing_value” for strings or object data types.
add_indicators (bool, default=True) – If True, adds columns indicating where missing and special values were imputed.
special_values (list, optional, default=None) – A list of special values (e.g., [“999”, “888”]) to be treated as missing. Must be compatible with the data type of the columns.

Returns:

A container object with the following components:

key: “data_preprocess_imputing”
data: Name of the dataset used
inputs: Dictionary of input parameters
value: Dictionary containing imputation configuration for each feature:
- ”fidx”: Feature index
- ”imputer”: Fitted IndicatorImputer instance
- ”feature_names_out”: List of output feature names including indicators

Return type:

ValidationResult

Examples

Data Processing and Feature Engineering