Residual Analysis (Regression)#

Evaluate model residuals.

Installation

# To install the required package, use the following command:
# !pip install modeva

Authentication

# To get authentication, use the following command: (To get full access please replace the token to your own token)
# from modeva.utils.authenticate import authenticate
# authenticate(auth_code='eaaa4301-b140-484c-8e93-f9f633c8bacb')

Import modeva modules

from modeva import DataSet
from modeva import TestSuite
from modeva.models import MoLGBMRegressor

Load BikeSharing Dataset

ds = DataSet()
ds.load(name="BikeSharing")
ds.set_random_split()
ds.set_target("cnt")

ds.scale_numerical(features=("cnt",), method="log1p")
ds.preprocess()

Fit a LGBM model

model = MoLGBMRegressor(name="LGBM-2", max_depth=2, verbose=-1, random_state=0)
model.fit(ds.train_x, ds.train_y.ravel())
MoLGBMRegressor(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
                importance_type='split', learning_rate=0.1, max_depth=2,
                min_child_samples=20, min_child_weight=0.001,
                min_split_gain=0.0, n_estimators=100, n_jobs=None,
                num_leaves=31, objective=None, random_state=0, reg_alpha=0.0,
                reg_lambda=0.0, subsample=1.0, subsample_for_bin=200000,
                subsample_freq=0, verbose=-1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.


Analyzes residuals feature importance#

ts = TestSuite(ds, model)
results = ts.diagnose_residual_interpret(dataset="train")
results.plot()


Visualize the residual against predictor#

results = ts.diagnose_residual_analysis(features="hr", dataset="train")
results.plot()


Visualize the residual against response variable#

results = ts.diagnose_residual_analysis(features="cnt", dataset="train")
results.plot()


Visualize the residual against model prediction#

results = ts.diagnose_residual_analysis(use_prediction=True, dataset="train")
results.plot()


Interpret residual by a XGB depth-2 model#

results = ts.diagnose_residual_interpret(dataset='test', n_estimators=100, max_depth=2)

XGB-2 feature performance

results.plot("feature_importance")


XGB-2 effect performance

results.plot("effect_importance")


Further interpretation (main effect plot)

ts_residual = results.value["TestSuite"]
ts_residual.interpret_effects("hr", dataset="test").plot()


Further interpretation (local interpretation)

ts_residual.interpret_local_fi(sample_index=20).plot()


Random forest-based residual clustering analysis (absolute residual)#

results = ts.diagnose_residual_cluster(
    dataset="test",
    response_type="abs_residual",
    metric="MAE",
    n_clusters=10,
    cluster_method="pam",
    sample_size=2000,
    rf_n_estimators=100,
    rf_max_depth=5,
)
results.table
MAE Size abs_residual
4 0.6961 349.0 0.6961
6 0.6672 227.0 0.6672
7 0.5565 176.0 0.5565
2 0.5036 240.0 0.5036
1 0.4871 402.0 0.4871
3 0.3387 237.0 0.3387
9 0.2995 648.0 0.2995
8 0.2889 117.0 0.2889
0 0.2883 717.0 0.2883
5 0.2789 363.0 0.2789


Residual value for each cluster

results.plot("cluster_residual")


Performance metric for each cluster

results.plot("cluster_performance")


Feature importance of the random forest model

results.plot("feature_importance")


Analyze data drift for a specific cluster

data_results = ds.data_drift_test(
    **results.value["clusters"][2]["data_info"],
    distance_metric="PSI",
    psi_method="uniform",
    psi_bins=10
)
data_results.plot("summary")


data_results.plot(name=('density', 'hr'))


Random forest-based residual clustering analysis (perturbed residual)#

results = ts.diagnose_residual_cluster(
    dataset="test",
    response_type="abs_residual_perturb",
    metric="MAE",
    n_clusters=10,
    cluster_method="pam",
    sample_size=2000,
    rf_n_estimators=100,
    rf_max_depth=5,
)
results.table
MAE Size abs_residual_perturb
7 1.0488 150.0 1.0877
3 0.6053 400.0 0.8061
8 0.6487 120.0 0.7433
1 0.4809 350.0 0.6016
5 0.5565 176.0 0.6006
4 0.3423 242.0 0.4784
2 0.3873 440.0 0.4007
9 0.3002 920.0 0.3294
6 0.2531 277.0 0.2955
0 0.2046 401.0 0.2631


Random forest-based residual clustering analysis (prediction interval width)#

results = ts.diagnose_residual_cluster(
    dataset="test",
    response_type="pi_width",
    metric="MAE",
    n_clusters=10,
    cluster_method="pam",
    sample_size=2000,
    rf_n_estimators=100,
    rf_max_depth=5,
)
results.table
MAE Size pi_width
9 0.6680 59.0 1.5924
2 0.4744 179.0 1.5321
5 0.4803 63.0 1.3994
0 0.4606 506.0 1.3455
8 0.3719 98.0 1.3270
4 0.6190 147.0 1.2788
6 0.4209 155.0 1.1828
1 0.3035 188.0 1.1128
7 0.2979 113.0 1.0051
3 0.2265 230.0 0.8901


Compare residuals cluster of multiple models#

benchmark = MoLGBMRegressor(name="LGBM-5", max_depth=5, verbose=-1, random_state=0)
benchmark.fit(ds.train_x, ds.train_y.ravel())

tsc = TestSuite(ds, models=[model, benchmark])
results = tsc.compare_residual_cluster(dataset="test")
results.table
LGBM-2 LGBM-5
MSE size MSE size
4 0.5837 349.0 0.1151 349.0
6 0.5975 227.0 0.1531 227.0
7 0.4996 176.0 0.3053 176.0
2 0.3003 240.0 0.0650 240.0
1 0.3756 402.0 0.1956 402.0
3 0.1880 237.0 0.1145 237.0
9 0.1468 648.0 0.0506 648.0
8 0.1422 117.0 0.0730 117.0
0 0.2079 717.0 0.0872 717.0
5 0.1150 363.0 0.0370 363.0


results.plot("cluster_performance")


Total running time of the script: (2 minutes 7.870 seconds)

Gallery generated by Sphinx-Gallery