Note

Go to the end to download the full example code.

Residual Analysis (Regression)#

Evaluate model residuals.

Installation

# To install the required package, use the following command:
# !pip install modeva

Authentication

# To get authentication, use the following command: (To get full access please replace the token to your own token)
# from modeva.utils.authenticate import authenticate
# authenticate(auth_code='eaaa4301-b140-484c-8e93-f9f633c8bacb')

Import modeva modules

from modeva import DataSet
from modeva import TestSuite
from modeva.models import MoLGBMRegressor

Load BikeSharing Dataset

ds = DataSet()
ds.load(name="BikeSharing")
ds.set_random_split()
ds.set_target("cnt")

ds.scale_numerical(features=("cnt",), method="log1p")
ds.preprocess()

Fit a LGBM model

model = MoLGBMRegressor(name="LGBM-2", max_depth=2, verbose=-1, random_state=0)
model.fit(ds.train_x, ds.train_y.ravel())

MoLGBMRegressor(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
                importance_type='split', learning_rate=0.1, max_depth=2,
                min_child_samples=20, min_child_weight=0.001,
                min_split_gain=0.0, n_estimators=100, n_jobs=None,
                num_leaves=31, objective=None, random_state=0, reg_alpha=0.0,
                reg_lambda=0.0, subsample=1.0, subsample_for_bin=200000,
                subsample_freq=0, verbose=-1)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Analyzes residuals feature importance#

ts = TestSuite(ds, model)
results = ts.diagnose_residual_interpret(dataset="train")
results.plot()

Visualize the residual against predictor#

results = ts.diagnose_residual_analysis(features="hr", dataset="train")
results.plot()

Visualize the residual against response variable#

results = ts.diagnose_residual_analysis(features="cnt", dataset="train")
results.plot()

Visualize the residual against model prediction#

results = ts.diagnose_residual_analysis(use_prediction=True, dataset="train")
results.plot()

Interpret residual by a XGB depth-2 model#

results = ts.diagnose_residual_interpret(dataset='test', n_estimators=100, max_depth=2)

XGB-2 feature performance

results.plot("feature_importance")

XGB-2 effect performance

results.plot("effect_importance")

Further interpretation (main effect plot)

ts_residual = results.value["TestSuite"]
ts_residual.interpret_effects("hr", dataset="test").plot()

Further interpretation (local interpretation)

ts_residual.interpret_local_fi(sample_index=20).plot()

Random forest-based residual clustering analysis (absolute residual)#

results = ts.diagnose_residual_cluster(
    dataset="test",
    response_type="abs_residual",
    metric="MAE",
    n_clusters=10,
    cluster_method="pam",
    sample_size=2000,
    rf_n_estimators=100,
    rf_max_depth=5,
)
results.table

	MAE	Size	abs_residual
4	0.6961	349.0	0.6961
6	0.6672	227.0	0.6672
7	0.5565	176.0	0.5565
2	0.5036	240.0	0.5036
1	0.4871	402.0	0.4871
3	0.3387	237.0	0.3387
9	0.2995	648.0	0.2995
8	0.2889	117.0	0.2889
0	0.2883	717.0	0.2883
5	0.2789	363.0	0.2789

Residual value for each cluster

results.plot("cluster_residual")

Performance metric for each cluster

results.plot("cluster_performance")

Feature importance of the random forest model

results.plot("feature_importance")

Analyze data drift for a specific cluster

data_results = ds.data_drift_test(
    **results.value["clusters"][2]["data_info"],
    distance_metric="PSI",
    psi_method="uniform",
    psi_bins=10
)
data_results.plot("summary")

data_results.plot(name=('density', 'hr'))

Random forest-based residual clustering analysis (perturbed residual)#

results = ts.diagnose_residual_cluster(
    dataset="test",
    response_type="abs_residual_perturb",
    metric="MAE",
    n_clusters=10,
    cluster_method="pam",
    sample_size=2000,
    rf_n_estimators=100,
    rf_max_depth=5,
)
results.table

	MAE	Size	abs_residual_perturb
7	1.0488	150.0	1.0877
3	0.6053	400.0	0.8061
8	0.6487	120.0	0.7433
1	0.4809	350.0	0.6016
5	0.5565	176.0	0.6006
4	0.3423	242.0	0.4784
2	0.3873	440.0	0.4007
9	0.3002	920.0	0.3294
6	0.2531	277.0	0.2955
0	0.2046	401.0	0.2631

Random forest-based residual clustering analysis (prediction interval width)#

results = ts.diagnose_residual_cluster(
    dataset="test",
    response_type="pi_width",
    metric="MAE",
    n_clusters=10,
    cluster_method="pam",
    sample_size=2000,
    rf_n_estimators=100,
    rf_max_depth=5,
)
results.table

	MAE	Size	pi_width
9	0.6680	59.0	1.5924
2	0.4744	179.0	1.5321
5	0.4803	63.0	1.3994
0	0.4606	506.0	1.3455
8	0.3719	98.0	1.3270
4	0.6190	147.0	1.2788
6	0.4209	155.0	1.1828
1	0.3035	188.0	1.1128
7	0.2979	113.0	1.0051
3	0.2265	230.0	0.8901

Compare residuals cluster of multiple models#

benchmark = MoLGBMRegressor(name="LGBM-5", max_depth=5, verbose=-1, random_state=0)
benchmark.fit(ds.train_x, ds.train_y.ravel())

tsc = TestSuite(ds, models=[model, benchmark])
results = tsc.compare_residual_cluster(dataset="test")
results.table

	LGBM-2		LGBM-5
	MSE	size	MSE	size
4	0.5837	349.0	0.1151	349.0
6	0.5975	227.0	0.1531	227.0
7	0.4996	176.0	0.3053	176.0
2	0.3003	240.0	0.0650	240.0
1	0.3756	402.0	0.1956	402.0
3	0.1880	237.0	0.1145	237.0
9	0.1468	648.0	0.0506	648.0
8	0.1422	117.0	0.0730	117.0
0	0.2079	717.0	0.0872	717.0
5	0.1150	363.0	0.0370	363.0

results.plot("cluster_performance")

Total running time of the script: (2 minutes 7.870 seconds)

Gallery generated by Sphinx-Gallery