Dealing with Date Variables#

Installation

# To install the required package, use the following command:
# !pip install modeva

Authentication

# To get authentication, use the following command: (To get full access please replace the token to your own token)
# from modeva.utils.authenticate import authenticate
# authenticate(auth_code='eaaa4301-b140-484c-8e93-f9f633c8bacb')

Load BikeSharing Dataset

import pandas as pd
from modeva import DataSet
from modeva.data.utils.loading import load_builtin_data

data = load_builtin_data("BikeSharing")
data['Date'] = (pd.to_datetime('2011-01-01') + pd.to_timedelta(data.index / 24, unit='D')).date
data.head()

	season	mnth	hr	weekday	weathersit	temp	atemp	hum	cnt	Date
0	1	1	0	6	1	0.24	0.2879	0.81	16	2011-01-01
1	1	1	1	6	1	0.22	0.2727	0.80	40	2011-01-01
2	1	1	2	6	1	0.22	0.2727	0.80	32	2011-01-01
3	1	1	3	6	1	0.24	0.2879	0.75	13	2011-01-01
4	1	1	4	6	1	0.24	0.2879	0.75	1	2011-01-01

Create some missing and special values for demo purpose

data["Date"].iloc[:10] = "SV1"
data["Date"].iloc[10:15] = "SV2"
data["Date"].iloc[5:20] = pd.NA
data.head()

	season	mnth	hr	weekday	weathersit	temp	atemp	hum	cnt	Date
0	1	1	0	6	1	0.24	0.2879	0.81	16	SV1
1	1	1	1	6	1	0.22	0.2727	0.80	40	SV1
2	1	1	2	6	1	0.22	0.2727	0.80	32	SV1
3	1	1	3	6	1	0.24	0.2879	0.75	13	SV1
4	1	1	4	6	1	0.24	0.2879	0.75	1	SV1

Load the data into Modeva DataSet

ds = DataSet()
ds.load_dataframe(data)
ds.set_target("cnt")
ds.set_inactive_features(features=('Date', ))
ds.set_random_split(shuffle=False)

ds.reset_preprocess()
ds.impute_missing(features="Date", method='constant', fill_value="2011-01-01",
                  add_indicators=True, special_values=["SV1", "SV2"])
# Uncomment the following to convert date into binned integers.
# ds.encode_categorical(features=("date", ), method="ordinal")
# ds.bin_numerical(features=("date", ), bins=5)
ds.preprocess()
ds.to_df()

	season	yr	mnth	hr	holiday	weekday	workingday	weathersit	temp	atemp	hum	windspeed	cnt	Date	Date_missing_nan	Date_special_SV1
0	1	0	1	0	0	6	0	1	0.24	0.2879	0.81	0.0000	16	2011-01-01	0	True
1	1	0	1	1	0	6	0	1	0.22	0.2727	0.80	0.0000	40	2011-01-01	0	True
2	1	0	1	2	0	6	0	1	0.22	0.2727	0.80	0.0000	32	2011-01-01	0	True
3	1	0	1	3	0	6	0	1	0.24	0.2879	0.75	0.0000	13	2011-01-01	0	True
4	1	0	1	4	0	6	0	1	0.24	0.2879	0.75	0.0000	1	2011-01-01	0	True
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
17374	1	1	12	19	0	1	1	2	0.26	0.2576	0.60	0.1642	119	2012-12-24	0	False
17375	1	1	12	20	0	1	1	2	0.26	0.2576	0.60	0.1642	89	2012-12-24	0	False
17376	1	1	12	21	0	1	1	1	0.26	0.2576	0.60	0.1642	90	2012-12-25	0	False
17377	1	1	12	22	0	1	1	1	0.26	0.2727	0.56	0.1343	61	2012-12-25	0	False
17378	1	1	12	23	0	1	1	1	0.26	0.2727	0.65	0.1343	49	2012-12-25	0	False

17379 rows × 16 columns

Data summary#

result = ds.summary()
result.table["summary"]

	samples	features	numerical	categorical	mixed	date	duplicated	missing cells	missing cells (%)	infinite cells	infinite cells (%)
0	17379	16	8	7	0	1	0	0	0.0	0	0.0

Data summary results for numerical variables

result.table["numerical"]

	missing	inf	unique	mean	std	min	25%	median	75%	max
name
mnth	0	0	12	6.537775	3.438677	1.00	4.0000	7.0000	10.0000	12.0000
hr	0	0	24	11.546752	6.914206	0.00	6.0000	12.0000	18.0000	23.0000
weekday	0	0	7	3.003683	2.005714	0.00	1.0000	3.0000	5.0000	6.0000
temp	0	0	50	0.496987	0.192551	0.02	0.3400	0.5000	0.6600	1.0000
atemp	0	0	65	0.475775	0.171845	0.00	0.3333	0.4848	0.6212	1.0000
hum	0	0	89	0.627229	0.192924	0.00	0.4800	0.6300	0.7800	1.0000
windspeed	0	0	30	0.190098	0.122337	0.00	0.1045	0.1940	0.2537	0.8507
cnt	0	0	869	189.463088	181.382380	1.00	40.0000	142.0000	281.0000	977.0000

Data summary results for categorical variables

result.table["categorical"]

	missing	unique	top1	top1-counts	top2	top2-counts	top3	top3-counts
name
season	0	4	3	4496	2	4409	1.0	4242.0
yr	0	2	1	8734	0	8645	NaN	NaN
holiday	0	2	0	16879	1	500	NaN	NaN
workingday	0	2	1	11865	0	5514	NaN	NaN
weathersit	0	4	1	11413	2	4544	3.0	1419.0
Date_missing_nan	0	2	0	17364	1	15	NaN	NaN
Date_special_SV1	0	2	False	17374	True	5	NaN	NaN

Data summary results for mixed numerical and categorical variables

result.table["mixed"]

	missing	inf	unique	mean	std	min	25%	median	75%	max	categories
name

Data summary results for date type variables

result.table["date"]

	missing	unique	min	max	range_days
name
Date	0	725	2011-01-01	2012-12-25	724

EDA 2D#

EDA 2D between Date and a numerical feature

result = ds.eda_2d(feature_x="Date", feature_y="cnt")
result.plot()

EDA 3D#

result = ds.eda_3d(feature_x="Date", feature_y="hr", feature_z="cnt", sample_size=1000)
result.plot()

Total running time of the script: (0 minutes 1.739 seconds)

Gallery generated by Sphinx-Gallery