Motivations
The real-world data often has a lot of missing values. The cause of missing values can be data corruption or failure to record data. Handling missing data is important as many machine learning algorithms do not support data with missing values. However, in the case of XGBoost we may not need to impute missing data before training XGboost.
Findings
Experimental results show that XGboost with original (with missing) dataset has an AUC 3% higher than that of obtained from using imputed dataset.
Python Codes
Python codes for above findings with hmeq.csv dataset:
# ==============================================
# Load data and impute missing data
# ==============================================
# Load data:
import pandas as pd
df = pd.read_csv("http://www.creditriskanalytics.net/uploads/1/9/5/1/19511601/hmeq.csv")
# Convert categories to dummies:
df = pd.get_dummies(df)
# Impute missing data (https://academic.oup.com/bioinformatics/article/28/1/112/219101,
# https://academic.oup.com/aje/article/179/6/764/107562,
# https://github.com/epsilon-machine/missingpy):
from missingpy import MissForest
imputer = MissForest()
df_imputed = imputer.fit_transform(df)
# Convert to data frame:
df_imputed = pd.DataFrame(df_imputed)
# Rename for columns:
df_imputed.columns = df.columns
# Prepare data:
from sklearn.model_selection import train_test_split
X = df_imputed.drop(labels=["BAD"], axis=1)
Y = df_imputed["BAD"]
# ======================================
# Train XGboost with imputed dataset
# ======================================
# Train XGBClassifier with cross-validation:
from xgboost import XGBClassifier
from sklearn.model_selection import RepeatedStratifiedKFold, cross_val_score
cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=3, random_state=29)
xgb1 = XGBClassifier(random_state=29)
auc_scores1 = cross_val_score(xgb1, X, Y, cv=cv, scoring="roc_auc", n_jobs=-1)
# =====================================
# Train XGboost with missing dataset
# =====================================
X_o = df.drop(labels=["BAD"], axis=1)
auc_scores2 = cross_val_score(xgb1, X_o, Y, cv=cv, scoring="roc_auc", n_jobs=-1)
# Compare results:
print(auc_scores1.mean(), auc_scores2.mean())
LS0tDQp0aXRsZTogJ1Nob3VsZCB3ZSBpbXB1dGUgbWlzc2luZyBkYXRhPyAoUHl0aG9uKScNCmF1dGhvcjogJ0F1dGhvcjogTmd1eWVuIENoaSBEdW5nJw0Kc3VidGl0bGU6ICJQeXRob24gTWFjaGluZSBMZWFybmluZyBTZXJpZXMiDQpvdXRwdXQ6DQogIGh0bWxfZG9jdW1lbnQ6IA0KICAgIGNvZGVfZG93bmxvYWQ6IHRydWUNCiAgICAjIGNvZGVfZm9sZGluZzogaGlkZQ0KICAgIGhpZ2hsaWdodDogemVuYnVybg0KICAgICMgbnVtYmVyX3NlY3Rpb25zOiB5ZXMNCiAgICB0aGVtZTogImZsYXRseSINCiAgICB0b2M6IFRSVUUNCiAgICB0b2NfZmxvYXQ6IFRSVUUNCi0tLQ0KDQpgYGB7ciBzZXR1cCxpbmNsdWRlPUZBTFNFfQ0Ka25pdHI6Om9wdHNfY2h1bmskc2V0KGVjaG8gPSBUUlVFLCB3YXJuaW5nID0gRkFMU0UsIG1lc3NhZ2UgPSBGQUxTRSwgY2FjaGUgPSBUUlVFLCBldmFsID0gRkFMU0UpDQoNCmBgYA0KDQoNCg0KIyBNb3RpdmF0aW9ucw0KDQpUaGUgcmVhbC13b3JsZCBkYXRhIG9mdGVuIGhhcyBhIGxvdCBvZiBtaXNzaW5nIHZhbHVlcy4gVGhlIGNhdXNlIG9mIG1pc3NpbmcgdmFsdWVzIGNhbiBiZSBkYXRhIGNvcnJ1cHRpb24gb3IgZmFpbHVyZSB0byByZWNvcmQgZGF0YS4gSGFuZGxpbmcgbWlzc2luZyBkYXRhIGlzIGltcG9ydGFudCBhcyBtYW55IG1hY2hpbmUgbGVhcm5pbmcgYWxnb3JpdGhtcyBkbyBub3Qgc3VwcG9ydCBkYXRhIHdpdGggbWlzc2luZyB2YWx1ZXMuIEhvd2V2ZXIsIGluIHRoZSBjYXNlIG9mIFhHQm9vc3Qgd2UgbWF5IG5vdCBuZWVkIHRvIGltcHV0ZSBtaXNzaW5nIGRhdGEgYmVmb3JlIHRyYWluaW5nIFhHYm9vc3QuIA0KDQojIEZpbmRpbmdzDQoNCkV4cGVyaW1lbnRhbCByZXN1bHRzIHNob3cgdGhhdCBYR2Jvb3N0IHdpdGggb3JpZ2luYWwgKHdpdGggbWlzc2luZykgZGF0YXNldCBoYXMgYW4gQVVDIDMlIGhpZ2hlciB0aGFuIHRoYXQgb2Ygb2J0YWluZWQgZnJvbSB1c2luZyBpbXB1dGVkIGRhdGFzZXQuIA0KDQojIFB5dGhvbiBDb2Rlcw0KDQpQeXRob24gY29kZXMgZm9yIGFib3ZlIGZpbmRpbmdzIHdpdGggKipobWVxLmNzdioqIGRhdGFzZXQ6IA0KDQoNCmBgYHtyLCBldmFsPUZBTFNFfQ0KIyA9PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09DQojICAgICAgIExvYWQgZGF0YSBhbmQgaW1wdXRlIG1pc3NpbmcgZGF0YQ0KIyA9PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09DQoNCiMgTG9hZCBkYXRhOg0KaW1wb3J0IHBhbmRhcyBhcyBwZA0KDQpkZiA9IHBkLnJlYWRfY3N2KCJodHRwOi8vd3d3LmNyZWRpdHJpc2thbmFseXRpY3MubmV0L3VwbG9hZHMvMS85LzUvMS8xOTUxMTYwMS9obWVxLmNzdiIpDQoNCiMgQ29udmVydCBjYXRlZ29yaWVzIHRvIGR1bW1pZXM6DQpkZiA9IHBkLmdldF9kdW1taWVzKGRmKQ0KDQojIEltcHV0ZSBtaXNzaW5nIGRhdGEgKGh0dHBzOi8vYWNhZGVtaWMub3VwLmNvbS9iaW9pbmZvcm1hdGljcy9hcnRpY2xlLzI4LzEvMTEyLzIxOTEwMSwNCiMgICAgICAgICAgICAgICAgICAgICAgaHR0cHM6Ly9hY2FkZW1pYy5vdXAuY29tL2FqZS9hcnRpY2xlLzE3OS82Lzc2NC8xMDc1NjIsDQojICAgICAgICAgICAgICAgICAgICAgIGh0dHBzOi8vZ2l0aHViLmNvbS9lcHNpbG9uLW1hY2hpbmUvbWlzc2luZ3B5KToNCg0KZnJvbSBtaXNzaW5ncHkgaW1wb3J0IE1pc3NGb3Jlc3QNCg0KaW1wdXRlciA9IE1pc3NGb3Jlc3QoKQ0KZGZfaW1wdXRlZCA9IGltcHV0ZXIuZml0X3RyYW5zZm9ybShkZikNCg0KIyBDb252ZXJ0IHRvIGRhdGEgZnJhbWU6DQpkZl9pbXB1dGVkID0gcGQuRGF0YUZyYW1lKGRmX2ltcHV0ZWQpDQoNCiMgUmVuYW1lIGZvciBjb2x1bW5zOg0KZGZfaW1wdXRlZC5jb2x1bW5zID0gZGYuY29sdW1ucw0KDQojIFByZXBhcmUgZGF0YToNCmZyb20gc2tsZWFybi5tb2RlbF9zZWxlY3Rpb24gaW1wb3J0IHRyYWluX3Rlc3Rfc3BsaXQNCg0KWCA9IGRmX2ltcHV0ZWQuZHJvcChsYWJlbHM9WyJCQUQiXSwgYXhpcz0xKQ0KWSA9IGRmX2ltcHV0ZWRbIkJBRCJdDQoNCiMgPT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT0NCiMgIFRyYWluIFhHYm9vc3Qgd2l0aCBpbXB1dGVkIGRhdGFzZXQNCiMgPT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT0NCg0KIyBUcmFpbiBYR0JDbGFzc2lmaWVyIHdpdGggY3Jvc3MtdmFsaWRhdGlvbjoNCmZyb20geGdib29zdCBpbXBvcnQgWEdCQ2xhc3NpZmllcg0KZnJvbSBza2xlYXJuLm1vZGVsX3NlbGVjdGlvbiBpbXBvcnQgUmVwZWF0ZWRTdHJhdGlmaWVkS0ZvbGQsIGNyb3NzX3ZhbF9zY29yZQ0KDQpjdiA9IFJlcGVhdGVkU3RyYXRpZmllZEtGb2xkKG5fc3BsaXRzPTUsIG5fcmVwZWF0cz0zLCByYW5kb21fc3RhdGU9MjkpDQp4Z2IxID0gWEdCQ2xhc3NpZmllcihyYW5kb21fc3RhdGU9MjkpDQphdWNfc2NvcmVzMSA9IGNyb3NzX3ZhbF9zY29yZSh4Z2IxLCBYLCBZLCBjdj1jdiwgc2NvcmluZz0icm9jX2F1YyIsIG5fam9icz0tMSkNCg0KIyA9PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09DQojIFRyYWluIFhHYm9vc3Qgd2l0aCBtaXNzaW5nIGRhdGFzZXQNCiMgPT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PQ0KDQpYX28gPSBkZi5kcm9wKGxhYmVscz1bIkJBRCJdLCBheGlzPTEpDQphdWNfc2NvcmVzMiA9IGNyb3NzX3ZhbF9zY29yZSh4Z2IxLCBYX28sIFksIGN2PWN2LCBzY29yaW5nPSJyb2NfYXVjIiwgbl9qb2JzPS0xKQ0KDQojIENvbXBhcmUgcmVzdWx0czoNCnByaW50KGF1Y19zY29yZXMxLm1lYW4oKSwgYXVjX3Njb3JlczIubWVhbigpKQ0KYGBgDQoNCg0KDQogDQo=