In hospitals, medical treatments and surgeries can be categorized into inpatient and outpatient procedures. For patients, it is important to understand the difference between these two types of care, because they impact the length of a patient’s stay in a medical facility and the cost of a procedure.
The Patient Treatment Classification dataset is an Electronic Health Record collected from a private hospital in Indonesia. It contains patients’ demographics (i.e., age, sex) and laboratory test results that can be used determine the next patient care setting: inpatient or outpatient care. The target variable, source, is labelled 1 for inpatient and 0 for outpatient.
Source: https://www.kaggle.com/datasets/manishkc06/patient-treatment-classification
Build machine learning models to predict if the patient should be classified as in care or out care based on the patient’s laboratory test result using the Patient Treatment Classification dataset. Here, we will build logistic regression and random forest models.
Loading the modules and data
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn import metrics
df = pd.read_csv("../data/patient_treatment_classification.csv")
df.columns = df.columns.str.lower()
Restructuring the data
# Convert categorical columns into a dummy variable and into a category type
df['sex'] = df['sex'].apply(lambda x: 1 if x=='F' else 0).astype('category')
df = df.rename(columns = {'sex':'sex_f'})
df['source'] = df['source'].astype('category')
with pd.option_context('display.max_columns', None):
df.head()
#
## haematocrit haemoglobins erythrocyte leucocyte thrombocyte mch mchc \
## 0 33.8 11.1 4.18 4.6 150 26.6 32.8
## 1 44.6 14.0 6.86 6.3 232 20.4 31.4
## 2 42.9 14.0 4.57 6.2 336 30.6 32.6
## 3 41.9 14.4 4.67 3.5 276 30.8 34.4
## 4 40.6 13.3 4.85 14.9 711 27.4 32.8
##
## mcv age sex_f source
## 0 80.9 33 1 1
## 1 65.0 36 0 0
## 2 93.9 70 1 0
## 3 89.7 18 1 0
## 4 83.7 36 0 0
Splitting the data into train and tests and scalling the x matrices.
x_train, x_test, y_train, y_test = train_test_split(df[df.columns.difference(['source'])], df['source'], test_size = 0.25, random_state = 0)
# Scale x matrices
scaler = StandardScaler()
x_train[x_train.columns.difference(['sex_f'])] = scaler.fit_transform(x_train[x_train.columns.difference(['sex_f'])])
x_test[x_test.columns.difference(['sex_f'])] = scaler.fit_transform(x_test[x_test.columns.difference(['sex_f'])])
Running the model
lr_fit = LogisticRegression(max_iter = 250)
lr_fit.fit(x_train, y_train)
LogisticRegression(max_iter=250)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LogisticRegression(max_iter=250)
score = lr_fit.score(x_test, y_test)
score.round(2)
## 0.73
The accuracy is 73%.
Creating the confusion matrix.
predictions = lr_fit.predict(x_test)
cmtx = pd.DataFrame(
metrics.confusion_matrix(y_test, predictions),
index=['true: outpatient', 'true: inpatient'],
columns=['pred: outpatient', 'pred: inpatient']
)
cmtx
## pred: outpatient pred: inpatient
## true: outpatient 446 60
## true: inpatient 164 158
Question: Which variables are positively correlated with the probability of inpatient status?
coefs = pd.DataFrame(zip(x_train.columns, np.transpose(lr_fit.coef_).round(2)), columns = ['features', 'coef'])
pos_coefs = coefs.loc[coefs['coef'] >= 0]
neg_coefs = coefs.loc[coefs['coef'] < 0]
pos_coefs; neg_coefs
## features coef
## 0 age [0.09]
## 2 haematocrit [1.19]
## 4 leucocyte [0.41]
## 6 mchc [0.44]
## 7 mcv [0.06]
## features coef
## 1 erythrocyte [-0.7]
## 3 haemoglobins [-1.2]
## 5 mch [-0.6]
## 8 sex_f [-0.39]
## 9 thrombocyte [-0.72]
Based on the model, inpatient status is more likely with higher age, haematocrit, leukocyte, mch, and mchc values, lower erythrocyte, hemoglobin, mcv and thrombocyte values, and male sex. However, some of these coefficients are small and may not have any effect on the target.
Running the model
rf_fit = RandomForestClassifier()
rf_fit.fit(x_train, y_train)
RandomForestClassifier()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
RandomForestClassifier()
score = rf_fit.score(x_test, y_test)
score.round(2)
## 0.75
The accuracy is 75%.
Creating the confusion matrix.
predictions = rf_fit.predict(x_test)
cmtx = pd.DataFrame(
metrics.confusion_matrix(y_test, predictions),
index=['true: outpatient', 'true: inpatient'],
columns=['pred: outpatient', 'pred: inpatient']
)
cmtx
## pred: outpatient pred: inpatient
## true: outpatient 422 84
## true: inpatient 124 198
Question: What are the five more important variables based on feature importance?
importance = rf_fit.feature_importances_
feats = {} # a dict to hold feature_name: feature_importance
for feature, importance in zip(df.columns, rf_fit.feature_importances_):
feats[feature] = importance #add the name/value pair
importances = pd.DataFrame.from_dict(feats, orient='index').rename(columns={0: 'Gini-importance'})
importances.sort_values(by='Gini-importance', ascending = False)
## Gini-importance
## sex_f 0.212562
## thrombocyte 0.128001
## erythrocyte 0.109839
## haemoglobins 0.105465
## haematocrit 0.096791
## leucocyte 0.090135
## mcv 0.085937
## mchc 0.079435
## mch 0.074821
## age 0.017015
Based on the model, the top 5 most important variables are sex, thrombocyte, erythrocyte, haemoglobins and haematocrit.