1 Overview
2 Objective
3 Data preprocessing
4 Model Fitting
- 4.1 Logistic regression model
- 4.2 Random forest model

1 Overview

In hospitals, medical treatments and surgeries can be categorized into inpatient and outpatient procedures. For patients, it is important to understand the difference between these two types of care, because they impact the length of a patient’s stay in a medical facility and the cost of a procedure.

The Patient Treatment Classification dataset is an Electronic Health Record collected from a private hospital in Indonesia. It contains patients’ demographics (i.e., age, sex) and laboratory test results that can be used determine the next patient care setting: inpatient or outpatient care. The target variable, source, is labelled 1 for inpatient and 0 for outpatient.

Source: https://www.kaggle.com/datasets/manishkc06/patient-treatment-classification

2 Objective

Build machine learning models to predict if the patient should be classified as in care or out care based on the patient’s laboratory test result using the Patient Treatment Classification dataset. Here, we will build logistic regression and random forest models.

3 Data preprocessing

Loading the modules and data

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn import metrics

df = pd.read_csv("../data/patient_treatment_classification.csv")
df.columns = df.columns.str.lower()

Restructuring the data

# Convert categorical columns into a dummy variable and into a category type
df['sex'] = df['sex'].apply(lambda x: 1 if x=='F' else 0).astype('category')
df = df.rename(columns = {'sex':'sex_f'})
df['source'] = df['source'].astype('category')

with pd.option_context('display.max_columns', None):
  df.head()
#

##    haematocrit  haemoglobins  erythrocyte  leucocyte  thrombocyte   mch  mchc  \
## 0         33.8          11.1         4.18        4.6          150  26.6  32.8   
## 1         44.6          14.0         6.86        6.3          232  20.4  31.4   
## 2         42.9          14.0         4.57        6.2          336  30.6  32.6   
## 3         41.9          14.4         4.67        3.5          276  30.8  34.4   
## 4         40.6          13.3         4.85       14.9          711  27.4  32.8   
## 
##     mcv  age sex_f source  
## 0  80.9   33     1      1  
## 1  65.0   36     0      0  
## 2  93.9   70     1      0  
## 3  89.7   18     1      0  
## 4  83.7   36     0      0

Splitting the data into train and tests and scalling the x matrices.

x_train, x_test, y_train, y_test = train_test_split(df[df.columns.difference(['source'])], df['source'], test_size = 0.25, random_state = 0)
# Scale x matrices
scaler = StandardScaler()
x_train[x_train.columns.difference(['sex_f'])] = scaler.fit_transform(x_train[x_train.columns.difference(['sex_f'])])
x_test[x_test.columns.difference(['sex_f'])] = scaler.fit_transform(x_test[x_test.columns.difference(['sex_f'])])

4 Model Fitting

4.1 Logistic regression model

Running the model

lr_fit = LogisticRegression(max_iter = 250)
lr_fit.fit(x_train, y_train)

LogisticRegression(max_iter=250)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

score = lr_fit.score(x_test, y_test)
score.round(2)

## 0.73

The accuracy is 73%.

Creating the confusion matrix.

predictions = lr_fit.predict(x_test)
cmtx = pd.DataFrame(
    metrics.confusion_matrix(y_test, predictions), 
    index=['true: outpatient', 'true: inpatient'], 
    columns=['pred: outpatient', 'pred: inpatient']
)
cmtx

##                   pred: outpatient  pred: inpatient
## true: outpatient               446               60
## true: inpatient                164              158

Question: Which variables are positively correlated with the probability of inpatient status?

coefs = pd.DataFrame(zip(x_train.columns, np.transpose(lr_fit.coef_).round(2)), columns = ['features', 'coef'])
pos_coefs = coefs.loc[coefs['coef'] >= 0] 
neg_coefs = coefs.loc[coefs['coef'] < 0] 
pos_coefs; neg_coefs

##       features    coef
## 0          age  [0.09]
## 2  haematocrit  [1.19]
## 4    leucocyte  [0.41]
## 6         mchc  [0.44]
## 7          mcv  [0.06]
##        features     coef
## 1   erythrocyte   [-0.7]
## 3  haemoglobins   [-1.2]
## 5           mch   [-0.6]
## 8         sex_f  [-0.39]
## 9   thrombocyte  [-0.72]

Based on the model, inpatient status is more likely with higher age, haematocrit, leukocyte, mch, and mchc values, lower erythrocyte, hemoglobin, mcv and thrombocyte values, and male sex. However, some of these coefficients are small and may not have any effect on the target.

4.2 Random forest model

Running the model

rf_fit = RandomForestClassifier()
rf_fit.fit(x_train, y_train)

RandomForestClassifier()

score = rf_fit.score(x_test, y_test)
score.round(2)

## 0.75

The accuracy is 75%.

Creating the confusion matrix.

predictions = rf_fit.predict(x_test)
cmtx = pd.DataFrame(
    metrics.confusion_matrix(y_test, predictions), 
    index=['true: outpatient', 'true: inpatient'], 
    columns=['pred: outpatient', 'pred: inpatient']
)
cmtx

##                   pred: outpatient  pred: inpatient
## true: outpatient               422               84
## true: inpatient                124              198

Question: What are the five more important variables based on feature importance?

importance = rf_fit.feature_importances_
feats = {} # a dict to hold feature_name: feature_importance
for feature, importance in zip(df.columns, rf_fit.feature_importances_):
    feats[feature] = importance #add the name/value pair 

importances = pd.DataFrame.from_dict(feats, orient='index').rename(columns={0: 'Gini-importance'})
importances.sort_values(by='Gini-importance', ascending = False)

##               Gini-importance
## sex_f                0.212562
## thrombocyte          0.128001
## erythrocyte          0.109839
## haemoglobins         0.105465
## haematocrit          0.096791
## leucocyte            0.090135
## mcv                  0.085937
## mchc                 0.079435
## mch                  0.074821
## age                  0.017015

Based on the model, the top 5 most important variables are sex, thrombocyte, erythrocyte, haemoglobins and haematocrit.

Patient Treatment Classification with Logistic Regression and Random Forest

Kim G

August 22, 2022