2025-03-21

Introduction

Dataset:

  • In this project, we analyze a medicine dataset with the goal of recommending fever medication based on patient characteristics. The dataset includes information on Temperature, Age, Gender, BMI, symptoms (Headache, Body Ache, Fatigue), lifestyle factors (Physical Activity, Diet), environmental measures (Humidity, AQI), etc.

Problem we are solving:

  • Our goal is to clean and preprocess the data, perform thorough exploratory data analysis, build and evaluate a classification tree model to predict the recommended medication so that patients could easily decide which medicine they need to take.

Loading Libraries

We begin by loading the libraries needed for data import, manipulation, visualization, and model building.

# readr: For reading CSV files quickly
library(readr)

# dplyr: For efficient data manipulation (mutate, filter, summarize)
library(dplyr)

# ggplot2: For creating advanced visualizations
library(ggplot2)

# rpart: For building decision trees (classification and regression)
library(rpart)

# rpart.plot: For visualizing the decision trees built with rpart
library(rpart.plot)

# For kable(), used on our confusion matrix
library(knitr)       

# For kable_styling() used on our confusion matrix
library(kableExtra)  

Importing The Dataset

Next, we import our dataset from a CSV file. We then display the first few rows to verify that the data loads correctly.

# Import the dataset from the specified file path, this is niche to my system
data <- read_csv(paste0("/Users/colepatola/Desktop/ASU-Year-3/",
                 "DAT-301/Project1/enhanced_fever_medicine_recommendation.csv"))

# Display the first six rows to confirm successful import
head(data) 
## # A tibble: 6 × 20
##   Temperature Fever_Severity   Age Gender   BMI Headache Body_Ache Fatigue
##         <dbl> <chr>          <dbl> <chr>  <dbl> <chr>    <chr>     <chr>  
## 1        36.1 Normal            89 Female  24   No       Yes       No     
## 2        37.5 Mild Fever        94 Male    26.6 No       No        No     
## 3        36.4 Normal            92 Male    27.8 No       No        No     
## 4        39.8 High Fever        66 Male    18.7 No       Yes       No     
## 5        39.3 High Fever        28 Male    21   No       Yes       No     
## 6        38.3 High Fever        48 Female  24.3 No       Yes       No     
## # ℹ 12 more variables: Chronic_Conditions <chr>, Allergies <chr>,
## #   Smoking_History <chr>, Alcohol_Consumption <chr>, Humidity <dbl>,
## #   AQI <dbl>, Physical_Activity <chr>, Diet_Type <chr>, Heart_Rate <dbl>,
## #   Blood_Pressure <chr>, Previous_Medication <chr>,
## #   Recommended_Medication <chr>

Data Overview

Before proceeding to data cleaning, we need a thorough understanding of our dataset’s structure and contents.

  • str() function helps us see each column’s data type (numeric, character, factor).
  • summary() function provides basic descriptive statistics (min, max, and median).
  • This guides our decisions about which columns need factor conversion, whether there are potential outliers, and if any values are missing.
str(head(data) %>% select(1)) # Display the structure of the data frame to review data types and dimensions
## tibble [6 × 1] (S3: tbl_df/tbl/data.frame)
##  $ Temperature: num [1:6] 36.1 37.5 36.4 39.8 39.3 38.3
summary(head(data) %>% select(1))# Generate statistics to see the range, mean, median, etc
##   Temperature   
##  Min.   :36.10  
##  1st Qu.:36.67  
##  Median :37.90  
##  Mean   :37.90  
##  3rd Qu.:39.05  
##  Max.   :39.80

Data Cleaning and Pre-Processing

  • In this section, we ensure that each variable is in the correct format.

  • Many variables in our dataset represent categories (like gender or symptom presence), so we convert them to factors.

  • When converting, we sometimes also specify an order (for example, from “Normal” -> “Mild Fever” -> “High Fever”) so that the model understands the natural progression of these categories.

  • We also check for missing values that might interfere with the analysis.

Data Cleaning and Pre-Processing

# Convert relevant character columns to factors with specific levels where applicable
data <- data %>% 
  mutate(
    Fever_Severity = factor(Fever_Severity, levels = c("Normal", "Mild Fever", "High Fever")),# Convert to a factor
    Gender = factor(Gender), # Convert 'Gender' to a factor
    Headache = factor(Headache, levels = c("No", "Yes")),         # Convert  to a factor
    Body_Ache = factor(Body_Ache, levels = c("No", "Yes")),       # Convert to a factor
    Fatigue = factor(Fatigue, levels = c("No", "Yes")),           # Convert to a factor
    Chronic_Conditions = factor(Chronic_Conditions, levels = c("No", "Yes")),    # Convert to a factor
    Allergies = factor(Allergies, levels = c("No", "Yes")),                      # Convert to a factor
    Smoking_History = factor(Smoking_History, levels = c("No", "Yes")),          # Convert to a factor
    Alcohol_Consumption = factor(Alcohol_Consumption, levels = c("No", "Yes")),  # Convert to a factor
    
    # 'Physical_Activity' is an ordered factor
    Physical_Activity = factor(Physical_Activity, levels = c("Sedentary", "Moderate", "Active")), 
    
    Diet_Type = factor(Diet_Type),  # Convert 'Diet_Type' to a factor without order
    
    Blood_Pressure = factor(Blood_Pressure, levels = c("Low", "Normal", "High")),  # Convert to an ordered factor
    
    Previous_Medication = factor(Previous_Medication),  # Convert to a factor, same as diet type, no order
    Recommended_Medication = factor(Recommended_Medication) # Convert to a factor (target variable), no order
  )

# Check for any missing values in each column; this will return a named vector with counts of NA values
missing_counts <- sapply(data, function(x) sum(is.na(x)))
missing_counts  # Print out the missing value counts for review (nothing was missing)
##            Temperature         Fever_Severity                    Age 
##                      0                      0                      0 
##                 Gender                    BMI               Headache 
##                      0                      0                      0 
##              Body_Ache                Fatigue     Chronic_Conditions 
##                      0                      0                      0 
##              Allergies        Smoking_History    Alcohol_Consumption 
##                      0                      0                      0 
##               Humidity                    AQI      Physical_Activity 
##                      0                      0                      0 
##              Diet_Type             Heart_Rate         Blood_Pressure 
##                      0                      0                      0 
##    Previous_Medication Recommended_Medication 
##                      0                      0

Exploratory Data Analysis

1) Distribution of Fever Severity

  • Important Notes:

  • This bar chart displays how many patients fall into each of the three categories: Normal, Mild Fever, and High Fever.

  • In the bar chart, we see that High Fever has the highest count, indicating a large portion of the dataset involves patients with more severe fever. Normal fever severity ranks second, and Mild Fever is the least common.

  • This skew in the data might imply we have more samples of patients in severe conditions, which can affect how we interpret the results or build our models (i.e, potential class imbalance).

Exploratory Data Analysis

2) Temperature Distribution

  • Important Notes:

    • This histogram visualizes how the Temperature variable is distributed among all patients.

    • Typically, human body temperature ranges from about 36°C to 40°C in the dataset, with a peak around 38°C–39°C.

    • A higher concentration near the upper end (39°C–40°C) may reflect why so many patients are classified under High Fever in the previous plot.

Exploratory Data Analysis

3) Recommended Medicine Count

  • Important Notes:

    • This bar chart illustrates how many times each medication (Ibuprofen or Paracetamol) is recommended in the dataset.

    • In the output, Ibuprofen appears to be recommended more frequently than Paracetamol (800 vs. 200).

    • This imbalance could suggest that Ibuprofen is the default choice in many scenarios.

Exploratory Data Analysis

4) Recommended Medicine Count relation with Temperature

  • Important Notes:

    • This boxplot compares temperature ranges for each recommended medication.

    • From the plot, Ibuprofen seems to be used across a broader or higher temperature range, while Paracetamol is recommended at a slightly lower or narrower range of temperatures.

    • Furthermore, the median temperature for Ibuprofen appears higher than that for Paracetamol, suggesting that Ibuprofen may be favored for more severe fevers in this dataset.

Model Creation

Explanation of our Classification tree

  1. Root Node: In the node label we see “Ibuprofen 642 / 800”. This means if you labeled all 800 patients as Ibuprofen, you’d be correct for 642 of them.

  2. Split on Fever_Severity: The tree discovered during training that the single best variable to split on is Fever_Severity. This is why our tree only has 1 layer.

  3. Left Child Node: The label “Ibuprofen 642 / 642”, means all 642 individuals in that node received Ibuprofen and there is no reason to continue as we are 100% accurate.

  4. Right Child Node: Contains all Mild Fever patients, the label “Paracetamol 158 / 158” shows node is 100% accurate, so the model has no reason to split further.

Model Evaluation

Confusion Matrix
Ibuprofen Paracetamol
Ibuprofen 159 0
Paracetamol 0 41
## Model Accuracy: 100.00%

Explanation of our confusion matrix

  • The diagonal cells represent correct classifications (where Predicted = Actual).

  • The off-diagonal cells represent misclassifications (where Predicted != Actual).

This confusion matrix indicates that the model correctly classified all test cases

  • It predicted ‘Ibuprofen’ 159 times for patients who actually received Ibuprofen and never mislabeled them as ‘Paracetamol’

  • It predicted ‘Paracetamol’ 41 times for patients who actually received Paracetamol and never mislabeled them as ‘Ibuprofen’

  • Since there are no off-diagonal errors (0 and 0), the model achieved 100% accuracy

Conclusion

Overall, in this project I:

  • Imported and explored a dataset containing various patient characteristics and their recommended fever medication.

  • Cleaned and pre-processed the data by converting columns to appropriate types and checking for missing values.

  • Performed exploratory data analysis (EDA) to visualize distributions and relationships between variables.

  • Built a classification tree model to predict which medication is recommended based on patient attributes.

  • Evaluated the model’s performance using a confusion matrix and accuracy.