Central Hospital

Akanni, Samuel Ifeoluwa

2024-11-21

INTODUCTION

This is a presentation on the data exploration and visualization, and data cleaning and preprocessing of Central Hospital data set

Data Exploration and Visualization

  1. Load and explore the dataset using RStudio.

  2. Create visualizations (e.g., histograms, scatter plots, bar charts) to understand data distribution.

  3. Perform principal component analysis (PCA) or clustering (k-means) on a dataset.

Data Cleaning and Preprocessing

  1. Handle missing values in a dataset using various methods (e.g., mean imputation, median imputation)

  2. Remove duplicates and outliers from a dataset.

  3. Transform categorical variables into numerical variables.

Import the data set

setwd("C:/Users/USER PC/Documents/GIT/Data Analytics with Prof/data sets")
central <- read.csv("Central Hospital.csv")
head(central)
##   case_id generation date_infection date_onset date_hospitalisation
## 1  275cc7          5     2014-05-24 2014-05-27           2014-05-28
## 2  64c8ef          6                2014-07-14           2014-07-15
## 3  e56412          9     2014-07-12 2014-07-15           2014-07-17
## 4  62a2ef          9     2014-07-02 2014-07-17           2014-07-19
## 5  92e129         10     2014-08-01 2014-08-10           2014-08-12
## 6  240da4         10                2014-08-20           2014-08-20
##   date_outcome outcome gender age age_unit age_years age_cat age_cat5
## 1   2014-06-07   Death      f   4    years         4     0-4      0-4
## 2   2014-07-25 Recover      m  30    years        30   30-49    30-34
## 3   2014-07-19   Death      f  23    years        23   20-29    20-24
## 4   2014-07-23   Death      f  14    years        14   10-14    10-14
## 5   2014-08-22 Recover      f  42    years        42   30-49    40-44
## 6                Death      m   1    years         1     0-4      0-4
##           hospital       lon      lat infector  source wt_kg ht_cm ct_blood
## 1 Central Hospital -13.23694 8.469161   e02f66   other    41    75       23
## 2 Central Hospital -13.26147 8.457506                     66   177       23
## 3 Central Hospital -13.23433 8.478321   894024 funeral    55   150       23
## 4 Central Hospital -13.23794 8.469774   b0c500   other    59   136       22
## 5 Central Hospital -13.23573 8.474329   4b501e   other    85   196       23
## 6 Central Hospital -13.26799 8.462502                     17    47       23
##   fever chills cough aches vomit temp time_admission      bmi days_onset_hosp
## 1    no     no   yes    no    no 37.2          09:21 72.88889               1
## 2    no    yes   yes    no    no 36.8          10:12 21.06674               1
## 3    no     no    no    no   yes 37.4          16:46 24.44444               2
## 4    no    yes   yes    no    no 36.6          13:52 31.89879               2
## 5                                36.7          14:25 22.12620               2
## 6    no    yes   yes    no   yes 36.3          16:06 76.95790               0
View(central)

Check the summary of the data

summary(central)
##    case_id            generation    date_infection      date_onset       
##  Length:454         Min.   : 2.00   Length:454         Length:454        
##  Class :character   1st Qu.:13.00   Class :character   Class :character  
##  Mode  :character   Median :16.00   Mode  :character   Mode  :character  
##                     Mean   :16.88                                        
##                     3rd Qu.:20.00                                        
##                     Max.   :36.00                                        
##                                                                          
##  date_hospitalisation date_outcome         outcome             gender         
##  Length:454           Length:454         Length:454         Length:454        
##  Class :character     Class :character   Class :character   Class :character  
##  Mode  :character     Mode  :character   Mode  :character   Mode  :character  
##                                                                               
##                                                                               
##                                                                               
##                                                                               
##       age         age_unit           age_years       age_cat         
##  Min.   : 0.0   Length:454         Min.   : 0.00   Length:454        
##  1st Qu.: 7.0   Class :character   1st Qu.: 7.00   Class :character  
##  Median :15.0   Mode  :character   Median :15.00   Mode  :character  
##  Mean   :17.4                      Mean   :17.38                     
##  3rd Qu.:24.0                      3rd Qu.:24.00                     
##  Max.   :87.0                      Max.   :87.00                     
##  NA's   :9                         NA's   :9                         
##    age_cat5           hospital              lon              lat       
##  Length:454         Length:454         Min.   :-13.27   Min.   :8.448  
##  Class :character   Class :character   1st Qu.:-13.25   1st Qu.:8.460  
##  Mode  :character   Mode  :character   Median :-13.23   Median :8.468  
##                                        Mean   :-13.23   Mean   :8.469  
##                                        3rd Qu.:-13.22   3rd Qu.:8.479  
##                                        Max.   :-13.21   Max.   :8.490  
##                                                                        
##    infector            source              wt_kg            ht_cm      
##  Length:454         Length:454         Min.   : -2.00   Min.   : 15.0  
##  Class :character   Class :character   1st Qu.: 43.00   1st Qu.: 97.0  
##  Mode  :character   Mode  :character   Median : 57.50   Median :135.0  
##                                        Mean   : 55.13   Mean   :129.7  
##                                        3rd Qu.: 67.00   3rd Qu.:161.0  
##                                        Max.   :103.00   Max.   :335.0  
##                                                                        
##     ct_blood        fever              chills             cough          
##  Min.   :17.00   Length:454         Length:454         Length:454        
##  1st Qu.:20.00   Class :character   Class :character   Class :character  
##  Median :22.00   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :21.19                                                           
##  3rd Qu.:22.00                                                           
##  Max.   :25.00                                                           
##                                                                          
##     aches              vomit                temp      time_admission    
##  Length:454         Length:454         Min.   :35.7   Length:454        
##  Class :character   Class :character   1st Qu.:37.8   Class :character  
##  Mode  :character   Mode  :character   Median :38.8   Mode  :character  
##                                        Mean   :38.5                     
##                                        3rd Qu.:39.2                     
##                                        Max.   :40.4                     
##                                        NA's   :3                        
##       bmi         days_onset_hosp 
##  Min.   :-41.32   Min.   : 0.000  
##  1st Qu.: 24.10   1st Qu.: 1.000  
##  Median : 32.23   Median : 1.000  
##  Mean   : 44.57   Mean   : 1.852  
##  3rd Qu.: 47.71   3rd Qu.: 2.000  
##  Max.   :370.37   Max.   :12.000  
## 
dim(central)
## [1] 454  30

Check for missing values

# Identify columns with missing values or empty strings
missing_summary <- sapply(central, function(x) sum(is.na(x) | x == "" | x == "NA"))
# Filter and display only columns with missing-like values
missing_summary <- missing_summary[missing_summary > 0]
missing_summary
## date_infection   date_outcome        outcome         gender            age 
##            154             74             96             26              9 
##      age_years        age_cat       age_cat5       infector         source 
##              9              9              9            154            154 
##          fever         chills          cough          aches          vomit 
##             26             26             26             26             26 
##           temp time_admission 
##              3             63

#are all the ages inyears?
table(central$age_unit)
## 
## months  years 
##      1    453
#convert two ages in months to years
library(tidyverse)
central <- central %>% 
  mutate(
    age = ifelse(age_unit == "months", age /12, age),
    age_unit = ifelse(age_unit == "months", "years", age_unit)
  )
central$age <- round(central$age)

Remove some columns

Date of infection, source, and infector 3,17,18 has so many missing values so i will delete the columns date_onset4,5and date_hospipalize will be removed because the result of their difference is on column days_onset_hosp age_unit, age_years,age_cat 10,11,12 will be removed since we have a column for age already hospital will also be removed since every one is from the same hospital 14 also remove date_outcome 6

central1 <- central[, -c(3,4,5,6,10,11,12,14,17,18)]
head(central1)
##   case_id generation outcome gender age age_cat5       lon      lat wt_kg ht_cm
## 1  275cc7          5   Death      f   4      0-4 -13.23694 8.469161    41    75
## 2  64c8ef          6 Recover      m  30    30-34 -13.26147 8.457506    66   177
## 3  e56412          9   Death      f  23    20-24 -13.23433 8.478321    55   150
## 4  62a2ef          9   Death      f  14    10-14 -13.23794 8.469774    59   136
## 5  92e129         10 Recover      f  42    40-44 -13.23573 8.474329    85   196
## 6  240da4         10   Death      m   1      0-4 -13.26799 8.462502    17    47
##   ct_blood fever chills cough aches vomit temp time_admission      bmi
## 1       23    no     no   yes    no    no 37.2          09:21 72.88889
## 2       23    no    yes   yes    no    no 36.8          10:12 21.06674
## 3       23    no     no    no    no   yes 37.4          16:46 24.44444
## 4       22    no    yes   yes    no    no 36.6          13:52 31.89879
## 5       23                                36.7          14:25 22.12620
## 6       23    no    yes   yes    no   yes 36.3          16:06 76.95790
##   days_onset_hosp
## 1               1
## 2               1
## 3               2
## 4               2
## 5               2
## 6               0
dim(central1)
## [1] 454  20
view(central1)

Remove rows with missing values when age, age_cat5 and time_admission is not included

# remove age from central 1
central2 <- central1[,-c(5,6,18)]
rows_with_missing <- central2[rowSums(is.na(central2) | central2 == "") > 0, ]
head(rows_with_missing)
##    case_id generation outcome gender       lon      lat wt_kg ht_cm ct_blood
## 5   92e129         10 Recover      f -13.23573 8.474329    85   196       23
## 8   9b4647          9   Death      m -13.21540 8.482433    73   207       23
## 10  24ef7d         11 Recover      m -13.22377 8.461412    90   236       22
## 12  f5e8d0         11              m -13.25317 8.457906    77   147       21
## 13  7c67f1         14 Recover      f -13.21197 8.483053    59   185       22
## 14  5d6f13          7 Recover      m -13.26767 8.462242    71   175       23
##    fever chills cough aches vomit temp      bmi days_onset_hosp
## 5                                 36.7 22.12620               2
## 8                                 37.8 17.03657               0
## 10                                37.5 16.15915               2
## 12                                36.9 35.63330               0
## 13                                37.8 17.23886               0
## 14                                35.8 23.18367               2
nrow(rows_with_missing)
## [1] 141
View(central2)

We have 141 rows(Because the column outcome was not removed and sounds important) with missing values that are hard to fix, so i deleted 41 rows

# we have 37 rows with missing values that are hard to fix which is not too much so i will delete them
newcentral <- anti_join(central1, rows_with_missing)
dim(newcentral)
## [1] 313  20
head(newcentral)
##   case_id generation outcome gender age age_cat5       lon      lat wt_kg ht_cm
## 1  275cc7          5   Death      f   4      0-4 -13.23694 8.469161    41    75
## 2  64c8ef          6 Recover      m  30    30-34 -13.26147 8.457506    66   177
## 3  e56412          9   Death      f  23    20-24 -13.23433 8.478321    55   150
## 4  62a2ef          9   Death      f  14    10-14 -13.23794 8.469774    59   136
## 5  240da4         10   Death      m   1      0-4 -13.26799 8.462502    17    47
## 6  31e797         13   Death      m  49    45-49 -13.26021 8.457613    82   220
##   ct_blood fever chills cough aches vomit temp time_admission      bmi
## 1       23    no     no   yes    no    no 37.2          09:21 72.88889
## 2       23    no    yes   yes    no    no 36.8          10:12 21.06674
## 3       23    no     no    no    no   yes 37.4          16:46 24.44444
## 4       22    no    yes   yes    no    no 36.6          13:52 31.89879
## 5       23    no    yes   yes    no   yes 36.3          16:06 76.95790
## 6       21    no     no   yes    no    no 36.2          13:10 16.94215
##   days_onset_hosp
## 1               1
## 2               1
## 3               2
## 4               2
## 5               0
## 6               2

Check for missing values again

# I will check for missing values again
missing_summary2 <- sapply(newcentral, function(x) sum(is.na(x) | x == "" | x == "NA"))
missing_summary2 <- missing_summary2[missing_summary2 > 0]
missing_summary2
## time_admission 
##             41

lets work on the time_admission

 # Identify rows with invalid time formats in the "time_admission" column
invalid_times <- newcentral[!grepl("^([01]\\d|2[0-3]):([0-5]\\d)(:[0-5]\\d)?$", newcentral$time_admission), ]

# View the problematic rows
invalid_times$time_admission
##  [1] ""      ""      ""      ""      ""      ""      ""      ""      ""     
## [10] ""      ""      "11:60" ""      "24:20" ""      ""      ""      ""     
## [19] ""      ""      ""      ""      ""      ""      ""      ""      ""     
## [28] ""      ""      ""      ""      "14:60" ""      ""      ""      ""     
## [37] ""      ""      ""      ""      ""      ""      ""      ""

Correct the 12:60 and and the 13:60

newcentral$time_admission[newcentral$time_admission == "11:60"] <- "11:59"
newcentral$time_admission[newcentral$time_admission == "14:60"] <- "14:59"
newcentral$time_admission[newcentral$time_admission == ""] <- "00:00"
newcentral$time_admission[newcentral$time_admission == "24:20"] <- "00:20"
#convert the time_admission column to a proper time format
library(lubridate)
# converting using lubridate's `hm` 
newcentral$time_admission <- lubridate::hm(newcentral$time_admission)

Convert categorical variable to booling

newcentral$gender <- ifelse(newcentral$gender == "m", 1, 0)
newcentral$fever <- ifelse(newcentral$fever == "yes", 1, 0)
newcentral$chills <- ifelse(newcentral$chills == "yes", 1, 0)
newcentral$cough <- ifelse(newcentral$cough == "yes", 1, 0)
newcentral$aches <- ifelse(newcentral$aches == "yes", 1, 0)
newcentral$vomit <- ifelse(newcentral$vomit == "yes", 1, 0)

I will check for missing values again

sum(is.na(newcentral))
## [1] 0
summary(newcentral)
##    case_id            generation      outcome              gender     
##  Length:313         Min.   : 2.00   Length:313         Min.   :0.000  
##  Class :character   1st Qu.:13.00   Class :character   1st Qu.:0.000  
##  Mode  :character   Median :16.00   Mode  :character   Median :1.000  
##                     Mean   :16.93                      Mean   :0.524  
##                     3rd Qu.:20.00                      3rd Qu.:1.000  
##                     Max.   :35.00                      Max.   :1.000  
##       age          age_cat5              lon              lat       
##  Min.   : 0.00   Length:313         Min.   :-13.27   Min.   :8.449  
##  1st Qu.: 7.00   Class :character   1st Qu.:-13.25   1st Qu.:8.460  
##  Median :15.00   Mode  :character   Median :-13.23   Median :8.468  
##  Mean   :17.64                      Mean   :-13.23   Mean   :8.469  
##  3rd Qu.:24.00                      3rd Qu.:-13.22   3rd Qu.:8.479  
##  Max.   :73.00                      Max.   :-13.21   Max.   :8.490  
##      wt_kg            ht_cm          ct_blood         fever       
##  Min.   :  0.00   Min.   : 26.0   Min.   :17.00   Min.   :0.0000  
##  1st Qu.: 44.00   1st Qu.: 96.0   1st Qu.:20.00   1st Qu.:1.0000  
##  Median : 58.00   Median :135.0   Median :22.00   Median :1.0000  
##  Mean   : 55.53   Mean   :129.7   Mean   :21.16   Mean   :0.7764  
##  3rd Qu.: 68.00   3rd Qu.:161.0   3rd Qu.:22.00   3rd Qu.:1.0000  
##  Max.   :100.00   Max.   :281.0   Max.   :25.00   Max.   :1.0000  
##      chills          cough            aches            vomit       
##  Min.   :0.000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.000   1st Qu.:1.0000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.000   Median :1.0000   Median :0.0000   Median :1.0000  
##  Mean   :0.147   Mean   :0.8626   Mean   :0.1086   Mean   :0.5112  
##  3rd Qu.:0.000   3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:1.0000  
##  Max.   :1.000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##       temp       time_admission                           bmi        
##  Min.   :35.90   Min.   :0S                          Min.   :  0.00  
##  1st Qu.:38.20   1st Qu.:9H 5M 0S                    1st Qu.: 24.45  
##  Median :38.80   Median :12H 19M 0S                  Median : 32.63  
##  Mean   :38.57   Mean   :11H 33M 32.2044728434485S   Mean   : 44.15  
##  3rd Qu.:39.20   3rd Qu.:16H 6M 0S                   3rd Qu.: 47.90  
##  Max.   :40.40   Max.   :22H 40M 0S                  Max.   :370.37  
##  days_onset_hosp 
##  Min.   : 0.000  
##  1st Qu.: 1.000  
##  Median : 1.000  
##  Mean   : 1.904  
##  3rd Qu.: 3.000  
##  Max.   :11.000

Now lets Check for correlations and outliers

# Calculate correlations
subset_central <-  newcentral[,-c(1,3,6)]
cor_matrix <- cor(subset_central)
cor_matrix
##                    generation       gender          age          lon
## generation       1.0000000000 -0.043210346 -0.004295300  0.028329794
## gender          -0.0432103463  1.000000000  0.242972913  0.064615459
## age             -0.0042952998  0.242972913  1.000000000 -0.063946375
## lon              0.0283297940  0.064615459 -0.063946375  1.000000000
## lat             -0.0558026704  0.035849589 -0.058903768  0.086448612
## wt_kg            0.0377018920  0.353133891  0.838619948 -0.010091031
## ht_cm            0.0193576254  0.303663303  0.891307556 -0.006929528
## ct_blood         0.2976287002  0.082825712  0.048222869  0.044910119
## fever            0.0921457473  0.071808170 -0.045948027 -0.025652367
## chills          -0.0336678305 -0.001847142 -0.066410988  0.028399351
## cough           -0.0002445515 -0.064476085 -0.005755009  0.003176636
## aches           -0.0262540104 -0.016748795 -0.050737861 -0.045092735
## vomit            0.0598378425  0.014923548  0.007207912 -0.008932279
## temp             0.1325428997  0.049538999 -0.028018579 -0.008782092
## time_admission             NA           NA           NA           NA
## bmi             -0.0066983902 -0.155717147 -0.525988929 -0.104983100
## days_onset_hosp -0.3417351437 -0.097358077 -0.028462864  0.001975443
##                          lat        wt_kg        ht_cm     ct_blood
## generation      -0.055802670  0.037701892  0.019357625  0.297628700
## gender           0.035849589  0.353133891  0.303663303  0.082825712
## age             -0.058903768  0.838619948  0.891307556  0.048222869
## lon              0.086448612 -0.010091031 -0.006929528  0.044910119
## lat              1.000000000 -0.043512817 -0.072504430  0.040036147
## wt_kg           -0.043512817  1.000000000  0.885137783  0.074808786
## ht_cm           -0.072504430  0.885137783  1.000000000  0.063429932
## ct_blood         0.040036147  0.074808786  0.063429932  1.000000000
## fever            0.005492878 -0.039684905 -0.061990823 -0.047208797
## chills          -0.120296068 -0.139905055 -0.077011856  0.076064476
## cough           -0.043024795  0.051227921  0.015842098 -0.080871410
## aches            0.038387844 -0.044433085 -0.022972805 -0.009188953
## vomit            0.013657458 -0.017952742 -0.014660060  0.003453408
## temp            -0.009701322 -0.003385039 -0.054323848 -0.031792524
## time_admission            NA           NA           NA           NA
## bmi              0.089096430 -0.529545039 -0.713099072 -0.096466353
## days_onset_hosp -0.020829928 -0.045219005 -0.039753008 -0.656144736
##                        fever        chills         cough        aches
## generation       0.092145747 -0.0336678305 -0.0002445515 -0.026254010
## gender           0.071808170 -0.0018471424 -0.0644760847 -0.016748795
## age             -0.045948027 -0.0664109877 -0.0057550088 -0.050737861
## lon             -0.025652367  0.0283993512  0.0031766356 -0.045092735
## lat              0.005492878 -0.1202960685 -0.0430247954  0.038387844
## wt_kg           -0.039684905 -0.1399050547  0.0512279213 -0.044433085
## ht_cm           -0.061990823 -0.0770118561  0.0158420978 -0.022972805
## ct_blood        -0.047208797  0.0760644758 -0.0808714099 -0.009188953
## fever            1.000000000 -0.1237031649 -0.0137337470  0.064159932
## chills          -0.123703165  1.0000000000 -0.0702605111 -0.028905367
## cough           -0.013733747 -0.0702605111  1.0000000000 -0.069465611
## aches            0.064159932 -0.0289053670 -0.0694656107  1.000000000
## vomit            0.012006273 -0.0453873137  0.0182102773 -0.048888758
## temp             0.875433945 -0.1207631977 -0.0216745226  0.024110014
## time_admission            NA            NA            NA           NA
## bmi              0.033171459  0.0003451653  0.0155250299 -0.022285916
## days_onset_hosp -0.087511596 -0.0253555972  0.0554068700  0.006495686
##                        vomit         temp time_admission           bmi
## generation       0.059837842  0.132542900             NA -0.0066983902
## gender           0.014923548  0.049538999             NA -0.1557171470
## age              0.007207912 -0.028018579             NA -0.5259889291
## lon             -0.008932279 -0.008782092             NA -0.1049831002
## lat              0.013657458 -0.009701322             NA  0.0890964304
## wt_kg           -0.017952742 -0.003385039             NA -0.5295450388
## ht_cm           -0.014660060 -0.054323848             NA -0.7130990717
## ct_blood         0.003453408 -0.031792524             NA -0.0964663533
## fever            0.012006273  0.875433945             NA  0.0331714589
## chills          -0.045387314 -0.120763198             NA  0.0003451653
## cough            0.018210277 -0.021674523             NA  0.0155250299
## aches           -0.048888758  0.024110014             NA -0.0222859160
## vomit            1.000000000  0.007955103             NA  0.0636913962
## temp             0.007955103  1.000000000             NA  0.0557669485
## time_admission            NA           NA              1            NA
## bmi              0.063691396  0.055766948             NA  1.0000000000
## days_onset_hosp -0.014983646 -0.079754626             NA  0.0270226972
##                 days_onset_hosp
## generation         -0.341735144
## gender             -0.097358077
## age                -0.028462864
## lon                 0.001975443
## lat                -0.020829928
## wt_kg              -0.045219005
## ht_cm              -0.039753008
## ct_blood           -0.656144736
## fever              -0.087511596
## chills             -0.025355597
## cough               0.055406870
## aches               0.006495686
## vomit              -0.014983646
## temp               -0.079754626
## time_admission               NA
## bmi                 0.027022697
## days_onset_hosp     1.000000000

The correlation plot

library(corrplot)
# Create correlation plot
corrplot(cor_matrix, method="color", type="upper", 
         addCoef.col = "black", number.cex = 0.7,
         tl.col="black", tl.srt=90)

There is a strong positive correlation between age and weight, age and height, weight and height, fever and temperature. Also there is a strong negative correlation between blood count and difference between days on set and ct_blood, height and BMI, weight and BMI, age and BMI, which makes a lot of sense.

Confirm if there are outliers in age, ht_cm, bmi and days_onset_hosp

a <-ggplot(newcentral, aes(age))+
  geom_boxplot()
b <- ggplot(newcentral, aes(ht_cm))+
  geom_boxplot()
c <- ggplot(newcentral, aes(bmi))+
  geom_boxplot()
d <- ggplot(newcentral, aes(days_onset_hosp))+
  geom_boxplot()
e <- ggplot(newcentral, aes(wt_kg))+
  geom_boxplot()
f <- ggplot(newcentral, aes(temp))+
  geom_boxplot()

library(gridExtra)
grid.arrange(a,b,c,d,e,f, ncol = 3)

lets investigate and replace the outliers

# remove outliers in days_onset_hosp
cq <- quantile( newcentral$days_onset_hosp , 0.95)
newcentral$days_onset_hosp[newcentral$days_onset_hosp > cq]<- round(cq)
#lets take a look at the outliers in age
table(newcentral$age)
## 
##  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 
##  4 11 10 18  9 10 10 15  7 13 11  8  7  8  9  9  9  7  6  7  7 13  6 12 12  4 
## 26 27 28 29 30 31 32 33 34 35 37 38 39 40 43 44 46 47 49 52 53 57 58 61 64 65 
##  6  3  4  5  7  2  2  6  1  6  2  2  5  2  1  1  1  3  3  2  1  1  1  1  1  1 
## 73 
##  1
newcentral %>% 
  filter(age > 50) %>% 
  select(age, age_cat5, ht_cm)
##   age age_cat5 ht_cm
## 1  61    60-64   251
## 2  58    55-59   228
## 3  52    50-54   220
## 4  64    60-64   240
## 5  53    50-54   215
## 6  52    50-54   206
## 7  57    55-59   237
## 8  65    65-69   281
## 9  73    70-74   274

since there is an age category between 60 and 74 i don’t think the ourliers are invalid

Outliers in weights

table(newcentral$wt_kg)
## 
##   0   3   5  10  11  13  15  16  17  18  21  23  24  25  26  27  28  30  31  32 
##   1   1   1   1   1   2   1   2   2   2   1   1   3   3   2   2   2   1   2   4 
##  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52 
##   4   2   4   2   7   6   3   2   7   5   1   4   2   3   4   2   3   9   9   7 
##  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72 
##   3   9  11   7   5   8   4  10   7   9  10   9   8   6   7   3   4   8   7   5 
##  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  91  95  97 
##   4   6   4   3   4   4   2   3   3   3   2   2   1   3   1   1   1   1   1   1 
##  98 100 
##   1   1
newcentral %>% 
  filter(wt_kg < 10) %>% 
  select(outcome, age, wt_kg, ht_cm, bmi)
##   outcome age wt_kg ht_cm      bmi
## 1   Death   0     5    33 45.91368
## 2 Recover   0     3    30 33.33333
## 3 Recover   0     0    26  0.00000

this is not quite clear, lets check the height

newcentral %>% 
  filter(ht_cm < 50 )%>% 
  select(outcome, age, wt_kg, ht_cm, bmi)
##    outcome age wt_kg ht_cm       bmi
## 1    Death   1    17    47  76.95790
## 2    Death   2    21    39 138.06706
## 3  Recover   3    32    46 151.22873
## 4  Recover   1    17    47  76.95790
## 5  Recover   2    16    40 100.00000
## 6    Death   0     5    33  45.91368
## 7  Recover   1    13    38  90.02770
## 8    Death   1    16    38 110.80332
## 9    Death   2    25    34 216.26298
## 10 Recover   0     3    30  33.33333
## 11 Recover   0    11    32 107.42188
## 12 Recover   0     0    26   0.00000
## 13   Death   2    18    48  78.12500
## 14   Death   3    35    48 151.90972
## 15   Death   1    25    42 141.72336
## 16 Recover   1    10    40  62.50000
## 17   Death   2    38    49 158.26739
## 18   Death   1    26    40 162.50000
## 19 Recover   1    13    42  73.69615
## 20   Death   3    23    49  95.79342
## 21 Recover   3    49    46 231.56900
## 22   Death   1    18    40 112.50000
## 23   Death   1    15    38 103.87812
## 24   Death   2    27    27 370.37037

From here there are some anomaly the range of height for a child in this range of age o - 2 should be between 49 and 100. I dont think there is much that could be done about this, i will just change the zero weight to 1

newcentral$wt_kg[newcentral$wt_kg == 0] <- 1

Outliers in height

##   age wt_kg ht_cm
## 1  65    86   281
## 2  73    97   274

these heights are possible but extreme lets leave them as they might give us information on some patients

outliers in temp

#outlier in temp
#lets check these temperatures
newcentral %>% 
  filter(temp < 36.5 )%>%
  select(temp, chills, aches, fever, cough, outcome)
##    temp chills aches fever cough outcome
## 1  36.3      1     0     0     1   Death
## 2  36.2      0     0     0     1   Death
## 3  35.9      1     0     0     1 Recover
## 4  36.4      0     0     0     1   Death
## 5  36.2      0     1     0     1 Recover
## 6  36.1      0     0     0     1 Recover
## 7  36.0      0     0     0     0   Death
## 8  36.4      1     0     0     1 Recover
## 9  36.4      0     0     0     0   Death
## 10 36.3      0     0     0     1 Recover

I think the low temperature is possible and those with it have some problems

outliers in bmi

# since weight and height do not have outliers, lets see if the bmi calculation is correct
newcentral$calculated_BMI <- newcentral$wt_kg / ((newcentral$ht_cm/100) ^ 2)
#Compare calculated BMI with the existing BMI column
newcentral$bmi_correct <- abs(newcentral$calculated_BMI - newcentral$bmi) < 0.01 
#Identify rows where BMI is incorrect
incorrect_BMI <- newcentral[!newcentral$bmi_correct, ] # Rows where BMI is not correct
cat("Number of incorrect BMI entries:", nrow(incorrect_BMI), "\n")
## Number of incorrect BMI entries: 1
# since all bmi is correct i dont think the outliers in bmi are invalid

#lets replace the invalid bmi
newcentral$bmi[newcentral$bmi == 0] <- 1 / ((26/100)^2)

lets move on to visualizations

lets look at the weights of different category of individuals we can see from here that the older one gets the heigher the weight or the height

Lets check those with abnormal weight

# ;ets first calculate mean and standard deviation of weight for each age group
weight_stats <- newcentral %>%
  group_by(age_cat5) %>%
  summarise(mean_weight = mean(wt_kg),
            sd_weight = sd(wt_kg))
# Join the calculated stats back with the original data
newcentral <- newcentral %>%
  left_join(weight_stats, by = "age_cat5") %>%
  mutate(is_abnormal = (wt_kg < mean_weight - 2 * sd_weight) #(a common statistical approach to flag extreme low values)
         | (wt_kg > mean_weight + 2 * sd_weight))
#lets plot the graph
ggplot(newcentral, aes(x = age, fill= is_abnormal)) +
  geom_bar(position = "dodge") +
  scale_color_manual(values = c("black", "red")) +
  labs(title = "Abnormal Weights by Age",
       x = "Age (Years)",
       y = "Weight (kg)",
       color = "Abnormal Weight") +
  theme_minimal()

lets check how many of these people of good or abnormal weight survives

the weight does not matter much in if they survive or not

lets see what age group has the most abnormal weight

From here we can see that only age range from 0 - 34 experience underweight or overweight adults from 35 above do not

Patients with low temperature and the symptoms they encounter

lt_only <- subset(newcentral, temp < 36.5)
t1 <- ggplot(lt_only, aes(x = as.factor(fever), fill = as.factor(fever))) +
  geom_bar() +
  scale_fill_manual(values = c("0" = "blue", "1" = "red"), labels = c("0" = "No", "1" = "Yes")) +
  scale_x_discrete(labels = c("0" = "No", "1" = "Yes")) +
  theme_minimal()
t2 <- ggplot(lt_only, aes(x = as.factor(chills), fill = as.factor(chills))) +
  geom_bar() +
  scale_fill_manual(values = c("0" = "blue", "1" = "red"), labels = c("0" = "No", "1" = "Yes")) +
  scale_x_discrete(labels = c("0" = "No", "1" = "Yes")) +
  theme_minimal()
t3 <- ggplot(lt_only, aes(x = as.factor(aches), fill = as.factor(aches))) +
  geom_bar() +
  scale_fill_manual(values = c("0" = "blue", "1" = "red"), labels = c("0" = "No", "1" = "Yes")) +
  scale_x_discrete(labels = c("0" = "No", "1" = "Yes")) +
  theme_minimal()
t4 <- ggplot(lt_only, aes(x = as.factor(cough), fill = as.factor(cough))) +
  geom_bar() +
  scale_fill_manual(values = c("0" = "blue", "1" = "red"), labels = c("0" = "No", "1" = "Yes")) +
  scale_x_discrete(labels = c("0" = "No", "1" = "Yes")) +
  theme_minimal()
library(grid) 
library(gridExtra)
grid.arrange(t1,t2,t3,t4, ncol = 2,
             top = textGrob("patients with low temperature and their symptoms", gp = gpar(fontsize = 16, fontface = "bold")))

Patients with low temperature and the symptoms they encounter

ht_only <- subset(newcentral, temp > 38)
t1 <- ggplot(ht_only, aes(x = as.factor(fever), fill = as.factor(fever))) +
  geom_bar() +
  scale_fill_manual(values = c("0" = "blue", "1" = "red"), labels = c("0" = "No", "1" = "Yes")) +
  scale_x_discrete(labels = c("0" = "No", "1" = "Yes")) +
  theme_minimal()
t2 <- ggplot(ht_only, aes(x = as.factor(chills), fill = as.factor(chills))) +
  geom_bar() +
  scale_fill_manual(values = c("0" = "blue", "1" = "red"), labels = c("0" = "No", "1" = "Yes")) +
  scale_x_discrete(labels = c("0" = "No", "1" = "Yes")) +
  theme_minimal()
t3 <- ggplot(ht_only, aes(x = as.factor(aches), fill = as.factor(aches))) +
  geom_bar() +
  scale_fill_manual(values = c("0" = "blue", "1" = "red"), labels = c("0" = "No", "1" = "Yes")) +
  scale_x_discrete(labels = c("0" = "No", "1" = "Yes")) +
  theme_minimal()
t4 <- ggplot(ht_only, aes(x = as.factor(cough), fill = as.factor(cough))) +
  geom_bar() +
  scale_fill_manual(values = c("0" = "blue", "1" = "red"), labels = c("0" = "No", "1" = "Yes")) +
  scale_x_discrete(labels = c("0" = "No", "1" = "Yes")) +
  theme_minimal()
library(grid) 
library(gridExtra)
grid.arrange(t1,t2,t3,t4, ncol = 2,
             top = textGrob("patients with low temperature and their symptoms", gp = gpar(fontsize = 16, fontface = "bold")))

lets take a look at the blood counts

Since there a correlation between blood cound and the number of days it takes between showing symptoms and vising the hospital lets take a look visually

ggplot(newcentral, aes(x = days_onset_hosp, y = ct_blood)) +
  geom_point() +
  geom_smooth(method = "lm", color = "red", se = TRUE) +  # Regression line with confidence interval
  labs(title = "Blood Count Over Time", x = "symptom-to-admission delay", y = "Blood Count")

This shows a declining blood count which means that the patients’s condition worsens as they wait to seek medical care.

Does this delay affect the outcome of each patients ?

ggplot(newcentral, aes(days_onset_hosp, fill = outcome)) +
  geom_bar(position = "dodge")+
  labs(title = "Admission delay and outcome", x ="symptom-to-admission delay" )

\ We can see that there are relatively more deaths than recoveries for those who experience delay in visition the hospital.

What age category survive or recover

ggplot(newcentral, aes(age_cat5, fill = outcome ))+
  geom_bar(position = "dodge")+
  xlab("Age category")+
  ylab("No of deaths or recovery")

there are more death than recovery among children between age 0-9, and there are more death generally than recovery.

lets check the relationship between how much time it takes between knowing the infection and being in the hospital and the outcome

ggplot(newcentral, aes(days_onset_hosp, fill = outcome))+
  geom_bar(position = "dodge")

More people get recovered visiting the hospital earlier, than late. but visiting the hospital earlier does not reduce death rate.

compare symptoms with outcome

death_only <- subset(newcentral, outcome == "Death")
f <- ggplot(death_only, aes(as.factor(fever), fill = outcome))+
  geom_bar()
g <- ggplot(death_only, aes(as.factor(chills), fill = outcome))+
  geom_bar()
h <- ggplot(death_only, aes(as.factor(aches), fill = outcome))+
  geom_bar()
i <- ggplot(death_only, aes(as.factor(vomit), fill = outcome))+
  geom_bar()
j <- ggplot(death_only, aes(as.factor(cough), fill = outcome))+
  geom_bar()
grid.arrange(f,g,h,i,j, ncol = 3)

This show that there is more death among those with symptoms such as fever, vomits and cough.

lets talk about the bmi again

# some bmi values are  not within the healthy range shows that there are some people with unhealthy bmi
# <18.5 and >=30 are unhealthy bmi
newcentral$bmi_range <- ifelse(newcentral$bmi < 18.5, "< 18.5",
                         ifelse(newcentral$bmi >= 18.5 & newcentral$bmi < 30, "18.5 - 29.9",
                                ifelse(newcentral$bmi >= 30, ">= 30", NA)))
ggplot(newcentral, aes(bmi_range))+
  geom_bar(color = "yellow")

there are more obese people

ggplot(newcentral, aes(bmi_range, fill = outcome))+
  geom_bar(position = "dodge")

This graph is not too informative, so lets calculate percentage

lets calculate how many percent each of death or recovered we have for each range of bmis

## percentage of < 18.5 that recovered 47.82609
## percentage of 18.5 - 29.9 that recovered 54.71698
## percentage of >= 30 that recovered 38.04348

From this we found out that those that are obesed are less likely to recover