Akanni, Samuel Ifeoluwa
2024-11-21
This is a presentation on the data exploration and visualization, and data cleaning and preprocessing of Central Hospital data set
Data Exploration and Visualization
Load and explore the dataset using RStudio.
Create visualizations (e.g., histograms, scatter plots, bar charts) to understand data distribution.
Perform principal component analysis (PCA) or clustering (k-means) on a dataset.
Data Cleaning and Preprocessing
Handle missing values in a dataset using various methods (e.g., mean imputation, median imputation)
Remove duplicates and outliers from a dataset.
Transform categorical variables into numerical variables.
setwd("C:/Users/USER PC/Documents/GIT/Data Analytics with Prof/data sets")
central <- read.csv("Central Hospital.csv")
head(central)## case_id generation date_infection date_onset date_hospitalisation
## 1 275cc7 5 2014-05-24 2014-05-27 2014-05-28
## 2 64c8ef 6 2014-07-14 2014-07-15
## 3 e56412 9 2014-07-12 2014-07-15 2014-07-17
## 4 62a2ef 9 2014-07-02 2014-07-17 2014-07-19
## 5 92e129 10 2014-08-01 2014-08-10 2014-08-12
## 6 240da4 10 2014-08-20 2014-08-20
## date_outcome outcome gender age age_unit age_years age_cat age_cat5
## 1 2014-06-07 Death f 4 years 4 0-4 0-4
## 2 2014-07-25 Recover m 30 years 30 30-49 30-34
## 3 2014-07-19 Death f 23 years 23 20-29 20-24
## 4 2014-07-23 Death f 14 years 14 10-14 10-14
## 5 2014-08-22 Recover f 42 years 42 30-49 40-44
## 6 Death m 1 years 1 0-4 0-4
## hospital lon lat infector source wt_kg ht_cm ct_blood
## 1 Central Hospital -13.23694 8.469161 e02f66 other 41 75 23
## 2 Central Hospital -13.26147 8.457506 66 177 23
## 3 Central Hospital -13.23433 8.478321 894024 funeral 55 150 23
## 4 Central Hospital -13.23794 8.469774 b0c500 other 59 136 22
## 5 Central Hospital -13.23573 8.474329 4b501e other 85 196 23
## 6 Central Hospital -13.26799 8.462502 17 47 23
## fever chills cough aches vomit temp time_admission bmi days_onset_hosp
## 1 no no yes no no 37.2 09:21 72.88889 1
## 2 no yes yes no no 36.8 10:12 21.06674 1
## 3 no no no no yes 37.4 16:46 24.44444 2
## 4 no yes yes no no 36.6 13:52 31.89879 2
## 5 36.7 14:25 22.12620 2
## 6 no yes yes no yes 36.3 16:06 76.95790 0
## case_id generation date_infection date_onset
## Length:454 Min. : 2.00 Length:454 Length:454
## Class :character 1st Qu.:13.00 Class :character Class :character
## Mode :character Median :16.00 Mode :character Mode :character
## Mean :16.88
## 3rd Qu.:20.00
## Max. :36.00
##
## date_hospitalisation date_outcome outcome gender
## Length:454 Length:454 Length:454 Length:454
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## age age_unit age_years age_cat
## Min. : 0.0 Length:454 Min. : 0.00 Length:454
## 1st Qu.: 7.0 Class :character 1st Qu.: 7.00 Class :character
## Median :15.0 Mode :character Median :15.00 Mode :character
## Mean :17.4 Mean :17.38
## 3rd Qu.:24.0 3rd Qu.:24.00
## Max. :87.0 Max. :87.00
## NA's :9 NA's :9
## age_cat5 hospital lon lat
## Length:454 Length:454 Min. :-13.27 Min. :8.448
## Class :character Class :character 1st Qu.:-13.25 1st Qu.:8.460
## Mode :character Mode :character Median :-13.23 Median :8.468
## Mean :-13.23 Mean :8.469
## 3rd Qu.:-13.22 3rd Qu.:8.479
## Max. :-13.21 Max. :8.490
##
## infector source wt_kg ht_cm
## Length:454 Length:454 Min. : -2.00 Min. : 15.0
## Class :character Class :character 1st Qu.: 43.00 1st Qu.: 97.0
## Mode :character Mode :character Median : 57.50 Median :135.0
## Mean : 55.13 Mean :129.7
## 3rd Qu.: 67.00 3rd Qu.:161.0
## Max. :103.00 Max. :335.0
##
## ct_blood fever chills cough
## Min. :17.00 Length:454 Length:454 Length:454
## 1st Qu.:20.00 Class :character Class :character Class :character
## Median :22.00 Mode :character Mode :character Mode :character
## Mean :21.19
## 3rd Qu.:22.00
## Max. :25.00
##
## aches vomit temp time_admission
## Length:454 Length:454 Min. :35.7 Length:454
## Class :character Class :character 1st Qu.:37.8 Class :character
## Mode :character Mode :character Median :38.8 Mode :character
## Mean :38.5
## 3rd Qu.:39.2
## Max. :40.4
## NA's :3
## bmi days_onset_hosp
## Min. :-41.32 Min. : 0.000
## 1st Qu.: 24.10 1st Qu.: 1.000
## Median : 32.23 Median : 1.000
## Mean : 44.57 Mean : 1.852
## 3rd Qu.: 47.71 3rd Qu.: 2.000
## Max. :370.37 Max. :12.000
##
## [1] 454 30
# Identify columns with missing values or empty strings
missing_summary <- sapply(central, function(x) sum(is.na(x) | x == "" | x == "NA"))
# Filter and display only columns with missing-like values
missing_summary <- missing_summary[missing_summary > 0]
missing_summary## date_infection date_outcome outcome gender age
## 154 74 96 26 9
## age_years age_cat age_cat5 infector source
## 9 9 9 154 154
## fever chills cough aches vomit
## 26 26 26 26 26
## temp time_admission
## 3 63
##
## months years
## 1 453
Date of infection, source, and infector 3,17,18 has so many missing values so i will delete the columns date_onset4,5and date_hospipalize will be removed because the result of their difference is on column days_onset_hosp age_unit, age_years,age_cat 10,11,12 will be removed since we have a column for age already hospital will also be removed since every one is from the same hospital 14 also remove date_outcome 6
## case_id generation outcome gender age age_cat5 lon lat wt_kg ht_cm
## 1 275cc7 5 Death f 4 0-4 -13.23694 8.469161 41 75
## 2 64c8ef 6 Recover m 30 30-34 -13.26147 8.457506 66 177
## 3 e56412 9 Death f 23 20-24 -13.23433 8.478321 55 150
## 4 62a2ef 9 Death f 14 10-14 -13.23794 8.469774 59 136
## 5 92e129 10 Recover f 42 40-44 -13.23573 8.474329 85 196
## 6 240da4 10 Death m 1 0-4 -13.26799 8.462502 17 47
## ct_blood fever chills cough aches vomit temp time_admission bmi
## 1 23 no no yes no no 37.2 09:21 72.88889
## 2 23 no yes yes no no 36.8 10:12 21.06674
## 3 23 no no no no yes 37.4 16:46 24.44444
## 4 22 no yes yes no no 36.6 13:52 31.89879
## 5 23 36.7 14:25 22.12620
## 6 23 no yes yes no yes 36.3 16:06 76.95790
## days_onset_hosp
## 1 1
## 2 1
## 3 2
## 4 2
## 5 2
## 6 0
## [1] 454 20
# remove age from central 1
central2 <- central1[,-c(5,6,18)]
rows_with_missing <- central2[rowSums(is.na(central2) | central2 == "") > 0, ]
head(rows_with_missing)## case_id generation outcome gender lon lat wt_kg ht_cm ct_blood
## 5 92e129 10 Recover f -13.23573 8.474329 85 196 23
## 8 9b4647 9 Death m -13.21540 8.482433 73 207 23
## 10 24ef7d 11 Recover m -13.22377 8.461412 90 236 22
## 12 f5e8d0 11 m -13.25317 8.457906 77 147 21
## 13 7c67f1 14 Recover f -13.21197 8.483053 59 185 22
## 14 5d6f13 7 Recover m -13.26767 8.462242 71 175 23
## fever chills cough aches vomit temp bmi days_onset_hosp
## 5 36.7 22.12620 2
## 8 37.8 17.03657 0
## 10 37.5 16.15915 2
## 12 36.9 35.63330 0
## 13 37.8 17.23886 0
## 14 35.8 23.18367 2
## [1] 141
We have 141 rows(Because the column outcome was not removed and sounds important) with missing values that are hard to fix, so i deleted 41 rows
# we have 37 rows with missing values that are hard to fix which is not too much so i will delete them
newcentral <- anti_join(central1, rows_with_missing)
dim(newcentral)## [1] 313 20
## case_id generation outcome gender age age_cat5 lon lat wt_kg ht_cm
## 1 275cc7 5 Death f 4 0-4 -13.23694 8.469161 41 75
## 2 64c8ef 6 Recover m 30 30-34 -13.26147 8.457506 66 177
## 3 e56412 9 Death f 23 20-24 -13.23433 8.478321 55 150
## 4 62a2ef 9 Death f 14 10-14 -13.23794 8.469774 59 136
## 5 240da4 10 Death m 1 0-4 -13.26799 8.462502 17 47
## 6 31e797 13 Death m 49 45-49 -13.26021 8.457613 82 220
## ct_blood fever chills cough aches vomit temp time_admission bmi
## 1 23 no no yes no no 37.2 09:21 72.88889
## 2 23 no yes yes no no 36.8 10:12 21.06674
## 3 23 no no no no yes 37.4 16:46 24.44444
## 4 22 no yes yes no no 36.6 13:52 31.89879
## 5 23 no yes yes no yes 36.3 16:06 76.95790
## 6 21 no no yes no no 36.2 13:10 16.94215
## days_onset_hosp
## 1 1
## 2 1
## 3 2
## 4 2
## 5 0
## 6 2
# I will check for missing values again
missing_summary2 <- sapply(newcentral, function(x) sum(is.na(x) | x == "" | x == "NA"))
missing_summary2 <- missing_summary2[missing_summary2 > 0]
missing_summary2## time_admission
## 41
# Identify rows with invalid time formats in the "time_admission" column
invalid_times <- newcentral[!grepl("^([01]\\d|2[0-3]):([0-5]\\d)(:[0-5]\\d)?$", newcentral$time_admission), ]
# View the problematic rows
invalid_times$time_admission## [1] "" "" "" "" "" "" "" "" ""
## [10] "" "" "11:60" "" "24:20" "" "" "" ""
## [19] "" "" "" "" "" "" "" "" ""
## [28] "" "" "" "" "14:60" "" "" "" ""
## [37] "" "" "" "" "" "" "" ""
Correct the 12:60 and and the 13:60
newcentral$time_admission[newcentral$time_admission == "11:60"] <- "11:59"
newcentral$time_admission[newcentral$time_admission == "14:60"] <- "14:59"
newcentral$time_admission[newcentral$time_admission == ""] <- "00:00"
newcentral$time_admission[newcentral$time_admission == "24:20"] <- "00:20"
#convert the time_admission column to a proper time format
library(lubridate)
# converting using lubridate's `hm`
newcentral$time_admission <- lubridate::hm(newcentral$time_admission)newcentral$gender <- ifelse(newcentral$gender == "m", 1, 0)
newcentral$fever <- ifelse(newcentral$fever == "yes", 1, 0)
newcentral$chills <- ifelse(newcentral$chills == "yes", 1, 0)
newcentral$cough <- ifelse(newcentral$cough == "yes", 1, 0)
newcentral$aches <- ifelse(newcentral$aches == "yes", 1, 0)
newcentral$vomit <- ifelse(newcentral$vomit == "yes", 1, 0)## [1] 0
## case_id generation outcome gender
## Length:313 Min. : 2.00 Length:313 Min. :0.000
## Class :character 1st Qu.:13.00 Class :character 1st Qu.:0.000
## Mode :character Median :16.00 Mode :character Median :1.000
## Mean :16.93 Mean :0.524
## 3rd Qu.:20.00 3rd Qu.:1.000
## Max. :35.00 Max. :1.000
## age age_cat5 lon lat
## Min. : 0.00 Length:313 Min. :-13.27 Min. :8.449
## 1st Qu.: 7.00 Class :character 1st Qu.:-13.25 1st Qu.:8.460
## Median :15.00 Mode :character Median :-13.23 Median :8.468
## Mean :17.64 Mean :-13.23 Mean :8.469
## 3rd Qu.:24.00 3rd Qu.:-13.22 3rd Qu.:8.479
## Max. :73.00 Max. :-13.21 Max. :8.490
## wt_kg ht_cm ct_blood fever
## Min. : 0.00 Min. : 26.0 Min. :17.00 Min. :0.0000
## 1st Qu.: 44.00 1st Qu.: 96.0 1st Qu.:20.00 1st Qu.:1.0000
## Median : 58.00 Median :135.0 Median :22.00 Median :1.0000
## Mean : 55.53 Mean :129.7 Mean :21.16 Mean :0.7764
## 3rd Qu.: 68.00 3rd Qu.:161.0 3rd Qu.:22.00 3rd Qu.:1.0000
## Max. :100.00 Max. :281.0 Max. :25.00 Max. :1.0000
## chills cough aches vomit
## Min. :0.000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.000 1st Qu.:1.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.000 Median :1.0000 Median :0.0000 Median :1.0000
## Mean :0.147 Mean :0.8626 Mean :0.1086 Mean :0.5112
## 3rd Qu.:0.000 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:1.0000
## Max. :1.000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## temp time_admission bmi
## Min. :35.90 Min. :0S Min. : 0.00
## 1st Qu.:38.20 1st Qu.:9H 5M 0S 1st Qu.: 24.45
## Median :38.80 Median :12H 19M 0S Median : 32.63
## Mean :38.57 Mean :11H 33M 32.2044728434485S Mean : 44.15
## 3rd Qu.:39.20 3rd Qu.:16H 6M 0S 3rd Qu.: 47.90
## Max. :40.40 Max. :22H 40M 0S Max. :370.37
## days_onset_hosp
## Min. : 0.000
## 1st Qu.: 1.000
## Median : 1.000
## Mean : 1.904
## 3rd Qu.: 3.000
## Max. :11.000
# Calculate correlations
subset_central <- newcentral[,-c(1,3,6)]
cor_matrix <- cor(subset_central)
cor_matrix## generation gender age lon
## generation 1.0000000000 -0.043210346 -0.004295300 0.028329794
## gender -0.0432103463 1.000000000 0.242972913 0.064615459
## age -0.0042952998 0.242972913 1.000000000 -0.063946375
## lon 0.0283297940 0.064615459 -0.063946375 1.000000000
## lat -0.0558026704 0.035849589 -0.058903768 0.086448612
## wt_kg 0.0377018920 0.353133891 0.838619948 -0.010091031
## ht_cm 0.0193576254 0.303663303 0.891307556 -0.006929528
## ct_blood 0.2976287002 0.082825712 0.048222869 0.044910119
## fever 0.0921457473 0.071808170 -0.045948027 -0.025652367
## chills -0.0336678305 -0.001847142 -0.066410988 0.028399351
## cough -0.0002445515 -0.064476085 -0.005755009 0.003176636
## aches -0.0262540104 -0.016748795 -0.050737861 -0.045092735
## vomit 0.0598378425 0.014923548 0.007207912 -0.008932279
## temp 0.1325428997 0.049538999 -0.028018579 -0.008782092
## time_admission NA NA NA NA
## bmi -0.0066983902 -0.155717147 -0.525988929 -0.104983100
## days_onset_hosp -0.3417351437 -0.097358077 -0.028462864 0.001975443
## lat wt_kg ht_cm ct_blood
## generation -0.055802670 0.037701892 0.019357625 0.297628700
## gender 0.035849589 0.353133891 0.303663303 0.082825712
## age -0.058903768 0.838619948 0.891307556 0.048222869
## lon 0.086448612 -0.010091031 -0.006929528 0.044910119
## lat 1.000000000 -0.043512817 -0.072504430 0.040036147
## wt_kg -0.043512817 1.000000000 0.885137783 0.074808786
## ht_cm -0.072504430 0.885137783 1.000000000 0.063429932
## ct_blood 0.040036147 0.074808786 0.063429932 1.000000000
## fever 0.005492878 -0.039684905 -0.061990823 -0.047208797
## chills -0.120296068 -0.139905055 -0.077011856 0.076064476
## cough -0.043024795 0.051227921 0.015842098 -0.080871410
## aches 0.038387844 -0.044433085 -0.022972805 -0.009188953
## vomit 0.013657458 -0.017952742 -0.014660060 0.003453408
## temp -0.009701322 -0.003385039 -0.054323848 -0.031792524
## time_admission NA NA NA NA
## bmi 0.089096430 -0.529545039 -0.713099072 -0.096466353
## days_onset_hosp -0.020829928 -0.045219005 -0.039753008 -0.656144736
## fever chills cough aches
## generation 0.092145747 -0.0336678305 -0.0002445515 -0.026254010
## gender 0.071808170 -0.0018471424 -0.0644760847 -0.016748795
## age -0.045948027 -0.0664109877 -0.0057550088 -0.050737861
## lon -0.025652367 0.0283993512 0.0031766356 -0.045092735
## lat 0.005492878 -0.1202960685 -0.0430247954 0.038387844
## wt_kg -0.039684905 -0.1399050547 0.0512279213 -0.044433085
## ht_cm -0.061990823 -0.0770118561 0.0158420978 -0.022972805
## ct_blood -0.047208797 0.0760644758 -0.0808714099 -0.009188953
## fever 1.000000000 -0.1237031649 -0.0137337470 0.064159932
## chills -0.123703165 1.0000000000 -0.0702605111 -0.028905367
## cough -0.013733747 -0.0702605111 1.0000000000 -0.069465611
## aches 0.064159932 -0.0289053670 -0.0694656107 1.000000000
## vomit 0.012006273 -0.0453873137 0.0182102773 -0.048888758
## temp 0.875433945 -0.1207631977 -0.0216745226 0.024110014
## time_admission NA NA NA NA
## bmi 0.033171459 0.0003451653 0.0155250299 -0.022285916
## days_onset_hosp -0.087511596 -0.0253555972 0.0554068700 0.006495686
## vomit temp time_admission bmi
## generation 0.059837842 0.132542900 NA -0.0066983902
## gender 0.014923548 0.049538999 NA -0.1557171470
## age 0.007207912 -0.028018579 NA -0.5259889291
## lon -0.008932279 -0.008782092 NA -0.1049831002
## lat 0.013657458 -0.009701322 NA 0.0890964304
## wt_kg -0.017952742 -0.003385039 NA -0.5295450388
## ht_cm -0.014660060 -0.054323848 NA -0.7130990717
## ct_blood 0.003453408 -0.031792524 NA -0.0964663533
## fever 0.012006273 0.875433945 NA 0.0331714589
## chills -0.045387314 -0.120763198 NA 0.0003451653
## cough 0.018210277 -0.021674523 NA 0.0155250299
## aches -0.048888758 0.024110014 NA -0.0222859160
## vomit 1.000000000 0.007955103 NA 0.0636913962
## temp 0.007955103 1.000000000 NA 0.0557669485
## time_admission NA NA 1 NA
## bmi 0.063691396 0.055766948 NA 1.0000000000
## days_onset_hosp -0.014983646 -0.079754626 NA 0.0270226972
## days_onset_hosp
## generation -0.341735144
## gender -0.097358077
## age -0.028462864
## lon 0.001975443
## lat -0.020829928
## wt_kg -0.045219005
## ht_cm -0.039753008
## ct_blood -0.656144736
## fever -0.087511596
## chills -0.025355597
## cough 0.055406870
## aches 0.006495686
## vomit -0.014983646
## temp -0.079754626
## time_admission NA
## bmi 0.027022697
## days_onset_hosp 1.000000000
library(corrplot)
# Create correlation plot
corrplot(cor_matrix, method="color", type="upper",
addCoef.col = "black", number.cex = 0.7,
tl.col="black", tl.srt=90)
There is a strong positive correlation between age and weight, age and
height, weight and height, fever and temperature. Also there is a strong
negative correlation between blood count and difference between days on
set and ct_blood, height and BMI, weight and BMI, age and BMI, which
makes a lot of sense.
a <-ggplot(newcentral, aes(age))+
geom_boxplot()
b <- ggplot(newcentral, aes(ht_cm))+
geom_boxplot()
c <- ggplot(newcentral, aes(bmi))+
geom_boxplot()
d <- ggplot(newcentral, aes(days_onset_hosp))+
geom_boxplot()
e <- ggplot(newcentral, aes(wt_kg))+
geom_boxplot()
f <- ggplot(newcentral, aes(temp))+
geom_boxplot()# remove outliers in days_onset_hosp
cq <- quantile( newcentral$days_onset_hosp , 0.95)
newcentral$days_onset_hosp[newcentral$days_onset_hosp > cq]<- round(cq)
#lets take a look at the outliers in age
table(newcentral$age)##
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
## 4 11 10 18 9 10 10 15 7 13 11 8 7 8 9 9 9 7 6 7 7 13 6 12 12 4
## 26 27 28 29 30 31 32 33 34 35 37 38 39 40 43 44 46 47 49 52 53 57 58 61 64 65
## 6 3 4 5 7 2 2 6 1 6 2 2 5 2 1 1 1 3 3 2 1 1 1 1 1 1
## 73
## 1
## age age_cat5 ht_cm
## 1 61 60-64 251
## 2 58 55-59 228
## 3 52 50-54 220
## 4 64 60-64 240
## 5 53 50-54 215
## 6 52 50-54 206
## 7 57 55-59 237
## 8 65 65-69 281
## 9 73 70-74 274
since there is an age category between 60 and 74 i don’t think the ourliers are invalid
##
## 0 3 5 10 11 13 15 16 17 18 21 23 24 25 26 27 28 30 31 32
## 1 1 1 1 1 2 1 2 2 2 1 1 3 3 2 2 2 1 2 4
## 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52
## 4 2 4 2 7 6 3 2 7 5 1 4 2 3 4 2 3 9 9 7
## 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
## 3 9 11 7 5 8 4 10 7 9 10 9 8 6 7 3 4 8 7 5
## 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 91 95 97
## 4 6 4 3 4 4 2 3 3 3 2 2 1 3 1 1 1 1 1 1
## 98 100
## 1 1
## outcome age wt_kg ht_cm bmi
## 1 Death 0 5 33 45.91368
## 2 Recover 0 3 30 33.33333
## 3 Recover 0 0 26 0.00000
this is not quite clear, lets check the height
## outcome age wt_kg ht_cm bmi
## 1 Death 1 17 47 76.95790
## 2 Death 2 21 39 138.06706
## 3 Recover 3 32 46 151.22873
## 4 Recover 1 17 47 76.95790
## 5 Recover 2 16 40 100.00000
## 6 Death 0 5 33 45.91368
## 7 Recover 1 13 38 90.02770
## 8 Death 1 16 38 110.80332
## 9 Death 2 25 34 216.26298
## 10 Recover 0 3 30 33.33333
## 11 Recover 0 11 32 107.42188
## 12 Recover 0 0 26 0.00000
## 13 Death 2 18 48 78.12500
## 14 Death 3 35 48 151.90972
## 15 Death 1 25 42 141.72336
## 16 Recover 1 10 40 62.50000
## 17 Death 2 38 49 158.26739
## 18 Death 1 26 40 162.50000
## 19 Recover 1 13 42 73.69615
## 20 Death 3 23 49 95.79342
## 21 Recover 3 49 46 231.56900
## 22 Death 1 18 40 112.50000
## 23 Death 1 15 38 103.87812
## 24 Death 2 27 27 370.37037
From here there are some anomaly the range of height for a child in this range of age o - 2 should be between 49 and 100. I dont think there is much that could be done about this, i will just change the zero weight to 1
## age wt_kg ht_cm
## 1 65 86 281
## 2 73 97 274
these heights are possible but extreme lets leave them as they might give us information on some patients
#outlier in temp
#lets check these temperatures
newcentral %>%
filter(temp < 36.5 )%>%
select(temp, chills, aches, fever, cough, outcome)## temp chills aches fever cough outcome
## 1 36.3 1 0 0 1 Death
## 2 36.2 0 0 0 1 Death
## 3 35.9 1 0 0 1 Recover
## 4 36.4 0 0 0 1 Death
## 5 36.2 0 1 0 1 Recover
## 6 36.1 0 0 0 1 Recover
## 7 36.0 0 0 0 0 Death
## 8 36.4 1 0 0 1 Recover
## 9 36.4 0 0 0 0 Death
## 10 36.3 0 0 0 1 Recover
I think the low temperature is possible and those with it have some problems
# since weight and height do not have outliers, lets see if the bmi calculation is correct
newcentral$calculated_BMI <- newcentral$wt_kg / ((newcentral$ht_cm/100) ^ 2)
#Compare calculated BMI with the existing BMI column
newcentral$bmi_correct <- abs(newcentral$calculated_BMI - newcentral$bmi) < 0.01
#Identify rows where BMI is incorrect
incorrect_BMI <- newcentral[!newcentral$bmi_correct, ] # Rows where BMI is not correct
cat("Number of incorrect BMI entries:", nrow(incorrect_BMI), "\n")## Number of incorrect BMI entries: 1
lets look at the weights of different category of individuals
we can see from here that the older one gets the heigher the weight or
the height
# ;ets first calculate mean and standard deviation of weight for each age group
weight_stats <- newcentral %>%
group_by(age_cat5) %>%
summarise(mean_weight = mean(wt_kg),
sd_weight = sd(wt_kg))
# Join the calculated stats back with the original data
newcentral <- newcentral %>%
left_join(weight_stats, by = "age_cat5") %>%
mutate(is_abnormal = (wt_kg < mean_weight - 2 * sd_weight) #(a common statistical approach to flag extreme low values)
| (wt_kg > mean_weight + 2 * sd_weight))
#lets plot the graph
ggplot(newcentral, aes(x = age, fill= is_abnormal)) +
geom_bar(position = "dodge") +
scale_color_manual(values = c("black", "red")) +
labs(title = "Abnormal Weights by Age",
x = "Age (Years)",
y = "Weight (kg)",
color = "Abnormal Weight") +
theme_minimal()
the weight does not matter much in if they survive or not
From here we can see that only age range from 0 - 34 experience
underweight or overweight adults from 35 above do not
lt_only <- subset(newcentral, temp < 36.5)
t1 <- ggplot(lt_only, aes(x = as.factor(fever), fill = as.factor(fever))) +
geom_bar() +
scale_fill_manual(values = c("0" = "blue", "1" = "red"), labels = c("0" = "No", "1" = "Yes")) +
scale_x_discrete(labels = c("0" = "No", "1" = "Yes")) +
theme_minimal()
t2 <- ggplot(lt_only, aes(x = as.factor(chills), fill = as.factor(chills))) +
geom_bar() +
scale_fill_manual(values = c("0" = "blue", "1" = "red"), labels = c("0" = "No", "1" = "Yes")) +
scale_x_discrete(labels = c("0" = "No", "1" = "Yes")) +
theme_minimal()
t3 <- ggplot(lt_only, aes(x = as.factor(aches), fill = as.factor(aches))) +
geom_bar() +
scale_fill_manual(values = c("0" = "blue", "1" = "red"), labels = c("0" = "No", "1" = "Yes")) +
scale_x_discrete(labels = c("0" = "No", "1" = "Yes")) +
theme_minimal()
t4 <- ggplot(lt_only, aes(x = as.factor(cough), fill = as.factor(cough))) +
geom_bar() +
scale_fill_manual(values = c("0" = "blue", "1" = "red"), labels = c("0" = "No", "1" = "Yes")) +
scale_x_discrete(labels = c("0" = "No", "1" = "Yes")) +
theme_minimal()
library(grid)
library(gridExtra)
grid.arrange(t1,t2,t3,t4, ncol = 2,
top = textGrob("patients with low temperature and their symptoms", gp = gpar(fontsize = 16, fontface = "bold")))ht_only <- subset(newcentral, temp > 38)
t1 <- ggplot(ht_only, aes(x = as.factor(fever), fill = as.factor(fever))) +
geom_bar() +
scale_fill_manual(values = c("0" = "blue", "1" = "red"), labels = c("0" = "No", "1" = "Yes")) +
scale_x_discrete(labels = c("0" = "No", "1" = "Yes")) +
theme_minimal()
t2 <- ggplot(ht_only, aes(x = as.factor(chills), fill = as.factor(chills))) +
geom_bar() +
scale_fill_manual(values = c("0" = "blue", "1" = "red"), labels = c("0" = "No", "1" = "Yes")) +
scale_x_discrete(labels = c("0" = "No", "1" = "Yes")) +
theme_minimal()
t3 <- ggplot(ht_only, aes(x = as.factor(aches), fill = as.factor(aches))) +
geom_bar() +
scale_fill_manual(values = c("0" = "blue", "1" = "red"), labels = c("0" = "No", "1" = "Yes")) +
scale_x_discrete(labels = c("0" = "No", "1" = "Yes")) +
theme_minimal()
t4 <- ggplot(ht_only, aes(x = as.factor(cough), fill = as.factor(cough))) +
geom_bar() +
scale_fill_manual(values = c("0" = "blue", "1" = "red"), labels = c("0" = "No", "1" = "Yes")) +
scale_x_discrete(labels = c("0" = "No", "1" = "Yes")) +
theme_minimal()
library(grid)
library(gridExtra)
grid.arrange(t1,t2,t3,t4, ncol = 2,
top = textGrob("patients with low temperature and their symptoms", gp = gpar(fontsize = 16, fontface = "bold")))Since there a correlation between blood cound and the number of days it takes between showing symptoms and vising the hospital lets take a look visually
ggplot(newcentral, aes(x = days_onset_hosp, y = ct_blood)) +
geom_point() +
geom_smooth(method = "lm", color = "red", se = TRUE) + # Regression line with confidence interval
labs(title = "Blood Count Over Time", x = "symptom-to-admission delay", y = "Blood Count")
This shows a declining blood count which means that the patients’s
condition worsens as they wait to seek medical care.
ggplot(newcentral, aes(days_onset_hosp, fill = outcome)) +
geom_bar(position = "dodge")+
labs(title = "Admission delay and outcome", x ="symptom-to-admission delay" )
\ We can see that there are relatively more deaths than recoveries for
those who experience delay in visition the hospital.
ggplot(newcentral, aes(age_cat5, fill = outcome ))+
geom_bar(position = "dodge")+
xlab("Age category")+
ylab("No of deaths or recovery")
there are more death than recovery among children between age 0-9, and
there are more death generally than recovery.
More people get recovered visiting the hospital earlier, than late. but
visiting the hospital earlier does not reduce death rate.
death_only <- subset(newcentral, outcome == "Death")
f <- ggplot(death_only, aes(as.factor(fever), fill = outcome))+
geom_bar()
g <- ggplot(death_only, aes(as.factor(chills), fill = outcome))+
geom_bar()
h <- ggplot(death_only, aes(as.factor(aches), fill = outcome))+
geom_bar()
i <- ggplot(death_only, aes(as.factor(vomit), fill = outcome))+
geom_bar()
j <- ggplot(death_only, aes(as.factor(cough), fill = outcome))+
geom_bar()
grid.arrange(f,g,h,i,j, ncol = 3)
This show that there is more death among those with symptoms such as
fever, vomits and cough.
# some bmi values are not within the healthy range shows that there are some people with unhealthy bmi
# <18.5 and >=30 are unhealthy bmi
newcentral$bmi_range <- ifelse(newcentral$bmi < 18.5, "< 18.5",
ifelse(newcentral$bmi >= 18.5 & newcentral$bmi < 30, "18.5 - 29.9",
ifelse(newcentral$bmi >= 30, ">= 30", NA)))
there are more obese people
This graph is not too informative, so lets calculate percentage
## percentage of < 18.5 that recovered 47.82609
## percentage of 18.5 - 29.9 that recovered 54.71698
## percentage of >= 30 that recovered 38.04348
From this we found out that those that are obesed are less likely to recover