Task 1: install and load the packages “rio”, “skimr”, “janitor”, and “tidyverse” using p_load. If you get stuck, press on the “code” button on the right of the HTML file to see the code.
install.packages("pacman", repos = "http://cran.us.r-project.org")
library(pacman)
## Warning: package 'pacman' was built under R version 4.3.2
# p_load: load (and install if necessary) packages required for this analysis.
p_load(
rio,
skimr,
tidyverse,
gtsummary,
here,
rstatix,
scales,
flextable,
janitor
)
Task 2: Import the dataset “linelist_cleaned.rds” using import() and call the imported dataset linelist. If you get stuck, press on the “code” button on the right of the HTML file to see the code.
linelist<-import("linelist_cleaned.rds")
skim(linelist)
| Name | linelist |
| Number of rows | 5888 |
| Number of columns | 30 |
| _______________________ | |
| Column type frequency: | |
| character | 13 |
| Date | 4 |
| factor | 2 |
| numeric | 11 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| case_id | 0 | 1.00 | 6 | 6 | 0 | 5888 | 0 |
| outcome | 1323 | 0.78 | 5 | 7 | 0 | 2 | 0 |
| gender | 278 | 0.95 | 1 | 1 | 0 | 2 | 0 |
| age_unit | 0 | 1.00 | 5 | 6 | 0 | 2 | 0 |
| hospital | 0 | 1.00 | 5 | 36 | 0 | 6 | 0 |
| infector | 2088 | 0.65 | 6 | 6 | 0 | 2697 | 0 |
| source | 2088 | 0.65 | 5 | 7 | 0 | 2 | 0 |
| fever | 249 | 0.96 | 2 | 3 | 0 | 2 | 0 |
| chills | 249 | 0.96 | 2 | 3 | 0 | 2 | 0 |
| cough | 249 | 0.96 | 2 | 3 | 0 | 2 | 0 |
| aches | 249 | 0.96 | 2 | 3 | 0 | 2 | 0 |
| vomit | 249 | 0.96 | 2 | 3 | 0 | 2 | 0 |
| time_admission | 765 | 0.87 | 5 | 5 | 0 | 1072 | 0 |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| date_infection | 2087 | 0.65 | 2014-03-19 | 2015-04-27 | 2014-10-11 | 359 |
| date_onset | 256 | 0.96 | 2014-04-07 | 2015-04-30 | 2014-10-23 | 367 |
| date_hospitalisation | 0 | 1.00 | 2014-04-17 | 2015-04-30 | 2014-10-23 | 363 |
| date_outcome | 936 | 0.84 | 2014-04-19 | 2015-06-04 | 2014-11-01 | 371 |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| age_cat | 86 | 0.99 | FALSE | 8 | 0-4: 1095, 5-9: 1095, 20-: 1073, 10-: 941 |
| age_cat5 | 86 | 0.99 | FALSE | 17 | 0-4: 1095, 5-9: 1095, 10-: 941, 15-: 743 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| generation | 0 | 1.00 | 16.56 | 5.79 | 0.00 | 13.00 | 16.00 | 20.00 | 37.00 | ▁▆▇▂▁ |
| age | 86 | 0.99 | 16.07 | 12.62 | 0.00 | 6.00 | 13.00 | 23.00 | 84.00 | ▇▅▁▁▁ |
| age_years | 86 | 0.99 | 16.02 | 12.64 | 0.00 | 6.00 | 13.00 | 23.00 | 84.00 | ▇▅▁▁▁ |
| lon | 0 | 1.00 | -13.23 | 0.02 | -13.27 | -13.25 | -13.23 | -13.22 | -13.21 | ▅▃▃▆▇ |
| lat | 0 | 1.00 | 8.47 | 0.01 | 8.45 | 8.46 | 8.47 | 8.48 | 8.49 | ▅▇▇▇▆ |
| wt_kg | 0 | 1.00 | 52.64 | 18.58 | -11.00 | 41.00 | 54.00 | 66.00 | 111.00 | ▁▃▇▅▁ |
| ht_cm | 0 | 1.00 | 124.96 | 49.52 | 4.00 | 91.00 | 129.00 | 159.00 | 295.00 | ▂▅▇▂▁ |
| ct_blood | 0 | 1.00 | 21.21 | 1.69 | 16.00 | 20.00 | 22.00 | 22.00 | 26.00 | ▁▃▇▃▁ |
| temp | 149 | 0.97 | 38.56 | 0.98 | 35.20 | 38.20 | 38.80 | 39.20 | 40.80 | ▁▂▂▇▁ |
| bmi | 0 | 1.00 | 46.89 | 55.39 | -1200.00 | 24.56 | 32.12 | 50.01 | 1250.00 | ▁▁▇▁▁ |
| days_onset_hosp | 256 | 0.96 | 2.06 | 2.26 | 0.00 | 1.00 | 1.00 | 3.00 | 22.00 | ▇▁▁▁▁ |
summary(linelist)
## case_id generation date_infection date_onset
## Length:5888 Min. : 0.00 Min. :2014-03-19 Min. :2014-04-07
## Class :character 1st Qu.:13.00 1st Qu.:2014-09-06 1st Qu.:2014-09-16
## Mode :character Median :16.00 Median :2014-10-11 Median :2014-10-23
## Mean :16.56 Mean :2014-10-22 Mean :2014-11-03
## 3rd Qu.:20.00 3rd Qu.:2014-12-05 3rd Qu.:2014-12-19
## Max. :37.00 Max. :2015-04-27 Max. :2015-04-30
## NA's :2087 NA's :256
## date_hospitalisation date_outcome outcome
## Min. :2014-04-17 Min. :2014-04-19 Length:5888
## 1st Qu.:2014-09-19 1st Qu.:2014-09-26 Class :character
## Median :2014-10-23 Median :2014-11-01 Mode :character
## Mean :2014-11-03 Mean :2014-11-12
## 3rd Qu.:2014-12-17 3rd Qu.:2014-12-28
## Max. :2015-04-30 Max. :2015-06-04
## NA's :936
## gender age age_unit age_years
## Length:5888 Min. : 0.00 Length:5888 Min. : 0.00
## Class :character 1st Qu.: 6.00 Class :character 1st Qu.: 6.00
## Mode :character Median :13.00 Mode :character Median :13.00
## Mean :16.07 Mean :16.02
## 3rd Qu.:23.00 3rd Qu.:23.00
## Max. :84.00 Max. :84.00
## NA's :86 NA's :86
## age_cat age_cat5 hospital lon
## 0-4 :1095 0-4 :1095 Length:5888 Min. :-13.27
## 5-9 :1095 5-9 :1095 Class :character 1st Qu.:-13.25
## 20-29 :1073 10-14 : 941 Mode :character Median :-13.23
## 10-14 : 941 15-19 : 743 Mean :-13.23
## 30-49 : 754 20-24 : 638 3rd Qu.:-13.22
## (Other): 844 (Other):1290 Max. :-13.21
## NA's : 86 NA's : 86
## lat infector source wt_kg
## Min. :8.446 Length:5888 Length:5888 Min. :-11.00
## 1st Qu.:8.461 Class :character Class :character 1st Qu.: 41.00
## Median :8.469 Mode :character Mode :character Median : 54.00
## Mean :8.470 Mean : 52.64
## 3rd Qu.:8.480 3rd Qu.: 66.00
## Max. :8.492 Max. :111.00
##
## ht_cm ct_blood fever chills
## Min. : 4 Min. :16.00 Length:5888 Length:5888
## 1st Qu.: 91 1st Qu.:20.00 Class :character Class :character
## Median :129 Median :22.00 Mode :character Mode :character
## Mean :125 Mean :21.21
## 3rd Qu.:159 3rd Qu.:22.00
## Max. :295 Max. :26.00
##
## cough aches vomit temp
## Length:5888 Length:5888 Length:5888 Min. :35.20
## Class :character Class :character Class :character 1st Qu.:38.20
## Mode :character Mode :character Mode :character Median :38.80
## Mean :38.56
## 3rd Qu.:39.20
## Max. :40.80
## NA's :149
## time_admission bmi days_onset_hosp
## Length:5888 Min. :-1200.00 Min. : 0.000
## Class :character 1st Qu.: 24.56 1st Qu.: 1.000
## Mode :character Median : 32.12 Median : 1.000
## Mean : 46.89 Mean : 2.059
## 3rd Qu.: 50.01 3rd Qu.: 3.000
## Max. : 1250.00 Max. :22.000
## NA's :256
Skim result is easier to read because it shows firstly the summary of descreption and then describle each variable into the table
skim(linelist$temp)
| Name | linelist$temp |
| Number of rows | 5888 |
| Number of columns | 1 |
| _______________________ | |
| Column type frequency: | |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| data | 149 | 0.97 | 38.56 | 0.98 | 35.2 | 38.2 | 38.8 | 39.2 | 40.8 | ▁▂▂▇▁ |
# Method 1
fch1 <- function(x)
{
mean=mean(x)
median=median(x)
standard_deviation=sd(x)
interquartile_range=quantile(x)
range=range(x)
return(c("mean"=mean, "media"=median, "sd"=standard_deviation, "iqr"=interquartile_range, "range"=range))
}
fch1(linelist$bmi)
## mean media sd iqr.0% iqr.25% iqr.50%
## 46.89023 32.12396 55.38829 -1200.00000 24.56033 32.12396
## iqr.75% iqr.100% range1 range2
## 50.00676 1250.00000 -1200.00000 1250.00000
# Method 2
linelist %>%
get_summary_stats(
bmi, temp
)
## # A tibble: 2 × 13
## variable n min max median q1 q3 iqr mad mean sd
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 bmi 5888 -1200 1250 32.1 24.6 50.0 25.4 14.4 46.9 55.4
## 2 temp 5739 35.2 40.8 38.8 38.2 39.2 1 0.741 38.6 0.977
## # ℹ 2 more variables: se <dbl>, ci <dbl>
# Method 1
fch1 <- function(x)
{
mean=mean(x, na.rm = T)
median=median(x, na.rm = T)
standard_deviation=sd(x, na.rm = T)
interquartile_range=quantile(x, na.rm = T)
range=range(x, na.rm = T)
return(c("mean"=mean, "media"=median, "sd"=standard_deviation, "iqr"=interquartile_range, "range"=range))
}
fch1(linelist$bmi)
## mean media sd iqr.0% iqr.25% iqr.50%
## 46.89023 32.12396 55.38829 -1200.00000 24.56033 32.12396
## iqr.75% iqr.100% range1 range2
## 50.00676 1250.00000 -1200.00000 1250.00000
fch1(linelist$temp)
## mean media sd iqr.0% iqr.25% iqr.50% iqr.75%
## 38.5582854 38.8000000 0.9767722 35.2000000 38.2000000 38.8000000 39.2000000
## iqr.100% range1 range2
## 40.8000000 35.2000000 40.8000000
# Method 2
linelist %>%
na.omit() %>%
get_summary_stats(bmi, temp)
## # A tibble: 2 × 13
## variable n min max median q1 q3 iqr mad mean sd se
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 bmi 1818 -592. 741. 32.5 24.3 51.2 26.9 14.9 46.6 49.9 1.17
## 2 temp 1818 35.2 40.6 38.9 38.3 39.2 0.9 0.593 38.6 0.933 0.022
## # ℹ 1 more variable: ci <dbl>
linelist %>%
tabyl(temp, age_cat)
## temp 0-4 5-9 10-14 15-19 20-29 30-49 50-69 70+ NA_
## 35.2 0 0 1 0 0 0 0 0 0
## 35.3 0 0 0 0 1 0 0 0 0
## 35.5 0 1 0 0 1 0 0 0 0
## 35.6 0 0 1 0 1 0 0 0 0
## 35.7 0 0 0 0 0 1 0 0 0
## 35.8 0 1 1 0 1 2 0 0 0
## 35.9 3 0 2 4 2 0 2 0 1
## 36.0 1 3 4 1 2 0 0 0 0
## 36.1 2 3 4 4 5 2 0 0 1
## 36.2 3 6 7 3 7 4 1 0 0
## 36.3 11 5 8 5 5 4 0 0 1
## 36.4 12 11 16 8 8 6 2 0 0
## 36.5 16 14 8 12 13 7 1 0 1
## 36.6 18 15 6 7 15 6 3 0 1
## 36.7 21 13 9 5 13 13 2 0 1
## 36.8 13 23 14 4 15 12 3 0 1
## 36.9 21 19 18 14 17 16 3 0 0
## 37.0 17 21 14 15 17 19 2 0 2
## 37.1 14 21 22 10 17 11 2 0 1
## 37.2 19 13 17 14 11 16 1 0 0
## 37.3 17 16 16 16 9 14 2 0 1
## 37.4 18 9 14 12 10 17 1 0 1
## 37.5 17 11 18 7 10 9 2 0 0
## 37.6 16 5 13 8 10 5 1 0 0
## 37.7 6 6 2 4 6 5 0 0 1
## 37.8 4 5 3 7 3 5 1 0 0
## 37.9 7 7 5 4 5 1 0 0 1
## 38.0 3 5 5 3 5 5 3 0 1
## 38.1 11 12 20 14 12 12 0 1 3
## 38.2 26 18 21 12 18 9 0 0 2
## 38.3 25 22 25 20 24 15 1 0 4
## 38.4 32 30 21 17 34 28 1 0 2
## 38.5 45 48 31 29 40 24 2 1 2
## 38.6 62 45 36 38 45 35 6 0 0
## 38.7 47 60 47 41 60 35 5 0 3
## 38.8 58 76 57 33 59 46 5 0 8
## 38.9 60 64 47 43 74 48 4 1 8
## 39.0 54 66 58 36 63 53 2 1 10
## 39.1 73 77 60 33 53 57 9 0 9
## 39.2 64 77 58 38 67 35 5 0 5
## 39.3 61 53 40 37 72 38 7 1 3
## 39.4 49 63 40 33 49 39 2 0 5
## 39.5 42 40 37 32 40 21 4 0 3
## 39.6 35 26 22 30 32 17 5 1 0
## 39.7 20 19 22 25 35 17 1 0 0
## 39.8 17 12 18 9 11 10 0 0 2
## 39.9 6 11 12 7 14 1 2 0 1
## 40.0 6 9 8 10 12 5 0 0 0
## 40.1 7 4 5 8 3 7 0 0 1
## 40.2 3 2 2 2 5 2 0 0 0
## 40.3 1 1 2 4 2 1 0 0 0
## 40.4 3 0 2 0 2 2 0 0 0
## 40.5 1 1 0 1 2 1 0 0 0
## 40.6 0 0 1 2 0 0 0 0 0
## 40.7 0 0 1 0 0 0 0 0 0
## 40.8 0 0 0 0 1 0 0 0 0
## NA 28 26 20 22 35 16 2 0 0
linelist %>%
tabyl(gender, age_cat)
## gender 0-4 5-9 10-14 15-19 20-29 30-49 50-69 70+ NA_
## f 640 641 518 359 468 179 2 0 0
## m 416 412 383 364 575 557 91 5 0
## <NA> 39 42 40 20 30 18 2 1 86
s <- linelist %>%
group_by(hospital) %>%
summarise(min=min(bmi), max=max(bmi), mean=mean(bmi), standard_deviation=sd(bmi))
s
## # A tibble: 6 × 5
## hospital min max mean standard_deviation
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Central Hospital -156. 336 45.9 40.5
## 2 Military Hospital -39.1 741. 47.3 46.4
## 3 Missing -309. 741. 47.9 50.1
## 4 Other -1200 494. 44.9 69.5
## 5 Port Hospital -592. 816. 46.7 55.4
## 6 St. Mark's Maternity Hospital (SMMH) 12.7 1250 48.3 69.3
s <- linelist %>%
na.omit() %>%
group_by(hospital) %>%
reframe(min=min(bmi), max=max(bmi), mean=mean(bmi), standard_deviation=sd(bmi), bmi_p05 = quantile(bmi, probs = 0.05), bmi_p95 = quantile(bmi, probs = 0.95))
s
## # A tibble: 6 × 7
## hospital min max mean standard_deviation bmi_p05 bmi_p95
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Central Hospital 13.1 211. 45.3 33.3 17.1 107.
## 2 Military Hospital 0 301. 48.9 43.9 17.9 132.
## 3 Missing -309. 741. 49.6 63.0 18.0 144.
## 4 Other -136. 429. 48.1 49.2 17.2 129.
## 5 Port Hospital -592. 352. 43.5 49.2 17.0 113.
## 6 St. Mark's Maternity Ho… 12.8 171. 43.5 28.5 17.8 112.
t <- linelist %>%
group_by(hospital) %>%
summarise(
underweight=sum(bmi<=18.5, na.rm = T),
overweight=sum(bmi>=30, na.rm = T)
)
t
## # A tibble: 6 × 3
## hospital underweight overweight
## <chr> <int> <int>
## 1 Central Hospital 32 262
## 2 Military Hospital 60 495
## 3 Missing 104 818
## 4 Other 59 493
## 5 Port Hospital 121 999
## 6 St. Mark's Maternity Hospital (SMMH) 33 244
A t-test is a statistical test that you can use to compare the mean of two samples. The t-test is based on the following assumptions: the data are independent, they are (approximately) normally distributed, they are an approximately similar variance.
t.test(ht_cm~gender, data=linelist)
##
## Welch Two Sample t-test
##
## data: ht_cm by gender
## t = -26.334, df = 5596.7, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group f and group m is not equal to 0
## 95 percent confidence interval:
## -35.29745 -30.40620
## sample estimates:
## mean in group f mean in group m
## 108.6669 141.5187
t.test(linelist$bmi, mu=25)
##
## One Sample t-test
##
## data: linelist$bmi
## t = 30.326, df = 5887, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 25
## 95 percent confidence interval:
## 45.47518 48.30528
## sample estimates:
## mean of x
## 46.89023
The Shapiro-Wilk is used to test assumption of normal distribution. If the value of the W statistic is too extreme, this gives an indication that your data would not conform to the Normal distribution.
linelist1 <- linelist %>%
slice(1:5000)
shapiro.test(linelist1$ht_cm)
##
## Shapiro-Wilk normality test
##
## data: linelist1$ht_cm
## W = 0.99016, p-value < 2.2e-16
A Chi-square test assesse if there is a statistically significant association between the rows and the columns of the table. The test computes the values of the contingency table that one should expect under the assumption of rows and columns being independent. If the actual values are too extreme when compared to the expected value, then one is forced to reject the assumption of independence in favour of one of dependence between the two.
linelist %>%
freq_table(fever, na.rm = F)
## # A tibble: 3 × 3
## fever n prop
## <chr> <int> <dbl>
## 1 no 1090 18.5
## 2 yes 4549 77.3
## 3 <NA> 249 4.2
linelist %>%
freq_table(gender, na.rm = F)
## # A tibble: 3 × 3
## gender n prop
## <chr> <int> <dbl>
## 1 f 2807 47.7
## 2 m 2803 47.6
## 3 <NA> 278 4.7
linelist %>%
tabyl(gender,fever)
## gender no yes NA_
## f 520 2161 126
## m 518 2172 113
## <NA> 52 216 10
linelist %>%
select(gender,fever) %>%
tbl_summary(by=fever) %>%
add_p()
## 249 observations missing `fever` have been removed. To include these observations, use `forcats::fct_na_value_to_level()` on `fever` column before passing to `tbl_summary()`.
| Characteristic | no, N = 1,0901 | yes, N = 4,5491 | p-value2 |
|---|---|---|---|
| gender | 0.9 | ||
| f | 520 (50%) | 2,161 (50%) | |
| m | 518 (50%) | 2,172 (50%) | |
| Unknown | 52 | 216 | |
| 1 n (%) | |||
| 2 Pearson’s Chi-squared test | |||
Several data points are missing in the gender. There is no significant association between gender and fever (p>0.05)
The correlation coefficient is a number ranging between -1 and +1 expressing the strength of a linear relation between two variables, if a linear relationship is indeed present to begin with. The sign of the coefficient tells whether the correlation, if present, is positive (as one variable increases, so does the other) or negative (as one variable increases, the other decreases, and vice versa). A value on or around zero expresses the lack of a linear relation between the two variables considered, the two variables just don’t happen to be related.
library(tidyverse)
correlation_table <- linelist %>%
select(bmi, age, ht_cm) %>%
cor()
correlation_table
## bmi age ht_cm
## bmi 1.0000000 NA -0.5407457
## age NA 1 NA
## ht_cm -0.5407457 NA 1.0000000
class(linelist$bmi)
## [1] "numeric"
cor(linelist$age,linelist$bmi)
## [1] NA