Install required packages

Task 1: install and load the packages “rio”, “skimr”, “janitor”, and “tidyverse” using p_load. If you get stuck, press on the “code” button on the right of the HTML file to see the code.

install.packages("pacman", repos = "http://cran.us.r-project.org")
library(pacman)
## Warning: package 'pacman' was built under R version 4.3.2
# p_load: load (and install if necessary) packages required for this analysis.
p_load(
       rio,         
       skimr,       
       tidyverse, 
       gtsummary,
       here,
       rstatix,
       scales,
       flextable,
       janitor   
       )

Import data

Task 2: Import the dataset “linelist_cleaned.rds” using import() and call the imported dataset linelist. If you get stuck, press on the “code” button on the right of the HTML file to see the code.

linelist<-import("linelist_cleaned.rds")

Exercises

1) Dataset Exploration

  • Browse the data to get an overview of the variables present in your dataset. Compare the output from skim() with the one of summary().
skim(linelist)
Data summary
Name linelist
Number of rows 5888
Number of columns 30
_______________________
Column type frequency:
character 13
Date 4
factor 2
numeric 11
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
case_id 0 1.00 6 6 0 5888 0
outcome 1323 0.78 5 7 0 2 0
gender 278 0.95 1 1 0 2 0
age_unit 0 1.00 5 6 0 2 0
hospital 0 1.00 5 36 0 6 0
infector 2088 0.65 6 6 0 2697 0
source 2088 0.65 5 7 0 2 0
fever 249 0.96 2 3 0 2 0
chills 249 0.96 2 3 0 2 0
cough 249 0.96 2 3 0 2 0
aches 249 0.96 2 3 0 2 0
vomit 249 0.96 2 3 0 2 0
time_admission 765 0.87 5 5 0 1072 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
date_infection 2087 0.65 2014-03-19 2015-04-27 2014-10-11 359
date_onset 256 0.96 2014-04-07 2015-04-30 2014-10-23 367
date_hospitalisation 0 1.00 2014-04-17 2015-04-30 2014-10-23 363
date_outcome 936 0.84 2014-04-19 2015-06-04 2014-11-01 371

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
age_cat 86 0.99 FALSE 8 0-4: 1095, 5-9: 1095, 20-: 1073, 10-: 941
age_cat5 86 0.99 FALSE 17 0-4: 1095, 5-9: 1095, 10-: 941, 15-: 743

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
generation 0 1.00 16.56 5.79 0.00 13.00 16.00 20.00 37.00 ▁▆▇▂▁
age 86 0.99 16.07 12.62 0.00 6.00 13.00 23.00 84.00 ▇▅▁▁▁
age_years 86 0.99 16.02 12.64 0.00 6.00 13.00 23.00 84.00 ▇▅▁▁▁
lon 0 1.00 -13.23 0.02 -13.27 -13.25 -13.23 -13.22 -13.21 ▅▃▃▆▇
lat 0 1.00 8.47 0.01 8.45 8.46 8.47 8.48 8.49 ▅▇▇▇▆
wt_kg 0 1.00 52.64 18.58 -11.00 41.00 54.00 66.00 111.00 ▁▃▇▅▁
ht_cm 0 1.00 124.96 49.52 4.00 91.00 129.00 159.00 295.00 ▂▅▇▂▁
ct_blood 0 1.00 21.21 1.69 16.00 20.00 22.00 22.00 26.00 ▁▃▇▃▁
temp 149 0.97 38.56 0.98 35.20 38.20 38.80 39.20 40.80 ▁▂▂▇▁
bmi 0 1.00 46.89 55.39 -1200.00 24.56 32.12 50.01 1250.00 ▁▁▇▁▁
days_onset_hosp 256 0.96 2.06 2.26 0.00 1.00 1.00 3.00 22.00 ▇▁▁▁▁
summary(linelist)
##    case_id            generation    date_infection         date_onset        
##  Length:5888        Min.   : 0.00   Min.   :2014-03-19   Min.   :2014-04-07  
##  Class :character   1st Qu.:13.00   1st Qu.:2014-09-06   1st Qu.:2014-09-16  
##  Mode  :character   Median :16.00   Median :2014-10-11   Median :2014-10-23  
##                     Mean   :16.56   Mean   :2014-10-22   Mean   :2014-11-03  
##                     3rd Qu.:20.00   3rd Qu.:2014-12-05   3rd Qu.:2014-12-19  
##                     Max.   :37.00   Max.   :2015-04-27   Max.   :2015-04-30  
##                                     NA's   :2087         NA's   :256         
##  date_hospitalisation  date_outcome          outcome         
##  Min.   :2014-04-17   Min.   :2014-04-19   Length:5888       
##  1st Qu.:2014-09-19   1st Qu.:2014-09-26   Class :character  
##  Median :2014-10-23   Median :2014-11-01   Mode  :character  
##  Mean   :2014-11-03   Mean   :2014-11-12                     
##  3rd Qu.:2014-12-17   3rd Qu.:2014-12-28                     
##  Max.   :2015-04-30   Max.   :2015-06-04                     
##                       NA's   :936                            
##     gender               age          age_unit           age_years    
##  Length:5888        Min.   : 0.00   Length:5888        Min.   : 0.00  
##  Class :character   1st Qu.: 6.00   Class :character   1st Qu.: 6.00  
##  Mode  :character   Median :13.00   Mode  :character   Median :13.00  
##                     Mean   :16.07                      Mean   :16.02  
##                     3rd Qu.:23.00                      3rd Qu.:23.00  
##                     Max.   :84.00                      Max.   :84.00  
##                     NA's   :86                         NA's   :86     
##     age_cat        age_cat5      hospital              lon        
##  0-4    :1095   0-4    :1095   Length:5888        Min.   :-13.27  
##  5-9    :1095   5-9    :1095   Class :character   1st Qu.:-13.25  
##  20-29  :1073   10-14  : 941   Mode  :character   Median :-13.23  
##  10-14  : 941   15-19  : 743                      Mean   :-13.23  
##  30-49  : 754   20-24  : 638                      3rd Qu.:-13.22  
##  (Other): 844   (Other):1290                      Max.   :-13.21  
##  NA's   :  86   NA's   :  86                                      
##       lat          infector            source              wt_kg       
##  Min.   :8.446   Length:5888        Length:5888        Min.   :-11.00  
##  1st Qu.:8.461   Class :character   Class :character   1st Qu.: 41.00  
##  Median :8.469   Mode  :character   Mode  :character   Median : 54.00  
##  Mean   :8.470                                         Mean   : 52.64  
##  3rd Qu.:8.480                                         3rd Qu.: 66.00  
##  Max.   :8.492                                         Max.   :111.00  
##                                                                        
##      ht_cm        ct_blood        fever              chills         
##  Min.   :  4   Min.   :16.00   Length:5888        Length:5888       
##  1st Qu.: 91   1st Qu.:20.00   Class :character   Class :character  
##  Median :129   Median :22.00   Mode  :character   Mode  :character  
##  Mean   :125   Mean   :21.21                                        
##  3rd Qu.:159   3rd Qu.:22.00                                        
##  Max.   :295   Max.   :26.00                                        
##                                                                     
##     cough              aches              vomit                temp      
##  Length:5888        Length:5888        Length:5888        Min.   :35.20  
##  Class :character   Class :character   Class :character   1st Qu.:38.20  
##  Mode  :character   Mode  :character   Mode  :character   Median :38.80  
##                                                           Mean   :38.56  
##                                                           3rd Qu.:39.20  
##                                                           Max.   :40.80  
##                                                           NA's   :149    
##  time_admission          bmi           days_onset_hosp 
##  Length:5888        Min.   :-1200.00   Min.   : 0.000  
##  Class :character   1st Qu.:   24.56   1st Qu.: 1.000  
##  Mode  :character   Median :   32.12   Median : 1.000  
##                     Mean   :   46.89   Mean   : 2.059  
##                     3rd Qu.:   50.01   3rd Qu.: 3.000  
##                     Max.   : 1250.00   Max.   :22.000  
##                                        NA's   :256

Skim result is easier to read because it shows firstly the summary of descreption and then describle each variable into the table

  • How many individuals have missing data on the variables “temp” and “age_years”?
skim(linelist$temp)
Data summary
Name linelist$temp
Number of rows 5888
Number of columns 1
_______________________
Column type frequency:
numeric 1
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
data 149 0.97 38.56 0.98 35.2 38.2 38.8 39.2 40.8 ▁▂▂▇▁

2) Summary Statistics

  • Compute the following summary statistics for the variables “bmi” and “temp”: mean, median, standard deviation, interquartile range, range.
# Method 1
fch1 <- function(x)
{
  mean=mean(x)
  median=median(x)
  standard_deviation=sd(x)
  interquartile_range=quantile(x)
  range=range(x)
  return(c("mean"=mean, "media"=median, "sd"=standard_deviation, "iqr"=interquartile_range, "range"=range))
}
fch1(linelist$bmi)
##        mean       media          sd      iqr.0%     iqr.25%     iqr.50% 
##    46.89023    32.12396    55.38829 -1200.00000    24.56033    32.12396 
##     iqr.75%    iqr.100%      range1      range2 
##    50.00676  1250.00000 -1200.00000  1250.00000
# Method 2
linelist %>% 
  get_summary_stats(
    bmi, temp
  )
## # A tibble: 2 × 13
##   variable     n     min    max median    q1    q3   iqr    mad  mean     sd
##   <fct>    <dbl>   <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl>  <dbl>
## 1 bmi       5888 -1200   1250     32.1  24.6  50.0  25.4 14.4    46.9 55.4  
## 2 temp      5739    35.2   40.8   38.8  38.2  39.2   1    0.741  38.6  0.977
## # ℹ 2 more variables: se <dbl>, ci <dbl>
  • Compute the mean and median again, but this time add the option “na.rm = TRUE”. Do you notice any difference in the results? Why do you think this happens?
# Method 1
fch1 <- function(x)
{
  mean=mean(x, na.rm = T)
  median=median(x, na.rm = T)
  standard_deviation=sd(x, na.rm = T)
  interquartile_range=quantile(x, na.rm = T)
  range=range(x, na.rm = T)
  return(c("mean"=mean, "media"=median, "sd"=standard_deviation, "iqr"=interquartile_range, "range"=range))
}
fch1(linelist$bmi)
##        mean       media          sd      iqr.0%     iqr.25%     iqr.50% 
##    46.89023    32.12396    55.38829 -1200.00000    24.56033    32.12396 
##     iqr.75%    iqr.100%      range1      range2 
##    50.00676  1250.00000 -1200.00000  1250.00000
fch1(linelist$temp)
##       mean      media         sd     iqr.0%    iqr.25%    iqr.50%    iqr.75% 
## 38.5582854 38.8000000  0.9767722 35.2000000 38.2000000 38.8000000 39.2000000 
##   iqr.100%     range1     range2 
## 40.8000000 35.2000000 40.8000000
# Method 2
linelist %>% 
  na.omit() %>% 
  get_summary_stats(bmi, temp)
## # A tibble: 2 × 13
##   variable     n    min   max median    q1    q3   iqr    mad  mean     sd    se
##   <fct>    <dbl>  <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl>  <dbl> <dbl>
## 1 bmi       1818 -592.  741.    32.5  24.3  51.2  26.9 14.9    46.6 49.9   1.17 
## 2 temp      1818   35.2  40.6   38.9  38.3  39.2   0.9  0.593  38.6  0.933 0.022
## # ℹ 1 more variable: ci <dbl>
  • Compute the table with frequencies for the variables “temp” and “age_cat” (including missing values)
linelist %>% 
  tabyl(temp, age_cat)
##  temp 0-4 5-9 10-14 15-19 20-29 30-49 50-69 70+ NA_
##  35.2   0   0     1     0     0     0     0   0   0
##  35.3   0   0     0     0     1     0     0   0   0
##  35.5   0   1     0     0     1     0     0   0   0
##  35.6   0   0     1     0     1     0     0   0   0
##  35.7   0   0     0     0     0     1     0   0   0
##  35.8   0   1     1     0     1     2     0   0   0
##  35.9   3   0     2     4     2     0     2   0   1
##  36.0   1   3     4     1     2     0     0   0   0
##  36.1   2   3     4     4     5     2     0   0   1
##  36.2   3   6     7     3     7     4     1   0   0
##  36.3  11   5     8     5     5     4     0   0   1
##  36.4  12  11    16     8     8     6     2   0   0
##  36.5  16  14     8    12    13     7     1   0   1
##  36.6  18  15     6     7    15     6     3   0   1
##  36.7  21  13     9     5    13    13     2   0   1
##  36.8  13  23    14     4    15    12     3   0   1
##  36.9  21  19    18    14    17    16     3   0   0
##  37.0  17  21    14    15    17    19     2   0   2
##  37.1  14  21    22    10    17    11     2   0   1
##  37.2  19  13    17    14    11    16     1   0   0
##  37.3  17  16    16    16     9    14     2   0   1
##  37.4  18   9    14    12    10    17     1   0   1
##  37.5  17  11    18     7    10     9     2   0   0
##  37.6  16   5    13     8    10     5     1   0   0
##  37.7   6   6     2     4     6     5     0   0   1
##  37.8   4   5     3     7     3     5     1   0   0
##  37.9   7   7     5     4     5     1     0   0   1
##  38.0   3   5     5     3     5     5     3   0   1
##  38.1  11  12    20    14    12    12     0   1   3
##  38.2  26  18    21    12    18     9     0   0   2
##  38.3  25  22    25    20    24    15     1   0   4
##  38.4  32  30    21    17    34    28     1   0   2
##  38.5  45  48    31    29    40    24     2   1   2
##  38.6  62  45    36    38    45    35     6   0   0
##  38.7  47  60    47    41    60    35     5   0   3
##  38.8  58  76    57    33    59    46     5   0   8
##  38.9  60  64    47    43    74    48     4   1   8
##  39.0  54  66    58    36    63    53     2   1  10
##  39.1  73  77    60    33    53    57     9   0   9
##  39.2  64  77    58    38    67    35     5   0   5
##  39.3  61  53    40    37    72    38     7   1   3
##  39.4  49  63    40    33    49    39     2   0   5
##  39.5  42  40    37    32    40    21     4   0   3
##  39.6  35  26    22    30    32    17     5   1   0
##  39.7  20  19    22    25    35    17     1   0   0
##  39.8  17  12    18     9    11    10     0   0   2
##  39.9   6  11    12     7    14     1     2   0   1
##  40.0   6   9     8    10    12     5     0   0   0
##  40.1   7   4     5     8     3     7     0   0   1
##  40.2   3   2     2     2     5     2     0   0   0
##  40.3   1   1     2     4     2     1     0   0   0
##  40.4   3   0     2     0     2     2     0   0   0
##  40.5   1   1     0     1     2     1     0   0   0
##  40.6   0   0     1     2     0     0     0   0   0
##  40.7   0   0     1     0     0     0     0   0   0
##  40.8   0   0     0     0     1     0     0   0   0
##    NA  28  26    20    22    35    16     2   0   0
  • Compute a table of cross-tabulation between the variables “gender” and “age_cat5”
linelist %>% 
  tabyl(gender, age_cat)
##  gender 0-4 5-9 10-14 15-19 20-29 30-49 50-69 70+ NA_
##       f 640 641   518   359   468   179     2   0   0
##       m 416 412   383   364   575   557    91   5   0
##    <NA>  39  42    40    20    30    18     2   1  86

3) Summary Statistics by subgroup

  • With the help of the command “summarise()”, create a new data frame, grouped by “hospital”, including the following summaries for the variable “bmi”: min, max, mean, standard deviation.
s <- linelist %>% 
  group_by(hospital) %>% 
  summarise(min=min(bmi), max=max(bmi), mean=mean(bmi), standard_deviation=sd(bmi))
s
## # A tibble: 6 × 5
##   hospital                                 min   max  mean standard_deviation
##   <chr>                                  <dbl> <dbl> <dbl>              <dbl>
## 1 Central Hospital                      -156.   336   45.9               40.5
## 2 Military Hospital                      -39.1  741.  47.3               46.4
## 3 Missing                               -309.   741.  47.9               50.1
## 4 Other                                -1200    494.  44.9               69.5
## 5 Port Hospital                         -592.   816.  46.7               55.4
## 6 St. Mark's Maternity Hospital (SMMH)    12.7 1250   48.3               69.3
  • Extend the previous table to include the values for the 0.05 and 0.95 percentiles of the variable “bmi”)
s <- linelist %>%
  na.omit() %>% 
  group_by(hospital) %>% 
  reframe(min=min(bmi), max=max(bmi), mean=mean(bmi), standard_deviation=sd(bmi), bmi_p05 = quantile(bmi, probs = 0.05), bmi_p95 = quantile(bmi, probs = 0.95))
s
## # A tibble: 6 × 7
##   hospital                    min   max  mean standard_deviation bmi_p05 bmi_p95
##   <chr>                     <dbl> <dbl> <dbl>              <dbl>   <dbl>   <dbl>
## 1 Central Hospital           13.1  211.  45.3               33.3    17.1    107.
## 2 Military Hospital           0    301.  48.9               43.9    17.9    132.
## 3 Missing                  -309.   741.  49.6               63.0    18.0    144.
## 4 Other                    -136.   429.  48.1               49.2    17.2    129.
## 5 Port Hospital            -592.   352.  43.5               49.2    17.0    113.
## 6 St. Mark's Maternity Ho…   12.8  171.  43.5               28.5    17.8    112.
  • For each hospital, how many individuals have a bmi <=18.5? How many individuals have a bmi >=30? Adapt the previous data frame to include these information.
t <- linelist %>% 
  group_by(hospital) %>% 
  summarise(
    underweight=sum(bmi<=18.5, na.rm = T),
    overweight=sum(bmi>=30, na.rm = T)
  )
t
## # A tibble: 6 × 3
##   hospital                             underweight overweight
##   <chr>                                      <int>      <int>
## 1 Central Hospital                              32        262
## 2 Military Hospital                             60        495
## 3 Missing                                      104        818
## 4 Other                                         59        493
## 5 Port Hospital                                121        999
## 6 St. Mark's Maternity Hospital (SMMH)          33        244

4) Statistical Tests

T test

A t-test is a statistical test that you can use to compare the mean of two samples. The t-test is based on the following assumptions: the data are independent, they are (approximately) normally distributed, they are an approximately similar variance.

  • Perform a t-test to compare the mean height by gender.
t.test(ht_cm~gender, data=linelist)
## 
##  Welch Two Sample t-test
## 
## data:  ht_cm by gender
## t = -26.334, df = 5596.7, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group f and group m is not equal to 0
## 95 percent confidence interval:
##  -35.29745 -30.40620
## sample estimates:
## mean in group f mean in group m 
##        108.6669        141.5187
  • Perform a t-test to test whether mean bmi is significantly different from 25.
t.test(linelist$bmi, mu=25)
## 
##  One Sample t-test
## 
## data:  linelist$bmi
## t = 30.326, df = 5887, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 25
## 95 percent confidence interval:
##  45.47518 48.30528
## sample estimates:
## mean of x 
##  46.89023

Shapiro-Wilk Test

The Shapiro-Wilk is used to test assumption of normal distribution. If the value of the W statistic is too extreme, this gives an indication that your data would not conform to the Normal distribution.

  • Perform a Shapiro-Wilk test on the first 5000 records for the variable height. What can you conclude?
linelist1 <- linelist %>% 
  slice(1:5000)
shapiro.test(linelist1$ht_cm)
## 
##  Shapiro-Wilk normality test
## 
## data:  linelist1$ht_cm
## W = 0.99016, p-value < 2.2e-16

Chi-square test

A Chi-square test assesse if there is a statistically significant association between the rows and the columns of the table. The test computes the values of the contingency table that one should expect under the assumption of rows and columns being independent. If the actual values are too extreme when compared to the expected value, then one is forced to reject the assumption of independence in favour of one of dependence between the two.

  • Perform a Chi-square test between gender and fever. What can you conclude?
linelist %>% 
  freq_table(fever, na.rm = F) 
## # A tibble: 3 × 3
##   fever     n  prop
##   <chr> <int> <dbl>
## 1 no     1090  18.5
## 2 yes    4549  77.3
## 3 <NA>    249   4.2
linelist %>% 
  freq_table(gender, na.rm = F)
## # A tibble: 3 × 3
##   gender     n  prop
##   <chr>  <int> <dbl>
## 1 f       2807  47.7
## 2 m       2803  47.6
## 3 <NA>     278   4.7
linelist %>% 
  tabyl(gender,fever)
##  gender  no  yes NA_
##       f 520 2161 126
##       m 518 2172 113
##    <NA>  52  216  10
linelist %>% 
  select(gender,fever) %>% 
  tbl_summary(by=fever) %>% 
  add_p()
## 249 observations missing `fever` have been removed. To include these observations, use `forcats::fct_na_value_to_level()` on `fever` column before passing to `tbl_summary()`.
Characteristic no, N = 1,0901 yes, N = 4,5491 p-value2
gender

0.9
    f 520 (50%) 2,161 (50%)
    m 518 (50%) 2,172 (50%)
    Unknown 52 216
1 n (%)
2 Pearson’s Chi-squared test

Several data points are missing in the gender. There is no significant association between gender and fever (p>0.05)

Correlation

The correlation coefficient is a number ranging between -1 and +1 expressing the strength of a linear relation between two variables, if a linear relationship is indeed present to begin with. The sign of the coefficient tells whether the correlation, if present, is positive (as one variable increases, so does the other) or negative (as one variable increases, the other decreases, and vice versa). A value on or around zero expresses the lack of a linear relation between the two variables considered, the two variables just don’t happen to be related.

  • Generate a correlation table between the variables height, bmi, and age.
library(tidyverse)
correlation_table <- linelist %>% 
  select(bmi, age, ht_cm) %>% 
  cor()
correlation_table
##              bmi age      ht_cm
## bmi    1.0000000  NA -0.5407457
## age           NA   1         NA
## ht_cm -0.5407457  NA  1.0000000
class(linelist$bmi)
## [1] "numeric"
cor(linelist$age,linelist$bmi)
## [1] NA