Exercises

1) Dataset Exploration

Browse the data to get an overview of the variables present in your dataset. Compare the output from skim() with the one of summary().

skim(linelist)

Data summary
Name	linelist
Number of rows	5888
Number of columns	30
_______________________
Column type frequency:
character	13
Date	4
factor	2
numeric	11
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
case_id	0	1.00	6	6	5888
outcome	1323	0.78	5	7	2
gender	278	0.95	1	1	2
age_unit	0	1.00	5	6	2
hospital	0	1.00	5	36	6
infector	2088	0.65	6	6	2697
source	2088	0.65	5	7	2
fever	249	0.96	2	3	2
chills	249	0.96	2	3	2
cough	249	0.96	2	3	2
aches	249	0.96	2	3	2
vomit	249	0.96	2	3	2
time_admission	765	0.87	5	5	1072

Variable type: Date

skim_variable	n_missing	complete_rate	min	max	median	n_unique
date_infection	2087	0.65	2014-03-19	2015-04-27	2014-10-11	359
date_onset	256	0.96	2014-04-07	2015-04-30	2014-10-23	367
date_hospitalisation	0	1.00	2014-04-17	2015-04-30	2014-10-23	363
date_outcome	936	0.84	2014-04-19	2015-06-04	2014-11-01	371

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
age_cat	86	0.99	FALSE	8	0-4: 1095, 5-9: 1095, 20-: 1073, 10-: 941
age_cat5	86	0.99	FALSE	17	0-4: 1095, 5-9: 1095, 10-: 941, 15-: 743

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
generation	0	1.00	16.56	5.79	0.00	13.00	16.00	20.00	37.00	▁▆▇▂▁
age	86	0.99	16.07	12.62	0.00	6.00	13.00	23.00	84.00	▇▅▁▁▁
age_years	86	0.99	16.02	12.64	0.00	6.00	13.00	23.00	84.00	▇▅▁▁▁
lon	0	1.00	-13.23	0.02	-13.27	-13.25	-13.23	-13.22	-13.21	▅▃▃▆▇
lat	0	1.00	8.47	0.01	8.45	8.46	8.47	8.48	8.49	▅▇▇▇▆
wt_kg	0	1.00	52.64	18.58	-11.00	41.00	54.00	66.00	111.00	▁▃▇▅▁
ht_cm	0	1.00	124.96	49.52	4.00	91.00	129.00	159.00	295.00	▂▅▇▂▁
ct_blood	0	1.00	21.21	1.69	16.00	20.00	22.00	22.00	26.00	▁▃▇▃▁
temp	149	0.97	38.56	0.98	35.20	38.20	38.80	39.20	40.80	▁▂▂▇▁
bmi	0	1.00	46.89	55.39	-1200.00	24.56	32.12	50.01	1250.00	▁▁▇▁▁
days_onset_hosp	256	0.96	2.06	2.26	0.00	1.00	1.00	3.00	22.00	▇▁▁▁▁

summary(linelist)

##    case_id            generation    date_infection         date_onset        
##  Length:5888        Min.   : 0.00   Min.   :2014-03-19   Min.   :2014-04-07  
##  Class :character   1st Qu.:13.00   1st Qu.:2014-09-06   1st Qu.:2014-09-16  
##  Mode  :character   Median :16.00   Median :2014-10-11   Median :2014-10-23  
##                     Mean   :16.56   Mean   :2014-10-22   Mean   :2014-11-03  
##                     3rd Qu.:20.00   3rd Qu.:2014-12-05   3rd Qu.:2014-12-19  
##                     Max.   :37.00   Max.   :2015-04-27   Max.   :2015-04-30  
##                                     NA's   :2087         NA's   :256         
##  date_hospitalisation  date_outcome          outcome         
##  Min.   :2014-04-17   Min.   :2014-04-19   Length:5888       
##  1st Qu.:2014-09-19   1st Qu.:2014-09-26   Class :character  
##  Median :2014-10-23   Median :2014-11-01   Mode  :character  
##  Mean   :2014-11-03   Mean   :2014-11-12                     
##  3rd Qu.:2014-12-17   3rd Qu.:2014-12-28                     
##  Max.   :2015-04-30   Max.   :2015-06-04                     
##                       NA's   :936                            
##     gender               age          age_unit           age_years    
##  Length:5888        Min.   : 0.00   Length:5888        Min.   : 0.00  
##  Class :character   1st Qu.: 6.00   Class :character   1st Qu.: 6.00  
##  Mode  :character   Median :13.00   Mode  :character   Median :13.00  
##                     Mean   :16.07                      Mean   :16.02  
##                     3rd Qu.:23.00                      3rd Qu.:23.00  
##                     Max.   :84.00                      Max.   :84.00  
##                     NA's   :86                         NA's   :86     
##     age_cat        age_cat5      hospital              lon        
##  0-4    :1095   0-4    :1095   Length:5888        Min.   :-13.27  
##  5-9    :1095   5-9    :1095   Class :character   1st Qu.:-13.25  
##  20-29  :1073   10-14  : 941   Mode  :character   Median :-13.23  
##  10-14  : 941   15-19  : 743                      Mean   :-13.23  
##  30-49  : 754   20-24  : 638                      3rd Qu.:-13.22  
##  (Other): 844   (Other):1290                      Max.   :-13.21  
##  NA's   :  86   NA's   :  86                                      
##       lat          infector            source              wt_kg       
##  Min.   :8.446   Length:5888        Length:5888        Min.   :-11.00  
##  1st Qu.:8.461   Class :character   Class :character   1st Qu.: 41.00  
##  Median :8.469   Mode  :character   Mode  :character   Median : 54.00  
##  Mean   :8.470                                         Mean   : 52.64  
##  3rd Qu.:8.480                                         3rd Qu.: 66.00  
##  Max.   :8.492                                         Max.   :111.00  
##                                                                        
##      ht_cm        ct_blood        fever              chills         
##  Min.   :  4   Min.   :16.00   Length:5888        Length:5888       
##  1st Qu.: 91   1st Qu.:20.00   Class :character   Class :character  
##  Median :129   Median :22.00   Mode  :character   Mode  :character  
##  Mean   :125   Mean   :21.21                                        
##  3rd Qu.:159   3rd Qu.:22.00                                        
##  Max.   :295   Max.   :26.00                                        
##                                                                     
##     cough              aches              vomit                temp      
##  Length:5888        Length:5888        Length:5888        Min.   :35.20  
##  Class :character   Class :character   Class :character   1st Qu.:38.20  
##  Mode  :character   Mode  :character   Mode  :character   Median :38.80  
##                                                           Mean   :38.56  
##                                                           3rd Qu.:39.20  
##                                                           Max.   :40.80  
##                                                           NA's   :149    
##  time_admission          bmi           days_onset_hosp 
##  Length:5888        Min.   :-1200.00   Min.   : 0.000  
##  Class :character   1st Qu.:   24.56   1st Qu.: 1.000  
##  Mode  :character   Median :   32.12   Median : 1.000  
##                     Mean   :   46.89   Mean   : 2.059  
##                     3rd Qu.:   50.01   3rd Qu.: 3.000  
##                     Max.   : 1250.00   Max.   :22.000  
##                                        NA's   :256

Skim result is easier to read because it shows firstly the summary of descreption and then describle each variable into the table

How many individuals have missing data on the variables “temp” and “age_years”?

skim(linelist$temp)

Data summary
Name	linelist$temp
Number of rows	5888
Number of columns	1
_______________________
Column type frequency:
numeric	1
________________________
Group variables	None

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
data	149	0.97	38.56	0.98	35.2	38.2	38.8	39.2	40.8	▁▂▂▇▁

2) Summary Statistics

Compute the following summary statistics for the variables “bmi” and “temp”: mean, median, standard deviation, interquartile range, range.

# Method 1
fch1 <- function(x)
{
  mean=mean(x)
  median=median(x)
  standard_deviation=sd(x)
  interquartile_range=quantile(x)
  range=range(x)
  return(c("mean"=mean, "media"=median, "sd"=standard_deviation, "iqr"=interquartile_range, "range"=range))
}
fch1(linelist$bmi)

##        mean       media          sd      iqr.0%     iqr.25%     iqr.50% 
##    46.89023    32.12396    55.38829 -1200.00000    24.56033    32.12396 
##     iqr.75%    iqr.100%      range1      range2 
##    50.00676  1250.00000 -1200.00000  1250.00000

# Method 2
linelist %>% 
  get_summary_stats(
    bmi, temp
  )

## # A tibble: 2 × 13
##   variable     n     min    max median    q1    q3   iqr    mad  mean     sd
##   <fct>    <dbl>   <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl>  <dbl>
## 1 bmi       5888 -1200   1250     32.1  24.6  50.0  25.4 14.4    46.9 55.4  
## 2 temp      5739    35.2   40.8   38.8  38.2  39.2   1    0.741  38.6  0.977
## # ℹ 2 more variables: se <dbl>, ci <dbl>

Compute the mean and median again, but this time add the option “na.rm = TRUE”. Do you notice any difference in the results? Why do you think this happens?

# Method 1
fch1 <- function(x)
{
  mean=mean(x, na.rm = T)
  median=median(x, na.rm = T)
  standard_deviation=sd(x, na.rm = T)
  interquartile_range=quantile(x, na.rm = T)
  range=range(x, na.rm = T)
  return(c("mean"=mean, "media"=median, "sd"=standard_deviation, "iqr"=interquartile_range, "range"=range))
}
fch1(linelist$bmi)

##        mean       media          sd      iqr.0%     iqr.25%     iqr.50% 
##    46.89023    32.12396    55.38829 -1200.00000    24.56033    32.12396 
##     iqr.75%    iqr.100%      range1      range2 
##    50.00676  1250.00000 -1200.00000  1250.00000

fch1(linelist$temp)

##       mean      media         sd     iqr.0%    iqr.25%    iqr.50%    iqr.75% 
## 38.5582854 38.8000000  0.9767722 35.2000000 38.2000000 38.8000000 39.2000000 
##   iqr.100%     range1     range2 
## 40.8000000 35.2000000 40.8000000

# Method 2
linelist %>% 
  na.omit() %>% 
  get_summary_stats(bmi, temp)

## # A tibble: 2 × 13
##   variable     n    min   max median    q1    q3   iqr    mad  mean     sd    se
##   <fct>    <dbl>  <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl>  <dbl> <dbl>
## 1 bmi       1818 -592.  741.    32.5  24.3  51.2  26.9 14.9    46.6 49.9   1.17 
## 2 temp      1818   35.2  40.6   38.9  38.3  39.2   0.9  0.593  38.6  0.933 0.022
## # ℹ 1 more variable: ci <dbl>

Compute the table with frequencies for the variables “temp” and “age_cat” (including missing values)

linelist %>% 
  tabyl(temp, age_cat)

##  temp 0-4 5-9 10-14 15-19 20-29 30-49 50-69 70+ NA_
##  35.2   0   0     1     0     0     0     0   0   0
##  35.3   0   0     0     0     1     0     0   0   0
##  35.5   0   1     0     0     1     0     0   0   0
##  35.6   0   0     1     0     1     0     0   0   0
##  35.7   0   0     0     0     0     1     0   0   0
##  35.8   0   1     1     0     1     2     0   0   0
##  35.9   3   0     2     4     2     0     2   0   1
##  36.0   1   3     4     1     2     0     0   0   0
##  36.1   2   3     4     4     5     2     0   0   1
##  36.2   3   6     7     3     7     4     1   0   0
##  36.3  11   5     8     5     5     4     0   0   1
##  36.4  12  11    16     8     8     6     2   0   0
##  36.5  16  14     8    12    13     7     1   0   1
##  36.6  18  15     6     7    15     6     3   0   1
##  36.7  21  13     9     5    13    13     2   0   1
##  36.8  13  23    14     4    15    12     3   0   1
##  36.9  21  19    18    14    17    16     3   0   0
##  37.0  17  21    14    15    17    19     2   0   2
##  37.1  14  21    22    10    17    11     2   0   1
##  37.2  19  13    17    14    11    16     1   0   0
##  37.3  17  16    16    16     9    14     2   0   1
##  37.4  18   9    14    12    10    17     1   0   1
##  37.5  17  11    18     7    10     9     2   0   0
##  37.6  16   5    13     8    10     5     1   0   0
##  37.7   6   6     2     4     6     5     0   0   1
##  37.8   4   5     3     7     3     5     1   0   0
##  37.9   7   7     5     4     5     1     0   0   1
##  38.0   3   5     5     3     5     5     3   0   1
##  38.1  11  12    20    14    12    12     0   1   3
##  38.2  26  18    21    12    18     9     0   0   2
##  38.3  25  22    25    20    24    15     1   0   4
##  38.4  32  30    21    17    34    28     1   0   2
##  38.5  45  48    31    29    40    24     2   1   2
##  38.6  62  45    36    38    45    35     6   0   0
##  38.7  47  60    47    41    60    35     5   0   3
##  38.8  58  76    57    33    59    46     5   0   8
##  38.9  60  64    47    43    74    48     4   1   8
##  39.0  54  66    58    36    63    53     2   1  10
##  39.1  73  77    60    33    53    57     9   0   9
##  39.2  64  77    58    38    67    35     5   0   5
##  39.3  61  53    40    37    72    38     7   1   3
##  39.4  49  63    40    33    49    39     2   0   5
##  39.5  42  40    37    32    40    21     4   0   3
##  39.6  35  26    22    30    32    17     5   1   0
##  39.7  20  19    22    25    35    17     1   0   0
##  39.8  17  12    18     9    11    10     0   0   2
##  39.9   6  11    12     7    14     1     2   0   1
##  40.0   6   9     8    10    12     5     0   0   0
##  40.1   7   4     5     8     3     7     0   0   1
##  40.2   3   2     2     2     5     2     0   0   0
##  40.3   1   1     2     4     2     1     0   0   0
##  40.4   3   0     2     0     2     2     0   0   0
##  40.5   1   1     0     1     2     1     0   0   0
##  40.6   0   0     1     2     0     0     0   0   0
##  40.7   0   0     1     0     0     0     0   0   0
##  40.8   0   0     0     0     1     0     0   0   0
##    NA  28  26    20    22    35    16     2   0   0

Compute a table of cross-tabulation between the variables “gender” and “age_cat5”

linelist %>% 
  tabyl(gender, age_cat)

##  gender 0-4 5-9 10-14 15-19 20-29 30-49 50-69 70+ NA_
##       f 640 641   518   359   468   179     2   0   0
##       m 416 412   383   364   575   557    91   5   0
##    <NA>  39  42    40    20    30    18     2   1  86

3) Summary Statistics by subgroup

With the help of the command “summarise()”, create a new data frame, grouped by “hospital”, including the following summaries for the variable “bmi”: min, max, mean, standard deviation.

s <- linelist %>% 
  group_by(hospital) %>% 
  summarise(min=min(bmi), max=max(bmi), mean=mean(bmi), standard_deviation=sd(bmi))
s

## # A tibble: 6 × 5
##   hospital                                 min   max  mean standard_deviation
##   <chr>                                  <dbl> <dbl> <dbl>              <dbl>
## 1 Central Hospital                      -156.   336   45.9               40.5
## 2 Military Hospital                      -39.1  741.  47.3               46.4
## 3 Missing                               -309.   741.  47.9               50.1
## 4 Other                                -1200    494.  44.9               69.5
## 5 Port Hospital                         -592.   816.  46.7               55.4
## 6 St. Mark's Maternity Hospital (SMMH)    12.7 1250   48.3               69.3

Extend the previous table to include the values for the 0.05 and 0.95 percentiles of the variable “bmi”)

s <- linelist %>%
  na.omit() %>% 
  group_by(hospital) %>% 
  reframe(min=min(bmi), max=max(bmi), mean=mean(bmi), standard_deviation=sd(bmi), bmi_p05 = quantile(bmi, probs = 0.05), bmi_p95 = quantile(bmi, probs = 0.95))
s

## # A tibble: 6 × 7
##   hospital                    min   max  mean standard_deviation bmi_p05 bmi_p95
##   <chr>                     <dbl> <dbl> <dbl>              <dbl>   <dbl>   <dbl>
## 1 Central Hospital           13.1  211.  45.3               33.3    17.1    107.
## 2 Military Hospital           0    301.  48.9               43.9    17.9    132.
## 3 Missing                  -309.   741.  49.6               63.0    18.0    144.
## 4 Other                    -136.   429.  48.1               49.2    17.2    129.
## 5 Port Hospital            -592.   352.  43.5               49.2    17.0    113.
## 6 St. Mark's Maternity Ho…   12.8  171.  43.5               28.5    17.8    112.

For each hospital, how many individuals have a bmi <=18.5? How many individuals have a bmi >=30? Adapt the previous data frame to include these information.

t <- linelist %>% 
  group_by(hospital) %>% 
  summarise(
    underweight=sum(bmi<=18.5, na.rm = T),
    overweight=sum(bmi>=30, na.rm = T)
  )
t

## # A tibble: 6 × 3
##   hospital                             underweight overweight
##   <chr>                                      <int>      <int>
## 1 Central Hospital                              32        262
## 2 Military Hospital                             60        495
## 3 Missing                                      104        818
## 4 Other                                         59        493
## 5 Port Hospital                                121        999
## 6 St. Mark's Maternity Hospital (SMMH)          33        244

4) Statistical Tests

T test

A t-test is a statistical test that you can use to compare the mean of two samples. The t-test is based on the following assumptions: the data are independent, they are (approximately) normally distributed, they are an approximately similar variance.

Perform a t-test to compare the mean height by gender.

t.test(ht_cm~gender, data=linelist)

## 
##  Welch Two Sample t-test
## 
## data:  ht_cm by gender
## t = -26.334, df = 5596.7, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group f and group m is not equal to 0
## 95 percent confidence interval:
##  -35.29745 -30.40620
## sample estimates:
## mean in group f mean in group m 
##        108.6669        141.5187

Perform a t-test to test whether mean bmi is significantly different from 25.

t.test(linelist$bmi, mu=25)

## 
##  One Sample t-test
## 
## data:  linelist$bmi
## t = 30.326, df = 5887, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 25
## 95 percent confidence interval:
##  45.47518 48.30528
## sample estimates:
## mean of x 
##  46.89023

Shapiro-Wilk Test

The Shapiro-Wilk is used to test assumption of normal distribution. If the value of the W statistic is too extreme, this gives an indication that your data would not conform to the Normal distribution.

Perform a Shapiro-Wilk test on the first 5000 records for the variable height. What can you conclude?

linelist1 <- linelist %>% 
  slice(1:5000)
shapiro.test(linelist1$ht_cm)

## 
##  Shapiro-Wilk normality test
## 
## data:  linelist1$ht_cm
## W = 0.99016, p-value < 2.2e-16

Chi-square test

A Chi-square test assesse if there is a statistically significant association between the rows and the columns of the table. The test computes the values of the contingency table that one should expect under the assumption of rows and columns being independent. If the actual values are too extreme when compared to the expected value, then one is forced to reject the assumption of independence in favour of one of dependence between the two.

Perform a Chi-square test between gender and fever. What can you conclude?

linelist %>% 
  freq_table(fever, na.rm = F)

## # A tibble: 3 × 3
##   fever     n  prop
##   <chr> <int> <dbl>
## 1 no     1090  18.5
## 2 yes    4549  77.3
## 3 <NA>    249   4.2

linelist %>% 
  freq_table(gender, na.rm = F)

## # A tibble: 3 × 3
##   gender     n  prop
##   <chr>  <int> <dbl>
## 1 f       2807  47.7
## 2 m       2803  47.6
## 3 <NA>     278   4.7

linelist %>% 
  tabyl(gender,fever)

##  gender  no  yes NA_
##       f 520 2161 126
##       m 518 2172 113
##    <NA>  52  216  10

linelist %>% 
  select(gender,fever) %>% 
  tbl_summary(by=fever) %>% 
  add_p()

## 249 observations missing `fever` have been removed. To include these observations, use `forcats::fct_na_value_to_level()` on `fever` column before passing to `tbl_summary()`.

Characteristic	no, N = 1,090¹	yes, N = 4,549¹	p-value²
gender			0.9
f	520 (50%)	2,161 (50%)
m	518 (50%)	2,172 (50%)
Unknown	52	216
¹ n (%)
² Pearson’s Chi-squared test

Several data points are missing in the gender. There is no significant association between gender and fever (p>0.05)

Correlation

The correlation coefficient is a number ranging between -1 and +1 expressing the strength of a linear relation between two variables, if a linear relationship is indeed present to begin with. The sign of the coefficient tells whether the correlation, if present, is positive (as one variable increases, so does the other) or negative (as one variable increases, the other decreases, and vice versa). A value on or around zero expresses the lack of a linear relation between the two variables considered, the two variables just don’t happen to be related.

Generate a correlation table between the variables height, bmi, and age.

library(tidyverse)
correlation_table <- linelist %>% 
  select(bmi, age, ht_cm) %>% 
  cor()
correlation_table

##              bmi age      ht_cm
## bmi    1.0000000  NA -0.5407457
## age           NA   1         NA
## ht_cm -0.5407457  NA  1.0000000

class(linelist$bmi)

## [1] "numeric"

cor(linelist$age,linelist$bmi)

## [1] NA

Practical_3: Data Analysis

Thanhthuy

Install required packages

Import data