This is a part of basic statistical analysis topic of FETP training,
Thailand.
This article aim to provide basis R code for those who not familiar with
R.
The data set for this article is not provided.
##Setting environment first
library(tidyverse) # Load essential package.
`%notin%` <- Negate(`%in%`) # Code notin operation mean "not in"
eclairplus <- readxl::read_xlsx("dataset_basic data_day1.xlsx",
sheet = 1) # Load dataset into R.
head(eclairplus)
## # A tibble: 6 × 21
## No INSTITUTE HOSPITAL SEX AGE OCC EXPTIME BEEFCURRY
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dttm> <dbl>
## 1 1 1 1 1 13 1 1990-08-25 18:00:00 1
## 2 2 1 0 1 14 1 1990-08-25 18:00:00 1
## 3 3 1 0 1 13 1 1990-08-25 18:00:00 1
## 4 4 1 0 1 15 1 1990-08-25 18:00:00 1
## 5 5 1 0 1 14 1 1990-08-25 18:00:00 1
## 6 6 1 0 1 11 1 1990-08-25 18:00:00 1
## # … with 13 more variables: SALTEGG <dbl>, ECLAIR <dbl>, WATER <dbl>,
## # OTHERS <dbl>, ONSET <dttm>, NAUSEA <dbl>, VOMITING <dbl>, ABDPAIN <dbl>,
## # DIARRHEA <dbl>, OTHSYMP <dbl>, INCPERIOD <dbl>, CASECONT <chr>, BMI <dbl>
glimpse(eclairplus)
## Rows: 1,094
## Columns: 21
## $ No <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
## $ INSTITUTE <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ HOSPITAL <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, …
## $ SEX <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ AGE <dbl> 13, 14, 13, 15, 14, 11, 19, 17, 15, 20, 18, 17, 21, 16, 17, …
## $ OCC <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ EXPTIME <dttm> 1990-08-25 18:00:00, 1990-08-25 18:00:00, 1990-08-25 18:00:…
## $ BEEFCURRY <dbl> 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ SALTEGG <dbl> 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ ECLAIR <dbl> 1, 0, 0, 50, 0, 0, 0, 1, 50, 0, 0, 1, 0, 0, 3, 1, 2, 1, 2, 1…
## $ WATER <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ OTHERS <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ ONSET <dttm> 1990-08-25 22:00:00, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ NAUSEA <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, …
## $ VOMITING <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, …
## $ ABDPAIN <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, …
## $ DIARRHEA <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, …
## $ OTHSYMP <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ INCPERIOD <dbl> 240, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 240, 0, 0, 180, 0, …
## $ CASECONT <chr> "1", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", …
## $ BMI <dbl> 16.47505, 18.47097, 23.71044, 20.10565, 16.99689, 21.83215, …
Basic data describe for continuous variables
This dataset contain 22 variables, as the dataset coded AGE, INCPERIOD
and BMI are continuous data that, for example, we can describe with
central tendency and spreading. But what type of statistics we should
use? The answer is it depend on distribution of each variable. So let
examine these variables distribution first. We can examine the
distribution by using histogram, Q-Q plot, boxplot and perform
Shapiro-Wilk normality test. But for example I prefer using plots.
par(mfrow=c(3,3)) ## This line tell R to make our canvas show 3x3 plots
###AGE
## for AGE variable the code 99 mean unknown not 99 years, so we should exclude 99 from our analysis first.
hist(subset(eclairplus,AGE !=99)$AGE,
col = "green",
main = "Histogram of age",
xlab = "age (years)")
qqnorm(subset(eclairplus,AGE !=99)$AGE)
qqline(subset(eclairplus,AGE !=99)$AGE,
col = "red")
boxplot(subset(eclairplus,AGE !=99)$AGE,
main = "boxplot of age",
horizontal = TRUE)
###BMI
hist(eclairplus$BMI,
col = "darksalmon",
main = "Histogram of BMI",
xlab = "BMI (kg/square-mater)")
qqnorm(eclairplus$BMI)
qqline(eclairplus$BMI,
col = "red")
boxplot(eclairplus$BMI,
main = "boxplot of BMI",
horizontal = TRUE)
###INCPERIOD = inclubation period time in minute.
## for the coded 0 mean not in case group, so no incubation period and 999 mean unknown. Again we should exclude these two coded from our analysis.
# Note : %notin% operator mean "not in".
hist(subset(eclairplus,INCPERIOD %notin% c(0,999))$INCPERIOD,
col = "pink",
main = "Histogram of incubation period",
xlab = "Incubation period (minutes)")
qqnorm(subset(eclairplus,INCPERIOD %notin% c(0,999))$INCPERIOD)
qqline(subset(eclairplus,INCPERIOD %notin% c(0,999))$INCPERIOD,
col = "red")
boxplot(subset(eclairplus,INCPERIOD %notin% c(0,999))$INCPERIOD,
main = "boxplot of incubation period",
horizontal = TRUE)
As plots show BMI is approximate normal distributed, while AGE and INCPERIOD are not. So the statistics use to describe BMI is mean and S.D. and for AGE and INCPERIOD we should use median and IQR.
###Measure of central tendency and spreading.
median(subset(eclairplus,AGE !=99)$AGE)
## [1] 17
IQR(subset(eclairplus,AGE !=99)$AGE)
## [1] 8
mean(eclairplus$BMI)
## [1] 19.99721
sd(eclairplus$BMI)
## [1] 2.001817
median(subset(eclairplus,INCPERIOD %notin% c(0,999))$INCPERIOD)
## [1] 210
IQR(subset(eclairplus,INCPERIOD %notin% c(0,999))$INCPERIOD)
## [1] 60
Conclusion : We can describe that,
The median of age is 17 years and IQR is 8 years.
The mean of BMI is 19.99 and S.D. equal to 2.00 kg per
square-meter.
The median of incubation period is 210 minutes and IQR is 60
minutes.
Specific Risk calculation example
For basic epidemiology we can calculate specific risk by using
proportion.
###The overall attack rate.
prop.table(table(factor(eclairplus$CASECONT,
labels = c("normal",
"sick"))))
##
## normal sick
## 0.5703839 0.4296161
The overall attack rate is 42.96% .
###Sex specific attack rate
prop.table(table(factor(eclairplus$CASECONT,
labels =c("normal",
"sick")),
factor(eclairplus$SEX,
labels = c("female",
"male"))),
margin = 2)## margin = 2 mean we calculate column-wise proportion.
##
## female male
## normal 0.6327078 0.5381415
## sick 0.3672922 0.4618585
The specific attack for female and male are 36.73% and 46.19% respectively.
###Specific attack rate for those who ate beef curry.
prop.table(table(factor(eclairplus$CASECONT,
labels = c("normal",
"sick")),
factor(eclairplus$BEEFCURRY,
labels = c("not eat","eat","unknow"))),
margin = 2)
##
## not eat eat unknow
## normal 0.7692308 0.5501002 1.0000000
## sick 0.2307692 0.4498998 0.0000000
The specific attack rate for those who ate beef curry is near 45%.
###Specific attack rate for those who ate eclair.
##For eclair we should re-code eclair to group 90 with 0 first because 90 mean unknown and o mean not ate while other coded mean ate eclair in difference quantity. And we only interested in those who ate eclair.
eclairplus <- eclairplus%>%
mutate("eclair_eat"= case_when(ECLAIR %in% c(1:20,50,80)~1,
TRUE ~0))
prop.table(table(factor(eclairplus$CASECONT,
labels = c("normal",
"sick")),
factor(eclairplus$eclair_eat,
labels = c("other","eat"))),
margin = 2)
##
## other eat
## normal 0.7907543 0.4377745
## sick 0.2092457 0.5622255
Specific attack rate for those who ate eclair is 56.22%.