Basic statistical analysis with R

This is a part of basic statistical analysis topic of FETP training, Thailand.
This article aim to provide basis R code for those who not familiar with R.
The data set for this article is not provided.

##Setting environment first

library(tidyverse) # Load essential package.
`%notin%` <- Negate(`%in%`) # Code notin operation mean "not in"
eclairplus <- readxl::read_xlsx("dataset_basic data_day1.xlsx",
                                sheet = 1) # Load dataset into R.
head(eclairplus)

## # A tibble: 6 × 21
##      No INSTITUTE HOSPITAL   SEX   AGE   OCC EXPTIME             BEEFCURRY
##   <dbl>     <dbl>    <dbl> <dbl> <dbl> <dbl> <dttm>                  <dbl>
## 1     1         1        1     1    13     1 1990-08-25 18:00:00         1
## 2     2         1        0     1    14     1 1990-08-25 18:00:00         1
## 3     3         1        0     1    13     1 1990-08-25 18:00:00         1
## 4     4         1        0     1    15     1 1990-08-25 18:00:00         1
## 5     5         1        0     1    14     1 1990-08-25 18:00:00         1
## 6     6         1        0     1    11     1 1990-08-25 18:00:00         1
## # … with 13 more variables: SALTEGG <dbl>, ECLAIR <dbl>, WATER <dbl>,
## #   OTHERS <dbl>, ONSET <dttm>, NAUSEA <dbl>, VOMITING <dbl>, ABDPAIN <dbl>,
## #   DIARRHEA <dbl>, OTHSYMP <dbl>, INCPERIOD <dbl>, CASECONT <chr>, BMI <dbl>

glimpse(eclairplus)

## Rows: 1,094
## Columns: 21
## $ No        <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
## $ INSTITUTE <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ HOSPITAL  <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, …
## $ SEX       <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ AGE       <dbl> 13, 14, 13, 15, 14, 11, 19, 17, 15, 20, 18, 17, 21, 16, 17, …
## $ OCC       <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ EXPTIME   <dttm> 1990-08-25 18:00:00, 1990-08-25 18:00:00, 1990-08-25 18:00:…
## $ BEEFCURRY <dbl> 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ SALTEGG   <dbl> 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ ECLAIR    <dbl> 1, 0, 0, 50, 0, 0, 0, 1, 50, 0, 0, 1, 0, 0, 3, 1, 2, 1, 2, 1…
## $ WATER     <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ OTHERS    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ ONSET     <dttm> 1990-08-25 22:00:00, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ NAUSEA    <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, …
## $ VOMITING  <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, …
## $ ABDPAIN   <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, …
## $ DIARRHEA  <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, …
## $ OTHSYMP   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ INCPERIOD <dbl> 240, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 240, 0, 0, 180, 0, …
## $ CASECONT  <chr> "1", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", …
## $ BMI       <dbl> 16.47505, 18.47097, 23.71044, 20.10565, 16.99689, 21.83215, …

Basic data describe for continuous variables
This dataset contain 22 variables, as the dataset coded AGE, INCPERIOD and BMI are continuous data that, for example, we can describe with central tendency and spreading. But what type of statistics we should use? The answer is it depend on distribution of each variable. So let examine these variables distribution first. We can examine the distribution by using histogram, Q-Q plot, boxplot and perform Shapiro-Wilk normality test. But for example I prefer using plots.

par(mfrow=c(3,3)) ## This line tell R to make our canvas show 3x3 plots
###AGE
## for AGE variable the code 99 mean unknown not 99 years, so we should exclude 99 from our analysis first.
hist(subset(eclairplus,AGE !=99)$AGE, 
     col = "green",
     main = "Histogram of age",
     xlab = "age (years)")
qqnorm(subset(eclairplus,AGE !=99)$AGE)
qqline(subset(eclairplus,AGE !=99)$AGE,
       col = "red")
boxplot(subset(eclairplus,AGE !=99)$AGE,
        main = "boxplot of age",
        horizontal = TRUE)

###BMI
hist(eclairplus$BMI,
     col = "darksalmon",
     main = "Histogram of BMI",
     xlab = "BMI (kg/square-mater)")
qqnorm(eclairplus$BMI)
qqline(eclairplus$BMI,
       col = "red")
boxplot(eclairplus$BMI,
        main = "boxplot of BMI",
        horizontal = TRUE)

###INCPERIOD = inclubation period time in minute.
## for the coded 0 mean not in case group, so no incubation period and 999 mean unknown. Again we should exclude these two coded from our analysis.  
# Note : %notin% operator mean "not in".
hist(subset(eclairplus,INCPERIOD %notin% c(0,999))$INCPERIOD,
     col = "pink",
     main = "Histogram of incubation period",
     xlab = "Incubation period (minutes)")
qqnorm(subset(eclairplus,INCPERIOD %notin% c(0,999))$INCPERIOD)
qqline(subset(eclairplus,INCPERIOD %notin% c(0,999))$INCPERIOD,
       col = "red")
boxplot(subset(eclairplus,INCPERIOD %notin% c(0,999))$INCPERIOD,
        main = "boxplot of incubation period",
        horizontal = TRUE)

As plots show BMI is approximate normal distributed, while AGE and INCPERIOD are not. So the statistics use to describe BMI is mean and S.D. and for AGE and INCPERIOD we should use median and IQR.

###Measure of central tendency and spreading.
median(subset(eclairplus,AGE !=99)$AGE)

## [1] 17

IQR(subset(eclairplus,AGE !=99)$AGE)

## [1] 8

mean(eclairplus$BMI)

## [1] 19.99721

sd(eclairplus$BMI)

## [1] 2.001817

median(subset(eclairplus,INCPERIOD %notin% c(0,999))$INCPERIOD)

## [1] 210

IQR(subset(eclairplus,INCPERIOD %notin% c(0,999))$INCPERIOD)

## [1] 60

Conclusion : We can describe that,
The median of age is 17 years and IQR is 8 years.
The mean of BMI is 19.99 and S.D. equal to 2.00 kg per square-meter.
The median of incubation period is 210 minutes and IQR is 60 minutes.

Specific Risk calculation example
For basic epidemiology we can calculate specific risk by using proportion.

###The overall attack rate.
prop.table(table(factor(eclairplus$CASECONT, 
                        labels = c("normal",
                                   "sick"))))

## 
##    normal      sick 
## 0.5703839 0.4296161

The overall attack rate is 42.96% .

###Sex specific attack rate
prop.table(table(factor(eclairplus$CASECONT, 
                        labels =c("normal",
                                  "sick")),
                 factor(eclairplus$SEX,
                        labels = c("female",
                                   "male"))),
           margin = 2)## margin = 2 mean we calculate column-wise proportion.

##         
##             female      male
##   normal 0.6327078 0.5381415
##   sick   0.3672922 0.4618585

The specific attack for female and male are 36.73% and 46.19% respectively.

###Specific attack rate for those who ate beef curry.
prop.table(table(factor(eclairplus$CASECONT,
                        labels = c("normal",
                                  "sick")),
                 factor(eclairplus$BEEFCURRY,
                        labels = c("not eat","eat","unknow"))),
           margin = 2)

##         
##            not eat       eat    unknow
##   normal 0.7692308 0.5501002 1.0000000
##   sick   0.2307692 0.4498998 0.0000000

The specific attack rate for those who ate beef curry is near 45%.

###Specific attack rate for those who ate eclair.
##For eclair we should re-code eclair to group 90 with 0 first because 90 mean unknown and o mean not ate while other coded mean ate eclair in difference quantity. And we only interested in those who ate eclair.
eclairplus <- eclairplus%>%
                  mutate("eclair_eat"= case_when(ECLAIR %in% c(1:20,50,80)~1,
                                                 TRUE ~0))
prop.table(table(factor(eclairplus$CASECONT,
                        labels = c("normal",
                                  "sick")),
                 factor(eclairplus$eclair_eat,
                        labels = c("other","eat"))),
           margin = 2)

##         
##              other       eat
##   normal 0.7907543 0.4377745
##   sick   0.2092457 0.5622255

Specific attack rate for those who ate eclair is 56.22%.

Basic statistical analysis with R

Jirapanakorn Sutham

2022-07-11