Basic statistical analysis with R : continuous data comparison part 1

This is a part of basic statistical analysis topic of FETP training, Thailand. This article aim to provide basis R code about basic continuous data comparison for those who not familiar with R. The data set for this article is not provided.

Continuous data comparison

For continuous data comparison we compare the central tendency central tendency measure like mean, median, mode depend on distribution (normal distribution vs non-normal distribution) of the data. This type of comparison divide into one sample, two samples with independent or dependent and more than two samples comparison. These type has it own statistical test with required assumptions.

One sample t-test

We use on sample t-test in order to find out a different between our data and specific value i.e. value from prior study or expert opinion.

Assumptions

        1. Random sample  
        2. Independent observations  
        3. The population is normally distributed

In R we can use t.test() and define mu = specific value argument in the function.

Example

#Load packages first
library(readxl)
library(tidyverse)

angina <- read_xlsx("dataset_basic_2.xlsx",
                    sheet = "angina")
head(angina)

## # A tibble: 6 × 13
##      id   age   sex ihdfami   fat diabetes smoke alcohol weight height   sbp
##   <dbl> <dbl> <dbl>   <dbl> <dbl>    <dbl> <dbl>   <dbl>  <dbl>  <dbl> <dbl>
## 1     1    59     1       0     1        0     2       2   58.5   164.   185
## 2     2    55     1       0     0        0     1       1   73     174.   133
## 3     3    55     0       0     1        1     0       0   52     161    110
## 4     4    54     0       0     1        1     0       0   51     160    110
## 5     5    53     0       0     1        1     0       0   53     162    110
## 6     6    54     0       0     1        1     0       0   50     159    110
## # … with 2 more variables: dbp <dbl>, angina <dbl>

Say we want to know from angina dataset is the mean age among those who have angina = 50?
So our null hypothesis is “The mean age of those who have angina equal to 50 year old”

hist(angina$age,
     col = "aquamarine",
     main = "Histogram of age",
     xlab = "Age (years)")

From histogram age is approximate normally distributed.

t.test(x= subset(angina,angina == 1)$age,
       mu = 50) #define mu = 50

## 
##  One Sample t-test
## 
## data:  subset(angina, angina == 1)$age
## t = 0.41517, df = 23, p-value = 0.6819
## alternative hypothesis: true mean is not equal to 50
## 95 percent confidence interval:
##  47.34488 53.98846
## sample estimates:
## mean of x 
##  50.66667

The p value is 0.68, so we failed to reject our null hypothesis, and we can conclude that the mean age of those who have angina equal to 50 year old.

Two samples t-test

For two samples we can test the different between 2 independent groups by (unpaired) t-test, if two groups are dependent we will use paired t-test instead.

Assumptions of two sample t-test

        1. Random sample  
        2. Independent observations  
        3. The data of each group is normally distributed   
        4. The variances for the two independent groups are equal

Again we can use t.test() but we have to define another data group y = instead of mu = and we have to define that the two group have equal variant or not in var.equal = argument too.
we can use var.test() to test homogeneity of variance of data groups.

Example

personality <- read_xlsx("dataset_basic_2.xlsx",
                        sheet = "personality")
head(personality)

## # A tibble: 6 × 9
##      ID type   chol ...4  ...5  ...6  ...7  `type A` `type B`
##   <dbl> <chr> <dbl> <lgl> <lgl> <lgl> <lgl>    <dbl>    <dbl>
## 1     1 A       233 NA    NA    NA    NA         233      344
## 2     2 A       291 NA    NA    NA    NA         291      185
## 3     3 A       312 NA    NA    NA    NA         312      263
## 4     4 A       250 NA    NA    NA    NA         250      246
## 5     5 A       246 NA    NA    NA    NA         246      224
## 6     6 A       197 NA    NA    NA    NA         197      212

From personality dataset which record cholesterol level and personal type of each person. say we want to know “Is cholesterol level (chol) of those who are personal type A difference from those are personal type B?”.
We set null hypothesis to “Cholesterol level of those who are personal type A equal to those who are personal type B”.
Let check assumption first.

#Examine the distribution first.
par(mfrow =c(2,1))
hist(personality$`type A`,
     main = "Histogram of cholesterol level in personality type A group",
     xlab = "Cholesterol level")
hist(personality$`type B`,
     main = "Histogram of cholesterol level in personality type B group",
     xlab = "Cholesterol level")

## And then Check Homogeneity of Variance  
var.test(personality$`type A`,personality$`type B` )

## 
##  F test to compare two variances
## 
## data:  personality$`type A` and personality$`type B`
## F = 0.57446, num df = 19, denom df = 19, p-value = 0.236
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.2273779 1.4513427
## sample estimates:
## ratio of variances 
##          0.5744591

The plots show approximate normal distribution, and the test show equal variance of 2 groups.Then we can perform t-test.

t.test(personality$`type A`,personality$`type B`,
       var.equal = TRUE)

## 
##  Two Sample t-test
## 
## data:  personality$`type A` and personality$`type B`
## t = 2.5621, df = 38, p-value = 0.01449
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   7.293091 62.206909
## sample estimates:
## mean of x mean of y 
##    245.05    210.30

P value less than 0.05 conclude that cholesterol level of those who are personal type A are significant difference to those who are personal type B.

Another example : dependent (paired) t-test

This test use in situation that the 2 groups are dependent like before and after test.

Assumptions of two sample paired t-test

        1. Random sample  
        2. Independent observations  
        3. The differences are normally distributed

“Magic egg” example

magicegg <- read_xlsx("dataset_basic_2.xlsx",
                       sheet = "magic_egg")
head(magicegg)

## # A tibble: 6 × 3
##      ID    re    me
##   <dbl> <dbl> <dbl>
## 1     1  4.61  3.84
## 2     2  6.42  5.57
## 3     3  5.4   5.85
## 4     4  4.54  4.8 
## 5     5  3.98  3.68
## 6     6  3.82  2.96

This dataset record LDL level of 12 subjects at beginning on regular diet (re) and then after eating magic egg (me). Dose magic egg reduce LDL cholesterol (mmol/L)?
We can use t.test() with define paired = TRUE argument to compare this before-after test.

t.test(x=magicegg$re,
       y=magicegg$me,
       paired = TRUE)

## 
##  Paired t-test
## 
## data:  magicegg$re and magicegg$me
## t = 3.0431, df = 11, p-value = 0.01118
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
##  0.1053908 0.6562759
## sample estimates:
## mean difference 
##       0.3808333

P value less than 0.05 conclude that LDL level before and after eating magic egg is statistical significant difference.

Basic statistical analysis with R : continuous data comparison part 1

Parametric test

Jirapanakorn Sutham

2022-07-15

Continuous data comparison

One sample t-test

Assumptions

Example

Two samples t-test

Assumptions of two sample t-test

Example

Another example : dependent (paired) t-test

Assumptions of two sample paired t-test

“Magic egg” example