This is a part of basic statistical analysis topic of FETP training, Thailand. This article aim to provide basis R code about basic continuous data comparison for those who not familiar with R. The data set for this article is not provided.
For continuous data comparison we compare the central tendency central tendency measure like mean, median, mode depend on distribution (normal distribution vs non-normal distribution) of the data. This type of comparison divide into one sample, two samples with independent or dependent and more than two samples comparison. These type has it own statistical test with required assumptions.
We use on sample t-test in order to find out a different between our data and specific value i.e. value from prior study or expert opinion.
1. Random sample
2. Independent observations
3. The population is normally distributed
In R we can use t.test() and define
mu = specific value argument in the function.
#Load packages first
library(readxl)
library(tidyverse)
angina <- read_xlsx("dataset_basic_2.xlsx",
sheet = "angina")
head(angina)
## # A tibble: 6 × 13
## id age sex ihdfami fat diabetes smoke alcohol weight height sbp
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 59 1 0 1 0 2 2 58.5 164. 185
## 2 2 55 1 0 0 0 1 1 73 174. 133
## 3 3 55 0 0 1 1 0 0 52 161 110
## 4 4 54 0 0 1 1 0 0 51 160 110
## 5 5 53 0 0 1 1 0 0 53 162 110
## 6 6 54 0 0 1 1 0 0 50 159 110
## # … with 2 more variables: dbp <dbl>, angina <dbl>
Say we want to know from angina dataset is the mean age
among those who have angina = 50?
So our null hypothesis is “The mean age of those who have angina equal
to 50 year old”
hist(angina$age,
col = "aquamarine",
main = "Histogram of age",
xlab = "Age (years)")
From histogram age is approximate normally distributed.
t.test(x= subset(angina,angina == 1)$age,
mu = 50) #define mu = 50
##
## One Sample t-test
##
## data: subset(angina, angina == 1)$age
## t = 0.41517, df = 23, p-value = 0.6819
## alternative hypothesis: true mean is not equal to 50
## 95 percent confidence interval:
## 47.34488 53.98846
## sample estimates:
## mean of x
## 50.66667
The p value is 0.68, so we failed to reject our null hypothesis, and we can conclude that the mean age of those who have angina equal to 50 year old.
For two samples we can test the different between 2 independent groups by (unpaired) t-test, if two groups are dependent we will use paired t-test instead.
1. Random sample
2. Independent observations
3. The data of each group is normally distributed
4. The variances for the two independent groups are equal
Again we can use t.test() but we have to define another
data group y = instead of mu = and we have to
define that the two group have equal variant or not in
var.equal = argument too.
we can use var.test() to test homogeneity of variance of
data groups.
personality <- read_xlsx("dataset_basic_2.xlsx",
sheet = "personality")
head(personality)
## # A tibble: 6 × 9
## ID type chol ...4 ...5 ...6 ...7 `type A` `type B`
## <dbl> <chr> <dbl> <lgl> <lgl> <lgl> <lgl> <dbl> <dbl>
## 1 1 A 233 NA NA NA NA 233 344
## 2 2 A 291 NA NA NA NA 291 185
## 3 3 A 312 NA NA NA NA 312 263
## 4 4 A 250 NA NA NA NA 250 246
## 5 5 A 246 NA NA NA NA 246 224
## 6 6 A 197 NA NA NA NA 197 212
From personality dataset which record cholesterol level
and personal type of each person. say we want to know “Is cholesterol
level (chol) of those who are personal type A difference from those are
personal type B?”.
We set null hypothesis to “Cholesterol level of those who are personal
type A equal to those who are personal type B”.
Let check assumption first.
#Examine the distribution first.
par(mfrow =c(2,1))
hist(personality$`type A`,
main = "Histogram of cholesterol level in personality type A group",
xlab = "Cholesterol level")
hist(personality$`type B`,
main = "Histogram of cholesterol level in personality type B group",
xlab = "Cholesterol level")
## And then Check Homogeneity of Variance
var.test(personality$`type A`,personality$`type B` )
##
## F test to compare two variances
##
## data: personality$`type A` and personality$`type B`
## F = 0.57446, num df = 19, denom df = 19, p-value = 0.236
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.2273779 1.4513427
## sample estimates:
## ratio of variances
## 0.5744591
The plots show approximate normal distribution, and the test show equal variance of 2 groups.Then we can perform t-test.
t.test(personality$`type A`,personality$`type B`,
var.equal = TRUE)
##
## Two Sample t-test
##
## data: personality$`type A` and personality$`type B`
## t = 2.5621, df = 38, p-value = 0.01449
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 7.293091 62.206909
## sample estimates:
## mean of x mean of y
## 245.05 210.30
P value less than 0.05 conclude that cholesterol level of those who are personal type A are significant difference to those who are personal type B.
This test use in situation that the 2 groups are dependent like before and after test.
1. Random sample
2. Independent observations
3. The differences are normally distributed
magicegg <- read_xlsx("dataset_basic_2.xlsx",
sheet = "magic_egg")
head(magicegg)
## # A tibble: 6 × 3
## ID re me
## <dbl> <dbl> <dbl>
## 1 1 4.61 3.84
## 2 2 6.42 5.57
## 3 3 5.4 5.85
## 4 4 4.54 4.8
## 5 5 3.98 3.68
## 6 6 3.82 2.96
This dataset record LDL level of 12 subjects at beginning on regular
diet (re) and then after eating magic egg (me). Dose magic egg reduce
LDL cholesterol (mmol/L)?
We can use t.test() with define paired = TRUE
argument to compare this before-after test.
t.test(x=magicegg$re,
y=magicegg$me,
paired = TRUE)
##
## Paired t-test
##
## data: magicegg$re and magicegg$me
## t = 3.0431, df = 11, p-value = 0.01118
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
## 0.1053908 0.6562759
## sample estimates:
## mean difference
## 0.3808333
P value less than 0.05 conclude that LDL level before and after eating magic egg is statistical significant difference.