This is a part of basic statistical analysis topic of FETP training, Thailand. This article aim to provide basis R code about basic continuous data comparison more than 2 groups for those who not familiar with R. The data set for this article is not provided.
For comparison of continuous with more than 2 groups, we have
analysis of variance (ANOVA), aov() test for normal
distribution data while Kruskal-Wallis test, kruskal.test()
for not normal distribution data. In general these 2 tests only tell us
about at lease one group difference from other but not tell us about
which groups are difference, to answer this question we have to do post
hoc analysis afterward.
1. Each sample was drawn from a normally distributed population.
2. The variances of the populations that the samples come from are equal.
3. The observations in each group are independent of each other and the observations within groups were obtained by a random sample.
library(readxl)
library(tidyverse)
undergrad <- read_xlsx("dataset_basic_2.xlsx",
sheet = "undergrad")
head(undergrad)
## # A tibble: 6 × 23
## ID Gender Classification Height `Shoe Size` `Phone Time` `# of Shoes`
## <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1 male senior 67.8 7 12 12
## 2 2 male freshman 71 7.5 1.5 5
## 3 3 female freshman 64 6 25 15
## 4 4 female freshman 63 6.5 30 30
## 5 5 male senior 69 6.5 23 8
## 6 6 female senior 64 8.5 13 25
## # … with 16 more variables: `Birth order` <chr>, Pets <dbl>, Happy <dbl>,
## # Funny <dbl>, College <chr>, `Bfast Calories` <dbl>, Exercise <dbl>,
## # `Stat Pre` <dbl>, `Stat Post` <dbl>, `Phone Type` <chr>, Sleep <dbl>,
## # `Social Media` <dbl>, `Impact of SocNetworking` <chr>, Political <chr>,
## # Animal <chr>, Superhero <chr>
##Clean variable name
undergrad <- undergrad %>%
rename("phone_time"=`Phone Time`,
"birth_order"=`Birth order`)
Frrom undergrad dataset, dose the mean time spent on the
phone differ depending on whether students are the oldest, middle,
youngest, or only child?
So null hypothesis is the mean time spent on the phone among students
are the oldest, middle, youngest, or only child are equal.
boxplot(data = undergrad,
phone_time ~ birth_order,
main = "Boxplot",
ylab = "Time spent on phone",
xlab = "Birth order")
Plot show it look like the oldest students spent more time on phone
than other, to tell whether this different is significant or not we can
use aov() with summary() fucntions.
summary(aov(data = undergrad,
phone_time~birth_order))
## Df Sum Sq Mean Sq F value Pr(>F)
## birth_order 3 1782 593.8 3.633 0.0169 *
## Residuals 71 11606 163.5
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 1 observation deleted due to missingness
P value from ANOVA tell us that at lease one group of birth order
spent time difference from other.
To find which group is differ from other we can use
TukeyHSD() for post hoc analysis to find the answer.
TukeyHSD(aov(data = undergrad,
phone_time~birth_order))
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = phone_time ~ birth_order, data = undergrad)
##
## $birth_order
## diff lwr upr p adj
## oldest-middle 11.70000000 1.125527 22.2744726 0.0242915
## only child-middle 2.35714286 -12.749247 17.4635323 0.9764775
## youngest-middle 2.32692308 -8.164916 12.8187624 0.9367405
## only child-oldest -9.34285714 -23.727017 5.0413028 0.3267672
## youngest-oldest -9.37307692 -18.795377 0.0492235 0.0517176
## youngest-only child -0.03021978 -14.353742 14.2933020 0.9999999
The Tukey’s test show p value (p adj) of the oldest vs middle and the youngest vs the oldest are less than 0.05 conclude that the oldest spent time on phone significant different form the youngest and middle.
If our data is not approximate with normal distribution we can use
kruskal.test() instead of aov(), example.
kruskal.test( data = undergrad,
phone_time~birth_order)
##
## Kruskal-Wallis rank sum test
##
## data: phone_time by birth_order
## Kruskal-Wallis chi-squared = 15.759, df = 3, p-value = 0.00127
P value from Kruskal-Wallis test also show that there is at lease one group spent time differ from other too.