Basic statistical analysis with R : continuous data comparison part 3

This is a part of basic statistical analysis topic of FETP training, Thailand. This article aim to provide basis R code about basic continuous data comparison more than 2 groups for those who not familiar with R. The data set for this article is not provided.

Continuous data comparison more than two groups

For comparison of continuous with more than 2 groups, we have analysis of variance (ANOVA), aov() test for normal distribution data while Kruskal-Wallis test, kruskal.test() for not normal distribution data. In general these 2 tests only tell us about at lease one group difference from other but not tell us about which groups are difference, to answer this question we have to do post hoc analysis afterward.

Assumptions of ANOVA

        1. Each sample was drawn from a normally distributed population.
        2. The variances of the populations that the samples come from are equal.
        3. The observations in each group are independent of each other and the observations within groups were obtained by a random sample.

Example

library(readxl)
library(tidyverse)
undergrad <- read_xlsx("dataset_basic_2.xlsx",
                       sheet = "undergrad")
head(undergrad)

## # A tibble: 6 × 23
##      ID Gender Classification Height `Shoe Size` `Phone Time` `# of Shoes`
##   <dbl> <chr>  <chr>           <dbl>       <dbl>        <dbl>        <dbl>
## 1     1 male   senior           67.8         7           12             12
## 2     2 male   freshman         71           7.5          1.5            5
## 3     3 female freshman         64           6           25             15
## 4     4 female freshman         63           6.5         30             30
## 5     5 male   senior           69           6.5         23              8
## 6     6 female senior           64           8.5         13             25
## # … with 16 more variables: `Birth order` <chr>, Pets <dbl>, Happy <dbl>,
## #   Funny <dbl>, College <chr>, `Bfast Calories` <dbl>, Exercise <dbl>,
## #   `Stat Pre` <dbl>, `Stat Post` <dbl>, `Phone Type` <chr>, Sleep <dbl>,
## #   `Social Media` <dbl>, `Impact of SocNetworking` <chr>, Political <chr>,
## #   Animal <chr>, Superhero <chr>

##Clean variable name
undergrad <- undergrad %>%
                  rename("phone_time"=`Phone Time`,
                         "birth_order"=`Birth order`)

Frrom undergrad dataset, dose the mean time spent on the phone differ depending on whether students are the oldest, middle, youngest, or only child?
So null hypothesis is the mean time spent on the phone among students are the oldest, middle, youngest, or only child are equal.

boxplot(data = undergrad,
        phone_time ~ birth_order,
        main = "Boxplot",
        ylab = "Time spent on phone",
        xlab = "Birth order")

Plot show it look like the oldest students spent more time on phone than other, to tell whether this different is significant or not we can use aov() with summary() fucntions.

summary(aov(data = undergrad,
      phone_time~birth_order))

##             Df Sum Sq Mean Sq F value Pr(>F)  
## birth_order  3   1782   593.8   3.633 0.0169 *
## Residuals   71  11606   163.5                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 1 observation deleted due to missingness

P value from ANOVA tell us that at lease one group of birth order spent time difference from other.
To find which group is differ from other we can use TukeyHSD() for post hoc analysis to find the answer.

TukeyHSD(aov(data = undergrad,
      phone_time~birth_order))

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = phone_time ~ birth_order, data = undergrad)
## 
## $birth_order
##                            diff        lwr        upr     p adj
## oldest-middle       11.70000000   1.125527 22.2744726 0.0242915
## only child-middle    2.35714286 -12.749247 17.4635323 0.9764775
## youngest-middle      2.32692308  -8.164916 12.8187624 0.9367405
## only child-oldest   -9.34285714 -23.727017  5.0413028 0.3267672
## youngest-oldest     -9.37307692 -18.795377  0.0492235 0.0517176
## youngest-only child -0.03021978 -14.353742 14.2933020 0.9999999

The Tukey’s test show p value (p adj) of the oldest vs middle and the youngest vs the oldest are less than 0.05 conclude that the oldest spent time on phone significant different form the youngest and middle.

If our data is not approximate with normal distribution we can use kruskal.test() instead of aov(), example.

kruskal.test( data = undergrad,
              phone_time~birth_order)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  phone_time by birth_order
## Kruskal-Wallis chi-squared = 15.759, df = 3, p-value = 0.00127

P value from Kruskal-Wallis test also show that there is at lease one group spent time differ from other too.

Basic statistical analysis with R : continuous data comparison part 3

More than 2 groups comparison

Jirapanakorn Sutham

2022-07-15

Continuous data comparison more than two groups

Assumptions of ANOVA

Example