data()
data(package = .packages(all.available=TRUE))
#install.packages("carData")
library(carData)
mydata <- force(Angell)
head(mydata)
## moral hetero mobility region
## Rochester 19.0 20.6 15.0 E
## Syracuse 17.0 15.6 20.2 E
## Worcester 16.4 22.1 13.6 E
## Erie 16.2 14.0 14.8 E
## Milwaukee 15.8 17.4 17.6 MW
## Bridgeport 15.3 27.9 17.5 E
nrow(mydata)
## [1] 43
*Note: For the given research question, the only needed variables are region and mobility.
Converting categorical variables into factors
mydata$regionF <- factor(mydata$region,
levels = c("E", "MW" , "W" , "S"),
labels = c ("E" , "MW" , "W" , "S"))
head(mydata,3)
## moral hetero mobility region regionF
## Rochester 19.0 20.6 15.0 E E
## Syracuse 17.0 15.6 20.2 E E
## Worcester 16.4 22.1 13.6 E E
Statistical description of the geographic mobility (of
residents) by the region (where the city is located)
library(psych)
describeBy(x=mydata$mobility, group=mydata$regionF)
##
## Descriptive statistics by group
## group: E
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 9 15.9 2.66 15 15.9 2.08 12.1 20.2 8.1 0.33 -1.35
## se
## X1 0.89
## ----------------------------------------------------
## group: MW
## vars n mean sd median trimmed mad min max range skew
## X1 1 14 26.06 7.11 24 25.38 5.86 17.6 42.7 25.1 0.85
## kurtosis se
## X1 -0.29 1.9
## ----------------------------------------------------
## group: W
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 6 37.4 6.57 35.15 37.4 3.48 31.2 49.8 18.6 0.95 -0.73
## se
## X1 2.68
## ----------------------------------------------------
## group: S
## vars n mean sd median trimmed mad min max range skew
## X1 1 14 32.46 8.43 30.15 32.32 8.3 19.4 47.2 27.8 0.3
## kurtosis se
## X1 -1.31 2.25
For the given research question, the appropriate parametric test would be One-way ANOVA, since the intention is to assess whether there is relationship between geographic mobility of the residents and the region where the city was located (hypothesis about the equality of three or more population arithmetic means for independent samples). In the given data, there are four different independent regions that are measured at the same time and the differences in the geographic mobility averages for each group are compared.
However, in order to do this parametric test, certain assumptions need to be met:
Since the assumption of the numeric variable is met, the following assumption that needs to be checked is normal distribution.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(rstatix)
## Warning: package 'rstatix' was built under R version 4.3.2
##
## Attaching package: 'rstatix'
## The following object is masked from 'package:stats':
##
## filter
mydata %>%
group_by(regionF) %>%
shapiro_test(mobility)
## # A tibble: 4 Γ 4
## regionF variable statistic p
## <fct> <chr> <dbl> <dbl>
## 1 E mobility 0.943 0.613
## 2 MW mobility 0.910 0.156
## 3 W mobility 0.832 0.111
## 4 S mobility 0.937 0.385
H0: Variable mobility is normally distributed within E group (Northeast region).
H1: Variable mobility is not normally distributed within E group.
p-value=0.613, cannot reject H0.
H0: Variable mobility is normally distributed within MW group (Midwest region).
H1: Variable mobility is not normally distributed within MW group.
p-value=0.156, cannot reject H0.
H0: Variable mobility is normally distributed within W group (West region) .
H1: Variable mobility is not normally distributed within W group.
p-value=0.111, cannot reject H0.
H0: Variable mobility is normally distributed within S group (Southeast region).
H1: Variable mobility is not normally distributed within S group.
p-value=0.385, cannot reject H0.
Because p-value for each group is larger than 5%, we cannot reject normality assumption.
While the Shapiro-Wilk test is commonly used to assess normality, it is important to acknowledge its limitations, particularly with smaller sample sizes (43 units). Because of this it is important we need to check data visually as well.
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
ggplot(mydata, aes (x = mobility)) +
geom_histogram(binwidth = 5, colour="blue", fill="pink") +
facet_wrap(~regionF, ncol = 2) +
ylab("'Frequency")
Although Shapiro-Wilk normality test suggests normality, from this ggplot the distribution does not look normal within each group. Since the sample size is not very large as well, results of non-parametric alternative test (Kruskal-Wallis Rank Sum Test) will also be shown as a robustness check.
Now, using the Levene Test, homoskedasticity assumption will be checked to see if the variance of analyzed variable is the same within all groups.
library(car)
## Warning: package 'car' was built under R version 4.3.2
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:psych':
##
## logit
leveneTest(mydata$mobility, mydata$regionF)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 3 2.4694 0.07625 .
## 39
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
H0: varE=varMW=varW=varS (Variances are the same within all groups.)
H1: At least one variance is different.
p-value=0.076, cannot reject H0, and therefore cannot reject the homoskedasticity assumption.
Since all of the four assumptions are met, the One-way ANOVA parametric test can indeed be conducted.
Extension of independent t-test
ANOVA_results <- aov(mobility ~ regionF,
data = mydata)
summary(ANOVA_results)
## Df Sum Sq Mean Sq F value Pr(>F)
## regionF 3 2172 724.0 15.24 1.02e-06 ***
## Residuals 39 1853 47.5
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
H0: π(E) = π(MW) = π(W) = π(S)
H1: At least one π is different from others.
p-value < 0.001, reject H0.
library(effectsize)
##
## Attaching package: 'effectsize'
## The following objects are masked from 'package:rstatix':
##
## cohens_d, eta_squared
## The following object is masked from 'package:psych':
##
## phi
eta_squared(ANOVA_results)
## For one-way between subjects designs, partial eta squared is
## equivalent to eta squared. Returning eta squared.
## # Effect Size for ANOVA
##
## Parameter | Eta2 | 95% CI
## -------------------------------
## regionF | 0.54 | [0.34, 1.00]
##
## - One-sided CIs: upper bound fixed at [1.00].
interpret_eta_squared(0.54, rules = "cohen1992")
## [1] "large"
## (Rules: cohen1992)
Based on the sample data, we reject H0 at (p < 0.001), and conclude that at least one mean differs from the others. The size of the difference is large - effect size is large (π2 = 0.54).
Now we need to perform Post hoc tests to see between which groups this difference can be observed.
library(rstatix)
pairwise_t_test(mobility ~ regionF,
paired = FALSE,
p.adjust.method = "bonferroni",
data = mydata)
## # A tibble: 6 Γ 9
## .y. group1 group2 n1 n2 p p.signif p.adj
## * <chr> <chr> <chr> <int> <int> <dbl> <chr> <dbl>
## 1 mobility E MW 9 14 0.00136 ** 0.00818
## 2 mobility E W 9 6 0.000000672 **** 0.00000403
## 3 mobility MW W 14 6 0.00169 ** 0.0101
## 4 mobility E S 9 14 0.00000173 **** 0.0000104
## 5 mobility MW S 14 14 0.0186 * 0.111
## 6 mobility W S 6 14 0.15 ns 0.898
## # βΉ 1 more variable: p.adj.signif <chr>
*Note: We look at adjusted p-values.
H0: π(E) =π(MW)
H1: π(E) =/= π(MW)
p-value (adj) < 0.008, reject H0
H0: π(E) = π(W)
H1: π(E) =/= π(W).
p-value (adj) < 0.001, reject H0
H0: π(MW) = π(W)
H1: π(MW) =/= π(W)
p-value (adj) = 0.010, reject H0
H0: π(E) = π(S)
H1: π(E) =/= π(S)
p-value (adj) < 0.001, reject H0
H0: π(MW) = π(S)
H1: π(MW) =/= π(S)
p-value (adj) > 5%, cannot reject H0
H0: π(W) = π(S)
H1: π(W) =/= π(S)
p-value (adj) > 5%, cannot reject
Conclusion of One-way ANOVA and post hoc tests:
Extension of Wilcoxon Rank Sum Test Robustness check
kruskal.test(mobility ~ regionF,
data = mydata)
##
## Kruskal-Wallis rank sum test
##
## data: mobility by regionF
## Kruskal-Wallis chi-squared = 25.328, df = 3, p-value =
## 1.319e-05
H0 = Locations of distribution are the same for all 4 groups.
H1 = At least one location of distribution is different.
p-value < 0.001, reject H0.
kruskal_effsize(mobility ~ regionF,
data = mydata)
## # A tibble: 1 Γ 5
## .y. n effsize method magnitude
## * <chr> <int> <dbl> <chr> <ord>
## 1 mobility 43 0.573 eta2[H] large
Based on the sample data, we reject H0 (p<0.001) and conclude that at least one location differs from another. The differences are large (π2 = 0.573)
Now we need to perform Post hoc tests to see between which groups this difference can be observed.
library(rstatix)
wilcox_test(mobility ~ regionF,
paired = FALSE,
p.adjust.method = "bonferroni",
data=mydata)
## # A tibble: 6 Γ 9
## .y. group1 group2 n1 n2 statistic p p.adj
## * <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl>
## 1 mobility E MW 9 14 5 0.0000465 0.000279
## 2 mobility E W 9 6 0 0.0004 0.002
## 3 mobility E S 9 14 1.5 0.000121 0.000726
## 4 mobility MW W 14 6 10 0.006 0.039
## 5 mobility MW S 14 14 48 0.021 0.127
## 6 mobility W S 6 14 56 0.274 1
## # βΉ 1 more variable: p.adj.signif <chr>
*Note: We look at adjusted p-values.
H0: The location distribution of the mobility variable between groups βEβ and βMWβ is the same.
H1: The location distribution of the mobility variable between groups βEβ and βMWβ is not the same.
p-value (adj) < 0.001, reject H0
H0: The location distribution of the mobility variable between groups βEβ and βWβ is the same.
H1: The location distribution of the mobility variable between groups βEβ and βWβ is not the same.
p-value (adj) = 0.002, reject H0
H0: The location distribution of the mobility variable between groups βEβ and βSβ is the same.
H1: The location distribution of the mobility variable between groups βEβ and βSβ is not the same.
p-value (adj) < 0.001, reject H0
H0: The location distribution of the mobility variable between groups βMWβ and βWβ is the same.
H1: The location distribution of the mobility variable between groups βMWβ and βWβ is not the same.
p-value (adj) = 0.039, reject H0
H0: The location distribution of the mobility variable between groups βMWβ and βSβ is the same.
H1: The location distribution of the mobility variable between groups βMWβ and βSβ is not the same.
p-value (adj) > 5%, cannot reject H0
H0: The location distribution of the mobility variable between groups βWβ and βSβ is the same.
H1: The location distribution of the mobility variable between groups βWβ and βSβ is not the same.
p-value (adj) > 5%, cannot reject
Conclusion of Kruskal-Wallis Rank Sum test and post hoc tests:
From both parametric One-way ANOVA and non-parametric test Kruskal-Wallis Rank Sum Test (as well as the post hoc tests), the same conclusions are drawn, with noticed differences in p-values, effect sizes and other statistics between the two tests.