Anastasija Šofranac MVA HW1 - Hypothesis Testing

Research Question: Did the geographic mobility of the residents in US cities around 1950 differed among the regions where the city was located?

Data

data()
data(package = .packages(all.available=TRUE))

#install.packages("carData")
library(carData)
mydata <- force(Angell)

head(mydata)

##            moral hetero mobility region
## Rochester   19.0   20.6     15.0      E
## Syracuse    17.0   15.6     20.2      E
## Worcester   16.4   22.1     13.6      E
## Erie        16.2   14.0     14.8      E
## Milwaukee   15.8   17.4     17.6     MW
## Bridgeport  15.3   27.9     17.5      E

nrow(mydata)

## [1] 43

Explanation of data

Unit of observation: an US city around 1950
Sample size: 43 (not very large sample - attention needs to be paid to normal distribution)
Variables:
- moral: Moral Integration - Composite of crime rate and welfare expenditures in a city (1-100 index).
- hetero: Ethnic Heterogeneity - Diversity based on percentage nonwhite and foreign-born white residents in the city (%).
- mobility: Geographic Mobility - The percentages of residents moving into and out of the city (%).
- region: Region where the city is located - Categorical variable with four levels: E is Northeast, MW is Midwest, S is Southeast, and W is West.

*Note: For the given research question, the only needed variables are region and mobility.

Source

Angell, R. C. (1951) The moral integration of American Cities. American Journal of Sociology 57 (part 2), 1–140.

Data manipulation and Descriptive statistics

Converting categorical variables into factors

mydata$regionF <- factor(mydata$region,
                         levels = c("E", "MW" , "W" , "S"),
                         labels = c ("E" , "MW" , "W" , "S"))
head(mydata,3)

##           moral hetero mobility region regionF
## Rochester  19.0   20.6     15.0      E       E
## Syracuse   17.0   15.6     20.2      E       E
## Worcester  16.4   22.1     13.6      E       E

Statistical description of the geographic mobility (of residents) by the region (where the city is located)

library(psych)

describeBy(x=mydata$mobility, group=mydata$regionF)

## 
##  Descriptive statistics by group 
## group: E
##    vars n mean   sd median trimmed  mad  min  max range skew kurtosis
## X1    1 9 15.9 2.66     15    15.9 2.08 12.1 20.2   8.1 0.33    -1.35
##      se
## X1 0.89
## ---------------------------------------------------- 
## group: MW
##    vars  n  mean   sd median trimmed  mad  min  max range skew
## X1    1 14 26.06 7.11     24   25.38 5.86 17.6 42.7  25.1 0.85
##    kurtosis  se
## X1    -0.29 1.9
## ---------------------------------------------------- 
## group: W
##    vars n mean   sd median trimmed  mad  min  max range skew kurtosis
## X1    1 6 37.4 6.57  35.15    37.4 3.48 31.2 49.8  18.6 0.95    -0.73
##      se
## X1 2.68
## ---------------------------------------------------- 
## group: S
##    vars  n  mean   sd median trimmed mad  min  max range skew
## X1    1 14 32.46 8.43  30.15   32.32 8.3 19.4 47.2  27.8  0.3
##    kurtosis   se
## X1    -1.31 2.25

Mean 15.9 in the E group: On average, 15.9% of residents in the Northeast (E) region move in and out of the city.
Min 17.6 in the MW group: The minimum value of 17.6% indicates the smallest value (observed geographical mobility) in the Midwest region (MW).
Median 35.15 in the S group: Half of the residents in the Southeast group (S) have geographic mobility lower than 35.15%, while the other half have a percentage higher than 35.15%.
Range 27.8 in the W group: The range of 27.8% in the West (W) region represents the difference between the highest and lowest mobility values.
Sd 8.43 in the W group: The standard deviation of 8.43 in the West (W) region indicates the average amount by which individual mobility values differ from the mean.

Hypothesis Testing

For the given research question, the appropriate parametric test would be One-way ANOVA, since the intention is to assess whether there is relationship between geographic mobility of the residents and the region where the city was located (hypothesis about the equality of three or more population arithmetic means for independent samples). In the given data, there are four different independent regions that are measured at the same time and the differences in the geographic mobility averages for each group are compared.

Assumptions

However, in order to do this parametric test, certain assumptions need to be met:

Analyzed variable (geographic mobility) is numeric.
Variable (geographic mobility) in the population is normally distributed within each group (region).
Homoskedasticity: the variance of analyzed variable is the same within all groups.
No significant outliers. (*Note: We neglect this and do not check it)

Since the assumption of the numeric variable is met, the following assumption that needs to be checked is normal distribution.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(rstatix)

## Warning: package 'rstatix' was built under R version 4.3.2

## 
## Attaching package: 'rstatix'

## The following object is masked from 'package:stats':
## 
##     filter

mydata %>%
  group_by(regionF) %>%
  shapiro_test(mobility)

## # A tibble: 4 × 4
##   regionF variable statistic     p
##   <fct>   <chr>        <dbl> <dbl>
## 1 E       mobility     0.943 0.613
## 2 MW      mobility     0.910 0.156
## 3 W       mobility     0.832 0.111
## 4 S       mobility     0.937 0.385

E group

H0: Variable mobility is normally distributed within E group (Northeast region).

H1: Variable mobility is not normally distributed within E group.

p-value=0.613, cannot reject H0.

MW group

H0: Variable mobility is normally distributed within MW group (Midwest region).

H1: Variable mobility is not normally distributed within MW group.

p-value=0.156, cannot reject H0.

W group

H0: Variable mobility is normally distributed within W group (West region) .

H1: Variable mobility is not normally distributed within W group.

p-value=0.111, cannot reject H0.

S group

H0: Variable mobility is normally distributed within S group (Southeast region).

H1: Variable mobility is not normally distributed within S group.

p-value=0.385, cannot reject H0.

Because p-value for each group is larger than 5%, we cannot reject normality assumption.

While the Shapiro-Wilk test is commonly used to assess normality, it is important to acknowledge its limitations, particularly with smaller sample sizes (43 units). Because of this it is important we need to check data visually as well.

library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

ggplot(mydata, aes (x = mobility)) +
geom_histogram(binwidth = 5, colour="blue", fill="pink") +
facet_wrap(~regionF, ncol = 2) +
ylab("'Frequency")

Although Shapiro-Wilk normality test suggests normality, from this ggplot the distribution does not look normal within each group. Since the sample size is not very large as well, results of non-parametric alternative test (Kruskal-Wallis Rank Sum Test) will also be shown as a robustness check.

Now, using the Levene Test, homoskedasticity assumption will be checked to see if the variance of analyzed variable is the same within all groups.

library(car)

## Warning: package 'car' was built under R version 4.3.2

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

## The following object is masked from 'package:psych':
## 
##     logit

leveneTest(mydata$mobility, mydata$regionF)

## Levene's Test for Homogeneity of Variance (center = median)
##       Df F value  Pr(>F)  
## group  3  2.4694 0.07625 .
##       39                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

H0: varE=varMW=varW=varS (Variances are the same within all groups.)

H1: At least one variance is different.

p-value=0.076, cannot reject H0, and therefore cannot reject the homoskedasticity assumption.

Since all of the four assumptions are met, the One-way ANOVA parametric test can indeed be conducted.

Parametric test - One-way ANOVA

Extension of independent t-test

ANOVA_results <- aov(mobility ~ regionF,
                     data = mydata)

summary(ANOVA_results)

##             Df Sum Sq Mean Sq F value   Pr(>F)    
## regionF      3   2172   724.0   15.24 1.02e-06 ***
## Residuals   39   1853    47.5                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

H0: 𝜇(E) = 𝜇(MW) = 𝜇(W) = 𝜇(S)

H1: At least one 𝜇 is different from others.

p-value < 0.001, reject H0.

library(effectsize)

## 
## Attaching package: 'effectsize'

## The following objects are masked from 'package:rstatix':
## 
##     cohens_d, eta_squared

## The following object is masked from 'package:psych':
## 
##     phi

eta_squared(ANOVA_results)

## For one-way between subjects designs, partial eta squared is
##   equivalent to eta squared. Returning eta squared.

## # Effect Size for ANOVA
## 
## Parameter | Eta2 |       95% CI
## -------------------------------
## regionF   | 0.54 | [0.34, 1.00]
## 
## - One-sided CIs: upper bound fixed at [1.00].

interpret_eta_squared(0.54, rules = "cohen1992")

## [1] "large"
## (Rules: cohen1992)

Based on the sample data, we reject H0 at (p < 0.001), and conclude that at least one mean differs from the others. The size of the difference is large - effect size is large (𝜂2 = 0.54).

Now we need to perform Post hoc tests to see between which groups this difference can be observed.

Post hoc tests following the One-way ANOVA

library(rstatix)
pairwise_t_test(mobility ~ regionF,
                paired = FALSE,
                p.adjust.method = "bonferroni",
                data = mydata)

## # A tibble: 6 × 9
##   .y.      group1 group2    n1    n2           p p.signif      p.adj
## * <chr>    <chr>  <chr>  <int> <int>       <dbl> <chr>         <dbl>
## 1 mobility E      MW         9    14 0.00136     **       0.00818   
## 2 mobility E      W          9     6 0.000000672 ****     0.00000403
## 3 mobility MW     W         14     6 0.00169     **       0.0101    
## 4 mobility E      S          9    14 0.00000173  ****     0.0000104 
## 5 mobility MW     S         14    14 0.0186      *        0.111     
## 6 mobility W      S          6    14 0.15        ns       0.898     
## # ℹ 1 more variable: p.adj.signif <chr>

*Note: We look at adjusted p-values.

Test 1

H0: 𝜇(E) =𝜇(MW)

H1: 𝜇(E) =/= 𝜇(MW)

p-value (adj) < 0.008, reject H0

Test 2

H0: 𝜇(E) = 𝜇(W)

H1: 𝜇(E) =/= 𝜇(W).

p-value (adj) < 0.001, reject H0

Test 3

H0: 𝜇(MW) = 𝜇(W)

H1: 𝜇(MW) =/= 𝜇(W)

p-value (adj) = 0.010, reject H0

Test 4

H0: 𝜇(E) = 𝜇(S)

H1: 𝜇(E) =/= 𝜇(S)

p-value (adj) < 0.001, reject H0

Test 5

H0: 𝜇(MW) = 𝜇(S)

H1: 𝜇(MW) =/= 𝜇(S)

p-value (adj) > 5%, cannot reject H0

Test 6

H0: 𝜇(W) = 𝜇(S)

H1: 𝜇(W) =/= 𝜇(S)

p-value (adj) > 5%, cannot reject

Conclusion of One-way ANOVA and post hoc tests:

Based on the sample data, we reject the null hypothesis of equality of the four arithmetic means (𝑝 < 0.001, 𝜂2 = 0.54) and conclude that the average geographic mobility of residents differed between the US cities in 1950. Using the group comparisons, we find that the average geographic mobility of the Northeast region (group E) differs from each of the other regions (𝑝 < 5%), and the average geographic mobility also differs between the Midwest region - group MW and West region - group W (𝑝 = 0.010).

Corresponding non-parametic alternative - Kruskal Wallis Rank Sum Test

Extension of Wilcoxon Rank Sum Test Robustness check

kruskal.test(mobility ~ regionF, 
             data = mydata)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  mobility by regionF
## Kruskal-Wallis chi-squared = 25.328, df = 3, p-value =
## 1.319e-05

H0 = Locations of distribution are the same for all 4 groups.

H1 = At least one location of distribution is different.

p-value < 0.001, reject H0.

kruskal_effsize(mobility ~ regionF, 
             data = mydata)

## # A tibble: 1 × 5
##   .y.          n effsize method  magnitude
## * <chr>    <int>   <dbl> <chr>   <ord>    
## 1 mobility    43   0.573 eta2[H] large

Based on the sample data, we reject H0 (p<0.001) and conclude that at least one location differs from another. The differences are large (𝜂2 = 0.573)

Now we need to perform Post hoc tests to see between which groups this difference can be observed.

Post hoc tests following the Kruskal-Wallis Rank Sum test

library(rstatix)
wilcox_test(mobility ~ regionF,
     paired = FALSE,
     p.adjust.method = "bonferroni",
     data=mydata)

## # A tibble: 6 × 9
##   .y.      group1 group2    n1    n2 statistic         p    p.adj
## * <chr>    <chr>  <chr>  <int> <int>     <dbl>     <dbl>    <dbl>
## 1 mobility E      MW         9    14       5   0.0000465 0.000279
## 2 mobility E      W          9     6       0   0.0004    0.002   
## 3 mobility E      S          9    14       1.5 0.000121  0.000726
## 4 mobility MW     W         14     6      10   0.006     0.039   
## 5 mobility MW     S         14    14      48   0.021     0.127   
## 6 mobility W      S          6    14      56   0.274     1       
## # ℹ 1 more variable: p.adj.signif <chr>

*Note: We look at adjusted p-values.

Test 1

H0: The location distribution of the mobility variable between groups “E” and “MW” is the same.

H1: The location distribution of the mobility variable between groups “E” and “MW” is not the same.

p-value (adj) < 0.001, reject H0

Test 2

H0: The location distribution of the mobility variable between groups “E” and “W” is the same.

H1: The location distribution of the mobility variable between groups “E” and “W” is not the same.

p-value (adj) = 0.002, reject H0

Test 3

H0: The location distribution of the mobility variable between groups “E” and “S” is the same.

H1: The location distribution of the mobility variable between groups “E” and “S” is not the same.

p-value (adj) < 0.001, reject H0

Test 4

H0: The location distribution of the mobility variable between groups “MW” and “W” is the same.

H1: The location distribution of the mobility variable between groups “MW” and “W” is not the same.

p-value (adj) = 0.039, reject H0

Test 5

H0: The location distribution of the mobility variable between groups “MW” and “S” is the same.

H1: The location distribution of the mobility variable between groups “MW” and “S” is not the same.

p-value (adj) > 5%, cannot reject H0

Test 6

H0: The location distribution of the mobility variable between groups “W” and “S” is the same.

H1: The location distribution of the mobility variable between groups “W” and “S” is not the same.

p-value (adj) > 5%, cannot reject

Conclusion of Kruskal-Wallis Rank Sum test and post hoc tests:

Based on the sample data, we reject the null hypothesis about the equality of the four location distributions (p<0.001) and conclude that at least one location distributions differs from the others. We also found a large effect size (𝜂2 = 0.573). Using the group comparisons, we find that the location distribution of geographic mobility of the Northeast region (group E) differs from each of the other regions (𝑝 < 5%), and the location distribution also differs between the Midwest region - group MW and West region - group W (𝑝 = 0.039).

From both parametric One-way ANOVA and non-parametric test Kruskal-Wallis Rank Sum Test (as well as the post hoc tests), the same conclusions are drawn, with noticed differences in p-values, effect sizes and other statistics between the two tests.

Anastasija Šofranac MVA HW1 - Hypothesis Testing

2024-01-09

Research Question: Did the geographic mobility of the residents in US cities around 1950 differed among the regions where the city was located?

Data

Explanation of data

Source

Data manipulation and Descriptive statistics

Hypothesis Testing

Assumptions

Parametric test - One-way ANOVA

Post hoc tests following the One-way ANOVA

Corresponding non-parametic alternative - Kruskal Wallis Rank Sum Test

Post hoc tests following the Kruskal-Wallis Rank Sum test