Homework 5

Author

James Malloy

Complete an analysis of your choice using the attached SPSS data set. Use one of the time 2 variables (c2rscale, c2mscale or c2gscale) as the response/outcome variable. Use a combination of continuous and categorical variables as explanatory variables. The attached data set has our usual reading, math and general knowledge scales (indicated by the r, m or g in the name). The number in the outcome name indicates fall kindergarten (time 1) or spring kindergarten (time 2). Bring this analysis to class on 10/6 for discussion.

library(tidyverse)
library(ggpubr)
library(haven)
library(modelsummary)
library(broom)
theme_set(theme_light())

df <- read_sav("HW5practice.sav") 

df_esl <- df %>% 
  drop_na(esl) %>% 
  mutate(esl_factor = case_when(esl == 1 ~ "ESL",
                                esl == 0 ~ "Non-ESL"))

For this analysis, one of my first questions is whether there is a difference in reading scores between ESL students and non-ESL students. I hypothesize that there is a difference in reading scores, with Non-ESL students scoring higher than ESL students. This is based on my theory that being a native/fluent English speaker gives you an advantage on a test that assesses your ability to read, comprehend, and make inferences on texts written in the English language. To look at this relationship, let’s do some descriptive statistics.

To my surprise, although ESL students in the dataset score slightly lower, on average, than non-ESL students, a quick descriptive analysis reveals that difference to be much smaller than I had anticipated. Non-ESL students score, on average, less than a point higher than ESL students.

# let's summarize our data by ESL and Non-ESL students. 
df_esl %>% 
  group_by(esl_factor) %>% 
  summarize(
    n = n(),
    proportion = n/nrow(.),
    mean_reading = mean(c2rscale, na.rm = T)
  )
# A tibble: 2 × 4
  esl_factor     n proportion mean_reading
  <chr>      <int>      <dbl>        <dbl>
1 ESL           54      0.116         33.4
2 Non-ESL      412      0.884         34.3
# let's create a mean dot plot to visualize the mean and distribution of scores.
# the following chart uses a jitter, which is an argument that moves data points
# a long a categorical axis, but leaves 
ggplot(df_esl, aes(esl_factor, c2rscale, color = esl_factor)) + 
  geom_jitter(width = .2) +
  stat_summary(fun = mean, geom = "crossbar") +
  labs(x = "",
       y = "Reading Score",
       title = "Is there a difference in X2 Reading Score between ESL and non-ESL students?",
       caption = "Bars represent mean reading score for each group.") +
  theme(legend.position = "none")

We can do a bivariate regression analysis to determine if that difference is statistically significant. The regression analysis reveals that the true difference between these -4.64 and 2.99, which covers the value 0. This finding surprises me and contradicts my original theory.

m1 <- lm(data = df_esl, c2rscale ~ esl)
m1 %>% tidy(conf.int = T)
# A tibble: 2 × 7
  term        estimate std.error statistic   p.value conf.low conf.high
  <chr>          <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
1 (Intercept)   34.3       0.571    60.0   7.03e-210    33.1      35.4 
2 esl           -0.823     1.94     -0.424 6.72e-  1    -4.64      2.99

I still don’t quite accept this “finding” as I believe my theory (that Non-ESL students have higher readings scores, on average, than ESL students) really does hold some weight.

My next thought is, “What if the difference is being masked by gender?” Girls typically do better on reading assignments than boys, so what if there is an interaction (can we say interaction?) such that the effect of ESL on reading scores is determined by gender. Let’s explore this further.

To do this analysis, we first have to create a new variable that I’ll call esl_female with 4 levels (Non-ESL Male, Non-ESL Female, ESL Male, and ESL Female).

# First we have to create our new variable for ESL/NON-ESL males and females
df_esl <- df_esl %>%
  mutate(
    female = case_when(gender == 1 ~ 0,
                       gender == 2 ~ 1),
    esl_female = case_when(esl == 0 & female == 0 ~ "Non-ESL Male", 
                           esl == 0 & female == 1 ~ "Non-ESL Female",
                           esl == 1 & female == 0 ~ "ESL Male",
                           esl == 1 & female == 1 ~ "ESL Female"))

Now that we have our newly created variable, esl_female, we can now graph our data. Based on the chart below, there does seem to be a difference among the four groups with ESL females scoring the highest, on average, and ESL males scoring the lowest, on average.

ggplot(df_esl, aes(esl_female, c2rscale, color = esl_female)) + 
  geom_jitter(width = .2) +
  stat_summary(fun = mean, geom = "crossbar") +
  labs(x = "",
       y = "Reading Score",
       title = "Do gender and ESL-status make a difference on reading score?",
       caption = "Bars represent mean score for each group.") +
  theme(legend.position = "none")

The differences are more apparent in the graph above, but I want to know if there are statistically significant differences Let’s do an anova to see if there are statistically significant group differences.

anova <- aov(data = df_esl, c2rscale ~ esl_female)
anova %>% summary()
             Df Sum Sq Mean Sq F value Pr(>F)  
esl_female    3   1047   349.0   2.776  0.041 *
Residuals   424  53298   125.7                 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
38 observations deleted due to missingness

The anova tells us that at least one of the groups is statistically significantly different from the other. To see where the differences lie, you have to do a posthoc analysis. However, when I do a Tukey test (below), the results come back non-statistically significant. I guess this means I’ve (not) adjusted for something that I should (not) have. Let’s just move on from this part of the analysis :sweat_smile:.

TukeyHSD(anova) 
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = c2rscale ~ esl_female, data = df_esl)

$esl_female
                                 diff        lwr        upr     p adj
ESL Male-ESL Female         -7.990896 -17.587014  1.6052222 0.1398676
Non-ESL Female-ESL Female   -1.585891  -8.220869  5.0490872 0.9268013
Non-ESL Male-ESL Female     -3.717870 -10.364383  2.9286438 0.4733734
Non-ESL Female-ESL Male      6.405005  -1.109443 13.9194523 0.1253014
Non-ESL Male-ESL Male        4.273026  -3.251609 11.7976607 0.4598690
Non-ESL Male-Non-ESL Female -2.131979  -5.057314  0.7933562 0.2382985

To round things out, lets add SES to our model. Here, I’m wondering if ESL-status will still be statistically significant after controlling for income.

m2 <- lm(data = df_esl, c2rscale ~ esl_female + wksesl)
df_esl <- augment_columns(m2, df_esl)

ggplot(df_esl, aes(wksesl, c2rscale, color = esl_female)) +
  geom_point(aes(color = esl_female)) +
  geom_line(aes(y = .fitted), size = 1) +
  stat_regline_equation() +
  labs(x = "Socioeconomic Status",
       y = "Reading Score",
       title = "Relationship between SES and income, controlling for ESL")

ggplot(df_esl, aes(esl_female, c2rscale, group = esl_female)) + geom_point(aes(mean(~cr2scale, na.rm = T)))

modelsummary(list(m1, m2), stars = T)
Model 1 Model 2
(Intercept) 34.263*** 36.607***
(0.571) (2.272)
esl −0.823
(1.942)
esl_femaleESL Male −5.800+
(3.464)
esl_femaleNon-ESL Female −2.239
(2.389)
esl_femaleNon-ESL Male −4.476+
(2.394)
wksesl 5.934***
(0.714)
Num.Obs. 428 428
R2 0.000 0.157
R2 Adj. −0.002 0.149
AIC 3293.7 3226.8
BIC 3305.8 3251.2
Log.Lik. −1643.827 −1607.425
RMSE 11.27 10.35
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
ggplot(df_esl, aes(wksesl, c2rscale, color = esl_female)) +
  geom_point() +
  geom_smooth(method = "lm", se = F) +
  labs(x = "Socioeconomic Status",
       y = "Reading Score",
       title = "Is there an intereaction?")

m3 <- lm(data = df_esl, c2rscale ~ esl_female * wksesl )
m3 %>% summary()

Call:
lm(formula = c2rscale ~ esl_female * wksesl, data = df_esl)

Residuals:
    Min      1Q  Median      3Q     Max 
-23.086  -6.380  -1.019   4.701  40.120 

Coefficients:
                                Estimate Std. Error t value Pr(>|t|)    
(Intercept)                       36.690      2.279  16.103   <2e-16 ***
esl_femaleESL Male                -5.060      3.880  -1.304    0.193    
esl_femaleNon-ESL Female          -2.211      2.401  -0.921    0.358    
esl_femaleNon-ESL Male            -4.710      2.407  -1.957    0.051 .  
wksesl                             4.221      2.702   1.562    0.119    
esl_femaleESL Male:wksesl          4.282      6.099   0.702    0.483    
esl_femaleNon-ESL Female:wksesl    1.012      2.904   0.349    0.728    
esl_femaleNon-ESL Male:wksesl      2.571      2.901   0.886    0.376    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 10.42 on 420 degrees of freedom
Multiple R-squared:  0.1602,    Adjusted R-squared:  0.1462 
F-statistic: 11.45 on 7 and 420 DF,  p-value: 2.557e-13
modelsummary(list(m1, m2, m3),stars = T)
Model 1 Model 2 Model 3
(Intercept) 34.263*** 36.607*** 36.690***
(0.571) (2.272) (2.279)
esl −0.823
(1.942)
esl_femaleESL Male −5.800+ −5.060
(3.464) (3.880)
esl_femaleNon-ESL Female −2.239 −2.211
(2.389) (2.401)
esl_femaleNon-ESL Male −4.476+ −4.710+
(2.394) (2.407)
wksesl 5.934*** 4.221
(0.714) (2.702)
esl_femaleESL Male × wksesl 4.282
(6.099)
esl_femaleNon-ESL Female × wksesl 1.012
(2.904)
esl_femaleNon-ESL Male × wksesl 2.571
(2.901)
Num.Obs. 428 428 428
R2 0.000 0.157 0.160
R2 Adj. −0.002 0.149 0.146
AIC 3293.7 3226.8 3231.1
BIC 3305.8 3251.2 3267.6
Log.Lik. −1643.827 −1607.425 −1606.552
RMSE 11.27 10.35 10.33
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
model_1 <- lm(data = df_esl, c2rscale ~ c1rscale,)
model_2 <- lm(data = df_esl, c2rscale ~ c1rscale + esl_female)
modelsummary(list(model_1, model_2))
Model 1 Model 2
(Intercept) 11.185 13.520
(0.893) (1.760)
c1rscale 0.991 0.991
(0.035) (0.035)
esl_femaleESL Male 0.391
(2.586)
esl_femaleNon-ESL Female −2.044
(1.611)
esl_femaleNon-ESL Male −3.013
(1.614)
Num.Obs. 413 413
R2 0.657 0.662
R2 Adj. 0.656 0.659
AIC 2731.0 2730.6
BIC 2743.1 2754.7
Log.Lik. −1362.514 −1359.277
RMSE 6.55 6.50