Quant 3 Midterm Exam

EPRS 8550

Author

James Malloy

This midterm exam has two sections. In section A, you will analyze a data set using a model of your choosing using the High School Longitudinal Study (HSLS) data set. In section B, you will read and interpret a published study that uses multiple regression analysis. Please be sure to provide appropriate output, and to share your reasoning for your answer to each question.

SECTION A

Description of the Data Set The data included with the midterm is entitled MidtermSample.sav. Missing values have already been coded for each variable. Recall that the HSLS is a longitudinal, nationally representative data set of high school students, collecting information about their academic achievement and other demographic and survey data. The data here is a small, random subset of the whole data set. The variables included in the data set are:

X2SEX: Student’s sex. Male = 1; Female = 2

X2RACE: Student’s race. Native American = 1; Asian = 2; Black = 3; Hispanic, no race specified = 4; Hispanic, race specified = 5; More than one race = 6; Pacific Islander = 7; White = 8

X2DUALLANG: Student’s first language. First language is English = 1; First language is a non-English language = 2; First language is English and non-English equally = 3

X2NUMHS: Number of high schools attended by student

X2TXMTH: Student’s math score (response variable for this exercise)

X2PAR1EDU: Parent 1’s highest level of education. Ranges from 1=less than HS to 7=PhD/MD/Law/Other professional degree

X2SES: Student SES. Standardized composite SES, centered at the median income in the year of data collection

X2BEHAVEIN: Scale of student’s school motivation. Higher values are more positive assessments of the student’s in-school behavior.

X2MTHEFF: Scale of student’s math self-efficacy. Higher values represent higher math self-efficacy

X2STUEDEXPCT: How far in school student expects to achieve. Ranges from 1=less than high school to 12=PhD/MD/Law/Other professional degree. Note that 13 = don’t know

Directions

For the following question, fit a multiple regression model using X2TXMTH (math score) as your response variable with at least three explanatory variables of your choice. One of these variables should be a categorical variable (such as sex). You are free to use whatever explanatory variables are of interest to you. Please provide a short narrative report of your results that includes the following sections.

• Rationale for the model. Provide a discussion of the explanatory variables you chose and why you chose them. Include the model in this discussion.

• Description of the variables. Provide a short description of the variables you chose, including any relevant descriptive statistics. Discuss any transformations necessary for the variables.

• Assumptions for multiple regression. Discuss how the data meet the assumptions of multiple regression.

• Results. Discuss the fit of the model and interpret the model coefficients.

• Summary. Describe your results.

Load libraries and data

library(tidyverse) # ggplot and more
library(haven) #import spss files
library(modelsummary) # side-by-side model comparison
library(broom) # tidy dataframes
library(GGally) # pairwise plots
library(knitr) # print nice dataframes
library(kableExtra) # print nicer dataframes
theme_set(theme_light()) 

# BELOW I WILL IMPORT FULL HSLS DATASET SO THAT I CAN ADD JOIN X1 MATH
# SCORE VAR TO THE MIDTERM SAMPLE. DATASETS ARE JOINED BY "STU_ID". I
# FOUND THE FULL DATASET ONLINE AT NCES.ED.GOV

df_hsls <- read_csv("hsls_17_student_pets_sr_v1_0.csv",  
                    col_select = c(STU_ID, X1TXMTH)) %>%
  mutate(STU_ID = as.character(STU_ID), 
         x1_math_score = X1TXMTH) %>% 
  select(-X1TXMTH) %>% 
  filter(x1_math_score > -3) # remove people with missing x1 math scores

# IMPORTING MIDTERM SAMPLE DATA
df_midterm_sample <- read_sav("MidtermSample.sav")  

# NOW IM GOIN TO CLEAN THE MIDTERM SAMPLE DATASET, CREATE/RENATE SOME NEW VARS,
# AND WILL ALSO JOIN THE HSLS DATASET (THAT HAVE X1 MATH SCORES) TO THE MIDTERM
# SAMPLE

df <- 
  df_midterm_sample %>% 
  mutate(x2_math_score = X2TXMTH,
         x2_math_efficacy = factor(case_when(X2MTHEFF > 0 ~ 1,
                                             X2MTHEFF <= 0 ~ 0),
                                   labels = c("Low Math Efficacy", 
                                              "High Math Efficacy")),
         x2_ses = X2SES) %>% 
  relocate(x2_math_efficacy, .after = X2MTHEFF) %>% 
  left_join(df_hsls, "STU_ID") %>% 
  # REMOVING THE 318 ROWS WITH MISSING MATH EFFICACY SCORES; 1536 OBS. REMAINING
  filter(!is.na(x2_math_efficacy)) 

# df_midterm_sample %>% count(is.na(X2MTHEFF)) # COUNT MISSING VALUES

Model rationale

Math ability is an important skill for students to have. Not only is math taught in schools, math knowledge is a pre-requisite for many of society’s best paying job (e.g. STEM jobs), it is also essential for today’s most important tasks (e.g. reducing carbon footprint). If students are going to live up to these challenges, they must gain sufficient knowledge and understanding of math concepts taught in school.

Math efficacy indicates students’ self-belief in their ability to overcome difficulties or obstacles to solving maths problems. Students who have low math efficacy may not learn as much math as students with high math efficacy. Anxiousness about math tests may show up on their math scores. This analysis will study the relationship between math scores and students assigned to the High Math Efficacy group or Low Math Efficacy group.

Socioeconomic status can also impact students’ math scores. For example, students with more financial resources can afford more extracurricular enrichment that can improve math knowledge and math scores. Therefore, adding SES to the model will allow us to control for such differences.

Lastly, to really tease out the relationship between math scores and math efficacy, we need to add a pre-test to our model. Math efficacy aside, students are learning math concepts each day in class. It is intuitive that students had some math knowledge before taking their X2 math test. To account for this, I will add X1 Math scores to the model.

(Note: X1 Math Scores were not on the original Midterm sample, but I knew that I wanted to add it to my analysis. To get X1 math scores for each student in the midterm sample, I downloaded all X1 Math scores from the HSLS dataset from the following URL: https://nces.ed.gov/surveys/hsls09/hsls09_data.asp. I then merged the X1 Math Scores to the midterm sample, joining the two datasets by the unique ID variable “STU_ID”.

\[ \widehat{X2Mathscores} = X2 Math Efficacy + X2SES + X1 Math Scores \]

# A tibble: 4 × 7
  term                         estim…¹ std.e…² stati…³   p.value conf.…⁴ conf.…⁵
  <chr>                          <dbl>   <dbl>   <dbl>     <dbl>   <dbl>   <dbl>
1 (Intercept)                    0.524  0.0313   16.7  2.38e- 57   0.462   0.585
2 x2_math_efficacyHigh Math E…   0.258  0.0419    6.16 9.57e- 10   0.176   0.341
3 x2_ses                         0.199  0.0296    6.73 2.47e- 11   0.141   0.257
4 x1_math_score                  0.811  0.0230   35.3  1.38e-195   0.766   0.856
# … with abbreviated variable names ¹​estimate, ²​std.error, ³​statistic,
#   ⁴​conf.low, ⁵​conf.high

Variable description

This analysis will use multiple regression to study the relationship between the outcome variable X2 Math Score (x2_math_score) and the following explanatory variables

X2 Math Efficacy (x2_math_efficacy):

  • This variable was originally a continuous variable (X2MTHEFF). From that continuous variable I created the dichotomous variable x2_math_efficacy, which has 2 levels: “High Math Efficacy” assigned to students with a math efficacy score greater than 0 (the mean) and “Low Math Efficacy”, assigned to students with a math efficacy score of 0 or less.

X1 Math Score (x1_math_score):

  • Students X1 math score; a continuous variable centered at mean = 0.

X2 SES (x2_ses):

  • Student’s SES at time 2; a continuous variable centered at mean = 0.
X2 Math Efficacy Group n Mean X2 Math Score SD X2 Math Score Mean X2 Math Efficacy SD X2 Math Efficacy Mean SES SD SES Mean X1 Math Score SD X1 Math Score
Low Math Efficacy 652 0.38 1.10 -0.88 0.67 0.02 0.72 -0.19 0.92
High Math Efficacy 884 1.02 1.15 0.71 0.60 0.15 0.74 0.30 0.96

Assumptions

Before we can do a regression analysis, we should make sure that our data meet the assumptions of multiple regression.

Assumption 1: Is the outcome variable normally distributed?

Based on the histogram plot below, we can safely say that math scores are mostly normally distributed. (Regression is robust against this assumption anyways).

df %>% 
  ggplot(aes(x2_math_score)) +
  geom_histogram(color = "white", binwidth = .2) +
  labs(x = "X2 Math Score",
       title = "Are X2 Math Scores normally distributed?")

Assumptions 2 & 3: Linear relationships and Multi-collinearity

Now I’ll address assumptions 2 and 3 together. The pairwise plot below visualizes the relationship between each of the variables in the model. Based on the visualizations, we can conclude that 1) our response variable X2 Math Score has a linear relationship with the other variables in the model and 2) we should not be concerned with multicollinearity between the explanatory variables given their low correlation.

(Note: To check these assumptions, I used the continuous version of the math efficacy variable (X2MTHEFF). The analysis, however, will use the dichotomous variable x2_math_efficacy (High Math Efficacy vs Low Math Efficacy).

# pairwise plot
ggpairs(
  data.frame(df$X2MTHEFF, df$x2_ses, df$x1_math_score, df$x2_math_score),
  columnLabels = c("X2 Math Efficacy", "SES", "X1 Math Score", "X2 Math Score"),
  switch = "both",
  title = "What are the relationships between each variable in the model?")

Assumption 4: Independence of observations

Our groups, High Math Efficacy and Low Math Efficacy, are independent. In other words, students belong to one group or the other and no one belongs in both. The plot below shows a sharp cutoff at the mean Math Efficacy score, m = 0, and independent groups.

df %>% ggplot(aes(x2_math_efficacy, X2MTHEFF,
                  color = x2_math_efficacy)) + 
  geom_jitter(width = .2, alpha = .5) +
  labs(x = "Math Efficacy Group",
       y = "Math Efficacy Score",
       title = "Are the observations independent?") +
  theme(legend.position = "none")

Assumption 5: Homoscedasticity

Homoscedasticity in a model means that the error is constant along the values of the dependent variable. I don’t know how to make this plot yet, but I’ll learn how to plot it at some point. For now, let’s assume our data meets this assumption as well 🙂.

Results

I hypothesized the the data would show a difference in math scores between students with high math efficacy and students with low math efficacy. The plot below visualizes this difference through a mean-dot plot. The bars on the plot represent the mean math score for each group. The plot shows a visible difference between students with high vs low math efficacy, with high efficacy students scoring higher on the test than their counterparts.

ggplot(df, aes(x2_math_efficacy, x2_math_score,
           color = x2_math_efficacy)) +
  geom_jitter(width = .1) +
  stat_summary(fun = mean, geom = "crossbar") +
  labs(x = "",
       y = "X2 Math Score",
       title = "Does math efficacy impact a student's math score?") +
  theme(legend.position = "none")

The above plot supports our theory that math efficacy is related to math scores. Now let’s see if this relationship is statistically significant. To determine this, we will use bivariate regression analysis.

Model 1, below, shows us that students in the sample with high math efficacy scored, on average, 0.64 standard deviation points higher than students with low math efficacy (p < 0.001). Students in the data set with low math efficacy had an average score of 0.38 standard deviation points which is represented by the the y-intercept. Model 1 has an adjusted r-squared of .07, meaning that our model explains 7% of the variation in X2 Math Scores.

m1 <- lm(data = df, x2_math_score ~ x2_math_efficacy)  # model 1
m1 %>% summary()

Call:
lm(formula = x2_math_score ~ x2_math_efficacy, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.9379 -0.8236  0.0183  0.8445  3.7651 

Coefficients:
                                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)                         0.38008    0.04424   8.592   <2e-16 ***
x2_math_efficacyHigh Math Efficacy  0.63896    0.05831  10.957   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.13 on 1534 degrees of freedom
Multiple R-squared:  0.07258,   Adjusted R-squared:  0.07198 
F-statistic: 120.1 on 1 and 1534 DF,  p-value: < 2.2e-16

According to model 1, we can expect students with High Math Efficacy to score somewhere between .53 and .75 standard deviation points higher on the math test than students with Low Math Efficacy. Thus, model 1 supports our original hypothesis.

m1 %>% modelsummary(statistic = "conf.int", stars = T)
Model 1
(Intercept) 0.380***
[0.293, 0.467]
x2_math_efficacyHigh Math Efficacy 0.639***
[0.525, 0.753]
Num.Obs. 1536
R2 0.073
R2 Adj. 0.072
AIC 4737.4
BIC 4753.4
Log.Lik. −2365.705
RMSE 1.13
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001

Although there is a difference among these groups, our explanatory variable is endogenous and other variables may be impacting this relationship. Let’s do some hierarchical regression models to see if we can tease out the “true” relationship between math scores and math efficacy.

Before, I run my model, I always like to visualize my data first. The plot below visualizes the relationship between math efficacy and math scores, holding SES constant.

m2 <- update(m1, ~. + x2_ses) # update model 1 to include ses

ggplot(df, aes(x2_ses, x2_math_score,
               color = x2_math_efficacy)) + 
  geom_point() +
  geom_line(aes(y = m2$fitted.values), size = 2) + # use model estimates to plot parallel lines
  labs(x = "X2 Socioeconomic Status",
       y = "X2 Math Score", 
       title = "Relationship b/w Math Effic. & Math Scores, holding SES constant",
       color = "Math Efficacy")

m2 %>% summary()

Call:
lm(formula = x2_math_score ~ x2_math_efficacy + x2_ses, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.3022 -0.6824  0.0004  0.7618  4.1844 

Coefficients:
                                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)                         0.36597    0.04101   8.923   <2e-16 ***
x2_math_efficacyHigh Math Efficacy  0.56593    0.05425  10.433   <2e-16 ***
x2_ses                              0.58057    0.03653  15.892   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.047 on 1533 degrees of freedom
Multiple R-squared:  0.2038,    Adjusted R-squared:  0.2027 
F-statistic: 196.1 on 2 and 1533 DF,  p-value: < 2.2e-16

Model 2 tells us that students in the sample with high math efficacy scored, on average, .57 standard deviation points higher on the math test than students with low math efficacy, holding SES constant. Students with low math efficacy and an average SES (\(M_{SES}\) = 0) had an average Math Efficacy score of .37 standard deviation points (the y-intercept). Model 2 has an adj. r-squared of .2, so this model explains 20% of the variation in X2 Math scores. This is certainly an improvement on model 1 that had an adjusted r-squared of .07.

Additionally, the coefficient on X2 SES is .57. This means that for every 1 standard deviation increase in SES, we can expect X2 Math Scores to increase by .58 standard deviation points, holding Math Efficacy constant.

ggplot(df, aes(x = x2_ses, y = x2_math_score, color = x2_math_efficacy)) + 
  geom_point() + 
  geom_line(aes(y = m2$fitted.values)) +
  facet_wrap(~ x2_math_efficacy, ncol = 1) +
  labs(x = "SES",
       y = "X2 Math Score", 
       title = "Relationship b/w math scores and SES, holding Math Efficacy constant") +
  theme(legend.position = "none")

According to model 2, we can infer that students with high math efficacy will score between .46 and .67 standard deviation points higher on the math test than students with low math efficacy, holding SES constant. It is important to note that our confidence interval for Math Efficacy does not include 0 even after including SES. This is in line with our original hypothesis and suggests that Math Efficacy does, in fact, matter when it comes to math scores even after accounting for something as significant as SES.

modelsummary(
  list(m1, m2), 
  statistic = "conf.int", 
  stars = T)
Model 1 Model 2
(Intercept) 0.380*** 0.366***
[0.293, 0.467] [0.286, 0.446]
x2_math_efficacyHigh Math Efficacy 0.639*** 0.566***
[0.525, 0.753] [0.460, 0.672]
x2_ses 0.581***
[0.509, 0.652]
Num.Obs. 1536 1536
R2 0.073 0.204
R2 Adj. 0.072 0.203
AIC 4737.4 4505.2
BIC 4753.4 4526.5
Log.Lik. −2365.705 −2248.588
RMSE 1.13 1.05
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001

To end our regression analysis, lets add prior knowledge to the model. This allows us to account for the fact prior knowledge affects both math efficacy and math scores. For example, the more students know prior to the math test, the more likely they are to perform well on said math test. Similarly, it is also reasonable to assume that the more knowledge students have, the more confident they will be with their math skills. Using X1 Math scores should allow us to account for some of that confounding.

m3 <- update(m2, ~. + x1_math_score) # update m2 to include X1 Math Scores
m3 %>% summary()

Call:
lm(formula = x2_math_score ~ x2_math_efficacy + x2_ses + x1_math_score, 
    data = df)

Residuals:
     Min       1Q   Median       3Q      Max 
-3.05647 -0.47338  0.02393  0.49708  2.79333 

Coefficients:
                                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)                         0.52353    0.03131  16.721  < 2e-16 ***
x2_math_efficacyHigh Math Efficacy  0.25831    0.04194   6.159 9.57e-10 ***
x2_ses                              0.19893    0.02956   6.730 2.47e-11 ***
x1_math_score                       0.81062    0.02297  35.296  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.7504 on 1396 degrees of freedom
  (136 observations deleted due to missingness)
Multiple R-squared:  0.5879,    Adjusted R-squared:  0.587 
F-statistic: 663.7 on 3 and 1396 DF,  p-value: < 2.2e-16

Model 3 tells us that holding SES and X1 math scores constant, students with High Math Efficacy score .26 standard deviation points higher on the math test than students with Low Math Efficacy (p < 0.001). For every 1 standard deviation point increase in SES, we can expect math scores to increase by .2 standard deviation points, holding Math Efficacy and X2 Math Scores constant. Additionally, for every 1 standard deviation increase in X1 Math Score, we can expect X2 Math scores to increase by .81 standard deviation points, holding SES and Math Efficacy constant. The y-intercept, 0.52, is the math score we’d expect for someone in the Low Math Efficacy group with an average SES and an average X1 math score.

Model 3 has an adjusted r-squared of .59, meaning our model now accounts for 59% of the variation in X2 Math Scores. This is a big increase from model 2 with an adjusted r-squared of 0.20. Even after controlling for SES and prior knowledge, the confidence interval for X2 Math Efficacy still does not include 0. Our model suggests that students with High Math Efficacy will score between .18 and .34 standard deviation points higher on the math test than students with Low Math Efficacy.

modelsummary(
  list(m1, m2, m3), 
  statistic = "conf.int", 
  stars = T)
Model 1 Model 2 Model 3
(Intercept) 0.380*** 0.366*** 0.524***
[0.293, 0.467] [0.286, 0.446] [0.462, 0.585]
x2_math_efficacyHigh Math Efficacy 0.639*** 0.566*** 0.258***
[0.525, 0.753] [0.460, 0.672] [0.176, 0.341]
x2_ses 0.581*** 0.199***
[0.509, 0.652] [0.141, 0.257]
x1_math_score 0.811***
[0.766, 0.856]
Num.Obs. 1536 1536 1400
R2 0.073 0.204 0.588
R2 Adj. 0.072 0.203 0.587
AIC 4737.4 4505.2 3174.8
BIC 4753.4 4526.5 3201.0
Log.Lik. −2365.705 −2248.588 −1582.410
RMSE 1.13 1.05 0.75
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001

This analysis studied the relationship between Math Efficacy and Math Scores. Students with High Math Efficacy were found to have higher math scores than students with Low Math Efficacy in all 3 models (p < 0.001).

A limitation of this study was group assignment. Students were placed into High and Low Math Efficacy according to their mean math efficacy score. While this may be reasonable, it is also arguably arbitrary as students just below or just above the cutoff could possible belong in the opposite group.

Another limitation of this study was the handling of missing values. For this analysis, students with missing values for X1Math Scores and X2 Math Efficacy were removed from the data set. That may be fine if the values are missing at random. However, there are implications if values are missing along some important characteristic.

This analysis did not consider the relationship between math scores and sex/gender. Boys are stereotypically said to be better at math than girls. Furthermore, there may even be an interaction such that the relationship between math scores and math efficacy vary by gender. Therefore, future research should add sex to the model and study any main effects and interactions.

SECTION B:

Interpretation of a published article, Miller et al. (2016) Please read the attached article, paying close attention to the design of the study, the analysis and results. Please provide a short answer to the following question about this article.

  1. This research study was a randomized controlled trial, randomizing principals to the treatment and control group. Despite the use of randomization, are there any threats to internal validity in this study? Choose at least two of the threats to internal validity as discussed in class and outlined by the article by Shadish & Luellen (attached to the midterm information) and discuss why this threat does or does not apply to this study.

    • Attrition: The researchers lost a total of 31 schools during their study. Fortunately, they determined that treatment and control groups differences were not systematically related to school context or achievement status. (I think a chi-squared test would tell you this??) .

    • Selection: Selection bias is when your sample differs systematically from your population of interest. Lots of things can lead to selection bias, one of them being attrition. As I mentioned before, the researchers stated that they found no statistically significant differences between those who participated in the program and those who withdrew.

  2. How well do you think this study generalizes to other principals and schools? Provide a rationale for your answer.

    • It is difficult to speak to the external validity of this study. The participants in this study all come from rural Michigan elementary schools. Can we expect similar outcomes from rural middle or high schools in Michigan? Can we expect similar outcomes from schools outside of Michigan? outside the Midwest? outside the United States? It’s hard to say.
  3. Do you have any concerns with differences among the two groups of principals (baseline equivalence) in this study? Provide a rationale for your answer.

    • No, I don’t have any concerns with baseline equivalence. Although participants dropped about of the program, the researchers determined those who withdrew were not systematically different from those who participated in the program. Furthermore, the pre-test determined participants in the treatment group were not systematically different from those in the control group.
  4. On page 543, the authors provide the regression equation they will use in the study. Describe one of the control variables used in the model, and provide a rationale for why the authors might choose to control for this variable when estimating the effect of the BLPD program.

    • The researchers used the pre-test scores as a control variable for their model. This is important because prior knowledge influences both the outcome variable and participation in the program. Adding pre-test scores to the model helps to make our treatment variable exogenous
  5. Do you have any concerns with multicollinearity among the explanatory variables in this study? Provide a rationale for your answer.

    • No, I don’t think there is a concern with multicollinearity. My understanding from the article is that they just did a bunch of t-tests/bivariate regression analyses. The article says the following: “In addition, we planned no tests of relationships among the outcomes; rather, we simply examined the degree to which treatment and control school principals differ on these outcomes controlling for their Goddard et al. 541 baseline (pretreatment) status on each indicator.” This means there was only one dichotomous variable (Treated or Untreated) and no control variables.
  6. Choose an outcome in Table 4. Describe in your own words the meaning of the unstandardized coefficient.If a control variable was significant in this model, discuss what this significant variable means in the model.

    • Those in the program scored .424 points higher on “Principal Efficacy” than those in the control group. The researchers did not control for anything with this outcome.