Homework 4

Author

James Malloy

# First things first. Load libraries 
library(tidyverse)
library(ggpubr)
library(haven)
library(modelsummary)
library(broom)

# Now let's load our Framingham dataset
df_framingham <- read_sav("Framingham.sav")
  1. Using the Framingham data set linked in iCollege, create dummy codes for the variable Sex in the data set. Fit the model using cholesterol as the response variable, with age and a dummy code for sex as the explanatory variables. You might consider transforming age to increase the interpretation of the intercept term.

    # First, let's create our dichotmous variable "female"
    df_framingham <- 
      df_framingham %>% 
      mutate(female = case_when(Sex == "MALE" ~ 0,
                                Sex == "FEM" ~ 1))
    
    # Now that we've created our new female variable, let's double check our work 
    df_framingham %>% count(Sex, female) # double check my work
    # A tibble: 2 × 3
      Sex   female     n
      <chr>  <dbl> <int>
    1 FEM        1   737
    2 MALE       0   669
    # Now let's center the age var. so the y-intercept has a meaningful value
    df_framingham <- 
      df_framingham %>%
      mutate(age_centered = Age - mean(Age, na.rm = T))
    
    # Let's check our work to make sure vars "Age" & "age-centered" are equal
    psych::describe( 
      data.frame(
        df_framingham$Age, 
        df_framingham$age_centered), fast = T)
                               vars    n  mean   sd   min   max range   se
    df_framingham.Age             1 1406 52.45 4.78 45.00 62.00    17 0.13
    df_framingham.age_centered    2 1406  0.00 4.78 -7.45  9.55    17 0.13
# Now let's fit our model
m1_summary <- 
  lm(Cholesterol ~ female + age_centered, data = df_framingham) %>% summary()
m1_summary

Call:
lm(formula = Cholesterol ~ female + age_centered, data = df_framingham)

Residuals:
     Min       1Q   Median       3Q      Max 
-133.434  -30.319   -3.914   27.980  186.028 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  225.8398     1.7552 128.673  < 2e-16 ***
female        16.9053     2.4243   6.973 4.75e-12 ***
age_centered   0.7891     0.2531   3.117  0.00186 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 45.4 on 1403 degrees of freedom
Multiple R-squared:  0.03965,   Adjusted R-squared:  0.03828 
F-statistic: 28.96 on 2 and 1403 DF,  p-value: 4.736e-13
  1. Which dummy code did you use in your regression model? What is the reference category (the sex represented by the value of 0) for this dummy code?

    I created the variable “female” that equals 1 when Sex == “FEM” and 0 when Sex == “MALE”. The reference category is males.

  2. Interpret the adjusted R-square and the F-test for this model

The adjusted r-squared is 0.0383 so our explanatory variables, age and gender, account for only 3.8% of the variation in our outcome variable, cholesterol.

For model 1, F\(_2 {,}\) \(_{1403}\) = 28.96, p = 4.7357574^{-13}. This means that at least one of the explanatory variables in our model is statistically significant and, thus, an improvement on the mean.

  1. Interpret the estimate for the intercept term and report the confidence interval. Is it statistically different from zero?

    #Let's remind ourselves of what the model coefficients were
    m1_coefficients <- m1_summary$coefficients
    m1_coefficients
                    Estimate Std. Error    t value     Pr(>|t|)
    (Intercept)  225.8398390  1.7551509 128.672606 0.000000e+00
    female        16.9052732  2.4242837   6.973307 4.752994e-12
    age_centered   0.7890559  0.2531302   3.117194 1.862890e-03
    # Now that we remember our coefficients, lets extract the y-intercept
    m1_y_intercept <- m1_coefficients[1] %>% round(2) #store y-intercept value

    Yes, the y-intercept, 225.84, is statistically significantly different from 0, p < 0.001. This makes sense because we centered our age variable, so our y-intercept is our predicted value of cholesterol for men with the mean age in our dataset.

  2. Interpret the estimate for the coefficient for age and report the confidence interval. Is it statistically different from zero?

# Now let's extract our age coefficient and its p-value
m1_age_coefficient <- m1_coefficients[3] %>% round(2) #store age coefficient
m1_age_p_value <- m1_coefficients[3,4] # store p-value for age coefficient

For every one year increase in age, we can expect cholesterol to increase on average by 0.79, holding sex constant p = 0.0018629.

  1. Interpret the estimate for the coefficient for sex and report the confidence interval. Is it statistically different from zero?
# Now let's extract our female coefficient and its p-value
m1_female_coefficient <- m1_coefficients[2] %>% round() # store coefficient for female 
m1_female_p_value <- m1_coefficients[2,4] #store p value for female coefficient

A female’s cholesterol is, on average, 17 points higher than a male’s of comparable age, p = 4.7529942^{-12}.

  1. Write down the prediction equations for men and for women. What are the differences between these two equations?

\[ \widehat{Cholesterol}_{female} = Age* .78 + 242.75 + \varepsilon \]

\[ \widehat{Cholesterol}_{male} = Age* .78 + 225.84 + \varepsilon \]

The slopes for both equations are the same. What’s different is their y-intercepts. Women are expected to have higher cholesterol than men, holding age constant.

  1. Let’s take a look at the Morton & Riegle-Crumb (2020) article on school differences in covering algebra content. Please skim the article and pay close attention to the regression models fit to the data. Several of the questions below refer to Table 2 on page 444. In Model 1, only the race composition variables are included in the model whereas in Model 2, the model includes the race composition, school SES, urbanicity, region, and percent English language learners explanatory variables. Note that urbanicity and region are both dummy codes.
  1. What is the design of this study (Randomized controlled trial, Quasi-experimental study, Observational study)? Provide a brief rationale for your answer.

    This is an observational study. We know this because the data is administrative and comes from U.S. portion of the Trends in International Mathematics and Science Study of 2011 (TIMSS).

  2. What is the response or dependent variable of interest in this study?

    DV: the amount of algebra and advanced math content (i.e. geometry) that teachers report covering

  3. Describe how the school racial/ethnic composition variable are coded in this study.

    Using 60% or more as the threshold, the researchers identify n=9 “Predominandly Black” schools, n=20 “Predominantly Latinx” schools, and n = 82 “not predominantly minority” schools.

  4. In Table 2, compare the coefficients for the school racial/ethnic composition variables for Model 1 versus Model 2. Describe what happens to the coefficients for the school race composition variables when the school SES, urbanicity, region and percent ELL are added to the model. State if the school racial/ethnic composition variables are statistically different from zero.

    After controlling for school SES, urbanicity, region, and percent ELL, the coefficient for predominantly Black schools decreases from 13.175 to 9.67. Although there is a decrease, the estimate still maintains its statistical significance even after adding the control variables. Thus, we would interpret this coefficient to mean that predominantly Black schools are associated with, on average, 9.67 minutes fewer minutes of math education than non-minority schools, controlling for school SES, urbanicity, region and percent ELL.

    The slope for the Latinx coefficient actually changes from negative in model 1 (–6.273) to positive in model 2 (3.649). However, 0 is included in the confidence interval for both models. Thus, we should not make a population inference when comparing Latinx and non-minority schools.

  5. Describe how the urbanicity variables are coded in this study.

    There are 4 dichotomous variables for urbanicity city (the reference group), suburb, town, and rural.

  6. For model 2, interpret each of the coefficients for the urbanicity variable and indicate whether they are significantly difference from the reference category.

    Suburban schools are the only group with a statistically significant coefficient when comparing them to the reference group. Thus, we can interpret the coefficient to say, students in suburban schools receive, on average, 6.85 minutes more of math education than students in city schools, controlling for race, SES, region, school size, and percent ELL.

    Town and Rural coefficients are not statistically significant, so we should not interpret those.

  7. Describe how the region variables are coded in this study.

    There are 4 dichotomous variables for region: north east (reference group), Midwest, South, and West

  8. Interpret each of the coefficients for the region variable, and indicate whether they are significantly different from the reference category.

    Students in midwest schools receive, on average, 8.5 more minutes of math education than students in north east schools, controlling for race, SES, urbanicity, school size, and percent ELL.

    Students in the South receive, on average, 11.5 more minutes of math education than students in the north east, controlling for race, SES, urbanicity, school size, and percent ELL.

    Students in the West receive, on average, 12.3 more minutes of math education than students in the north east, controlling for race, SES, urbanicity, school size, and percent ELL.

  9. In Model 2, describe the school that is represented by the constant value. Assume that the continuous variables in the model are mean-centered.

    The constant, 76.767, is the expected minutes of math education we would expect students to receive who attend a non-predominantly minority, school located in a city in the north east, with an average SES, an average school size, and an average number of ELL speakers.