Regression Activity

Use Regression Module R code and ECLSK data: Pick one outcome and one predictor then conduct a simple linear regression analysis.

1. Identify your model. What is your outcome? What is your predictor? Write your model in words then in statistical notation?

This model estimates the effect of household income on percentage of students on free or reduced lunch at a particular school.

Model for predicting free lunch eligibility .

@return Model output

# Scatter plot with regression line

ggplot(eclsk, aes(x = income, y = free.lunch)) +
  geom_point(color = "steelblue", alpha = 0.6) +            # Scatter plot
  geom_smooth(method = "lm", se = TRUE, color = "firebrick") +  # Regression line with confidence band
  labs(
    title = "Relationship Between Household Income and Free Lunch",
    x = "Household Income",
    y = "Free Lunch"
  ) +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

Conduct analysis and report key results: Interpret the coefficients including if they are statistically significant. Be ready to share model r2, coefficient estimate, and standardized coefficient estimate.

lin.model <- lm (free.lunch~income,eclsk)
# The return doesn't print much but we have an object of class lm; summary() provides a better picture
summary(lin.model)

## 
## Call:
## lm(formula = free.lunch ~ income, data = eclsk)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -37.788 -17.407  -6.227  13.842 138.112 
## 
## Coefficients:
##                 Estimate   Std. Error t value            Pr(>|t|)    
## (Intercept) 37.919930445  0.709861475   53.42 <0.0000000000000002 ***
## income      -0.000220040  0.000009986  -22.03 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.23 on 2433 degrees of freedom
## Multiple R-squared:  0.1664, Adjusted R-squared:  0.166 
## F-statistic: 485.5 on 1 and 2433 DF,  p-value: < 0.00000000000000022

Model Fit

R-squared: 0.1664 Approximately 16.6% of the variance in free lunch eligibility is explained by income.

Residual Standard Error: 23.23 On average, predictions deviate from actual values by ~23 percentage points.

F-statistic: 485.5, p-value < 2e-16 The model is statistically significant overall.

Interpretation of Results

Intercept (β₀ = 37.92): When income is 0, the predicted free lunch eligibility is 37.92%. While this is a theoretical value, it gives context to the slope

Income (β₁ = -0.000220): Each additional $1 increase in income is associated with a 0.00022 percentage point decrease in free lunch eligibility. Equivalently, a $10,000 increase in income reduces free lunch eligibility by ~2.2%. This effect is highly statistically significant (p < 0.001).

Check Assumptions (Normality, Linearity, Homogeneity of variance). Describe evidence supporting or refuting each assumption.

# Simple Regression: Assumption checks
# Often involve residuals
lin.model.residuals <- lin.model$residuals
# standardized residuals work too
studentized.residuals <- rstudent(lin.model)

# Normality
# Basic QQplot using residuals
qqnorm(lin.model.residuals)

qqnorm(studentized.residuals) #just standardized the scale

#--- Scatter Plots --- Homogeneity of Variance and Linearity  
#plot studentized residuals vs. fitted values
fitted.values <- fitted(lin.model)
plot(fitted.values, studentized.residuals)
abline(h=0)

# you are looking for values that are relatively evenly spread across fitted values
# patterns/fan shape suggest heterogeneity of variance
# uneven grouping above and below the line suggest linearity issue

# Is the issue linearity
lin.model2 <- lm(free.lunch~income+I(free.lunch^2),eclsk)

# Check
lin.model2.residuals <- lin.model2$residuals
qqnorm(lin.model2.residuals)

fitted.values2 <- fitted(lin.model2)
studentized.residuals2 <- rstudent(lin.model2)
plot(fitted.values2, studentized.residuals2)
abline(h=0)

Regression Activity

Andrew J. Knoblich

04-01-2025