Use Regression Module R code and ECLSK data: Pick one outcome and one predictor then conduct a simple linear regression analysis.
This model estimates the effect of household income on percentage of students on free or reduced lunch at a particular school.
Model for predicting free lunch eligibility .
@return Model output
# Scatter plot with regression line
ggplot(eclsk, aes(x = income, y = free.lunch)) +
geom_point(color = "steelblue", alpha = 0.6) + # Scatter plot
geom_smooth(method = "lm", se = TRUE, color = "firebrick") + # Regression line with confidence band
labs(
title = "Relationship Between Household Income and Free Lunch",
x = "Household Income",
y = "Free Lunch"
) +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
lin.model <- lm (free.lunch~income,eclsk)
# The return doesn't print much but we have an object of class lm; summary() provides a better picture
summary(lin.model)
##
## Call:
## lm(formula = free.lunch ~ income, data = eclsk)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37.788 -17.407 -6.227 13.842 138.112
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.919930445 0.709861475 53.42 <0.0000000000000002 ***
## income -0.000220040 0.000009986 -22.03 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23.23 on 2433 degrees of freedom
## Multiple R-squared: 0.1664, Adjusted R-squared: 0.166
## F-statistic: 485.5 on 1 and 2433 DF, p-value: < 0.00000000000000022
R-squared: 0.1664 Approximately 16.6% of the variance in free lunch eligibility is explained by income.
Residual Standard Error: 23.23 On average, predictions deviate from actual values by ~23 percentage points.
F-statistic: 485.5, p-value < 2e-16 The model is statistically significant overall.
Intercept (β₀ = 37.92): When income is 0, the predicted free lunch eligibility is 37.92%. While this is a theoretical value, it gives context to the slope
Income (β₁ = -0.000220): Each additional $1 increase in income is associated with a 0.00022 percentage point decrease in free lunch eligibility. Equivalently, a $10,000 increase in income reduces free lunch eligibility by ~2.2%. This effect is highly statistically significant (p < 0.001).
# Simple Regression: Assumption checks
# Often involve residuals
lin.model.residuals <- lin.model$residuals
# standardized residuals work too
studentized.residuals <- rstudent(lin.model)
# Normality
# Basic QQplot using residuals
qqnorm(lin.model.residuals)
qqnorm(studentized.residuals) #just standardized the scale
#--- Scatter Plots --- Homogeneity of Variance and Linearity
#plot studentized residuals vs. fitted values
fitted.values <- fitted(lin.model)
plot(fitted.values, studentized.residuals)
abline(h=0)
# you are looking for values that are relatively evenly spread across fitted values
# patterns/fan shape suggest heterogeneity of variance
# uneven grouping above and below the line suggest linearity issue
# Is the issue linearity
lin.model2 <- lm(free.lunch~income+I(free.lunch^2),eclsk)
# Check
lin.model2.residuals <- lin.model2$residuals
qqnorm(lin.model2.residuals)
fitted.values2 <- fitted(lin.model2)
studentized.residuals2 <- rstudent(lin.model2)
plot(fitted.values2, studentized.residuals2)
abline(h=0)