Today’s Focus: Fitting regression models, selecting the best model, and validating model assumptions.
Important: This worksheet covers BOTH linear regression and logistic regression. Choose the section that matches your project’s response variable:
What you should have ready: - Your cleaned dataset - Summary statistics and EDA from Week 2 - Initial ideas about which variables to include
By the end of today, you should have: - Fitted at least 2-3 regression models - Selected your best model with justification - Checked model diagnostics - Identified any problems and potential solutions
# Load your dataset here
setwd("/Users/yejuseol/Desktop/BU/2026 Spring/MA214/Project 1")
data <- read.csv("cleaned_college_data.csv")
# Display the first few rows
head(data)
## College_Name State Predominant_Degree
## 1 Alaska Bible College AK Bachelor
## 2 Alaska Career College AK Certificate
## 3 Alaska Christian College AK Associate
## 4 Alaska Pacific University AK Bachelor
## 5 Alaska Vocational Technical Center AK Certificate
## 6 Charter College AK Certificate
## Institution_Type Urban_Rural Admission_Rate Average_SAT Enrollment_Total
## 1 Private Nonprofit Suburb: Small NA NA 34
## 2 Private For-Profit City: Large NA NA 255
## 3 Private Nonprofit Rural: Fringe 0.8936 NA 60
## 4 Private Nonprofit City: Large 0.8621 NA 400
## 5 Public Rural: Remote NA NA 452
## 6 Private For-Profit City: Large NA NA 2277
## Pct_White Pct_Black Pct_Hispanic Pct_Asian Net_Price_Public Net_Price_Private
## 1 0.8529 0.0294 0.0294 0.0000 NA 14093
## 2 0.2510 0.1020 0.0824 0.1490 NA 18091
## 3 0.0833 0.0000 0.0000 0.0000 NA 5427
## 4 0.3325 0.0275 0.0900 0.0700 NA 18976
## 5 0.4978 0.0155 0.0288 0.0088 12071 NA
## 6 0.3808 0.0839 0.2306 0.0413 NA 25225
## Average_Total_Cost Tuition_InState Tuition_OutState
## 1 22101 10930 10930
## 2 NA NA NA
## 3 22858 9014 9014
## 4 34964 20760 20760
## 5 NA NA NA
## 6 31188 18678 18678
## Instructional_Spending_Per_Student Avg_Faculty_Salary Pct_Full_Time_Faculty
## 1 8702 NA NA
## 2 4279 5205 0.0476
## 3 18917 8009 NA
## 4 14628 7297 0.4775
## 5 31891 NA NA
## 6 2637 5298 0.2821
## Percent_Pell_Grant Graduation_Rate Graduation_Rate_Pooled
## 1 0.4872 1.0000 0.6111
## 2 0.5810 NA NA
## 3 0.7763 NA NA
## 4 0.3442 0.4688 0.4118
## 5 0.0602 NA NA
## 6 0.5579 0.5753 0.5621
## Retention_Rate_FullTime Retention_Rate_PartTime Default_Rate_3yr
## 1 0.3333 NA 0.000
## 2 NA 0.7097 0.000
## 3 NA 0.4444 0.023
## 4 0.6818 NA 0.000
## 5 NA 0.8952 0.000
## 6 NA NA 0.000
## Median_Debt_All Median_Debt_Graduates Debt_90th_Percentile
## 1 NA NA NA
## 2 7199 7377 16291
## 3 6233 NA NA
## 4 12696 23500 36000
## 5 5500 NA 9500
## 6 11884 14176 18812
## Debt_25th_Percentile Count_Working_10yr Mean_Earnings_10yr
## 1 NA NA NA
## 2 5499 554 32000
## 3 3552 36 NA
## 4 5500 160 45300
## 5 2900 174 45600
## 6 6334 2790 46400
## Median_Earnings_10yr Pct_Earning_Over_25k_10yr Count_Working_6yr
## 1 NA NA NA
## 2 34468 0.612 506
## 3 25787 NA 65
## 4 54271 0.719 71
## 5 51083 0.578 146
## 6 35504 0.691 3803
## Mean_Earnings_6yr Median_Earnings_6yr Room_and_Board
## 1 NA NA 7000
## 2 27200 35863 NA
## 3 NA 14240 12150
## 4 37500 47803 8932
## 5 43200 42187 NA
## 6 30500 35427 NA
# Check structure
str(data)
## 'data.frame': 5757 obs. of 38 variables:
## $ College_Name : chr "Alaska Bible College" "Alaska Career College" "Alaska Christian College" "Alaska Pacific University" ...
## $ State : chr "AK" "AK" "AK" "AK" ...
## $ Predominant_Degree : chr "Bachelor" "Certificate" "Associate" "Bachelor" ...
## $ Institution_Type : chr "Private Nonprofit" "Private For-Profit" "Private Nonprofit" "Private Nonprofit" ...
## $ Urban_Rural : chr "Suburb: Small" "City: Large" "Rural: Fringe" "City: Large" ...
## $ Admission_Rate : num NA NA 0.894 0.862 NA ...
## $ Average_SAT : num NA NA NA NA NA NA NA NA NA NA ...
## $ Enrollment_Total : num 34 255 60 400 452 ...
## $ Pct_White : num 0.8529 0.251 0.0833 0.3325 0.4978 ...
## $ Pct_Black : num 0.0294 0.102 0 0.0275 0.0155 0.0839 0.0057 0.0369 0.0335 0.0109 ...
## $ Pct_Hispanic : num 0.0294 0.0824 0 0.09 0.0288 ...
## $ Pct_Asian : num 0 0.149 0 0.07 0.0088 ...
## $ Net_Price_Public : num NA NA NA NA 12071 ...
## $ Net_Price_Private : num 14093 18091 5427 18976 NA ...
## $ Average_Total_Cost : num 22101 NA 22858 34964 NA ...
## $ Tuition_InState : num 10930 NA 9014 20760 NA ...
## $ Tuition_OutState : num 10930 NA 9014 20760 NA ...
## $ Instructional_Spending_Per_Student: num 8702 4279 18917 14628 31891 ...
## $ Avg_Faculty_Salary : num NA 5205 8009 7297 NA ...
## $ Pct_Full_Time_Faculty : num NA 0.0476 NA 0.4775 NA ...
## $ Percent_Pell_Grant : num 0.4872 0.581 0.7763 0.3442 0.0602 ...
## $ Graduation_Rate : num 1 NA NA 0.469 NA ...
## $ Graduation_Rate_Pooled : num 0.611 NA NA 0.412 NA ...
## $ Retention_Rate_FullTime : num 0.333 NA NA 0.682 NA ...
## $ Retention_Rate_PartTime : num NA 0.71 0.444 NA 0.895 ...
## $ Default_Rate_3yr : num 0 0 0.023 0 0 0 NA 0 0 0 ...
## $ Median_Debt_All : num NA 7199 6233 12696 5500 ...
## $ Median_Debt_Graduates : num NA 7377 NA 23500 NA ...
## $ Debt_90th_Percentile : num NA 16291 NA 36000 9500 ...
## $ Debt_25th_Percentile : num NA 5499 3552 5500 2900 ...
## $ Count_Working_10yr : num NA 554 36 160 174 ...
## $ Mean_Earnings_10yr : num NA 32000 NA 45300 45600 46400 NA 51200 45100 39500 ...
## $ Median_Earnings_10yr : num NA 34468 25787 54271 51083 ...
## $ Pct_Earning_Over_25k_10yr : num NA 0.612 NA 0.719 0.578 0.691 0.397 0.735 0.681 0.645 ...
## $ Count_Working_6yr : num NA 506 65 71 146 ...
## $ Mean_Earnings_6yr : num NA 27200 NA 37500 43200 30500 NA 41500 36500 35200 ...
## $ Median_Earnings_6yr : num NA 35863 14240 47803 42187 ...
## $ Room_and_Board : num 7000 NA 12150 8932 NA ...
Group Discussion:
- What is your response variable? –> Tuition Cost
- Is it continuous (use Section A) or binary (use Section B)? –> Section A (continuous)
- What predictors are you considering? –> Institutional type (private vs. public), college ranking, and geographic location
- Are all variables in the correct format (numeric, factor, etc.)? –> Yes
Check here which section you’ll use:
Use this section if your response variable is continuous (numeric).
Skip to Section B if you have a binary response variable.
Objective: Start with a simple model using your most important predictor.
# Fit a simple linear regression
# Replace 'response' and 'predictor1' with your actual variable names
avg_total_cost <- data$Average_Total_Cost
inst_type <- data$Institution_Type
model1 <- lm(avg_total_cost ~ inst_type, data = data)
# View summary
summary(model1)
##
## Call:
## lm(formula = avg_total_cost ~ inst_type, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -35207 -6246 -1165 6470 49039
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30743.1 719.1 42.75 <2e-16 ***
## inst_typePrivate Nonprofit 14211.9 809.5 17.56 <2e-16 ***
## inst_typePublic -12650.1 794.6 -15.92 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13390 on 3212 degrees of freedom
## (2542 observations deleted due to missingness)
## Multiple R-squared: 0.4708, Adjusted R-squared: 0.4705
## F-statistic: 1429 on 2 and 3212 DF, p-value: < 2.2e-16
Questions to answer:
Your answer: 0.4705
How do you interpret the slope coefficient?
Your answer: Compared to the reference institution type, Private Nonprofit institutions charge about $14,212 more on average, while Public institutions charge about $12,650 less on average.
Visualize the model:
# Scatterplot with regression line
ggplot(data, aes(x = inst_type, y = avg_total_cost)) +
geom_point(alpha = 0.6) +
geom_smooth(method = "lm", se = TRUE, color = "blue") +
labs(title = "Model 1: Simple Linear Regression",
x = "Institution Type",
y = "Average Total Cost") +
theme_minimal()
Objective: Add more predictors to improve the model.
# Fit a multiple regression model with main effects only
avg_faculty_salary <- data$Avg_Faculty_Salary
grad_rate <- data$Graduation_Rate
model2 <- lm(avg_total_cost ~ inst_type + avg_faculty_salary + grad_rate, data = data)
# View summary
summary(model2)
##
## Call:
## lm(formula = avg_total_cost ~ inst_type + avg_faculty_salary +
## grad_rate, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -48416 -5914 670 6664 44315
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.030e+03 1.045e+03 4.814 1.58e-06 ***
## inst_typePrivate Nonprofit 6.739e+03 9.221e+02 7.308 3.84e-13 ***
## inst_typePublic -2.004e+04 9.743e+02 -20.564 < 2e-16 ***
## avg_faculty_salary 2.847e+00 9.567e-02 29.761 < 2e-16 ***
## grad_rate 2.197e+04 1.357e+03 16.188 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10410 on 2109 degrees of freedom
## (3643 observations deleted due to missingness)
## Multiple R-squared: 0.703, Adjusted R-squared: 0.7024
## F-statistic: 1248 on 4 and 2109 DF, p-value: < 2.2e-16
Questions to answer:
What is the R² value? How does it compare to Model 1?
What is the Adjusted R² value? Why is this important?
Objective: Test if relationships between variables depend on each other.
# Fit a model with an interaction term
model3 <- lm(avg_total_cost ~ inst_type * avg_faculty_salary + grad_rate, data = data)
# View summary
summary(model3)
##
## Call:
## lm(formula = avg_total_cost ~ inst_type * avg_faculty_salary +
## grad_rate, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -54139 -5116 196 5524 45009
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 12974.2458 2941.0499 4.411
## inst_typePrivate Nonprofit -8516.9397 2959.9454 -2.877
## inst_typePublic -5887.0932 3226.6390 -1.825
## avg_faculty_salary 1.6334 0.4182 3.905
## grad_rate 21865.2171 1273.5069 17.169
## inst_typePrivate Nonprofit:avg_faculty_salary 2.1604 0.4348 4.969
## inst_typePublic:avg_faculty_salary -1.1583 0.4503 -2.573
## Pr(>|t|)
## (Intercept) 1.08e-05 ***
## inst_typePrivate Nonprofit 0.00405 **
## inst_typePublic 0.06821 .
## avg_faculty_salary 9.70e-05 ***
## grad_rate < 2e-16 ***
## inst_typePrivate Nonprofit:avg_faculty_salary 7.27e-07 ***
## inst_typePublic:avg_faculty_salary 0.01016 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9553 on 2107 degrees of freedom
## (3643 observations deleted due to missingness)
## Multiple R-squared: 0.7502, Adjusted R-squared: 0.7495
## F-statistic: 1055 on 6 and 2107 DF, p-value: < 2.2e-16
Visualize the interaction:
# Create interaction plot
data %>%
mutate(avg_faculty_salary_group = cut(avg_faculty_salary, breaks = 3,
labels = c("Low", "Medium", "High"))) %>%
ggplot(aes(x = inst_type, y = avg_total_cost, color = avg_faculty_salary_group)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", se = TRUE) +
labs(title = "Interaction Effect") +
theme_minimal()
Objective: Test if additional model can be used
# Fit an additional model
location_type <- data$Urban_Rural
model4 <- lm(avg_total_cost ~ inst_type * grad_rate + avg_faculty_salary + location_type , data = data)
# View summary
summary(model4)
##
## Call:
## lm(formula = avg_total_cost ~ inst_type * grad_rate + avg_faculty_salary +
## location_type, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -51436 -4837 497 5765 45145
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.091e+04 7.174e+03 2.915 0.00360 **
## inst_typePrivate Nonprofit -4.578e+03 2.029e+03 -2.257 0.02414 *
## inst_typePublic -1.507e+04 2.055e+03 -7.334 3.18e-13 ***
## grad_rate 1.547e+04 3.526e+03 4.386 1.21e-05 ***
## avg_faculty_salary 2.860e+00 9.642e-02 29.662 < 2e-16 ***
## location_typeCity: Large -1.389e+04 6.976e+03 -1.992 0.04655 *
## location_typeCity: Midsize -1.260e+04 6.985e+03 -1.804 0.07136 .
## location_typeCity: Small -1.268e+04 6.976e+03 -1.818 0.06926 .
## location_typeRural: Distant -8.568e+03 7.062e+03 -1.213 0.22514
## location_typeRural: Fringe -1.041e+04 7.022e+03 -1.482 0.13839
## location_typeRural: Remote -8.605e+03 7.096e+03 -1.213 0.22540
## location_typeSuburb: Large -1.274e+04 6.975e+03 -1.827 0.06790 .
## location_typeSuburb: Midsize -1.255e+04 7.056e+03 -1.779 0.07546 .
## location_typeSuburb: Small -1.376e+04 7.129e+03 -1.931 0.05363 .
## location_typeTown: Distant -8.708e+03 6.985e+03 -1.247 0.21268
## location_typeTown: Fringe -9.075e+03 7.071e+03 -1.283 0.19948
## location_typeTown: Remote -7.925e+03 7.004e+03 -1.132 0.25795
## inst_typePrivate Nonprofit:grad_rate 1.973e+04 3.939e+03 5.010 5.90e-07 ***
## inst_typePublic:grad_rate -1.361e+04 4.107e+03 -3.313 0.00094 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9779 on 2095 degrees of freedom
## (3643 observations deleted due to missingness)
## Multiple R-squared: 0.7398, Adjusted R-squared: 0.7375
## F-statistic: 330.9 on 18 and 2095 DF, p-value: < 2.2e-16
Visualize the model:
# Create interaction plot
data %>%
mutate(grad_rate_group = cut(grad_rate, breaks = 3,
labels = c("Low", "Medium", "High"))) %>%
ggplot(aes(x = inst_type, y = avg_total_cost, color = grad_rate_group)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", se = TRUE) +
labs(title = "Interaction Effect") +
theme_minimal()
# Create comparison using broom::glance
library(broom)
bind_rows(
glance(model1) %>% mutate(model = "Model 1"),
glance(model2) %>% mutate(model = "Model 2"),
glance(model3) %>% mutate(model = "Model 3"),
glance(model4) %>% mutate(model = "Model 4")
) %>%
select(model, r.squared, adj.r.squared, AIC)
## # A tibble: 4 × 4
## model r.squared adj.r.squared AIC
## <chr> <dbl> <dbl> <dbl>
## 1 Model 1 0.471 0.471 70231.
## 2 Model 2 0.703 0.702 45119.
## 3 Model 3 0.750 0.750 44756.
## 4 Model 4 0.740 0.738 44867.
Model Selection Table:
| Criterion | Best Model | Value |
|---|---|---|
| Highest Adj. R² | Model 3 | 0.7495 |
| R² | Model 3 | 0.7502 |
| AIC | Model 3 | 44756.40 |
Your chosen model and justification: We’re choosing Model 3 since it has the Highest R-Squared and Adjusted R-Squared, and the lowest AIC among the four models.
The four key assumptions: 1. Linearity 2. Independence 3. Normality of residuals 4. Equal Variance
# Create all 4 diagnostic plots
par(mfrow = c(2, 2))
plot(model3) # Replace with your chosen model
par(mfrow = c(1, 1))
plot(model3, which = 1)
# Or using ggplot2
augment(model3) %>%
ggplot(aes(x = .fitted, y = .resid)) +
geom_point(alpha = 0.6) +
geom_hline(yintercept = 0, color = "red", linetype = "dashed") +
geom_smooth(se = TRUE, color = "blue") +
labs(title = "Residuals vs Fitted Values") +
theme_minimal()
What to look for:
- Good: Random scatter around zero
- Bad: Curved pattern (non-linearity), funnel shape (unequal variance)
Your observations: The residuals are not completely randomly scattered around zero. There is slight curvature and increasing spread at higher fitted values, suggesting potential non-linearity.
plot(model3, which = 2)
Your observations: The Q-Q plot shows slight deviations from normality in the tails, indicating the presence of some outliers, but the central portion follows the line fairly well.
# Add polynomial term
model_poly <- lm(avg_total_cost ~ inst_type * avg_faculty_salary + I(avg_faculty_salary^2) + grad_rate,data = data)
# Or log transformation
model_log <- lm(log(avg_total_cost) ~ inst_type * avg_faculty_salary + grad_rate, data = data)
# Log transformation often helps
model_log_y <- lm(log(avg_total_cost) ~ inst_type * avg_faculty_salary + grad_rate, data = data)
library(broom)
bind_rows(
glance(model1) %>% mutate(model = "Model 1"),
glance(model2) %>% mutate(model = "Model 2"),
glance(model3) %>% mutate(model = "Model 3"),
glance(model4) %>% mutate(model = "Model 4"),
glance(model_poly) %>% mutate(model = "Polynomial Model"),
glance(model_log) %>% mutate(model = "Log Model"),
glance(model_log_y) %>% mutate(model = "Log Transformation Model")
) %>%
select(model, r.squared, adj.r.squared, AIC)
## # A tibble: 7 × 4
## model r.squared adj.r.squared AIC
## <chr> <dbl> <dbl> <dbl>
## 1 Model 1 0.471 0.471 70231.
## 2 Model 2 0.703 0.702 45119.
## 3 Model 3 0.750 0.750 44756.
## 4 Model 4 0.740 0.738 44867.
## 5 Polynomial Model 0.766 0.765 44619.
## 6 Log Model 0.680 0.679 949.
## 7 Log Transformation Model 0.680 0.679 949.
# Model Selection: The polynomial model is selected as the final model because it has the highest adjusted R² and the lowest AIC among other models.
# Your final model
final_model <- lm(avg_total_cost ~ inst_type * avg_faculty_salary + I(avg_faculty_salary^2) + grad_rate,data = data)
summary(final_model)
##
## Call:
## lm(formula = avg_total_cost ~ inst_type * avg_faculty_salary +
## I(avg_faculty_salary^2) + grad_rate, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -53939 -4969 57 5379 44053
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 7.793e+03 2.879e+03 2.707
## inst_typePrivate Nonprofit -1.555e+04 2.924e+03 -5.316
## inst_typePublic -1.595e+04 3.234e+03 -4.932
## avg_faculty_salary 3.483e+00 4.333e-01 8.038
## I(avg_faculty_salary^2) -1.467e-04 1.226e-05 -11.967
## grad_rate 2.161e+04 1.233e+03 17.532
## inst_typePrivate Nonprofit:avg_faculty_salary 3.226e+00 4.301e-01 7.499
## inst_typePublic:avg_faculty_salary 9.974e-02 4.483e-01 0.222
## Pr(>|t|)
## (Intercept) 0.00685 **
## inst_typePrivate Nonprofit 1.17e-07 ***
## inst_typePublic 8.78e-07 ***
## avg_faculty_salary 1.50e-15 ***
## I(avg_faculty_salary^2) < 2e-16 ***
## grad_rate < 2e-16 ***
## inst_typePrivate Nonprofit:avg_faculty_salary 9.39e-14 ***
## inst_typePublic:avg_faculty_salary 0.82395
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9246 on 2106 degrees of freedom
## (3643 observations deleted due to missingness)
## Multiple R-squared: 0.7661, Adjusted R-squared: 0.7653
## F-statistic: 985.5 on 7 and 2106 DF, p-value: < 2.2e-16
Model equation:
\[\hat{y} = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ...\]
Your equation:
\[ \begin{aligned} \widehat{AverageTotalCost} \;=\;& 7793 - 15550(\text{PrivateNonProfit}) - 15950(\text{Public}) \\ &+ 3.483(\text{AverageFacultySalary}) - 0.0001467(\text{AverageFacultySalary})^2 \\ &+ 21610(\text{GraduationRate}) + 3.226(\text{PrivateNonProfit} \cdot \text{AverageFacultySalary})\\ &+ 0.0997(\text{Public} \cdot \text{AverageFacultySalary}) \\ \end{aligned} \]
Performance:
- R² = 0.7661
- Adjusted R² = 0.7653
- RMSE = 9246
- AIC = 44619.321
Diagnostics checklist: