Overview

Today’s Focus: Fitting regression models, selecting the best model, and validating model assumptions.

Important: This worksheet covers BOTH linear regression and logistic regression. Choose the section that matches your project’s response variable:

Section A (Linear Regression): For continuous/numeric response variables (e.g., price, temperature, age)
Section B (Logistic Regression): For binary response variables (e.g., yes/no, success/failure, 0/1)

What you should have ready: - Your cleaned dataset - Summary statistics and EDA from Week 2 - Initial ideas about which variables to include

By the end of today, you should have: - Fitted at least 2-3 regression models - Selected your best model with justification - Checked model diagnostics - Identified any problems and potential solutions

Part 0: Load Your Data

# Load your dataset here
setwd("/Users/yejuseol/Desktop/BU/2026 Spring/MA214/Project 1")
data <- read.csv("cleaned_college_data.csv")

# Display the first few rows
head(data)

##                         College_Name State Predominant_Degree
## 1               Alaska Bible College    AK           Bachelor
## 2              Alaska Career College    AK        Certificate
## 3           Alaska Christian College    AK          Associate
## 4          Alaska Pacific University    AK           Bachelor
## 5 Alaska Vocational Technical Center    AK        Certificate
## 6                    Charter College    AK        Certificate
##     Institution_Type   Urban_Rural Admission_Rate Average_SAT Enrollment_Total
## 1  Private Nonprofit Suburb: Small             NA          NA               34
## 2 Private For-Profit   City: Large             NA          NA              255
## 3  Private Nonprofit Rural: Fringe         0.8936          NA               60
## 4  Private Nonprofit   City: Large         0.8621          NA              400
## 5             Public Rural: Remote             NA          NA              452
## 6 Private For-Profit   City: Large             NA          NA             2277
##   Pct_White Pct_Black Pct_Hispanic Pct_Asian Net_Price_Public Net_Price_Private
## 1    0.8529    0.0294       0.0294    0.0000               NA             14093
## 2    0.2510    0.1020       0.0824    0.1490               NA             18091
## 3    0.0833    0.0000       0.0000    0.0000               NA              5427
## 4    0.3325    0.0275       0.0900    0.0700               NA             18976
## 5    0.4978    0.0155       0.0288    0.0088            12071                NA
## 6    0.3808    0.0839       0.2306    0.0413               NA             25225
##   Average_Total_Cost Tuition_InState Tuition_OutState
## 1              22101           10930            10930
## 2                 NA              NA               NA
## 3              22858            9014             9014
## 4              34964           20760            20760
## 5                 NA              NA               NA
## 6              31188           18678            18678
##   Instructional_Spending_Per_Student Avg_Faculty_Salary Pct_Full_Time_Faculty
## 1                               8702                 NA                    NA
## 2                               4279               5205                0.0476
## 3                              18917               8009                    NA
## 4                              14628               7297                0.4775
## 5                              31891                 NA                    NA
## 6                               2637               5298                0.2821
##   Percent_Pell_Grant Graduation_Rate Graduation_Rate_Pooled
## 1             0.4872          1.0000                 0.6111
## 2             0.5810              NA                     NA
## 3             0.7763              NA                     NA
## 4             0.3442          0.4688                 0.4118
## 5             0.0602              NA                     NA
## 6             0.5579          0.5753                 0.5621
##   Retention_Rate_FullTime Retention_Rate_PartTime Default_Rate_3yr
## 1                  0.3333                      NA            0.000
## 2                      NA                  0.7097            0.000
## 3                      NA                  0.4444            0.023
## 4                  0.6818                      NA            0.000
## 5                      NA                  0.8952            0.000
## 6                      NA                      NA            0.000
##   Median_Debt_All Median_Debt_Graduates Debt_90th_Percentile
## 1              NA                    NA                   NA
## 2            7199                  7377                16291
## 3            6233                    NA                   NA
## 4           12696                 23500                36000
## 5            5500                    NA                 9500
## 6           11884                 14176                18812
##   Debt_25th_Percentile Count_Working_10yr Mean_Earnings_10yr
## 1                   NA                 NA                 NA
## 2                 5499                554              32000
## 3                 3552                 36                 NA
## 4                 5500                160              45300
## 5                 2900                174              45600
## 6                 6334               2790              46400
##   Median_Earnings_10yr Pct_Earning_Over_25k_10yr Count_Working_6yr
## 1                   NA                        NA                NA
## 2                34468                     0.612               506
## 3                25787                        NA                65
## 4                54271                     0.719                71
## 5                51083                     0.578               146
## 6                35504                     0.691              3803
##   Mean_Earnings_6yr Median_Earnings_6yr Room_and_Board
## 1                NA                  NA           7000
## 2             27200               35863             NA
## 3                NA               14240          12150
## 4             37500               47803           8932
## 5             43200               42187             NA
## 6             30500               35427             NA

# Check structure
str(data)

## 'data.frame':    5757 obs. of  38 variables:
##  $ College_Name                      : chr  "Alaska Bible College" "Alaska Career College" "Alaska Christian College" "Alaska Pacific University" ...
##  $ State                             : chr  "AK" "AK" "AK" "AK" ...
##  $ Predominant_Degree                : chr  "Bachelor" "Certificate" "Associate" "Bachelor" ...
##  $ Institution_Type                  : chr  "Private Nonprofit" "Private For-Profit" "Private Nonprofit" "Private Nonprofit" ...
##  $ Urban_Rural                       : chr  "Suburb: Small" "City: Large" "Rural: Fringe" "City: Large" ...
##  $ Admission_Rate                    : num  NA NA 0.894 0.862 NA ...
##  $ Average_SAT                       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Enrollment_Total                  : num  34 255 60 400 452 ...
##  $ Pct_White                         : num  0.8529 0.251 0.0833 0.3325 0.4978 ...
##  $ Pct_Black                         : num  0.0294 0.102 0 0.0275 0.0155 0.0839 0.0057 0.0369 0.0335 0.0109 ...
##  $ Pct_Hispanic                      : num  0.0294 0.0824 0 0.09 0.0288 ...
##  $ Pct_Asian                         : num  0 0.149 0 0.07 0.0088 ...
##  $ Net_Price_Public                  : num  NA NA NA NA 12071 ...
##  $ Net_Price_Private                 : num  14093 18091 5427 18976 NA ...
##  $ Average_Total_Cost                : num  22101 NA 22858 34964 NA ...
##  $ Tuition_InState                   : num  10930 NA 9014 20760 NA ...
##  $ Tuition_OutState                  : num  10930 NA 9014 20760 NA ...
##  $ Instructional_Spending_Per_Student: num  8702 4279 18917 14628 31891 ...
##  $ Avg_Faculty_Salary                : num  NA 5205 8009 7297 NA ...
##  $ Pct_Full_Time_Faculty             : num  NA 0.0476 NA 0.4775 NA ...
##  $ Percent_Pell_Grant                : num  0.4872 0.581 0.7763 0.3442 0.0602 ...
##  $ Graduation_Rate                   : num  1 NA NA 0.469 NA ...
##  $ Graduation_Rate_Pooled            : num  0.611 NA NA 0.412 NA ...
##  $ Retention_Rate_FullTime           : num  0.333 NA NA 0.682 NA ...
##  $ Retention_Rate_PartTime           : num  NA 0.71 0.444 NA 0.895 ...
##  $ Default_Rate_3yr                  : num  0 0 0.023 0 0 0 NA 0 0 0 ...
##  $ Median_Debt_All                   : num  NA 7199 6233 12696 5500 ...
##  $ Median_Debt_Graduates             : num  NA 7377 NA 23500 NA ...
##  $ Debt_90th_Percentile              : num  NA 16291 NA 36000 9500 ...
##  $ Debt_25th_Percentile              : num  NA 5499 3552 5500 2900 ...
##  $ Count_Working_10yr                : num  NA 554 36 160 174 ...
##  $ Mean_Earnings_10yr                : num  NA 32000 NA 45300 45600 46400 NA 51200 45100 39500 ...
##  $ Median_Earnings_10yr              : num  NA 34468 25787 54271 51083 ...
##  $ Pct_Earning_Over_25k_10yr         : num  NA 0.612 NA 0.719 0.578 0.691 0.397 0.735 0.681 0.645 ...
##  $ Count_Working_6yr                 : num  NA 506 65 71 146 ...
##  $ Mean_Earnings_6yr                 : num  NA 27200 NA 37500 43200 30500 NA 41500 36500 35200 ...
##  $ Median_Earnings_6yr               : num  NA 35863 14240 47803 42187 ...
##  $ Room_and_Board                    : num  7000 NA 12150 8932 NA ...

Group Discussion:

- What is your response variable? –> Tuition Cost

- Is it continuous (use Section A) or binary (use Section B)? –> Section A (continuous)

- What predictors are you considering? –> Institutional type (private vs. public), college ranking, and geographic location

- Are all variables in the correct format (numeric, factor, etc.)? –> Yes

Check here which section you’ll use:

Section A: Linear Regression
Section B: Logistic Regression

SECTION A: LINEAR REGRESSION

Use this section if your response variable is continuous (numeric).

Skip to Section B if you have a binary response variable.

A1: Fitting Linear Regression Models

Model 1: Simple Linear Regression

Objective: Start with a simple model using your most important predictor.

# Fit a simple linear regression
# Replace 'response' and 'predictor1' with your actual variable names
avg_total_cost <- data$Average_Total_Cost
inst_type <- data$Institution_Type

model1 <- lm(avg_total_cost ~ inst_type, data = data)

# View summary
summary(model1)

## 
## Call:
## lm(formula = avg_total_cost ~ inst_type, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -35207  -6246  -1165   6470  49039 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 30743.1      719.1   42.75   <2e-16 ***
## inst_typePrivate Nonprofit  14211.9      809.5   17.56   <2e-16 ***
## inst_typePublic            -12650.1      794.6  -15.92   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13390 on 3212 degrees of freedom
##   (2542 observations deleted due to missingness)
## Multiple R-squared:  0.4708, Adjusted R-squared:  0.4705 
## F-statistic:  1429 on 2 and 3212 DF,  p-value: < 2.2e-16

Questions to answer:

What is the adj R² value? What does it mean?

Your answer: 0.4705

How do you interpret the slope coefficient?

Your answer: Compared to the reference institution type, Private Nonprofit institutions charge about $14,212 more on average, while Public institutions charge about $12,650 less on average.

Visualize the model:

# Scatterplot with regression line
ggplot(data, aes(x = inst_type, y = avg_total_cost)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "lm", se = TRUE, color = "blue") +
  labs(title = "Model 1: Simple Linear Regression",
       x = "Institution Type",
       y = "Average Total Cost") +
  theme_minimal()

Model 2: Multiple Linear Regression (Main Effects)

Objective: Add more predictors to improve the model.

# Fit a multiple regression model with main effects only
avg_faculty_salary <- data$Avg_Faculty_Salary
grad_rate <- data$Graduation_Rate
model2 <- lm(avg_total_cost ~ inst_type + avg_faculty_salary + grad_rate, data = data)

# View summary
summary(model2)

## 
## Call:
## lm(formula = avg_total_cost ~ inst_type + avg_faculty_salary + 
##     grad_rate, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -48416  -5914    670   6664  44315 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 5.030e+03  1.045e+03   4.814 1.58e-06 ***
## inst_typePrivate Nonprofit  6.739e+03  9.221e+02   7.308 3.84e-13 ***
## inst_typePublic            -2.004e+04  9.743e+02 -20.564  < 2e-16 ***
## avg_faculty_salary          2.847e+00  9.567e-02  29.761  < 2e-16 ***
## grad_rate                   2.197e+04  1.357e+03  16.188  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10410 on 2109 degrees of freedom
##   (3643 observations deleted due to missingness)
## Multiple R-squared:  0.703,  Adjusted R-squared:  0.7024 
## F-statistic:  1248 on 4 and 2109 DF,  p-value: < 2.2e-16

Questions to answer:

What is the R² value? How does it compare to Model 1?
- Your answer: 0.703 (Significantly higher than Model 1)
What is the Adjusted R² value? Why is this important?
- Your answer: 0.7024, after accounting for the number of predictors, the model explains about 70.24% of the variation in average total cost and provides a more reliable measure.

Model 3: Model with Interactions

Objective: Test if relationships between variables depend on each other.

# Fit a model with an interaction term
model3 <- lm(avg_total_cost ~ inst_type * avg_faculty_salary + grad_rate, data = data)

# View summary
summary(model3)

## 
## Call:
## lm(formula = avg_total_cost ~ inst_type * avg_faculty_salary + 
##     grad_rate, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -54139  -5116    196   5524  45009 
## 
## Coefficients:
##                                                 Estimate Std. Error t value
## (Intercept)                                   12974.2458  2941.0499   4.411
## inst_typePrivate Nonprofit                    -8516.9397  2959.9454  -2.877
## inst_typePublic                               -5887.0932  3226.6390  -1.825
## avg_faculty_salary                                1.6334     0.4182   3.905
## grad_rate                                     21865.2171  1273.5069  17.169
## inst_typePrivate Nonprofit:avg_faculty_salary     2.1604     0.4348   4.969
## inst_typePublic:avg_faculty_salary               -1.1583     0.4503  -2.573
##                                               Pr(>|t|)    
## (Intercept)                                   1.08e-05 ***
## inst_typePrivate Nonprofit                     0.00405 ** 
## inst_typePublic                                0.06821 .  
## avg_faculty_salary                            9.70e-05 ***
## grad_rate                                      < 2e-16 ***
## inst_typePrivate Nonprofit:avg_faculty_salary 7.27e-07 ***
## inst_typePublic:avg_faculty_salary             0.01016 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9553 on 2107 degrees of freedom
##   (3643 observations deleted due to missingness)
## Multiple R-squared:  0.7502, Adjusted R-squared:  0.7495 
## F-statistic:  1055 on 6 and 2107 DF,  p-value: < 2.2e-16

Visualize the interaction:

# Create interaction plot
data %>%
  mutate(avg_faculty_salary_group = cut(avg_faculty_salary, breaks = 3, 
                                 labels = c("Low", "Medium", "High"))) %>%
  ggplot(aes(x = inst_type, y = avg_total_cost, color = avg_faculty_salary_group)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = TRUE) +
  labs(title = "Interaction Effect") +
  theme_minimal()

Model 4: Additional model to compare

Objective: Test if additional model can be used

# Fit an additional model 
location_type <- data$Urban_Rural

model4 <- lm(avg_total_cost ~ inst_type * grad_rate + avg_faculty_salary + location_type , data = data)

# View summary
summary(model4)

## 
## Call:
## lm(formula = avg_total_cost ~ inst_type * grad_rate + avg_faculty_salary + 
##     location_type, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -51436  -4837    497   5765  45145 
## 
## Coefficients:
##                                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                           2.091e+04  7.174e+03   2.915  0.00360 ** 
## inst_typePrivate Nonprofit           -4.578e+03  2.029e+03  -2.257  0.02414 *  
## inst_typePublic                      -1.507e+04  2.055e+03  -7.334 3.18e-13 ***
## grad_rate                             1.547e+04  3.526e+03   4.386 1.21e-05 ***
## avg_faculty_salary                    2.860e+00  9.642e-02  29.662  < 2e-16 ***
## location_typeCity: Large             -1.389e+04  6.976e+03  -1.992  0.04655 *  
## location_typeCity: Midsize           -1.260e+04  6.985e+03  -1.804  0.07136 .  
## location_typeCity: Small             -1.268e+04  6.976e+03  -1.818  0.06926 .  
## location_typeRural: Distant          -8.568e+03  7.062e+03  -1.213  0.22514    
## location_typeRural: Fringe           -1.041e+04  7.022e+03  -1.482  0.13839    
## location_typeRural: Remote           -8.605e+03  7.096e+03  -1.213  0.22540    
## location_typeSuburb: Large           -1.274e+04  6.975e+03  -1.827  0.06790 .  
## location_typeSuburb: Midsize         -1.255e+04  7.056e+03  -1.779  0.07546 .  
## location_typeSuburb: Small           -1.376e+04  7.129e+03  -1.931  0.05363 .  
## location_typeTown: Distant           -8.708e+03  6.985e+03  -1.247  0.21268    
## location_typeTown: Fringe            -9.075e+03  7.071e+03  -1.283  0.19948    
## location_typeTown: Remote            -7.925e+03  7.004e+03  -1.132  0.25795    
## inst_typePrivate Nonprofit:grad_rate  1.973e+04  3.939e+03   5.010 5.90e-07 ***
## inst_typePublic:grad_rate            -1.361e+04  4.107e+03  -3.313  0.00094 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9779 on 2095 degrees of freedom
##   (3643 observations deleted due to missingness)
## Multiple R-squared:  0.7398, Adjusted R-squared:  0.7375 
## F-statistic: 330.9 on 18 and 2095 DF,  p-value: < 2.2e-16

Visualize the model:

# Create interaction plot
data %>%
  mutate(grad_rate_group = cut(grad_rate, breaks = 3, 
                                 labels = c("Low", "Medium", "High"))) %>%
  ggplot(aes(x = inst_type, y = avg_total_cost, color = grad_rate_group)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = TRUE) +
  labs(title = "Interaction Effect") +
  theme_minimal()

A2: Model Selection (Linear)

# Create comparison using broom::glance
library(broom)
bind_rows(
  glance(model1) %>% mutate(model = "Model 1"),
  glance(model2) %>% mutate(model = "Model 2"),
  glance(model3) %>% mutate(model = "Model 3"),
  glance(model4) %>% mutate(model = "Model 4")
) %>%
  select(model, r.squared, adj.r.squared, AIC)

## # A tibble: 4 × 4
##   model   r.squared adj.r.squared    AIC
##   <chr>       <dbl>         <dbl>  <dbl>
## 1 Model 1     0.471         0.471 70231.
## 2 Model 2     0.703         0.702 45119.
## 3 Model 3     0.750         0.750 44756.
## 4 Model 4     0.740         0.738 44867.

Model Selection Table:

Criterion	Best Model	Value
Highest Adj. R²	Model 3	0.7495
R²	Model 3	0.7502
AIC	Model 3	44756.40

Your chosen model and justification: We’re choosing Model 3 since it has the Highest R-Squared and Adjusted R-Squared, and the lowest AIC among the four models.

A3: Model Diagnostics (Linear)

The four key assumptions: 1. Linearity 2. Independence 3. Normality of residuals 4. Equal Variance

All diagnostic plots at once:

# Create all 4 diagnostic plots
par(mfrow = c(2, 2))
plot(model3)  # Replace with your chosen model

par(mfrow = c(1, 1))

Diagnostic Plot 1: Residuals vs Fitted

plot(model3, which = 1)

# Or using ggplot2
augment(model3) %>%
  ggplot(aes(x = .fitted, y = .resid)) +
  geom_point(alpha = 0.6) +
  geom_hline(yintercept = 0, color = "red", linetype = "dashed") +
  geom_smooth(se = TRUE, color = "blue") +
  labs(title = "Residuals vs Fitted Values") +
  theme_minimal()

What to look for:

- Good: Random scatter around zero

- Bad: Curved pattern (non-linearity), funnel shape (unequal variance)

Your observations: The residuals are not completely randomly scattered around zero. There is slight curvature and increasing spread at higher fitted values, suggesting potential non-linearity.

Diagnostic Plot 2: Normal Q-Q Plot

plot(model3, which = 2)

Your observations: The Q-Q plot shows slight deviations from normality in the tails, indicating the presence of some outliers, but the central portion follows the line fairly well.

A4: Addressing Issues (Linear)

If you found non-linearity:

# Add polynomial term
model_poly <- lm(avg_total_cost ~ inst_type * avg_faculty_salary + I(avg_faculty_salary^2) + grad_rate,data = data)

# Or log transformation
model_log <- lm(log(avg_total_cost) ~ inst_type * avg_faculty_salary + grad_rate, data = data)

If you found unequal variance:

# Log transformation often helps
model_log_y <- lm(log(avg_total_cost) ~ inst_type * avg_faculty_salary + grad_rate, data = data)

library(broom)
bind_rows(
  glance(model1) %>% mutate(model = "Model 1"),
  glance(model2) %>% mutate(model = "Model 2"),
  glance(model3) %>% mutate(model = "Model 3"),
  glance(model4) %>% mutate(model = "Model 4"),
  glance(model_poly) %>% mutate(model = "Polynomial Model"),
  glance(model_log) %>% mutate(model = "Log Model"),
  glance(model_log_y) %>% mutate(model = "Log Transformation Model")
) %>%
  select(model, r.squared, adj.r.squared, AIC)

## # A tibble: 7 × 4
##   model                    r.squared adj.r.squared    AIC
##   <chr>                        <dbl>         <dbl>  <dbl>
## 1 Model 1                      0.471         0.471 70231.
## 2 Model 2                      0.703         0.702 45119.
## 3 Model 3                      0.750         0.750 44756.
## 4 Model 4                      0.740         0.738 44867.
## 5 Polynomial Model             0.766         0.765 44619.
## 6 Log Model                    0.680         0.679   949.
## 7 Log Transformation Model     0.680         0.679   949.

# Model Selection: The polynomial model is selected as the final model because it has the highest adjusted R² and the lowest AIC among other models.

A5: Final Linear Model

# Your final model
final_model <- lm(avg_total_cost ~ inst_type * avg_faculty_salary + I(avg_faculty_salary^2) + grad_rate,data = data)

summary(final_model)

## 
## Call:
## lm(formula = avg_total_cost ~ inst_type * avg_faculty_salary + 
##     I(avg_faculty_salary^2) + grad_rate, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -53939  -4969     57   5379  44053 
## 
## Coefficients:
##                                                 Estimate Std. Error t value
## (Intercept)                                    7.793e+03  2.879e+03   2.707
## inst_typePrivate Nonprofit                    -1.555e+04  2.924e+03  -5.316
## inst_typePublic                               -1.595e+04  3.234e+03  -4.932
## avg_faculty_salary                             3.483e+00  4.333e-01   8.038
## I(avg_faculty_salary^2)                       -1.467e-04  1.226e-05 -11.967
## grad_rate                                      2.161e+04  1.233e+03  17.532
## inst_typePrivate Nonprofit:avg_faculty_salary  3.226e+00  4.301e-01   7.499
## inst_typePublic:avg_faculty_salary             9.974e-02  4.483e-01   0.222
##                                               Pr(>|t|)    
## (Intercept)                                    0.00685 ** 
## inst_typePrivate Nonprofit                    1.17e-07 ***
## inst_typePublic                               8.78e-07 ***
## avg_faculty_salary                            1.50e-15 ***
## I(avg_faculty_salary^2)                        < 2e-16 ***
## grad_rate                                      < 2e-16 ***
## inst_typePrivate Nonprofit:avg_faculty_salary 9.39e-14 ***
## inst_typePublic:avg_faculty_salary             0.82395    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9246 on 2106 degrees of freedom
##   (3643 observations deleted due to missingness)
## Multiple R-squared:  0.7661, Adjusted R-squared:  0.7653 
## F-statistic: 985.5 on 7 and 2106 DF,  p-value: < 2.2e-16

Model equation:

\[\hat{y} = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ...\]

Your equation:

\[ \begin{aligned} \widehat{AverageTotalCost} \;=\;& 7793 - 15550(\text{PrivateNonProfit}) - 15950(\text{Public}) \\ &+ 3.483(\text{AverageFacultySalary}) - 0.0001467(\text{AverageFacultySalary})^2 \\ &+ 21610(\text{GraduationRate}) + 3.226(\text{PrivateNonProfit} \cdot \text{AverageFacultySalary})\\ &+ 0.0997(\text{Public} \cdot \text{AverageFacultySalary}) \\ \end{aligned} \]

Performance:

- R² = 0.7661

- Adjusted R² = 0.7653

- RMSE = 9246

- AIC = 44619.321

Diagnostics checklist:

Linearity
Normality
Independence
Equal variance –> has been improved but heteroscedasticity may still remain

MA214 Applied Statistics - Project 1 Week 3 Group 1

Modeling, Model Selection, and Diagnostics

Section C1 | Group 1

2026-02-19