Importing the libraries
Import the data
Exploratory Data Analysis
- ggpairs
Objective 1: Describe probability as a foundation of statistical modeling, including inference and maximum likelihood estimation
- Split the data into training and testing sets (80-20 split)
- Linear Regression – Predict Out-of-State Tuition
Objective 2: Apply the appropriate generalized linear model for a specific data context
- Polynomial Regression – Predict Out-of-State Tuition
Objective 3: Demonstrate model selection given a set of candidate models
Objective 4: Express the results of statistical models to a general audience
Objective 5: Use programming software to fit and assess statistical models

Importing the libraries

library(readxl)
library(GGally)
library(readxl)
library(dplyr)
library(ggplot2)
library(tidyr)
library(ggcorrplot)
library(forcats)
library(tidymodels)
library(janitor)

Import the data

Final_Data <- read_excel("Final_Data.xlsx")
head(Final_Data)

Exploratory Data Analysis

glimpse(Final_Data)

## Rows: 100
## Columns: 20
## $ `University ID`                            <chr> "U001", "U002", "U003", "U0…
## $ `University Name`                          <chr> "Northern Michigan Universi…
## $ Department                                 <chr> "Sciences", "Education", "E…
## $ `In-State Tuition (USD)`                   <dbl> 12441, 10841, 13509, 12474,…
## $ `Out-of-State Tuition (USD)`               <dbl> 45209, 21394, 48355, 19813,…
## $ `Offers In-State Tuition to DACA Students` <chr> "No", "No", "No", "No", "Ye…
## $ `Tuition Equity Policy`                    <chr> "Yes", "No", "Yes", "Partia…
## $ `Year Policy Adopted`                      <dbl> NA, 2019, 2020, 2020, 2015,…
## $ `Total Enrollment`                         <dbl> 34340, 10191, 32452, 24208,…
## $ `Undergrad Enrollment`                     <dbl> 22187, 15233, 9342, 17456, …
## $ `Grad Enrollment`                          <dbl> 5967, 3047, 12612, 11830, 2…
## $ `Average SAT Score`                        <dbl> 1187, 1123, 943, 981, 1031,…
## $ `Average ACT Score`                        <dbl> 19, 25, 23, 18, 31, 20, 25,…
## $ `Acceptance Rate (%)`                      <dbl> 43.42, 48.75, 60.68, 44.42,…
## $ `Diversity Index`                          <dbl> 0.52, 0.80, 0.59, 0.52, 0.7…
## $ `Financial Aid (%)`                        <dbl> 61.48, 61.39, 89.57, 54.33,…
## $ `Avg Financial Aid (USD)`                  <dbl> 19822, 19564, 6467, 12357, …
## $ Region                                     <chr> "Central", "Upper Peninsula…
## $ `Policy Notes`                             <chr> "Policy applies to all MI h…
## $ `Last Updated`                             <dttm> 2022-01-02, 2022-01-09, 20…

The dataset contains 100 rows and 20 columns, representing detailed information about public universities in Michigan. Key variables include tuition costs, academic scores, enrollment numbers, financial aid details, and policy indicators. From the data structure, we see that variables like Average SAT Score, Average ACT Score, Total Enrollment, and Financial Aid (%) are numeric and vary widely across institutions. Categorical variables such as Department, Region, and Tuition Equity Policy provide group-level context. This output confirms that the dataset is well-structured for statistical modeling, with enough variation in both numeric and categorical predictors to explore relationships with outcomes like tuition or policy adoption.

ggpairs

# Select numeric columns only
numeric_data <- Final_Data %>% select(where(is.numeric))

# Create scatterplot matrix
ggpairs(numeric_data, title = "Scatterplot Matrix of Numeric Variables")

## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 2 rows containing missing values
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 2 rows containing missing values

## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).

## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_density()`).

## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 2 rows containing missing values
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 2 rows containing missing values
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 2 rows containing missing values
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 2 rows containing missing values
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 2 rows containing missing values
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 2 rows containing missing values
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 2 rows containing missing values
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 2 rows containing missing values
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 2 rows containing missing values

## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).

The scatterplot matrix provides a comprehensive view of pairwise relationships among all numeric variables in the dataset. Diagonal elements show the distribution of each variable, while the lower triangle visualizes bivariate scatterplots and the upper triangle reports correlation coefficients. Notably, In-State Tuition and Out-of-State Tuition show a moderately positive correlation, indicating institutions with high in-state tuition often have higher out-of-state rates as well. Total Enrollment and Undergrad Enrollment exhibit a strong positive correlation, as expected, while variables like Average SAT and ACT scores also correlate positively. Overall, the matrix highlights key variable relationships useful for modeling tuition costs and equity outcomes.

# View data structure
str(Final_Data)

## tibble [100 × 20] (S3: tbl_df/tbl/data.frame)
##  $ University ID                           : chr [1:100] "U001" "U002" "U003" "U004" ...
##  $ University Name                         : chr [1:100] "Northern Michigan University" "Baker College" "Kettering University" "Michigan State University" ...
##  $ Department                              : chr [1:100] "Sciences" "Education" "Education" "Sciences" ...
##  $ In-State Tuition (USD)                  : num [1:100] 12441 10841 13509 12474 17194 ...
##  $ Out-of-State Tuition (USD)              : num [1:100] 45209 21394 48355 19813 36287 ...
##  $ Offers In-State Tuition to DACA Students: chr [1:100] "No" "No" "No" "No" ...
##  $ Tuition Equity Policy                   : chr [1:100] "Yes" "No" "Yes" "Partial" ...
##  $ Year Policy Adopted                     : num [1:100] NA 2019 2020 2020 2015 ...
##  $ Total Enrollment                        : num [1:100] 34340 10191 32452 24208 38025 ...
##  $ Undergrad Enrollment                    : num [1:100] 22187 15233 9342 17456 10947 ...
##  $ Grad Enrollment                         : num [1:100] 5967 3047 12612 11830 2522 ...
##  $ Average SAT Score                       : num [1:100] 1187 1123 943 981 1031 ...
##  $ Average ACT Score                       : num [1:100] 19 25 23 18 31 20 25 17 18 21 ...
##  $ Acceptance Rate (%)                     : num [1:100] 43.4 48.8 60.7 44.4 63 ...
##  $ Diversity Index                         : num [1:100] 0.52 0.8 0.59 0.52 0.73 0.88 0.66 0.64 0.4 0.53 ...
##  $ Financial Aid (%)                       : num [1:100] 61.5 61.4 89.6 54.3 87.8 ...
##  $ Avg Financial Aid (USD)                 : num [1:100] 19822 19564 6467 12357 15713 ...
##  $ Region                                  : chr [1:100] "Central" "Upper Peninsula" "Northeast" "Upper Peninsula" ...
##  $ Policy Notes                            : chr [1:100] "Policy applies to all MI high school grads" "Tuition equity not offered" "Policy applies to all MI high school grads" "Tuition equity not offered" ...
##  $ Last Updated                            : POSIXct[1:100], format: "2022-01-02" "2022-01-09" ...

The structure output reveals a well-organized tibble with 100 observations and 20 variables representing detailed university-level data across Michigan. It includes numeric features such as In-State Tuition, Out-of-State Tuition, SAT/ACT scores, Enrollment figures, and Financial Aid metrics, offering rich quantitative insights for modeling. Categorical features like Department, Region, Tuition Equity Policy, and Offers In-State Tuition to DACA Students provide policy and institutional context. The presence of policy notes and update timestamps suggests the data is designed for analyzing tuition equity and institutional characteristics. This structure confirms the dataset is suitable for both descriptive and predictive analytics.

# Clean column names for safe use
colnames(Final_Data) <- make.names(colnames(Final_Data))

# Convert character to factors
Final_Data$Offers.In.State.Tuition.to.DACA.Students <- as.factor(Final_Data$Offers.In.State.Tuition.to.DACA.Students)
Final_Data$Tuition.Equity.Policy <- as.factor(Final_Data$Tuition.Equity.Policy)
Final_Data$Region <- as.factor(Final_Data$Region)
Final_Data$Department <- as.factor(Final_Data$Department)

summary(Final_Data)

##  University.ID      University.Name          Department In.State.Tuition..USD.
##  Length:100         Length:100         Arts       :15   Min.   : 9313         
##  Class :character   Class :character   Business   :18   1st Qu.:11396         
##  Mode  :character   Mode  :character   Education  :25   Median :13560         
##                                        Engineering:18   Mean   :13776         
##                                        Sciences   :24   3rd Qu.:16171         
##                                                         Max.   :17974         
##                                                                               
##  Out.of.State.Tuition..USD. Offers.In.State.Tuition.to.DACA.Students
##  Min.   :18462              No :50                                  
##  1st Qu.:25012              Yes:50                                  
##  Median :36538                                                      
##  Mean   :35658                                                      
##  3rd Qu.:44597                                                      
##  Max.   :54830                                                      
##                                                                     
##  Tuition.Equity.Policy Year.Policy.Adopted Total.Enrollment
##  No     :32            Min.   :2012        Min.   : 2397   
##  Partial:31            1st Qu.:2014        1st Qu.:14406   
##  Yes    :37            Median :2017        Median :26945   
##                        Mean   :2017        Mean   :26313   
##                        3rd Qu.:2020        3rd Qu.:36932   
##                        Max.   :2022        Max.   :49394   
##                        NA's   :2                           
##  Undergrad.Enrollment Grad.Enrollment Average.SAT.Score Average.ACT.Score
##  Min.   : 1055        Min.   :  528   Min.   : 912      Min.   :17.00    
##  1st Qu.: 6791        1st Qu.: 4986   1st Qu.:1035      1st Qu.:20.00    
##  Median :13541        Median : 8308   Median :1132      Median :23.00    
##  Mean   :14249        Mean   : 8228   Mean   :1163      Mean   :23.57    
##  3rd Qu.:21936        3rd Qu.:11777   3rd Qu.:1325      3rd Qu.:27.25    
##  Max.   :29841        Max.   :14986   Max.   :1449      Max.   :31.00    
##                                                                          
##  Acceptance.Rate.... Diversity.Index  Financial.Aid.... Avg.Financial.Aid..USD.
##  Min.   :40.54       Min.   :0.3000   Min.   :51.14     Min.   : 4142          
##  1st Qu.:55.75       1st Qu.:0.4575   1st Qu.:64.65     1st Qu.: 9394          
##  Median :65.96       Median :0.6250   Median :74.80     Median :12676          
##  Mean   :66.04       Mean   :0.6110   Mean   :73.48     Mean   :12604          
##  3rd Qu.:78.28       3rd Qu.:0.7800   3rd Qu.:82.44     3rd Qu.:16166          
##  Max.   :89.95       Max.   :0.9000   Max.   :94.32     Max.   :19971          
##                                                                                
##              Region   Policy.Notes        Last.Updated                
##  Central        :24   Length:100         Min.   :2022-01-02 00:00:00  
##  Northeast      :16   Class :character   1st Qu.:2022-06-24 06:00:00  
##  Southeast      :19   Mode  :character   Median :2022-12-14 12:00:00  
##  Upper Peninsula:22                      Mean   :2022-12-14 12:00:00  
##  West           :19                      3rd Qu.:2023-06-05 18:00:00  
##                                          Max.   :2023-11-26 00:00:00  
##

The summary statistics provide a high-level overview of the tuition equity dataset. In-state tuition ranges from around $9,313 to $17,974, while out-of-state tuition is significantly higher, ranging from $18,462 to $54,830. Enrollment numbers vary widely, with total enrollments between 2,397 and 49,394 students. Academic indicators like SAT and ACT scores show substantial variability across institutions, with median SAT scores around 1132 and ACT scores around 23. Approximately half of the universities offer in-state tuition to DACA students, and tuition equity policies are distributed among “Yes” (37%), “Partial” (31%), and “No” (32%). Additionally, average financial aid awards and diversity indices also display considerable variation, suggesting heterogeneity in institutional support and student demographics across Michigan public universities.

# Compute and visualize the correlation matrix of numeric variables using ggcorrplot
numeric_vars <- Final_Data %>% select(where(is.numeric))
corr_matrix <- cor(numeric_vars, use = "complete.obs")

# Convert to a tidy format
corr_table <- as.data.frame(round(corr_matrix, 2))
corr_table

ggcorrplot(corr_matrix, 
           lab = FALSE,                      # hide the correlation values
           title = "Correlation Matrix",
           hc.order = TRUE, 
           type = "lower")

The correlation matrix heatmap reveals how strongly numeric variables are linearly related in the tuition equity dataset. As expected, In-State Tuition and Out-of-State Tuition show a weak positive correlation (~0.14), suggesting some alignment in pricing strategies across institutions. Total Enrollment and Undergrad Enrollment are positively related, while Grad Enrollment shows a slight negative relationship with in-state tuition. Interestingly, Diversity Index has a small positive correlation with out-of-state tuition and year of policy adoption, implying that more diverse schools may have more recent equity policies and slightly higher tuition. Overall, the matrix helps identify potential predictors for modeling tuition and policy-related outcomes.

# Generate histograms for key numeric variables to examine their distributions
key_vars <- c("In.State.Tuition..USD.", "Out.of.State.Tuition..USD.",
              "Average.SAT.Score", "Average.ACT.Score", "Avg.Financial.Aid..USD.")

for (col in key_vars) {
  print(
    ggplot(Final_Data, aes_string(x = col)) +
      geom_histogram(fill = "steelblue", bins = 30, color = "black") +
      theme_minimal() +
      labs(title = paste("Histogram of", col), x = col, y = "Frequency")
  )
}

## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

The histograms above illustrate the distributions of key numeric variables in the tuition equity dataset. Both In-State and Out-of-State Tuition show moderately right-skewed distributions, indicating that while most institutions charge moderate tuition, a few charge substantially higher amounts. The Average SAT and ACT Scores reveal fairly normal distributions with slight variation, reflecting academic competitiveness across universities. Lastly, Average Financial Aid (USD) is also right-skewed, suggesting that while most students receive modest aid, a few institutions offer significantly higher awards. These visualizations help identify variable ranges, skewness, and potential outliers, all of which are essential for guiding model assumptions and data preprocessing.

# Bar plots of categorical variables to visualize their frequency distributions
cat_vars <- c("University.Name", "Department", "Region",
              "Offers.In.State.Tuition.to.DACA.Students", "Tuition.Equity.Policy")

for (col in cat_vars) {
  print(
    ggplot(Final_Data, aes_string(x = col)) +
      geom_bar(fill = "darkgreen") +
      theme_minimal() +
      coord_flip() +
      labs(title = paste("Bar Plot of", col), x = col, y = "Count")
  )
}

The bar plots provide insights into the distribution of categorical variables across the dataset. The Department variable shows a relatively balanced spread among categories like Sciences, Business, Education, Arts, and Engineering. The Region variable reflects geographic diversity, with universities distributed across Central, Upper Peninsula, Northeast, Southeast, and West regions of Michigan. When examining policy-related factors, the Offers In-State Tuition to DACA Students variable is evenly split between “Yes” and “No,” suggesting variation in institutional support for DACA students. Similarly, the Tuition Equity Policy is distributed across “Yes,” “No,” and “Partial,” indicating differing levels of commitment to tuition equity across universities. These visualizations help contextualize policy decisions and institutional classifications within the dataset.

# Boxplots to visualize how tuition costs vary by policy and region
ggplot(Final_Data, aes(x = Tuition.Equity.Policy, y = Out.of.State.Tuition..USD., fill = Tuition.Equity.Policy)) +
  geom_boxplot() +
  theme_minimal() +
  labs(title = "Out-of-State Tuition by Equity Policy")

ggplot(Final_Data, aes(x = Region, y = In.State.Tuition..USD., fill = Region)) +
  geom_boxplot() +
  theme_minimal() +
  labs(title = "In-State Tuition by Region")

The boxplot comparing Out-of-State Tuition by Equity Policy shows that institutions with no equity policy tend to have slightly lower median tuition compared to those with partial or full policies. However, there is substantial overlap in distributions, suggesting that equity policy alone may not fully explain tuition differences.

In the In-State Tuition by Region plot, institutions in the West and Upper Peninsula regions generally show higher median in-state tuition, while those in the Northeast and Central regions tend to have lower values. This suggests potential regional disparities in tuition pricing across Michigan’s public universities.

# Scatter plots to explore the relationship between enrollment and tuition costs
ggplot(Final_Data, aes(x = Total.Enrollment, y = Out.of.State.Tuition..USD.)) +
  geom_point(color = "red") +
  geom_smooth(method = "lm", se = FALSE, color = "blue") +
  theme_minimal() +
  labs(title = "Out-of-State Tuition vs Total Enrollment")

## `geom_smooth()` using formula = 'y ~ x'

ggplot(Final_Data, aes(x = Undergrad.Enrollment, y = In.State.Tuition..USD.)) +
  geom_point(color = "darkorange") +
  theme_minimal() +
  labs(title = "In-State Tuition vs Undergrad Enrollment")

The first scatter plot shows a weak positive relationship between total enrollment and out-of-state tuition, suggesting that universities with more students may charge slightly higher out-of-state fees. However, the spread is wide, indicating variability. The second plot shows no strong trend between undergrad enrollment and in-state tuition, implying that in-state rates remain relatively stable regardless of undergraduate population size.

# Boxplots comparing financial aid amount and percentage across different tuition equity policy levels
ggplot(Final_Data, aes(x = Tuition.Equity.Policy, y = Avg.Financial.Aid..USD., fill = Tuition.Equity.Policy)) +
  geom_boxplot() +
  theme_minimal() +
  labs(title = "Average Financial Aid by Equity Policy")

ggplot(Final_Data, aes(x = Tuition.Equity.Policy, y = Financial.Aid...., fill = Tuition.Equity.Policy)) +
  geom_boxplot() +
  theme_minimal() +
  labs(title = "Financial Aid % by Equity Policy")

The boxplots reveal how financial aid differs by tuition equity policy. Universities with full equity policies (“Yes”) tend to offer higher average financial aid and a greater percentage of students receiving aid compared to those with “No” or “Partial” policies. This suggests that institutions with stronger equity commitments may also invest more in financial support.

# Boxplots comparing SAT and ACT scores across departments
ggplot(Final_Data, aes(x = Department, y = Average.SAT.Score, fill = Department)) +
  geom_boxplot() +
  theme_minimal() +
  labs(title = "Average SAT Score by Department")

ggplot(Final_Data, aes(x = Department, y = Average.ACT.Score, fill = Department)) +
  geom_boxplot() +
  theme_minimal() +
  labs(title = "Average ACT Score by Department")

The boxplots illustrate department-level differences in standardized test scores. Business and Education departments generally show higher median SAT and ACT scores, while Engineering and Arts display wider variability and more outliers. These visualizations help compare academic performance expectations across different fields of study within Michigan’s public universities.

Objective 1: Describe probability as a foundation of statistical modeling, including inference and maximum likelihood estimation

Split the data into training and testing sets (80-20 split)

# Remove NAs for modeling
Final_Model <- Final_Data %>% drop_na()

# Split data
set.seed(123)
tuition_split <- initial_split(Final_Data, prop = 0.8)
tuition_train <- training(tuition_split)
tuition_test <- testing(tuition_split)
head(tuition_train)

summary(tuition_train)

##  University.ID      University.Name          Department In.State.Tuition..USD.
##  Length:80          Length:80          Arts       :14   Min.   : 9313         
##  Class :character   Class :character   Business   :15   1st Qu.:11434         
##  Mode  :character   Mode  :character   Education  :21   Median :14315         
##                                        Engineering:11   Mean   :13978         
##                                        Sciences   :19   3rd Qu.:16245         
##                                                         Max.   :17910         
##                                                                               
##  Out.of.State.Tuition..USD. Offers.In.State.Tuition.to.DACA.Students
##  Min.   :18780              No :41                                  
##  1st Qu.:25078              Yes:39                                  
##  Median :38124                                                      
##  Mean   :36624                                                      
##  3rd Qu.:46774                                                      
##  Max.   :54830                                                      
##                                                                     
##  Tuition.Equity.Policy Year.Policy.Adopted Total.Enrollment
##  No     :27            Min.   :2012        Min.   : 2397   
##  Partial:22            1st Qu.:2014        1st Qu.:16101   
##  Yes    :31            Median :2016        Median :26789   
##                        Mean   :2016        Mean   :26256   
##                        3rd Qu.:2020        3rd Qu.:35091   
##                        Max.   :2022        Max.   :49394   
##                        NA's   :1                           
##  Undergrad.Enrollment Grad.Enrollment Average.SAT.Score Average.ACT.Score
##  Min.   : 1055        Min.   :  528   Min.   : 929      Min.   :17.00    
##  1st Qu.: 7226        1st Qu.: 4348   1st Qu.:1035      1st Qu.:20.00    
##  Median :13755        Median : 8511   Median :1146      Median :23.00    
##  Mean   :14393        Mean   : 8116   Mean   :1169      Mean   :23.76    
##  3rd Qu.:21936        3rd Qu.:11330   3rd Qu.:1328      3rd Qu.:28.00    
##  Max.   :29535        Max.   :14986   Max.   :1449      Max.   :31.00    
##                                                                          
##  Acceptance.Rate.... Diversity.Index  Financial.Aid.... Avg.Financial.Aid..USD.
##  Min.   :40.54       Min.   :0.3000   Min.   :51.14     Min.   : 4142          
##  1st Qu.:56.65       1st Qu.:0.4475   1st Qu.:64.65     1st Qu.: 9394          
##  Median :67.62       Median :0.6250   Median :75.08     Median :12423          
##  Mean   :67.07       Mean   :0.6092   Mean   :73.45     Mean   :12192          
##  3rd Qu.:80.29       3rd Qu.:0.7800   3rd Qu.:82.98     3rd Qu.:15245          
##  Max.   :89.95       Max.   :0.9000   Max.   :93.85     Max.   :19664          
##                                                                                
##              Region   Policy.Notes        Last.Updated                
##  Central        :17   Length:80          Min.   :2022-01-16 00:00:00  
##  Northeast      :14   Class :character   1st Qu.:2022-06-24 06:00:00  
##  Southeast      :12   Mode  :character   Median :2022-12-25 00:00:00  
##  Upper Peninsula:20                      Mean   :2022-12-22 17:24:00  
##  West           :17                      3rd Qu.:2023-06-19 18:00:00  
##                                          Max.   :2023-11-26 00:00:00  
##

# Create a modeling version with cleaner names
Final_Model <- Final_Data %>%
  rename(
    out_tuition = `Out.of.State.Tuition..USD.`,
    in_tuition = `In.State.Tuition..USD.`,
    sat = `Average.SAT.Score`,
    act = `Average.ACT.Score`,
    aid_pct = `Financial.Aid....`,
    total_enroll = `Total.Enrollment`,
    region = Region,
    daca_policy = `Offers.In.State.Tuition.to.DACA.Students`,
    policy = `Tuition.Equity.Policy`
  ) %>%
  mutate(
    daca_policy = as.factor(daca_policy),
    policy = as.factor(policy),
    region = as.factor(region)
  ) %>%
  drop_na()

Linear Regression – Predict Out-of-State Tuition

set.seed(1)
tuition_split <- initial_split(Final_Model, prop = 0.8)
tuition_train <- training(tuition_split)
tuition_test  <- testing(tuition_split)

tuition_recipe <- recipe(out_tuition ~ sat + aid_pct + total_enroll + policy, data = tuition_train) %>%
  step_dummy(all_nominal_predictors()) %>%
  step_normalize(all_numeric_predictors())

lm_spec <- linear_reg() %>% set_engine("lm")

lm_workflow <- workflow() %>%
  add_recipe(tuition_recipe) %>%
  add_model(lm_spec)

lm_fit <- lm_workflow %>% fit(data = tuition_train)

# Evaluate
lm_preds <- predict(lm_fit, tuition_test) %>%
  bind_cols(tuition_test %>% select(out_tuition))

metrics(lm_preds, truth = out_tuition, estimate = .pred)

# View model coefficients and inference info
tidy(lm_fit)

# Visualization: Actual vs Predicted Out-of-State Tuition
ggplot(lm_preds, aes(x = out_tuition, y = .pred)) +
  geom_point(color = "darkgreen") +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "red") +
  theme_minimal() +
  labs(title = "Actual vs Predicted Out-of-State Tuition",
       x = "Actual Tuition",
       y = "Predicted Tuition")

# Visualization: Residuals vs Fitted Values
lm_preds <- lm_preds %>%
  mutate(residuals = out_tuition - .pred)

ggplot(lm_preds, aes(x = .pred, y = residuals)) +
  geom_point(color = "steelblue") +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  theme_minimal() +
  labs(title = "Residuals vs Fitted Values",
       x = "Predicted Tuition",
       y = "Residuals")

This analysis demonstrates the foundation of statistical modeling through a linear regression model using maximum likelihood estimation (MLE). By splitting the dataset into training and testing sets and preprocessing with recipes, we built a model to predict out-of-state tuition based on key predictors like SAT score, financial aid percentage, enrollment, and policy type. The model coefficients obtained via tidy() offer interpretable estimates, while the residual and prediction plots support model evaluation and inference. Overall, this approach aligns with Objective 1 by applying probability theory and MLE to make data-driven predictions and assess uncertainty.

Objective 2: Apply the appropriate generalized linear model for a specific data context

Polynomial Regression – Predict Out-of-State Tuition

# Polynomial regression with SAT² and ACT²
set.seed(100)

poly_recipe <- recipe(out_tuition ~ sat + act + aid_pct + total_enroll, data = tuition_train) %>%
  step_poly(sat, degree = 2) %>%
  step_poly(act, degree = 2) %>%
  step_normalize(all_numeric_predictors())

poly_spec <- linear_reg() %>% set_engine("lm")

poly_wf <- workflow() %>%
  add_recipe(poly_recipe) %>%
  add_model(poly_spec)

poly_fit <- poly_wf %>% fit(data = tuition_train)

# Evaluate polynomial model
poly_preds <- predict(poly_fit, new_data = tuition_test) %>%
  bind_cols(tuition_test %>% select(out_tuition))

metrics(poly_preds, truth = out_tuition, estimate = .pred)

# Plot residuals vs predicted values
poly_preds <- poly_preds %>%
  mutate(residuals = out_tuition - .pred)

ggplot(poly_preds, aes(x = .pred, y = residuals)) +
  geom_point(color = "darkblue") +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  theme_minimal() +
  labs(title = "Residuals vs Fitted Values",
       x = "Predicted Tuition",
       y = "Residuals")

# Plot actual vs predicted values
ggplot(poly_preds, aes(x = out_tuition, y = .pred)) +
  geom_point(color = "darkgreen") +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "red") +
  theme_minimal() +
  labs(title = "Actual vs Predicted Tuition",
       x = "Actual Tuition",
       y = "Predicted Tuition")

# Histogram of residuals
ggplot(poly_preds, aes(x = residuals)) +
  geom_histogram(fill = "purple", bins = 30, color = "black") +
  theme_minimal() +
  labs(title = "Distribution of Residuals",
       x = "Residuals",
       y = "Frequency")

# Q-Q plot to assess normality of residuals
library(qqplotr)

## 
## Attaching package: 'qqplotr'

## The following objects are masked from 'package:ggplot2':
## 
##     stat_qq_line, StatQqLine

ggplot(poly_preds, aes(sample = residuals)) +
  stat_qq_point(size = 2, color = "steelblue") +
  stat_qq_line(color = "red") +
  theme_minimal() +
  labs(title = "Q-Q Plot of Residuals")

The polynomial regression model explores the nonlinear effects of SAT and ACT scores on out-of-state tuition. Evaluation metrics show an RMSE of approximately $11,315 and an R² of 0.16, suggesting a modest fit. The residual plot shows randomness around zero, while the Q-Q plot indicates some departure from normality. Although prediction accuracy is limited, the model captures some underlying patterns, affirming that nonlinear test score relationships contribute to tuition variability. Further refinements could improve the model’s performance.

Objective 3: Demonstrate model selection given a set of candidate models

Random Forest Classification – Predict DACA Tuition Policy

set.seed(2)
daca_split <- initial_split(Final_Model, prop = 0.8, strata = daca_policy)
daca_train <- training(daca_split)
daca_test  <- testing(daca_split)

daca_recipe <- recipe(daca_policy ~ sat + act + total_enroll + region + policy + aid_pct, data = daca_train) %>%
  step_dummy(all_nominal_predictors()) %>%
  step_normalize(all_numeric_predictors())

rf_spec <- rand_forest(mtry = 3, trees = 500) %>%
  set_engine("ranger") %>%
  set_mode("classification")

rf_workflow <- workflow() %>%
  add_recipe(daca_recipe) %>%
  add_model(rf_spec)

rf_fit <- rf_workflow %>% fit(data = daca_train)

rf_preds <- predict(rf_fit, new_data = daca_test, type = "prob") %>%
  bind_cols(predict(rf_fit, new_data = daca_test)) %>%
  bind_cols(daca_test %>% select(daca_policy))

# Metrics
metrics(rf_preds, truth = daca_policy, estimate = .pred_class)

conf_mat(rf_preds, truth = daca_policy, estimate = .pred_class)

##           Truth
## Prediction No Yes
##        No   4   6
##        Yes  6   4

Multinomial Logistic Regression – Predict Department Type

library(nnet)

# Prepare data: filter for rows with non-missing Department
multi_data <- Final_Model %>% filter(!is.na(Department))

set.seed(101)
multi_split <- initial_split(multi_data, prop = 0.8, strata = Department)
multi_train <- training(multi_split)
multi_test <- testing(multi_split)

multi_recipe <- recipe(Department ~ sat + act + region + aid_pct + total_enroll, data = multi_train) %>%
  step_dummy(all_nominal_predictors()) %>%
  step_normalize(all_numeric_predictors())

multi_spec <- multinom_reg() %>%
  set_engine("nnet") %>%
  set_mode("classification")

multi_wf <- workflow() %>%
  add_recipe(multi_recipe) %>%
  add_model(multi_spec)

multi_fit <- multi_wf %>% fit(data = multi_train)

multi_preds <- predict(multi_fit, multi_test) %>%
  bind_cols(multi_test %>% select(Department))

metrics(multi_preds, truth = Department, estimate = .pred_class)

conf_mat(multi_preds, truth = Department, estimate = .pred_class)

##              Truth
## Prediction    Arts Business Education Engineering Sciences
##   Arts           0        0         0           2        0
##   Business       1        1         1           0        0
##   Education      1        1         2           0        1
##   Engineering    1        2         1           2        1
##   Sciences       0        0         1           0        3

multi_preds %>%
  conf_mat(truth = Department, estimate = .pred_class) %>%
  autoplot(type = "heatmap") +
  scale_fill_gradient(low = "#fde0dd", high = "#67000d") +  # light peach to dark red
  labs(
    title = "Confusion Matrix Heatmap",
    subtitle = "Multinomial Logistic Regression - Department",
    fill = "Count"
  ) +
  theme_minimal()

## Scale for fill is already present.
## Adding another scale for fill, which will replace the existing scale.

multi_preds %>%
  count(Department, .pred_class) %>%
  ggplot(aes(x = Department, y = n, fill = .pred_class)) +
  geom_bar(stat = "identity", position = "dodge") +
  theme_minimal() +
  labs(title = "Predicted vs Actual Department",
       x = "Actual Department",
       y = "Count",
       fill = "Predicted")

multi_preds <- multi_preds %>%
  mutate(correct = ifelse(Department == .pred_class, "Correct", "Incorrect"))

ggplot(multi_preds, aes(x = Department, fill = correct)) +
  geom_bar(position = "dodge") +
  theme_minimal() +
  labs(title = "Classification Accuracy by Department",
       x = "Department",
       y = "Count",
       fill = "Prediction Status")

Linear Discriminant Analysis – Predict Region

library(discrim)

## 
## Attaching package: 'discrim'

## The following object is masked from 'package:dials':
## 
##     smoothness

# LDA to predict Region
set.seed(102)
lda_split <- initial_split(Final_Model, prop = 0.8, strata = region)
lda_train <- training(lda_split)
lda_test <- testing(lda_split)

lda_recipe <- recipe(region ~ sat + act + Diversity.Index + aid_pct, data = lda_train) %>%
  step_normalize(all_numeric_predictors())

lda_spec <- discrim_linear() %>%
  set_engine("MASS") %>%
  set_mode("classification")

lda_wf <- workflow() %>%
  add_recipe(lda_recipe) %>%
  add_model(lda_spec)

lda_fit <- lda_wf %>% fit(data = lda_train)

lda_preds <- predict(lda_fit, lda_test) %>%
  bind_cols(lda_test %>% select(region))

metrics(lda_preds, truth = region, estimate = .pred_class)

conf_mat(lda_preds, truth = region, estimate = .pred_class)

##                  Truth
## Prediction        Central Northeast Southeast Upper Peninsula West
##   Central               0         2         1               1    1
##   Northeast             2         0         0               1    0
##   Southeast             2         1         2               2    1
##   Upper Peninsula       1         0         1               1    2
##   West                  0         0         0               0    0

lda_preds %>%
  conf_mat(truth = region, estimate = .pred_class) %>%
  autoplot(type = "heatmap") +
  scale_fill_gradient(low = "#deebf7", high = "#08306b") +  # light blue to navy
  labs(
    title = "LDA: Confusion Matrix Heatmap",
    subtitle = "True vs Predicted Regions",
    fill = "Count"
  ) +
  theme_minimal()

## Scale for fill is already present.
## Adding another scale for fill, which will replace the existing scale.

lda_preds %>%
  count(region, .pred_class) %>%
  ggplot(aes(x = region, y = n, fill = .pred_class)) +
  geom_bar(stat = "identity", position = "dodge") +
  theme_minimal() +
  labs(title = "LDA: Actual vs Predicted Region",
       x = "Actual Region",
       y = "Count",
       fill = "Predicted")

lda_preds <- lda_preds %>%
  mutate(correct = ifelse(region == .pred_class, "Correct", "Incorrect"))

ggplot(lda_preds, aes(x = region, fill = correct)) +
  geom_bar(position = "dodge") +
  theme_minimal() +
  labs(title = "LDA: Correct vs Incorrect Classification by Region",
       x = "Region",
       y = "Count",
       fill = "Prediction Accuracy")

# Add posterior probabilities
lda_probs <- predict(lda_fit, lda_test, type = "prob") %>%
  bind_cols(lda_test %>% select(region))

# Max predicted probability per row
lda_probs <- lda_probs %>%
  mutate(max_prob = apply(select(., starts_with(".pred_")), 1, max))

ggplot(lda_probs, aes(x = max_prob)) +
  geom_histogram(bins = 30, fill = "skyblue", color = "black") +
  theme_minimal() +
  labs(title = "LDA: Distribution of Maximum Posterior Probabilities",
       x = "Max Posterior Probability",
       y = "Count")

Ridge Regression – Predict Out-of-State Tuition

# Ridge regression to predict out_tuition
set.seed(103)

ridge_spec <- linear_reg(penalty = tune(), mixture = 0) %>%
  set_engine("glmnet") %>%
  set_mode("regression")

ridge_recipe <- recipe(out_tuition ~ sat + aid_pct + total_enroll + policy, data = tuition_train) %>%
  step_dummy(all_nominal_predictors()) %>%
  step_normalize(all_numeric_predictors())

ridge_wf <- workflow() %>%
  add_recipe(ridge_recipe) %>%
  add_model(ridge_spec)

set.seed(104)
ridge_res <- tune_grid(
  ridge_wf,
  resamples = vfold_cv(tuition_train, v = 5),
  grid = 10
)

show_best(ridge_res, metric = "rmse")

# Finalize and fit
best_ridge <- select_best(ridge_res, metric = "rmse")

final_ridge <- finalize_workflow(ridge_wf, best_ridge)

ridge_fit <- final_ridge %>% fit(data = tuition_train)

# Evaluate
ridge_preds <- predict(ridge_fit, tuition_test) %>%
  bind_cols(tuition_test %>% select(out_tuition))

metrics(ridge_preds, truth = out_tuition, estimate = .pred)

# Add residuals to predictions
ridge_preds <- ridge_preds %>%
  mutate(residuals = out_tuition - .pred)

ggplot(ridge_preds, aes(x = .pred, y = residuals)) +
  geom_point(color = "darkblue") +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  theme_minimal() +
  labs(title = "Ridge Regression: Residuals vs Fitted Values",
       x = "Predicted Tuition",
       y = "Residuals")

ggplot(ridge_preds, aes(x = out_tuition, y = .pred)) +
  geom_point(color = "darkgreen") +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "red") +
  theme_minimal() +
  labs(title = "Ridge Regression: Actual vs Predicted Tuition",
       x = "Actual Tuition",
       y = "Predicted Tuition")

qqnorm(ridge_preds$residuals)
qqline(ridge_preds$residuals, col = "red")

ggplot(ridge_preds, aes(x = residuals)) +
  geom_histogram(fill = "steelblue", bins = 30, color = "black") +
  theme_minimal() +
  labs(title = "Ridge Regression: Distribution of Residuals",
       x = "Residuals",
       y = "Frequency")

K-Means Clustering – Group Similar Universities

library(cluster)
library(factoextra)

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

# Numeric features only
clust_data <- Final_Model %>%
  select(sat, act, in_tuition, out_tuition, total_enroll, aid_pct) %>%
  na.omit() %>%
  scale()

# K-means
set.seed(3)
kmeans_fit <- kmeans(clust_data, centers = 3, nstart = 25)

fviz_cluster(kmeans_fit, data = clust_data,
             ellipse.type = "convex",
             palette = "jco",
             ggtheme = theme_minimal(),
             main = "University Clustering")

In Objective 3, I applied a range of models to different predictive tasks and explored their effectiveness. Random Forest classification was used to predict DACA tuition policy, though it showed low predictive accuracy. Multinomial Logistic Regression was implemented to classify academic departments, supported by confusion matrices and accuracy plots. Linear Discriminant Analysis (LDA) attempted to predict region classifications but performed poorly, as indicated by low accuracy and diffuse posterior probabilities. For regression, Ridge Regression was tuned and evaluated, yielding stable performance with visual diagnostics for residuals and prediction error. Finally, K-means clustering uncovered natural groupings among universities based on tuition, test scores, and enrollment metrics. Each model was selected to match the nature of the response variable, showcasing informed model selection across classification, regression, and clustering contexts.

Objective 4: Express the results of statistical models to a general audience

Import the Data

The dataset analyzed consists of information on public universities in Michigan, covering variables such as in-state and out-of-state tuition, student enrollment, SAT and ACT scores, financial aid, regional location, academic department, and whether the university offers in-state tuition to DACA students or has a tuition equity policy. This broad range of variables allowed for rich exploration of both predictive and explanatory modeling tasks, providing insights relevant to higher education policy.

Exploratory Analysis

To begin, I conducted exploratory data analysis (EDA) to understand the distributions and relationships within the dataset. Histograms revealed that both in-state and out-of-state tuition were right-skewed, with a few universities charging significantly higher tuition than others. Bar plots showed how categorical variables like region, department, and policy status were distributed across the dataset. This helped identify potential class imbalances and informed modeling choices. For example, the equity policy variable had relatively balanced representation across “Yes,” “Partial,” and “No” categories, making it suitable for regression and classification modeling.

ggpairs and Correlation Matrix

The use of the ggpairs() function allowed for visual inspection of relationships between numeric variables. There was a modest positive correlation between SAT/ACT scores and tuition, as well as between enrollment and financial aid percentage. These observations suggested that schools with higher academic standards and larger enrollments may also offer more aid or charge higher tuition, patterns which could be tested more formally through regression models.

Summary Statistics

Summary statistics confirmed that out-of-state tuition ranged from about $18,000 to over $54,000, while SAT scores ranged from approximately 980 to 1370. Financial aid packages varied widely, both in dollar amount and in percentage of students receiving aid. The variability across these indicators justified their inclusion in predictive modeling, as they provided the necessary variance to explore outcome relationships.

Model Coefficients and Interpretation

To model out-of-state tuition, I implemented a linear regression model using SAT score, financial aid percentage, total enrollment, and equity policy status as predictors. The coefficients were extracted using the tidy() function from the broom package. Notably, the coefficient for SAT score was -630, suggesting that, all else equal, a 1-point increase in SAT score corresponded to a decrease of $630 in tuition. However, this effect was not statistically significant (p = 0.64), meaning we cannot conclude a strong relationship based on this sample.

Similarly, aid percentage had a positive coefficient (~$1,494), indicating that schools offering more aid might also charge higher tuition. Total enrollment had a coefficient of approximately $1,013 per additional student, which seems implausibly high and may reflect an untransformed scale or the need for log transformation. Importantly, none of these predictors reached conventional significance thresholds, suggesting weak predictive power or multicollinearity in the model.

For categorical predictors, institutions with a “Partial” policy had tuition levels about $919 higher than those with “No” policy, while “Yes” policy schools had slightly lower tuition (-$119), but again, these results were not significant. These findings were surprising and suggest that other unmeasured variables—such as institutional funding models or location-specific cost structures—may better explain tuition variation.

Confidence Intervals and Regularization

Although I did not explicitly plot confidence intervals, the standard errors and p-values from the regression output provided enough information to assess estimate uncertainty. To address multicollinearity and improve model stability, I also applied ridge regression using cross-validation to tune the penalty parameter. This regularized model shrunk the coefficients toward zero and improved generalizability, though it also confirmed the relative weakness of SAT, aid, and policy status as strong individual predictors.

Visualizing Results for a Broader Audience

To help communicate findings to a general audience, I relied on visual tools throughout the analysis. Actual vs. predicted plots illustrated how well the model fit the test data, while residual plots helped assess whether the linear model assumptions were reasonable. In both linear and polynomial regression, residuals were relatively scattered, suggesting noise and possible missing variables.

For classification models, I used heatmaps of confusion matrices and accuracy plots to demonstrate model performance in predicting categorical outcomes like academic department and DACA policy. These visuals offered intuitive ways to convey which categories were predicted well and where errors were more frequent. For example, the multinomial logistic regression model misclassified departments with similar academic profiles, which is understandable but also highlights limitations.

Accessible Language and Takeaways

Throughout the portfolio, I avoided overly technical jargon and aimed to explain concepts in terms that are relatable to non-statistical audiences. For example, rather than discussing statistical significance in abstract terms, I explained that “we can’t be confident there is a real relationship between SAT scores and tuition based on this model.” This kind of interpretation makes the results more accessible and actionable for stakeholders in education policy or university administration.

In summary, I used a combination of summary statistics, regression coefficients, diagnostic visuals, and classification evaluation metrics to communicate model results effectively. My focus was on clarity, transparency, and context-based interpretation rather than technical depth alone. This objective helped reinforce my ability to translate statistical findings into meaningful insights for broader audiences, which is essential for any data science or policy-oriented role.

Objective 5: Use programming software to fit and assess statistical models

# Extract model object from workflow
lm_engine <- extract_fit_engine(lm_fit)

# Create diagnostic dataframe
lm_diag <- augment(lm_engine)

# Residuals vs Fitted
ggplot(lm_diag, aes(.fitted, .resid)) +
  geom_point(alpha = 0.6) +
  geom_hline(yintercept = 0, color = "red", linetype = "dashed") +
  labs(title = "Residuals vs Fitted Values", x = "Fitted Tuition", y = "Residuals") +
  theme_minimal()

# Q-Q Plot
ggplot(lm_diag, aes(sample = .resid)) +
  stat_qq() +
  stat_qq_line(color = "blue") +
  labs(title = "Q-Q Plot of Residuals") +
  theme_minimal()

# Scale-Location Plot
ggplot(lm_diag, aes(.fitted, sqrt(abs(.std.resid)))) +
  geom_point(alpha = 0.6) +
  geom_smooth(se = FALSE, color = "purple") +
  labs(title = "Scale-Location Plot", x = "Fitted Tuition", y = "Sqrt(|Standardized Residuals|)") +
  theme_minimal()

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

# Cook's Distance
lm_diag$cooksd <- cooks.distance(lm_engine)

ggplot(lm_diag, aes(x = seq_along(cooksd), y = cooksd)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  labs(title = "Cook's Distance Plot", x = "Observation Index", y = "Cook's Distance") +
  theme_minimal()

In this portfolio, I demonstrated the use of R and the Tidy Models framework to programmatically fit and assess statistical models. I implemented workflow(), recipe(), and fit() to develop models including linear regression, polynomial regression, random forest classification, multinomial logistic regression, linear discriminant analysis, ridge regression, and k-means clustering.

To ensure data readiness, I used preprocessing techniques like normalization, polynomial transformations, and dummy variable encoding. After model fitting, I applied both quantitative and visual tools for evaluation. Specifically, I used residual diagnostics for linear models, including residuals vs. fitted plots, Q-Q plots, Scale-Location plots, and Cook’s Distance plots. These visualizations allowed me to assess assumptions such as linearity, homoscedasticity, normality, and influential observations.

Across other models, performance was evaluated using appropriate classification metrics and visual summaries like confusion matrices and prediction accuracy bar plots. By combining statistical programming with model diagnostics and validation, I ensured that the models were interpretable, reproducible, and statistically robust.

Portfolio

Rebecca Dara

2025-04-19