library(readxl)
library(GGally)
library(readxl)
library(dplyr)
library(ggplot2)
library(tidyr)
library(ggcorrplot)
library(forcats)
library(tidymodels)
library(janitor)
Final_Data <- read_excel("Final_Data.xlsx")
head(Final_Data)
glimpse(Final_Data)
## Rows: 100
## Columns: 20
## $ `University ID` <chr> "U001", "U002", "U003", "U0…
## $ `University Name` <chr> "Northern Michigan Universi…
## $ Department <chr> "Sciences", "Education", "E…
## $ `In-State Tuition (USD)` <dbl> 12441, 10841, 13509, 12474,…
## $ `Out-of-State Tuition (USD)` <dbl> 45209, 21394, 48355, 19813,…
## $ `Offers In-State Tuition to DACA Students` <chr> "No", "No", "No", "No", "Ye…
## $ `Tuition Equity Policy` <chr> "Yes", "No", "Yes", "Partia…
## $ `Year Policy Adopted` <dbl> NA, 2019, 2020, 2020, 2015,…
## $ `Total Enrollment` <dbl> 34340, 10191, 32452, 24208,…
## $ `Undergrad Enrollment` <dbl> 22187, 15233, 9342, 17456, …
## $ `Grad Enrollment` <dbl> 5967, 3047, 12612, 11830, 2…
## $ `Average SAT Score` <dbl> 1187, 1123, 943, 981, 1031,…
## $ `Average ACT Score` <dbl> 19, 25, 23, 18, 31, 20, 25,…
## $ `Acceptance Rate (%)` <dbl> 43.42, 48.75, 60.68, 44.42,…
## $ `Diversity Index` <dbl> 0.52, 0.80, 0.59, 0.52, 0.7…
## $ `Financial Aid (%)` <dbl> 61.48, 61.39, 89.57, 54.33,…
## $ `Avg Financial Aid (USD)` <dbl> 19822, 19564, 6467, 12357, …
## $ Region <chr> "Central", "Upper Peninsula…
## $ `Policy Notes` <chr> "Policy applies to all MI h…
## $ `Last Updated` <dttm> 2022-01-02, 2022-01-09, 20…
The dataset contains 100 rows and 20 columns, representing detailed information about public universities in Michigan. Key variables include tuition costs, academic scores, enrollment numbers, financial aid details, and policy indicators. From the data structure, we see that variables like Average SAT Score, Average ACT Score, Total Enrollment, and Financial Aid (%) are numeric and vary widely across institutions. Categorical variables such as Department, Region, and Tuition Equity Policy provide group-level context. This output confirms that the dataset is well-structured for statistical modeling, with enough variation in both numeric and categorical predictors to explore relationships with outcomes like tuition or policy adoption.
# Select numeric columns only
numeric_data <- Final_Data %>% select(where(is.numeric))
# Create scatterplot matrix
ggpairs(numeric_data, title = "Scatterplot Matrix of Numeric Variables")
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 2 rows containing missing values
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 2 rows containing missing values
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_density()`).
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 2 rows containing missing values
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 2 rows containing missing values
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 2 rows containing missing values
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 2 rows containing missing values
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 2 rows containing missing values
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 2 rows containing missing values
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 2 rows containing missing values
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 2 rows containing missing values
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 2 rows containing missing values
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
The scatterplot matrix provides a comprehensive view of pairwise relationships among all numeric variables in the dataset. Diagonal elements show the distribution of each variable, while the lower triangle visualizes bivariate scatterplots and the upper triangle reports correlation coefficients. Notably, In-State Tuition and Out-of-State Tuition show a moderately positive correlation, indicating institutions with high in-state tuition often have higher out-of-state rates as well. Total Enrollment and Undergrad Enrollment exhibit a strong positive correlation, as expected, while variables like Average SAT and ACT scores also correlate positively. Overall, the matrix highlights key variable relationships useful for modeling tuition costs and equity outcomes.
# View data structure
str(Final_Data)
## tibble [100 × 20] (S3: tbl_df/tbl/data.frame)
## $ University ID : chr [1:100] "U001" "U002" "U003" "U004" ...
## $ University Name : chr [1:100] "Northern Michigan University" "Baker College" "Kettering University" "Michigan State University" ...
## $ Department : chr [1:100] "Sciences" "Education" "Education" "Sciences" ...
## $ In-State Tuition (USD) : num [1:100] 12441 10841 13509 12474 17194 ...
## $ Out-of-State Tuition (USD) : num [1:100] 45209 21394 48355 19813 36287 ...
## $ Offers In-State Tuition to DACA Students: chr [1:100] "No" "No" "No" "No" ...
## $ Tuition Equity Policy : chr [1:100] "Yes" "No" "Yes" "Partial" ...
## $ Year Policy Adopted : num [1:100] NA 2019 2020 2020 2015 ...
## $ Total Enrollment : num [1:100] 34340 10191 32452 24208 38025 ...
## $ Undergrad Enrollment : num [1:100] 22187 15233 9342 17456 10947 ...
## $ Grad Enrollment : num [1:100] 5967 3047 12612 11830 2522 ...
## $ Average SAT Score : num [1:100] 1187 1123 943 981 1031 ...
## $ Average ACT Score : num [1:100] 19 25 23 18 31 20 25 17 18 21 ...
## $ Acceptance Rate (%) : num [1:100] 43.4 48.8 60.7 44.4 63 ...
## $ Diversity Index : num [1:100] 0.52 0.8 0.59 0.52 0.73 0.88 0.66 0.64 0.4 0.53 ...
## $ Financial Aid (%) : num [1:100] 61.5 61.4 89.6 54.3 87.8 ...
## $ Avg Financial Aid (USD) : num [1:100] 19822 19564 6467 12357 15713 ...
## $ Region : chr [1:100] "Central" "Upper Peninsula" "Northeast" "Upper Peninsula" ...
## $ Policy Notes : chr [1:100] "Policy applies to all MI high school grads" "Tuition equity not offered" "Policy applies to all MI high school grads" "Tuition equity not offered" ...
## $ Last Updated : POSIXct[1:100], format: "2022-01-02" "2022-01-09" ...
The structure output reveals a well-organized tibble with 100 observations and 20 variables representing detailed university-level data across Michigan. It includes numeric features such as In-State Tuition, Out-of-State Tuition, SAT/ACT scores, Enrollment figures, and Financial Aid metrics, offering rich quantitative insights for modeling. Categorical features like Department, Region, Tuition Equity Policy, and Offers In-State Tuition to DACA Students provide policy and institutional context. The presence of policy notes and update timestamps suggests the data is designed for analyzing tuition equity and institutional characteristics. This structure confirms the dataset is suitable for both descriptive and predictive analytics.
# Clean column names for safe use
colnames(Final_Data) <- make.names(colnames(Final_Data))
# Convert character to factors
Final_Data$Offers.In.State.Tuition.to.DACA.Students <- as.factor(Final_Data$Offers.In.State.Tuition.to.DACA.Students)
Final_Data$Tuition.Equity.Policy <- as.factor(Final_Data$Tuition.Equity.Policy)
Final_Data$Region <- as.factor(Final_Data$Region)
Final_Data$Department <- as.factor(Final_Data$Department)
summary(Final_Data)
## University.ID University.Name Department In.State.Tuition..USD.
## Length:100 Length:100 Arts :15 Min. : 9313
## Class :character Class :character Business :18 1st Qu.:11396
## Mode :character Mode :character Education :25 Median :13560
## Engineering:18 Mean :13776
## Sciences :24 3rd Qu.:16171
## Max. :17974
##
## Out.of.State.Tuition..USD. Offers.In.State.Tuition.to.DACA.Students
## Min. :18462 No :50
## 1st Qu.:25012 Yes:50
## Median :36538
## Mean :35658
## 3rd Qu.:44597
## Max. :54830
##
## Tuition.Equity.Policy Year.Policy.Adopted Total.Enrollment
## No :32 Min. :2012 Min. : 2397
## Partial:31 1st Qu.:2014 1st Qu.:14406
## Yes :37 Median :2017 Median :26945
## Mean :2017 Mean :26313
## 3rd Qu.:2020 3rd Qu.:36932
## Max. :2022 Max. :49394
## NA's :2
## Undergrad.Enrollment Grad.Enrollment Average.SAT.Score Average.ACT.Score
## Min. : 1055 Min. : 528 Min. : 912 Min. :17.00
## 1st Qu.: 6791 1st Qu.: 4986 1st Qu.:1035 1st Qu.:20.00
## Median :13541 Median : 8308 Median :1132 Median :23.00
## Mean :14249 Mean : 8228 Mean :1163 Mean :23.57
## 3rd Qu.:21936 3rd Qu.:11777 3rd Qu.:1325 3rd Qu.:27.25
## Max. :29841 Max. :14986 Max. :1449 Max. :31.00
##
## Acceptance.Rate.... Diversity.Index Financial.Aid.... Avg.Financial.Aid..USD.
## Min. :40.54 Min. :0.3000 Min. :51.14 Min. : 4142
## 1st Qu.:55.75 1st Qu.:0.4575 1st Qu.:64.65 1st Qu.: 9394
## Median :65.96 Median :0.6250 Median :74.80 Median :12676
## Mean :66.04 Mean :0.6110 Mean :73.48 Mean :12604
## 3rd Qu.:78.28 3rd Qu.:0.7800 3rd Qu.:82.44 3rd Qu.:16166
## Max. :89.95 Max. :0.9000 Max. :94.32 Max. :19971
##
## Region Policy.Notes Last.Updated
## Central :24 Length:100 Min. :2022-01-02 00:00:00
## Northeast :16 Class :character 1st Qu.:2022-06-24 06:00:00
## Southeast :19 Mode :character Median :2022-12-14 12:00:00
## Upper Peninsula:22 Mean :2022-12-14 12:00:00
## West :19 3rd Qu.:2023-06-05 18:00:00
## Max. :2023-11-26 00:00:00
##
The summary statistics provide a high-level overview of the tuition equity dataset. In-state tuition ranges from around $9,313 to $17,974, while out-of-state tuition is significantly higher, ranging from $18,462 to $54,830. Enrollment numbers vary widely, with total enrollments between 2,397 and 49,394 students. Academic indicators like SAT and ACT scores show substantial variability across institutions, with median SAT scores around 1132 and ACT scores around 23. Approximately half of the universities offer in-state tuition to DACA students, and tuition equity policies are distributed among “Yes” (37%), “Partial” (31%), and “No” (32%). Additionally, average financial aid awards and diversity indices also display considerable variation, suggesting heterogeneity in institutional support and student demographics across Michigan public universities.
# Compute and visualize the correlation matrix of numeric variables using ggcorrplot
numeric_vars <- Final_Data %>% select(where(is.numeric))
corr_matrix <- cor(numeric_vars, use = "complete.obs")
# Convert to a tidy format
corr_table <- as.data.frame(round(corr_matrix, 2))
corr_table
ggcorrplot(corr_matrix,
lab = FALSE, # hide the correlation values
title = "Correlation Matrix",
hc.order = TRUE,
type = "lower")
The correlation matrix heatmap reveals how strongly numeric variables
are linearly related in the tuition equity dataset. As expected,
In-State Tuition and Out-of-State Tuition show a weak positive
correlation (~0.14), suggesting some alignment in pricing strategies
across institutions. Total Enrollment and Undergrad Enrollment are
positively related, while Grad Enrollment shows a slight negative
relationship with in-state tuition. Interestingly, Diversity Index has a
small positive correlation with out-of-state tuition and year of policy
adoption, implying that more diverse schools may have more recent equity
policies and slightly higher tuition. Overall, the matrix helps identify
potential predictors for modeling tuition and policy-related
outcomes.
# Generate histograms for key numeric variables to examine their distributions
key_vars <- c("In.State.Tuition..USD.", "Out.of.State.Tuition..USD.",
"Average.SAT.Score", "Average.ACT.Score", "Avg.Financial.Aid..USD.")
for (col in key_vars) {
print(
ggplot(Final_Data, aes_string(x = col)) +
geom_histogram(fill = "steelblue", bins = 30, color = "black") +
theme_minimal() +
labs(title = paste("Histogram of", col), x = col, y = "Frequency")
)
}
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
The histograms above illustrate the distributions of key numeric
variables in the tuition equity dataset. Both In-State and Out-of-State
Tuition show moderately right-skewed distributions, indicating that
while most institutions charge moderate tuition, a few charge
substantially higher amounts. The Average SAT and ACT Scores reveal
fairly normal distributions with slight variation, reflecting academic
competitiveness across universities. Lastly, Average Financial Aid (USD)
is also right-skewed, suggesting that while most students receive modest
aid, a few institutions offer significantly higher awards. These
visualizations help identify variable ranges, skewness, and potential
outliers, all of which are essential for guiding model assumptions and
data preprocessing.
# Bar plots of categorical variables to visualize their frequency distributions
cat_vars <- c("University.Name", "Department", "Region",
"Offers.In.State.Tuition.to.DACA.Students", "Tuition.Equity.Policy")
for (col in cat_vars) {
print(
ggplot(Final_Data, aes_string(x = col)) +
geom_bar(fill = "darkgreen") +
theme_minimal() +
coord_flip() +
labs(title = paste("Bar Plot of", col), x = col, y = "Count")
)
}
The bar plots provide insights into the distribution of categorical
variables across the dataset. The Department variable shows a relatively
balanced spread among categories like Sciences, Business, Education,
Arts, and Engineering. The Region variable reflects geographic
diversity, with universities distributed across Central, Upper
Peninsula, Northeast, Southeast, and West regions of Michigan. When
examining policy-related factors, the Offers In-State Tuition to DACA
Students variable is evenly split between “Yes” and “No,” suggesting
variation in institutional support for DACA students. Similarly, the
Tuition Equity Policy is distributed across “Yes,” “No,” and “Partial,”
indicating differing levels of commitment to tuition equity across
universities. These visualizations help contextualize policy decisions
and institutional classifications within the dataset.
# Boxplots to visualize how tuition costs vary by policy and region
ggplot(Final_Data, aes(x = Tuition.Equity.Policy, y = Out.of.State.Tuition..USD., fill = Tuition.Equity.Policy)) +
geom_boxplot() +
theme_minimal() +
labs(title = "Out-of-State Tuition by Equity Policy")
ggplot(Final_Data, aes(x = Region, y = In.State.Tuition..USD., fill = Region)) +
geom_boxplot() +
theme_minimal() +
labs(title = "In-State Tuition by Region")
The boxplot comparing Out-of-State Tuition by Equity Policy shows that
institutions with no equity policy tend to have slightly lower median
tuition compared to those with partial or full policies. However, there
is substantial overlap in distributions, suggesting that equity policy
alone may not fully explain tuition differences.
In the In-State Tuition by Region plot, institutions in the West and Upper Peninsula regions generally show higher median in-state tuition, while those in the Northeast and Central regions tend to have lower values. This suggests potential regional disparities in tuition pricing across Michigan’s public universities.
# Scatter plots to explore the relationship between enrollment and tuition costs
ggplot(Final_Data, aes(x = Total.Enrollment, y = Out.of.State.Tuition..USD.)) +
geom_point(color = "red") +
geom_smooth(method = "lm", se = FALSE, color = "blue") +
theme_minimal() +
labs(title = "Out-of-State Tuition vs Total Enrollment")
## `geom_smooth()` using formula = 'y ~ x'
ggplot(Final_Data, aes(x = Undergrad.Enrollment, y = In.State.Tuition..USD.)) +
geom_point(color = "darkorange") +
theme_minimal() +
labs(title = "In-State Tuition vs Undergrad Enrollment")
The first scatter plot shows a weak positive relationship between total
enrollment and out-of-state tuition, suggesting that universities with
more students may charge slightly higher out-of-state fees. However, the
spread is wide, indicating variability. The second plot shows no strong
trend between undergrad enrollment and in-state tuition, implying that
in-state rates remain relatively stable regardless of undergraduate
population size.
# Boxplots comparing financial aid amount and percentage across different tuition equity policy levels
ggplot(Final_Data, aes(x = Tuition.Equity.Policy, y = Avg.Financial.Aid..USD., fill = Tuition.Equity.Policy)) +
geom_boxplot() +
theme_minimal() +
labs(title = "Average Financial Aid by Equity Policy")
ggplot(Final_Data, aes(x = Tuition.Equity.Policy, y = Financial.Aid...., fill = Tuition.Equity.Policy)) +
geom_boxplot() +
theme_minimal() +
labs(title = "Financial Aid % by Equity Policy")
The boxplots reveal how financial aid differs by tuition equity policy.
Universities with full equity policies (“Yes”) tend to offer higher
average financial aid and a greater percentage of students receiving aid
compared to those with “No” or “Partial” policies. This suggests that
institutions with stronger equity commitments may also invest more in
financial support.
# Boxplots comparing SAT and ACT scores across departments
ggplot(Final_Data, aes(x = Department, y = Average.SAT.Score, fill = Department)) +
geom_boxplot() +
theme_minimal() +
labs(title = "Average SAT Score by Department")
ggplot(Final_Data, aes(x = Department, y = Average.ACT.Score, fill = Department)) +
geom_boxplot() +
theme_minimal() +
labs(title = "Average ACT Score by Department")
The boxplots illustrate department-level differences in standardized
test scores. Business and Education departments generally show higher
median SAT and ACT scores, while Engineering and Arts display wider
variability and more outliers. These visualizations help compare
academic performance expectations across different fields of study
within Michigan’s public universities.
# Remove NAs for modeling
Final_Model <- Final_Data %>% drop_na()
# Split data
set.seed(123)
tuition_split <- initial_split(Final_Data, prop = 0.8)
tuition_train <- training(tuition_split)
tuition_test <- testing(tuition_split)
head(tuition_train)
summary(tuition_train)
## University.ID University.Name Department In.State.Tuition..USD.
## Length:80 Length:80 Arts :14 Min. : 9313
## Class :character Class :character Business :15 1st Qu.:11434
## Mode :character Mode :character Education :21 Median :14315
## Engineering:11 Mean :13978
## Sciences :19 3rd Qu.:16245
## Max. :17910
##
## Out.of.State.Tuition..USD. Offers.In.State.Tuition.to.DACA.Students
## Min. :18780 No :41
## 1st Qu.:25078 Yes:39
## Median :38124
## Mean :36624
## 3rd Qu.:46774
## Max. :54830
##
## Tuition.Equity.Policy Year.Policy.Adopted Total.Enrollment
## No :27 Min. :2012 Min. : 2397
## Partial:22 1st Qu.:2014 1st Qu.:16101
## Yes :31 Median :2016 Median :26789
## Mean :2016 Mean :26256
## 3rd Qu.:2020 3rd Qu.:35091
## Max. :2022 Max. :49394
## NA's :1
## Undergrad.Enrollment Grad.Enrollment Average.SAT.Score Average.ACT.Score
## Min. : 1055 Min. : 528 Min. : 929 Min. :17.00
## 1st Qu.: 7226 1st Qu.: 4348 1st Qu.:1035 1st Qu.:20.00
## Median :13755 Median : 8511 Median :1146 Median :23.00
## Mean :14393 Mean : 8116 Mean :1169 Mean :23.76
## 3rd Qu.:21936 3rd Qu.:11330 3rd Qu.:1328 3rd Qu.:28.00
## Max. :29535 Max. :14986 Max. :1449 Max. :31.00
##
## Acceptance.Rate.... Diversity.Index Financial.Aid.... Avg.Financial.Aid..USD.
## Min. :40.54 Min. :0.3000 Min. :51.14 Min. : 4142
## 1st Qu.:56.65 1st Qu.:0.4475 1st Qu.:64.65 1st Qu.: 9394
## Median :67.62 Median :0.6250 Median :75.08 Median :12423
## Mean :67.07 Mean :0.6092 Mean :73.45 Mean :12192
## 3rd Qu.:80.29 3rd Qu.:0.7800 3rd Qu.:82.98 3rd Qu.:15245
## Max. :89.95 Max. :0.9000 Max. :93.85 Max. :19664
##
## Region Policy.Notes Last.Updated
## Central :17 Length:80 Min. :2022-01-16 00:00:00
## Northeast :14 Class :character 1st Qu.:2022-06-24 06:00:00
## Southeast :12 Mode :character Median :2022-12-25 00:00:00
## Upper Peninsula:20 Mean :2022-12-22 17:24:00
## West :17 3rd Qu.:2023-06-19 18:00:00
## Max. :2023-11-26 00:00:00
##
# Create a modeling version with cleaner names
Final_Model <- Final_Data %>%
rename(
out_tuition = `Out.of.State.Tuition..USD.`,
in_tuition = `In.State.Tuition..USD.`,
sat = `Average.SAT.Score`,
act = `Average.ACT.Score`,
aid_pct = `Financial.Aid....`,
total_enroll = `Total.Enrollment`,
region = Region,
daca_policy = `Offers.In.State.Tuition.to.DACA.Students`,
policy = `Tuition.Equity.Policy`
) %>%
mutate(
daca_policy = as.factor(daca_policy),
policy = as.factor(policy),
region = as.factor(region)
) %>%
drop_na()
set.seed(1)
tuition_split <- initial_split(Final_Model, prop = 0.8)
tuition_train <- training(tuition_split)
tuition_test <- testing(tuition_split)
tuition_recipe <- recipe(out_tuition ~ sat + aid_pct + total_enroll + policy, data = tuition_train) %>%
step_dummy(all_nominal_predictors()) %>%
step_normalize(all_numeric_predictors())
lm_spec <- linear_reg() %>% set_engine("lm")
lm_workflow <- workflow() %>%
add_recipe(tuition_recipe) %>%
add_model(lm_spec)
lm_fit <- lm_workflow %>% fit(data = tuition_train)
# Evaluate
lm_preds <- predict(lm_fit, tuition_test) %>%
bind_cols(tuition_test %>% select(out_tuition))
metrics(lm_preds, truth = out_tuition, estimate = .pred)
# View model coefficients and inference info
tidy(lm_fit)
# Visualization: Actual vs Predicted Out-of-State Tuition
ggplot(lm_preds, aes(x = out_tuition, y = .pred)) +
geom_point(color = "darkgreen") +
geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "red") +
theme_minimal() +
labs(title = "Actual vs Predicted Out-of-State Tuition",
x = "Actual Tuition",
y = "Predicted Tuition")
# Visualization: Residuals vs Fitted Values
lm_preds <- lm_preds %>%
mutate(residuals = out_tuition - .pred)
ggplot(lm_preds, aes(x = .pred, y = residuals)) +
geom_point(color = "steelblue") +
geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
theme_minimal() +
labs(title = "Residuals vs Fitted Values",
x = "Predicted Tuition",
y = "Residuals")
This analysis demonstrates the foundation of statistical modeling through a linear regression model using maximum likelihood estimation (MLE). By splitting the dataset into training and testing sets and preprocessing with recipes, we built a model to predict out-of-state tuition based on key predictors like SAT score, financial aid percentage, enrollment, and policy type. The model coefficients obtained via tidy() offer interpretable estimates, while the residual and prediction plots support model evaluation and inference. Overall, this approach aligns with Objective 1 by applying probability theory and MLE to make data-driven predictions and assess uncertainty.
# Polynomial regression with SAT² and ACT²
set.seed(100)
poly_recipe <- recipe(out_tuition ~ sat + act + aid_pct + total_enroll, data = tuition_train) %>%
step_poly(sat, degree = 2) %>%
step_poly(act, degree = 2) %>%
step_normalize(all_numeric_predictors())
poly_spec <- linear_reg() %>% set_engine("lm")
poly_wf <- workflow() %>%
add_recipe(poly_recipe) %>%
add_model(poly_spec)
poly_fit <- poly_wf %>% fit(data = tuition_train)
# Evaluate polynomial model
poly_preds <- predict(poly_fit, new_data = tuition_test) %>%
bind_cols(tuition_test %>% select(out_tuition))
metrics(poly_preds, truth = out_tuition, estimate = .pred)
# Plot residuals vs predicted values
poly_preds <- poly_preds %>%
mutate(residuals = out_tuition - .pred)
ggplot(poly_preds, aes(x = .pred, y = residuals)) +
geom_point(color = "darkblue") +
geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
theme_minimal() +
labs(title = "Residuals vs Fitted Values",
x = "Predicted Tuition",
y = "Residuals")
# Plot actual vs predicted values
ggplot(poly_preds, aes(x = out_tuition, y = .pred)) +
geom_point(color = "darkgreen") +
geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "red") +
theme_minimal() +
labs(title = "Actual vs Predicted Tuition",
x = "Actual Tuition",
y = "Predicted Tuition")
# Histogram of residuals
ggplot(poly_preds, aes(x = residuals)) +
geom_histogram(fill = "purple", bins = 30, color = "black") +
theme_minimal() +
labs(title = "Distribution of Residuals",
x = "Residuals",
y = "Frequency")
# Q-Q plot to assess normality of residuals
library(qqplotr)
##
## Attaching package: 'qqplotr'
## The following objects are masked from 'package:ggplot2':
##
## stat_qq_line, StatQqLine
ggplot(poly_preds, aes(sample = residuals)) +
stat_qq_point(size = 2, color = "steelblue") +
stat_qq_line(color = "red") +
theme_minimal() +
labs(title = "Q-Q Plot of Residuals")
The polynomial regression model explores the nonlinear effects of SAT
and ACT scores on out-of-state tuition. Evaluation metrics show an RMSE
of approximately $11,315 and an R² of 0.16, suggesting a modest fit. The
residual plot shows randomness around zero, while the Q-Q plot indicates
some departure from normality. Although prediction accuracy is limited,
the model captures some underlying patterns, affirming that nonlinear
test score relationships contribute to tuition variability. Further
refinements could improve the model’s performance.
set.seed(2)
daca_split <- initial_split(Final_Model, prop = 0.8, strata = daca_policy)
daca_train <- training(daca_split)
daca_test <- testing(daca_split)
daca_recipe <- recipe(daca_policy ~ sat + act + total_enroll + region + policy + aid_pct, data = daca_train) %>%
step_dummy(all_nominal_predictors()) %>%
step_normalize(all_numeric_predictors())
rf_spec <- rand_forest(mtry = 3, trees = 500) %>%
set_engine("ranger") %>%
set_mode("classification")
rf_workflow <- workflow() %>%
add_recipe(daca_recipe) %>%
add_model(rf_spec)
rf_fit <- rf_workflow %>% fit(data = daca_train)
rf_preds <- predict(rf_fit, new_data = daca_test, type = "prob") %>%
bind_cols(predict(rf_fit, new_data = daca_test)) %>%
bind_cols(daca_test %>% select(daca_policy))
# Metrics
metrics(rf_preds, truth = daca_policy, estimate = .pred_class)
conf_mat(rf_preds, truth = daca_policy, estimate = .pred_class)
## Truth
## Prediction No Yes
## No 4 6
## Yes 6 4
library(nnet)
# Prepare data: filter for rows with non-missing Department
multi_data <- Final_Model %>% filter(!is.na(Department))
set.seed(101)
multi_split <- initial_split(multi_data, prop = 0.8, strata = Department)
multi_train <- training(multi_split)
multi_test <- testing(multi_split)
multi_recipe <- recipe(Department ~ sat + act + region + aid_pct + total_enroll, data = multi_train) %>%
step_dummy(all_nominal_predictors()) %>%
step_normalize(all_numeric_predictors())
multi_spec <- multinom_reg() %>%
set_engine("nnet") %>%
set_mode("classification")
multi_wf <- workflow() %>%
add_recipe(multi_recipe) %>%
add_model(multi_spec)
multi_fit <- multi_wf %>% fit(data = multi_train)
multi_preds <- predict(multi_fit, multi_test) %>%
bind_cols(multi_test %>% select(Department))
metrics(multi_preds, truth = Department, estimate = .pred_class)
conf_mat(multi_preds, truth = Department, estimate = .pred_class)
## Truth
## Prediction Arts Business Education Engineering Sciences
## Arts 0 0 0 2 0
## Business 1 1 1 0 0
## Education 1 1 2 0 1
## Engineering 1 2 1 2 1
## Sciences 0 0 1 0 3
multi_preds %>%
conf_mat(truth = Department, estimate = .pred_class) %>%
autoplot(type = "heatmap") +
scale_fill_gradient(low = "#fde0dd", high = "#67000d") + # light peach to dark red
labs(
title = "Confusion Matrix Heatmap",
subtitle = "Multinomial Logistic Regression - Department",
fill = "Count"
) +
theme_minimal()
## Scale for fill is already present.
## Adding another scale for fill, which will replace the existing scale.
multi_preds %>%
count(Department, .pred_class) %>%
ggplot(aes(x = Department, y = n, fill = .pred_class)) +
geom_bar(stat = "identity", position = "dodge") +
theme_minimal() +
labs(title = "Predicted vs Actual Department",
x = "Actual Department",
y = "Count",
fill = "Predicted")
multi_preds <- multi_preds %>%
mutate(correct = ifelse(Department == .pred_class, "Correct", "Incorrect"))
ggplot(multi_preds, aes(x = Department, fill = correct)) +
geom_bar(position = "dodge") +
theme_minimal() +
labs(title = "Classification Accuracy by Department",
x = "Department",
y = "Count",
fill = "Prediction Status")
library(discrim)
##
## Attaching package: 'discrim'
## The following object is masked from 'package:dials':
##
## smoothness
# LDA to predict Region
set.seed(102)
lda_split <- initial_split(Final_Model, prop = 0.8, strata = region)
lda_train <- training(lda_split)
lda_test <- testing(lda_split)
lda_recipe <- recipe(region ~ sat + act + Diversity.Index + aid_pct, data = lda_train) %>%
step_normalize(all_numeric_predictors())
lda_spec <- discrim_linear() %>%
set_engine("MASS") %>%
set_mode("classification")
lda_wf <- workflow() %>%
add_recipe(lda_recipe) %>%
add_model(lda_spec)
lda_fit <- lda_wf %>% fit(data = lda_train)
lda_preds <- predict(lda_fit, lda_test) %>%
bind_cols(lda_test %>% select(region))
metrics(lda_preds, truth = region, estimate = .pred_class)
conf_mat(lda_preds, truth = region, estimate = .pred_class)
## Truth
## Prediction Central Northeast Southeast Upper Peninsula West
## Central 0 2 1 1 1
## Northeast 2 0 0 1 0
## Southeast 2 1 2 2 1
## Upper Peninsula 1 0 1 1 2
## West 0 0 0 0 0
lda_preds %>%
conf_mat(truth = region, estimate = .pred_class) %>%
autoplot(type = "heatmap") +
scale_fill_gradient(low = "#deebf7", high = "#08306b") + # light blue to navy
labs(
title = "LDA: Confusion Matrix Heatmap",
subtitle = "True vs Predicted Regions",
fill = "Count"
) +
theme_minimal()
## Scale for fill is already present.
## Adding another scale for fill, which will replace the existing scale.
lda_preds %>%
count(region, .pred_class) %>%
ggplot(aes(x = region, y = n, fill = .pred_class)) +
geom_bar(stat = "identity", position = "dodge") +
theme_minimal() +
labs(title = "LDA: Actual vs Predicted Region",
x = "Actual Region",
y = "Count",
fill = "Predicted")
lda_preds <- lda_preds %>%
mutate(correct = ifelse(region == .pred_class, "Correct", "Incorrect"))
ggplot(lda_preds, aes(x = region, fill = correct)) +
geom_bar(position = "dodge") +
theme_minimal() +
labs(title = "LDA: Correct vs Incorrect Classification by Region",
x = "Region",
y = "Count",
fill = "Prediction Accuracy")
# Add posterior probabilities
lda_probs <- predict(lda_fit, lda_test, type = "prob") %>%
bind_cols(lda_test %>% select(region))
# Max predicted probability per row
lda_probs <- lda_probs %>%
mutate(max_prob = apply(select(., starts_with(".pred_")), 1, max))
ggplot(lda_probs, aes(x = max_prob)) +
geom_histogram(bins = 30, fill = "skyblue", color = "black") +
theme_minimal() +
labs(title = "LDA: Distribution of Maximum Posterior Probabilities",
x = "Max Posterior Probability",
y = "Count")
# Ridge regression to predict out_tuition
set.seed(103)
ridge_spec <- linear_reg(penalty = tune(), mixture = 0) %>%
set_engine("glmnet") %>%
set_mode("regression")
ridge_recipe <- recipe(out_tuition ~ sat + aid_pct + total_enroll + policy, data = tuition_train) %>%
step_dummy(all_nominal_predictors()) %>%
step_normalize(all_numeric_predictors())
ridge_wf <- workflow() %>%
add_recipe(ridge_recipe) %>%
add_model(ridge_spec)
set.seed(104)
ridge_res <- tune_grid(
ridge_wf,
resamples = vfold_cv(tuition_train, v = 5),
grid = 10
)
show_best(ridge_res, metric = "rmse")
# Finalize and fit
best_ridge <- select_best(ridge_res, metric = "rmse")
final_ridge <- finalize_workflow(ridge_wf, best_ridge)
ridge_fit <- final_ridge %>% fit(data = tuition_train)
# Evaluate
ridge_preds <- predict(ridge_fit, tuition_test) %>%
bind_cols(tuition_test %>% select(out_tuition))
metrics(ridge_preds, truth = out_tuition, estimate = .pred)
# Add residuals to predictions
ridge_preds <- ridge_preds %>%
mutate(residuals = out_tuition - .pred)
ggplot(ridge_preds, aes(x = .pred, y = residuals)) +
geom_point(color = "darkblue") +
geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
theme_minimal() +
labs(title = "Ridge Regression: Residuals vs Fitted Values",
x = "Predicted Tuition",
y = "Residuals")
ggplot(ridge_preds, aes(x = out_tuition, y = .pred)) +
geom_point(color = "darkgreen") +
geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "red") +
theme_minimal() +
labs(title = "Ridge Regression: Actual vs Predicted Tuition",
x = "Actual Tuition",
y = "Predicted Tuition")
qqnorm(ridge_preds$residuals)
qqline(ridge_preds$residuals, col = "red")
ggplot(ridge_preds, aes(x = residuals)) +
geom_histogram(fill = "steelblue", bins = 30, color = "black") +
theme_minimal() +
labs(title = "Ridge Regression: Distribution of Residuals",
x = "Residuals",
y = "Frequency")
library(cluster)
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
# Numeric features only
clust_data <- Final_Model %>%
select(sat, act, in_tuition, out_tuition, total_enroll, aid_pct) %>%
na.omit() %>%
scale()
# K-means
set.seed(3)
kmeans_fit <- kmeans(clust_data, centers = 3, nstart = 25)
fviz_cluster(kmeans_fit, data = clust_data,
ellipse.type = "convex",
palette = "jco",
ggtheme = theme_minimal(),
main = "University Clustering")
In Objective 3, I applied a range of models to different predictive tasks and explored their effectiveness. Random Forest classification was used to predict DACA tuition policy, though it showed low predictive accuracy. Multinomial Logistic Regression was implemented to classify academic departments, supported by confusion matrices and accuracy plots. Linear Discriminant Analysis (LDA) attempted to predict region classifications but performed poorly, as indicated by low accuracy and diffuse posterior probabilities. For regression, Ridge Regression was tuned and evaluated, yielding stable performance with visual diagnostics for residuals and prediction error. Finally, K-means clustering uncovered natural groupings among universities based on tuition, test scores, and enrollment metrics. Each model was selected to match the nature of the response variable, showcasing informed model selection across classification, regression, and clustering contexts.
The dataset analyzed consists of information on public universities in Michigan, covering variables such as in-state and out-of-state tuition, student enrollment, SAT and ACT scores, financial aid, regional location, academic department, and whether the university offers in-state tuition to DACA students or has a tuition equity policy. This broad range of variables allowed for rich exploration of both predictive and explanatory modeling tasks, providing insights relevant to higher education policy.
To begin, I conducted exploratory data analysis (EDA) to understand the distributions and relationships within the dataset. Histograms revealed that both in-state and out-of-state tuition were right-skewed, with a few universities charging significantly higher tuition than others. Bar plots showed how categorical variables like region, department, and policy status were distributed across the dataset. This helped identify potential class imbalances and informed modeling choices. For example, the equity policy variable had relatively balanced representation across “Yes,” “Partial,” and “No” categories, making it suitable for regression and classification modeling.
The use of the ggpairs() function allowed for visual inspection of relationships between numeric variables. There was a modest positive correlation between SAT/ACT scores and tuition, as well as between enrollment and financial aid percentage. These observations suggested that schools with higher academic standards and larger enrollments may also offer more aid or charge higher tuition, patterns which could be tested more formally through regression models.
Summary statistics confirmed that out-of-state tuition ranged from about $18,000 to over $54,000, while SAT scores ranged from approximately 980 to 1370. Financial aid packages varied widely, both in dollar amount and in percentage of students receiving aid. The variability across these indicators justified their inclusion in predictive modeling, as they provided the necessary variance to explore outcome relationships.
To model out-of-state tuition, I implemented a linear regression model using SAT score, financial aid percentage, total enrollment, and equity policy status as predictors. The coefficients were extracted using the tidy() function from the broom package. Notably, the coefficient for SAT score was -630, suggesting that, all else equal, a 1-point increase in SAT score corresponded to a decrease of $630 in tuition. However, this effect was not statistically significant (p = 0.64), meaning we cannot conclude a strong relationship based on this sample.
Similarly, aid percentage had a positive coefficient (~$1,494), indicating that schools offering more aid might also charge higher tuition. Total enrollment had a coefficient of approximately $1,013 per additional student, which seems implausibly high and may reflect an untransformed scale or the need for log transformation. Importantly, none of these predictors reached conventional significance thresholds, suggesting weak predictive power or multicollinearity in the model.
For categorical predictors, institutions with a “Partial” policy had tuition levels about $919 higher than those with “No” policy, while “Yes” policy schools had slightly lower tuition (-$119), but again, these results were not significant. These findings were surprising and suggest that other unmeasured variables—such as institutional funding models or location-specific cost structures—may better explain tuition variation.
Although I did not explicitly plot confidence intervals, the standard errors and p-values from the regression output provided enough information to assess estimate uncertainty. To address multicollinearity and improve model stability, I also applied ridge regression using cross-validation to tune the penalty parameter. This regularized model shrunk the coefficients toward zero and improved generalizability, though it also confirmed the relative weakness of SAT, aid, and policy status as strong individual predictors.
To help communicate findings to a general audience, I relied on visual tools throughout the analysis. Actual vs. predicted plots illustrated how well the model fit the test data, while residual plots helped assess whether the linear model assumptions were reasonable. In both linear and polynomial regression, residuals were relatively scattered, suggesting noise and possible missing variables.
For classification models, I used heatmaps of confusion matrices and accuracy plots to demonstrate model performance in predicting categorical outcomes like academic department and DACA policy. These visuals offered intuitive ways to convey which categories were predicted well and where errors were more frequent. For example, the multinomial logistic regression model misclassified departments with similar academic profiles, which is understandable but also highlights limitations.
Throughout the portfolio, I avoided overly technical jargon and aimed to explain concepts in terms that are relatable to non-statistical audiences. For example, rather than discussing statistical significance in abstract terms, I explained that “we can’t be confident there is a real relationship between SAT scores and tuition based on this model.” This kind of interpretation makes the results more accessible and actionable for stakeholders in education policy or university administration.
In summary, I used a combination of summary statistics, regression coefficients, diagnostic visuals, and classification evaluation metrics to communicate model results effectively. My focus was on clarity, transparency, and context-based interpretation rather than technical depth alone. This objective helped reinforce my ability to translate statistical findings into meaningful insights for broader audiences, which is essential for any data science or policy-oriented role.
# Extract model object from workflow
lm_engine <- extract_fit_engine(lm_fit)
# Create diagnostic dataframe
lm_diag <- augment(lm_engine)
# Residuals vs Fitted
ggplot(lm_diag, aes(.fitted, .resid)) +
geom_point(alpha = 0.6) +
geom_hline(yintercept = 0, color = "red", linetype = "dashed") +
labs(title = "Residuals vs Fitted Values", x = "Fitted Tuition", y = "Residuals") +
theme_minimal()
# Q-Q Plot
ggplot(lm_diag, aes(sample = .resid)) +
stat_qq() +
stat_qq_line(color = "blue") +
labs(title = "Q-Q Plot of Residuals") +
theme_minimal()
# Scale-Location Plot
ggplot(lm_diag, aes(.fitted, sqrt(abs(.std.resid)))) +
geom_point(alpha = 0.6) +
geom_smooth(se = FALSE, color = "purple") +
labs(title = "Scale-Location Plot", x = "Fitted Tuition", y = "Sqrt(|Standardized Residuals|)") +
theme_minimal()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
# Cook's Distance
lm_diag$cooksd <- cooks.distance(lm_engine)
ggplot(lm_diag, aes(x = seq_along(cooksd), y = cooksd)) +
geom_bar(stat = "identity", fill = "skyblue") +
labs(title = "Cook's Distance Plot", x = "Observation Index", y = "Cook's Distance") +
theme_minimal()
In this portfolio, I demonstrated the use of R and the Tidy Models framework to programmatically fit and assess statistical models. I implemented workflow(), recipe(), and fit() to develop models including linear regression, polynomial regression, random forest classification, multinomial logistic regression, linear discriminant analysis, ridge regression, and k-means clustering.
To ensure data readiness, I used preprocessing techniques like normalization, polynomial transformations, and dummy variable encoding. After model fitting, I applied both quantitative and visual tools for evaluation. Specifically, I used residual diagnostics for linear models, including residuals vs. fitted plots, Q-Q plots, Scale-Location plots, and Cook’s Distance plots. These visualizations allowed me to assess assumptions such as linearity, homoscedasticity, normality, and influential observations.
Across other models, performance was evaluated using appropriate classification metrics and visual summaries like confusion matrices and prediction accuracy bar plots. By combining statistical programming with model diagnostics and validation, I ensured that the models were interpretable, reproducible, and statistically robust.