Imagine a stack of 777 college brochures spread across a desk. Without opening one, can we tell which schools are private and which are public — using only the numbers printed on the back?
That is the question this report answers. Using the
College dataset from the ISLR
package, we build a logistic regression model with the
glm() function in R that predicts whether a university is
private (Yes) or public (No) from a handful of
numeric features such as out-of-state tuition, student-to-faculty ratio,
and graduation rate.
Private and public institutions differ in funding, mission, and student experience. A model that distinguishes them from observable characteristics helps:
We move through five stages: explore the data, split it 70/30, fit a
logistic regression with glm(), evaluate it with a
confusion matrix and the metrics Accuracy / Precision / Recall /
Specificity, and finally judge generalisation with a ROC curve
and AUC on the test set.
## [1] 777 18
##
## No Yes
## 212 565
The dataset contains 777 observations and 18 variables, with 565 private and 212 public universities — roughly a 73 / 27 split. Private schools dominate, which is something we must keep in mind when we read the confusion matrix later.
balance <- df %>% count(Private)
plot_ly(balance, x = ~Private, y = ~n, type = "bar",
color = ~Private,
colors = c("No" = "#E45756", "Yes" = "#4C78A8"),
text = ~n, textposition = "outside",
hovertemplate = "<b>%{x}</b><br>Count: %{y}<extra></extra>") %>%
layout(title = "How many private vs. public colleges?",
yaxis = list(title = "Number of colleges"),
xaxis = list(title = "Private"),
showlegend = FALSE)If you suspected that private schools cost more, the data agrees loudly. Out-of-state tuition is the single most discriminating variable in this dataset.
plot_ly(df, y = ~Outstate, color = ~Private, type = "box",
colors = c("No" = "#E45756", "Yes" = "#4C78A8"),
boxpoints = "outliers") %>%
layout(title = "Out-of-state Tuition by School Type",
yaxis = list(title = "Out-of-state tuition (USD)"),
xaxis = list(title = "Private"))The two boxes barely overlap. Public colleges cluster around the lower end while private colleges sit much higher — a gap of roughly $5,000. Any reasonable model is going to lean heavily on this column.
Private schools also tend to be smaller and more attentive, with fewer students per faculty member.
plot_ly(df, y = ~S.F.Ratio, color = ~Private, type = "violin",
box = list(visible = TRUE), meanline = list(visible = TRUE),
colors = c("No" = "#E45756", "Yes" = "#4C78A8")) %>%
layout(title = "Student/Faculty Ratio Distribution",
yaxis = list(title = "Students per faculty"),
xaxis = list(title = "Private"))The violin shape for public colleges sits noticeably higher (more crowded classrooms), reinforcing the intuition that this variable carries useful classification signal.
A scatterplot of tuition against graduation rate makes the two clouds of points visually distinct. Hover over any dot to see the underlying values.
plot_ly(df, x = ~Outstate, y = ~Grad.Rate, color = ~Private,
colors = c("No" = "#E45756", "Yes" = "#4C78A8"),
type = "scatter", mode = "markers",
marker = list(size = 7, opacity = 0.6),
text = ~paste0("Tuition: $", Outstate,
"<br>Grad rate: ", Grad.Rate, "%",
"<br>Type: ", Private),
hoverinfo = "text") %>%
layout(title = "Out-of-state Tuition vs. Graduation Rate",
xaxis = list(title = "Out-of-state tuition (USD)"),
yaxis = list(title = "Graduation rate (%)"))Private schools (blue) drift toward the upper-right quadrant — higher tuition and higher graduation rates — while public schools (red) sit lower and to the left.
num_df <- df %>% select(where(is.numeric))
cor_mat <- cor(num_df)
corrplot(cor_mat, method = "color", type = "lower",
tl.cex = 0.7, tl.col = "black",
addCoef.col = "black", number.cex = 0.55,
col = colorRampPalette(c("#E45756","white","#4C78A8"))(200),
title = "Correlation Matrix of Numeric Predictors",
mar = c(0,0,2,0))A few clusters of strong correlation jump out — Apps,
Accept, Enroll, and F.Undergrad
all measure school size and move together. We will avoid piling all of
them into the model to limit multicollinearity.
A 70/30 stratified split keeps the original class balance in both subsets, which is important when the classes are imbalanced.
set.seed(3456)
trainIndex <- createDataPartition(df$Private, p = 0.7,
list = FALSE, times = 1)
train <- df[ trainIndex, ]
test <- df[-trainIndex, ]
data.frame(
Set = c("Train","Test"),
N = c(nrow(train), nrow(test)),
`% Private`= c(round(mean(train$Private == "Yes")*100,1),
round(mean(test$Private == "Yes")*100,1)),
check.names = FALSE
) %>%
kable(caption = "Train / Test sizes and class balance") %>%
kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)| Set | N | % Private |
|---|---|---|
| Train | 545 | 72.7 |
| Test | 232 | 72.8 |
The train set has 545 colleges, the test set 232 — and both keep the roughly 73 % private share, so neither side is starved of one class.
glm()We fit a logistic regression with six predictors chosen for a mix of
price, size, prestige, and resources:
Outstate, F.Undergrad, S.F.Ratio,
perc.alumni, Grad.Rate, and
Expend.
model <- glm(Private ~ Outstate + F.Undergrad + S.F.Ratio +
perc.alumni + Grad.Rate + Expend,
data = train, family = binomial(link = "logit"))
summary(model)##
## Call:
## glm(formula = Private ~ Outstate + F.Undergrad + S.F.Ratio +
## perc.alumni + Grad.Rate + Expend, family = binomial(link = "logit"),
## data = train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.052e+00 1.602e+00 -1.281 0.2003
## Outstate 5.912e-04 9.444e-05 6.260 3.85e-10 ***
## F.Undergrad -5.645e-04 7.431e-05 -7.597 3.03e-14 ***
## S.F.Ratio -9.754e-02 6.717e-02 -1.452 0.1465
## perc.alumni 4.616e-02 2.112e-02 2.186 0.0288 *
## Grad.Rate 8.589e-03 1.138e-02 0.755 0.4504
## Expend 1.585e-05 1.038e-04 0.153 0.8787
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 639.40 on 544 degrees of freedom
## Residual deviance: 204.71 on 538 degrees of freedom
## AIC: 218.71
##
## Number of Fisher Scoring iterations: 7
Raw glm() coefficients are hard to compare because each
predictor lives on its own scale (tuition in dollars, ratio in
students-per-faculty, etc.). The plot below
standardises each coefficient by the standard deviation
of its predictor so we can see, on a single axis, which variables
actually move the prediction. Significance stars come from the model’s
own p-values.
coef_df <- data.frame(
term = names(coef(model))[-1],
estimate = coef(model)[-1],
stderr = summary(model)$coefficients[-1, "Std. Error"],
pval = summary(model)$coefficients[-1, "Pr(>|z|)"]
)
sds <- sapply(train[, coef_df$term], sd)
coef_df$std_estimate <- coef_df$estimate * sds
coef_df$std_se <- coef_df$stderr * sds
coef_df$sig <- cut(coef_df$pval,
breaks = c(-Inf, 0.001, 0.01, 0.05, Inf),
labels = c("***","**","*",""))
coef_df <- coef_df %>% arrange(std_estimate)
coef_df$term <- factor(coef_df$term, levels = coef_df$term)
coef_df$direction <- ifelse(coef_df$std_estimate > 0,
"Pushes toward Private",
"Pushes toward Public")
plot_ly(coef_df, x = ~std_estimate, y = ~term, type = "bar",
orientation = "h",
color = ~direction,
colors = c("Pushes toward Private" = "#4C78A8",
"Pushes toward Public" = "#E45756"),
error_x = list(type = "data",
array = ~1.96*std_se,
color = "#444"),
text = ~paste0(term, " ", sig,
"<br>Std. effect: ", round(std_estimate, 3),
"<br>p-value: ", signif(pval, 3)),
hoverinfo = "text") %>%
layout(title = "Standardised Logistic Coefficients (per 1 SD)",
xaxis = list(title = "Change in log-odds(Private)",
zeroline = TRUE, zerolinecolor = "black"),
yaxis = list(title = ""))What this shows. Outstate is by far the
strongest pull toward Private; F.Undergrad is the
strongest pull toward Public. perc.alumni helps a
bit, while S.F.Ratio, Grad.Rate and
Expend contribute very little once tuition and
undergraduate size are in the model. This matches the EDA story
exactly.
prob_train <- predict(model, newdata = train, type = "response")
pred_train <- factor(ifelse(prob_train >= 0.5, "Yes","No"),
levels = c("No","Yes"))
cm_train <- confusionMatrix(pred_train, train$Private, positive = "Yes")
cm_train## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 129 10
## Yes 20 386
##
## Accuracy : 0.945
## 95% CI : (0.9223, 0.9626)
## No Information Rate : 0.7266
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.8585
##
## Mcnemar's Test P-Value : 0.1003
##
## Sensitivity : 0.9747
## Specificity : 0.8658
## Pos Pred Value : 0.9507
## Neg Pred Value : 0.9281
## Prevalence : 0.7266
## Detection Rate : 0.7083
## Detection Prevalence : 0.7450
## Balanced Accuracy : 0.9203
##
## 'Positive' Class : Yes
##
cm_df <- as.data.frame(cm_train$table)
plot_ly(cm_df, x = ~Reference, y = ~Prediction, z = ~Freq, type = "heatmap",
colorscale = list(c(0,"#FFFFFF"), c(1,"#4C78A8")),
text = ~Freq, texttemplate = "%{text}",
hovertemplate = "Actual: %{x}<br>Predicted: %{y}<br>Count: %{z}<extra></extra>") %>%
layout(title = "Train Confusion Matrix",
xaxis = list(title = "Actual"),
yaxis = list(title = "Predicted"))Out of 545 training colleges, the model is wrong on only 30. The diagonal — 129 true negatives (correctly identified public) and 386 true positives (correctly identified private) — is overwhelmingly populated. Off the diagonal we have:
That depends on what the prediction is used for.
False Positives are the more damaging error in this context. The dataset is dominated by private schools (73 %), so a model that over-predicts the majority class can look artificially accurate while quietly mislabelling the smaller, harder-to-detect public group. If a counsellor used the model to route applicants toward private-school financial-aid resources, false positives would send students down the wrong aid pathway — potentially costing them tuition discounts they actually qualify for. False negatives are the mirror mistake but generally less costly.
metrics_train <- data.frame(
Metric = c("Accuracy","Precision (PPV)","Recall (Sensitivity)","Specificity"),
Value = c(cm_train$overall["Accuracy"],
cm_train$byClass["Pos Pred Value"],
cm_train$byClass["Sensitivity"],
cm_train$byClass["Specificity"])
) %>% mutate(Value = round(Value, 4))
kable(metrics_train, caption = "Training-set classification metrics") %>%
kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)| Metric | Value | |
|---|---|---|
| Accuracy | Accuracy | 0.9450 |
| Pos Pred Value | Precision (PPV) | 0.9507 |
| Sensitivity | Recall (Sensitivity) | 0.9747 |
| Specificity | Specificity | 0.8658 |
Training accuracy of 94.5 % comfortably beats the 73 % floor of always predicting private. Recall of 97.5 % means only ~2.5 % of true private colleges are missed. Specificity is the lower number at 86.6 % — confirming that public colleges are the slightly harder class.
The real test of a model is fresh data.
prob_test <- predict(model, newdata = test, type = "response")
pred_test <- factor(ifelse(prob_test >= 0.5, "Yes","No"),
levels = c("No","Yes"))
cm_test <- confusionMatrix(pred_test, test$Private, positive = "Yes")
cm_test## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 56 8
## Yes 7 161
##
## Accuracy : 0.9353
## 95% CI : (0.8956, 0.9634)
## No Information Rate : 0.7284
## P-Value [Acc > NIR] : 7.952e-16
##
## Kappa : 0.8374
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9527
## Specificity : 0.8889
## Pos Pred Value : 0.9583
## Neg Pred Value : 0.8750
## Prevalence : 0.7284
## Detection Rate : 0.6940
## Detection Prevalence : 0.7241
## Balanced Accuracy : 0.9208
##
## 'Positive' Class : Yes
##
cm_df2 <- as.data.frame(cm_test$table)
plot_ly(cm_df2, x = ~Reference, y = ~Prediction, z = ~Freq, type = "heatmap",
colorscale = list(c(0,"#FFFFFF"), c(1,"#E45756")),
text = ~Freq, texttemplate = "%{text}",
hovertemplate = "Actual: %{x}<br>Predicted: %{y}<br>Count: %{z}<extra></extra>") %>%
layout(title = "Test Confusion Matrix",
xaxis = list(title = "Actual"),
yaxis = list(title = "Predicted"))metrics_test <- data.frame(
Metric = c("Accuracy","Precision (PPV)","Recall (Sensitivity)","Specificity"),
Value = c(cm_test$overall["Accuracy"],
cm_test$byClass["Pos Pred Value"],
cm_test$byClass["Sensitivity"],
cm_test$byClass["Specificity"])
) %>% mutate(Value = round(Value, 4))
kable(metrics_test, caption = "Test-set classification metrics") %>%
kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)| Metric | Value | |
|---|---|---|
| Accuracy | Accuracy | 0.9353 |
| Pos Pred Value | Precision (PPV) | 0.9583 |
| Sensitivity | Recall (Sensitivity) | 0.9527 |
| Specificity | Specificity | 0.8889 |
Test accuracy of 93.5 % — barely a percentage point lower than training accuracy. The closeness of the two means the model is not overfitting: it generalises gracefully to colleges it has never seen.
A confusion matrix collapses everything to a single threshold (0.5). The plot below shows the full distribution of predicted probabilities for the test set, separated by actual class. Misclassifications are exactly the bars that cross the dashed threshold line into the wrong half.
prob_df <- data.frame(prob = prob_test, actual = test$Private)
plot_ly(alpha = 0.7) %>%
add_histogram(data = subset(prob_df, actual == "No"),
x = ~prob, name = "Actual: Public",
marker = list(color = "#E45756"),
xbins = list(start = 0, end = 1, size = 0.033)) %>%
add_histogram(data = subset(prob_df, actual == "Yes"),
x = ~prob, name = "Actual: Private",
marker = list(color = "#4C78A8"),
xbins = list(start = 0, end = 1, size = 0.033)) %>%
layout(barmode = "overlay",
title = "Predicted P(Private) on the Test Set",
xaxis = list(title = "Predicted P(Private)"),
yaxis = list(title = "Number of colleges"),
shapes = list(list(type = "line",
x0 = 0.5, x1 = 0.5, y0 = 0, y1 = 50,
line = list(color = "black", dash = "dash",
width = 2))),
annotations = list(list(x = 0.5, y = 1.05, xref = "x", yref = "paper",
text = "Decision threshold = 0.5",
showarrow = FALSE, font = list(size = 11))))The two distributions are strongly separated. Public colleges (red) cluster at low probabilities, private colleges (blue) at high probabilities, and only a small handful spill across the line — those are our 7 false positives and 8 false negatives.
The Receiver Operating Characteristic (ROC) curve sweeps every possible threshold and shows the trade-off between true-positive and false-positive rates.
roc_obj <- roc(response = test$Private, predictor = prob_test,
levels = c("No","Yes"), direction = "<")
auc_val <- as.numeric(auc(roc_obj))
# FIX for the previous "compatible sizes" error:
# Build the diagonal as its own data frame so column lengths never clash.
roc_df <- data.frame(FPR = 1 - roc_obj$specificities,
TPR = roc_obj$sensitivities,
Threshold = roc_obj$thresholds)
diag_df <- data.frame(x = c(0, 1), y = c(0, 1))
plot_ly() %>%
add_trace(data = roc_df, x = ~FPR, y = ~TPR,
type = "scatter", mode = "lines",
line = list(color = "#4C78A8", width = 3),
name = "Logistic model",
text = ~paste0("Threshold: ", round(Threshold, 3),
"<br>TPR: ", round(TPR, 3),
"<br>FPR: ", round(FPR, 3)),
hoverinfo = "text") %>%
add_trace(data = diag_df, x = ~x, y = ~y,
type = "scatter", mode = "lines",
line = list(color = "grey", dash = "dash"),
name = "Random classifier",
hoverinfo = "none") %>%
layout(title = paste0("ROC Curve (Test Set) — AUC = ", round(auc_val, 3)),
xaxis = list(title = "False Positive Rate (1 − Specificity)"),
yaxis = list(title = "True Positive Rate (Recall)"))The curve hugs the top-left corner, exactly where we want it. The diagonal grey line represents a coin-flip classifier — the model is far above it everywhere.
Area Under the Curve = 0.976.
AUC is the probability that the model ranks a randomly chosen private college higher than a randomly chosen public college. A value of 0.5 is random guessing; 1.0 is perfect ranking. Anything above 0.9 is generally considered excellent. Our value of 0.976 says the logistic regression separates the two classes with very little overlap.
The logistic regression model built with glm() answered
our opening question convincingly. Using just six numeric features —
tuition, full-time undergraduate enrolment, student-to-faculty ratio,
alumni-giving rate, graduation rate, and per-student expenditure — we
identify private vs. public colleges with ~93.5 % test
accuracy and an AUC of 0.976.
Three takeaways stand out:
Outstate () and F.Undergrad
() dominate, and perc.alumni adds a small but
significant bump (*). The other three predictors are not statistically
meaningful once those three are in the model. A leaner model with three
predictors would likely perform almost identically.For an analyst, the practical next step is to deploy the model as a quick sanity check: feed any directory of colleges through it, flag every record where the model’s prediction disagrees with the listed type, and review those cases by hand. That is exactly the kind of high-leverage work a small, well-calibrated logistic regression model can do.