1. Introduction

The question

Imagine a stack of 777 college brochures spread across a desk. Without opening one, can we tell which schools are private and which are public — using only the numbers printed on the back?

That is the question this report answers. Using the College dataset from the ISLR package, we build a logistic regression model with the glm() function in R that predicts whether a university is private (Yes) or public (No) from a handful of numeric features such as out-of-state tuition, student-to-faculty ratio, and graduation rate.

Why it matters

Private and public institutions differ in funding, mission, and student experience. A model that distinguishes them from observable characteristics helps:

  • prospective students and counselors quickly classify unfamiliar schools,
  • policy analysts spot universities whose profile does not match their official type (a sign of an unusual funding model), and
  • data analysts learning GLMs see how a binary classifier is built, evaluated, and interpreted end-to-end.

Roadmap

We move through five stages: explore the data, split it 70/30, fit a logistic regression with glm(), evaluate it with a confusion matrix and the metrics Accuracy / Precision / Recall / Specificity, and finally judge generalisation with a ROC curve and AUC on the test set.


2. Exploratory Data Analysis

data("College")
df <- College
dim(df); table(df$Private)
## [1] 777  18
## 
##  No Yes 
## 212 565

The dataset contains 777 observations and 18 variables, with 565 private and 212 public universities — roughly a 73 / 27 split. Private schools dominate, which is something we must keep in mind when we read the confusion matrix later.

2.1 Class balance

balance <- df %>% count(Private)
plot_ly(balance, x = ~Private, y = ~n, type = "bar",
        color = ~Private,
        colors = c("No" = "#E45756", "Yes" = "#4C78A8"),
        text = ~n, textposition = "outside",
        hovertemplate = "<b>%{x}</b><br>Count: %{y}<extra></extra>") %>%
  layout(title = "How many private vs. public colleges?",
         yaxis = list(title = "Number of colleges"),
         xaxis = list(title = "Private"),
         showlegend = FALSE)

2.2 Tuition tells the story

If you suspected that private schools cost more, the data agrees loudly. Out-of-state tuition is the single most discriminating variable in this dataset.

plot_ly(df, y = ~Outstate, color = ~Private, type = "box",
        colors = c("No" = "#E45756", "Yes" = "#4C78A8"),
        boxpoints = "outliers") %>%
  layout(title = "Out-of-state Tuition by School Type",
         yaxis = list(title = "Out-of-state tuition (USD)"),
         xaxis = list(title = "Private"))

The two boxes barely overlap. Public colleges cluster around the lower end while private colleges sit much higher — a gap of roughly $5,000. Any reasonable model is going to lean heavily on this column.

2.3 Student-to-faculty ratio

Private schools also tend to be smaller and more attentive, with fewer students per faculty member.

plot_ly(df, y = ~S.F.Ratio, color = ~Private, type = "violin",
        box = list(visible = TRUE), meanline = list(visible = TRUE),
        colors = c("No" = "#E45756", "Yes" = "#4C78A8")) %>%
  layout(title = "Student/Faculty Ratio Distribution",
         yaxis = list(title = "Students per faculty"),
         xaxis = list(title = "Private"))

The violin shape for public colleges sits noticeably higher (more crowded classrooms), reinforcing the intuition that this variable carries useful classification signal.

2.4 Tuition vs. graduation rate

A scatterplot of tuition against graduation rate makes the two clouds of points visually distinct. Hover over any dot to see the underlying values.

plot_ly(df, x = ~Outstate, y = ~Grad.Rate, color = ~Private,
        colors = c("No" = "#E45756", "Yes" = "#4C78A8"),
        type = "scatter", mode = "markers",
        marker = list(size = 7, opacity = 0.6),
        text = ~paste0("Tuition: $", Outstate,
                       "<br>Grad rate: ", Grad.Rate, "%",
                       "<br>Type: ", Private),
        hoverinfo = "text") %>%
  layout(title = "Out-of-state Tuition vs. Graduation Rate",
         xaxis = list(title = "Out-of-state tuition (USD)"),
         yaxis = list(title = "Graduation rate (%)"))

Private schools (blue) drift toward the upper-right quadrant — higher tuition and higher graduation rates — while public schools (red) sit lower and to the left.

2.5 Correlation heatmap

num_df  <- df %>% select(where(is.numeric))
cor_mat <- cor(num_df)
corrplot(cor_mat, method = "color", type = "lower",
         tl.cex = 0.7, tl.col = "black",
         addCoef.col = "black", number.cex = 0.55,
         col = colorRampPalette(c("#E45756","white","#4C78A8"))(200),
         title = "Correlation Matrix of Numeric Predictors",
         mar = c(0,0,2,0))

A few clusters of strong correlation jump out — Apps, Accept, Enroll, and F.Undergrad all measure school size and move together. We will avoid piling all of them into the model to limit multicollinearity.


3. Train / Test Split

A 70/30 stratified split keeps the original class balance in both subsets, which is important when the classes are imbalanced.

set.seed(3456)
trainIndex <- createDataPartition(df$Private, p = 0.7,
                                  list = FALSE, times = 1)
train <- df[ trainIndex, ]
test  <- df[-trainIndex, ]

data.frame(
  Set        = c("Train","Test"),
  N          = c(nrow(train), nrow(test)),
  `% Private`= c(round(mean(train$Private == "Yes")*100,1),
                 round(mean(test$Private  == "Yes")*100,1)),
  check.names = FALSE
) %>%
  kable(caption = "Train / Test sizes and class balance") %>%
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)
Train / Test sizes and class balance
Set N % Private
Train 545 72.7
Test 232 72.8

The train set has 545 colleges, the test set 232 — and both keep the roughly 73 % private share, so neither side is starved of one class.


4. Logistic Regression with glm()

We fit a logistic regression with six predictors chosen for a mix of price, size, prestige, and resources: Outstate, F.Undergrad, S.F.Ratio, perc.alumni, Grad.Rate, and Expend.

model <- glm(Private ~ Outstate + F.Undergrad + S.F.Ratio +
                       perc.alumni + Grad.Rate + Expend,
             data = train, family = binomial(link = "logit"))
summary(model)
## 
## Call:
## glm(formula = Private ~ Outstate + F.Undergrad + S.F.Ratio + 
##     perc.alumni + Grad.Rate + Expend, family = binomial(link = "logit"), 
##     data = train)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.052e+00  1.602e+00  -1.281   0.2003    
## Outstate     5.912e-04  9.444e-05   6.260 3.85e-10 ***
## F.Undergrad -5.645e-04  7.431e-05  -7.597 3.03e-14 ***
## S.F.Ratio   -9.754e-02  6.717e-02  -1.452   0.1465    
## perc.alumni  4.616e-02  2.112e-02   2.186   0.0288 *  
## Grad.Rate    8.589e-03  1.138e-02   0.755   0.4504    
## Expend       1.585e-05  1.038e-04   0.153   0.8787    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 639.40  on 544  degrees of freedom
## Residual deviance: 204.71  on 538  degrees of freedom
## AIC: 218.71
## 
## Number of Fisher Scoring iterations: 7

4.1 Standardised coefficient plot

Raw glm() coefficients are hard to compare because each predictor lives on its own scale (tuition in dollars, ratio in students-per-faculty, etc.). The plot below standardises each coefficient by the standard deviation of its predictor so we can see, on a single axis, which variables actually move the prediction. Significance stars come from the model’s own p-values.

coef_df <- data.frame(
  term     = names(coef(model))[-1],
  estimate = coef(model)[-1],
  stderr   = summary(model)$coefficients[-1, "Std. Error"],
  pval     = summary(model)$coefficients[-1, "Pr(>|z|)"]
)
sds <- sapply(train[, coef_df$term], sd)
coef_df$std_estimate <- coef_df$estimate * sds
coef_df$std_se       <- coef_df$stderr   * sds
coef_df$sig <- cut(coef_df$pval,
                   breaks = c(-Inf, 0.001, 0.01, 0.05, Inf),
                   labels = c("***","**","*",""))
coef_df <- coef_df %>% arrange(std_estimate)
coef_df$term <- factor(coef_df$term, levels = coef_df$term)
coef_df$direction <- ifelse(coef_df$std_estimate > 0,
                            "Pushes toward Private",
                            "Pushes toward Public")

plot_ly(coef_df, x = ~std_estimate, y = ~term, type = "bar",
        orientation = "h",
        color = ~direction,
        colors = c("Pushes toward Private" = "#4C78A8",
                   "Pushes toward Public"  = "#E45756"),
        error_x = list(type = "data",
                       array = ~1.96*std_se,
                       color = "#444"),
        text = ~paste0(term, "  ", sig,
                       "<br>Std. effect: ", round(std_estimate, 3),
                       "<br>p-value: ", signif(pval, 3)),
        hoverinfo = "text") %>%
  layout(title = "Standardised Logistic Coefficients (per 1 SD)",
         xaxis = list(title = "Change in log-odds(Private)",
                      zeroline = TRUE, zerolinecolor = "black"),
         yaxis = list(title = ""))

What this shows. Outstate is by far the strongest pull toward Private; F.Undergrad is the strongest pull toward Public. perc.alumni helps a bit, while S.F.Ratio, Grad.Rate and Expend contribute very little once tuition and undergraduate size are in the model. This matches the EDA story exactly.


5. Confusion Matrix on the Training Set

prob_train <- predict(model, newdata = train, type = "response")
pred_train <- factor(ifelse(prob_train >= 0.5, "Yes","No"),
                     levels = c("No","Yes"))

cm_train <- confusionMatrix(pred_train, train$Private, positive = "Yes")
cm_train
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  129  10
##        Yes  20 386
##                                           
##                Accuracy : 0.945           
##                  95% CI : (0.9223, 0.9626)
##     No Information Rate : 0.7266          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.8585          
##                                           
##  Mcnemar's Test P-Value : 0.1003          
##                                           
##             Sensitivity : 0.9747          
##             Specificity : 0.8658          
##          Pos Pred Value : 0.9507          
##          Neg Pred Value : 0.9281          
##              Prevalence : 0.7266          
##          Detection Rate : 0.7083          
##    Detection Prevalence : 0.7450          
##       Balanced Accuracy : 0.9203          
##                                           
##        'Positive' Class : Yes             
## 
cm_df <- as.data.frame(cm_train$table)
plot_ly(cm_df, x = ~Reference, y = ~Prediction, z = ~Freq, type = "heatmap",
        colorscale = list(c(0,"#FFFFFF"), c(1,"#4C78A8")),
        text = ~Freq, texttemplate = "%{text}",
        hovertemplate = "Actual: %{x}<br>Predicted: %{y}<br>Count: %{z}<extra></extra>") %>%
  layout(title = "Train Confusion Matrix",
         xaxis = list(title = "Actual"),
         yaxis = list(title = "Predicted"))

Interpreting the matrix

Out of 545 training colleges, the model is wrong on only 30. The diagonal — 129 true negatives (correctly identified public) and 386 true positives (correctly identified private) — is overwhelmingly populated. Off the diagonal we have:

  • False Positives (20): public colleges the model labels as private.
  • False Negatives (10): private colleges the model labels as public.

Which mistake is worse?

That depends on what the prediction is used for.

False Positives are the more damaging error in this context. The dataset is dominated by private schools (73 %), so a model that over-predicts the majority class can look artificially accurate while quietly mislabelling the smaller, harder-to-detect public group. If a counsellor used the model to route applicants toward private-school financial-aid resources, false positives would send students down the wrong aid pathway — potentially costing them tuition discounts they actually qualify for. False negatives are the mirror mistake but generally less costly.

Headline metrics — Train

metrics_train <- data.frame(
  Metric = c("Accuracy","Precision (PPV)","Recall (Sensitivity)","Specificity"),
  Value  = c(cm_train$overall["Accuracy"],
             cm_train$byClass["Pos Pred Value"],
             cm_train$byClass["Sensitivity"],
             cm_train$byClass["Specificity"])
) %>% mutate(Value = round(Value, 4))

kable(metrics_train, caption = "Training-set classification metrics") %>%
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)
Training-set classification metrics
Metric Value
Accuracy Accuracy 0.9450
Pos Pred Value Precision (PPV) 0.9507
Sensitivity Recall (Sensitivity) 0.9747
Specificity Specificity 0.8658

Training accuracy of 94.5 % comfortably beats the 73 % floor of always predicting private. Recall of 97.5 % means only ~2.5 % of true private colleges are missed. Specificity is the lower number at 86.6 % — confirming that public colleges are the slightly harder class.


6. Confusion Matrix on the Test Set

The real test of a model is fresh data.

prob_test <- predict(model, newdata = test, type = "response")
pred_test <- factor(ifelse(prob_test >= 0.5, "Yes","No"),
                    levels = c("No","Yes"))

cm_test <- confusionMatrix(pred_test, test$Private, positive = "Yes")
cm_test
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No   56   8
##        Yes   7 161
##                                           
##                Accuracy : 0.9353          
##                  95% CI : (0.8956, 0.9634)
##     No Information Rate : 0.7284          
##     P-Value [Acc > NIR] : 7.952e-16       
##                                           
##                   Kappa : 0.8374          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9527          
##             Specificity : 0.8889          
##          Pos Pred Value : 0.9583          
##          Neg Pred Value : 0.8750          
##              Prevalence : 0.7284          
##          Detection Rate : 0.6940          
##    Detection Prevalence : 0.7241          
##       Balanced Accuracy : 0.9208          
##                                           
##        'Positive' Class : Yes             
## 
cm_df2 <- as.data.frame(cm_test$table)
plot_ly(cm_df2, x = ~Reference, y = ~Prediction, z = ~Freq, type = "heatmap",
        colorscale = list(c(0,"#FFFFFF"), c(1,"#E45756")),
        text = ~Freq, texttemplate = "%{text}",
        hovertemplate = "Actual: %{x}<br>Predicted: %{y}<br>Count: %{z}<extra></extra>") %>%
  layout(title = "Test Confusion Matrix",
         xaxis = list(title = "Actual"),
         yaxis = list(title = "Predicted"))
metrics_test <- data.frame(
  Metric = c("Accuracy","Precision (PPV)","Recall (Sensitivity)","Specificity"),
  Value  = c(cm_test$overall["Accuracy"],
             cm_test$byClass["Pos Pred Value"],
             cm_test$byClass["Sensitivity"],
             cm_test$byClass["Specificity"])
) %>% mutate(Value = round(Value, 4))

kable(metrics_test, caption = "Test-set classification metrics") %>%
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)
Test-set classification metrics
Metric Value
Accuracy Accuracy 0.9353
Pos Pred Value Precision (PPV) 0.9583
Sensitivity Recall (Sensitivity) 0.9527
Specificity Specificity 0.8889

Test accuracy of 93.5 % — barely a percentage point lower than training accuracy. The closeness of the two means the model is not overfitting: it generalises gracefully to colleges it has never seen.


7. Predicted Probability Distribution

A confusion matrix collapses everything to a single threshold (0.5). The plot below shows the full distribution of predicted probabilities for the test set, separated by actual class. Misclassifications are exactly the bars that cross the dashed threshold line into the wrong half.

prob_df <- data.frame(prob = prob_test, actual = test$Private)

plot_ly(alpha = 0.7) %>%
  add_histogram(data = subset(prob_df, actual == "No"),
                x = ~prob, name = "Actual: Public",
                marker = list(color = "#E45756"),
                xbins = list(start = 0, end = 1, size = 0.033)) %>%
  add_histogram(data = subset(prob_df, actual == "Yes"),
                x = ~prob, name = "Actual: Private",
                marker = list(color = "#4C78A8"),
                xbins = list(start = 0, end = 1, size = 0.033)) %>%
  layout(barmode = "overlay",
         title = "Predicted P(Private) on the Test Set",
         xaxis = list(title = "Predicted P(Private)"),
         yaxis = list(title = "Number of colleges"),
         shapes = list(list(type = "line",
                            x0 = 0.5, x1 = 0.5, y0 = 0, y1 = 50,
                            line = list(color = "black", dash = "dash",
                                        width = 2))),
         annotations = list(list(x = 0.5, y = 1.05, xref = "x", yref = "paper",
                                 text = "Decision threshold = 0.5",
                                 showarrow = FALSE, font = list(size = 11))))

The two distributions are strongly separated. Public colleges (red) cluster at low probabilities, private colleges (blue) at high probabilities, and only a small handful spill across the line — those are our 7 false positives and 8 false negatives.


8. ROC Curve and AUC

The Receiver Operating Characteristic (ROC) curve sweeps every possible threshold and shows the trade-off between true-positive and false-positive rates.

roc_obj <- roc(response = test$Private, predictor = prob_test,
               levels = c("No","Yes"), direction = "<")
auc_val <- as.numeric(auc(roc_obj))

# FIX for the previous "compatible sizes" error:
# Build the diagonal as its own data frame so column lengths never clash.
roc_df <- data.frame(FPR = 1 - roc_obj$specificities,
                     TPR = roc_obj$sensitivities,
                     Threshold = roc_obj$thresholds)
diag_df <- data.frame(x = c(0, 1), y = c(0, 1))

plot_ly() %>%
  add_trace(data = roc_df, x = ~FPR, y = ~TPR,
            type = "scatter", mode = "lines",
            line = list(color = "#4C78A8", width = 3),
            name = "Logistic model",
            text = ~paste0("Threshold: ", round(Threshold, 3),
                           "<br>TPR: ", round(TPR, 3),
                           "<br>FPR: ", round(FPR, 3)),
            hoverinfo = "text") %>%
  add_trace(data = diag_df, x = ~x, y = ~y,
            type = "scatter", mode = "lines",
            line = list(color = "grey", dash = "dash"),
            name = "Random classifier",
            hoverinfo = "none") %>%
  layout(title = paste0("ROC Curve (Test Set) — AUC = ", round(auc_val, 3)),
         xaxis = list(title = "False Positive Rate (1 − Specificity)"),
         yaxis = list(title = "True Positive Rate (Recall)"))

The curve hugs the top-left corner, exactly where we want it. The diagonal grey line represents a coin-flip classifier — the model is far above it everywhere.

Interpreting AUC

Area Under the Curve = 0.976.

AUC is the probability that the model ranks a randomly chosen private college higher than a randomly chosen public college. A value of 0.5 is random guessing; 1.0 is perfect ranking. Anything above 0.9 is generally considered excellent. Our value of 0.976 says the logistic regression separates the two classes with very little overlap.


9. Conclusion

The logistic regression model built with glm() answered our opening question convincingly. Using just six numeric features — tuition, full-time undergraduate enrolment, student-to-faculty ratio, alumni-giving rate, graduation rate, and per-student expenditure — we identify private vs. public colleges with ~93.5 % test accuracy and an AUC of 0.976.

Three takeaways stand out:

  1. Tuition and undergraduate size do almost all the work. The coefficient plot makes this brutally clear — Outstate () and F.Undergrad () dominate, and perc.alumni adds a small but significant bump (*). The other three predictors are not statistically meaningful once those three are in the model. A leaner model with three predictors would likely perform almost identically.
  2. The model generalises well. Train and test accuracy are within one percentage point. The 70/30 stratified split was sufficient — no regularisation or cross-validation tuning was needed for this dataset.
  3. Public colleges are the harder class. Specificity (~0.89) lags recall (~0.95), reflecting the 73 / 27 imbalance. If the cost of a false positive were genuinely high, raising the decision threshold above 0.5 or rebalancing the training set would help.

For an analyst, the practical next step is to deploy the model as a quick sanity check: feed any directory of colleges through it, flag every record where the model’s prediction disagrees with the listed type, and review those cases by hand. That is exactly the kind of high-leverage work a small, well-calibrated logistic regression model can do.


References

  • James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning with applications in R. Springer.
  • Kuhn, M. (2008). Building predictive models in R using the caret package. Journal of Statistical Software, 28(5), 1–26.
  • R Core Team. (2024). R: A language and environment for statistical computing. R Foundation for Statistical Computing. https://www.R-project.org/
  • Robin, X., Turck, N., Hainard, A., Tiberti, N., Lisacek, F., Sanchez, J. C., & Müller, M. (2011). pROC: An open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics, 12, 77.
  • Sievert, C. (2020). Interactive web-based data visualization with R, plotly, and shiny. Chapman and Hall/CRC.
  • Wickham, H., & Grolemund, G. (2017). R for data science: Import, tidy, transform, visualize, and model data. O’Reilly Media.