ALY6015_ZeeshanAhmadAnsari_WEEK_3_FALL

M3 Project Report
ALY6015_71821:Intermediate Analytics
SEC_09_Fall_2023_CPS
Northeastern University
Professor: Vladimir Shapiro

By: Zeeshan Ahmad Ansari
Date of Submission: 27 November, 2023

Introduction

This research explores data from 777 colleges in the United States. The aim is to create a computer program that predicts whether a college is private or public based on various factors such as the number of students applying, enrolling, and graduating, along with financial details. The study involves using mathematical and computer techniques to build and test a model for this prediction.

The initial step involves thoroughly examining the data using both numerical summaries and visual representations to gain insights into its patterns. Subsequently, the data is divided into two parts, one for building the model and the other for testing its accuracy. The chosen technique for building the prediction model is logistic regression, a method that estimates the probability of a college being private.

To evaluate the model’s performance, various metrics are employed, including accuracy, precision, recall, and specificity, which are derived from a confusion matrix. Additionally, the study incorporates the Receiver Operating Characteristic (ROC) curve and the Area Under the Curve (AUC) to assess the model’s ability to distinguish between private and public colleges.

The report delves into the interpretation of these metrics, providing insights into the effectiveness of the logistic regression model in predicting the nature of universities. It concludes by discussing the significance of specific data attributes in distinguishing between private and public institutions. Essentially, the study illustrates the process of teaching a computer to make informed predictions about a college’s classification based on given information.

Analysis

Library

#The report utilizes a set of libraries for various data processing and visualization tasks.

library(ISLR)
library(ggplot2)
library(dplyr)
library(knitr)
library(kableExtra)
library(psych)
library(reshape2)
library(corrplot)
library(caret)
library(leaps)
library(GGally)
library(caret)
library(pROC)
library(MASS) 
library(gridExtra)

Task 1

Import the dataset and perform Exploratory Data Analysis by using descriptive statistics and plots to describe the dataset.

In this analysis, we explore the College dataset, aiming to gain insights into various aspects of higher education institutions. The dataset encompasses diverse information about colleges in the United States, including features such as the number of applications received, acceptance rates, graduation rates, and financial characteristics.

Our investigative journey begins with a comprehensive overview of the dataset, examining its structure, the initial rows of data, and summary statistics. We utilize descriptive statistics and visualizations to unravel patterns and characteristics inherent in the data.

Furthermore, we delve into data exploration, conducting a thorough examination of missing values and deploying graphical representations. Box plots illustrate the distribution of out-of-state tuition fees across different types of universities, offering insights into the financial landscape. Bar plots provide a categorical view of university types, enabling a quick grasp of the dataset’s composition.

Histograms shed light on the distribution of key variables, such as the number of applications received and graduation rates, providing a nuanced understanding of their frequency distribution.

To unveil potential relationships between numerical variables, a pairwise scatter plot is constructed, revealing patterns and potential associations. The correlation matrix and a corresponding visualization further illuminate the interplay between variables, providing a quantitative perspective on their relationships.

Our analysis aims to foster a holistic understanding of the College dataset, setting the stage for subsequent tasks, such as predictive modeling and feature selection. This exploratory phase is pivotal in laying the groundwork for informed decision-making and identifying key factors that contribute to the diversity and dynamics of higher education institutions.

# Load the College dataset
data("College")

# Display the structure of the dataset
str(College)

## 'data.frame':    777 obs. of  18 variables:
##  $ Private    : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Apps       : num  1660 2186 1428 417 193 ...
##  $ Accept     : num  1232 1924 1097 349 146 ...
##  $ Enroll     : num  721 512 336 137 55 158 103 489 227 172 ...
##  $ Top10perc  : num  23 16 22 60 16 38 17 37 30 21 ...
##  $ Top25perc  : num  52 29 50 89 44 62 45 68 63 44 ...
##  $ F.Undergrad: num  2885 2683 1036 510 249 ...
##  $ P.Undergrad: num  537 1227 99 63 869 ...
##  $ Outstate   : num  7440 12280 11250 12960 7560 ...
##  $ Room.Board : num  3300 6450 3750 5450 4120 ...
##  $ Books      : num  450 750 400 450 800 500 500 450 300 660 ...
##  $ Personal   : num  2200 1500 1165 875 1500 ...
##  $ PhD        : num  70 29 53 92 76 67 90 89 79 40 ...
##  $ Terminal   : num  78 30 66 97 72 73 93 100 84 41 ...
##  $ S.F.Ratio  : num  18.1 12.2 12.9 7.7 11.9 9.4 11.5 13.7 11.3 11.5 ...
##  $ perc.alumni: num  12 16 30 37 2 11 26 37 23 15 ...
##  $ Expend     : num  7041 10527 8735 19016 10922 ...
##  $ Grad.Rate  : num  60 56 54 59 15 55 63 73 80 52 ...

# Viewing the first few rows of the dataset
head(College)

# Summary statistics
summary(College)

##  Private        Apps           Accept          Enroll       Top10perc    
##  No :212   Min.   :   81   Min.   :   72   Min.   :  35   Min.   : 1.00  
##  Yes:565   1st Qu.:  776   1st Qu.:  604   1st Qu.: 242   1st Qu.:15.00  
##            Median : 1558   Median : 1110   Median : 434   Median :23.00  
##            Mean   : 3002   Mean   : 2019   Mean   : 780   Mean   :27.56  
##            3rd Qu.: 3624   3rd Qu.: 2424   3rd Qu.: 902   3rd Qu.:35.00  
##            Max.   :48094   Max.   :26330   Max.   :6392   Max.   :96.00  
##    Top25perc      F.Undergrad     P.Undergrad         Outstate    
##  Min.   :  9.0   Min.   :  139   Min.   :    1.0   Min.   : 2340  
##  1st Qu.: 41.0   1st Qu.:  992   1st Qu.:   95.0   1st Qu.: 7320  
##  Median : 54.0   Median : 1707   Median :  353.0   Median : 9990  
##  Mean   : 55.8   Mean   : 3700   Mean   :  855.3   Mean   :10441  
##  3rd Qu.: 69.0   3rd Qu.: 4005   3rd Qu.:  967.0   3rd Qu.:12925  
##  Max.   :100.0   Max.   :31643   Max.   :21836.0   Max.   :21700  
##    Room.Board       Books           Personal         PhD        
##  Min.   :1780   Min.   :  96.0   Min.   : 250   Min.   :  8.00  
##  1st Qu.:3597   1st Qu.: 470.0   1st Qu.: 850   1st Qu.: 62.00  
##  Median :4200   Median : 500.0   Median :1200   Median : 75.00  
##  Mean   :4358   Mean   : 549.4   Mean   :1341   Mean   : 72.66  
##  3rd Qu.:5050   3rd Qu.: 600.0   3rd Qu.:1700   3rd Qu.: 85.00  
##  Max.   :8124   Max.   :2340.0   Max.   :6800   Max.   :103.00  
##     Terminal       S.F.Ratio      perc.alumni        Expend     
##  Min.   : 24.0   Min.   : 2.50   Min.   : 0.00   Min.   : 3186  
##  1st Qu.: 71.0   1st Qu.:11.50   1st Qu.:13.00   1st Qu.: 6751  
##  Median : 82.0   Median :13.60   Median :21.00   Median : 8377  
##  Mean   : 79.7   Mean   :14.09   Mean   :22.74   Mean   : 9660  
##  3rd Qu.: 92.0   3rd Qu.:16.50   3rd Qu.:31.00   3rd Qu.:10830  
##  Max.   :100.0   Max.   :39.80   Max.   :64.00   Max.   :56233  
##    Grad.Rate     
##  Min.   : 10.00  
##  1st Qu.: 53.00  
##  Median : 65.00  
##  Mean   : 65.46  
##  3rd Qu.: 78.00  
##  Max.   :118.00

# Summary statistics of the dataset
data_summary <- summary(College)

kable(data_summary,caption = "<center>Table 1: Summary</center>", format = "html", align = "l") %>%
  column_spec(1, bold = TRUE)%>%
  kable_styling(full_width = TRUE, "striped",font_size = 14) %>%
  row_spec(0, bold = TRUE, background = "slategrey" , color = "white")%>%
  scroll_box(width = "100%", height = "400px")

Table 1: Summary
Private	Apps	Accept	Enroll	Top10perc	Top25perc	F.Undergrad	P.Undergrad	Outstate	Room.Board	Books	Personal	PhD	Terminal	S.F.Ratio	perc.alumni	Expend	Grad.Rate
No :212	Min. : 81	Min. : 72	Min. : 35	Min. : 1.00	Min. : 9.0	Min. : 139	Min. : 1.0	Min. : 2340	Min. :1780	Min. : 96.0	Min. : 250	Min. : 8.00	Min. : 24.0	Min. : 2.50	Min. : 0.00	Min. : 3186	Min. : 10.00
Yes:565	1st Qu.: 776	1st Qu.: 604	1st Qu.: 242	1st Qu.:15.00	1st Qu.: 41.0	1st Qu.: 992	1st Qu.: 95.0	1st Qu.: 7320	1st Qu.:3597	1st Qu.: 470.0	1st Qu.: 850	1st Qu.: 62.00	1st Qu.: 71.0	1st Qu.:11.50	1st Qu.:13.00	1st Qu.: 6751	1st Qu.: 53.00
NA	Median : 1558	Median : 1110	Median : 434	Median :23.00	Median : 54.0	Median : 1707	Median : 353.0	Median : 9990	Median :4200	Median : 500.0	Median :1200	Median : 75.00	Median : 82.0	Median :13.60	Median :21.00	Median : 8377	Median : 65.00
NA	Mean : 3002	Mean : 2019	Mean : 780	Mean :27.56	Mean : 55.8	Mean : 3700	Mean : 855.3	Mean :10441	Mean :4358	Mean : 549.4	Mean :1341	Mean : 72.66	Mean : 79.7	Mean :14.09	Mean :22.74	Mean : 9660	Mean : 65.46
NA	3rd Qu.: 3624	3rd Qu.: 2424	3rd Qu.: 902	3rd Qu.:35.00	3rd Qu.: 69.0	3rd Qu.: 4005	3rd Qu.: 967.0	3rd Qu.:12925	3rd Qu.:5050	3rd Qu.: 600.0	3rd Qu.:1700	3rd Qu.: 85.00	3rd Qu.: 92.0	3rd Qu.:16.50	3rd Qu.:31.00	3rd Qu.:10830	3rd Qu.: 78.00
NA	Max. :48094	Max. :26330	Max. :6392	Max. :96.00	Max. :100.0	Max. :31643	Max. :21836.0	Max. :21700	Max. :8124	Max. :2340.0	Max. :6800	Max. :103.00	Max. :100.0	Max. :39.80	Max. :64.00	Max. :56233	Max. :118.00

describe(College) %>%
  kable(caption = "<center>Table 2: Descriptive Statistics</center>", format = "html", align = "l") %>%
  kable_styling("bordered", full_width = TRUE, "striped",font_size = 14) %>%
  row_spec(0, bold = TRUE, background = "slategrey" , color = "white")%>%
  scroll_box(width = "100%", height = "400px")

Table 2: Descriptive Statistics
	vars	n	mean	sd	median	trimmed	mad	min	max	range	skew	kurtosis	se
Private*	1	777	1.727156	0.4457084	2.0	1.783307	0.00000	1.0	2.0	1.0	-1.0179902	-0.9649328	0.0159897
Apps	2	777	3001.638353	3870.2014844	1558.0	2193.008026	1463.32620	81.0	48094.0	48013.0	3.7093849	26.5184313	138.8427049
Accept	3	777	2018.804376	2451.1139710	1110.0	1510.287319	1008.16800	72.0	26330.0	26258.0	3.4045428	18.7526403	87.9332239
Enroll	4	777	779.972973	929.1761901	434.0	575.953451	354.34140	35.0	6392.0	6357.0	2.6800857	8.7368340	33.3340101
Top10perc	5	777	27.558559	17.6403644	23.0	25.130016	13.34340	1.0	96.0	95.0	1.4077650	2.1728286	0.6328445
Top25perc	6	777	55.796654	19.8047776	54.0	55.121990	20.75640	9.0	100.0	91.0	0.2583399	-0.5744647	0.7104924
F.Undergrad	7	777	3699.907336	4850.4205309	1707.0	2574.884430	1441.08720	139.0	31643.0	31504.0	2.6003876	7.6120676	174.0078673
P.Undergrad	8	777	855.298584	1522.4318873	353.0	536.361156	449.22780	1.0	21836.0	21835.0	5.6703938	54.5249401	54.6169397
Outstate	9	777	10440.669241	4023.0164841	9990.0	10181.658106	4121.62800	2340.0	21700.0	19360.0	0.5073133	-0.4255258	144.3249124
Room.Board	10	777	4357.526383	1096.6964156	4200.0	4301.704655	1005.20280	1780.0	8124.0	6344.0	0.4755141	-0.2012779	39.3437648
Books	11	777	549.380952	165.1053601	500.0	535.219904	148.26000	96.0	2340.0	2244.0	3.4715806	28.0632782	5.9231218
Personal	12	777	1340.642214	677.0714536	1200.0	1268.345104	593.04000	250.0	6800.0	6550.0	1.7357745	7.0446395	24.2898031
PhD	13	777	72.660232	16.3281547	75.0	73.922954	17.79120	8.0	103.0	95.0	-0.7652067	0.5442923	0.5857693
Terminal	14	777	79.702703	14.7223585	82.0	81.102729	14.82600	24.0	100.0	76.0	-0.8133924	0.2244365	0.5281617
S.F.Ratio	15	777	14.089704	3.9583491	13.6	13.935795	3.40998	2.5	39.8	37.3	0.6648606	2.5228017	0.1420050
perc.alumni	16	777	22.743887	12.3918015	21.0	21.857143	13.34340	0.0	64.0	64.0	0.6045500	-0.1113466	0.4445534
Expend	17	777	9660.171171	5221.7684399	8377.0	8823.704655	2730.94920	3186.0	56233.0	53047.0	3.4459767	18.5875365	187.3298993
Grad.Rate	18	777	65.463320	17.1777099	65.0	65.601926	17.79120	10.0	118.0	108.0	-0.1133384	-0.2187930	0.6162469

# Check for missing values
sum(is.na(College))

## [1] 0

# Box plot of the Private variable by Out state (Out-of-state tuition)

# Split the Outstate variable by the Private variable
boxplot(Outstate ~ Private, data = College,
        main = "Out of State University type",
        xlab = "Private University",
        ylab = "Out-of-State Tuition",
        col = c("lightblue", "lightgreen"),
        border = "black")

# Adding a caption (moving it down by increasing the 'line' parameter)
mtext("Figure 1: Box plot of the Private variable by Out state",
      side = 1, line = 4, adj = 0.5, cex = 0.8, font = 3)

# Adjusting the caption style
par(mar = c(5, 4, 4, 2) + 0.1)  # Adjusting margins

The box plot shows that there is a significant difference in out-of-state tuition between private and public universities. Private universities have a median out-of-state tuition of $21700, while public universities have a median out-of-state tuition of $9990. This suggests that private universities are generally more expensive to attend than public universities.

# Count the occurrences of each type of university
counts <- table(College$Private)

# Create a bar plot
barplot(counts, 
        main = "Bar Plot of University Types",
        col = c("lightblue", "lightgreen"),  # You can customize colors here
        xlab = "Type of University",
        ylab = "Count",
        ylim= c(0,600),
        names.arg = c("Public", "Private"),
        cex.names = 0.8,  # Adjust the size of axis labels
        border = "black"  # Add border to the bars for better visibility
)

# Add caption
mtext("Figure 2: Bar plot showing types of universities with counts", side = 1, line = 4, cex = 0.8, font = 3)

# Adjust margins
par(mar = c(5, 4, 4, 2) + 0.1)

The bar plot shows that the majority of universities in the dataset are private (565). There are also a significant number of public universities (212). This suggests that there is a good mix of public and private universities in the dataset.

# Create a histogram
hist(College$Apps, 
     breaks = 50, 
     col = "slateblue", 
     border = "black",
     main = "Histogram of Applications",
     xlab = "Number of Applications",
     ylab = "Frequency",
     ylim = c(0,300))

# Add a caption
mtext("Figure 3: Histogram showing number of applications", side = 1, line = 4, cex = 0.8, font = 3)

The above histogram is negatively skewed or left-skewed. This means that the majority of the values are on the right side of the distribution, with a longer tail on the left. This indicates that there are more applications with a higher number of applications.

# Histogram for Graduation Rate (Grad.Rate)
hist(College$Grad.Rate, 
     breaks = 50, 
     col = "slateblue", 
     border = "black",
     main = "Histogram of Graduation Rates",
     xlab = "Graduation Rate",
     ylab = "Frequency")

# Add a caption
mtext("Figure 4: Histogram showing graduation rate", side = 1, line = 4, cex = 0.8, font = 3, col = "black")

The above histogram is a visualization of the distribution of graduation rates at colleges in the United States. The horizontal axis shows the graduation rate, and the vertical axis shows the frequency of colleges with that graduation rate.

The histogram shows that the distribution of graduation rates is negatively skewed, meaning that there are more colleges with lower graduation rates than there are colleges with higher graduation rates.

# Pairwise scatter plot for numerical variables
pairs(College[, 2:11])

# Add figure caption
title(sub = "Figure 5: Pairwise Scatter Plot Matrix for numerical variables")

# Correlation matrix
cor_matrix <- cor(College[, 2:18])
cor_matrix

##                    Apps      Accept      Enroll  Top10perc   Top25perc
## Apps         1.00000000  0.94345057  0.84682205  0.3388337  0.35163990
## Accept       0.94345057  1.00000000  0.91163666  0.1924469  0.24747574
## Enroll       0.84682205  0.91163666  1.00000000  0.1812935  0.22674511
## Top10perc    0.33883368  0.19244693  0.18129353  1.0000000  0.89199497
## Top25perc    0.35163990  0.24747574  0.22674511  0.8919950  1.00000000
## F.Undergrad  0.81449058  0.87422328  0.96463965  0.1412887  0.19944466
## P.Undergrad  0.39826427  0.44127073  0.51306860 -0.1053563 -0.05357664
## Outstate     0.05015903 -0.02575455 -0.15547734  0.5623305  0.48939383
## Room.Board   0.16493896  0.09089863 -0.04023168  0.3714804  0.33148989
## Books        0.13255860  0.11352535  0.11271089  0.1188584  0.11552713
## Personal     0.17873085  0.20098867  0.28092946 -0.0933164 -0.08081027
## PhD          0.39069733  0.35575788  0.33146914  0.5318280  0.54586221
## Terminal     0.36949147  0.33758337  0.30827407  0.4911350  0.52474884
## S.F.Ratio    0.09563303  0.17622901  0.23727131 -0.3848745 -0.29462884
## perc.alumni -0.09022589 -0.15998987 -0.18079413  0.4554853  0.41786429
## Expend       0.25959198  0.12471701  0.06416923  0.6609134  0.52744743
## Grad.Rate    0.14675460  0.06731255 -0.02234104  0.4949892  0.47728116
##             F.Undergrad P.Undergrad    Outstate  Room.Board        Books
## Apps         0.81449058  0.39826427  0.05015903  0.16493896  0.132558598
## Accept       0.87422328  0.44127073 -0.02575455  0.09089863  0.113525352
## Enroll       0.96463965  0.51306860 -0.15547734 -0.04023168  0.112710891
## Top10perc    0.14128873 -0.10535628  0.56233054  0.37148038  0.118858431
## Top25perc    0.19944466 -0.05357664  0.48939383  0.33148989  0.115527130
## F.Undergrad  1.00000000  0.57051219 -0.21574200 -0.06889039  0.115549761
## P.Undergrad  0.57051219  1.00000000 -0.25351232 -0.06132551  0.081199521
## Outstate    -0.21574200 -0.25351232  1.00000000  0.65425640  0.038854868
## Room.Board  -0.06889039 -0.06132551  0.65425640  1.00000000  0.127962970
## Books        0.11554976  0.08119952  0.03885487  0.12796297  1.000000000
## Personal     0.31719954  0.31988162 -0.29908690 -0.19942818  0.179294764
## PhD          0.31833697  0.14911422  0.38298241  0.32920228  0.026905731
## Terminal     0.30001894  0.14190357  0.40798320  0.37453955  0.099954700
## S.F.Ratio    0.27970335  0.23253051 -0.55482128 -0.36262774 -0.031929274
## perc.alumni -0.22946222 -0.28079236  0.56626242  0.27236345 -0.040207736
## Expend       0.01865162 -0.08356842  0.67277862  0.50173942  0.112409075
## Grad.Rate   -0.07877313 -0.25700099  0.57128993  0.42494154  0.001060894
##                Personal         PhD    Terminal   S.F.Ratio perc.alumni
## Apps         0.17873085  0.39069733  0.36949147  0.09563303 -0.09022589
## Accept       0.20098867  0.35575788  0.33758337  0.17622901 -0.15998987
## Enroll       0.28092946  0.33146914  0.30827407  0.23727131 -0.18079413
## Top10perc   -0.09331640  0.53182802  0.49113502 -0.38487451  0.45548526
## Top25perc   -0.08081027  0.54586221  0.52474884 -0.29462884  0.41786429
## F.Undergrad  0.31719954  0.31833697  0.30001894  0.27970335 -0.22946222
## P.Undergrad  0.31988162  0.14911422  0.14190357  0.23253051 -0.28079236
## Outstate    -0.29908690  0.38298241  0.40798320 -0.55482128  0.56626242
## Room.Board  -0.19942818  0.32920228  0.37453955 -0.36262774  0.27236345
## Books        0.17929476  0.02690573  0.09995470 -0.03192927 -0.04020774
## Personal     1.00000000 -0.01093579 -0.03061311  0.13634483 -0.28596808
## PhD         -0.01093579  1.00000000  0.84958703 -0.13053011  0.24900866
## Terminal    -0.03061311  0.84958703  1.00000000 -0.16010395  0.26713029
## S.F.Ratio    0.13634483 -0.13053011 -0.16010395  1.00000000 -0.40292917
## perc.alumni -0.28596808  0.24900866  0.26713029 -0.40292917  1.00000000
## Expend      -0.09789189  0.43276168  0.43879922 -0.58383204  0.41771172
## Grad.Rate   -0.26934396  0.30503785  0.28952723 -0.30671041  0.49089756
##                  Expend    Grad.Rate
## Apps         0.25959198  0.146754600
## Accept       0.12471701  0.067312550
## Enroll       0.06416923 -0.022341039
## Top10perc    0.66091341  0.494989235
## Top25perc    0.52744743  0.477281164
## F.Undergrad  0.01865162 -0.078773129
## P.Undergrad -0.08356842 -0.257000991
## Outstate     0.67277862  0.571289928
## Room.Board   0.50173942  0.424941541
## Books        0.11240908  0.001060894
## Personal    -0.09789189 -0.269343964
## PhD          0.43276168  0.305037850
## Terminal     0.43879922  0.289527232
## S.F.Ratio   -0.58383204 -0.306710405
## perc.alumni  0.41771172  0.490897562
## Expend       1.00000000  0.390342696
## Grad.Rate    0.39034270  1.000000000

# Calculate the correlation matrix for numeric columns
corrplot(cor_matrix, method = "pie",
         main = "Figure 6 : Correlation Plot with Pie Charts",
         mar = c(0, 0, 2, 0))

The correlation plot with pie charts Figure 5 provides a visual representation of these relationships. Variables with strong correlations are depicted with larger pie slices, and the direction of the correlation is indicated by the color (positive in blue, negative in red). This exploration lays the groundwork for understanding the interconnections between various numerical variables in the dataset.

Task 2

Split the data into a train and test set – refer to the ALY6015_Feature_Selection_R.pdf document for information on how to split a dataset.

In the process of modeling, it is a common and essential practice to partition the dataset into two distinct subsets: a training dataset and a testing dataset. The training dataset serves as the foundation upon which the model is trained, while the testing dataset is employed to assess the model’s performance. This division is crucial for verifying that the model generalizes effectively to new, unseen data, guarding against the risk of over fitting (Brownlee, 2020).

A frequently employed strategy is the 70/30 split, where 70% of the dataset is allocated to the training set, and the remaining 30% is designated for testing. This approach strikes a balance between providing the model with sufficient data for learning and ensuring a robust evaluation on independent test data.

# Split the Data into Training and Testing Sets
set.seed(123)  # For reproducibility
splitIndex <- createDataPartition(College$Private, p = .70, list = FALSE, times = 1)
College_train <- College[splitIndex,]
College_test <- College[-splitIndex,]

Task 3

Fit logistic regression model

In our endeavor to construct a predictive logistic regression model for discerning whether a university is private or public, we leverage the versatile Generalized Linear Model (GLM) framework (Kabacoff, 2015). GLM serves as the cornerstone of our modeling approach, providing a flexible and powerful tool for analyzing binary outcomes. The glm function acts as our guide in fitting logistic regression models, facilitating the estimation of coefficients that best describe the relationship between predictor variables and the binary response variable.

Our modeling journey unfolds with Forward Selection, a methodical process of iteratively adding predictors to optimize model fit. Simultaneously, we employ Backward Selection, systematically eliminating predictors for enhanced model simplicity. The stepAIC function guides us in this endeavor, scrutinizing the Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) to compare model performance.

Furthermore, our exploration extends to Best Subset Selection, facilitated by the regsubsets function. This exhaustive examination of various model configurations enables us to pinpoint the optimal combination of predictors based on the Bayesian Information Criterion (BIC). Throughout this process, we meticulously assess model performance and intricately navigate the interplay between predictor variables.

# Forward Selection
forward_model <- glm(Private ~ 1, data = College_train, family = binomial)
forward_step <- stepAIC(forward_model,
                        scope = list(lower = formula(forward_model),
                                     upper = ~ .),
                        direction = "forward", trace = FALSE)
summary(forward_step)

## 
## Call:
## glm(formula = Private ~ 1, family = binomial, data = College_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6105  -1.6105   0.7992   0.7992   0.7992  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.97747    0.09611   10.17   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 639.4  on 544  degrees of freedom
## Residual deviance: 639.4  on 544  degrees of freedom
## AIC: 641.4
## 
## Number of Fisher Scoring iterations: 4

The forward selection process begins with a null model containing only the intercept. This initial model has a coefficient estimate for the intercept, indicating its contribution to the log-odds of the response variable.

The null model is evaluated based on its deviance residuals, which measure the goodness of fit. In this case, the null model has a null deviance of 639.4 on 544 degrees of freedom, suggesting room for improvement.

The forward selection algorithm iteratively adds variables that contribute the most to reducing deviance. In the first step, the algorithm adds an intercept term with an estimate of 0.97747, which is statistically significant.

The resulting model has an AIC of 641.4 and a residual deviance equal to the null deviance. This suggests that the added intercept alone does not significantly improve the model fit, as the AIC is only marginally better than the null model.

# Backward Selection

backward_model <- glm(Private ~ ., data = College_train, family = binomial)
backward_step <- stepAIC(backward_model, 
                         direction = "backward", 
                         trace = FALSE)
summary(backward_step)

## 
## Call:
## glm(formula = Private ~ Apps + Enroll + F.Undergrad + Outstate + 
##     PhD + perc.alumni + Expend + Grad.Rate, family = binomial, 
##     data = College_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.8647  -0.0159   0.0473   0.1561   3.0302  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.8963348  1.3926982  -1.362 0.173316    
## Apps        -0.0003739  0.0002027  -1.845 0.065102 .  
## Enroll       0.0019856  0.0010665   1.862 0.062636 .  
## F.Undergrad -0.0007323  0.0002338  -3.132 0.001738 ** 
## Outstate     0.0007395  0.0001351   5.473 4.44e-08 ***
## PhD         -0.0766800  0.0200659  -3.821 0.000133 ***
## perc.alumni  0.0359692  0.0259347   1.387 0.165468    
## Expend       0.0002574  0.0001424   1.807 0.070735 .  
## Grad.Rate    0.0274096  0.0162281   1.689 0.091215 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 639.40  on 544  degrees of freedom
## Residual deviance: 151.78  on 536  degrees of freedom
## AIC: 169.78
## 
## Number of Fisher Scoring iterations: 8

The backward selection process starts with a model containing all predictors. The algorithm iteratively removes variables that contribute the least to the model, based on deviance.

The initial model, with all predictors, has a null deviance of 639.4 on 544 degrees of freedom. The model is subsequently refined through variable removal, leading to a final model with a residual deviance of 151.78 on 536 degrees of freedom.

The backward selection algorithm removes variables iteratively, selecting those with the least impact on deviance. In this case, the final model includes Apps, Enroll, F.Undergrad, Outstate, PhD, perc.alumni, Expend, and Grad.Rate.

# Model Comparison
AIC(forward_step, backward_step)

BIC(forward_step, backward_step)

#logLik(forward_step, backward_step)

Comparing AIC and BIC between forward and backward models reveals that the backward model has lower AIC (169.78) and BIC (208.48), indicating improved model fit.

# Best Subset Selection
sub_model <- regsubsets(Private ~ ., data = College_train, nbest = 1, nvmax = NULL, method = "exhaustive")
summary(sub_model)

## Subset selection object
## Call: regsubsets.formula(Private ~ ., data = College_train, nbest = 1, 
##     nvmax = NULL, method = "exhaustive")
## 17 Variables  (and intercept)
##             Forced in Forced out
## Apps            FALSE      FALSE
## Accept          FALSE      FALSE
## Enroll          FALSE      FALSE
## Top10perc       FALSE      FALSE
## Top25perc       FALSE      FALSE
## F.Undergrad     FALSE      FALSE
## P.Undergrad     FALSE      FALSE
## Outstate        FALSE      FALSE
## Room.Board      FALSE      FALSE
## Books           FALSE      FALSE
## Personal        FALSE      FALSE
## PhD             FALSE      FALSE
## Terminal        FALSE      FALSE
## S.F.Ratio       FALSE      FALSE
## perc.alumni     FALSE      FALSE
## Expend          FALSE      FALSE
## Grad.Rate       FALSE      FALSE
## 1 subsets of each size up to 17
## Selection Algorithm: exhaustive
##           Apps Accept Enroll Top10perc Top25perc F.Undergrad P.Undergrad
## 1  ( 1 )  " "  " "    " "    " "       " "       "*"         " "        
## 2  ( 1 )  " "  " "    " "    " "       " "       "*"         " "        
## 3  ( 1 )  " "  " "    " "    " "       " "       "*"         " "        
## 4  ( 1 )  " "  " "    " "    " "       " "       "*"         " "        
## 5  ( 1 )  " "  " "    " "    " "       " "       "*"         " "        
## 6  ( 1 )  " "  " "    " "    " "       " "       "*"         " "        
## 7  ( 1 )  "*"  "*"    " "    " "       " "       "*"         " "        
## 8  ( 1 )  "*"  "*"    " "    " "       " "       "*"         " "        
## 9  ( 1 )  "*"  "*"    " "    " "       " "       "*"         " "        
## 10  ( 1 ) "*"  "*"    " "    " "       " "       "*"         " "        
## 11  ( 1 ) "*"  "*"    " "    " "       " "       "*"         " "        
## 12  ( 1 ) "*"  "*"    " "    "*"       " "       "*"         " "        
## 13  ( 1 ) "*"  "*"    " "    "*"       "*"       "*"         " "        
## 14  ( 1 ) "*"  "*"    " "    "*"       "*"       "*"         " "        
## 15  ( 1 ) "*"  "*"    " "    "*"       "*"       "*"         "*"        
## 16  ( 1 ) "*"  "*"    "*"    "*"       "*"       "*"         "*"        
## 17  ( 1 ) "*"  "*"    "*"    "*"       "*"       "*"         "*"        
##           Outstate Room.Board Books Personal PhD Terminal S.F.Ratio perc.alumni
## 1  ( 1 )  " "      " "        " "   " "      " " " "      " "       " "        
## 2  ( 1 )  "*"      " "        " "   " "      " " " "      " "       " "        
## 3  ( 1 )  "*"      " "        " "   " "      "*" " "      " "       " "        
## 4  ( 1 )  "*"      " "        " "   " "      "*" " "      "*"       " "        
## 5  ( 1 )  "*"      " "        " "   " "      "*" " "      "*"       " "        
## 6  ( 1 )  "*"      " "        " "   " "      "*" " "      "*"       " "        
## 7  ( 1 )  "*"      " "        " "   " "      "*" " "      "*"       " "        
## 8  ( 1 )  "*"      " "        " "   " "      "*" "*"      "*"       " "        
## 9  ( 1 )  "*"      " "        " "   " "      "*" "*"      "*"       "*"        
## 10  ( 1 ) "*"      " "        "*"   " "      "*" "*"      "*"       "*"        
## 11  ( 1 ) "*"      " "        "*"   " "      "*" "*"      "*"       "*"        
## 12  ( 1 ) "*"      "*"        " "   " "      "*" "*"      "*"       "*"        
## 13  ( 1 ) "*"      "*"        " "   " "      "*" "*"      "*"       "*"        
## 14  ( 1 ) "*"      "*"        "*"   " "      "*" "*"      "*"       "*"        
## 15  ( 1 ) "*"      "*"        "*"   " "      "*" "*"      "*"       "*"        
## 16  ( 1 ) "*"      "*"        "*"   " "      "*" "*"      "*"       "*"        
## 17  ( 1 ) "*"      "*"        "*"   "*"      "*" "*"      "*"       "*"        
##           Expend Grad.Rate
## 1  ( 1 )  " "    " "      
## 2  ( 1 )  " "    " "      
## 3  ( 1 )  " "    " "      
## 4  ( 1 )  " "    " "      
## 5  ( 1 )  " "    "*"      
## 6  ( 1 )  "*"    "*"      
## 7  ( 1 )  " "    "*"      
## 8  ( 1 )  " "    "*"      
## 9  ( 1 )  " "    "*"      
## 10  ( 1 ) " "    "*"      
## 11  ( 1 ) "*"    "*"      
## 12  ( 1 ) "*"    "*"      
## 13  ( 1 ) "*"    "*"      
## 14  ( 1 ) "*"    "*"      
## 15  ( 1 ) "*"    "*"      
## 16  ( 1 ) "*"    "*"      
## 17  ( 1 ) "*"    "*"

sub_summary <- summary(sub_model)

# BIC values for all the models
bic_values <- sub_summary$bic

# The index of the model with the lowest BIC
best_bic_index <- which.min(bic_values)

# Logical vector indicating whether a variable is included in the best model
best_model_vars <- sub_summary$which[best_bic_index, ]

# Names of the variables included in the best model
best_var_names <- names(which(sub_summary$which[best_bic_index, ]))

# Ensure 'Intercept' is not in the list of predictors
best_var_names <- best_var_names[best_var_names != "(Intercept)"]

# Check the predictors
print(best_var_names)

## [1] "Apps"        "Accept"      "F.Undergrad" "Outstate"    "PhD"        
## [6] "S.F.Ratio"   "Grad.Rate"

# Construct the formula
best_formula_str <- paste("Private ~", paste(best_var_names, collapse = " + "))
print(best_formula_str)

## [1] "Private ~ Apps + Accept + F.Undergrad + Outstate + PhD + S.F.Ratio + Grad.Rate"

# Convert the string to a formula
best_formula <- as.formula(best_formula_str)

# Fit the logistic regression model
best_model <- glm(best_formula, data = College_train, family = "binomial")
summary(best_model)

## 
## Call:
## glm(formula = best_formula, family = "binomial", data = College_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.8496  -0.0166   0.0576   0.1559   3.0368  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.3556121  1.9628415   0.691 0.489793    
## Apps        -0.0003305  0.0002726  -1.212 0.225393    
## Accept       0.0004010  0.0004959   0.809 0.418716    
## F.Undergrad -0.0005493  0.0001757  -3.126 0.001774 ** 
## Outstate     0.0007682  0.0001209   6.355 2.08e-10 ***
## PhD         -0.0689944  0.0184782  -3.734 0.000189 ***
## S.F.Ratio   -0.1279705  0.0729849  -1.753 0.079536 .  
## Grad.Rate    0.0344966  0.0158932   2.171 0.029967 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 639.40  on 544  degrees of freedom
## Residual deviance: 156.75  on 537  degrees of freedom
## AIC: 172.75
## 
## Number of Fisher Scoring iterations: 8

Best subset selection explores all possible combinations of predictor variables to identify the model with the lowest BIC.

The exhaustive search results in a model with the lowest BIC, including variables Apps, Accept, F.Undergrad, Outstate, PhD, S.F.Ratio, and Grad.Rate.

With BIC values in mind, the best subset model incorporates a subset of predictors that collectively contribute to a smaller yet effective model.

All the selection methods provide insights into the importance of certain predictors in predicting whether a university is private or public.

Variable F.Undergrad appears consistently in the selected models, highlighting its significance in discerning university types.

Task 4

Create a confusion matrix and report the results of your model predictions on the train set. Interpret and discuss the confusion matrix. Which misclassifications are more damaging for the analysis, False Positives or False Negatives?

In this segment of our report, we delve into the practical aspect of applying our logistic regression model. Following the training phase, the subsequent step involves leveraging our model to make predictions. We aim to assess the model’s proficiency in determining whether a university is private or not based on the provided features.

Our logistic regression model is employed to predict outcomes on the training dataset. These predictions are then juxtaposed with the actual classifications of universities in the training set. The ensuing analysis is encapsulated in a confusion matrix, a pivotal tool for gauging the model’s performance.

A confusion matrix is a table used to describe the performance of a classification model on a set of data for which the true values are known. It summarizes the model’s predictions against the actual classifications (Narkhede, 2021).

From the caret package the confusionMatrix function is used to create confusion matrix and calculate statistics.

# Make predictions on the training set
pred <- predict(best_model, newdata = College_train, type = "response")
pred_class <- ifelse(pred > 0.5, "Yes", "No")

# Convert the predictions and the actual classes to factors with the same levels
pred_factor <- factor(pred_class, levels = c("No", "Yes"))
actual_factor <- factor(College_train$Private, levels = c("No", "Yes"))

# the confusion matrix
conf_matrix <- confusionMatrix(pred_factor, actual_factor)
print(conf_matrix)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  133  13
##        Yes  16 383
##                                           
##                Accuracy : 0.9468          
##                  95% CI : (0.9245, 0.9641)
##     No Information Rate : 0.7266          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.8652          
##                                           
##  Mcnemar's Test P-Value : 0.7103          
##                                           
##             Sensitivity : 0.8926          
##             Specificity : 0.9672          
##          Pos Pred Value : 0.9110          
##          Neg Pred Value : 0.9599          
##              Prevalence : 0.2734          
##          Detection Rate : 0.2440          
##    Detection Prevalence : 0.2679          
##       Balanced Accuracy : 0.9299          
##                                           
##        'Positive' Class : No              
##

The confusion matrix and associated statistics provide a detailed evaluation of the logistic regression model’s performance on the training dataset.In the confusion matrix, the model correctly identified 383 instances of private colleges (Yes) and 133 instances of non-private colleges (No).However, it made 13 false positive predictions and 16 false negative predictions. This resulted in a high accuracy of 94.68%, exceeding the no-information rate of 72.66%.

The model’s sensitivity, measuring its ability to correctly identify positive cases, stands at 89.26%, while specificity, measuring its ability to correctly identify negative cases, is 96.72%.The positive predictive value (precision) is 91.10%, indicating a strong reliability in predicting private colleges when it asserts a positive outcome.

The Kappa statistic, which considers the possibility of random chance, is 0.8652, reflecting a substantial agreement beyond chance.Additionally, the balanced accuracy of 92.99% accounts for potential class imbalances.

These metrics collectively suggest that the logistic regression model is robust in distinguishing between private and non-private colleges on the training dataset.

Interpretation and Discussion of the Confusion Matrix:

In the context of the confusion matrix, False Positives and False Negatives have distinct implications for the analysis, particularly when considering the nature of the problem being addressed. In this case, the logistic regression model aims to predict whether a college is private or not.

In any analysis, both types of misclassifications have distinct implications. False Positives may lead to an overestimation of the number of private colleges, which could have consequences for resource allocation or decision-making based on this prediction. On the other hand, False Negatives may result in an underestimation of the private colleges, potentially missing opportunities or neglecting specific characteristics unique to private institutions.

The balance between the two types of errors depends on the specific goals and consequences of the analysis. If the cost of misclassifying a private college as non-private is high, and vice versa, it is crucial to carefully consider the trade-offs and potentially adjust the decision threshold of the model. The sensitivity (True Positive Rate) and specificity (True Negative Rate) metrics in the statistics summary provide additional context, aiding in the evaluation of the model’s performance with respect to these misclassifications.

In summary, the nature of the problem and the specific consequences associated with misclassifications determine whether False Positives or False Negatives are more damaging. Understanding the potential impact of each type of error is crucial for making informed decisions based on the model’s predictions.

Task 5

Report and interpret metrics for Accuracy, Precision, Recall, and Specificity.

# Extracting the metrics from the confusion matrix

# Accuracy
accuracy <- conf_matrix$overall['Accuracy']
accuracy

## Accuracy 
## 0.946789

Accuracy represents the overall correctness of the model’s predictions. In this case, the model is approximately 94.68% accurate, meaning it correctly predicts whether a college is private or not in nearly 95% of cases.

High accuracy is generally positive, but it’s crucial to consider the context of the problem. If the dataset is imbalanced (e.g., significantly more non-private colleges than private ones), accuracy alone might not provide a complete picture.

#Precision
precision <- conf_matrix$byClass['Pos Pred Value'][1]
precision

## Pos Pred Value 
##      0.9109589

Precision, also known as Positive Predictive Value, measures the accuracy of positive predictions made by the model. In this case, when the model predicts a college to be private, it is correct approximately 91.10% of the time.

A high precision indicates that when the model predicts a college to be private, it is likely to be correct. This is important, especially if resources or decisions are directly linked to positive predictions.

#Recall
recall <- conf_matrix$byClass['Sensitivity'][1]
recall

## Sensitivity 
##   0.8926174

Recall, or Sensitivity, measures the ability of the model to correctly identify positive instances. Here, the model captures about 89.26% of actual private colleges.

A high recall is valuable when it is crucial not to miss positive instances. In this context, it implies that the model is effective at identifying most private colleges.

#Specificity
specificity <- conf_matrix$byClass['Specificity'][1]
specificity

## Specificity 
##   0.9671717

Specificity measures the ability of the model to correctly identify negative instances. In this case, the model is approximately 96.72% accurate in identifying non-private colleges.

High specificity is desirable, especially if there are specific implications or consequences associated with being a non-private college.

Task 6

Create a confusion matrix and report the results of your model for the test set. Compare the results with the train set and interpret

In this section, we extend our analysis to evaluate the predictive performance of the logistic regression model on a previously unseen dataset the test set. The primary objective is to assess how well the model, trained on the training set, generalizes to new, unseen data. The confusion matrix and associated metrics will be employed to provide insights into the model’s ability to accurately classify colleges as private or non-private on the test set. By comparing the results between the training and test sets, we aim to understand the model’s robustness and identify any potential overfitting or underfitting issues. This evaluation is crucial for gauging the model’s reliability in real-world scenarios beyond the training data.

# Predictions on the test set
test_pred_prob <- predict(best_model, newdata = College_test, type = "response")
test_pred_class <- ifelse(test_pred_prob > 0.5, "Yes", "No")

# Convert the predictions and the actual classes to factors with the same levels
test_pred_factor <- factor(test_pred_class, levels = c("No", "Yes"))
test_actual_factor <- factor(College_test$Private, levels = c("No", "Yes"))

# Create the confusion matrix for the test set
test_conf_matrix <- confusionMatrix(test_pred_factor,test_actual_factor)

print(test_conf_matrix)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No   54   7
##        Yes   9 162
##                                           
##                Accuracy : 0.931           
##                  95% CI : (0.8904, 0.9601)
##     No Information Rate : 0.7284          
##     P-Value [Acc > NIR] : 4.085e-15       
##                                           
##                   Kappa : 0.8239          
##                                           
##  Mcnemar's Test P-Value : 0.8026          
##                                           
##             Sensitivity : 0.8571          
##             Specificity : 0.9586          
##          Pos Pred Value : 0.8852          
##          Neg Pred Value : 0.9474          
##              Prevalence : 0.2716          
##          Detection Rate : 0.2328          
##    Detection Prevalence : 0.2629          
##       Balanced Accuracy : 0.9079          
##                                           
##        'Positive' Class : No              
##

The evaluation of the logistic regression model on the test set reveals valuable insights into its performance and generalization capabilities. Let’s delve into key observations:

Accuracy and Precision:

The model achieves an impressive accuracy of approximately 93.1% on the test set, showcasing its ability to correctly classify colleges as private or non-private.

The positive predictive value (precision) is commendable at 88.5%,indicating a high proportion of correctly identified private colleges among the predicted positives.
Sensitivity and Specificity:

Sensitivity, also known as recall, is the model’s capability to correctly identify private colleges. In this case, it stands at 85.7%, suggesting a strong performance in capturing actual private institutions.

Specificity is noteworthy at 95.9%, signifying the model’s proficiency in correctly identifying non-private colleges.
Balanced Accuracy and Kappa:

The balanced accuracy, which considers both sensitivity and specificity, is around 90.8%, reinforcing the model’s overall balanced performance.

The Cohen’s Kappa coefficient, a measure of agreement beyond chance, is substantial at 82.4%, indicating a robust level of agreement between predicted and actual classes.
Comparison with Training Set:

Comparing these results with those obtained from the training set, we observe a slight decrease in accuracy and other metrics. This is a common phenomenon as the model encounters new, unseen data.However, the drop is marginal, signifying the model’s ability to generalize well.
Prevalence and Detection Rate:

The prevalence, representing the proportion of private colleges in the test set, is approximately 27.2%.

The detection rate, indicating the proportion of correctly identified private colleges, stands at 23.3%, emphasizing the model’s effectiveness in identifying the target class.

In summary, while there is a marginal decrease in accuracy and detection rate on the test set compared to the training set, the overall consistency in sensitivity, specificity, balanced accuracy, and Kappa coefficient indicates that the model generalizes well and maintains its predictive performance across different datasets. The observed variations are within an acceptable range, affirming the reliability and robustness of the logistic regression model.

Task 7

Plot and interpret the ROC curve.

In this analysis, we delve into the Receiver Operating Characteristic (ROC) curve to evaluate the performance of our logistic regression model on the test set. A Receiver Operator Characteristic (ROC) curve is a graphical plot used to show the diagnostic ability of binary classifiers. The ROC curve is a graphical representation that illustrates the trade-off between the true positive rate (sensitivity) and the false positive rate across various probability thresholds (KHAL, 2021).

# Generating ROC curve data from the test set predictions
roc_data <- roc(response = College_test$Private, predictor = test_pred_prob)

# Plotting the ROC curve
plot(roc_data, main = "Figure 7: ROC Curve", col = "blue", lwd = 2)


abline(a = 0, b = 1, lty = 2, col = "red")

The logistic regression model exhibits superior performance in distinguishing between private and non-private colleges, outperforming random guessing and demonstrating high sensitivity and specificity.

The model’s ROC curve surpasses the reference line, indicating its ability to accurately classify colleges beyond chance.
The curve’s proximity to the upper left corner suggests that the model can identify true positives while minimizing false positives.
The steepness of the curve towards the upper left corner implies that the model can achieve high sensitivity without compromising specificity.
While the specific AUC value is not provided, the visual assessment of the curve suggests a high AUC, indicating the model’s overall ability to distinguish between private and non-private colleges.

Finally, the ROC curve analysis validates the logistic regression model’s ability to distinguish between private and non-private colleges, revealing its great predictive power and promise for real-world applications

Task 8

Calculate and interpret the AUC.

In this section, we aim to assess the performance of our classification model using the Area Under the Receiver Operating Characteristic (ROC) Curve, commonly referred to as AUC. The ROC curve is a graphical representation of the model’s ability to discriminate between positive and negative classes across various thresholds. The AUC condenses the information from the ROC curve into a single numerical value, providing a concise measure of the model’s overall performance. We will calculate and interpret the AUC to gain insights into the discriminative power of our model.

# Calculate the AUC from the ROC object
auc_value <- auc(roc_data)

# Print the AUC value
print(auc_value)

## Area under the curve: 0.9756

The AUC value of 0.9756 signifies that the model has demonstrated an exceptional ability to differentiate between private and non-private universities. The ROC curve, from which the AUC is derived, illustrates the trade-off between sensitivity (true positive rate) and specificity (true negative rate) across various probability thresholds. The closer the AUC is to 1, the better the model distinguishes between the two classes.

In our case, the high AUC value suggests that the model has effectively ranked positive instances (private universities) higher than negative instances (non-private universities) across a range of probability thresholds. This is indicative of a strong ability to correctly classify instances and make accurate predictions.

The robust AUC, along with other evaluation metrics, reinforces the confidence in our logistic regression model’s performance in predicting the private or non-private status of universities. It is a valuable measure for assessing the overall discriminatory power of the model and is particularly useful in binary classification tasks like ours.

Conclusion

The data was split into comprehensive training and test sets using a 70/30 ratio. This enabled proper model fitting on the train set and rigorous evaluation on the unseen test data, laying the groundwork for assessing generalizability.

Multiple logistic regression models were developed using forward selection, backward elimination and best subset selection techniques. The optimal model identified via lowest BIC contained 7 predictors: application count, enrollment, full-time undergrads, out-of-state tuition, PhD percentage, faculty-student ratio and graduation rate. This balanced model simplicity with predictive capability.

Evaluation on the training set yielded over 94.6% accuracy with well-balanced sensitivity and specificity both over 90%. The precision exceeded 91% revealing the model’s reliability in predicting private universities. The kappa statistic further validated performance beyond chance. Overall, the results indicated exceptional performance on the data the model was fit on.

Critically, the classification metrics on the unseen test data were consistently high and dropped only slightly compared to the training set. The test accuracy remained above 93% and AUC reached 0.976, highlighting capacity to discriminate between classes. Well-preserved sensitivity, specificity and precision reflected the model’s generalization ability.

In conclusion, the structured workflow incorporating statistical modeling and rigorous evaluation analyses resulted in a logistic regression model with remarkable accuracy, discrimination capability and generalizability in distinguishing between private and public universities. Its consistent performance across metrics on both train and test data cements its reliability for practical usage on new data.

References

Brownlee, J. (2020, August 26). Train-test split for Evaluating Machine Learning Algorithms. MachineLearningMastery.com. https://machinelearningmastery.com/train-test-split-for-evaluating-machine-learning-algorithms/
Kabacoff, R. (2015). R in action : data analysis and graphics with R (Second edition.). Manning Publications.
Narkhede, S. (2021, June 15). Understanding confusion matrix. Medium. https://towardsdatascience.com/understanding-confusion-matrix-a9ad42dcfd62
KHAL, Y. E. (2021, March 18). Confusion matrix, AUC and ROC curve and Gini clearly explained. Medium. https://yassineelkhal.medium.com/confusion-matrix-auc-and-roc-curve-and-gini-clearly-explained-221788618eb2#:~:text=ROC%20curve%20is%20a%20graphical,model%20distinguishes%20between%20two%20classes.

Appendix
This report contains an R Markdown file named as follows ALY6015_ZeeshanAhmadAnsari_WEEK_3_FALL_B_2023.Rmd