M3 Project
Report
ALY6015_71821:Intermediate Analytics
SEC_09_Fall_2023_CPS
Northeastern University
Professor: Vladimir Shapiro
By: Zeeshan Ahmad Ansari
Date of Submission: 27
November, 2023
Introduction
This research explores data from 777 colleges in the United States. The aim is to create a computer program that predicts whether a college is private or public based on various factors such as the number of students applying, enrolling, and graduating, along with financial details. The study involves using mathematical and computer techniques to build and test a model for this prediction.
The initial step involves thoroughly examining the data using both numerical summaries and visual representations to gain insights into its patterns. Subsequently, the data is divided into two parts, one for building the model and the other for testing its accuracy. The chosen technique for building the prediction model is logistic regression, a method that estimates the probability of a college being private.
To evaluate the model’s performance, various metrics are employed, including accuracy, precision, recall, and specificity, which are derived from a confusion matrix. Additionally, the study incorporates the Receiver Operating Characteristic (ROC) curve and the Area Under the Curve (AUC) to assess the model’s ability to distinguish between private and public colleges.
The report delves into the interpretation of these metrics, providing insights into the effectiveness of the logistic regression model in predicting the nature of universities. It concludes by discussing the significance of specific data attributes in distinguishing between private and public institutions. Essentially, the study illustrates the process of teaching a computer to make informed predictions about a college’s classification based on given information.
Analysis
Library
#The report utilizes a set of libraries for various data processing and visualization tasks.
library(ISLR)
library(ggplot2)
library(dplyr)
library(knitr)
library(kableExtra)
library(psych)
library(reshape2)
library(corrplot)
library(caret)
library(leaps)
library(GGally)
library(caret)
library(pROC)
library(MASS)
library(gridExtra)
Task 1
Import the dataset and perform Exploratory Data Analysis by using descriptive statistics and plots to describe the dataset.
In this analysis, we explore the College dataset, aiming to
gain insights into various aspects of higher education institutions. The
dataset encompasses diverse information about colleges in the United
States, including features such as the number of applications received,
acceptance rates, graduation rates, and financial characteristics.
Our investigative journey begins with a comprehensive overview of the dataset, examining its structure, the initial rows of data, and summary statistics. We utilize descriptive statistics and visualizations to unravel patterns and characteristics inherent in the data.
Furthermore, we delve into data exploration, conducting a thorough examination of missing values and deploying graphical representations. Box plots illustrate the distribution of out-of-state tuition fees across different types of universities, offering insights into the financial landscape. Bar plots provide a categorical view of university types, enabling a quick grasp of the dataset’s composition.
Histograms shed light on the distribution of key variables, such as the number of applications received and graduation rates, providing a nuanced understanding of their frequency distribution.
To unveil potential relationships between numerical variables, a pairwise scatter plot is constructed, revealing patterns and potential associations. The correlation matrix and a corresponding visualization further illuminate the interplay between variables, providing a quantitative perspective on their relationships.
Our analysis aims to foster a holistic understanding of the College dataset, setting the stage for subsequent tasks, such as predictive modeling and feature selection. This exploratory phase is pivotal in laying the groundwork for informed decision-making and identifying key factors that contribute to the diversity and dynamics of higher education institutions.
# Load the College dataset
data("College")
# Display the structure of the dataset
str(College)
## 'data.frame': 777 obs. of 18 variables:
## $ Private : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ Apps : num 1660 2186 1428 417 193 ...
## $ Accept : num 1232 1924 1097 349 146 ...
## $ Enroll : num 721 512 336 137 55 158 103 489 227 172 ...
## $ Top10perc : num 23 16 22 60 16 38 17 37 30 21 ...
## $ Top25perc : num 52 29 50 89 44 62 45 68 63 44 ...
## $ F.Undergrad: num 2885 2683 1036 510 249 ...
## $ P.Undergrad: num 537 1227 99 63 869 ...
## $ Outstate : num 7440 12280 11250 12960 7560 ...
## $ Room.Board : num 3300 6450 3750 5450 4120 ...
## $ Books : num 450 750 400 450 800 500 500 450 300 660 ...
## $ Personal : num 2200 1500 1165 875 1500 ...
## $ PhD : num 70 29 53 92 76 67 90 89 79 40 ...
## $ Terminal : num 78 30 66 97 72 73 93 100 84 41 ...
## $ S.F.Ratio : num 18.1 12.2 12.9 7.7 11.9 9.4 11.5 13.7 11.3 11.5 ...
## $ perc.alumni: num 12 16 30 37 2 11 26 37 23 15 ...
## $ Expend : num 7041 10527 8735 19016 10922 ...
## $ Grad.Rate : num 60 56 54 59 15 55 63 73 80 52 ...
# Viewing the first few rows of the dataset
head(College)
# Summary statistics
summary(College)
## Private Apps Accept Enroll Top10perc
## No :212 Min. : 81 Min. : 72 Min. : 35 Min. : 1.00
## Yes:565 1st Qu.: 776 1st Qu.: 604 1st Qu.: 242 1st Qu.:15.00
## Median : 1558 Median : 1110 Median : 434 Median :23.00
## Mean : 3002 Mean : 2019 Mean : 780 Mean :27.56
## 3rd Qu.: 3624 3rd Qu.: 2424 3rd Qu.: 902 3rd Qu.:35.00
## Max. :48094 Max. :26330 Max. :6392 Max. :96.00
## Top25perc F.Undergrad P.Undergrad Outstate
## Min. : 9.0 Min. : 139 Min. : 1.0 Min. : 2340
## 1st Qu.: 41.0 1st Qu.: 992 1st Qu.: 95.0 1st Qu.: 7320
## Median : 54.0 Median : 1707 Median : 353.0 Median : 9990
## Mean : 55.8 Mean : 3700 Mean : 855.3 Mean :10441
## 3rd Qu.: 69.0 3rd Qu.: 4005 3rd Qu.: 967.0 3rd Qu.:12925
## Max. :100.0 Max. :31643 Max. :21836.0 Max. :21700
## Room.Board Books Personal PhD
## Min. :1780 Min. : 96.0 Min. : 250 Min. : 8.00
## 1st Qu.:3597 1st Qu.: 470.0 1st Qu.: 850 1st Qu.: 62.00
## Median :4200 Median : 500.0 Median :1200 Median : 75.00
## Mean :4358 Mean : 549.4 Mean :1341 Mean : 72.66
## 3rd Qu.:5050 3rd Qu.: 600.0 3rd Qu.:1700 3rd Qu.: 85.00
## Max. :8124 Max. :2340.0 Max. :6800 Max. :103.00
## Terminal S.F.Ratio perc.alumni Expend
## Min. : 24.0 Min. : 2.50 Min. : 0.00 Min. : 3186
## 1st Qu.: 71.0 1st Qu.:11.50 1st Qu.:13.00 1st Qu.: 6751
## Median : 82.0 Median :13.60 Median :21.00 Median : 8377
## Mean : 79.7 Mean :14.09 Mean :22.74 Mean : 9660
## 3rd Qu.: 92.0 3rd Qu.:16.50 3rd Qu.:31.00 3rd Qu.:10830
## Max. :100.0 Max. :39.80 Max. :64.00 Max. :56233
## Grad.Rate
## Min. : 10.00
## 1st Qu.: 53.00
## Median : 65.00
## Mean : 65.46
## 3rd Qu.: 78.00
## Max. :118.00
# Summary statistics of the dataset
data_summary <- summary(College)
kable(data_summary,caption = "<center>Table 1: Summary</center>", format = "html", align = "l") %>%
column_spec(1, bold = TRUE)%>%
kable_styling(full_width = TRUE, "striped",font_size = 14) %>%
row_spec(0, bold = TRUE, background = "slategrey" , color = "white")%>%
scroll_box(width = "100%", height = "400px")
| Private | Apps | Accept | Enroll | Top10perc | Top25perc | F.Undergrad | P.Undergrad | Outstate | Room.Board | Books | Personal | PhD | Terminal | S.F.Ratio | perc.alumni | Expend | Grad.Rate | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| No :212 | Min. : 81 | Min. : 72 | Min. : 35 | Min. : 1.00 | Min. : 9.0 | Min. : 139 | Min. : 1.0 | Min. : 2340 | Min. :1780 | Min. : 96.0 | Min. : 250 | Min. : 8.00 | Min. : 24.0 | Min. : 2.50 | Min. : 0.00 | Min. : 3186 | Min. : 10.00 | |
| Yes:565 | 1st Qu.: 776 | 1st Qu.: 604 | 1st Qu.: 242 | 1st Qu.:15.00 | 1st Qu.: 41.0 | 1st Qu.: 992 | 1st Qu.: 95.0 | 1st Qu.: 7320 | 1st Qu.:3597 | 1st Qu.: 470.0 | 1st Qu.: 850 | 1st Qu.: 62.00 | 1st Qu.: 71.0 | 1st Qu.:11.50 | 1st Qu.:13.00 | 1st Qu.: 6751 | 1st Qu.: 53.00 | |
| NA | Median : 1558 | Median : 1110 | Median : 434 | Median :23.00 | Median : 54.0 | Median : 1707 | Median : 353.0 | Median : 9990 | Median :4200 | Median : 500.0 | Median :1200 | Median : 75.00 | Median : 82.0 | Median :13.60 | Median :21.00 | Median : 8377 | Median : 65.00 | |
| NA | Mean : 3002 | Mean : 2019 | Mean : 780 | Mean :27.56 | Mean : 55.8 | Mean : 3700 | Mean : 855.3 | Mean :10441 | Mean :4358 | Mean : 549.4 | Mean :1341 | Mean : 72.66 | Mean : 79.7 | Mean :14.09 | Mean :22.74 | Mean : 9660 | Mean : 65.46 | |
| NA | 3rd Qu.: 3624 | 3rd Qu.: 2424 | 3rd Qu.: 902 | 3rd Qu.:35.00 | 3rd Qu.: 69.0 | 3rd Qu.: 4005 | 3rd Qu.: 967.0 | 3rd Qu.:12925 | 3rd Qu.:5050 | 3rd Qu.: 600.0 | 3rd Qu.:1700 | 3rd Qu.: 85.00 | 3rd Qu.: 92.0 | 3rd Qu.:16.50 | 3rd Qu.:31.00 | 3rd Qu.:10830 | 3rd Qu.: 78.00 | |
| NA | Max. :48094 | Max. :26330 | Max. :6392 | Max. :96.00 | Max. :100.0 | Max. :31643 | Max. :21836.0 | Max. :21700 | Max. :8124 | Max. :2340.0 | Max. :6800 | Max. :103.00 | Max. :100.0 | Max. :39.80 | Max. :64.00 | Max. :56233 | Max. :118.00 |
describe(College) %>%
kable(caption = "<center>Table 2: Descriptive Statistics</center>", format = "html", align = "l") %>%
kable_styling("bordered", full_width = TRUE, "striped",font_size = 14) %>%
row_spec(0, bold = TRUE, background = "slategrey" , color = "white")%>%
scroll_box(width = "100%", height = "400px")
| vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Private* | 1 | 777 | 1.727156 | 0.4457084 | 2.0 | 1.783307 | 0.00000 | 1.0 | 2.0 | 1.0 | -1.0179902 | -0.9649328 | 0.0159897 |
| Apps | 2 | 777 | 3001.638353 | 3870.2014844 | 1558.0 | 2193.008026 | 1463.32620 | 81.0 | 48094.0 | 48013.0 | 3.7093849 | 26.5184313 | 138.8427049 |
| Accept | 3 | 777 | 2018.804376 | 2451.1139710 | 1110.0 | 1510.287319 | 1008.16800 | 72.0 | 26330.0 | 26258.0 | 3.4045428 | 18.7526403 | 87.9332239 |
| Enroll | 4 | 777 | 779.972973 | 929.1761901 | 434.0 | 575.953451 | 354.34140 | 35.0 | 6392.0 | 6357.0 | 2.6800857 | 8.7368340 | 33.3340101 |
| Top10perc | 5 | 777 | 27.558559 | 17.6403644 | 23.0 | 25.130016 | 13.34340 | 1.0 | 96.0 | 95.0 | 1.4077650 | 2.1728286 | 0.6328445 |
| Top25perc | 6 | 777 | 55.796654 | 19.8047776 | 54.0 | 55.121990 | 20.75640 | 9.0 | 100.0 | 91.0 | 0.2583399 | -0.5744647 | 0.7104924 |
| F.Undergrad | 7 | 777 | 3699.907336 | 4850.4205309 | 1707.0 | 2574.884430 | 1441.08720 | 139.0 | 31643.0 | 31504.0 | 2.6003876 | 7.6120676 | 174.0078673 |
| P.Undergrad | 8 | 777 | 855.298584 | 1522.4318873 | 353.0 | 536.361156 | 449.22780 | 1.0 | 21836.0 | 21835.0 | 5.6703938 | 54.5249401 | 54.6169397 |
| Outstate | 9 | 777 | 10440.669241 | 4023.0164841 | 9990.0 | 10181.658106 | 4121.62800 | 2340.0 | 21700.0 | 19360.0 | 0.5073133 | -0.4255258 | 144.3249124 |
| Room.Board | 10 | 777 | 4357.526383 | 1096.6964156 | 4200.0 | 4301.704655 | 1005.20280 | 1780.0 | 8124.0 | 6344.0 | 0.4755141 | -0.2012779 | 39.3437648 |
| Books | 11 | 777 | 549.380952 | 165.1053601 | 500.0 | 535.219904 | 148.26000 | 96.0 | 2340.0 | 2244.0 | 3.4715806 | 28.0632782 | 5.9231218 |
| Personal | 12 | 777 | 1340.642214 | 677.0714536 | 1200.0 | 1268.345104 | 593.04000 | 250.0 | 6800.0 | 6550.0 | 1.7357745 | 7.0446395 | 24.2898031 |
| PhD | 13 | 777 | 72.660232 | 16.3281547 | 75.0 | 73.922954 | 17.79120 | 8.0 | 103.0 | 95.0 | -0.7652067 | 0.5442923 | 0.5857693 |
| Terminal | 14 | 777 | 79.702703 | 14.7223585 | 82.0 | 81.102729 | 14.82600 | 24.0 | 100.0 | 76.0 | -0.8133924 | 0.2244365 | 0.5281617 |
| S.F.Ratio | 15 | 777 | 14.089704 | 3.9583491 | 13.6 | 13.935795 | 3.40998 | 2.5 | 39.8 | 37.3 | 0.6648606 | 2.5228017 | 0.1420050 |
| perc.alumni | 16 | 777 | 22.743887 | 12.3918015 | 21.0 | 21.857143 | 13.34340 | 0.0 | 64.0 | 64.0 | 0.6045500 | -0.1113466 | 0.4445534 |
| Expend | 17 | 777 | 9660.171171 | 5221.7684399 | 8377.0 | 8823.704655 | 2730.94920 | 3186.0 | 56233.0 | 53047.0 | 3.4459767 | 18.5875365 | 187.3298993 |
| Grad.Rate | 18 | 777 | 65.463320 | 17.1777099 | 65.0 | 65.601926 | 17.79120 | 10.0 | 118.0 | 108.0 | -0.1133384 | -0.2187930 | 0.6162469 |
# Check for missing values
sum(is.na(College))
## [1] 0
# Box plot of the Private variable by Out state (Out-of-state tuition)
# Split the Outstate variable by the Private variable
boxplot(Outstate ~ Private, data = College,
main = "Out of State University type",
xlab = "Private University",
ylab = "Out-of-State Tuition",
col = c("lightblue", "lightgreen"),
border = "black")
# Adding a caption (moving it down by increasing the 'line' parameter)
mtext("Figure 1: Box plot of the Private variable by Out state",
side = 1, line = 4, adj = 0.5, cex = 0.8, font = 3)
# Adjusting the caption style
par(mar = c(5, 4, 4, 2) + 0.1) # Adjusting margins
The box plot shows that there is a significant difference in
out-of-state tuition between private and public universities. Private
universities have a median out-of-state tuition of $21700,
while public universities have a median out-of-state tuition of
$9990. This suggests that private universities are
generally more expensive to attend than public universities.
# Count the occurrences of each type of university
counts <- table(College$Private)
# Create a bar plot
barplot(counts,
main = "Bar Plot of University Types",
col = c("lightblue", "lightgreen"), # You can customize colors here
xlab = "Type of University",
ylab = "Count",
ylim= c(0,600),
names.arg = c("Public", "Private"),
cex.names = 0.8, # Adjust the size of axis labels
border = "black" # Add border to the bars for better visibility
)
# Add caption
mtext("Figure 2: Bar plot showing types of universities with counts", side = 1, line = 4, cex = 0.8, font = 3)
# Adjust margins
par(mar = c(5, 4, 4, 2) + 0.1)
The bar plot shows that the majority of universities in the dataset are private (565). There are also a significant number of public universities (212). This suggests that there is a good mix of public and private universities in the dataset.
# Create a histogram
hist(College$Apps,
breaks = 50,
col = "slateblue",
border = "black",
main = "Histogram of Applications",
xlab = "Number of Applications",
ylab = "Frequency",
ylim = c(0,300))
# Add a caption
mtext("Figure 3: Histogram showing number of applications", side = 1, line = 4, cex = 0.8, font = 3)
The above histogram is negatively skewed or left-skewed. This means that the majority of the values are on the right side of the distribution, with a longer tail on the left. This indicates that there are more applications with a higher number of applications.
# Histogram for Graduation Rate (Grad.Rate)
hist(College$Grad.Rate,
breaks = 50,
col = "slateblue",
border = "black",
main = "Histogram of Graduation Rates",
xlab = "Graduation Rate",
ylab = "Frequency")
# Add a caption
mtext("Figure 4: Histogram showing graduation rate", side = 1, line = 4, cex = 0.8, font = 3, col = "black")
The above histogram is a visualization of the distribution of graduation rates at colleges in the United States. The horizontal axis shows the graduation rate, and the vertical axis shows the frequency of colleges with that graduation rate.
The histogram shows that the distribution of graduation rates is negatively skewed, meaning that there are more colleges with lower graduation rates than there are colleges with higher graduation rates.
# Pairwise scatter plot for numerical variables
pairs(College[, 2:11])
# Add figure caption
title(sub = "Figure 5: Pairwise Scatter Plot Matrix for numerical variables")
# Correlation matrix
cor_matrix <- cor(College[, 2:18])
cor_matrix
## Apps Accept Enroll Top10perc Top25perc
## Apps 1.00000000 0.94345057 0.84682205 0.3388337 0.35163990
## Accept 0.94345057 1.00000000 0.91163666 0.1924469 0.24747574
## Enroll 0.84682205 0.91163666 1.00000000 0.1812935 0.22674511
## Top10perc 0.33883368 0.19244693 0.18129353 1.0000000 0.89199497
## Top25perc 0.35163990 0.24747574 0.22674511 0.8919950 1.00000000
## F.Undergrad 0.81449058 0.87422328 0.96463965 0.1412887 0.19944466
## P.Undergrad 0.39826427 0.44127073 0.51306860 -0.1053563 -0.05357664
## Outstate 0.05015903 -0.02575455 -0.15547734 0.5623305 0.48939383
## Room.Board 0.16493896 0.09089863 -0.04023168 0.3714804 0.33148989
## Books 0.13255860 0.11352535 0.11271089 0.1188584 0.11552713
## Personal 0.17873085 0.20098867 0.28092946 -0.0933164 -0.08081027
## PhD 0.39069733 0.35575788 0.33146914 0.5318280 0.54586221
## Terminal 0.36949147 0.33758337 0.30827407 0.4911350 0.52474884
## S.F.Ratio 0.09563303 0.17622901 0.23727131 -0.3848745 -0.29462884
## perc.alumni -0.09022589 -0.15998987 -0.18079413 0.4554853 0.41786429
## Expend 0.25959198 0.12471701 0.06416923 0.6609134 0.52744743
## Grad.Rate 0.14675460 0.06731255 -0.02234104 0.4949892 0.47728116
## F.Undergrad P.Undergrad Outstate Room.Board Books
## Apps 0.81449058 0.39826427 0.05015903 0.16493896 0.132558598
## Accept 0.87422328 0.44127073 -0.02575455 0.09089863 0.113525352
## Enroll 0.96463965 0.51306860 -0.15547734 -0.04023168 0.112710891
## Top10perc 0.14128873 -0.10535628 0.56233054 0.37148038 0.118858431
## Top25perc 0.19944466 -0.05357664 0.48939383 0.33148989 0.115527130
## F.Undergrad 1.00000000 0.57051219 -0.21574200 -0.06889039 0.115549761
## P.Undergrad 0.57051219 1.00000000 -0.25351232 -0.06132551 0.081199521
## Outstate -0.21574200 -0.25351232 1.00000000 0.65425640 0.038854868
## Room.Board -0.06889039 -0.06132551 0.65425640 1.00000000 0.127962970
## Books 0.11554976 0.08119952 0.03885487 0.12796297 1.000000000
## Personal 0.31719954 0.31988162 -0.29908690 -0.19942818 0.179294764
## PhD 0.31833697 0.14911422 0.38298241 0.32920228 0.026905731
## Terminal 0.30001894 0.14190357 0.40798320 0.37453955 0.099954700
## S.F.Ratio 0.27970335 0.23253051 -0.55482128 -0.36262774 -0.031929274
## perc.alumni -0.22946222 -0.28079236 0.56626242 0.27236345 -0.040207736
## Expend 0.01865162 -0.08356842 0.67277862 0.50173942 0.112409075
## Grad.Rate -0.07877313 -0.25700099 0.57128993 0.42494154 0.001060894
## Personal PhD Terminal S.F.Ratio perc.alumni
## Apps 0.17873085 0.39069733 0.36949147 0.09563303 -0.09022589
## Accept 0.20098867 0.35575788 0.33758337 0.17622901 -0.15998987
## Enroll 0.28092946 0.33146914 0.30827407 0.23727131 -0.18079413
## Top10perc -0.09331640 0.53182802 0.49113502 -0.38487451 0.45548526
## Top25perc -0.08081027 0.54586221 0.52474884 -0.29462884 0.41786429
## F.Undergrad 0.31719954 0.31833697 0.30001894 0.27970335 -0.22946222
## P.Undergrad 0.31988162 0.14911422 0.14190357 0.23253051 -0.28079236
## Outstate -0.29908690 0.38298241 0.40798320 -0.55482128 0.56626242
## Room.Board -0.19942818 0.32920228 0.37453955 -0.36262774 0.27236345
## Books 0.17929476 0.02690573 0.09995470 -0.03192927 -0.04020774
## Personal 1.00000000 -0.01093579 -0.03061311 0.13634483 -0.28596808
## PhD -0.01093579 1.00000000 0.84958703 -0.13053011 0.24900866
## Terminal -0.03061311 0.84958703 1.00000000 -0.16010395 0.26713029
## S.F.Ratio 0.13634483 -0.13053011 -0.16010395 1.00000000 -0.40292917
## perc.alumni -0.28596808 0.24900866 0.26713029 -0.40292917 1.00000000
## Expend -0.09789189 0.43276168 0.43879922 -0.58383204 0.41771172
## Grad.Rate -0.26934396 0.30503785 0.28952723 -0.30671041 0.49089756
## Expend Grad.Rate
## Apps 0.25959198 0.146754600
## Accept 0.12471701 0.067312550
## Enroll 0.06416923 -0.022341039
## Top10perc 0.66091341 0.494989235
## Top25perc 0.52744743 0.477281164
## F.Undergrad 0.01865162 -0.078773129
## P.Undergrad -0.08356842 -0.257000991
## Outstate 0.67277862 0.571289928
## Room.Board 0.50173942 0.424941541
## Books 0.11240908 0.001060894
## Personal -0.09789189 -0.269343964
## PhD 0.43276168 0.305037850
## Terminal 0.43879922 0.289527232
## S.F.Ratio -0.58383204 -0.306710405
## perc.alumni 0.41771172 0.490897562
## Expend 1.00000000 0.390342696
## Grad.Rate 0.39034270 1.000000000
# Calculate the correlation matrix for numeric columns
corrplot(cor_matrix, method = "pie",
main = "Figure 6 : Correlation Plot with Pie Charts",
mar = c(0, 0, 2, 0))
The correlation plot with pie charts Figure 5 provides a
visual representation of these relationships. Variables with strong
correlations are depicted with larger pie slices, and the direction of
the correlation is indicated by the color (positive in blue, negative in
red). This exploration lays the groundwork for understanding the
interconnections between various numerical variables in the dataset.
Task 2
Split the data into a train and test set – refer to the ALY6015_Feature_Selection_R.pdf document for information on how to split a dataset.
In the process of modeling, it is a common and essential practice to partition the dataset into two distinct subsets: a training dataset and a testing dataset. The training dataset serves as the foundation upon which the model is trained, while the testing dataset is employed to assess the model’s performance. This division is crucial for verifying that the model generalizes effectively to new, unseen data, guarding against the risk of over fitting (Brownlee, 2020).
A frequently employed strategy is the 70/30 split, where 70% of the dataset is allocated to the training set, and the remaining 30% is designated for testing. This approach strikes a balance between providing the model with sufficient data for learning and ensuring a robust evaluation on independent test data.
# Split the Data into Training and Testing Sets
set.seed(123) # For reproducibility
splitIndex <- createDataPartition(College$Private, p = .70, list = FALSE, times = 1)
College_train <- College[splitIndex,]
College_test <- College[-splitIndex,]
Task 3
Fit logistic regression model
In our endeavor to construct a predictive logistic regression model for
discerning whether a university is private or public, we leverage the
versatile Generalized Linear Model (GLM) framework (Kabacoff,
2015). GLM serves as the cornerstone of our modeling approach,
providing a flexible and powerful tool for analyzing binary outcomes.
The glm function acts as our guide in fitting logistic
regression models, facilitating the estimation of coefficients that best
describe the relationship between predictor variables and the binary
response variable.
Our modeling journey unfolds with Forward Selection, a methodical
process of iteratively adding predictors to optimize model fit.
Simultaneously, we employ Backward Selection, systematically eliminating
predictors for enhanced model simplicity. The stepAIC
function guides us in this endeavor, scrutinizing the Akaike Information
Criterion (AIC) and Bayesian Information Criterion (BIC) to compare
model performance.
Furthermore, our exploration extends to Best Subset Selection, facilitated by the regsubsets function. This exhaustive examination of various model configurations enables us to pinpoint the optimal combination of predictors based on the Bayesian Information Criterion (BIC). Throughout this process, we meticulously assess model performance and intricately navigate the interplay between predictor variables.
# Forward Selection
forward_model <- glm(Private ~ 1, data = College_train, family = binomial)
forward_step <- stepAIC(forward_model,
scope = list(lower = formula(forward_model),
upper = ~ .),
direction = "forward", trace = FALSE)
summary(forward_step)
##
## Call:
## glm(formula = Private ~ 1, family = binomial, data = College_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6105 -1.6105 0.7992 0.7992 0.7992
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.97747 0.09611 10.17 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 639.4 on 544 degrees of freedom
## Residual deviance: 639.4 on 544 degrees of freedom
## AIC: 641.4
##
## Number of Fisher Scoring iterations: 4
The forward selection process begins with a null model containing only the intercept. This initial model has a coefficient estimate for the intercept, indicating its contribution to the log-odds of the response variable.
The null model is evaluated based on its deviance residuals, which
measure the goodness of fit. In this case, the null model has a null
deviance of 639.4 on 544 degrees of freedom,
suggesting room for improvement.
The forward selection algorithm iteratively adds variables that
contribute the most to reducing deviance. In the first step, the
algorithm adds an intercept term with an estimate of
0.97747, which is statistically significant.
The resulting model has an AIC of 641.4 and a residual
deviance equal to the null deviance. This suggests that the added
intercept alone does not significantly improve the model fit, as the AIC
is only marginally better than the null model.
# Backward Selection
backward_model <- glm(Private ~ ., data = College_train, family = binomial)
backward_step <- stepAIC(backward_model,
direction = "backward",
trace = FALSE)
summary(backward_step)
##
## Call:
## glm(formula = Private ~ Apps + Enroll + F.Undergrad + Outstate +
## PhD + perc.alumni + Expend + Grad.Rate, family = binomial,
## data = College_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.8647 -0.0159 0.0473 0.1561 3.0302
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.8963348 1.3926982 -1.362 0.173316
## Apps -0.0003739 0.0002027 -1.845 0.065102 .
## Enroll 0.0019856 0.0010665 1.862 0.062636 .
## F.Undergrad -0.0007323 0.0002338 -3.132 0.001738 **
## Outstate 0.0007395 0.0001351 5.473 4.44e-08 ***
## PhD -0.0766800 0.0200659 -3.821 0.000133 ***
## perc.alumni 0.0359692 0.0259347 1.387 0.165468
## Expend 0.0002574 0.0001424 1.807 0.070735 .
## Grad.Rate 0.0274096 0.0162281 1.689 0.091215 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 639.40 on 544 degrees of freedom
## Residual deviance: 151.78 on 536 degrees of freedom
## AIC: 169.78
##
## Number of Fisher Scoring iterations: 8
The backward selection process starts with a model containing all predictors. The algorithm iteratively removes variables that contribute the least to the model, based on deviance.
The initial model, with all predictors, has a null deviance of
639.4 on 544 degrees of freedom. The model is
subsequently refined through variable removal, leading to a final model
with a residual deviance of 151.78 on 536
degrees of freedom.
The backward selection algorithm removes variables iteratively,
selecting those with the least impact on deviance. In this case, the
final model includes Apps, Enroll,
F.Undergrad, Outstate, PhD,
perc.alumni, Expend, and
Grad.Rate.
# Model Comparison
AIC(forward_step, backward_step)
BIC(forward_step, backward_step)
#logLik(forward_step, backward_step)
Comparing AIC and BIC between forward and backward models reveals that
the backward model has lower AIC (169.78) and
BIC (208.48), indicating improved model fit.
# Best Subset Selection
sub_model <- regsubsets(Private ~ ., data = College_train, nbest = 1, nvmax = NULL, method = "exhaustive")
summary(sub_model)
## Subset selection object
## Call: regsubsets.formula(Private ~ ., data = College_train, nbest = 1,
## nvmax = NULL, method = "exhaustive")
## 17 Variables (and intercept)
## Forced in Forced out
## Apps FALSE FALSE
## Accept FALSE FALSE
## Enroll FALSE FALSE
## Top10perc FALSE FALSE
## Top25perc FALSE FALSE
## F.Undergrad FALSE FALSE
## P.Undergrad FALSE FALSE
## Outstate FALSE FALSE
## Room.Board FALSE FALSE
## Books FALSE FALSE
## Personal FALSE FALSE
## PhD FALSE FALSE
## Terminal FALSE FALSE
## S.F.Ratio FALSE FALSE
## perc.alumni FALSE FALSE
## Expend FALSE FALSE
## Grad.Rate FALSE FALSE
## 1 subsets of each size up to 17
## Selection Algorithm: exhaustive
## Apps Accept Enroll Top10perc Top25perc F.Undergrad P.Undergrad
## 1 ( 1 ) " " " " " " " " " " "*" " "
## 2 ( 1 ) " " " " " " " " " " "*" " "
## 3 ( 1 ) " " " " " " " " " " "*" " "
## 4 ( 1 ) " " " " " " " " " " "*" " "
## 5 ( 1 ) " " " " " " " " " " "*" " "
## 6 ( 1 ) " " " " " " " " " " "*" " "
## 7 ( 1 ) "*" "*" " " " " " " "*" " "
## 8 ( 1 ) "*" "*" " " " " " " "*" " "
## 9 ( 1 ) "*" "*" " " " " " " "*" " "
## 10 ( 1 ) "*" "*" " " " " " " "*" " "
## 11 ( 1 ) "*" "*" " " " " " " "*" " "
## 12 ( 1 ) "*" "*" " " "*" " " "*" " "
## 13 ( 1 ) "*" "*" " " "*" "*" "*" " "
## 14 ( 1 ) "*" "*" " " "*" "*" "*" " "
## 15 ( 1 ) "*" "*" " " "*" "*" "*" "*"
## 16 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
## 17 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
## Outstate Room.Board Books Personal PhD Terminal S.F.Ratio perc.alumni
## 1 ( 1 ) " " " " " " " " " " " " " " " "
## 2 ( 1 ) "*" " " " " " " " " " " " " " "
## 3 ( 1 ) "*" " " " " " " "*" " " " " " "
## 4 ( 1 ) "*" " " " " " " "*" " " "*" " "
## 5 ( 1 ) "*" " " " " " " "*" " " "*" " "
## 6 ( 1 ) "*" " " " " " " "*" " " "*" " "
## 7 ( 1 ) "*" " " " " " " "*" " " "*" " "
## 8 ( 1 ) "*" " " " " " " "*" "*" "*" " "
## 9 ( 1 ) "*" " " " " " " "*" "*" "*" "*"
## 10 ( 1 ) "*" " " "*" " " "*" "*" "*" "*"
## 11 ( 1 ) "*" " " "*" " " "*" "*" "*" "*"
## 12 ( 1 ) "*" "*" " " " " "*" "*" "*" "*"
## 13 ( 1 ) "*" "*" " " " " "*" "*" "*" "*"
## 14 ( 1 ) "*" "*" "*" " " "*" "*" "*" "*"
## 15 ( 1 ) "*" "*" "*" " " "*" "*" "*" "*"
## 16 ( 1 ) "*" "*" "*" " " "*" "*" "*" "*"
## 17 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*"
## Expend Grad.Rate
## 1 ( 1 ) " " " "
## 2 ( 1 ) " " " "
## 3 ( 1 ) " " " "
## 4 ( 1 ) " " " "
## 5 ( 1 ) " " "*"
## 6 ( 1 ) "*" "*"
## 7 ( 1 ) " " "*"
## 8 ( 1 ) " " "*"
## 9 ( 1 ) " " "*"
## 10 ( 1 ) " " "*"
## 11 ( 1 ) "*" "*"
## 12 ( 1 ) "*" "*"
## 13 ( 1 ) "*" "*"
## 14 ( 1 ) "*" "*"
## 15 ( 1 ) "*" "*"
## 16 ( 1 ) "*" "*"
## 17 ( 1 ) "*" "*"
sub_summary <- summary(sub_model)
# BIC values for all the models
bic_values <- sub_summary$bic
# The index of the model with the lowest BIC
best_bic_index <- which.min(bic_values)
# Logical vector indicating whether a variable is included in the best model
best_model_vars <- sub_summary$which[best_bic_index, ]
# Names of the variables included in the best model
best_var_names <- names(which(sub_summary$which[best_bic_index, ]))
# Ensure 'Intercept' is not in the list of predictors
best_var_names <- best_var_names[best_var_names != "(Intercept)"]
# Check the predictors
print(best_var_names)
## [1] "Apps" "Accept" "F.Undergrad" "Outstate" "PhD"
## [6] "S.F.Ratio" "Grad.Rate"
# Construct the formula
best_formula_str <- paste("Private ~", paste(best_var_names, collapse = " + "))
print(best_formula_str)
## [1] "Private ~ Apps + Accept + F.Undergrad + Outstate + PhD + S.F.Ratio + Grad.Rate"
# Convert the string to a formula
best_formula <- as.formula(best_formula_str)
# Fit the logistic regression model
best_model <- glm(best_formula, data = College_train, family = "binomial")
summary(best_model)
##
## Call:
## glm(formula = best_formula, family = "binomial", data = College_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.8496 -0.0166 0.0576 0.1559 3.0368
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.3556121 1.9628415 0.691 0.489793
## Apps -0.0003305 0.0002726 -1.212 0.225393
## Accept 0.0004010 0.0004959 0.809 0.418716
## F.Undergrad -0.0005493 0.0001757 -3.126 0.001774 **
## Outstate 0.0007682 0.0001209 6.355 2.08e-10 ***
## PhD -0.0689944 0.0184782 -3.734 0.000189 ***
## S.F.Ratio -0.1279705 0.0729849 -1.753 0.079536 .
## Grad.Rate 0.0344966 0.0158932 2.171 0.029967 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 639.40 on 544 degrees of freedom
## Residual deviance: 156.75 on 537 degrees of freedom
## AIC: 172.75
##
## Number of Fisher Scoring iterations: 8
Best subset selection explores all possible combinations of predictor variables to identify the model with the lowest BIC.
The exhaustive search results in a model with the lowest BIC, including
variables Apps, Accept,
F.Undergrad, Outstate, PhD,
S.F.Ratio, and Grad.Rate.
With BIC values in mind, the best subset model incorporates a subset of predictors that collectively contribute to a smaller yet effective model.
All the selection methods provide insights into the importance of certain predictors in predicting whether a university is private or public.
Variable F.Undergrad appears consistently in the selected
models, highlighting its significance in discerning university types.
Task 4
Create a confusion matrix and report the results of your model
predictions on the train set. Interpret and discuss the confusion
matrix. Which misclassifications are more damaging for the analysis,
False Positives or False Negatives?
In this segment of our report, we delve into the practical aspect of applying our logistic regression model. Following the training phase, the subsequent step involves leveraging our model to make predictions. We aim to assess the model’s proficiency in determining whether a university is private or not based on the provided features.
Our logistic regression model is employed to predict outcomes on the training dataset. These predictions are then juxtaposed with the actual classifications of universities in the training set. The ensuing analysis is encapsulated in a confusion matrix, a pivotal tool for gauging the model’s performance.
A confusion matrix is a table used to describe the performance of a classification model on a set of data for which the true values are known. It summarizes the model’s predictions against the actual classifications (Narkhede, 2021).
From the caret package the confusionMatrix
function is used to create confusion matrix and calculate statistics.
# Make predictions on the training set
pred <- predict(best_model, newdata = College_train, type = "response")
pred_class <- ifelse(pred > 0.5, "Yes", "No")
# Convert the predictions and the actual classes to factors with the same levels
pred_factor <- factor(pred_class, levels = c("No", "Yes"))
actual_factor <- factor(College_train$Private, levels = c("No", "Yes"))
# the confusion matrix
conf_matrix <- confusionMatrix(pred_factor, actual_factor)
print(conf_matrix)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 133 13
## Yes 16 383
##
## Accuracy : 0.9468
## 95% CI : (0.9245, 0.9641)
## No Information Rate : 0.7266
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.8652
##
## Mcnemar's Test P-Value : 0.7103
##
## Sensitivity : 0.8926
## Specificity : 0.9672
## Pos Pred Value : 0.9110
## Neg Pred Value : 0.9599
## Prevalence : 0.2734
## Detection Rate : 0.2440
## Detection Prevalence : 0.2679
## Balanced Accuracy : 0.9299
##
## 'Positive' Class : No
##
The confusion matrix and associated statistics provide a detailed
evaluation of the logistic regression model’s performance on the
training dataset.In the confusion matrix, the model correctly identified
383 instances of private colleges (Yes) and 133 instances
of non-private colleges (No).However, it made 13 false
positive predictions and 16 false negative predictions. This resulted in
a high accuracy of 94.68%, exceeding the no-information rate of 72.66%.
The model’s sensitivity, measuring its ability to correctly identify positive cases, stands at 89.26%, while specificity, measuring its ability to correctly identify negative cases, is 96.72%.The positive predictive value (precision) is 91.10%, indicating a strong reliability in predicting private colleges when it asserts a positive outcome.
The Kappa statistic, which considers the possibility of random chance, is 0.8652, reflecting a substantial agreement beyond chance.Additionally, the balanced accuracy of 92.99% accounts for potential class imbalances.
These metrics collectively suggest that the logistic regression model is robust in distinguishing between private and non-private colleges on the training dataset.
Interpretation and Discussion of the Confusion Matrix:
In the context of the confusion matrix, False Positives and False Negatives have distinct implications for the analysis, particularly when considering the nature of the problem being addressed. In this case, the logistic regression model aims to predict whether a college is private or not.
In any analysis, both types of misclassifications have distinct implications. False Positives may lead to an overestimation of the number of private colleges, which could have consequences for resource allocation or decision-making based on this prediction. On the other hand, False Negatives may result in an underestimation of the private colleges, potentially missing opportunities or neglecting specific characteristics unique to private institutions.
The balance between the two types of errors depends on the specific goals and consequences of the analysis. If the cost of misclassifying a private college as non-private is high, and vice versa, it is crucial to carefully consider the trade-offs and potentially adjust the decision threshold of the model. The sensitivity (True Positive Rate) and specificity (True Negative Rate) metrics in the statistics summary provide additional context, aiding in the evaluation of the model’s performance with respect to these misclassifications.
In summary, the nature of the problem and the specific consequences associated with misclassifications determine whether False Positives or False Negatives are more damaging. Understanding the potential impact of each type of error is crucial for making informed decisions based on the model’s predictions.
Task 5
Report and interpret metrics for Accuracy, Precision, Recall, and
Specificity.
# Extracting the metrics from the confusion matrix
# Accuracy
accuracy <- conf_matrix$overall['Accuracy']
accuracy
## Accuracy
## 0.946789
Accuracy represents the overall correctness of the model’s predictions.
In this case, the model is approximately 94.68% accurate,
meaning it correctly predicts whether a college is private or not in
nearly 95% of cases.
High accuracy is generally positive, but it’s crucial to consider the context of the problem. If the dataset is imbalanced (e.g., significantly more non-private colleges than private ones), accuracy alone might not provide a complete picture.
#Precision
precision <- conf_matrix$byClass['Pos Pred Value'][1]
precision
## Pos Pred Value
## 0.9109589
Precision, also known as Positive Predictive Value, measures the
accuracy of positive predictions made by the model. In this case, when
the model predicts a college to be private, it is correct approximately
91.10% of the time.
A high precision indicates that when the model predicts a college to be private, it is likely to be correct. This is important, especially if resources or decisions are directly linked to positive predictions.
#Recall
recall <- conf_matrix$byClass['Sensitivity'][1]
recall
## Sensitivity
## 0.8926174
Recall, or Sensitivity, measures the ability of the model to correctly
identify positive instances. Here, the model captures about
89.26% of actual private colleges.
A high recall is valuable when it is crucial not to miss positive instances. In this context, it implies that the model is effective at identifying most private colleges.
#Specificity
specificity <- conf_matrix$byClass['Specificity'][1]
specificity
## Specificity
## 0.9671717
Specificity measures the ability of the model to correctly identify
negative instances. In this case, the model is approximately
96.72% accurate in identifying non-private colleges.
High specificity is desirable, especially if there are specific implications or consequences associated with being a non-private college.
Task 6
Create a confusion matrix and report the results of your model for
the test set. Compare the results with the train set and interpret
In this section, we extend our analysis to evaluate the predictive performance of the logistic regression model on a previously unseen dataset the test set. The primary objective is to assess how well the model, trained on the training set, generalizes to new, unseen data. The confusion matrix and associated metrics will be employed to provide insights into the model’s ability to accurately classify colleges as private or non-private on the test set. By comparing the results between the training and test sets, we aim to understand the model’s robustness and identify any potential overfitting or underfitting issues. This evaluation is crucial for gauging the model’s reliability in real-world scenarios beyond the training data.
# Predictions on the test set
test_pred_prob <- predict(best_model, newdata = College_test, type = "response")
test_pred_class <- ifelse(test_pred_prob > 0.5, "Yes", "No")
# Convert the predictions and the actual classes to factors with the same levels
test_pred_factor <- factor(test_pred_class, levels = c("No", "Yes"))
test_actual_factor <- factor(College_test$Private, levels = c("No", "Yes"))
# Create the confusion matrix for the test set
test_conf_matrix <- confusionMatrix(test_pred_factor,test_actual_factor)
print(test_conf_matrix)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 54 7
## Yes 9 162
##
## Accuracy : 0.931
## 95% CI : (0.8904, 0.9601)
## No Information Rate : 0.7284
## P-Value [Acc > NIR] : 4.085e-15
##
## Kappa : 0.8239
##
## Mcnemar's Test P-Value : 0.8026
##
## Sensitivity : 0.8571
## Specificity : 0.9586
## Pos Pred Value : 0.8852
## Neg Pred Value : 0.9474
## Prevalence : 0.2716
## Detection Rate : 0.2328
## Detection Prevalence : 0.2629
## Balanced Accuracy : 0.9079
##
## 'Positive' Class : No
##
The evaluation of the logistic regression model on the test set reveals valuable insights into its performance and generalization capabilities. Let’s delve into key observations:
Accuracy and Precision:
The model achieves an impressive accuracy of approximately
93.1% on the test set, showcasing its ability to correctly
classify colleges as private or non-private.
The positive predictive value (precision) is commendable at
88.5%,indicating a high proportion of correctly identified
private colleges among the predicted positives.
Sensitivity and Specificity:
Sensitivity, also known as recall, is the model’s capability to
correctly identify private colleges. In this case, it stands at
85.7%, suggesting a strong performance in capturing actual
private institutions.
Specificity is noteworthy at 95.9%, signifying the
model’s proficiency in correctly identifying non-private colleges.
Balanced Accuracy and Kappa:
The balanced accuracy, which considers both sensitivity and
specificity, is around 90.8%, reinforcing the model’s
overall balanced performance.
The Cohen’s Kappa coefficient, a measure of agreement beyond chance,
is substantial at 82.4%, indicating a robust level of
agreement between predicted and actual classes.
Comparison with Training Set:
Comparing these results with those obtained from the training set, we observe a slight decrease in accuracy and other metrics. This is a common phenomenon as the model encounters new, unseen data.However, the drop is marginal, signifying the model’s ability to generalize well.
Prevalence and Detection Rate:
The prevalence, representing the proportion of private colleges in
the test set, is approximately 27.2%.
The detection rate, indicating the proportion of correctly identified
private colleges, stands at 23.3%, emphasizing the model’s
effectiveness in identifying the target class.
In summary, while there is a marginal decrease in accuracy and detection rate on the test set compared to the training set, the overall consistency in sensitivity, specificity, balanced accuracy, and Kappa coefficient indicates that the model generalizes well and maintains its predictive performance across different datasets. The observed variations are within an acceptable range, affirming the reliability and robustness of the logistic regression model.
Task 7
Plot and interpret the ROC curve.
In this analysis, we delve into the Receiver Operating Characteristic (ROC) curve to evaluate the performance of our logistic regression model on the test set. A Receiver Operator Characteristic (ROC) curve is a graphical plot used to show the diagnostic ability of binary classifiers. The ROC curve is a graphical representation that illustrates the trade-off between the true positive rate (sensitivity) and the false positive rate across various probability thresholds (KHAL, 2021).
# Generating ROC curve data from the test set predictions
roc_data <- roc(response = College_test$Private, predictor = test_pred_prob)
# Plotting the ROC curve
plot(roc_data, main = "Figure 7: ROC Curve", col = "blue", lwd = 2)
abline(a = 0, b = 1, lty = 2, col = "red")
The logistic regression model exhibits superior performance in distinguishing between private and non-private colleges, outperforming random guessing and demonstrating high sensitivity and specificity.
The model’s ROC curve surpasses the reference line, indicating its ability to accurately classify colleges beyond chance.
The curve’s proximity to the upper left corner suggests that the model can identify true positives while minimizing false positives.
The steepness of the curve towards the upper left corner implies that the model can achieve high sensitivity without compromising specificity.
While the specific AUC value is not provided, the visual assessment of the curve suggests a high AUC, indicating the model’s overall ability to distinguish between private and non-private colleges.
Finally, the ROC curve analysis validates the logistic regression model’s ability to distinguish between private and non-private colleges, revealing its great predictive power and promise for real-world applications
Task 8
Calculate and interpret the AUC.
In this section, we aim to assess the performance of our classification model using the Area Under the Receiver Operating Characteristic (ROC) Curve, commonly referred to as AUC. The ROC curve is a graphical representation of the model’s ability to discriminate between positive and negative classes across various thresholds. The AUC condenses the information from the ROC curve into a single numerical value, providing a concise measure of the model’s overall performance. We will calculate and interpret the AUC to gain insights into the discriminative power of our model.
# Calculate the AUC from the ROC object
auc_value <- auc(roc_data)
# Print the AUC value
print(auc_value)
## Area under the curve: 0.9756
The AUC value of 0.9756 signifies that the model has
demonstrated an exceptional ability to differentiate between private and
non-private universities. The ROC curve, from which the AUC is derived,
illustrates the trade-off between sensitivity (true positive rate) and
specificity (true negative rate) across various probability thresholds.
The closer the AUC is to 1, the better the model distinguishes between
the two classes.
In our case, the high AUC value suggests that the model has effectively ranked positive instances (private universities) higher than negative instances (non-private universities) across a range of probability thresholds. This is indicative of a strong ability to correctly classify instances and make accurate predictions.
The robust AUC, along with other evaluation metrics, reinforces the confidence in our logistic regression model’s performance in predicting the private or non-private status of universities. It is a valuable measure for assessing the overall discriminatory power of the model and is particularly useful in binary classification tasks like ours.
Conclusion
The data was split into comprehensive training and test sets using a 70/30 ratio. This enabled proper model fitting on the train set and rigorous evaluation on the unseen test data, laying the groundwork for assessing generalizability.
Multiple logistic regression models were developed using
forward selection, backward elimination and
best subset selection techniques. The optimal model
identified via lowest BIC contained 7 predictors:
application count, enrollment,
full-time undergrads, out-of-state tuition,
PhD percentage, faculty-student ratio and
graduation rate. This balanced model simplicity with
predictive capability.
Evaluation on the training set yielded over 94.6% accuracy with well-balanced sensitivity and specificity both over 90%. The precision exceeded 91% revealing the model’s reliability in predicting private universities. The kappa statistic further validated performance beyond chance. Overall, the results indicated exceptional performance on the data the model was fit on.
Critically, the classification metrics on the unseen test data were
consistently high and dropped only slightly compared to the training
set. The test accuracy remained above 93% and AUC reached
0.976, highlighting capacity to discriminate between
classes. Well-preserved sensitivity, specificity and precision reflected
the model’s generalization ability.
In conclusion, the structured workflow incorporating statistical modeling and rigorous evaluation analyses resulted in a logistic regression model with remarkable accuracy, discrimination capability and generalizability in distinguishing between private and public universities. Its consistent performance across metrics on both train and test data cements its reliability for practical usage on new data.
References
Brownlee, J. (2020, August 26). Train-test split for Evaluating Machine Learning Algorithms. MachineLearningMastery.com. https://machinelearningmastery.com/train-test-split-for-evaluating-machine-learning-algorithms/
Kabacoff, R. (2015). R in action : data analysis and graphics with R (Second edition.). Manning Publications.
Narkhede, S. (2021, June 15). Understanding confusion matrix. Medium. https://towardsdatascience.com/understanding-confusion-matrix-a9ad42dcfd62
KHAL, Y. E. (2021, March 18). Confusion matrix, AUC and ROC curve and Gini clearly explained. Medium. https://yassineelkhal.medium.com/confusion-matrix-auc-and-roc-curve-and-gini-clearly-explained-221788618eb2#:~:text=ROC%20curve%20is%20a%20graphical,model%20distinguishes%20between%20two%20classes.
Appendix
This report contains an R Markdown file named as follows
ALY6015_ZeeshanAhmadAnsari_WEEK_3_FALL_B_2023.Rmd