Introduction
A family of regression models known as generalized linear models enables
us to adapt the linear regression methodology to a wide range of
dependent variables. A generalized linear model does not require a
continuous or regularly distributed dependent variable (Gero, 2023).
Gaussian family – continuous decimal data with normal
distribution like weight, length.
Binomial – binary data like 0 and 1, or proportion like
survival number vs death number, positive frequency vs negative
frequency
Poisson – used for counts or frequencies
In contrast to linear regression, which has a limited number of alternative values, the outcome of logistic regression is categorical and has a single possible value, such as “yes” or “no,” or if someone has brown, blue, or green eyes. Classification is the prediction of a label or categorical variable.
#EDA######
Task1data = Project3_data
describe(Task1data)%>% #table representation of the statistical description
kable(caption = "<center>Table 1: Statistical Description of College Dataset</center>",
align = "c")%>%
kable_styling(bootstrap_options = c("hover",
"bordered"),
font_size = 11) %>%
scroll_box(width = "100%", height = "100%") %>%
footnote(general = "US Colleges from the 1995 issue of US News and World Report.\n",
general_title = "Data Source:",
symbol = c("Categorical Variable"))
| vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Private* | 1 | 777 | 1.727156 | 0.4457084 | 2.0 | 1.783307 | 0.00000 | 1.0 | 2.0 | 1.0 | -1.0179902 | -0.9649328 | 0.0159897 |
| Apps | 2 | 777 | 3001.638353 | 3870.2014844 | 1558.0 | 2193.008026 | 1463.32620 | 81.0 | 48094.0 | 48013.0 | 3.7093849 | 26.5184313 | 138.8427049 |
| Accept | 3 | 777 | 2018.804376 | 2451.1139710 | 1110.0 | 1510.287319 | 1008.16800 | 72.0 | 26330.0 | 26258.0 | 3.4045428 | 18.7526403 | 87.9332239 |
| Enroll | 4 | 777 | 779.972973 | 929.1761901 | 434.0 | 575.953451 | 354.34140 | 35.0 | 6392.0 | 6357.0 | 2.6800857 | 8.7368340 | 33.3340101 |
| Top10perc | 5 | 777 | 27.558559 | 17.6403644 | 23.0 | 25.130016 | 13.34340 | 1.0 | 96.0 | 95.0 | 1.4077650 | 2.1728286 | 0.6328445 |
| Top25perc | 6 | 777 | 55.796654 | 19.8047776 | 54.0 | 55.121990 | 20.75640 | 9.0 | 100.0 | 91.0 | 0.2583399 | -0.5744647 | 0.7104924 |
| F.Undergrad | 7 | 777 | 3699.907336 | 4850.4205309 | 1707.0 | 2574.884430 | 1441.08720 | 139.0 | 31643.0 | 31504.0 | 2.6003876 | 7.6120676 | 174.0078673 |
| P.Undergrad | 8 | 777 | 855.298584 | 1522.4318873 | 353.0 | 536.361156 | 449.22780 | 1.0 | 21836.0 | 21835.0 | 5.6703938 | 54.5249401 | 54.6169397 |
| Outstate | 9 | 777 | 10440.669241 | 4023.0164841 | 9990.0 | 10181.658106 | 4121.62800 | 2340.0 | 21700.0 | 19360.0 | 0.5073133 | -0.4255258 | 144.3249124 |
| Room.Board | 10 | 777 | 4357.526383 | 1096.6964156 | 4200.0 | 4301.704655 | 1005.20280 | 1780.0 | 8124.0 | 6344.0 | 0.4755141 | -0.2012779 | 39.3437648 |
| Books | 11 | 777 | 549.380952 | 165.1053601 | 500.0 | 535.219904 | 148.26000 | 96.0 | 2340.0 | 2244.0 | 3.4715806 | 28.0632782 | 5.9231218 |
| Personal | 12 | 777 | 1340.642214 | 677.0714536 | 1200.0 | 1268.345104 | 593.04000 | 250.0 | 6800.0 | 6550.0 | 1.7357745 | 7.0446395 | 24.2898031 |
| PhD | 13 | 777 | 72.660232 | 16.3281547 | 75.0 | 73.922954 | 17.79120 | 8.0 | 103.0 | 95.0 | -0.7652067 | 0.5442923 | 0.5857693 |
| Terminal | 14 | 777 | 79.702703 | 14.7223585 | 82.0 | 81.102729 | 14.82600 | 24.0 | 100.0 | 76.0 | -0.8133924 | 0.2244365 | 0.5281617 |
| S.F.Ratio | 15 | 777 | 14.089704 | 3.9583491 | 13.6 | 13.935795 | 3.40998 | 2.5 | 39.8 | 37.3 | 0.6648606 | 2.5228017 | 0.1420050 |
| perc.alumni | 16 | 777 | 22.743887 | 12.3918015 | 21.0 | 21.857143 | 13.34340 | 0.0 | 64.0 | 64.0 | 0.6045500 | -0.1113466 | 0.4445534 |
| Expend | 17 | 777 | 9660.171171 | 5221.7684399 | 8377.0 | 8823.704655 | 2730.94920 | 3186.0 | 56233.0 | 53047.0 | 3.4459767 | 18.5875365 | 187.3298993 |
| Grad.Rate | 18 | 777 | 65.463320 | 17.1777099 | 65.0 | 65.601926 | 17.79120 | 10.0 | 118.0 | 108.0 | -0.1133384 | -0.2187930 | 0.6162469 |
#DescTools::Desc(Task1data$Private)
##Statistical Charts#####
ggplot(Task1data, aes(y=F.Undergrad, x=Private)) +
geom_bar(stat = "identity", position = "dodge")+
ggtitle("Full-time Undergrad Students in Universities") +
ylab("Number of Undergrads")+
theme(plot.title = element_text(hjust = 0.5))
Task1data = Task1data %>%
mutate(Acceptrate = c((Accept/Apps))) #add acceptance rate column
#private univ subset
Privaterate = subset(Task1data, Private == "Yes",
select=c(Acceptrate))
Privaterate = sum(Privaterate$Acceptrate) / length(Privaterate$Acceptrate)
#public Univ subset
Publicrate = subset(Task1data, Private == "No",
select=c(Acceptrate))
Publicrate = sum(Publicrate$Acceptrate) / length(Publicrate$Acceptrate)
#data combining
acceptrate = data.frame(Private = Privaterate,
Public = Publicrate)
#avg acceptance rate table representation
kable(round(acceptrate,3)*100,
caption = "<center>Table 2: Private vs Public Acceptance rate </center>",
align = "c",
col.names = c("Private University Acceptance Rate %",
"Public University Acceptance Rate %")) %>%
kable_styling(bootstrap_options = c("bordered",
"responsive",
"hover"),
font_size = 11) %>%
scroll_box(width = "100%", height = "100%") %>%
footnote(general = "US Colleges from the 1995 issue of US News and World Report.\n",
general_title = "Data Source:")
| Private University Acceptance Rate % | Public University Acceptance Rate % |
|---|---|
| 75.5 | 72.7 |
#Public vs Private Out-of-State Tuition
cdplot(Private ~ Outstate, data=Task1data, ylevels = 2:1,
main = "Out-of-State Tuition of Private, Public",
xlab = "Tuition Fee (USD)")
#Finding universities per student expenditure above country mean
Abovemeanexp = Task1data[which(Task1data$Expend > mean(Task1data$Expend) ),]
meancount <- function(y, uplim = max(Abovemeanexp$Expend) * 1.15) {
return(data.frame(
y = 0.95 * uplim,
label = paste(
"Count =", length(y), "\n"
)
))
}
ggplot(Abovemeanexp, aes(y=Expend, x=Private)) +
geom_bar(position="dodge", stat="identity")+
ggtitle("Expenditure per Student greater than country mean") +
ylab("Expenditure in USD")+
theme(plot.title = element_text(hjust = 0.5),
plot.caption = element_text(size = 7))+
#scale_y_continuous(labels = unit_format(unit = "M", scale = 1e-6))+
labs(caption = "Country Mean per student expenditure, $9660") +
stat_summary(
fun.data = meancount,
geom = "text", hjust = 0.5, vjust = 0.9, size = 2)
#Highest expenditure per student
Mostperstudentexp = Abovemeanexp %>%
filter(Abovemeanexp$Expend == max(Abovemeanexp$Expend))
kable(Mostperstudentexp,
caption = "<center>Table 3: Highest per student expenditure(USD) university </center>",
align = "c")%>%
kable_styling(bootstrap_options = c("hover",
"bordered"),
font_size = 11) %>%
scroll_box(width = "100%", height = "100%") %>%
footnote(general = "US Colleges from the 1995 issue of US News and World Report.\n",
general_title = "Data Source:")
| Private | Apps | Accept | Enroll | Top10perc | Top25perc | F.Undergrad | P.Undergrad | Outstate | Room.Board | Books | Personal | PhD | Terminal | S.F.Ratio | perc.alumni | Expend | Grad.Rate | Acceptrate | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Johns Hopkins University | Yes | 8474 | 3446 | 911 | 75 | 94 | 3566 | 1569 | 18800 | 6740 | 500 | 1040 | 96 | 97 | 3.3 | 38 | 56233 | 90 | 0.4066557 |
#Avg Student to Faculty Ratio in Private Univ
PrivateSFR = subset(Task1data, Private == "Yes",
select=c(S.F.Ratio))
PrivateSFR = sum(PrivateSFR$S.F.Ratio) / length(PrivateSFR$S.F.Ratio)
#Avg Student to Faculty Ratio in Public Univ
PublicSFR = subset(Task1data, Private == "No",
select=c(S.F.Ratio))
PublicSFR = sum(PublicSFR$S.F.Ratio) / length(PublicSFR$S.F.Ratio)
#Data merging
FinSFR = data.frame(Public_S.F.R = PublicSFR,
Private_S.F.R = PrivateSFR)
kable(round(FinSFR,2),
caption = "<center>Table 4: Private vs Public Student:Faculty </center>",
align = "c",
col.names = c("Private University Student to Faculty Ratio",
"Public University Student to Faculty Ratio")) %>%
kable_styling(bootstrap_options = c("bordered",
"responsive",
"hover"),
font_size = 11) %>%
scroll_box(width = "100%", height = "100%") %>%
footnote(general = "US Colleges from the 1995 issue of US News and World Report.",
general_title = "Data Source: ")
| Private University Student to Faculty Ratio | Public University Student to Faculty Ratio |
|---|---|
| 17.14 | 12.95 |
testfit = glm(Private ~ S.F.Ratio + Grad.Rate + Outstate, data = Task1data, family = binomial)
testfit2 = glm(Private ~ Top10perc + F.Undergrad + PhD, data = Task1data, family = binomial)
#summary(testfit)
#summary(testfit2)
anov = anova(testfit, testfit2)
anov %>%
kable(caption = "<center>Table 5: Best fit Model</center>")%>%
kable_styling(bootstrap_options = c("bordered",
"responsive",
"hover"),
font_size = 11) %>%
scroll_box(width = "100%", height = "100%") %>%
footnote(general = "Test 1: Private ~ S.F.Ratio + Grad.Rate + Outstate and \n Test 2: Private ~ Top10perc + F.Undergrad + PhD\n",
general_title = "Test fit: ")
| Resid. Df | Resid. Dev | Df | Deviance |
|---|---|---|---|
| 773 | 541.1257 | NA | NA |
| 773 | 482.9046 | 0 | 58.22114 |
#tab_model()
AIC(testfit, testfit2)%>%
kable(caption = "<center>Table 6: AIC Best fit Model</center>",
align = "c")%>%
kable_styling(bootstrap_options = c("bordered",
"responsive",
"hover"),
font_size = 11) %>%
scroll_box(width = "100%", height = "50%") %>%
footnote(general = "Test 1: Private ~ S.F.Ratio + Grad.Rate + Outstate and \n Test 2: Private ~ Top10perc + F.Undergrad + PhD\n",
general_title = "Test fit: ")
| df | AIC | |
|---|---|---|
| testfit | 4 | 549.1257 |
| testfit2 | 4 | 490.9046 |
The College dataset, sourced from the 1995 issue of US News and World
Report, contains information on various attributes of American colleges
and universities. The dataset includes variables such as the school’s
name, private/public status, enrollment size, percent of students from
Top 10 and 25 of high school class, acceptance rate, and expenditures
per student. The dataset can be used to explore relationships between
different attributes, such as the relationship between a school’s
expenditure per student and percentage of students from top 10 or 25
high school class. Additionally, statistical techniques such as
regression analysis can be used to identify the most significant
predictors of a school’s overall success, as determined by the US News
and World Report. This dataset can be used to understand the trends of
colleges in 1995 and can also be used to compare with recent data.
Furthermore, it can help to understand how the factors affecting the
success of colleges and universities have changed over time.
Table 1 provides an overview of the dataset and descriptive statistics
of variables. One of the variables Private is categorical variables as
it have Yes/No values. In the table 1, Private is denoted by \(*\) to confirm the categorical type. Rest
of the variables are discrete and continuous variables.
Bar chart
Full time students in universities shows the undergraduate students in
US universities who are enrolled as full time students. Bar chart shows
Public universities have most full time undergraduate students enrolled
in 1995. However, the acceptance rate of Private universities is greater
than public universities. Overall, Private universities have acceptance
rate of 75.5% and Public universities have 72.7% acceptance rate.
Furthermore, the tuition fee for out-of-state students is higher in
Private Universities as compared to Public Universities. Almost 99% of
Public Universities have tuition fee less than USD15,000 and on the
other hand, almost 20% of Private Universities have tuition fee less
than USD15,000. Another contributing factor in success of the
universities is the instructional expenditure per student. Instructional
expenditure per student refers to the amount of money spent by a college
or university on instruction-related activities for each student
enrolled. This typically includes expenses such as faculty salaries,
instructional materials, and classroom technology. It is calculated by
dividing the total amount spent on instruction by the number of students
enrolled. This metric is often used as an indicator of a school’s
investment in its educational offerings and can be a useful tool for
comparing the resources available at different institutions. It is also
used to study the relationship between the expenditure and the academic
outcome. The country mean of expenditure per student across universities
is USD9660. 226 Private and 36 Public universities’ expenditure per
student is higher than country mean. And John Hopkins University have
most instructional expenditure of USD56233.
However, another
interesting analysis out of the data is student:faculty ratio.
Surprisingly the Public Universities have almost 13:1 ratio and Private
Universities have 17:1 ratio.
After a thoughtful consideration, I have decided to consider
Student:Faculty ratio, Outstate, Grad Rate, percentage of students from
Top 10 high school, and Full time undergraduate to measure the success
of the universities. Model 1 measures the correlation of Private to
Student:Faculty ratio, Outstate, Grad Rate and Model 2 percentage of
students from Top 10 high school, and Full time undergraduate, Model 2
is best fit. This model is selected based on preliminary analysis. I
would be modeling different regression models in the next tasks.
#Train and Test#######
set.seed(130)
traintestind <- createDataPartition(Task1data$Private, p=0.70, list = FALSE)
traindata <- Task1data[traintestind,] #assigning 70% of data to train
testdata <- Task1data[-traintestind,] #assigning remaining 30% of data to test
#nrow(traindata) + nrow(testdata)
In this task, I would be splitting the College dataset into two parts: a
training set(70%) and a test set(30%). The training set is used to train
the model, while the test set is used to evaluate the model’s
performance.
The main reason for doing a train and test split is to
evaluate the performance of a machine learning model. A model that
performs well on the training set is not necessarily a good indicator of
how well the model will perform on unseen data.
When a model is trained and tested on the same data, it can result in
overfitting, where the model performs well on the training data but
poorly on new unseen data. By splitting the data into a training set and
a test set, we can evaluate the model’s performance on unseen data and
get a better idea of how well the model is likely to perform on new,
unseen data.
Another reason for doing a train and test split is to be able to compare
the performance of different models. By training and testing several
models on the same data, we can compare their performance and select the
best model.
In addition, train and test split help in finding the optimal parameters
of the model by tuning the hyperparameters using the train set and test
it on test set.
Once the dataset has been split into training and test sets, the model
can be trained on the training set using a function such as lm()-Linear
Regression or glm()- Logistic Regression. Then, the model’s performance
can be evaluated on the test set using function ConfusionMatrix().
In summary, train and test split is an important step in the machine
learning process as it allows us to evaluate the performance of a model,
compare the performance of different models, and ensure that the model
is generalizing well to new, unseen data.
allglm = glm(Private ~ ., data = traindata, family = binomial(link = "logit"))
summary(allglm)
##
## Call:
## glm(formula = Private ~ ., family = binomial(link = "logit"),
## data = traindata)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.3564 -0.0070 0.0396 0.1596 3.0606
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 5.493e-01 3.112e+00 0.177 0.85988
## Apps -8.661e-04 4.930e-04 -1.757 0.07893 .
## Accept 1.465e-03 7.582e-04 1.933 0.05328 .
## Enroll 9.076e-05 1.168e-03 0.078 0.93808
## Top10perc 1.542e-02 3.607e-02 0.428 0.66891
## Top25perc -5.867e-04 2.430e-02 -0.024 0.98074
## F.Undergrad -8.399e-04 2.879e-04 -2.918 0.00353 **
## P.Undergrad 2.247e-04 1.904e-04 1.181 0.23775
## Outstate 9.145e-04 1.617e-04 5.654 1.57e-08 ***
## Room.Board -4.767e-04 3.529e-04 -1.351 0.17673
## Books 6.782e-04 1.511e-03 0.449 0.65358
## Personal -1.564e-04 3.275e-04 -0.477 0.63301
## PhD -3.799e-02 3.604e-02 -1.054 0.29183
## Terminal -4.040e-02 3.442e-02 -1.174 0.24047
## S.F.Ratio 3.935e-03 8.349e-02 0.047 0.96241
## perc.alumni 4.128e-02 2.692e-02 1.533 0.12522
## Expend 2.253e-04 1.480e-04 1.522 0.12798
## Grad.Rate 1.496e-02 1.486e-02 1.007 0.31413
## Acceptrate -1.605e+00 2.347e+00 -0.684 0.49417
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 639.40 on 544 degrees of freedom
## Residual deviance: 154.73 on 526 degrees of freedom
## AIC: 192.73
##
## Number of Fisher Scoring iterations: 8
task3fit1 = glm(Private ~ F.Undergrad + perc.alumni + Outstate, data = traindata, family = binomial(link = "logit"))
task3fit2 = glm(Private ~ Personal + Enroll, data = traindata, family = binomial(link = "logit"))
#summary(task3fit1)
#summary(task3fit2)
exp(coef(task3fit1)) %>%
kable(caption = "<center>Table 7: Coefficinent of Model</center>",
col.names = "Coefficient of Model",
align = "c") %>%
kable_styling(bootstrap_options = c("bordered",
"responsive",
"hover"),
font_size = 11) %>%
scroll_box(width = "100%", height = "50%") %>%
footnote(general = "Test 1: Private ~ F.Undergrad + perc.alumni + Outstate",
general_title = "Test fit: ")
| Coefficient of Model | |
|---|---|
| (Intercept) | 0.0148729 |
| F.Undergrad | 0.9992869 |
| perc.alumni | 1.0323057 |
| Outstate | 1.0008423 |
GLMs are a type of statistical model that extend the generalized linear
model framework to allow for response variables that have error
distributions other than a normal distribution. glm() allows for fitting
a wide range of models, including linear regression, logistic
regression, Poisson regression, and others. It can handle both
continuous and categorical predictor variables, and can include
interactions and non-linear terms.
In this task, I ran glm() on entire dataset considering Private is
dependent variable. As the model is fitted, summary() is used to extract
information and make inferences. Summary() shows the Apps, Accept, Full
time Undergrad Students, and Outstate are significant. As these codes
only indicate if the coefficients are statistically significant or not,
but they don’t give information about the effect size, or the magnitude
of the coefficient, which is important in determining the practical
significance of the results.
I would model F.Undergrad, Percentage
of Alumni who donate, and Outstate and Personal expenditure and enrolled
new students to run on train and test data. Out of these two models,
F.Undergrad, Percentage of Alumni who donate, and Outstate is best fit
and selected for further analysis.
Probtrain = predict(task3fit1, newdata = traindata, type = "response")
predclassmin <- as.factor(ifelse(Probtrain >= 0.5, "Yes", "No"))
head(Probtrain,10) %>%
kable(caption = "<center>Table 8: Predicated Probability of Train Data </center>",
col.names = "Probabilities",
align = "c") %>%
kable_styling(bootstrap_options = c("bordered",
"responsive",
"hover"),
font_size = 11) %>%
scroll_box(width = "100%", height = "50%") %>%
footnote(general = "Predict model ran on train data shows first 10 observations",
general_title = "Probability: ")
| Probabilities | |
|---|---|
| Adelphi University | 0.9912108 |
| Adrian College | 0.9958409 |
| Agnes Scott College | 0.9994559 |
| Alaska Pacific University | 0.8852125 |
| Albertson College | 0.9991105 |
| Albertus Magnus College | 0.9994532 |
| Albion College | 0.9994511 |
| Albright College | 0.9998714 |
| Alderson-Broaddus College | 0.9891433 |
| Alfred University | 0.9999176 |
confusionMatrix(predclassmin, traindata$Private, positive = "Yes")
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 130 16
## Yes 19 380
##
## Accuracy : 0.9358
## 95% CI : (0.9118, 0.9549)
## No Information Rate : 0.7266
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.8373
##
## Mcnemar's Test P-Value : 0.7353
##
## Sensitivity : 0.9596
## Specificity : 0.8725
## Pos Pred Value : 0.9524
## Neg Pred Value : 0.8904
## Prevalence : 0.7266
## Detection Rate : 0.6972
## Detection Prevalence : 0.7321
## Balanced Accuracy : 0.9160
##
## 'Positive' Class : Yes
##
Predict() function is used to make predictions for new data using a
fitted model. The predictions can be used for a variety of purposes,
such as:
Model evaluation: By comparing the predicted values
to the actual values for a set of new data, we can evaluate the
performance of the model and determine how well it is able to generalize
to new data.
Model comparison: By comparing the predicted
values from different models, we can determine which model is the best
fit for the data.
Forecasting: We can use the predict
function to forecast future values of a response variable based on the
model that has been fit to historical data.
Understanding model
behavior: The predict function can also be used to understand how
the model behaves for different input values. This can be useful for
understanding the relationship between predictor variables and the
response variable.
Decision making: In some cases, the
predictions can be used to make decisions. For example, a model that
predicts the probability of universities churn can be used to identify
Public or Private and take action to retain them. Overall, predict
function is a key function in data modeling as it allows to test the
model on unseen data and evaluate its performance.
Furthermore, as.factor() is used on the output of the predict() to
change probability more than 0.5 to Yes and others to No. This
arrangement would we used to make data ConfusionMatrix() compatible.
ConfusionMatrix() is used to create a confusion matrix, which is a table
that is used to evaluate the performance of a classification model. The
matrix is created by comparing the predicted class labels to the actual
class labels for a set of test data.
The confusion matrix has
four main elements: true positives (TP), false positives (FP),
true negatives (TN), and false negatives (FN).
True positives (TP) are the number of observations that are
correctly classified as belonging to the positive class. In our train
model, it is 380.
False positives (FP) are the number of
observations that are incorrectly classified as belonging to the
positive class. In our train model, it is 16.
True negatives
(TN) are the number of observations that are correctly classified as
belonging to the negative class. In our train model, it is 130.
False negatives (FN) are the number of observations that are
incorrectly classified as belonging to the negative class. In our train
model, it is 19.
The confusion matrix is useful for understanding the performance of a
classification model in terms of precision, recall, accuracy, and
specificity.
ConfusionMatrix() function can be used with various
classification models, such as logistic regression, decision trees,
random forests, etc. It can also be used with multiclass classification
problems. It is an important step in evaluating the performance of a
model and helps in identifying any errors or bias present in the
model.
In general, false positives and false negatives can have different
consequences and it is important to consider the specific context of the
analysis when determining which misclassification is more damaging. As
the goal of the analysis is to identify schools with Full time
undergrads, percentage of alumni who donate and out-of-state tuition, a
false negative (failing to identify a school with high out-of-state
tuition) might be more damaging than a false positive (identifying a
school as having high out-of-state tuition when it does not). On the
other hand, if the goal of the analysis is to identify schools with low
out-of-state tuition per student, a false positive might be more
damaging.
By counting the number of observations in each of these cells, we can
calculate various metrics such as accuracy, precision, recall, and
F1-score.
#Report and interpret metrics for Accuracy, Precision, Recall, and Specificity.
cmt5 = confusionMatrix(predclassmin, traindata$Private, positive = "Yes")
cmt5$byClass
## Sensitivity Specificity Pos Pred Value
## 0.9595960 0.8724832 0.9523810
## Neg Pred Value Precision Recall
## 0.8904110 0.9523810 0.9595960
## F1 Prevalence Detection Rate
## 0.9559748 0.7266055 0.6972477
## Detection Prevalence Balanced Accuracy
## 0.7321101 0.9160396
cmt5$overall
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
## 9.357798e-01 8.373370e-01 9.118140e-01 9.548642e-01 7.266055e-01
## AccuracyPValue McnemarPValue
## 8.161411e-36 7.353167e-01
True positives (TP) are the number of observations that are
correctly classified as belonging to the positive class. In our train
model, it is 380.
False positives (FP) are the number of
observations that are incorrectly classified as belonging to the
positive class. In our train model, it is 16.
True negatives
(TN) are the number of observations that are correctly classified as
belonging to the negative class. In our train model, it is 130.
False negatives (FN) are the number of observations that are
incorrectly classified as belonging to the negative class. In our train
model, it is 19.
Accuracy of the model is 93%. Sensitivity (Recall) is almost 96% which means 96 universities out of 100 are predicted correctly. Precision is 95% which means 95 universities are predicted correctly. Specificity determines the proportion of actual negatives that are correctly identified. In our model, specificity is 87% which means 87 universities are labeled private and 13 are labelled incorrectly as public.
#Confusion Matrix for test data#####
Probtest = predict(task3fit1, newdata = testdata, type = "response")
predclassmintest <- as.factor(ifelse(Probtest >= 0.5, "Yes", "No"))
head(Probtest,10) %>%
kable(caption = "<center>Table 9: Predicated Probabilites of Test Data</center>",
col.names = "Probabilities for Test data",
align = "c") %>%
kable_styling(bootstrap_options = c("bordered",
"responsive",
"hover"),
font_size = 11) %>%
scroll_box(width = "100%", height = "50%") %>%
footnote(general = "Predict model ran on test data shows first 10 observations",
general_title = "Probability: ")
| Probabilities for Test data | |
|---|---|
| Abilene Christian University | 0.5936725 |
| Amherst College | 0.9999983 |
| Antioch University | 0.9999194 |
| Appalachian State University | 0.0059170 |
| Aquinas College | 0.9941216 |
| Arkansas Tech University | 0.0239912 |
| Assumption College | 0.9964255 |
| Barat College | 0.9913350 |
| Barnard College | 0.9999702 |
| Barry University | 0.9873203 |
cm = confusionMatrix(predclassmintest, testdata$Private, positive = "Yes")
cm
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 58 9
## Yes 5 160
##
## Accuracy : 0.9397
## 95% CI : (0.9008, 0.9666)
## No Information Rate : 0.7284
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.8504
##
## Mcnemar's Test P-Value : 0.4227
##
## Sensitivity : 0.9467
## Specificity : 0.9206
## Pos Pred Value : 0.9697
## Neg Pred Value : 0.8657
## Prevalence : 0.7284
## Detection Rate : 0.6897
## Detection Prevalence : 0.7112
## Balanced Accuracy : 0.9337
##
## 'Positive' Class : Yes
##
#precision <- cm$byClass['Pos Pred Value']
#recall <- cm$byClass['Sensitivity']
#f_measure <- 2 * ((precision * recall) / (precision + recall))
fval = cm$byClass["F1"]
error = data.frame("error" = c(1-fval))
error = error$error
round(error,4) %>%
kable(caption = "<center>Table 10: Error Probability in the model</center>",
col.names = "Error Probability",
align = "c") %>%
kable_styling(bootstrap_options = c("bordered",
"responsive",
"hover"),
font_size = 11) %>%
scroll_box(width = "100%", height = "50%") %>%
footnote(general = "Private ~ F.Undergrad + perc.alumni + Outstate",
general_title = "Error Probability: ")
| Error Probability |
|---|
| 0.0419 |
True positives (TP) are the number of observations that are
correctly classified as belonging to the positive class. In our train
model, it is 160.
False positives (FP) are the number of
observations that are incorrectly classified as belonging to the
positive class. In our train model, it is 9.
True negatives
(TN) are the number of observations that are correctly classified as
belonging to the negative class. In our train model, it is 58.
False negatives (FN) are the number of observations that are
incorrectly classified as belonging to the negative class. In our train
model, it is 5.
Accuracy of the model is 94%. Sensitivity (Recall) is almost 94%
which means 94 universities out of 100 are predicted correctly as
Private. Precision is 97% which means 97 universities are predicted
correctly as Private. Specificity determines the proportion of actual
negatives that are correctly identified. In our model, specificity is
92% which means 92 universities are labeled private and 18 are labelled
incorrectly as public.
The Error Probability of the model on test
dataset is 4% means there are 6 univertsities out of 100 are incorrectly
labelled public.
#Receiver Operator Characteristic Curve######
rocdata = roc(testdata$Private, Probtest)
ggroc(rocdata, colour = 'steelblue', size = 2) +
ggtitle("ROC Chart")+
xlab(" Specificity: FP Rate")+
ylab("Sensitivity: TP Rate")+
theme(plot.title = element_text(hjust = 0.5))
The roc() function is used to create a receiver operating
characteristic (ROC) curve, which is a graphical representation of the
performance of a binary classifier system as the discrimination
threshold is varied. The ROC curve is created by plotting the true
positive rate (sensitivity) against the false positive rate
(1-specificity) at various threshold settings. The area under the ROC
curve (AUC) is a measure of the overall performance of the classifier,
with a value of 1 indicating perfect classification and a value of 0.5
indicating a classifier that performs no better than chance.
#Area Under Curve####
auc(rocdata)
## Area under the curve: 0.9714
The ROC chart shows the model is well suited to make predictions
accurately as AOC that is area under the curve is 97%.
In this project, we have tested glm() to test logistic regression on Categorical variable Private. Private have Yes,No values for university category. The glm() provided the best suited model by running on the dataset and we have selected three independent variable to test and train the dataset to predict private and public university based upon the enrolled Full time undergrads, Percentage of Alumni who donate and Out-of-State tuition fee. The model shows overall 94% accuracy and error labelling private as public is nearly 4%.
Bluman, A. (2014). Elementary statistics: A step by step
approach. McGraw-Hill Education.
Gero, E. (2023).
ALY6015_Module3_Logistic_Regression[Lecture recording]. Canvas@Northeastern
University.https://canvas.northeastern.edu/
Kabacoff, R. (2015).
R in Action. Manning Publications Co.
Zach. (2020). How
to Calculate AUC (Area Under Curve) in R. Statology. https://www.statology.org/auc-in-r/