GLM and Logistic Regression


Introduction

A family of regression models known as generalized linear models enables us to adapt the linear regression methodology to a wide range of dependent variables. A generalized linear model does not require a continuous or regularly distributed dependent variable (Gero, 2023).

Three components of a GLM:
Random component – probability distribution of the response variable
Systematic component - specifies the explanatory variables (X1, X2, … Xk) in the model, more specifically their linear combination in creating the so called linear predictor.
Link function - specifies the link between random and systematic components.

Model Families:


Gaussian family – continuous decimal data with normal distribution like weight, length.

Binomial – binary data like 0 and 1, or proportion like survival number vs death number, positive frequency vs negative frequency

Poisson – used for counts or frequencies

Gamma – used for time data like the time or duration of the occurrence of the event

In contrast to linear regression, which has a limited number of alternative values, the outcome of logistic regression is categorical and has a single possible value, such as “yes” or “no,” or if someone has brown, blue, or green eyes. Classification is the prediction of a label or categorical variable.

Task 1: College dataset EDA
#EDA######
Task1data = Project3_data

describe(Task1data)%>% #table representation of the statistical description
  kable(caption = "<center>Table 1: Statistical Description of College Dataset</center>",
        align = "c")%>%
  kable_styling(bootstrap_options = c("hover",
                                        "bordered"),
                font_size = 11) %>%
  scroll_box(width = "100%", height = "100%") %>%
  footnote(general = "US Colleges from the 1995 issue of US News and World Report.\n",
           general_title = "Data Source:",
           symbol = c("Categorical Variable"))
Table 1: Statistical Description of College Dataset
vars n mean sd median trimmed mad min max range skew kurtosis se
Private* 1 777 1.727156 0.4457084 2.0 1.783307 0.00000 1.0 2.0 1.0 -1.0179902 -0.9649328 0.0159897
Apps 2 777 3001.638353 3870.2014844 1558.0 2193.008026 1463.32620 81.0 48094.0 48013.0 3.7093849 26.5184313 138.8427049
Accept 3 777 2018.804376 2451.1139710 1110.0 1510.287319 1008.16800 72.0 26330.0 26258.0 3.4045428 18.7526403 87.9332239
Enroll 4 777 779.972973 929.1761901 434.0 575.953451 354.34140 35.0 6392.0 6357.0 2.6800857 8.7368340 33.3340101
Top10perc 5 777 27.558559 17.6403644 23.0 25.130016 13.34340 1.0 96.0 95.0 1.4077650 2.1728286 0.6328445
Top25perc 6 777 55.796654 19.8047776 54.0 55.121990 20.75640 9.0 100.0 91.0 0.2583399 -0.5744647 0.7104924
F.Undergrad 7 777 3699.907336 4850.4205309 1707.0 2574.884430 1441.08720 139.0 31643.0 31504.0 2.6003876 7.6120676 174.0078673
P.Undergrad 8 777 855.298584 1522.4318873 353.0 536.361156 449.22780 1.0 21836.0 21835.0 5.6703938 54.5249401 54.6169397
Outstate 9 777 10440.669241 4023.0164841 9990.0 10181.658106 4121.62800 2340.0 21700.0 19360.0 0.5073133 -0.4255258 144.3249124
Room.Board 10 777 4357.526383 1096.6964156 4200.0 4301.704655 1005.20280 1780.0 8124.0 6344.0 0.4755141 -0.2012779 39.3437648
Books 11 777 549.380952 165.1053601 500.0 535.219904 148.26000 96.0 2340.0 2244.0 3.4715806 28.0632782 5.9231218
Personal 12 777 1340.642214 677.0714536 1200.0 1268.345104 593.04000 250.0 6800.0 6550.0 1.7357745 7.0446395 24.2898031
PhD 13 777 72.660232 16.3281547 75.0 73.922954 17.79120 8.0 103.0 95.0 -0.7652067 0.5442923 0.5857693
Terminal 14 777 79.702703 14.7223585 82.0 81.102729 14.82600 24.0 100.0 76.0 -0.8133924 0.2244365 0.5281617
S.F.Ratio 15 777 14.089704 3.9583491 13.6 13.935795 3.40998 2.5 39.8 37.3 0.6648606 2.5228017 0.1420050
perc.alumni 16 777 22.743887 12.3918015 21.0 21.857143 13.34340 0.0 64.0 64.0 0.6045500 -0.1113466 0.4445534
Expend 17 777 9660.171171 5221.7684399 8377.0 8823.704655 2730.94920 3186.0 56233.0 53047.0 3.4459767 18.5875365 187.3298993
Grad.Rate 18 777 65.463320 17.1777099 65.0 65.601926 17.79120 10.0 118.0 108.0 -0.1133384 -0.2187930 0.6162469
Data Source: US Colleges from the 1995 issue of US News and World Report.
* Categorical Variable
#DescTools::Desc(Task1data$Private)

##Statistical Charts#####

ggplot(Task1data, aes(y=F.Undergrad, x=Private)) + 
    geom_bar(stat = "identity", position = "dodge")+
  ggtitle("Full-time Undergrad Students in Universities") +
  ylab("Number of Undergrads")+
  theme(plot.title = element_text(hjust = 0.5))

Task1data = Task1data %>%
  mutate(Acceptrate = c((Accept/Apps))) #add acceptance rate column

#private univ subset
Privaterate = subset(Task1data, Private == "Yes", 
         select=c(Acceptrate))
Privaterate = sum(Privaterate$Acceptrate) / length(Privaterate$Acceptrate)

#public Univ subset
Publicrate = subset(Task1data, Private == "No", 
         select=c(Acceptrate))
Publicrate = sum(Publicrate$Acceptrate) / length(Publicrate$Acceptrate)

#data combining
acceptrate = data.frame(Private = Privaterate,
                        Public = Publicrate)

#avg acceptance rate table representation
kable(round(acceptrate,3)*100,
      caption = "<center>Table 2: Private vs Public Acceptance rate </center>",
      align = "c",
      col.names = c("Private University Acceptance Rate %",
                    "Public University Acceptance Rate %")) %>%
  kable_styling(bootstrap_options = c("bordered",
                                      "responsive",
                                      "hover"),
                font_size = 11) %>%
 scroll_box(width = "100%", height = "100%") %>%
   footnote(general = "US Colleges from the 1995 issue of US News and World Report.\n",
           general_title = "Data Source:")
Table 2: Private vs Public Acceptance rate
Private University Acceptance Rate % Public University Acceptance Rate %
75.5 72.7
Data Source: US Colleges from the 1995 issue of US News and World Report.
#Public vs Private Out-of-State Tuition
cdplot(Private ~ Outstate, data=Task1data, ylevels = 2:1,
       main = "Out-of-State Tuition of Private, Public",
       xlab = "Tuition Fee (USD)")

#Finding universities per student expenditure above country mean
Abovemeanexp = Task1data[which(Task1data$Expend > mean(Task1data$Expend) ),]

meancount <- function(y, uplim = max(Abovemeanexp$Expend) * 1.15) {
  return(data.frame(
    y = 0.95 * uplim,
    label = paste(
      "Count =", length(y), "\n"
    )
  ))
}
ggplot(Abovemeanexp, aes(y=Expend, x=Private)) + 
    geom_bar(position="dodge", stat="identity")+
  ggtitle("Expenditure per Student  greater than country mean") +
  ylab("Expenditure in USD")+
  theme(plot.title = element_text(hjust = 0.5),
        plot.caption = element_text(size = 7))+
  #scale_y_continuous(labels = unit_format(unit = "M", scale = 1e-6))+
  labs(caption = "Country Mean per student expenditure, $9660") +
  stat_summary( 
               fun.data = meancount, 
               geom = "text", hjust = 0.5, vjust = 0.9, size = 2)

#Highest expenditure per student 
Mostperstudentexp = Abovemeanexp %>%
  filter(Abovemeanexp$Expend == max(Abovemeanexp$Expend))

  kable(Mostperstudentexp,
        caption = "<center>Table 3: Highest per student expenditure(USD) university </center>",
        align = "c")%>%
  kable_styling(bootstrap_options = c("hover",
                                        "bordered"),
                font_size = 11) %>%
  scroll_box(width = "100%", height = "100%") %>%
  footnote(general = "US Colleges from the 1995 issue of US News and World Report.\n",
           general_title = "Data Source:")
Table 3: Highest per student expenditure(USD) university
Private Apps Accept Enroll Top10perc Top25perc F.Undergrad P.Undergrad Outstate Room.Board Books Personal PhD Terminal S.F.Ratio perc.alumni Expend Grad.Rate Acceptrate
Johns Hopkins University Yes 8474 3446 911 75 94 3566 1569 18800 6740 500 1040 96 97 3.3 38 56233 90 0.4066557
Data Source: US Colleges from the 1995 issue of US News and World Report.
#Avg Student to Faculty Ratio in Private Univ
PrivateSFR = subset(Task1data, Private == "Yes", 
         select=c(S.F.Ratio))
PrivateSFR = sum(PrivateSFR$S.F.Ratio) / length(PrivateSFR$S.F.Ratio)

#Avg Student to Faculty Ratio in Public Univ
PublicSFR = subset(Task1data, Private == "No", 
         select=c(S.F.Ratio))
PublicSFR = sum(PublicSFR$S.F.Ratio) / length(PublicSFR$S.F.Ratio)

#Data merging
FinSFR = data.frame(Public_S.F.R = PublicSFR,
                    Private_S.F.R = PrivateSFR)

kable(round(FinSFR,2),
      caption = "<center>Table 4: Private vs Public Student:Faculty </center>",
      align = "c",
      col.names = c("Private University Student to Faculty Ratio",
                    "Public University Student to Faculty Ratio")) %>%
  kable_styling(bootstrap_options = c("bordered",
                                      "responsive",
                                      "hover"),
                font_size = 11) %>%
 scroll_box(width = "100%", height = "100%") %>%
   footnote(general = "US Colleges from the 1995 issue of US News and World Report.",
           general_title = "Data Source: ")
Table 4: Private vs Public Student:Faculty
Private University Student to Faculty Ratio Public University Student to Faculty Ratio
17.14 12.95
Data Source: US Colleges from the 1995 issue of US News and World Report.
testfit = glm(Private ~ S.F.Ratio + Grad.Rate + Outstate, data = Task1data, family = binomial)

testfit2 = glm(Private ~ Top10perc + F.Undergrad + PhD, data = Task1data, family = binomial)

#summary(testfit)
#summary(testfit2)

anov = anova(testfit, testfit2)
anov %>%
  kable(caption = "<center>Table 5: Best fit Model</center>")%>%
  kable_styling(bootstrap_options = c("bordered",
                                      "responsive",
                                      "hover"),
                font_size = 11) %>%
 scroll_box(width = "100%", height = "100%") %>%
   footnote(general = "Test 1: Private ~ S.F.Ratio + Grad.Rate + Outstate and \n Test 2: Private ~ Top10perc + F.Undergrad + PhD\n",
           general_title = "Test fit: ")
Table 5: Best fit Model
Resid. Df Resid. Dev Df Deviance
773 541.1257 NA NA
773 482.9046 0 58.22114
Test fit: Test 1: Private ~ S.F.Ratio + Grad.Rate + Outstate and
Test 2: Private ~ Top10perc + F.Undergrad + PhD
#tab_model()

AIC(testfit, testfit2)%>%
  kable(caption = "<center>Table 6: AIC Best fit Model</center>",
        align = "c")%>%
  kable_styling(bootstrap_options = c("bordered",
                                      "responsive",
                                      "hover"),
                font_size = 11) %>%
 scroll_box(width = "100%", height = "50%") %>%
   footnote(general = "Test 1: Private ~ S.F.Ratio + Grad.Rate + Outstate and \n Test 2: Private ~ Top10perc + F.Undergrad + PhD\n",
           general_title = "Test fit: ")
Table 6: AIC Best fit Model
df AIC
testfit 4 549.1257
testfit2 4 490.9046
Test fit: Test 1: Private ~ S.F.Ratio + Grad.Rate + Outstate and
Test 2: Private ~ Top10perc + F.Undergrad + PhD
Observation:

The College dataset, sourced from the 1995 issue of US News and World Report, contains information on various attributes of American colleges and universities. The dataset includes variables such as the school’s name, private/public status, enrollment size, percent of students from Top 10 and 25 of high school class, acceptance rate, and expenditures per student. The dataset can be used to explore relationships between different attributes, such as the relationship between a school’s expenditure per student and percentage of students from top 10 or 25 high school class. Additionally, statistical techniques such as regression analysis can be used to identify the most significant predictors of a school’s overall success, as determined by the US News and World Report. This dataset can be used to understand the trends of colleges in 1995 and can also be used to compare with recent data. Furthermore, it can help to understand how the factors affecting the success of colleges and universities have changed over time.

Table 1 provides an overview of the dataset and descriptive statistics of variables. One of the variables Private is categorical variables as it have Yes/No values. In the table 1, Private is denoted by \(*\) to confirm the categorical type. Rest of the variables are discrete and continuous variables.
Bar chart Full time students in universities shows the undergraduate students in US universities who are enrolled as full time students. Bar chart shows Public universities have most full time undergraduate students enrolled in 1995. However, the acceptance rate of Private universities is greater than public universities. Overall, Private universities have acceptance rate of 75.5% and Public universities have 72.7% acceptance rate.
Furthermore, the tuition fee for out-of-state students is higher in Private Universities as compared to Public Universities. Almost 99% of Public Universities have tuition fee less than USD15,000 and on the other hand, almost 20% of Private Universities have tuition fee less than USD15,000. Another contributing factor in success of the universities is the instructional expenditure per student. Instructional expenditure per student refers to the amount of money spent by a college or university on instruction-related activities for each student enrolled. This typically includes expenses such as faculty salaries, instructional materials, and classroom technology. It is calculated by dividing the total amount spent on instruction by the number of students enrolled. This metric is often used as an indicator of a school’s investment in its educational offerings and can be a useful tool for comparing the resources available at different institutions. It is also used to study the relationship between the expenditure and the academic outcome. The country mean of expenditure per student across universities is USD9660. 226 Private and 36 Public universities’ expenditure per student is higher than country mean. And John Hopkins University have most instructional expenditure of USD56233.
However, another interesting analysis out of the data is student:faculty ratio. Surprisingly the Public Universities have almost 13:1 ratio and Private Universities have 17:1 ratio.

After a thoughtful consideration, I have decided to consider Student:Faculty ratio, Outstate, Grad Rate, percentage of students from Top 10 high school, and Full time undergraduate to measure the success of the universities. Model 1 measures the correlation of Private to Student:Faculty ratio, Outstate, Grad Rate and Model 2 percentage of students from Top 10 high school, and Full time undergraduate, Model 2 is best fit. This model is selected based on preliminary analysis. I would be modeling different regression models in the next tasks.

Task 2:
#Train and Test#######

set.seed(130)

traintestind <- createDataPartition(Task1data$Private, p=0.70, list = FALSE)

traindata <- Task1data[traintestind,] #assigning 70% of data to train
testdata <- Task1data[-traintestind,] #assigning remaining 30% of data to test

#nrow(traindata) + nrow(testdata)
Observation:

In this task, I would be splitting the College dataset into two parts: a training set(70%) and a test set(30%). The training set is used to train the model, while the test set is used to evaluate the model’s performance.
The main reason for doing a train and test split is to evaluate the performance of a machine learning model. A model that performs well on the training set is not necessarily a good indicator of how well the model will perform on unseen data.

When a model is trained and tested on the same data, it can result in overfitting, where the model performs well on the training data but poorly on new unseen data. By splitting the data into a training set and a test set, we can evaluate the model’s performance on unseen data and get a better idea of how well the model is likely to perform on new, unseen data.

Another reason for doing a train and test split is to be able to compare the performance of different models. By training and testing several models on the same data, we can compare their performance and select the best model.

In addition, train and test split help in finding the optimal parameters of the model by tuning the hyperparameters using the train set and test it on test set.

Once the dataset has been split into training and test sets, the model can be trained on the training set using a function such as lm()-Linear Regression or glm()- Logistic Regression. Then, the model’s performance can be evaluated on the test set using function ConfusionMatrix().

In summary, train and test split is an important step in the machine learning process as it allows us to evaluate the performance of a model, compare the performance of different models, and ensure that the model is generalizing well to new, unseen data.

Task 3:
allglm = glm(Private ~ ., data = traindata, family = binomial(link = "logit"))

summary(allglm)
## 
## Call:
## glm(formula = Private ~ ., family = binomial(link = "logit"), 
##     data = traindata)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.3564  -0.0070   0.0396   0.1596   3.0606  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  5.493e-01  3.112e+00   0.177  0.85988    
## Apps        -8.661e-04  4.930e-04  -1.757  0.07893 .  
## Accept       1.465e-03  7.582e-04   1.933  0.05328 .  
## Enroll       9.076e-05  1.168e-03   0.078  0.93808    
## Top10perc    1.542e-02  3.607e-02   0.428  0.66891    
## Top25perc   -5.867e-04  2.430e-02  -0.024  0.98074    
## F.Undergrad -8.399e-04  2.879e-04  -2.918  0.00353 ** 
## P.Undergrad  2.247e-04  1.904e-04   1.181  0.23775    
## Outstate     9.145e-04  1.617e-04   5.654 1.57e-08 ***
## Room.Board  -4.767e-04  3.529e-04  -1.351  0.17673    
## Books        6.782e-04  1.511e-03   0.449  0.65358    
## Personal    -1.564e-04  3.275e-04  -0.477  0.63301    
## PhD         -3.799e-02  3.604e-02  -1.054  0.29183    
## Terminal    -4.040e-02  3.442e-02  -1.174  0.24047    
## S.F.Ratio    3.935e-03  8.349e-02   0.047  0.96241    
## perc.alumni  4.128e-02  2.692e-02   1.533  0.12522    
## Expend       2.253e-04  1.480e-04   1.522  0.12798    
## Grad.Rate    1.496e-02  1.486e-02   1.007  0.31413    
## Acceptrate  -1.605e+00  2.347e+00  -0.684  0.49417    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 639.40  on 544  degrees of freedom
## Residual deviance: 154.73  on 526  degrees of freedom
## AIC: 192.73
## 
## Number of Fisher Scoring iterations: 8
task3fit1 = glm(Private ~ F.Undergrad + perc.alumni + Outstate, data = traindata, family = binomial(link = "logit"))

task3fit2 = glm(Private ~ Personal + Enroll, data = traindata, family = binomial(link = "logit"))

#summary(task3fit1)
#summary(task3fit2)

exp(coef(task3fit1)) %>%
  kable(caption = "<center>Table 7: Coefficinent of Model</center>",
        col.names = "Coefficient of Model",
        align = "c") %>%
  kable_styling(bootstrap_options = c("bordered",
                                      "responsive",
                                      "hover"),
                font_size = 11) %>%
 scroll_box(width = "100%", height = "50%") %>%
   footnote(general = "Test 1: Private ~ F.Undergrad + perc.alumni + Outstate",
           general_title = "Test fit: ")
Table 7: Coefficinent of Model
Coefficient of Model
(Intercept) 0.0148729
F.Undergrad 0.9992869
perc.alumni 1.0323057
Outstate 1.0008423
Test fit: Test 1: Private ~ F.Undergrad + perc.alumni + Outstate
Observation:

GLMs are a type of statistical model that extend the generalized linear model framework to allow for response variables that have error distributions other than a normal distribution. glm() allows for fitting a wide range of models, including linear regression, logistic regression, Poisson regression, and others. It can handle both continuous and categorical predictor variables, and can include interactions and non-linear terms.

In this task, I ran glm() on entire dataset considering Private is dependent variable. As the model is fitted, summary() is used to extract information and make inferences. Summary() shows the Apps, Accept, Full time Undergrad Students, and Outstate are significant. As these codes only indicate if the coefficients are statistically significant or not, but they don’t give information about the effect size, or the magnitude of the coefficient, which is important in determining the practical significance of the results.
I would model F.Undergrad, Percentage of Alumni who donate, and Outstate and Personal expenditure and enrolled new students to run on train and test data. Out of these two models, F.Undergrad, Percentage of Alumni who donate, and Outstate is best fit and selected for further analysis.

Task 4:
Probtrain = predict(task3fit1, newdata = traindata, type = "response")
predclassmin <- as.factor(ifelse(Probtrain >= 0.5, "Yes", "No"))

head(Probtrain,10) %>%
  kable(caption = "<center>Table 8: Predicated Probability of Train Data </center>",
    col.names = "Probabilities",
        align = "c") %>%
  kable_styling(bootstrap_options = c("bordered",
                                      "responsive",
                                      "hover"),
                font_size = 11) %>%
 scroll_box(width = "100%", height = "50%") %>%
   footnote(general = "Predict model ran on train data shows first 10 observations",
           general_title = "Probability: ")
Table 8: Predicated Probability of Train Data
Probabilities
Adelphi University 0.9912108
Adrian College 0.9958409
Agnes Scott College 0.9994559
Alaska Pacific University 0.8852125
Albertson College 0.9991105
Albertus Magnus College 0.9994532
Albion College 0.9994511
Albright College 0.9998714
Alderson-Broaddus College 0.9891433
Alfred University 0.9999176
Probability: Predict model ran on train data shows first 10 observations
confusionMatrix(predclassmin, traindata$Private, positive = "Yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  130  16
##        Yes  19 380
##                                           
##                Accuracy : 0.9358          
##                  95% CI : (0.9118, 0.9549)
##     No Information Rate : 0.7266          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.8373          
##                                           
##  Mcnemar's Test P-Value : 0.7353          
##                                           
##             Sensitivity : 0.9596          
##             Specificity : 0.8725          
##          Pos Pred Value : 0.9524          
##          Neg Pred Value : 0.8904          
##              Prevalence : 0.7266          
##          Detection Rate : 0.6972          
##    Detection Prevalence : 0.7321          
##       Balanced Accuracy : 0.9160          
##                                           
##        'Positive' Class : Yes             
## 
Observation:

Predict() function is used to make predictions for new data using a fitted model. The predictions can be used for a variety of purposes, such as:
Model evaluation: By comparing the predicted values to the actual values for a set of new data, we can evaluate the performance of the model and determine how well it is able to generalize to new data.
Model comparison: By comparing the predicted values from different models, we can determine which model is the best fit for the data.
Forecasting: We can use the predict function to forecast future values of a response variable based on the model that has been fit to historical data.
Understanding model behavior: The predict function can also be used to understand how the model behaves for different input values. This can be useful for understanding the relationship between predictor variables and the response variable.
Decision making: In some cases, the predictions can be used to make decisions. For example, a model that predicts the probability of universities churn can be used to identify Public or Private and take action to retain them. Overall, predict function is a key function in data modeling as it allows to test the model on unseen data and evaluate its performance.

Furthermore, as.factor() is used on the output of the predict() to change probability more than 0.5 to Yes and others to No. This arrangement would we used to make data ConfusionMatrix() compatible.

ConfusionMatrix() is used to create a confusion matrix, which is a table that is used to evaluate the performance of a classification model. The matrix is created by comparing the predicted class labels to the actual class labels for a set of test data.
The confusion matrix has four main elements: true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN).

True positives (TP) are the number of observations that are correctly classified as belonging to the positive class. In our train model, it is 380.
False positives (FP) are the number of observations that are incorrectly classified as belonging to the positive class. In our train model, it is 16.
True negatives (TN) are the number of observations that are correctly classified as belonging to the negative class. In our train model, it is 130.
False negatives (FN) are the number of observations that are incorrectly classified as belonging to the negative class. In our train model, it is 19.

Type 1 Error: False Positives (FP)
Type 2 Error: False Negatives (FN)

The confusion matrix is useful for understanding the performance of a classification model in terms of precision, recall, accuracy, and specificity.
ConfusionMatrix() function can be used with various classification models, such as logistic regression, decision trees, random forests, etc. It can also be used with multiclass classification problems. It is an important step in evaluating the performance of a model and helps in identifying any errors or bias present in the model.

In general, false positives and false negatives can have different consequences and it is important to consider the specific context of the analysis when determining which misclassification is more damaging. As the goal of the analysis is to identify schools with Full time undergrads, percentage of alumni who donate and out-of-state tuition, a false negative (failing to identify a school with high out-of-state tuition) might be more damaging than a false positive (identifying a school as having high out-of-state tuition when it does not). On the other hand, if the goal of the analysis is to identify schools with low out-of-state tuition per student, a false positive might be more damaging.

By counting the number of observations in each of these cells, we can calculate various metrics such as accuracy, precision, recall, and F1-score.

Accuracy = (TP+TN)/(TP+TN+FP+FN)
Precision = TP/(TP+FP)
Recall = TP/(TP+FN)
F1-score = 2PrecisionRecall/(Precision+Recall)

Task 5:Interpreting ConfusionMatrix()
#Report and interpret metrics for Accuracy, Precision, Recall, and Specificity.

cmt5 = confusionMatrix(predclassmin, traindata$Private, positive = "Yes")
cmt5$byClass
##          Sensitivity          Specificity       Pos Pred Value 
##            0.9595960            0.8724832            0.9523810 
##       Neg Pred Value            Precision               Recall 
##            0.8904110            0.9523810            0.9595960 
##                   F1           Prevalence       Detection Rate 
##            0.9559748            0.7266055            0.6972477 
## Detection Prevalence    Balanced Accuracy 
##            0.7321101            0.9160396
cmt5$overall
##       Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
##   9.357798e-01   8.373370e-01   9.118140e-01   9.548642e-01   7.266055e-01 
## AccuracyPValue  McnemarPValue 
##   8.161411e-36   7.353167e-01
Observation:

True positives (TP) are the number of observations that are correctly classified as belonging to the positive class. In our train model, it is 380.
False positives (FP) are the number of observations that are incorrectly classified as belonging to the positive class. In our train model, it is 16.
True negatives (TN) are the number of observations that are correctly classified as belonging to the negative class. In our train model, it is 130.
False negatives (FN) are the number of observations that are incorrectly classified as belonging to the negative class. In our train model, it is 19.

Type 1 Error: False Positives (FP)
Type 2 Error: False Negatives (FN)

Accuracy of the model is 93%. Sensitivity (Recall) is almost 96% which means 96 universities out of 100 are predicted correctly. Precision is 95% which means 95 universities are predicted correctly. Specificity determines the proportion of actual negatives that are correctly identified. In our model, specificity is 87% which means 87 universities are labeled private and 13 are labelled incorrectly as public.

Task 6:
#Confusion Matrix for test data#####


Probtest = predict(task3fit1, newdata = testdata, type = "response")
predclassmintest <- as.factor(ifelse(Probtest >= 0.5, "Yes", "No"))

head(Probtest,10) %>%
  kable(caption = "<center>Table 9: Predicated Probabilites of Test Data</center>",
        col.names = "Probabilities for Test data",
        align = "c") %>%
  kable_styling(bootstrap_options = c("bordered",
                                      "responsive",
                                      "hover"),
                font_size = 11) %>%
 scroll_box(width = "100%", height = "50%") %>%
   footnote(general = "Predict model ran on test data shows first 10 observations",
           general_title = "Probability: ")
Table 9: Predicated Probabilites of Test Data
Probabilities for Test data
Abilene Christian University 0.5936725
Amherst College 0.9999983
Antioch University 0.9999194
Appalachian State University 0.0059170
Aquinas College 0.9941216
Arkansas Tech University 0.0239912
Assumption College 0.9964255
Barat College 0.9913350
Barnard College 0.9999702
Barry University 0.9873203
Probability: Predict model ran on test data shows first 10 observations
cm = confusionMatrix(predclassmintest, testdata$Private, positive = "Yes") 
cm
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No   58   9
##        Yes   5 160
##                                           
##                Accuracy : 0.9397          
##                  95% CI : (0.9008, 0.9666)
##     No Information Rate : 0.7284          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.8504          
##                                           
##  Mcnemar's Test P-Value : 0.4227          
##                                           
##             Sensitivity : 0.9467          
##             Specificity : 0.9206          
##          Pos Pred Value : 0.9697          
##          Neg Pred Value : 0.8657          
##              Prevalence : 0.7284          
##          Detection Rate : 0.6897          
##    Detection Prevalence : 0.7112          
##       Balanced Accuracy : 0.9337          
##                                           
##        'Positive' Class : Yes             
## 
#precision <- cm$byClass['Pos Pred Value']    
#recall <- cm$byClass['Sensitivity']

#f_measure <- 2 * ((precision * recall) / (precision + recall))

fval = cm$byClass["F1"]

error = data.frame("error" = c(1-fval))
error = error$error

round(error,4) %>%
  kable(caption = "<center>Table 10: Error Probability in the model</center>",
    col.names = "Error Probability",
        align = "c") %>%
  kable_styling(bootstrap_options = c("bordered",
                                      "responsive",
                                      "hover"),
                font_size = 11) %>%
 scroll_box(width = "100%", height = "50%") %>%
   footnote(general = "Private ~ F.Undergrad + perc.alumni + Outstate",
           general_title = "Error Probability: ")
Table 10: Error Probability in the model
Error Probability
0.0419
Error Probability: Private ~ F.Undergrad + perc.alumni + Outstate
Observation:

True positives (TP) are the number of observations that are correctly classified as belonging to the positive class. In our train model, it is 160.
False positives (FP) are the number of observations that are incorrectly classified as belonging to the positive class. In our train model, it is 9.
True negatives (TN) are the number of observations that are correctly classified as belonging to the negative class. In our train model, it is 58.
False negatives (FN) are the number of observations that are incorrectly classified as belonging to the negative class. In our train model, it is 5.

Type 1 Error: False Positives (FP)
Type 2 Error: False Negatives (FN)

Accuracy of the model is 94%. Sensitivity (Recall) is almost 94% which means 94 universities out of 100 are predicted correctly as Private. Precision is 97% which means 97 universities are predicted correctly as Private. Specificity determines the proportion of actual negatives that are correctly identified. In our model, specificity is 92% which means 92 universities are labeled private and 18 are labelled incorrectly as public.
The Error Probability of the model on test dataset is 4% means there are 6 univertsities out of 100 are incorrectly labelled public.

Task 7:
#Receiver Operator Characteristic Curve######

rocdata = roc(testdata$Private, Probtest)

ggroc(rocdata, colour = 'steelblue', size = 2) +
  ggtitle("ROC Chart")+
  xlab(" Specificity: FP Rate")+
  ylab("Sensitivity: TP Rate")+
  theme(plot.title = element_text(hjust = 0.5))

Observation:

The roc() function is used to create a receiver operating characteristic (ROC) curve, which is a graphical representation of the performance of a binary classifier system as the discrimination threshold is varied. The ROC curve is created by plotting the true positive rate (sensitivity) against the false positive rate (1-specificity) at various threshold settings. The area under the ROC curve (AUC) is a measure of the overall performance of the classifier, with a value of 1 indicating perfect classification and a value of 0.5 indicating a classifier that performs no better than chance.

Task 8:
#Area Under Curve####

auc(rocdata)
## Area under the curve: 0.9714
Observation:

The ROC chart shows the model is well suited to make predictions accurately as AOC that is area under the curve is 97%.

Conclusion:

In this project, we have tested glm() to test logistic regression on Categorical variable Private. Private have Yes,No values for university category. The glm() provided the best suited model by running on the dataset and we have selected three independent variable to test and train the dataset to predict private and public university based upon the enrolled Full time undergrads, Percentage of Alumni who donate and Out-of-State tuition fee. The model shows overall 94% accuracy and error labelling private as public is nearly 4%.

Reference:


Bluman, A. (2014). Elementary statistics: A step by step approach. McGraw-Hill Education.
Gero, E. (2023). ALY6015_Module3_Logistic_Regression[Lecture recording]. University.https://canvas.northeastern.edu/
Kabacoff, R. (2015). R in Action. Manning Publications Co. 
Zach. (2020). How to Calculate AUC (Area Under Curve) in R. Statology. https://www.statology.org/auc-in-r/