GLM and Logistic Regression

GLM and Logistic Regression

Introduction

A family of regression models known as generalized linear models enables us to adapt the linear regression methodology to a wide range of dependent variables. A generalized linear model does not require a continuous or regularly distributed dependent variable (Gero, 2023).

Three components of a GLM: Random component – probability distribution of the response variable
Systematic component - specifies the explanatory variables (X1, X2, … Xk) in the model, more specifically their linear combination in creating the so called linear predictor.
Link function - specifies the link between random and systematic components.

Model Families:

Gaussian family – continuous decimal data with normal distribution like weight, length.

Binomial – binary data like 0 and 1, or proportion like survival number vs death number, positive frequency vs negative frequency

Poisson – used for counts or frequencies

Gamma – used for time data like the time or duration of the occurrence of the event

In contrast to linear regression, which has a limited number of alternative values, the outcome of logistic regression is categorical and has a single possible value, such as “yes” or “no,” or if someone has brown, blue, or green eyes. Classification is the prediction of a label or categorical variable.

Task 1: College dataset EDA

#EDA######
Task1data = Project3_data

describe(Task1data)%>% #table representation of the statistical description
  kable(caption = "<center>Table 1: Statistical Description of College Dataset</center>",
        align = "c")%>%
  kable_styling(bootstrap_options = c("hover",
                                        "bordered"),
                font_size = 11) %>%
  scroll_box(width = "100%", height = "100%") %>%
  footnote(general = "US Colleges from the 1995 issue of US News and World Report.\n",
           general_title = "Data Source:",
           symbol = c("Categorical Variable"))

Table 1: Statistical Description of College Dataset
	vars	n	mean	sd	median	trimmed	mad	min	max	range	skew	kurtosis	se
Private*	1	777	1.727156	0.4457084	2.0	1.783307	0.00000	1.0	2.0	1.0	-1.0179902	-0.9649328	0.0159897
Apps	2	777	3001.638353	3870.2014844	1558.0	2193.008026	1463.32620	81.0	48094.0	48013.0	3.7093849	26.5184313	138.8427049
Accept	3	777	2018.804376	2451.1139710	1110.0	1510.287319	1008.16800	72.0	26330.0	26258.0	3.4045428	18.7526403	87.9332239
Enroll	4	777	779.972973	929.1761901	434.0	575.953451	354.34140	35.0	6392.0	6357.0	2.6800857	8.7368340	33.3340101
Top10perc	5	777	27.558559	17.6403644	23.0	25.130016	13.34340	1.0	96.0	95.0	1.4077650	2.1728286	0.6328445
Top25perc	6	777	55.796654	19.8047776	54.0	55.121990	20.75640	9.0	100.0	91.0	0.2583399	-0.5744647	0.7104924
F.Undergrad	7	777	3699.907336	4850.4205309	1707.0	2574.884430	1441.08720	139.0	31643.0	31504.0	2.6003876	7.6120676	174.0078673
P.Undergrad	8	777	855.298584	1522.4318873	353.0	536.361156	449.22780	1.0	21836.0	21835.0	5.6703938	54.5249401	54.6169397
Outstate	9	777	10440.669241	4023.0164841	9990.0	10181.658106	4121.62800	2340.0	21700.0	19360.0	0.5073133	-0.4255258	144.3249124
Room.Board	10	777	4357.526383	1096.6964156	4200.0	4301.704655	1005.20280	1780.0	8124.0	6344.0	0.4755141	-0.2012779	39.3437648
Books	11	777	549.380952	165.1053601	500.0	535.219904	148.26000	96.0	2340.0	2244.0	3.4715806	28.0632782	5.9231218
Personal	12	777	1340.642214	677.0714536	1200.0	1268.345104	593.04000	250.0	6800.0	6550.0	1.7357745	7.0446395	24.2898031
PhD	13	777	72.660232	16.3281547	75.0	73.922954	17.79120	8.0	103.0	95.0	-0.7652067	0.5442923	0.5857693
Terminal	14	777	79.702703	14.7223585	82.0	81.102729	14.82600	24.0	100.0	76.0	-0.8133924	0.2244365	0.5281617
S.F.Ratio	15	777	14.089704	3.9583491	13.6	13.935795	3.40998	2.5	39.8	37.3	0.6648606	2.5228017	0.1420050
perc.alumni	16	777	22.743887	12.3918015	21.0	21.857143	13.34340	0.0	64.0	64.0	0.6045500	-0.1113466	0.4445534
Expend	17	777	9660.171171	5221.7684399	8377.0	8823.704655	2730.94920	3186.0	56233.0	53047.0	3.4459767	18.5875365	187.3298993
Grad.Rate	18	777	65.463320	17.1777099	65.0	65.601926	17.79120	10.0	118.0	108.0	-0.1133384	-0.2187930	0.6162469

Data Source: US Colleges from the 1995 issue of US News and World Report.
^* Categorical Variable

#DescTools::Desc(Task1data$Private)

##Statistical Charts#####

ggplot(Task1data, aes(y=F.Undergrad, x=Private)) + 
    geom_bar(stat = "identity", position = "dodge")+
  ggtitle("Full-time Undergrad Students in Universities") +
  ylab("Number of Undergrads")+
  theme(plot.title = element_text(hjust = 0.5))

Task1data = Task1data %>%
  mutate(Acceptrate = c((Accept/Apps))) #add acceptance rate column

#private univ subset
Privaterate = subset(Task1data, Private == "Yes", 
         select=c(Acceptrate))
Privaterate = sum(Privaterate$Acceptrate) / length(Privaterate$Acceptrate)

#public Univ subset
Publicrate = subset(Task1data, Private == "No", 
         select=c(Acceptrate))
Publicrate = sum(Publicrate$Acceptrate) / length(Publicrate$Acceptrate)

#data combining
acceptrate = data.frame(Private = Privaterate,
                        Public = Publicrate)

#avg acceptance rate table representation
kable(round(acceptrate,3)*100,
      caption = "<center>Table 2: Private vs Public Acceptance rate </center>",
      align = "c",
      col.names = c("Private University Acceptance Rate %",
                    "Public University Acceptance Rate %")) %>%
  kable_styling(bootstrap_options = c("bordered",
                                      "responsive",
                                      "hover"),
                font_size = 11) %>%
 scroll_box(width = "100%", height = "100%") %>%
   footnote(general = "US Colleges from the 1995 issue of US News and World Report.\n",
           general_title = "Data Source:")

Table 2: Private vs Public Acceptance rate
Private University Acceptance Rate %	Public University Acceptance Rate %
75.5	72.7

Data Source: US Colleges from the 1995 issue of US News and World Report.

#Public vs Private Out-of-State Tuition
cdplot(Private ~ Outstate, data=Task1data, ylevels = 2:1,
       main = "Out-of-State Tuition of Private, Public",
       xlab = "Tuition Fee (USD)")

#Finding universities per student expenditure above country mean
Abovemeanexp = Task1data[which(Task1data$Expend > mean(Task1data$Expend) ),]

meancount <- function(y, uplim = max(Abovemeanexp$Expend) * 1.15) {
  return(data.frame(
    y = 0.95 * uplim,
    label = paste(
      "Count =", length(y), "\n"
    )
  ))
}
ggplot(Abovemeanexp, aes(y=Expend, x=Private)) + 
    geom_bar(position="dodge", stat="identity")+
  ggtitle("Expenditure per Student  greater than country mean") +
  ylab("Expenditure in USD")+
  theme(plot.title = element_text(hjust = 0.5),
        plot.caption = element_text(size = 7))+
  #scale_y_continuous(labels = unit_format(unit = "M", scale = 1e-6))+
  labs(caption = "Country Mean per student expenditure, $9660") +
  stat_summary( 
               fun.data = meancount, 
               geom = "text", hjust = 0.5, vjust = 0.9, size = 2)

#Highest expenditure per student 
Mostperstudentexp = Abovemeanexp %>%
  filter(Abovemeanexp$Expend == max(Abovemeanexp$Expend))

  kable(Mostperstudentexp,
        caption = "<center>Table 3: Highest per student expenditure(USD) university </center>",
        align = "c")%>%
  kable_styling(bootstrap_options = c("hover",
                                        "bordered"),
                font_size = 11) %>%
  scroll_box(width = "100%", height = "100%") %>%
  footnote(general = "US Colleges from the 1995 issue of US News and World Report.\n",
           general_title = "Data Source:")

Table 3: Highest per student expenditure(USD) university
	Private	Apps	Accept	Enroll	Top10perc	Top25perc	F.Undergrad	P.Undergrad	Outstate	Room.Board	Books	Personal	PhD	Terminal	S.F.Ratio	perc.alumni	Expend	Grad.Rate	Acceptrate
Johns Hopkins University	Yes	8474	3446	911	75	94	3566	1569	18800	6740	500	1040	96	97	3.3	38	56233	90	0.4066557

Data Source: US Colleges from the 1995 issue of US News and World Report.

#Avg Student to Faculty Ratio in Private Univ
PrivateSFR = subset(Task1data, Private == "Yes", 
         select=c(S.F.Ratio))
PrivateSFR = sum(PrivateSFR$S.F.Ratio) / length(PrivateSFR$S.F.Ratio)

#Avg Student to Faculty Ratio in Public Univ
PublicSFR = subset(Task1data, Private == "No", 
         select=c(S.F.Ratio))
PublicSFR = sum(PublicSFR$S.F.Ratio) / length(PublicSFR$S.F.Ratio)

#Data merging
FinSFR = data.frame(Public_S.F.R = PublicSFR,
                    Private_S.F.R = PrivateSFR)

kable(round(FinSFR,2),
      caption = "<center>Table 4: Private vs Public Student:Faculty </center>",
      align = "c",
      col.names = c("Private University Student to Faculty Ratio",
                    "Public University Student to Faculty Ratio")) %>%
  kable_styling(bootstrap_options = c("bordered",
                                      "responsive",
                                      "hover"),
                font_size = 11) %>%
 scroll_box(width = "100%", height = "100%") %>%
   footnote(general = "US Colleges from the 1995 issue of US News and World Report.",
           general_title = "Data Source: ")

Table 4: Private vs Public Student:Faculty
Private University Student to Faculty Ratio	Public University Student to Faculty Ratio
17.14	12.95

Data Source: US Colleges from the 1995 issue of US News and World Report.

testfit = glm(Private ~ S.F.Ratio + Grad.Rate + Outstate, data = Task1data, family = binomial)

testfit2 = glm(Private ~ Top10perc + F.Undergrad + PhD, data = Task1data, family = binomial)

#summary(testfit)
#summary(testfit2)

anov = anova(testfit, testfit2)
anov %>%
  kable(caption = "<center>Table 5: Best fit Model</center>")%>%
  kable_styling(bootstrap_options = c("bordered",
                                      "responsive",
                                      "hover"),
                font_size = 11) %>%
 scroll_box(width = "100%", height = "100%") %>%
   footnote(general = "Test 1: Private ~ S.F.Ratio + Grad.Rate + Outstate and \n Test 2: Private ~ Top10perc + F.Undergrad + PhD\n",
           general_title = "Test fit: ")

Table 5: Best fit Model
Resid. Df	Resid. Dev	Df	Deviance
773	541.1257	NA	NA
773	482.9046	0	58.22114

Test fit: Test 1: Private ~ S.F.Ratio + Grad.Rate + Outstate and
Test 2: Private ~ Top10perc + F.Undergrad + PhD

#tab_model()

AIC(testfit, testfit2)%>%
  kable(caption = "<center>Table 6: AIC Best fit Model</center>",
        align = "c")%>%
  kable_styling(bootstrap_options = c("bordered",
                                      "responsive",
                                      "hover"),
                font_size = 11) %>%
 scroll_box(width = "100%", height = "50%") %>%
   footnote(general = "Test 1: Private ~ S.F.Ratio + Grad.Rate + Outstate and \n Test 2: Private ~ Top10perc + F.Undergrad + PhD\n",
           general_title = "Test fit: ")

Table 6: AIC Best fit Model
	df	AIC
testfit	4	549.1257
testfit2	4	490.9046

Test fit: Test 1: Private ~ S.F.Ratio + Grad.Rate + Outstate and
Test 2: Private ~ Top10perc + F.Undergrad + PhD

Observation:

The College dataset, sourced from the 1995 issue of US News and World Report, contains information on various attributes of American colleges and universities. The dataset includes variables such as the school’s name, private/public status, enrollment size, percent of students from Top 10 and 25 of high school class, acceptance rate, and expenditures per student. The dataset can be used to explore relationships between different attributes, such as the relationship between a school’s expenditure per student and percentage of students from top 10 or 25 high school class. Additionally, statistical techniques such as regression analysis can be used to identify the most significant predictors of a school’s overall success, as determined by the US News and World Report. This dataset can be used to understand the trends of colleges in 1995 and can also be used to compare with recent data. Furthermore, it can help to understand how the factors affecting the success of colleges and universities have changed over time.

Table 1 provides an overview of the dataset and descriptive statistics of variables. One of the variables Private is categorical variables as it have Yes/No values. In the table 1, Private is denoted by \(*\) to confirm the categorical type. Rest of the variables are discrete and continuous variables.
Bar chart Full time students in universities shows the undergraduate students in US universities who are enrolled as full time students. Bar chart shows Public universities have most full time undergraduate students enrolled in 1995. However, the acceptance rate of Private universities is greater than public universities. Overall, Private universities have acceptance rate of 75.5% and Public universities have 72.7% acceptance rate.
Furthermore, the tuition fee for out-of-state students is higher in Private Universities as compared to Public Universities. Almost 99% of Public Universities have tuition fee less than USD15,000 and on the other hand, almost 20% of Private Universities have tuition fee less than USD15,000. Another contributing factor in success of the universities is the instructional expenditure per student. Instructional expenditure per student refers to the amount of money spent by a college or university on instruction-related activities for each student enrolled. This typically includes expenses such as faculty salaries, instructional materials, and classroom technology. It is calculated by dividing the total amount spent on instruction by the number of students enrolled. This metric is often used as an indicator of a school’s investment in its educational offerings and can be a useful tool for comparing the resources available at different institutions. It is also used to study the relationship between the expenditure and the academic outcome. The country mean of expenditure per student across universities is USD9660. 226 Private and 36 Public universities’ expenditure per student is higher than country mean. And John Hopkins University have most instructional expenditure of USD56233.
However, another interesting analysis out of the data is student:faculty ratio. Surprisingly the Public Universities have almost 13:1 ratio and Private Universities have 17:1 ratio.

After a thoughtful consideration, I have decided to consider Student:Faculty ratio, Outstate, Grad Rate, percentage of students from Top 10 high school, and Full time undergraduate to measure the success of the universities. Model 1 measures the correlation of Private to Student:Faculty ratio, Outstate, Grad Rate and Model 2 percentage of students from Top 10 high school, and Full time undergraduate, Model 2 is best fit. This model is selected based on preliminary analysis. I would be modeling different regression models in the next tasks.

Task 2:

#Train and Test#######

set.seed(130)

traintestind <- createDataPartition(Task1data$Private, p=0.70, list = FALSE)

traindata <- Task1data[traintestind,] #assigning 70% of data to train
testdata <- Task1data[-traintestind,] #assigning remaining 30% of data to test

#nrow(traindata) + nrow(testdata)

Observation:

In this task, I would be splitting the College dataset into two parts: a training set(70%) and a test set(30%). The training set is used to train the model, while the test set is used to evaluate the model’s performance.
The main reason for doing a train and test split is to evaluate the performance of a machine learning model. A model that performs well on the training set is not necessarily a good indicator of how well the model will perform on unseen data.

When a model is trained and tested on the same data, it can result in overfitting, where the model performs well on the training data but poorly on new unseen data. By splitting the data into a training set and a test set, we can evaluate the model’s performance on unseen data and get a better idea of how well the model is likely to perform on new, unseen data.

Another reason for doing a train and test split is to be able to compare the performance of different models. By training and testing several models on the same data, we can compare their performance and select the best model.

In addition, train and test split help in finding the optimal parameters of the model by tuning the hyperparameters using the train set and test it on test set.

Once the dataset has been split into training and test sets, the model can be trained on the training set using a function such as lm()-Linear Regression or glm()- Logistic Regression. Then, the model’s performance can be evaluated on the test set using function ConfusionMatrix().

In summary, train and test split is an important step in the machine learning process as it allows us to evaluate the performance of a model, compare the performance of different models, and ensure that the model is generalizing well to new, unseen data.

Task 3:

allglm = glm(Private ~ ., data = traindata, family = binomial(link = "logit"))

summary(allglm)

## 
## Call:
## glm(formula = Private ~ ., family = binomial(link = "logit"), 
##     data = traindata)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.3564  -0.0070   0.0396   0.1596   3.0606  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  5.493e-01  3.112e+00   0.177  0.85988    
## Apps        -8.661e-04  4.930e-04  -1.757  0.07893 .  
## Accept       1.465e-03  7.582e-04   1.933  0.05328 .  
## Enroll       9.076e-05  1.168e-03   0.078  0.93808    
## Top10perc    1.542e-02  3.607e-02   0.428  0.66891    
## Top25perc   -5.867e-04  2.430e-02  -0.024  0.98074    
## F.Undergrad -8.399e-04  2.879e-04  -2.918  0.00353 ** 
## P.Undergrad  2.247e-04  1.904e-04   1.181  0.23775    
## Outstate     9.145e-04  1.617e-04   5.654 1.57e-08 ***
## Room.Board  -4.767e-04  3.529e-04  -1.351  0.17673    
## Books        6.782e-04  1.511e-03   0.449  0.65358    
## Personal    -1.564e-04  3.275e-04  -0.477  0.63301    
## PhD         -3.799e-02  3.604e-02  -1.054  0.29183    
## Terminal    -4.040e-02  3.442e-02  -1.174  0.24047    
## S.F.Ratio    3.935e-03  8.349e-02   0.047  0.96241    
## perc.alumni  4.128e-02  2.692e-02   1.533  0.12522    
## Expend       2.253e-04  1.480e-04   1.522  0.12798    
## Grad.Rate    1.496e-02  1.486e-02   1.007  0.31413    
## Acceptrate  -1.605e+00  2.347e+00  -0.684  0.49417    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 639.40  on 544  degrees of freedom
## Residual deviance: 154.73  on 526  degrees of freedom
## AIC: 192.73
## 
## Number of Fisher Scoring iterations: 8

task3fit1 = glm(Private ~ F.Undergrad + perc.alumni + Outstate, data = traindata, family = binomial(link = "logit"))

task3fit2 = glm(Private ~ Personal + Enroll, data = traindata, family = binomial(link = "logit"))

#summary(task3fit1)
#summary(task3fit2)

exp(coef(task3fit1)) %>%
  kable(caption = "<center>Table 7: Coefficinent of Model</center>",
        col.names = "Coefficient of Model",
        align = "c") %>%
  kable_styling(bootstrap_options = c("bordered",
                                      "responsive",
                                      "hover"),
                font_size = 11) %>%
 scroll_box(width = "100%", height = "50%") %>%
   footnote(general = "Test 1: Private ~ F.Undergrad + perc.alumni + Outstate",
           general_title = "Test fit: ")

Table 7: Coefficinent of Model
	Coefficient of Model
(Intercept)	0.0148729
F.Undergrad	0.9992869
perc.alumni	1.0323057
Outstate	1.0008423

Test fit: Test 1: Private ~ F.Undergrad + perc.alumni + Outstate

Observation:

GLMs are a type of statistical model that extend the generalized linear model framework to allow for response variables that have error distributions other than a normal distribution. glm() allows for fitting a wide range of models, including linear regression, logistic regression, Poisson regression, and others. It can handle both continuous and categorical predictor variables, and can include interactions and non-linear terms.

In this task, I ran glm() on entire dataset considering Private is dependent variable. As the model is fitted, summary() is used to extract information and make inferences. Summary() shows the Apps, Accept, Full time Undergrad Students, and Outstate are significant. As these codes only indicate if the coefficients are statistically significant or not, but they don’t give information about the effect size, or the magnitude of the coefficient, which is important in determining the practical significance of the results.
I would model F.Undergrad, Percentage of Alumni who donate, and Outstate and Personal expenditure and enrolled new students to run on train and test data. Out of these two models, F.Undergrad, Percentage of Alumni who donate, and Outstate is best fit and selected for further analysis.

Task 4:

Probtrain = predict(task3fit1, newdata = traindata, type = "response")
predclassmin <- as.factor(ifelse(Probtrain >= 0.5, "Yes", "No"))

head(Probtrain,10) %>%
  kable(caption = "<center>Table 8: Predicated Probability of Train Data </center>",
    col.names = "Probabilities",
        align = "c") %>%
  kable_styling(bootstrap_options = c("bordered",
                                      "responsive",
                                      "hover"),
                font_size = 11) %>%
 scroll_box(width = "100%", height = "50%") %>%
   footnote(general = "Predict model ran on train data shows first 10 observations",
           general_title = "Probability: ")

Table 8: Predicated Probability of Train Data
	Probabilities
Adelphi University	0.9912108
Adrian College	0.9958409
Agnes Scott College	0.9994559
Alaska Pacific University	0.8852125
Albertson College	0.9991105
Albertus Magnus College	0.9994532
Albion College	0.9994511
Albright College	0.9998714
Alderson-Broaddus College	0.9891433
Alfred University	0.9999176

Probability: Predict model ran on train data shows first 10 observations

confusionMatrix(predclassmin, traindata$Private, positive = "Yes")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  130  16
##        Yes  19 380
##                                           
##                Accuracy : 0.9358          
##                  95% CI : (0.9118, 0.9549)
##     No Information Rate : 0.7266          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.8373          
##                                           
##  Mcnemar's Test P-Value : 0.7353          
##                                           
##             Sensitivity : 0.9596          
##             Specificity : 0.8725          
##          Pos Pred Value : 0.9524          
##          Neg Pred Value : 0.8904          
##              Prevalence : 0.7266          
##          Detection Rate : 0.6972          
##    Detection Prevalence : 0.7321          
##       Balanced Accuracy : 0.9160          
##                                           
##        'Positive' Class : Yes             
##

Observation:

Predict() function is used to make predictions for new data using a fitted model. The predictions can be used for a variety of purposes, such as:
Model evaluation: By comparing the predicted values to the actual values for a set of new data, we can evaluate the performance of the model and determine how well it is able to generalize to new data.
Model comparison: By comparing the predicted values from different models, we can determine which model is the best fit for the data.
Forecasting: We can use the predict function to forecast future values of a response variable based on the model that has been fit to historical data.
Understanding model behavior: The predict function can also be used to understand how the model behaves for different input values. This can be useful for understanding the relationship between predictor variables and the response variable.
Decision making: In some cases, the predictions can be used to make decisions. For example, a model that predicts the probability of universities churn can be used to identify Public or Private and take action to retain them. Overall, predict function is a key function in data modeling as it allows to test the model on unseen data and evaluate its performance.

Furthermore, as.factor() is used on the output of the predict() to change probability more than 0.5 to Yes and others to No. This arrangement would we used to make data ConfusionMatrix() compatible.

ConfusionMatrix() is used to create a confusion matrix, which is a table that is used to evaluate the performance of a classification model. The matrix is created by comparing the predicted class labels to the actual class labels for a set of test data.
The confusion matrix has four main elements: true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN).

True positives (TP) are the number of observations that are correctly classified as belonging to the positive class. In our train model, it is 380.
False positives (FP) are the number of observations that are incorrectly classified as belonging to the positive class. In our train model, it is 16.
True negatives (TN) are the number of observations that are correctly classified as belonging to the negative class. In our train model, it is 130.
False negatives (FN) are the number of observations that are incorrectly classified as belonging to the negative class. In our train model, it is 19.

Type 1 Error: False Positives (FP)
Type 2 Error: False Negatives (FN)

The confusion matrix is useful for understanding the performance of a classification model in terms of precision, recall, accuracy, and specificity.
ConfusionMatrix() function can be used with various classification models, such as logistic regression, decision trees, random forests, etc. It can also be used with multiclass classification problems. It is an important step in evaluating the performance of a model and helps in identifying any errors or bias present in the model.

In general, false positives and false negatives can have different consequences and it is important to consider the specific context of the analysis when determining which misclassification is more damaging. As the goal of the analysis is to identify schools with Full time undergrads, percentage of alumni who donate and out-of-state tuition, a false negative (failing to identify a school with high out-of-state tuition) might be more damaging than a false positive (identifying a school as having high out-of-state tuition when it does not). On the other hand, if the goal of the analysis is to identify schools with low out-of-state tuition per student, a false positive might be more damaging.

By counting the number of observations in each of these cells, we can calculate various metrics such as accuracy, precision, recall, and F1-score.

Accuracy = (TP+TN)/(TP+TN+FP+FN)
Precision = TP/(TP+FP)
Recall = TP/(TP+FN)
F1-score = 2PrecisionRecall/(Precision+Recall)

Task 5:Interpreting ConfusionMatrix()

#Report and interpret metrics for Accuracy, Precision, Recall, and Specificity.

cmt5 = confusionMatrix(predclassmin, traindata$Private, positive = "Yes")
cmt5$byClass

##          Sensitivity          Specificity       Pos Pred Value 
##            0.9595960            0.8724832            0.9523810 
##       Neg Pred Value            Precision               Recall 
##            0.8904110            0.9523810            0.9595960 
##                   F1           Prevalence       Detection Rate 
##            0.9559748            0.7266055            0.6972477 
## Detection Prevalence    Balanced Accuracy 
##            0.7321101            0.9160396

cmt5$overall

##       Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
##   9.357798e-01   8.373370e-01   9.118140e-01   9.548642e-01   7.266055e-01 
## AccuracyPValue  McnemarPValue 
##   8.161411e-36   7.353167e-01

Observation:

Type 1 Error: False Positives (FP)
Type 2 Error: False Negatives (FN)

Accuracy of the model is 93%. Sensitivity (Recall) is almost 96% which means 96 universities out of 100 are predicted correctly. Precision is 95% which means 95 universities are predicted correctly. Specificity determines the proportion of actual negatives that are correctly identified. In our model, specificity is 87% which means 87 universities are labeled private and 13 are labelled incorrectly as public.

Task 6:

#Confusion Matrix for test data#####


Probtest = predict(task3fit1, newdata = testdata, type = "response")
predclassmintest <- as.factor(ifelse(Probtest >= 0.5, "Yes", "No"))

head(Probtest,10) %>%
  kable(caption = "<center>Table 9: Predicated Probabilites of Test Data</center>",
        col.names = "Probabilities for Test data",
        align = "c") %>%
  kable_styling(bootstrap_options = c("bordered",
                                      "responsive",
                                      "hover"),
                font_size = 11) %>%
 scroll_box(width = "100%", height = "50%") %>%
   footnote(general = "Predict model ran on test data shows first 10 observations",
           general_title = "Probability: ")

Table 9: Predicated Probabilites of Test Data
	Probabilities for Test data
Abilene Christian University	0.5936725
Amherst College	0.9999983
Antioch University	0.9999194
Appalachian State University	0.0059170
Aquinas College	0.9941216
Arkansas Tech University	0.0239912
Assumption College	0.9964255
Barat College	0.9913350
Barnard College	0.9999702
Barry University	0.9873203

Probability: Predict model ran on test data shows first 10 observations

cm = confusionMatrix(predclassmintest, testdata$Private, positive = "Yes") 
cm

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No   58   9
##        Yes   5 160
##                                           
##                Accuracy : 0.9397          
##                  95% CI : (0.9008, 0.9666)
##     No Information Rate : 0.7284          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.8504          
##                                           
##  Mcnemar's Test P-Value : 0.4227          
##                                           
##             Sensitivity : 0.9467          
##             Specificity : 0.9206          
##          Pos Pred Value : 0.9697          
##          Neg Pred Value : 0.8657          
##              Prevalence : 0.7284          
##          Detection Rate : 0.6897          
##    Detection Prevalence : 0.7112          
##       Balanced Accuracy : 0.9337          
##                                           
##        'Positive' Class : Yes             
##

#precision <- cm$byClass['Pos Pred Value']    
#recall <- cm$byClass['Sensitivity']

#f_measure <- 2 * ((precision * recall) / (precision + recall))

fval = cm$byClass["F1"]

error = data.frame("error" = c(1-fval))
error = error$error

round(error,4) %>%
  kable(caption = "<center>Table 10: Error Probability in the model</center>",
    col.names = "Error Probability",
        align = "c") %>%
  kable_styling(bootstrap_options = c("bordered",
                                      "responsive",
                                      "hover"),
                font_size = 11) %>%
 scroll_box(width = "100%", height = "50%") %>%
   footnote(general = "Private ~ F.Undergrad + perc.alumni + Outstate",
           general_title = "Error Probability: ")

Table 10: Error Probability in the model
Error Probability
0.0419

Error Probability: Private ~ F.Undergrad + perc.alumni + Outstate

Observation:

True positives (TP) are the number of observations that are correctly classified as belonging to the positive class. In our train model, it is 160.
False positives (FP) are the number of observations that are incorrectly classified as belonging to the positive class. In our train model, it is 9.
True negatives (TN) are the number of observations that are correctly classified as belonging to the negative class. In our train model, it is 58.
False negatives (FN) are the number of observations that are incorrectly classified as belonging to the negative class. In our train model, it is 5.

Type 1 Error: False Positives (FP)
Type 2 Error: False Negatives (FN)

Accuracy of the model is 94%. Sensitivity (Recall) is almost 94% which means 94 universities out of 100 are predicted correctly as Private. Precision is 97% which means 97 universities are predicted correctly as Private. Specificity determines the proportion of actual negatives that are correctly identified. In our model, specificity is 92% which means 92 universities are labeled private and 18 are labelled incorrectly as public.
The Error Probability of the model on test dataset is 4% means there are 6 univertsities out of 100 are incorrectly labelled public.

Task 7:

#Receiver Operator Characteristic Curve######

rocdata = roc(testdata$Private, Probtest)

ggroc(rocdata, colour = 'steelblue', size = 2) +
  ggtitle("ROC Chart")+
  xlab(" Specificity: FP Rate")+
  ylab("Sensitivity: TP Rate")+
  theme(plot.title = element_text(hjust = 0.5))

Observation:

The roc() function is used to create a receiver operating characteristic (ROC) curve, which is a graphical representation of the performance of a binary classifier system as the discrimination threshold is varied. The ROC curve is created by plotting the true positive rate (sensitivity) against the false positive rate (1-specificity) at various threshold settings. The area under the ROC curve (AUC) is a measure of the overall performance of the classifier, with a value of 1 indicating perfect classification and a value of 0.5 indicating a classifier that performs no better than chance.

Task 8:

#Area Under Curve####

auc(rocdata)

## Area under the curve: 0.9714

Observation:

The ROC chart shows the model is well suited to make predictions accurately as AOC that is area under the curve is 97%.

Conclusion:

In this project, we have tested glm() to test logistic regression on Categorical variable Private. Private have Yes,No values for university category. The glm() provided the best suited model by running on the dataset and we have selected three independent variable to test and train the dataset to predict private and public university based upon the enrolled Full time undergrads, Percentage of Alumni who donate and Out-of-State tuition fee. The model shows overall 94% accuracy and error labelling private as public is nearly 4%.

Reference:

Bluman, A. (2014). Elementary statistics: A step by step approach. McGraw-Hill Education.
Gero, E. (2023). ALY6015_Module3_Logistic_Regression[Lecture recording]. Canvas@Northeastern University.https://canvas.northeastern.edu/
Kabacoff, R. (2015). R in Action. Manning Publications Co.
Zach. (2020). How to Calculate AUC (Area Under Curve) in R. Statology. https://www.statology.org/auc-in-r/

GLM and Logistic Regression

Nikhil Deshpande

2023-02-02