# 
# Page
# 1
# of 3
# INFO 180 PS3
# November 20 2024 - Due December 4 2024
# 1 Problem 1 [5pt] Confusion Matrix
# 1.1 Confusion Matrix [1 pt]
# Calculate the Accuracy from the following confusion matrix:
#   Predicted Positive Predicted Negative
# Actual Positive 140 10
# Actual Negative 90 240
# Show your work (math formula) for partial credit.
(140+240)/(140+240+10+90)
## [1] 0.7916667
# 1.2 Recall [2pt]
# Calculate the recall from the confusion matrix above. Show your work (math
#formula) for partial credit.
140/(140 +10)
## [1] 0.9333333
# 1.3 Precision [2pt]
# Calculate the precision from the confusion matrix above. Show your work (math formula) for partial credit.
140/(140 +90)
## [1] 0.6086957
# 2 Problem 2 [10pt] Logit, Probit, AIC
# 2.1 Logit - Titanic Data [3 pt]
# Using the code we developed in class, run a logistic regression model for the
# outcome variable Survived from the titanic csv file provided in class materials
# and predictors P class, Age, F are, and P arents.Children.Aboard. Write the
# relationship in the form Y ∼ X1 + X2 + X3 + X4 like we’ve done throughout
# the term. Share that formula here for partial credit.
# All code should be included for this question.
# Ensure that the data is imported correctly and that categorical data is cor-rectly converted. Print the summary statistics table (1 pt), then run the logit model using glm and print its results (e.g., summary(glm(Y ∼ X1 + X2 + X3 +X4, family =  ̈binomial ̈, data = titanic)). Provide the resulting results table printout. [2pt]
titanic_data = read.csv("~/Documents/INFO180/Problem Set 3/titanic.csv")
model1 = glm(survived~pclass + age + fare, data = titanic_data, family = "binomial") #after fare, add 4th variable <<<<<<<<<<<<
summary(model1)
## 
## Call:
## glm(formula = survived ~ pclass + age + fare, family = "binomial", 
##     data = titanic_data)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  2.657877   0.381203   6.972 3.12e-12 ***
## pclass      -0.962862   0.112240  -8.579  < 2e-16 ***
## age         -0.036253   0.005500  -6.592 4.34e-11 ***
## fare         0.003741   0.001693   2.210   0.0271 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1413.6  on 1044  degrees of freedom
## Residual deviance: 1250.5  on 1041  degrees of freedom
##   (264 observations deleted due to missingness)
## AIC: 1258.5
## 
## Number of Fisher Scoring iterations: 4
# 2.2 Probit [2pt]
# Repeat the analysis with the same Y ∼ X1 + X2 + X3 + X4 formula but using
# the probit model (hint: same binomial family, but use probit link like in class
# or in slides). Provide the resulting table printout [2pt]
model2 = glm(survived~pclass + age + fare, data = titanic_data, family = binomial(link = "probit")) #after fare, add 4th variable <<<<<<<<<<<<
summary(model2)###COmpare code to notes in class<<<<<<<<<<<<<<<<<<<<<<
## 
## Call:
## glm(formula = survived ~ pclass + age + fare, family = binomial(link = "probit"), 
##     data = titanic_data)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.6049464  0.2258435   7.106 1.19e-12 ***
## pclass      -0.5846497  0.0663973  -8.805  < 2e-16 ***
## age         -0.0217088  0.0032619  -6.655 2.83e-11 ***
## fare         0.0022485  0.0009828   2.288   0.0221 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1413.6  on 1044  degrees of freedom
## Residual deviance: 1250.9  on 1041  degrees of freedom
##   (264 observations deleted due to missingness)
## AIC: 1258.9
## 
## Number of Fisher Scoring iterations: 4
# 2.3 AIC scores [1pt]
# Compare the AIC scores of the logit and probit models - which model would
# you consider better? Explain your decision.

#MODEL 2 AIC: 1258.9
#MODEL 1 AIC: 1258.5
#model 1 has the lower AIC SCORE


#  2.4 Marginal Effects [5pt]
#  Run the proper marginal effects function using the logitmfx or probitmfx for the
#  better model you identified previously (as per the AIC score analysis above).
#  Print the marginal effects results table and interpret which relationships appear
#  to be statistically significant and what each of the coefficients represents in terms
#  of increase or decrease in the probability of Survival.
library(mfx)
## Loading required package: sandwich
## Loading required package: lmtest
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Loading required package: MASS
## Loading required package: betareg
logitmfx(survived~pclass + age + fare, data = titanic_data)#<<<<<<<<<<<
## Call:
## logitmfx(formula = survived ~ pclass + age + fare, data = titanic_data)
## 
## Marginal Effects:
##              dF/dx   Std. Err.       z     P>|z|    
## pclass -0.23078863  0.02671346 -8.6394 < 2.2e-16 ***
## age    -0.00868960  0.00131499 -6.6081 3.892e-11 ***
## fare    0.00089659  0.00040691  2.2034   0.02757 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#                                        3 Problem 3 [4pt] Classification
#                                        3.1 Logit, Probit models [2pt]
#                                        In class we have also used logit and probit for classification. Using the best
#                                        model of the two you just analyzed (either the logit or the probit) based on
#                                        your AIC score analysis in Problem 2, let’s run a classification task on Survived
#                                        as where you modify the threshold for the class to be a ”1” (Survived) in the
#                                        predicted class to be 0.6. What is your confusion matrix output and what is
#                                        the computed accuracy of thistype. -  classifier?
#                                          Hint: Use the code from November 20 2021 class and modify it.
model1_eval = data.frame(actual = titanic_data$survived, predicted = ifelse(predict(model1, titanic_data, type = "response") > 0.6, 1, 0))

model1_cm = table(model1_eval)

sum(diag(model1_cm))/sum(model1_cm)
## [1] 0.6889952
#                                        3.2 Support Vector Machine [2pt]
#                                        Using the same relationship between Survived and the predictors from Problem
#                                        2 and the first half of Problem 3, Y ∼ X1+X2+X3+X4, calculate the accuracy
#                                        of a linear support vector machine model and the accuracy of a 2nd degree
#                                        polynomial support vector machine. Share the result of the better performing
#                                        model between these two SVM models, as well as the output confusion matrix.
#                                        Provide code for this question that we can run for partial credit.
# 
library(e1071)
svm1 = svm(survived~pclass + age + fare, data = titanic_data)
svm2 = svm(survived~pclass + age + fare, data = titanic_data, kernel = "polynomial", degree = 2)
summary(svm1)
## 
## Call:
## svm(formula = survived ~ pclass + age + fare, data = titanic_data)
## 
## 
## Parameters:
##    SVM-Type:  eps-regression 
##  SVM-Kernel:  radial 
##        cost:  1 
##       gamma:  0.3333333 
##     epsilon:  0.1 
## 
## 
## Number of Support Vectors:  703
summary(svm2)
## 
## Call:
## svm(formula = survived ~ pclass + age + fare, data = titanic_data, 
##     kernel = "polynomial", degree = 2)
## 
## 
## Parameters:
##    SVM-Type:  eps-regression 
##  SVM-Kernel:  polynomial 
##        cost:  1 
##      degree:  2 
##       gamma:  0.3333333 
##      coef.0:  0 
##     epsilon:  0.1 
## 
## 
## Number of Support Vectors:  899
#                                        4 Problem 4[1pt] Boston 311
#                                        The Boston 311 CSV dataset represents service calls related to city utilities, and
#                                        other complaints sent to Boston’s 311 service (a number residents may call).
#                                        This dataset includes rich information about the problem reported, resolution
#                                        time, and location.
#                                        Using a visualization tool of your choice between R ggplot and Tableau,
#                                        plot the relationship between zip code (geography) and count of complaints
#                                        (irrespective of category). For example, a map representation with hue for
#                                        quantity or a barplot might work well here.
#                                        Please include a brief description of the figure as well as a screenshot of the
#                                        visualization.
#                                        5 Rules reminder
#                                        Per the course syllabus you can work on this assignment with collaborators, as
#                                        long as the writeup is individually turned in and each of you submits a copy
#                                        of the file including the code write up in one PDF or word document with the
#                                        names of each collaborators listed. If you use outside sources, please list them.
#                                        If you need to take extra time, we have 3 late days for flexibility you can use
#                                        as you wish through the term. You do not need to ask TAs for extensions or
#                                        permission if you’re just using your late days. Our TAs and I have combined
#                                        about 30 hours of office hours per week - please use these office hours and start
#                                        work early.
#                                        3
#