#
# Page
# 1
# of 3
# INFO 180 PS3
# November 20 2024 - Due December 4 2024
# 1 Problem 1 [5pt] Confusion Matrix
# 1.1 Confusion Matrix [1 pt]
# Calculate the Accuracy from the following confusion matrix:
# Predicted Positive Predicted Negative
# Actual Positive 140 10
# Actual Negative 90 240
# Show your work (math formula) for partial credit.
(140+240)/(140+240+10+90)
## [1] 0.7916667
# 1.2 Recall [2pt]
# Calculate the recall from the confusion matrix above. Show your work (math
#formula) for partial credit.
140/(140 +10)
## [1] 0.9333333
# 1.3 Precision [2pt]
# Calculate the precision from the confusion matrix above. Show your work (math formula) for partial credit.
140/(140 +90)
## [1] 0.6086957
# 2 Problem 2 [10pt] Logit, Probit, AIC
# 2.1 Logit - Titanic Data [3 pt]
# Using the code we developed in class, run a logistic regression model for the
# outcome variable Survived from the titanic csv file provided in class materials
# and predictors P class, Age, F are, and P arents.Children.Aboard. Write the
# relationship in the form Y ∼ X1 + X2 + X3 + X4 like we’ve done throughout
# the term. Share that formula here for partial credit.
# All code should be included for this question.
# Ensure that the data is imported correctly and that categorical data is cor-rectly converted. Print the summary statistics table (1 pt), then run the logit model using glm and print its results (e.g., summary(glm(Y ∼ X1 + X2 + X3 +X4, family = ̈binomial ̈, data = titanic)). Provide the resulting results table printout. [2pt]
titanic_data = read.csv("~/Documents/INFO180/Problem Set 3/titanic.csv")
model1 = glm(survived~pclass + age + fare, data = titanic_data, family = "binomial") #after fare, add 4th variable <<<<<<<<<<<<
summary(model1)
##
## Call:
## glm(formula = survived ~ pclass + age + fare, family = "binomial",
## data = titanic_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.657877 0.381203 6.972 3.12e-12 ***
## pclass -0.962862 0.112240 -8.579 < 2e-16 ***
## age -0.036253 0.005500 -6.592 4.34e-11 ***
## fare 0.003741 0.001693 2.210 0.0271 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1413.6 on 1044 degrees of freedom
## Residual deviance: 1250.5 on 1041 degrees of freedom
## (264 observations deleted due to missingness)
## AIC: 1258.5
##
## Number of Fisher Scoring iterations: 4
# 2.2 Probit [2pt]
# Repeat the analysis with the same Y ∼ X1 + X2 + X3 + X4 formula but using
# the probit model (hint: same binomial family, but use probit link like in class
# or in slides). Provide the resulting table printout [2pt]
model2 = glm(survived~pclass + age + fare, data = titanic_data, family = binomial(link = "probit")) #after fare, add 4th variable <<<<<<<<<<<<
summary(model2)###COmpare code to notes in class<<<<<<<<<<<<<<<<<<<<<<
##
## Call:
## glm(formula = survived ~ pclass + age + fare, family = binomial(link = "probit"),
## data = titanic_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.6049464 0.2258435 7.106 1.19e-12 ***
## pclass -0.5846497 0.0663973 -8.805 < 2e-16 ***
## age -0.0217088 0.0032619 -6.655 2.83e-11 ***
## fare 0.0022485 0.0009828 2.288 0.0221 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1413.6 on 1044 degrees of freedom
## Residual deviance: 1250.9 on 1041 degrees of freedom
## (264 observations deleted due to missingness)
## AIC: 1258.9
##
## Number of Fisher Scoring iterations: 4
# 2.3 AIC scores [1pt]
# Compare the AIC scores of the logit and probit models - which model would
# you consider better? Explain your decision.
#MODEL 2 AIC: 1258.9
#MODEL 1 AIC: 1258.5
#model 1 has the lower AIC SCORE
# 2.4 Marginal Effects [5pt]
# Run the proper marginal effects function using the logitmfx or probitmfx for the
# better model you identified previously (as per the AIC score analysis above).
# Print the marginal effects results table and interpret which relationships appear
# to be statistically significant and what each of the coefficients represents in terms
# of increase or decrease in the probability of Survival.
library(mfx)
## Loading required package: sandwich
## Loading required package: lmtest
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: MASS
## Loading required package: betareg
logitmfx(survived~pclass + age + fare, data = titanic_data)#<<<<<<<<<<<
## Call:
## logitmfx(formula = survived ~ pclass + age + fare, data = titanic_data)
##
## Marginal Effects:
## dF/dx Std. Err. z P>|z|
## pclass -0.23078863 0.02671346 -8.6394 < 2.2e-16 ***
## age -0.00868960 0.00131499 -6.6081 3.892e-11 ***
## fare 0.00089659 0.00040691 2.2034 0.02757 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# 3 Problem 3 [4pt] Classification
# 3.1 Logit, Probit models [2pt]
# In class we have also used logit and probit for classification. Using the best
# model of the two you just analyzed (either the logit or the probit) based on
# your AIC score analysis in Problem 2, let’s run a classification task on Survived
# as where you modify the threshold for the class to be a ”1” (Survived) in the
# predicted class to be 0.6. What is your confusion matrix output and what is
# the computed accuracy of thistype. - classifier?
# Hint: Use the code from November 20 2021 class and modify it.
model1_eval = data.frame(actual = titanic_data$survived, predicted = ifelse(predict(model1, titanic_data, type = "response") > 0.6, 1, 0))
model1_cm = table(model1_eval)
sum(diag(model1_cm))/sum(model1_cm)
## [1] 0.6889952
# 3.2 Support Vector Machine [2pt]
# Using the same relationship between Survived and the predictors from Problem
# 2 and the first half of Problem 3, Y ∼ X1+X2+X3+X4, calculate the accuracy
# of a linear support vector machine model and the accuracy of a 2nd degree
# polynomial support vector machine. Share the result of the better performing
# model between these two SVM models, as well as the output confusion matrix.
# Provide code for this question that we can run for partial credit.
#
library(e1071)
svm1 = svm(survived~pclass + age + fare, data = titanic_data)
svm2 = svm(survived~pclass + age + fare, data = titanic_data, kernel = "polynomial", degree = 2)
summary(svm1)
##
## Call:
## svm(formula = survived ~ pclass + age + fare, data = titanic_data)
##
##
## Parameters:
## SVM-Type: eps-regression
## SVM-Kernel: radial
## cost: 1
## gamma: 0.3333333
## epsilon: 0.1
##
##
## Number of Support Vectors: 703
summary(svm2)
##
## Call:
## svm(formula = survived ~ pclass + age + fare, data = titanic_data,
## kernel = "polynomial", degree = 2)
##
##
## Parameters:
## SVM-Type: eps-regression
## SVM-Kernel: polynomial
## cost: 1
## degree: 2
## gamma: 0.3333333
## coef.0: 0
## epsilon: 0.1
##
##
## Number of Support Vectors: 899
# 4 Problem 4[1pt] Boston 311
# The Boston 311 CSV dataset represents service calls related to city utilities, and
# other complaints sent to Boston’s 311 service (a number residents may call).
# This dataset includes rich information about the problem reported, resolution
# time, and location.
# Using a visualization tool of your choice between R ggplot and Tableau,
# plot the relationship between zip code (geography) and count of complaints
# (irrespective of category). For example, a map representation with hue for
# quantity or a barplot might work well here.
# Please include a brief description of the figure as well as a screenshot of the
# visualization.
# 5 Rules reminder
# Per the course syllabus you can work on this assignment with collaborators, as
# long as the writeup is individually turned in and each of you submits a copy
# of the file including the code write up in one PDF or word document with the
# names of each collaborators listed. If you use outside sources, please list them.
# If you need to take extra time, we have 3 late days for flexibility you can use
# as you wish through the term. You do not need to ask TAs for extensions or
# permission if you’re just using your late days. Our TAs and I have combined
# about 30 hours of office hours per week - please use these office hours and start
# work early.
# 3
#