1)

\[ \text{Starting with the Bayes formula from slide 23 of lecture 6, the denominator will be the same regardless of class.} \\ f'(x) = \pi_k \frac{\operatorname{e}^{-\frac{1}{2}\Big(\frac{x-\mu_k}{\sigma_k}\Big)^2}}{\sqrt{2\pi\sigma^{2}_k}}\\ \text{Taking the log:}\\ \delta=\ln \Bigg(\frac{\pi_k}{\sqrt{2\pi\sigma^2_k}} \Bigg) - \frac{1}{2}\bigg(\frac{x-\mu_k}{\sigma_k}\bigg) \\ \text{We can see that the first term in this function is constant for every class k, and that} \ \delta \text{(x)is a quadratic function of x.} \]

2)

a)

\[ \text{Note: The top row displays all data. The bottom row displays a random selection of 'No' data that is equal in size to} \\ \text{the 'Yes' data. This is for display purposes only. We use the complete data set to construct every model.} \]

b)

\[ \begin{aligned} \operatorname{E}[\text{balance | default = Yes}] & \approx \text{\$1747.88} \\ \operatorname{E}[\text{balance | default = No}] & \approx \text{\$803.94} \\ \operatorname{\hat{\sigma}^{2}} &= \frac{1}{n-K} \sum_{k=1}^K(n_k-1)\operatorname{\hat\sigma^2_k}\\ &=\frac{n_1-1}{n-K} \operatorname{Var}[\text{balance | default = Yes}] + \frac{n_2-1}{n-K} \operatorname{Var}[\text{balance | default = No}] \\ &=\frac{332}{9998} 116,113.3 + \frac{9666}{9998} 208,349\\ &\approx \text{\$205,286.16}\\ \operatorname{\hat{\pi}_{Y}} &= \frac{333}{10,000} = \text{3.33%} \\ \operatorname{\hat{\pi}_{N}} &= \frac{9,667}{10,000} = \text{96.67%} \end{aligned} \]

c)

d)

\[ \text{The decision boundry shifted right from \$1,276.7 to \$2,007.3 (approximate intercepts) when we incorporated priors. That is,}\\ \text{We require a higher balance of credit card debt before the probability of defaulting exceeds the probability of not defaulting} \\ \text{when we adjust for priors. We prefer the adjusted model because it does a better job accounting for the entire population} \\ \text{as you can see from the top and bottom data rows.} \]

e)

f)

b)

\[ \begin{aligned} \operatorname{E}[\text{balance | default = Yes}] & \approx \text{\$1747.88} \\ \operatorname{E}[\text{balance | default = No}] & \approx \text{\$803.94} \\ \operatorname{\hat{\sigma}^{2}_Y} &= \operatorname{Var}[\text{balance | default = Yes}] \approx \text{\$116,113}\\ \operatorname{\hat{\sigma}^{2}_N} &= \operatorname{Var}[\text{balance | default = Yes}] \approx \text{\$208,349}\\ \operatorname{\hat{\pi}_{Y}} &= \frac{333}{10,000} = \text{3.33%} \\ \operatorname{\hat{\pi}_{N}} &= \frac{9,667}{10,000} = \text{96.67%} \end{aligned} \]

c)

d)

\[ \text{The decision boundry shifted right from \$1,297.9 to \$1,978.1 (approximate intercepts) when we incorporated priors. Our assessment}\\ \text{is effectively the same as when we used pooled variance, albiet the QDA method produces a marginally smaller interval between} \\ \text{the shifts. This suggests that both sets of data have the same population variance.} \]

e)

3)

a)

\[ \text{The data appears to be randomly distributed within donut-shaped bounds when we plot X against Y. When we control for Variable 3 } \\ \text{we see that the V3 = 0 corresponds to (X,Y) coordinates that are closer to the center of the donut than V3 = 1 (X,Y) coordinates. It} \\ \text{appears that V3 = 1 and V3 = 0 do not represent data drawn from the same population that was divided along arbitrary bounds.} \\ \text{We made this assessment because some of the V3 = 1 data fall within the bounds of the V3 = 0 data.} \]

b)

##     pred
## true FALSE TRUE
##    0    35   25
##    1    28   30
##                Estimate Std. Error     z value  Pr(>|z|)
## (Intercept) -0.01418412  0.1980570 -0.07161635 0.9429072
## X           -0.30352113  0.3780411 -0.80287875 0.4220448
## Y           -0.01813178  0.3599832 -0.05036842 0.9598288

\[ \text{Note: our assessment is that the basic logistic model does a poor job of predicting the third variable and the coefficients} \\ \text{are poorly fitted. Furthermore, we will analyze our results graphically. We start by calculating the decision boundry.} \\ \text{The logit is given by}\\ \ln \bigg(\frac{\operatorname{p}}{1-\operatorname{p}}\bigg) = \hat\beta_0 + \hat\beta_1X+\hat\beta_2Y \\ \text{We solve for the decision boundry by setting p = 0 giving us} \\ 0 = \hat\beta_0 + \hat\beta_1X+\hat\beta_2Y \\ Y = -\frac{\hat\beta_0+\hat\beta_1X}{\hat\beta_2} \\ \text{The decision boundry as a function of Y, so} \]

\[ \text{It is very clear that this fails because a linear boundry is unsuitable for these distributions.} \\ \text{We will experiment with higher-degree logits.} \]

c)

##     pred
## true FALSE TRUE
##    0    49   11
##    1    12   46
##               Estimate Std. Error    z value     Pr(>|z|)
## (Intercept)  -0.129438  0.2483533 -0.5211850 6.022379e-01
## poly(X, 2)1  -1.291508  2.8176253 -0.4583676 6.466884e-01
## poly(X, 2)2 -21.032885  3.9887675 -5.2730285 1.341907e-07
## poly(Y, 2)1  -1.304788  2.6960758 -0.4839581 6.284156e-01
## poly(Y, 2)2 -19.250876  3.9385001 -4.8878699 1.019329e-06

\[ \text{We see that our predictive capabilities improved greatly from the confusion matrix. Furthermore, p-values indicate two} \\ \text{signficant coefficients, both relating to the polynomials. We will include the terms anyway so that we do not have to use} \\ \text{complex numbers.* This gives us} \\ \ln \bigg(\frac{\operatorname{p}}{1-\operatorname{p}}\bigg) =\hat\beta_0 + \hat\beta_1X + \hat\beta_2X^2 + \hat\beta_3Y + \hat\beta_4Y^2 = 0 \\ \implies Y= -\frac{\beta_{3}}{2\beta_{4}} \pm \sqrt{\frac{\beta_{3}^{2}}{4}-\frac{\beta_{0}+\beta_{1}X+\beta_{2}X^{2}}{\beta_{4}}}\\ \text{Visually we see a superior decision boundry.}\\ \text{ }\\ \text{*The function} \ 0=\hat\beta_2X^2+\hat\beta_4Y^2 \ \text{does not have a real solution.} \]

4)

a)

##         y
## glm_pred    0    1
##    FALSE 1703  321
##    TRUE    97  879

\[ \text{The p values are all small enough to not reject their intercepts in the model.}\\ \text{"wor.freq.remove","word.freq.meeting", and "char.freq.exclaimation" all have high absolute values for thier coefficients.}\\ \text{Furthermore when we observe their correllation matrices, their correllations between each variable is close to zero so they}\\ \text{are all good individually. } \]

\[ \operatorname{TPR} = \frac{\text{True Positive}}{\text{True Positive + False Negative}} = \frac{879}{879 + 321} = 73.25\% \\ \operatorname{FPR} = \frac{\text{False Positive}}{\text{False Positive + True Negative}} =\frac{97}{97+1703}=5.3\bar{8}\% \]

b)

##    y
##        0    1
##   0 1731  539
##   1   69  661

\[ \operatorname{TPR} = \frac{\text{True Positive}}{\text{True Positive + False Negative}} = \frac{661}{661 + 539} = 55.08\bar{3}\% \\ \operatorname{FPR} = \frac{\text{False Positive}}{\text{False Positive + True Negative}} =\frac{69}{69+1731}=3.8\bar{3}\% \]

c)

##    y
##        0    1
##   0  747   94
##   1 1053 1106

\[ \operatorname{TPR} = \frac{\text{True Positive}}{\text{True Positive + False Negative}} = \frac{1106}{1106 + 94} = 92.1\bar{6}\% \\ \operatorname{FPR} = \frac{\text{False Positive}}{\text{False Positive + True Negative}} =\frac{1053}{1053+747}=58.5\% \]

d)

n = nrow(dataset)
p = ncol(dataset)
p = p-1

X_small <- dataset[,c(1,2,3,5,9)]
# X_small[c(1:300),]
spam = dataset[c(1:300),'y'] == 1
real = dataset[c(1:300),'y'] == 0
sum_spam = colSums(X_small[spam,])
sum_real = colSums(X_small[real,])

pHat_spam <- sum_spam
pHat_spam <- pHat_spam/sum(pHat_spam)
pHat_spam
##           word.freq.remove            word.freq.order 
##                 0.01966351                 0.01430304 
##             word.freq.free               word.freq.re 
##                 0.04602601                 0.04079357 
## capital.run.length.average 
##                 0.87921388
pHat_real <- sum_real
pHat_real <- pHat_real/sum(pHat_real)
pHat_real
##           word.freq.remove            word.freq.order 
##                 0.02281627                 0.01985226 
##             word.freq.free               word.freq.re 
##                 0.04448558                 0.06304275 
## capital.run.length.average 
##                 0.84980314
X = dataset[,c(1:9)]+(1/p)
y = dataset$y
pvec_0 <- colSums(X[y == 0,])
pvec_0 <- pvec_0/sum(pvec_0)
pvec_1 <- colSums(X[y == 1,])
pvec_1 <- pvec_1/sum(pvec_1)

barplot(head(sort(pvec_0, decreasing = TRUE),10), las = 2, cex.names = 0.6)

barplot(head(sort(pvec_1, decreasing = TRUE),10), las = 2, cex.names = 0.6)

priors <- table(y)/length(y)
i <- 1
test_doc <- X[i,]
round(sort(test_doc, decreasing = T)[1:9], 4)
##   capital.run.length.average word.freq.re word.freq.remove word.freq.order
## 1                     1.2531       0.6411           0.1111          0.1111
##   word.freq.free word.freq.meeting word.freq.edu char.freq.semicolon
## 1         0.1111            0.1111        0.1111              0.1111
##   char.freq.exclamation
## 1                0.1111
sum(test_doc * log(pvec_0)) + log(priors[1])
##         0 
## -5.072113
sum(test_doc * log(pvec_1)) + log(priors[2])
##         1 
## -6.622042
y[i]
## [1] 0
yhat <- array(NA, dim = n)
for (i in 1:n){
  test_doc <- X[i,]
  logp0 <- sum(test_doc * log(pvec_0)) + log(priors[1])
  logp1 <- sum(test_doc * log(pvec_1)) + log(priors[2])
  yhat[i] <- 0 + {logp1 > logp0}
}

confusion_matrix <- table(y, yhat)
confusion_matrix
##    yhat
## y      0    1
##   0 1595  205
##   1  416  784
sum(diag(confusion_matrix))/sum(confusion_matrix)
## [1] 0.793

e)

\[ \text{Did not complete} \]