1)

\[ \text{Starting with the Bayes formula from slide 23 of lecture 6, the denominator will be the same regardless of class.} \\ f'(x) = \pi_k \frac{\operatorname{e}^{-\frac{1}{2}\Big(\frac{x-\mu_k}{\sigma_k}\Big)^2}}{\sqrt{2\pi\sigma^{2}_k}}\\ \text{Taking the log:}\\ \delta=\ln \Bigg(\frac{\pi_k}{\sqrt{2\pi\sigma^2_k}} \Bigg) - \frac{1}{2}\bigg(\frac{x-\mu_k}{\sigma_k}\bigg) \\ \text{We can see that the first term in this function is constant for every class k, and that} \ \delta \text{(x)is a quadratic function of x.} \]

2)

a)

\[ \text{Note: The top row displays all data. The bottom row displays a random selection of 'No' data that is equal in size to} \\ \text{the 'Yes' data. This is for display purposes only. We use the complete data set to construct every model.} \]

b)

\[ \begin{aligned} \operatorname{E}[\text{balance | default = Yes}] & \approx \text{\$1747.88} \\ \operatorname{E}[\text{balance | default = No}] & \approx \text{\$803.94} \\ \operatorname{\hat{\sigma}^{2}} &= \frac{1}{n-K} \sum_{k=1}^K(n_k-1)\operatorname{\hat\sigma^2_k}\\ &=\frac{n_1-1}{n-K} \operatorname{Var}[\text{balance | default = Yes}] + \frac{n_2-1}{n-K} \operatorname{Var}[\text{balance | default = No}] \\ &=\frac{332}{9998} 116,113.3 + \frac{9666}{9998} 208,349\\ &\approx \text{\$205,286.16}\\ \operatorname{\hat{\pi}_{Y}} &= \frac{333}{10,000} = \text{3.33%} \\ \operatorname{\hat{\pi}_{N}} &= \frac{9,667}{10,000} = \text{96.67%} \end{aligned} \]

c)

d)

\[ \text{The decision boundry shifted right from \$1,276.7 to \$2,007.3 (approximate intercepts) when we incorporated priors. That is,}\\ \text{We require a higher balance of credit card debt before the probability of defaulting exceeds the probability of not defaulting} \\ \text{when we adjust for priors. We prefer the adjusted model because it does a better job accounting for the entire population} \\ \text{as you can see from the top and bottom data rows.} \]

e)

f)

b)

\[ \begin{aligned} \operatorname{E}[\text{balance | default = Yes}] & \approx \text{\$1747.88} \\ \operatorname{E}[\text{balance | default = No}] & \approx \text{\$803.94} \\ \operatorname{\hat{\sigma}^{2}_Y} &= \operatorname{Var}[\text{balance | default = Yes}] \approx \text{\$116,113}\\ \operatorname{\hat{\sigma}^{2}_N} &= \operatorname{Var}[\text{balance | default = Yes}] \approx \text{\$208,349}\\ \operatorname{\hat{\pi}_{Y}} &= \frac{333}{10,000} = \text{3.33%} \\ \operatorname{\hat{\pi}_{N}} &= \frac{9,667}{10,000} = \text{96.67%} \end{aligned} \]

c)

d)

\[ \text{The decision boundry shifted right from \$1,297.9 to \$1,978.1 (approximate intercepts) when we incorporated priors. Our assessment}\\ \text{is effectively the same as when we used pooled variance, albiet the QDA method produces a marginally smaller interval between} \\ \text{the shifts. This suggests that both sets of data have the same population variance.} \]

e)

3)

a)

\[ \text{The data appears to be randomly distributed within donut-shaped bounds when we plot X against Y. When we control for Variable 3 } \\ \text{we see that the V3 = 0 corresponds to (X,Y) coordinates that are closer to the center of the donut than V3 = 1 (X,Y) coordinates. It} \\ \text{appears that V3 = 1 and V3 = 0 do not represent data drawn from the same population that was divided along arbitrary bounds.} \\ \text{We made this assessment because some of the V3 = 1 data fall within the bounds of the V3 = 0 data.} \]

b)

##     pred
## true FALSE TRUE
##    0    35   25
##    1    28   30

##                Estimate Std. Error     z value  Pr(>|z|)
## (Intercept) -0.01418412  0.1980570 -0.07161635 0.9429072
## X           -0.30352113  0.3780411 -0.80287875 0.4220448
## Y           -0.01813178  0.3599832 -0.05036842 0.9598288

\[ \text{Note: our assessment is that the basic logistic model does a poor job of predicting the third variable and the coefficients} \\ \text{are poorly fitted. Furthermore, we will analyze our results graphically. We start by calculating the decision boundry.} \\ \text{The logit is given by}\\ \ln \bigg(\frac{\operatorname{p}}{1-\operatorname{p}}\bigg) = \hat\beta_0 + \hat\beta_1X+\hat\beta_2Y \\ \text{We solve for the decision boundry by setting p = 0 giving us} \\ 0 = \hat\beta_0 + \hat\beta_1X+\hat\beta_2Y \\ Y = -\frac{\hat\beta_0+\hat\beta_1X}{\hat\beta_2} \\ \text{The decision boundry as a function of Y, so} \]

\[ \text{It is very clear that this fails because a linear boundry is unsuitable for these distributions.} \\ \text{We will experiment with higher-degree logits.} \]

c)

##     pred
## true FALSE TRUE
##    0    49   11
##    1    12   46

##               Estimate Std. Error    z value     Pr(>|z|)
## (Intercept)  -0.129438  0.2483533 -0.5211850 6.022379e-01
## poly(X, 2)1  -1.291508  2.8176253 -0.4583676 6.466884e-01
## poly(X, 2)2 -21.032885  3.9887675 -5.2730285 1.341907e-07
## poly(Y, 2)1  -1.304788  2.6960758 -0.4839581 6.284156e-01
## poly(Y, 2)2 -19.250876  3.9385001 -4.8878699 1.019329e-06

\[ \text{We see that our predictive capabilities improved greatly from the confusion matrix. Furthermore, p-values indicate two} \\ \text{signficant coefficients, both relating to the polynomials. We will include the terms anyway so that we do not have to use} \\ \text{complex numbers.* This gives us} \\ \ln \bigg(\frac{\operatorname{p}}{1-\operatorname{p}}\bigg) =\hat\beta_0 + \hat\beta_1X + \hat\beta_2X^2 + \hat\beta_3Y + \hat\beta_4Y^2 = 0 \\ \implies Y= -\frac{\beta_{3}}{2\beta_{4}} \pm \sqrt{\frac{\beta_{3}^{2}}{4}-\frac{\beta_{0}+\beta_{1}X+\beta_{2}X^{2}}{\beta_{4}}}\\ \text{Visually we see a superior decision boundry.}\\ \text{ }\\ \text{*The function} \ 0=\hat\beta_2X^2+\hat\beta_4Y^2 \ \text{does not have a real solution.} \]

4)

a)

##         y
## glm_pred    0    1
##    FALSE 1703  321
##    TRUE    97  879

\[ \text{The p values are all small enough to not reject their intercepts in the model.}\\ \text{"wor.freq.remove","word.freq.meeting", and "char.freq.exclaimation" all have high absolute values for thier coefficients.}\\ \text{Furthermore when we observe their correllation matrices, their correllations between each variable is close to zero so they}\\ \text{are all good individually. } \]

\[ \operatorname{TPR} = \frac{\text{True Positive}}{\text{True Positive + False Negative}} = \frac{879}{879 + 321} = 73.25\% \\ \operatorname{FPR} = \frac{\text{False Positive}}{\text{False Positive + True Negative}} =\frac{97}{97+1703}=5.3\bar{8}\% \]

b)

##    y
##        0    1
##   0 1731  539
##   1   69  661

\[ \operatorname{TPR} = \frac{\text{True Positive}}{\text{True Positive + False Negative}} = \frac{661}{661 + 539} = 55.08\bar{3}\% \\ \operatorname{FPR} = \frac{\text{False Positive}}{\text{False Positive + True Negative}} =\frac{69}{69+1731}=3.8\bar{3}\% \]

c)

##    y
##        0    1
##   0  747   94
##   1 1053 1106

\[ \operatorname{TPR} = \frac{\text{True Positive}}{\text{True Positive + False Negative}} = \frac{1106}{1106 + 94} = 92.1\bar{6}\% \\ \operatorname{FPR} = \frac{\text{False Positive}}{\text{False Positive + True Negative}} =\frac{1053}{1053+747}=58.5\% \]

d)

n = nrow(dataset)
p = ncol(dataset)
p = p-1

X_small <- dataset[,c(1,2,3,5,9)]
# X_small[c(1:300),]
spam = dataset[c(1:300),'y'] == 1
real = dataset[c(1:300),'y'] == 0
sum_spam = colSums(X_small[spam,])
sum_real = colSums(X_small[real,])

pHat_spam <- sum_spam
pHat_spam <- pHat_spam/sum(pHat_spam)
pHat_spam

##           word.freq.remove            word.freq.order 
##                 0.01966351                 0.01430304 
##             word.freq.free               word.freq.re 
##                 0.04602601                 0.04079357 
## capital.run.length.average 
##                 0.87921388

pHat_real <- sum_real
pHat_real <- pHat_real/sum(pHat_real)
pHat_real

##           word.freq.remove            word.freq.order 
##                 0.02281627                 0.01985226 
##             word.freq.free               word.freq.re 
##                 0.04448558                 0.06304275 
## capital.run.length.average 
##                 0.84980314

X = dataset[,c(1:9)]+(1/p)
y = dataset$y
pvec_0 <- colSums(X[y == 0,])
pvec_0 <- pvec_0/sum(pvec_0)
pvec_1 <- colSums(X[y == 1,])
pvec_1 <- pvec_1/sum(pvec_1)

barplot(head(sort(pvec_0, decreasing = TRUE),10), las = 2, cex.names = 0.6)

barplot(head(sort(pvec_1, decreasing = TRUE),10), las = 2, cex.names = 0.6)

priors <- table(y)/length(y)
i <- 1
test_doc <- X[i,]
round(sort(test_doc, decreasing = T)[1:9], 4)

##   capital.run.length.average word.freq.re word.freq.remove word.freq.order
## 1                     1.2531       0.6411           0.1111          0.1111
##   word.freq.free word.freq.meeting word.freq.edu char.freq.semicolon
## 1         0.1111            0.1111        0.1111              0.1111
##   char.freq.exclamation
## 1                0.1111

sum(test_doc * log(pvec_0)) + log(priors[1])

##         0 
## -5.072113

sum(test_doc * log(pvec_1)) + log(priors[2])

##         1 
## -6.622042

y[i]

## [1] 0

yhat <- array(NA, dim = n)
for (i in 1:n){
  test_doc <- X[i,]
  logp0 <- sum(test_doc * log(pvec_0)) + log(priors[1])
  logp1 <- sum(test_doc * log(pvec_1)) + log(priors[2])
  yhat[i] <- 0 + {logp1 > logp0}
}

confusion_matrix <- table(y, yhat)
confusion_matrix

##    yhat
## y      0    1
##   0 1595  205
##   1  416  784

sum(diag(confusion_matrix))/sum(confusion_matrix)

## [1] 0.793

e)

\[ \text{Did not complete} \]

HW2

William Chance, Shray Mittal, Benjamin Leu

October 10, 2019

1)

2)

a)

b)

c)

d)

e)

f)

b)

c)

d)

e)

3)

a)

b)

c)

4)

a)

b)

c)

d)

e)