\[ \text{Starting with the Bayes formula from slide 23 of lecture 6, the denominator will be the same regardless of class.} \\ f'(x) = \pi_k \frac{\operatorname{e}^{-\frac{1}{2}\Big(\frac{x-\mu_k}{\sigma_k}\Big)^2}}{\sqrt{2\pi\sigma^{2}_k}}\\ \text{Taking the log:}\\ \delta=\ln \Bigg(\frac{\pi_k}{\sqrt{2\pi\sigma^2_k}} \Bigg) - \frac{1}{2}\bigg(\frac{x-\mu_k}{\sigma_k}\bigg) \\ \text{We can see that the first term in this function is constant for every class k, and that} \ \delta \text{(x)is a quadratic function of x.} \]
\[
\text{Note: The top row displays all data. The bottom row displays a random selection of 'No' data that is equal in size to} \\
\text{the 'Yes' data. This is for display purposes only. We use the complete data set to construct every model.}
\]
\[ \begin{aligned} \operatorname{E}[\text{balance | default = Yes}] & \approx \text{\$1747.88} \\ \operatorname{E}[\text{balance | default = No}] & \approx \text{\$803.94} \\ \operatorname{\hat{\sigma}^{2}} &= \frac{1}{n-K} \sum_{k=1}^K(n_k-1)\operatorname{\hat\sigma^2_k}\\ &=\frac{n_1-1}{n-K} \operatorname{Var}[\text{balance | default = Yes}] + \frac{n_2-1}{n-K} \operatorname{Var}[\text{balance | default = No}] \\ &=\frac{332}{9998} 116,113.3 + \frac{9666}{9998} 208,349\\ &\approx \text{\$205,286.16}\\ \operatorname{\hat{\pi}_{Y}} &= \frac{333}{10,000} = \text{3.33%} \\ \operatorname{\hat{\pi}_{N}} &= \frac{9,667}{10,000} = \text{96.67%} \end{aligned} \]
\[ \text{The decision boundry shifted right from \$1,276.7 to \$2,007.3 (approximate intercepts) when we incorporated priors. That is,}\\ \text{We require a higher balance of credit card debt before the probability of defaulting exceeds the probability of not defaulting} \\ \text{when we adjust for priors. We prefer the adjusted model because it does a better job accounting for the entire population} \\ \text{as you can see from the top and bottom data rows.} \]
\[ \begin{aligned} \operatorname{E}[\text{balance | default = Yes}] & \approx \text{\$1747.88} \\ \operatorname{E}[\text{balance | default = No}] & \approx \text{\$803.94} \\ \operatorname{\hat{\sigma}^{2}_Y} &= \operatorname{Var}[\text{balance | default = Yes}] \approx \text{\$116,113}\\ \operatorname{\hat{\sigma}^{2}_N} &= \operatorname{Var}[\text{balance | default = Yes}] \approx \text{\$208,349}\\ \operatorname{\hat{\pi}_{Y}} &= \frac{333}{10,000} = \text{3.33%} \\ \operatorname{\hat{\pi}_{N}} &= \frac{9,667}{10,000} = \text{96.67%} \end{aligned} \]
\[ \text{The decision boundry shifted right from \$1,297.9 to \$1,978.1 (approximate intercepts) when we incorporated priors. Our assessment}\\ \text{is effectively the same as when we used pooled variance, albiet the QDA method produces a marginally smaller interval between} \\ \text{the shifts. This suggests that both sets of data have the same population variance.} \]
\[ \text{The data appears to be randomly distributed within donut-shaped bounds when we plot X against Y. When we control for Variable 3 } \\ \text{we see that the V3 = 0 corresponds to (X,Y) coordinates that are closer to the center of the donut than V3 = 1 (X,Y) coordinates. It} \\ \text{appears that V3 = 1 and V3 = 0 do not represent data drawn from the same population that was divided along arbitrary bounds.} \\ \text{We made this assessment because some of the V3 = 1 data fall within the bounds of the V3 = 0 data.} \]
## pred
## true FALSE TRUE
## 0 35 25
## 1 28 30
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.01418412 0.1980570 -0.07161635 0.9429072
## X -0.30352113 0.3780411 -0.80287875 0.4220448
## Y -0.01813178 0.3599832 -0.05036842 0.9598288
\[ \text{Note: our assessment is that the basic logistic model does a poor job of predicting the third variable and the coefficients} \\ \text{are poorly fitted. Furthermore, we will analyze our results graphically. We start by calculating the decision boundry.} \\ \text{The logit is given by}\\ \ln \bigg(\frac{\operatorname{p}}{1-\operatorname{p}}\bigg) = \hat\beta_0 + \hat\beta_1X+\hat\beta_2Y \\ \text{We solve for the decision boundry by setting p = 0 giving us} \\ 0 = \hat\beta_0 + \hat\beta_1X+\hat\beta_2Y \\ Y = -\frac{\hat\beta_0+\hat\beta_1X}{\hat\beta_2} \\ \text{The decision boundry as a function of Y, so} \]
\[
\text{It is very clear that this fails because a linear boundry is unsuitable for these distributions.} \\
\text{We will experiment with higher-degree logits.}
\]
## pred
## true FALSE TRUE
## 0 49 11
## 1 12 46
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.129438 0.2483533 -0.5211850 6.022379e-01
## poly(X, 2)1 -1.291508 2.8176253 -0.4583676 6.466884e-01
## poly(X, 2)2 -21.032885 3.9887675 -5.2730285 1.341907e-07
## poly(Y, 2)1 -1.304788 2.6960758 -0.4839581 6.284156e-01
## poly(Y, 2)2 -19.250876 3.9385001 -4.8878699 1.019329e-06
\[ \text{We see that our predictive capabilities improved greatly from the confusion matrix. Furthermore, p-values indicate two} \\ \text{signficant coefficients, both relating to the polynomials. We will include the terms anyway so that we do not have to use} \\ \text{complex numbers.* This gives us} \\ \ln \bigg(\frac{\operatorname{p}}{1-\operatorname{p}}\bigg) =\hat\beta_0 + \hat\beta_1X + \hat\beta_2X^2 + \hat\beta_3Y + \hat\beta_4Y^2 = 0 \\ \implies Y= -\frac{\beta_{3}}{2\beta_{4}} \pm \sqrt{\frac{\beta_{3}^{2}}{4}-\frac{\beta_{0}+\beta_{1}X+\beta_{2}X^{2}}{\beta_{4}}}\\ \text{Visually we see a superior decision boundry.}\\ \text{ }\\ \text{*The function} \ 0=\hat\beta_2X^2+\hat\beta_4Y^2 \ \text{does not have a real solution.} \]
## y
## glm_pred 0 1
## FALSE 1703 321
## TRUE 97 879
\[ \text{The p values are all small enough to not reject their intercepts in the model.}\\ \text{"wor.freq.remove","word.freq.meeting", and "char.freq.exclaimation" all have high absolute values for thier coefficients.}\\ \text{Furthermore when we observe their correllation matrices, their correllations between each variable is close to zero so they}\\ \text{are all good individually. } \]
\[ \operatorname{TPR} = \frac{\text{True Positive}}{\text{True Positive + False Negative}} = \frac{879}{879 + 321} = 73.25\% \\ \operatorname{FPR} = \frac{\text{False Positive}}{\text{False Positive + True Negative}} =\frac{97}{97+1703}=5.3\bar{8}\% \]
## y
## 0 1
## 0 1731 539
## 1 69 661
\[ \operatorname{TPR} = \frac{\text{True Positive}}{\text{True Positive + False Negative}} = \frac{661}{661 + 539} = 55.08\bar{3}\% \\ \operatorname{FPR} = \frac{\text{False Positive}}{\text{False Positive + True Negative}} =\frac{69}{69+1731}=3.8\bar{3}\% \]
## y
## 0 1
## 0 747 94
## 1 1053 1106
\[ \operatorname{TPR} = \frac{\text{True Positive}}{\text{True Positive + False Negative}} = \frac{1106}{1106 + 94} = 92.1\bar{6}\% \\ \operatorname{FPR} = \frac{\text{False Positive}}{\text{False Positive + True Negative}} =\frac{1053}{1053+747}=58.5\% \]
n = nrow(dataset)
p = ncol(dataset)
p = p-1
X_small <- dataset[,c(1,2,3,5,9)]
# X_small[c(1:300),]
spam = dataset[c(1:300),'y'] == 1
real = dataset[c(1:300),'y'] == 0
sum_spam = colSums(X_small[spam,])
sum_real = colSums(X_small[real,])
pHat_spam <- sum_spam
pHat_spam <- pHat_spam/sum(pHat_spam)
pHat_spam
## word.freq.remove word.freq.order
## 0.01966351 0.01430304
## word.freq.free word.freq.re
## 0.04602601 0.04079357
## capital.run.length.average
## 0.87921388
pHat_real <- sum_real
pHat_real <- pHat_real/sum(pHat_real)
pHat_real
## word.freq.remove word.freq.order
## 0.02281627 0.01985226
## word.freq.free word.freq.re
## 0.04448558 0.06304275
## capital.run.length.average
## 0.84980314
X = dataset[,c(1:9)]+(1/p)
y = dataset$y
pvec_0 <- colSums(X[y == 0,])
pvec_0 <- pvec_0/sum(pvec_0)
pvec_1 <- colSums(X[y == 1,])
pvec_1 <- pvec_1/sum(pvec_1)
barplot(head(sort(pvec_0, decreasing = TRUE),10), las = 2, cex.names = 0.6)
barplot(head(sort(pvec_1, decreasing = TRUE),10), las = 2, cex.names = 0.6)
priors <- table(y)/length(y)
i <- 1
test_doc <- X[i,]
round(sort(test_doc, decreasing = T)[1:9], 4)
## capital.run.length.average word.freq.re word.freq.remove word.freq.order
## 1 1.2531 0.6411 0.1111 0.1111
## word.freq.free word.freq.meeting word.freq.edu char.freq.semicolon
## 1 0.1111 0.1111 0.1111 0.1111
## char.freq.exclamation
## 1 0.1111
sum(test_doc * log(pvec_0)) + log(priors[1])
## 0
## -5.072113
sum(test_doc * log(pvec_1)) + log(priors[2])
## 1
## -6.622042
y[i]
## [1] 0
yhat <- array(NA, dim = n)
for (i in 1:n){
test_doc <- X[i,]
logp0 <- sum(test_doc * log(pvec_0)) + log(priors[1])
logp1 <- sum(test_doc * log(pvec_1)) + log(priors[2])
yhat[i] <- 0 + {logp1 > logp0}
}
confusion_matrix <- table(y, yhat)
confusion_matrix
## yhat
## y 0 1
## 0 1595 205
## 1 416 784
sum(diag(confusion_matrix))/sum(confusion_matrix)
## [1] 0.793
\[ \text{Did not complete} \]