I think that the variable most highly associated with an email being spam may be the “winner” variable. Many spam emails may attempt to entice people by claiming that they won some price, and often include a link where they can then “claim their prize” but in reality takes them to some different website. Therefore, I think that the “winner” variable will be highly associated with spam emails.
| Var1 | Freq |
|---|---|
| 0 | 3554 |
| 1 | 367 |
There were 3921 total emails recieved. Of these emails, 367 were labeled as spam, or 9.36%.
Based on this mosaic plot, there does seem to be a relationship between whether or not an email contains the word “winner” and whether or not it is spam. It appears that a higher amount of spam emails contain the word winner than non-spam emails.
##
## Call:
## glm(formula = spam ~ winner, family = binomial, data = email)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.8657 -0.4342 -0.4342 -0.4342 2.1947
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.31405 0.05627 -41.121 < 2e-16 ***
## winneryes 1.52559 0.27549 5.538 3.06e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2437.2 on 3920 degrees of freedom
## Residual deviance: 2412.7 on 3919 degrees of freedom
## AIC: 2416.7
##
## Number of Fisher Scoring iterations: 5
The logistic regression line in log odds form is \(log\left(\frac{\hat{\pi}}{1-\hat{\pi}}\right) = -2.314 + 1.526 Winner\).
If the email contains “winner,” we expect the log odds of the email being spam to increase by 1.525.
If the email contains “winner,” we expect that the odds of the email being spam are multiplied by 4.598.
\(\hat{\pi} = \frac{e^{-2.314 + 1.526 Winner}}{1+ e^{-2.314 + 1.526 Winner}}\).
According to this model, if an email contains the word “winner,” we expect the probability of that email being spam to be 0.313.
I do not feel comfortable claiming that the shape condition is satisfied. The data does not seem to follow a linear pattern and there are multiple outliers that may be skewing the data.
##Question 12: Write down the code you would need to use to create the empirical logit plot with the explanatory variable as the log of the number of line breaks.
emplogit(x=log(email\(line_breaks), y=email\)spam, xlab = “Log Line Breaks”, ylab = “Empirical Logit”, main = “Figure 4”)
Figure 4 demonstrates data that follows a moderate negative linear pattern. Figure 4 demonstrates a more linear shape than Figure 3, and thus I would choose log(line_breaks) for this model.
##
## Call:
## glm(formula = spam ~ log(line_breaks), family = binomial, data = email)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.2431 -0.4743 -0.3174 -0.2183 3.2321
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.64755 0.17197 3.766 0.000166 ***
## log(line_breaks) -0.71319 0.04481 -15.915 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2437.2 on 3920 degrees of freedom
## Residual deviance: 2138.5 on 3919 degrees of freedom
## AIC: 2142.5
##
## Number of Fisher Scoring iterations: 6
\(log\left(\frac{\hat{\pi}}{1-\hat{\pi}}\right) = 0.64755 - 0.713Line Breaks\).
We expect that for every additional line break, the odds of the email being spam are multiplied by 0.49.
## 1
## 0.02594421
The predicted probability of a message with 400 line breaks has a 0.0259 probability of being a spam email.
Based on the probability of 0.0259, as found in question 16, I do not think that a message with 400 lines breaks is likely to be classified as a spam by our second spam filter. The probability is very low, and additionally it makes sense logically, as spam emails are typically short and not complex, and 400 line breaks suggests a long and complex email.
## 1
## 0.3774729
The predicted probability of being spam for a message with 5 line breaks is 0.377.
Our spam filter is more likely to classify an email with 5 line breaks as spam than a message with 400 line breaks, comparing the respective probabilities of 0.377 and 0.0259. The probability of a 5 line break email being spam, 0.377, is higher than the probability of 400 line breaks, 0.0259.The log odds of a 400 line break message is \(log\left(\frac{\hat{\pi}}{1-\hat{\pi}}\right) = 0.64755 - 0.713(400) = -284.55\). The log odds of a 5 line break message is \(log\left(\frac{\hat{\pi}}{1-\hat{\pi}}\right) = 0.64755 - 0.713(5) = -2.92\). Therefore, for an email with 400 line breaks, we expect the log odds of the email being spam to be -284.55. For an email with 5 line breaks, we expect the log odds of the email being spam to be -2.92. The larger the absolute value of the log odds, the smaller the probability. Therefore, the spam filter is more likely to classify an email with 5 line breaks as spam compared to a 400 line break message, as it has a smaller value of log odds.
My code is: sample(c(“spam”,“notspam”),1, prob= c(.0259,.9741))
## [1] "notspam"