STA 112 Lab 5

Question 1: Which variable do think might be most highly associated an email being spam? Why?

I think that the variable most highly associated with an email being spam may be the “winner” variable. Many spam emails may attempt to entice people by claiming that they won some price, and often include a link where they can then “claim their prize” but in reality takes them to some different website. Therefore, I think that the “winner” variable will be highly associated with spam emails.

Var1	Freq
0	3554
1	367

Question 2: How many emails are spam? What percent of the emails are spam?

There were 3921 total emails recieved. Of these emails, 367 were labeled as spam, or 9.36%.

Question 3: Create an appropriate graph of Y = spam. Label your graph Figure 1.

Question 4: Make a plot to explore the relationship between Y = whether or not an email is spam and X = whether or not the email contains the word “winner”.

Question 5: Based on this plot, does there seem to be a relationship between whether or not an email contains the word “winner” and whether or not it is spam?

Based on this mosaic plot, there does seem to be a relationship between whether or not an email contains the word “winner” and whether or not it is spam. It appears that a higher amount of spam emails contain the word winner than non-spam emails.

## 
## Call:
## glm(formula = spam ~ winner, family = binomial, data = email)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.8657  -0.4342  -0.4342  -0.4342   2.1947  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.31405    0.05627 -41.121  < 2e-16 ***
## winneryes    1.52559    0.27549   5.538 3.06e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2437.2  on 3920  degrees of freedom
## Residual deviance: 2412.7  on 3919  degrees of freedom
## AIC: 2416.7
## 
## Number of Fisher Scoring iterations: 5

Question 6: Write down the logistic regression line in log odds form.

The logistic regression line in log odds form is \(log\left(\frac{\hat{\pi}}{1-\hat{\pi}}\right) = -2.314 + 1.526 Winner\).

Question 7: Interpret the slope in log odds form.

If the email contains “winner,” we expect the log odds of the email being spam to increase by 1.525.

Quesiton 8: Interpret the slope in odds form.

If the email contains “winner,” we expect that the odds of the email being spam are multiplied by 4.598.

Question 9: Write down the fitted model in probability form.

\(\hat{\pi} = \frac{e^{-2.314 + 1.526 Winner}}{1+ e^{-2.314 + 1.526 Winner}}\).

Question 10: According to your model, what is the probability that an email that contains the word “winner” is spam?

According to this model, if an email contains the word “winner,” we expect the probability of that email being spam to be 0.313.

Question 11: Create the plot using the code above. Do you feel comfortable claiming that the shape condition is satisfied? Why or why not?

I do not feel comfortable claiming that the shape condition is satisfied. The data does not seem to follow a linear pattern and there are multiple outliers that may be skewing the data.

##Question 12: Write down the code you would need to use to create the empirical logit plot with the explanatory variable as the log of the number of line breaks.

emplogit(x=log(email\(line_breaks), y=email\)spam, xlab = “Log Line Breaks”, ylab = “Empirical Logit”, main = “Figure 4”)

Question 13: Run your code and show the plot. Which of the empirical logit plots (using line_breaks or using log(line_breaks)) is more linear? Based on the plots, choose one of the two explanatory variables to use in the model.

Figure 4 demonstrates data that follows a moderate negative linear pattern. Figure 4 demonstrates a more linear shape than Figure 3, and thus I would choose log(line_breaks) for this model.

Question 14: Fit your chosen model and call it spamModel2. Write down the logistic regression line in log odds form.

## 
## Call:
## glm(formula = spam ~ log(line_breaks), family = binomial, data = email)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.2431  -0.4743  -0.3174  -0.2183   3.2321  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       0.64755    0.17197   3.766 0.000166 ***
## log(line_breaks) -0.71319    0.04481 -15.915  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2437.2  on 3920  degrees of freedom
## Residual deviance: 2138.5  on 3919  degrees of freedom
## AIC: 2142.5
## 
## Number of Fisher Scoring iterations: 6

\(log\left(\frac{\hat{\pi}}{1-\hat{\pi}}\right) = 0.64755 - 0.713Line Breaks\).

Question 15: Interpret the slope in odds form.

We expect that for every additional line break, the odds of the email being spam are multiplied by 0.49.

Question 16: Run the function above, and interpret the result.

##          1 
## 0.02594421

The predicted probability of a message with 400 line breaks has a 0.0259 probability of being a spam email.

Question 17: Based on the prediction we have obtained so far, do you think a message with 400 line breaks is likely to be classified as spam by our second spam filter? Explain.

Based on the probability of 0.0259, as found in question 16, I do not think that a message with 400 lines breaks is likely to be classified as a spam by our second spam filter. The probability is very low, and additionally it makes sense logically, as spam emails are typically short and not complex, and 400 line breaks suggests a long and complex email.

Question 18: What is the predicted probability of being spam for a message with 5 line breaks?

##         1 
## 0.3774729

The predicted probability of being spam for a message with 5 line breaks is 0.377.

Question 19: Is our spam filter more or less likely to classify a message with 5 line breaks as spam than a message with 400 line breaks? Answer using both the probability and the log odds as evidence.

Our spam filter is more likely to classify an email with 5 line breaks as spam than a message with 400 line breaks, comparing the respective probabilities of 0.377 and 0.0259. The probability of a 5 line break email being spam, 0.377, is higher than the probability of 400 line breaks, 0.0259.The log odds of a 400 line break message is \(log\left(\frac{\hat{\pi}}{1-\hat{\pi}}\right) = 0.64755 - 0.713(400) = -284.55\). The log odds of a 5 line break message is \(log\left(\frac{\hat{\pi}}{1-\hat{\pi}}\right) = 0.64755 - 0.713(5) = -2.92\). Therefore, for an email with 400 line breaks, we expect the log odds of the email being spam to be -284.55. For an email with 5 line breaks, we expect the log odds of the email being spam to be -2.92. The larger the absolute value of the log odds, the smaller the probability. Therefore, the spam filter is more likely to classify an email with 5 line breaks as spam compared to a 400 line break message, as it has a smaller value of log odds.

Question 20: Suppose we want to make a decision about our email with 400 line breaks. We already know the probabilities. Adapt the code above. Show your code, and your result.

My code is: sample(c(“spam”,“notspam”),1, prob= c(.0259,.9741))

## [1] "notspam"