Data set: SPAM.
Peter Derby works as a cyber security analyst at a private equity firm. He has been asked to implement a spam detection system on the company’s email server. He has access to a sample of e-mails with two variables: spam (1 if spam, 0 otherwise) and the number of hyperlinks in the message. Before implementing a spam detection system, he wants to better understand the company’s emails. In a report, conduct hypothesis tests at a 1% significance level to determine whether more than 50% of the company’s email are spam.
head(Spam)
## Spam Hyperlinks
## 1 0 1
## 2 0 1
## 3 1 11
## 4 1 11
## 5 0 1
## 6 0 2
table(Spam$Spam)
##
## 0 1
## 41 59
Identify and find sample statistics:
# number of successes
x = length(which(Spam$Spam=="1"))
# sample size
n = nrow(Spam)
n
## [1] 100
# sample proportion
p = x/n
p
## [1] 0.59
Hypothesis:
\(H_0:\pi \le 0.50\); \(H_1:\pi > 0.50\)
pi_0 = 0.50
n*(1-p)
## [1] 41
n*p
## [1] 59
Since both \(np≥5\) and \(n(1-p)≥5\), condition met.
\(\alpha = 0.05\).
#Standard Error
SE = sqrt(pi_0*(1-pi_0)/n)
SE
## [1] 0.05
#Z statistics
Zstat = (p - pi_0)/SE
Zstat
## [1] 1.8
This is a upper-tailed test
pvalue = pnorm(Zstat,lower.tail=FALSE)
pvalue
## [1] 0.03593032
Since p_value (0.03593032) < \(\alpha\) (0.05), reject the null hypothesis. At the 5% significance level, we can conclude that the proportion of emails that are spam are greater than 0.50.