Question

Data set: SPAM.

Peter Derby works as a cyber security analyst at a private equity firm. He has been asked to implement a spam detection system on the company’s email server. He has access to a sample of e-mails with two variables: spam (1 if spam, 0 otherwise) and the number of hyperlinks in the message. Before implementing a spam detection system, he wants to better understand the company’s emails. In a report, conduct hypothesis tests at a 1% significance level to determine whether more than 50% of the company’s email are spam.

head(Spam)
##   Spam Hyperlinks
## 1    0          1
## 2    0          1
## 3    1         11
## 4    1         11
## 5    0          1
## 6    0          2
table(Spam$Spam)
## 
##  0  1 
## 41 59

Identify and find sample statistics:

# number of successes
x = length(which(Spam$Spam=="1"))
# sample size
n = nrow(Spam)
n
## [1] 100
# sample proportion
p = x/n
p
## [1] 0.59
Step 1. Prepare

Hypothesis:

\(H_0:\pi \le 0.50\); \(H_1:\pi > 0.50\)

pi_0 = 0.50
Step 2. Check
n*(1-p)
## [1] 41
n*p
## [1] 59

Since both \(np≥5\) and \(n(1-p)≥5\), condition met.

Step 3. Signifance level

\(\alpha = 0.05\).

Step 4 Calculate
#Standard Error
SE = sqrt(pi_0*(1-pi_0)/n)
SE
## [1] 0.05
#Z statistics
Zstat = (p - pi_0)/SE
Zstat
## [1] 1.8

This is a upper-tailed test

pvalue = pnorm(Zstat,lower.tail=FALSE)
pvalue
## [1] 0.03593032
Step 5. Conclude

Since p_value (0.03593032) < \(\alpha\) (0.05), reject the null hypothesis. At the 5% significance level, we can conclude that the proportion of emails that are spam are greater than 0.50.