Abstract:

Is there an association between spam and the length of an email? You could imagine a story either way:

Spam is more likely to be a short message tempting me to click on a link, or My normal email is likely shorter since I exchange brief emails with my friends all the time. Well let’s find out!

We first load the necessary libraries

library(openintro) #Contains the dataset email
## Please visit openintro.org for free statistics materials
## 
## Attaching package: 'openintro'
## The following objects are masked from 'package:datasets':
## 
##     cars, trees
library(dplyr) #For data manipulation
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2) #For fancy plots
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:openintro':
## 
##     diamonds

We start by plotting of the number of characters to check outliers and start our center spread analysis

plot(email$num_char, col = "gold") #Plot to check outliers

1- Center and Spread Analysis

We quickly visualize the center and spread of the data using the median and IQR because looking at the plot of the number of character of an email(num_char) we see a lot of outliers. We create side by side plots of the data. We normalize the number of characters by using the log transformation.

#Compute summary statistics
email %>% 
  group_by(spam) %>%
  summarize(median(num_char), IQR(num_char))
## # A tibble: 2 x 3
##    spam `median(num_char)` `IQR(num_char)`
##   <dbl>              <dbl>           <dbl>
## 1     0              6.831        13.58225
## 2     1              1.046         2.81800
#Create plot
email %>% 
  mutate(log_num_char = log(num_char)) %>%
  ggplot(aes(x = spam, y = log_num_char, fill=spam)) + geom_boxplot() + facet_wrap(~spam)

2-Association Between Spam Emails and The Number of Exclamation Points

Now let’s look at the variable exclaim_mess which the number of exclamation points in a typical email. The typical spam message is shorter than the typical non-spam message from the boxplots above. Boxplots allow us to clearly analyze the distribution of continuous variable and spot potential outliers.

boxplot(email$exclaim_mess) #Quick outlier check for the studied variable

# Compute center and spread for exclaim_mess by spam
email %>%
group_by(spam) %>%
summarize(median(exclaim_mess),IQR(exclaim_mess))#Use median and IQR due to outliers
## # A tibble: 2 x 3
##    spam `median(exclaim_mess)` `IQR(exclaim_mess)`
##   <dbl>                  <dbl>               <dbl>
## 1     0                      1                   5
## 2     1                      0                   1

Let’s look at the variable exclaim_mess which represents the number of exclamation points in an email. The boxplot above clearly reveals the existence of multiple outlliers in the distribution.

# Create plot for spam and exclaim_mess
email %>%
mutate(log_exclaim_mess = log(exclaim_mess + 0.01)) %>% 

ggplot(aes(x = log_exclaim_mess, color = spam)) + geom_histogram(fill = "yellow") + facet_wrap(~spam)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We notice that the chart above is heavily skewed to the right. We can see from the histrogram that there are more exclamation points in spam emails looking from the count of exclamation points in non-spam emails.. Naturally, spammers want you to click on their link at all cost!

3-Association Between Images and Spams

Now let’s analyze the number of images in an email input in the variable image

table(email$image)#Count of images per email
## 
##    0    1    2    3    4    5    9   20 
## 3811   76   17   11    2    2    1    1

Due to the low count ofimages for a high number of emails we will treat image as a categorical variable that indicates wether or not the email has at least one image. Create a new variable and explore its association with spam.

# Create plot of proportion of spam by image
email %>%
  mutate(has_image = image > 0) %>%
  ggplot(aes(x = has_image, color = spam)) +
  geom_bar(fill = "green")

We can see that an email with no image is more likely to be not-spam than spam.

Now let’s consider the variables image and attach. We are trying to determine if images count as attached files

sum(email$image > email$attach) #Check if number of attach files greater than numer of images
## [1] 0

We see that the the number of image is never greater than the number of attached files so we can infer than images are considered as attached files.

4-Assiociation Between The “Dollar Sign(\($\))” and Spams

Let’s take a look at the variable dollar. Let’s find out wether or not a typical spam email contains more occurences of the word dollar than the typical non-spam email.

email %>%
  filter(dollar > 0) %>%
  group_by(spam) %>%
  summarize(median(dollar))
## # A tibble: 2 x 2
##    spam `median(dollar)`
##   <dbl>            <dbl>
## 1     0                4
## 2     1                2

If we encounter an email with 10 occurences pf the word “dollar” is it more likely to be a spam or not? let’s find out!!!

email %>%
  filter(dollar > 10) %>%
  group_by(spam) %>%
  ggplot(aes(x = spam, color = spam)) + geom_bar(fill = "pink")

5-Association Between The Size of a Number and Spams

Let’s take a look at the variable number now. Let’s explore the association between “number” and “spam”. number is a categorical variable so we will use side-by-side barcharts to analyze it.

#Reorder levels
email$number <- factor(email$number, levels = c("none","small","big"))

# Construct plot of number
email %>%
group_by(spam) %>%
ggplot(aes( x = number)) +  
geom_bar(fill = "blue") + 
facet_wrap(~spam)

From the bar chart we can clearly see that given that an email contains a big number, it’s more likely to be not-spam.

Conclusion:

To conclude our analysis, it’s safe to say that there is a lot more variables that we can analyze that could have contributed to our analysis but this is just one way or one angle to look at this given problem. From our thorough statiscal exploration we conclude that the more unusual items(high occurences of images, dollar signs,…) an email has the more likely it’s a spam.