Nathaniel Lilly’s Final Project

Overview

The point of my final project is to use logistical regression to see if an email is spam or not.

Introduction

Interent users recieve millions of spam emails every day. Every large email account provider has a spam detector to place unwanted messages in their own category, however some always seem to sneak into your inbox. The goal of this project is to find the most efficent way to get as many spam email into the rightful place as possible. The data set “email” can be found in openintro, and is the email account of the author of the textbook.

Exploring the Data

openintro::email

This is to load the dataset onto the document.

email <- email

This is to create an easy way to access the data set by putting it in the enviroment.

summary(email)

This is to view the email data set and become familar with it.

sapply(email,function(x) sum(is.na(x)))
sapply(email, function(x) length(unique(x)))

This is to see all the different types of emails and see what what categories are avaliable

library(Amelia)
missmap(email, main = "Missing values vs observed")

Amelia is a package I downloaded to use the missmap fuction to map out and see if there are any values missing values and after using it, I came to the conclusion that there are no missing values.

Analysis

sspam = glm(spam ~ winner, data = email, family = binomial)
summary(sspam)

This is a model showing the correlation between span and winner. The P-Value is 3.06e-08 showing a very strong correlation between the two.

fspam = glm(spam ~ ., data = email, family = binomial)
summary(fspam)

This is a full anaylsis showing all the potential spam detecters and their p-values.

anova(fspam, test="Chisq")

This is to see the Residual Deviance and Df Devaiance, as well as P-Values.

library(pscl)
pR2(fspam)

This is to predict how well the spam method is at pulling out actual spam emails. The method is 39.937% effective.

Conclusions

In conclusion the model build using logistical regression has a 39.937% chance of marking spam effectively.

Limitations

There are definetly limitation of this method. This model is supposed to see and detect spam. It is 39.937% effective. This is not a great success rate, and this is because a computerized and and non human regulated model is this case is the best option, but with millions of spam emails being sent everyday all with the goal to get aroung spam filters. So, the reality is that it is impossible to catch them all. Compture models are extremely rigid, so one tweak to punctation, spelling, key words, or format by the spamers allows the email to not tagged and allow the email make its way to your inbox. It is an uphill battle, so any spam filter is better than none, and that is what this model provides.

This document was produced as a final project for MAT 143H - Introduction to Statistics (Honors) at North Shore Community College.
The course was led by Professor Billy Jackson.
Student Name: Nathaniel Lilly Semester: Spring 2018