Data 607 Project 4(Training Documents)

#Assignment It can be useful to be able to classify new “test” documents using already classified “training” documents. A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam.

For this project, you can start with a spam/ham dataset, then predict the class of new documents (either withheld from the training dataset or from another source such as your own spam folder) #Selected dataset and prepartion for this project These are the current packages and library I’ll be using for this project.

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6      ✔ purrr   0.3.4 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.0 
## ✔ readr   2.1.2      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(magrittr)

## 
## Attaching package: 'magrittr'
## 
## The following object is masked from 'package:purrr':
## 
##     set_names
## 
## The following object is masked from 'package:tidyr':
## 
##     extract

library(data.table)

## 
## Attaching package: 'data.table'
## 
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last
## 
## The following object is masked from 'package:purrr':
## 
##     transpose

For this assignment I am required to build a predictive classifer to classify if an incoming email is a spam or ham a machine learning model is needed to be build from some of the dataset posted from: https://spamassassin.apache.org/old/publiccorpus/ From this website, I went to focus on 20050311_spam_2.tar.bz2 file and 20050311_spam_2.tar.bz2 #Selected Datasets We will start off with SPAM email

#Url = the main package being downloaded for SPAM
url_spam <- "http://spamassassin.apache.org/old/publiccorpus/20050311_spam_2.tar.bz2"
file_spam1 <- "20050311_spam_2.tar.bz2"
file_spam2 <- "20050311_spam_2.tar"

Now we will follow into HAM emails

url_ham <- "http://spamassassin.apache.org/old/publiccorpus/20030228_easy_ham.tar.bz2"
file_ham1 <- "20030228_easy_ham.tar.bz2"
file_ham2 <- "20030228_easy_ham.tar"

#Downloading the Dataset Once we selected the dataset we can download this so we can begin the training documents

download.file(url_spam, destfile= file_spam1)
  download.file(url_ham, destfile=file_ham1)

  untar(file_ham1, exdir="spamham")
  untar(file_spam2, exdir = "spamham")

## Warning in untar(file_spam2, exdir = "spamham"): '/usr/bin/tar -xf
## '20050311_spam_2.tar' -C 'spamham'' returned error code 1

Data 607 Project 4(Training Documents)

Wilson Chau

2022-11-19