Data 607 Project 4 Document Classifier(Thought process)

#Assignment It can be useful to be able to classify new “test” documents using already classified “training” documents. A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam.
For this project, you can start with a spam/ham dataset, then predict the class of new documents (either withheld from the training dataset or from another source such as your own spam folder). One example corpus: https://spamassassin.apache.org/old/publiccorpus/

For this assignment, I had some trouble trying to upload a few input on my R studio. I’ve updated all my libraries and packages. But some of the inputs aren’t working properly. I explained in detail on my thought process on processing this project.

#Step 1: Libraries I try to update my libraries, but this is the recommended library i went for. I went for tm, wordcloud, Rtexttools, and e1071. These libraries are necessary for me to help pinpoint the spam, and ham of the emails.

#library(tm)

#library(wordcloud)

#library(RTextTools)

#library(e1071)

#Step 2READING the spam/ham data and converting it into DF

In Step 2, I worked on downloading the 20021010_easy_ham.tar.bz2 and 20050311_spam_2.tar.bz2 into my desktop and redirecting the directory to it on here. I converted the spam/ham data into dataframe for us to work in.

# Create Ham Dataframe
ham_dir='/Users/wilsonchau/Desktop/Project4/easy_ham'
hamFileNames = list.files(ham_dir)

# List of docs
ham_docs_list <- NA
for(i in 1:length(hamFileNames))
{
  filepath<-paste0(ham_dir, "/", hamFileNames[1])  
  text <-readLines(filepath)
  list1<- list(paste(text, collapse="\n"))
  ham_docs_list = c(ham_docs_list,list1)
  
}

# ham data frame
hamDF <-as.data.frame(unlist(ham_docs_list),stringsAsFactors = FALSE)
hamDF$type <- "ham"
colnames(hamDF) <- c("text","type")

# Create Spam Dataframe
spam_dir='/Users/wilsonchau/Desktop/Project4/spam_2'
spamFileNames = list.files(spam_dir)

spam_docs_list <- NA
for(i in 1:length(spamFileNames))
{
  filepath<-paste0(spam_dir, "/", spamFileNames[1])  
  text <-readLines(filepath)
  list1<- list(paste(text, collapse="\n"))
  spam_docs_list = c(spam_docs_list,list1)
  
}

spamDF <-as.data.frame(unlist(spam_docs_list),stringsAsFactors = FALSE)
spamDF$type <- "spam"
colnames(spamDF) <- c("text","type")


# creating combined data frame of spam and ham
spam_ham_df <- rbind(hamDF, spamDF)

#Step 3 Corpus prepartion I create an email corpus after the datafram were clean up. Upon cleaning my dataframe and trying to set up corupus. I get an error in “Error in library(tm) : there is no package called ‘tm’ Show in New Window Error in Corpus(VectorSource(spam_ham_df\(text)) : could not find function "Corpus". I am showing my code work, but I don't understand why Corpus isn't working properly. This is my code input: 1) Create Corpus dataset emailCorpus <- Corpus(VectorSource(spam_ham_df\)text)) 2) Removing numbers cleanCorpus <- tm_map(emailCorpus, removeNumbers) 3) Removing Punctuation cleanCorpus <- tm_map(cleanCorpus, removePunctuation) 4) Removing non-related words cleanCorpus <- tm_map(cleanCorpus, removeWords, stopwords()) 5) Removing excess white space cleanCorpus <- tm_map(cleanCorpus, stripWhitespace)

#emailCorpus <- Corpus(VectorSource(spam_ham_df$text))
#cleanCorpus <- tm_map(emailCorpus, removeNumbers)
#cleanCorpus <- tm_map(cleanCorpus, removePunctuation)
#cleanCorpus <- tm_map(cleanCorpus, removeWords, stopwords())
#cleanCorpus <- tm_map(cleanCorpus, stripWhitespace)

#Step 4 Trying to create Document-term matrix By creating a document-term matrix for the spamham emails This can help describe the frequency of words that was repeated in a collection of documents. I am unsure why Corpus isn’t working for this input I also worked on creating a word cloud to show the most frequent used word from the spam and ham, but wordcloud isn’t working on my Rstudio.

#email_dtm <- DocumentTermMatrix(cleanCorpus)


# spam word cloud
#spam_indices <- which(spam_ham_df$type == "spam")
#suppressWarnings(wordcloud(cleanCorpus[spam_indices], min.freq=40))

# ham word cloud
#ham_indices <- which(spam_ham_df$type == "ham")
#suppressWarnings(wordcloud(cleanCorpus[ham_indices], min.freq=50))

#Step 5 Prepare test and training for data In this section I was able to prepare a test data. This will allow me to test for 70% data training and 30% prediction. I was able to produce a sample size with 70% of the data to show there is a 30% for prediction

# Model to assess spam and ham

# sample 70% data training and 30 % for prediction

sample_size <- floor(0.70 * nrow(spam_ham_df))

# set the seed to make your partition reproducible
set.seed(123)
train_ind <- sample(seq_len(nrow(spam_ham_df)), size = sample_size)

train_spam_ham <- spam_ham_df[train_ind, ]
test_spam_ham <- spam_ham_df[-train_ind, ]

# count of spam and ham in train data set
spam<-subset(train_spam_ham,train_spam_ham$type == "spam")
ham<-subset(train_spam_ham,train_spam_ham$type == "ham")

#Step 5.5 Unable to crepate corpus for training/test data My Rstudio wasn’t able to create a corpus for training/testing the data. I used the count function to help count the spam/ham

# Create corpus for training and test data
#train_email_corpus <- Corpus(VectorSource(train_spam_ham$text))
#test_email_corpus <- Corpus(VectorSource(test_spam_ham$text))

#train_clean_corpus <- tm_map(train_email_corpus ,removeNumbers)
#test_clean_corpus <- tm_map(test_email_corpus, removeNumbers)

#train_clean_corpus <- tm_map(train_clean_corpus, removePunctuation)
#test_clean_corpus <- tm_map(test_clean_corpus, removePunctuation)

#train_clean_corpus <- tm_map(train_clean_corpus, removeWords, stopwords())
#test_clean_corpus  <- tm_map(test_clean_corpus, removeWords, stopwords())

#train_clean_corpus<- tm_map(train_clean_corpus, stripWhitespace)
#test_clean_corpus<- tm_map(test_clean_corpus, stripWhitespace)

#train_email_dtm <- DocumentTermMatrix(train_clean_corpus)
#test_email_dtm <- DocumentTermMatrix(test_clean_corpus)

# count function
convert_count <- function(x) {
  y <- ifelse(x > 0, 1,0)
  y <- factor(y, levels=c(0,1), labels=c("No", "Yes"))
  y
}

#train_sms <- apply(train_email_dtm, 2, convert_count)
#test_sms <- apply(test_email_dtm, 2, convert_count)

# classification of email
#classifier <- naiveBayes(train_sms, factor(train_spam_ham$type))

#Step 6 Predict using test data I used the predict to test out the dataset.

#test_pred <- predict(classifier, newdata=test_sms)

#table(test_pred, test_spam_ham$type)

#Conclusion My R studio wasn’t working correctly. I have tried to update it many times, but it isn’t working properly. I will look for some technical help on making sure this work properly on my Macbook. I have set up the correct code needed. I put my reasoning out, and I want to see if this is the right methodology to work on this.

Data 607 Project 4 Document Classifier(Thought process)

Wilson Chau

2022-11-20