Assignment 2

Kyle Kirkpatrick October 8th, 2020

Building the Corpus

First, 100 text files were gathered from IMDB that contain movie reviews. Each of the reviews were positive or negative based on whether the movie was liked or disliked, and the connotation of the words used in the corpus. The vector space will be created based on whether the text files are negative or positive.

Next, a corpus of text documents is created by reading in each of the text files, which will be built as vectors. The corpus is cleaned up by removing any unnecessary words and whitespace, as well as consolidating the number of words.This makes it easier to

# sets the working folder 
folder <- "C:/Users/kylek/OneDrive/Desktop/College/CST-435/textAnalysis"

# corpus full of vectors using the tm library
corpus2 <- VCorpus(DirSource(directory = folder, pattern = "*.txt"))

# cleans up the corpus
cleanCorpus <- corpus2 %>%
               tm_map(removePunctuation) %>%
               tm_map(content_transformer(tolower)) %>%
               tm_map(removeWords, c("film", "movie", "ill", stopwords("english"))) %>%
               tm_map(stripWhitespace)


# simplifies the corpus by combining similar words
stemCorpus <- tm_map(cleanCorpus, stemDocument)

Take a look at a word cloud of the top one hundred used words from the corpus. This is a general visual representation of the most commonly used words between all of the movie reviews.

The corpus is then converted into a matrix, with vector columns representing count of the most commonly used words.

# outputs a word cloud
wordcloud(stemCorpus, min.freq = 10, max.words = 100, random.order = F, colors = brewer.pal(8, "Set2"))

# converts the corpus into a matrix
Cmatrix <- DocumentTermMatrix(stemCorpus)

# removes any similar terms
CMsparse <- removeSparseTerms(Cmatrix, 0.995)

# makes the matrix into a data frame for future manipulation
cleanDF <- as.data.frame(as.matrix(CMsparse))

# outputs the number of most used words
frequent <- colSums(cleanDF)
frequent <- sort(frequent, decreasing = TRUE)
frequent[1:50]

##       one      like   charact      just       see     great    realli      make 
##        93        82        75        63        60        58        58        57 
##     scene     stori      play      good      time      look      well      much 
##        57        57        56        55        54        50        45        43 
##       act       can       get     actor     watch      even       way   costner 
##        42        42        42        41        41        40        40        38 
##       end   perform     think      know     music      role      mani     never 
##        38        36        36        35        35        34        33        33 
##      juli   kutcher     littl      will      also      film      movi      seem 
##        32        32        32        32        31        31        31        31 
##      seen       two      made       bad      dont      love       tri      work 
##        31        30        29        28        28        28        28        28 
##      come courtenay 
##        27        27

Creating the Training/Testing Sets Using Vectors

Each of the review is hand tagged as either negative (0) or positive (1) by adding a new vector column to the matrix. A training set and testing set is created by splitting the data randomly.

# hand tag each of the reviews
cleanDF$handtag <- c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
                     0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
                     0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
                     0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
                     0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
                     1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
                     1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
                     1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
                     1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
                     1, 1, 1, 1, 1, 1, 1, 1, 1, 1)

set.seed(123)

# indexed using the hand tags
index <- createDataPartition(cleanDF$handtag, p=0.75, list=FALSE)

# creation of data sets
training <- cleanDF[index,]
testing <- cleanDF[-index,]

Finding Term Frequency

After the training and testing sets are created, a frequency analysis is performed using the tidytext library. Built-in categories are used and the most common words are categorized in a chart below.

# retrives the nrc sentiment categories
nrc <- get_sentiments("nrc")

#creates a table for the most frequent words
temp_table <- data.frame(word = names(frequent), 
             word_count = frequent)%>% 
             inner_join(nrc)

## Joining, by = "word"

temp_table %>% 
  group_by(sentiment) %>%
  top_n(10, word_count) %>%
  ungroup() %>%
  mutate(word = reorder(word,word_count)) %>%
  ggplot(aes(x = word, # creates a plot (number of words vs frequent words used)
             y = word_count, fill = sentiment)) +
  geom_col() +
  facet_wrap(~sentiment, scales = "free")+
  coord_flip() +
  theme(axis.text.y = element_text(size = 7), 
        axis.text.x = element_text(size = 5))

Ten of the most frequent words are categorized within each specific category. We can predict if a review will be positive or negative based on the column vectors of the matrix.

The most important frequencies are ‘positive’ and ‘negative’. The chart displays that the positive impact is nearly double that of negative, which is interesting since 50 reviews were positive and 50 were negative. The word ‘good’ was used just less than 60 times, whereas the word ‘bad’ was used just less than 30 times. This could be because reviewers try to include the most amount of positive impactful words for a good movie.

We can see the specific amount of presence for each of the categories by giving a summary of the words used.

# selects the chart and creates a summary
select(temp_table, sentiment, word_count) %>%
  group_by(sentiment) %>%
  summarise(count = n()) %>%
  arrange(desc(count))

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 10 x 2
##    sentiment    count
##    <chr>        <int>
##  1 positive       263
##  2 negative       253
##  3 trust          140
##  4 fear           124
##  5 sadness        118
##  6 anticipation   107
##  7 anger          105
##  8 joy             91
##  9 disgust         74
## 10 surprise        70

We can see that the ‘positive’ count is not much greater than the ‘negative’ from the summary.

Another analysis tool that can be used to predict an outcome is a decision model tree. Below is an output of a decision model tree (using the textdata and rattle package) based off 50 random variables as predictors (column vectors) from the training dataset.

# creates a decision tree diagram based off of the training set
modfit <- train(good~.,method = "rpart", data = training[,c(1300:1350)])
rattle::fancyRpartPlot(modfit$finalModel)

The diagram above classifies the methods based on the number of each variable (word) within the review. A prediction model could then be created to predict whether the review would be postive or negative based on what words are present.