Text Analysis

Kyle Kirkpatrick April 29, 2020

Building the Corpus

First, 100 text files were gathered from IMDB that contain short movie reviews. Each of the reviews were positive or negative based on whether the movie was liked or not.

Next, a corpus is created by reading in each of the text files. The corpus is cleaned up by removing any unecessary words and whitespace, as well as consolodating the number of words.

# sets the working folder 
folder <- "C:/Users/kylek/OneDrive/Desktop/College/CST-425/textAnalysis"

# corpus using the tm library
corpus2 <- VCorpus(DirSource(directory = folder, pattern = "*.txt"))

# cleans up the corpus
cleanCorpus <- corpus2 %>%
               tm_map(removePunctuation) %>%
               tm_map(content_transformer(tolower)) %>%
               tm_map(removeWords, c("film", "movie", "ill", stopwords("english"))) %>%
               tm_map(stripWhitespace)


# simplifies the corpus by combining similar words
stemCorpus <- tm_map(cleanCorpus, stemDocument)

Take a look at a word cloud of the top one hundred used words from the corpus. This is a general visual representation of the most commonly used words.

The corpus is then converted into a matrix to be able to show the count of the most commonly used words.

# outputs a word cloud
wordcloud(stemCorpus, min.freq = 10, max.words = 100, random.order = F, colors = brewer.pal(8, "Set2"))

# converts the corpus into a matrix
Cmatrix <- DocumentTermMatrix(stemCorpus)

# removes any similar terms
CMsparse <- removeSparseTerms(Cmatrix, 0.995)

# makes the matrix into a data frame for future manipulation
cleanDF <- as.data.frame(as.matrix(CMsparse))

# outputs the number of most used words
frequent <- colSums(cleanDF)
frequent <- sort(frequent, decreasing = TRUE)
frequent[1:50]

##       one      like   charact      just       see     great    realli      make 
##        93        82        75        63        60        58        58        57 
##     scene     stori      play      good      time      look      well      much 
##        57        57        56        55        54        50        45        43 
##       act       can       get     actor     watch      even       way   costner 
##        42        42        42        41        41        40        40        38 
##       end   perform     think      know     music      role      mani     never 
##        38        36        36        35        35        34        33        33 
##      juli   kutcher     littl      will      also      film      movi      seem 
##        32        32        32        32        31        31        31        31 
##      seen       two      made       bad      dont      love       tri      work 
##        31        30        29        28        28        28        28        28 
##      come courtenay 
##        27        27

Creating the Training/Testing Sets

Each of the review is hand tagged as either negative (0) or positive (1) by adding a new column to the matrix. A training set and testing set is created by splitting the data randomly.

# hand tag each of the reviews
cleanDF$handtag <- c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
                     0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
                     0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
                     0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
                     0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
                     1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
                     1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
                     1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
                     1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
                     1, 1, 1, 1, 1, 1, 1, 1, 1, 1)

set.seed(123)

# indexed using the hand tags
index <- createDataPartition(cleanDF$handtag, p=0.75, list=FALSE)

# creation of data sets
training <- cleanDF[index,]
testing <- cleanDF[-index,]

Sentiment Analysis

After the training and testing sets are created, a sentiment analysis is performed using the tidytext library. Built-in sentiment categories are used and the most common words are categorized in a chart below.

# retrives the nrc sentiment categories
nrc <- get_sentiments("nrc")

#creates a table for the most frequent words
temp_table <- data.frame(word = names(frequent), 
             word_count = frequent)%>% 
             inner_join(nrc)

## Joining, by = "word"

temp_table %>% 
  group_by(sentiment) %>% # groups the words by sentiment
  top_n(10, word_count) %>%
  ungroup() %>%
  mutate(word = reorder(word,word_count)) %>%
  ggplot(aes(x = word, # creates a plot (number of words vs frequent words used)
             y = word_count, fill = sentiment)) +
  geom_col() +
  facet_wrap(~sentiment, scales = "free")+
  coord_flip() +
  theme(axis.text.y = element_text(size = 7), 
        axis.text.x = element_text(size = 5))

Ten of the most frequent words are categorized within each specific sentiment. We can see how each sentiment has a certain amount of presence. The ‘surprise’ category seems to have the least amount of impact, so there were more than likely no plot twists or unexpected endings based off the reviews. The anitcipation category has one a high impact showing that the reviewers were more than likely excited to watch the movie.

However, the most important sentiments are ‘positive’ and ‘negative’. The chart displays that the positive impact is nearly double that of negative, which is interesting since 50 reviews were positive and 50 were negative. The word ‘good’ was used just less than 60 times, whereas the word ‘bad’ was used just less than 30 times. This could be because reviewers try to include the most amount of positive impactful words for a good movie.

We can see the specific amount of presence for each of the sentiments by giving a summary of the words used.

# selects the chart and creates a summary for each of the sentiments
select(temp_table, sentiment, word_count) %>%
  group_by(sentiment) %>%
  summarise(count = n()) %>%
  arrange(desc(count))

## # A tibble: 10 x 2
##    sentiment    count
##    <chr>        <int>
##  1 positive       263
##  2 negative       253
##  3 trust          140
##  4 fear           124
##  5 sadness        118
##  6 anticipation   107
##  7 anger          105
##  8 joy             91
##  9 disgust         74
## 10 surprise        70

We can see that the ‘positive’ count is not much greater than the ‘negative’ from the summary. The sentiment analysis is a tool that can show connotations of key words and how people can present their argument to be the most persuasive.

Another analysis tool that can be used to predict an outcome is a decision model tree. Below is an output of a decision model tree (using the textdata and rattle package) based off 50 random variables as predictors from the training dataset.

# creates a decision tree diagram based off of the training set
modfit <- train(good~.,method = "rpart", data = training[,c(1300:1350)])
rattle::fancyRpartPlot(modfit$finalModel)

The diagram above classifies the methods based on the number of each variable (word) within the review. A prediction model could then be created to predict whether the review would be postive or negative based on what words are present.