The code for this is online at https://github.com/luizfelipebrito/Natural-language-processing-and-Text-Mining-with-R

Natural language processing - Text Mining with R

Techniques covered in this script:

  • Sentiment Analysis
  • Frequancy: tf-idf
  • Tokenizing
  • Stemmer
  • Wordclouds

What is the main objective? What am I trying to infer?

This dataset consists of reviews of fine foods from amazon. This allowed us to analyze which words are used most frequenlty in reviews. Furthermore, we can use the tools of text mining to approach the emotional content of text programmatically to infer whether a review is positive or negative, or perhaps characterized by some other more nuanced emotional content like surprise or disgust.

Data Set Information:

Amazon Fine Food Reviews The data span a period of more than 10 years. Analyze ~500,000 food reviews from Amazon https://www.kaggle.com/snap/amazon-fine-food-reviews/downloads/amazon-fine-food-reviews.zip/2

1ยบ Step - Clear Workspace

rm(list = ls())   

2ยบ Step - Clear console

cat("\014")      

3ยบ Step - The packages below must be installed. Once installed, you can comment this chunk code.

  • dplyr: A Grammar of Data Manipulation
  • ggplot2: Create Elegant Data Visualisations Using the Grammar of Graphics
  • tidytext: Text Mining using โ€˜dplyrโ€™, โ€˜ggplot2โ€™, and Other Tidy Tools
  • stringr: Simple, Consistent Wrappers for Common String Operations
  • tidyr: Easily Tidy Data with โ€˜spread()โ€™ and โ€˜gather()โ€™ Functions
  • wordcloud: Words Clouds
  • reshape2: Flexibly Reshape Data: A Reboot of the Reshape Package
  • hunspell: High-Performace Stemmer, Tokenizer, and Spell Checker
  • SnowballC: Stemmer based on the C โ€˜libstemmerโ€™ UTF-8 Library
  • xtable: Export Tables to LaTeX or HTML
  • knitr: A Genaral-Purpose Package for Dynamic Report Generation in R
  • kableExtra: Construct Complex Table with โ€˜kableโ€™ and Pipe Syntax

4ยบ Step - Load libraries.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(tidytext)
library(stringr) 
library(tidyr)   
library(wordcloud)
## Loading required package: RColorBrewer
library(reshape2)
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
library(hunspell)
library(SnowballC)
library(xtable)
library(knitr)
library(kableExtra)
## 
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
## 
##     group_rows

5ยบ Step - Set up my work directory.

setwd("D:\\Text_Mining")

6ยบ Step - Reading my database.

raw_text <- read.csv("Reviews.csv", header = TRUE)

7ยบ Step - Only 2 attributes are used. Renaming columns in order to make our exploratory analysis easier.

  1. IdRow (Id)
  2. ProductId (Unique identifier for the product)
  3. UserId (Unqiue identifier for the user)
  4. ProfileName (Profile name of the user)
  5. HelpfulnessNumerator (Number of users who found the review helpful)
  6. HelpfulnessDenominator (Number of users who indicated whether they found the review helpful or not)
  7. ScoreRating (between 1 and 5)
  8. TimeTimestamp (for the review)
  9. SummaryBrief (summary of the review)
  10. Text (Text of the review)
names(raw_text)[names(raw_text) == "Id"]      <- "id_review"
names(raw_text)[names(raw_text) == "Summary"] <- "summary_review"
names(raw_text)[names(raw_text) == "Text"]    <- "text_review"
raw_text <- raw_text %>% select(id_review, summary_review, text_review)

8ยบ Step - Preprocessing Text

Sometimes, we have some structure and extra text that we do not want to include in our analysis.

cleaned_text <- raw_text %>%
  filter(str_detect(text_review, "^[^>]+[A-Za-z\\d]") | text_review !="") 

Every raw text dataset will require different steps for data cleaning, which will often involve some trial and error, and exploration on unusual cases in the dataset.

cleaned_text$text_review <- gsub("[_]", "", cleaned_text$text_review)
cleaned_text$text_review <- gsub("<br />", "", cleaned_text$text_review)

9ยบ Step - Tokenization - We need to both break the text into individual tokens.

Token is a meaningful unit of text, most often a word, that we are interested in using for further analysis, and tokenization is the process of splitting text into tokens.

text_df <- tibble(id_review = cleaned_text$id_review , text_review = cleaned_text$text_review)
text_df <- text_df %>%  unnest_tokens(word, text_review)

10ยบ Step - Stemming Words - After tokenization, we need to analyze each word by breaking it down in itโ€™s root (stemming) and conjugation affix.

getStemLanguages() %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = F)
x
danish
dutch
english
finnish
french
german
hungarian
italian
norwegian
porter
portuguese
romanian
russian
spanish
swedish
turkish

We have split each row so that there is one token (word) in each row of the new data frame.

text_df$word <- wordStem(text_df$word,  language = "english")

Punctuation has been stripped. The words were converted to lowercase, which makes them easier to compare or combine with other datasets.

head(table(text_df$word)) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = F)
Var1 Freq
0 154
0,45 1
0.0006mg 1
0.035 1
0.05 3
0.09 1

11ยบ Step - Stop Words - Often in text analysis, we will want to remove stop words, which are words that are not useful for an analysis, typically extremely common words such as โ€œtheโ€, โ€œofโ€, โ€œtoโ€, and so forth in English.

data(stop_words)
text_df <- text_df %>% 
  anti_join(stop_words, "word")

12ยบ Step - We can find the most common words in all the reviews as a whole and create a visualization of the most common words.

xtable(head(text_df %>% 
       count(word, sort = TRUE))) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = F)
word n
tast 15974
flavor 13385
product 11914
love 11598
coffe 11554
tri 10440

Plot_01_word_count

text_df %>% 
  count(word, sort = TRUE) %>% 
  filter(n > 3000) %>% 
  mutate(word = reorder(word, n)) %>% 
  ggplot(aes(word, n)) + 
  geom_col() + 
  xlab(NULL) + 
  coord_flip()

13ยบ Step - Sentiment Analysis - We can use the tools of text mining to approach the emotional content of text programmatically.

Sentiment_Analysis <- text_df %>% 
  inner_join(get_sentiments("bing"), "word") %>% 
  count(id_review, sentiment) %>% 
  spread(sentiment, n, fill = 0) %>% 
  mutate(sentiment = positive - negative)

One way to analyze the sentiment of a text is to consider the text as a combination of its individual word, and the sentiment content of the whole text as the sum of the sentiment content of the individual words.

head(Sentiment_Analysis)%>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = F)
id_review negative positive sentiment
1 3 0 -3
2 1 0 -1
3 0 2 2
5 0 1 1
6 2 3 1
7 1 3 2

14ยบ Step - Most Common Positive and Negative Words. Now we can analyze word count that contribute to each sentiment.

Sentiment_Analysis_Word_Count <- text_df %>% 
  inner_join(get_sentiments("bing"), "word") %>% 
  count(word, sentiment, sort = TRUE) %>% 
  ungroup()

Pot_02_word_count

Sentiment_Analysis_Word_Count %>% 
  group_by(sentiment) %>% 
  top_n(12, n) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) + 
  geom_col(show.legend = FALSE) + 
  facet_wrap(~sentiment, scales = "free_y") + 
  labs(y = "Contribution to Sentiment", x = NULL) + 
  coord_flip()

15ยบ Step - Words with the greast contributions to positive/negative sentiment scores in the Review.

Sentiment_Analysis_Word_Contribution <- text_df %>% 
  inner_join(get_sentiments("afinn"), by = "word") %>% 
  group_by(word) %>% 
  summarize(occurences = n(), contribution = sum(score))

PLot_03_word_contribution

Sentiment_Analysis_Word_Contribution %>% 
  top_n(50, abs(contribution)) %>%
  mutate(word = reorder(word, contribution)) %>%
  ggplot(aes(word, contribution, fill = contribution > 0)) + 
  geom_col(show.legend = FALSE) + 
  coord_flip()

16ยบ Step - Word Clouds

plot_04_word_cloud

text_df %>% 
  anti_join(stop_words, "word") %>%
  count(word) %>% 
  with(wordcloud(word, n, max.words = 100))

pplot_05_word_clouD

text_df %>% 
  inner_join(get_sentiments("bing"), "word") %>%
  count(word, sentiment, sort = TRUE) %>% 
  acast(word ~ sentiment, value.var = "n", fill = 0) %>% 
  comparison.cloud(colors = c("gray20", "gray80"), max.words = 100)

17ยบ Step - tf-idf - The statistic tf-idf is intended to mesure how important a word is to a document in a collection (corpus) of documents.

Term Frequency (tf) It is one measure of how important a word may be and how frenquently a word occurs in a document. Inverse Document Frequency (idf) It decreases the weight for commonly used words and increases the weight for words that are not used very much in a collection of documents. Calculating tf-idf attemps to find the words that are importantin a text, but not too common.

term_frequency_review <- text_df %>% count(word, sort = TRUE)
term_frequency_review$total_words <- as.numeric(term_frequency_review %>% summarize(total = sum(n)))
term_frequency_review$document <- as.character("Review")
term_frequency_review <- term_frequency_review %>% 
  bind_tf_idf(word, document, n)

Plot_06_tf_idf

term_frequency_review %>% 
  arrange(desc(tf)) %>% 
  mutate(word = factor(word, levels = rev(unique(word)))) %>% 
  group_by(document) %>% 
  top_n(15, tf) %>% 
  ungroup() %>% 
  ggplot(aes(word, tf, fill = document)) + 
  geom_col(show.legend = FALSE) + 
  labs(x = NULL, y = "tf-idf") + 
  facet_wrap(~document, ncol = 2, scales = "free") + 
  coord_flip()