knitr::opts_chunk$set(warning=FALSE, message=FALSE)

Introduction

In order to perform Natural Language Processing and Text mining, I have selected “100K Coursera’s Course Reviews Dataset” from Kaggle. The name of CSV file is reviews_by_course.csv. Totally there are 140321 rows and 3 columns.The link of kaggle page is “https://www.kaggle.com/septa97/100k-courseras-course-reviews-dataset”.

Field Description

  • CourseId: The name of the course(course tag)
  • Review: The reviews of various courses in text form
  • Label: Rating given for each course review

Load Packages

Load necessay packages required for this assignment.

library(widyr)
library(textdata)
library(readr)
library(tidytext) 
library(stringr) 
library(tidyverse)
library(data.table)
library(knitr)
library(kableExtra)
library(dplyr)
library(ggplot2)
library(tidytext)
library(RColorBrewer)
library(wordcloud)
library(reshape2)
library(hunspell)
library(SnowballC)
library(xtable)
library(NLP)
library(tm)
library(stringr)
library(broom)

Reading Data

The data in CSV is stored in coursera dataframe

coursera<-as.data.frame(fread("reviews_by_course.csv"))
coursera <- as_tibble(coursera)
x <- head(coursera, n = 10)
kable(x)%>%
  kable_styling(bootstrap_options = "striped", full_width = F, position = "left")
CourseId Review Label
2-speed-it BOring 1
2-speed-it Bravo ! 5
2-speed-it Very goo 5
2-speed-it Great course - I recommend it for all, especially IT and Business Managers! 5
2-speed-it One of the most useful course on IT Management! 5
2-speed-it I was disappointed because the name is misleading. The course provides a good introduction & overview of the responsibilities of the CTO, but has very little specifically digital content. It deals with two-speed IT in a single short lecture, so of course the treatment is superficial. It is easy to find more in-depth material freely available, on the McKinsey website for example. 3
2-speed-it Super content. I’ll definitely re-do the course 5
2-speed-it Etant contrôleur de gestion pour le département IT (HQ + Locale), le cours est vraiment intéressant et de très bonne qualité.J’insiste que la qualité et le professionnalisme des professeurs.I’m a controller for an IT department, the courses is very good and very helpful for my job. I recommand you to follow the training. 5
2-speed-it One of the excellent courses at Coursera for information technology bosses and managers. 5
2-speed-it Is there any reason why you should not apply the course by BCG?)It’s content is pretty unique and includes a high level analysis and a wide range of knowledge needed to cover all detailed aspects.Best regards,Oleg Serov 5

Data Preprocessing

Removing all the unwanted special characters from Review text.

removeSpecialChars <- function(x) gsub("[^a-zA-Z0-9 ]", "", x)
coursera$Review <- sapply(coursera$Review, removeSpecialChars)

coursera$Review <- iconv(coursera$Review, from = 'UTF-8', to = 'ASCII//TRANSLIT')
coursera$Review  = gsub("!", "", coursera$Review)
coursera$Review <- gsub("[_]", "", coursera$Review)
coursera$Review <- gsub("<br />", "", coursera$Review)
head(coursera$Review,10) %>% kable()%>%
  kable_styling(bootstrap_options = "striped", full_width = F, position = "left")
x
BOring BOring
Bravo ! Bravo
Very goo Very goo
Great course - I recommend it for all, especially IT and Business Managers! Great course I recommend it for all especially IT and Business Managers
One of the most useful course on IT Management! One of the most useful course on IT Management
I was disappointed because the name is misleading. The course provides a good introduction & overview of the responsibilities of the CTO, but has very little specifically digital content. It deals with two-speed IT in a single short lecture, so of course the treatment is superficial. It is easy to find more in-depth material freely available, on the McKinsey website for example. I was disappointed because the name is misleading The course provides a good introduction overview of the responsibilities of the CTO but has very little specifically digital content It deals with twospeed IT in a single short lecture so of course the treatment is superficial It is easy to find more indepth material freely available on the McKinsey website for example
Super content. I’ll definitely re-do the course Super content Ill definitely redo the course
Etant contrôleur de gestion pour le département IT (HQ + Locale), le cours est vraiment intéressant et de très bonne qualité.J’insiste que la qualité et le professionnalisme des professeurs.I’m a controller for an IT department, the courses is very good and very helpful for my job. I recommand you to follow the training. Etant contrleur de gestion pour le dpartement IT HQ Locale le cours est vraiment intressant et de trs bonne qualitJinsiste que la qualit et le professionnalisme des professeursIm a controller for an IT department the courses is very good and very helpful for my job I recommand you to follow the training
One of the excellent courses at Coursera for information technology bosses and managers. One of the excellent courses at Coursera for information technology bosses and managers
Is there any reason why you should not apply the course by BCG?)It’s content is pretty unique and includes a high level analysis and a wide range of knowledge needed to cover all detailed aspects.Best regards,Oleg Serov Is there any reason why you should not apply the course by BCGIts content is pretty unique and includes a high level analysis and a wide range of knowledge needed to cover all detailed aspectsBest regardsOleg Serov

Tokenization

Token is defined as meaningful part of text(most often a word), which can be used for further text analysis.Tokenization is the process of splitting sentences into words(tokens).

tokens_df <- coursera %>%  unnest_tokens(word, Review)
head(tokens_df,5) %>% kable()%>%
  kable_styling(bootstrap_options = "striped", full_width = F, position = "left")
CourseId Label word
2-speed-it 1 boring
2-speed-it 5 bravo
2-speed-it 5 very
2-speed-it 5 goo
2-speed-it 5 great

Stemming Words

After tokenization, we need to analyze each word by breaking it down in it’s root (stemming) and conjugation affix.

getStemLanguages() %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = F, position = "left")
x
danish
dutch
english
finnish
french
german
hungarian
italian
norwegian
porter
portuguese
romanian
russian
spanish
swedish
turkish
tokens_df$word <- wordStem(tokens_df$word,  language = "english")

Punctuation are removed and tokens converted to lowercase

head(table(tokens_df$word)) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = F, position = "left")
Var1 Freq
0 78
00 3
0007364 1
01 2
010 1
0137 1

Removing Stopwords

Stopwords are words which are not useful for text analysis, so tt is essential to remove it before performing any analysis. Some of the example of stopwords are ‘to’,‘a’,‘of’ and ‘the’ etc.

get_stopwords()
## # A tibble: 175 x 2
##    word      lexicon 
##    <chr>     <chr>   
##  1 i         snowball
##  2 me        snowball
##  3 my        snowball
##  4 myself    snowball
##  5 we        snowball
##  6 our       snowball
##  7 ours      snowball
##  8 ourselves snowball
##  9 you       snowball
## 10 your      snowball
## # ... with 165 more rows
tokens_df <- tokens_df %>%  anti_join(get_stopwords(),"word")
head(tokens_df,5) %>% kable()%>%
  kable_styling(bootstrap_options = "striped", full_width = F, position = "left")
CourseId Label word
2-speed-it 1 bore
2-speed-it 5 bravo
2-speed-it 5 veri
2-speed-it 5 goo
2-speed-it 5 great

Removing Numbers

Removing the numbers which are not need for text analysis.

nums <- tokens_df %>%   filter(str_detect(word, "^[0-9]")) %>%   select(word) %>% unique() 
head(nums) %>% kable() 
word
2dcadautocad
3d
3ds
101
3
1
tokens_df <- tokens_df %>%   anti_join(nums, by = "word")

Removing Rare words

Removing the words which doesnt occur often. We have almost 50 K unique words.

length(unique(tokens_df$word))
## [1] 53824

But those words appear rarely.

tokens_df %>%   count(word, sort = T) %>%  rename(word_freq = n) %>%  ggplot(aes(x=word_freq)) +  geom_histogram(aes(y=..count..), color="black", fill="blue", alpha=0.3) +  scale_x_continuous(breaks=c(0:5,10,100,500,10e3), trans="log1p", expand=c(0,0)) +  scale_y_continuous(breaks=c(0,100,1000,5e3,10e3,5e4,10e4,4e4), expand=c(0,0)) +  theme_bw()

So it makes sense to remove rare words to improve the performance of text analytics.Removing words that have less than 10 appearances.

rare <- tokens_df %>%   count(word) %>%  filter(n<10) %>%  select(word) %>% unique()
head(rare) %>% kable() 
word
a1
a2
a3
a4
a65
aa
tokens_df <- tokens_df %>%   filter(!word %in% rare$word) 
length(unique(tokens_df$word))
## [1] 6644

Most common words

Here we are finding the common word which are found in whole reviews.

xtable(head(tokens_df %>% 
              count(word, sort = TRUE))) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = F, position = "left")
word n
cours 90511
veri 38573
great 26875
learn 25410
good 22942
realli 14010

Visulization 1: Most Common words

Below visulization gives an idea about the most frequently used word aross various reviews. We can see that cour is very common word which is used in various reviews and has more than 75000 occurences.

tokens_df %>% 
  count(word, sort = TRUE) %>% 
  filter(n > 5000) %>% 
  mutate(word = reorder(word, n)) %>% 
  ggplot(aes(word, n)) + 
  geom_col() + 
  xlab(NULL) + 
  coord_flip()

Sentiment Analysis

Sentiment analysis is typically performed based on a lexicon of sentiment keywords. There are three such sentiment lexicons in tidytext: - The nrc lexicon: word and their sentiment category - The bing lexicon: word and their polarity (negative or positive) - The affin lexicon: word and their numeric sentiment score

sent_reviews = tokens_df %>%   left_join(get_sentiments("nrc")) %>%  rename(nrc = sentiment) %>%  left_join(get_sentiments("bing")) %>%  rename(bing = sentiment) %>%  left_join(get_sentiments("afinn")) %>%  rename(afinn = value)
head(sent_reviews) %>% kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = F, position = "left")
CourseId Label word nrc bing afinn
2-speed-it 1 bore negative negative -2
2-speed-it 5 bravo NA positive NA
2-speed-it 5 veri NA NA NA
2-speed-it 5 goo disgust NA NA
2-speed-it 5 goo negative NA NA
2-speed-it 5 great NA positive 3

Using Bing to find the emotional content of text

Sentiment_Analysis <- tokens_df %>% 
  inner_join(get_sentiments("bing"), "word") %>% 
  count(CourseId, sentiment) %>% 
  spread(sentiment, n, fill = 0) %>% 
  mutate(sentiment = positive - negative)

One way to analyze the sentiment of a text is to consider the text as a combination of its individual word, and the sentiment content of the whole text as the sum of the sentiment content of the individual words.

head(Sentiment_Analysis)%>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = F, position = "left")
CourseId negative positive sentiment
2-speed-it 5 27 22
20cnwm 0 1 1
2d-cad 0 4 4
3d-cad 0 3 3
3d-printing 0 1 1
3d-printing-applications 4 56 52

Visulization 2: Most common Positive and Negative Words based on sentiment

Below visulization shows top 10 positive and negative words based on bing sentiment analysis. We can see that great is top positive word and poor is bottom negative word.

Sentiment_Analysis_Word_Count <- tokens_df %>% 
  inner_join(get_sentiments("bing"), "word") %>% 
  count(word, sentiment, sort = TRUE) %>% 
  ungroup()

Sentiment_Analysis_Word_Count %>% 
  group_by(sentiment) %>% 
  top_n(10, n) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) + 
  geom_col(show.legend = FALSE) + 
  facet_wrap(~sentiment, scales = "free_y") + 
  labs(y = "Contribution to Sentiment", x = NULL) + 
  coord_flip()

Visulization 3: Words with the greatest contributions to positive/negative sentiment scores in the Review

From below visualization we can infer that good has positive sentiment which has high occurances in overall review text. In the same way problem has negative sentiment which has high occurances in overall review text.

bing_word_counts <- sent_reviews %>%  filter(!is.na(bing)) %>%  count(word, bing, sort = TRUE) 
bing_word_counts
## # A tibble: 431 x 3
##    word      bing          n
##    <chr>     <chr>     <int>
##  1 good      positive 114710
##  2 excel     positive  69025
##  3 great     positive  26875
##  4 enjoy     positive  24692
##  5 recommend positive  15278
##  6 thank     positive  13237
##  7 love      positive  11244
##  8 well      positive  10493
##  9 like      positive   9426
## 10 fun       positive   8157
## # ... with 421 more rows
bing_word_counts %>%  filter(n > 800) %>%  mutate(n = ifelse(bing == "negative", -n, n)) %>%  mutate(word = reorder(word, n)) %>%  ggplot(aes(word, n, fill = bing)) +  geom_col() +  coord_flip() +  labs(y = "Contribution to sentiment")

Bi-grams

A bigram is an n-gram for n=2. It is basically a pair a consecutive occuring words.

bigrams <- tokens_df %>%  unnest_tokens(bigram, word,token = "ngrams", n = 2) 
bigrams %>% select(bigram)
## # A tibble: 1,682,589 x 1
##    bigram           
##    <chr>            
##  1 <NA>             
##  2 cours doe        
##  3 doe say          
##  4 say anyth        
##  5 anyth digit      
##  6 digit core       
##  7 core subject     
##  8 subject digit    
##  9 digit wave       
## 10 disappoint becaus
## # ... with 1,682,579 more rows
head(bigrams)%>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = F, position = "left")
CourseId Label bigram
2-speed-it 1 NA
2-speed-it 2 cours doe
2-speed-it 2 doe say
2-speed-it 2 say anyth
2-speed-it 2 anyth digit
2-speed-it 2 digit core

Removing stop words in bigrams

bigrams_separated <- bigrams %>%  separate(bigram, c("word1", "word2"), sep = " ") 
bigrams_filtered <- bigrams_separated %>%  filter(!word1 %in% stop_words$word) %>%  filter(!word2 %in% stop_words$word)
bigrams_filtered %>%   count(word1, word2, sort = TRUE) 
## # A tibble: 346,393 x 3
##    word1     word2          n
##    <chr>     <chr>      <int>
##  1 cours     veri        6103
##  2 excel     cours       5365
##  3 machin    learn       3460
##  4 learn     lot         3069
##  5 excelent  curso       2621
##  6 cours     realli      2396
##  7 recommend cours       2074
##  8 easi      understand  1930
##  9 cours     excel       1875
## 10 realli    enjoy       1853
## # ... with 346,383 more rows

Word correlation

To reduce the complexity by removing uncommon words.

uncommon <- tokens_df %>%
  count(word) %>%
  filter(n<1000) %>% #remove uncommon words
  # < 1000 reviews
  select(word) %>% distinct()

word_cor = tokens_df %>%
  filter(!word %in% uncommon$word) %>%
  widyr::pairwise_cor(word, CourseId) %>%
  filter(!is.na(correlation),
         correlation > .25)

head(word_cor) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = F, position = "left")
item1 item2 correlation
great veri 0.5384340
cours veri 0.6239834
recommend veri 0.4976756
especi veri 0.3282563
manag veri 0.2502092
one veri 0.4856276

Document term matrix

A document-term matrix or term-document matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms.

word_counts_by_course_id <- tokens_df %>%  group_by(CourseId) %>%  count(word, sort = TRUE)
review_dtm <- word_counts_by_course_id %>%  cast_dtm(CourseId, word, n) 

Topic modeling

Topic models are algorithms for discovering the main themes that pervade a large and otherwise unstructured collection of documents.Latent Dirichlet Allocation is a particularly popular method for fitting a topic model.

I have created topic modeling using 5 topics, in which each topic consist of 10 terms.

library(topicmodels)
lda5 <- LDA(review_dtm, k = 5, control = list(seed = 1234)) 
terms(lda5, 10)
##       Topic 1    Topic 2      Topic 3  Topic 4  Topic 5  
##  [1,] "de"       "cours"      "cours"  "de"     "cours"  
##  [2,] "curso"    "veri"       "learn"  "et"     "veri"   
##  [3,] "y"        "great"      "veri"   "cour"   "good"   
##  [4,] "que"      "interest"   "great"  "trs"    "great"  
##  [5,] "muy"      "good"       "help"   "les"    "learn"  
##  [6,] "excelent" "excel"      "use"    "la"     "use"    
##  [7,] "el"       "inform"     "good"   "le"     "realli" 
##  [8,] "la"       "thank"      "thank"  "des"    "excel"  
##  [9,] "en"       "realli"     "realli" "un"     "program"
## [10,] "para"     "understand" "lot"    "negoti" "assign"

For each combination the model has the probability of that term being generated from that topic.

lda5_betas <- broom::tidy(lda5) 
head(lda5_betas) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = F, position = "left")
topic term beta
1 cours 0.0007422
2 cours 0.0609491
3 cours 0.0655209
4 cours 0.0039582
5 cours 0.0629040
1 learn 0.0001602
top_terms_in_topics <- lda5_betas %>%  group_by(topic) %>%  top_n(5, beta) %>%  ungroup() %>%  arrange(topic, -beta)
head(top_terms_in_topics) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = F, position = "left")
topic term beta
1 de 0.0476018
1 curso 0.0420936
1 y 0.0385821
1 que 0.0320571
1 muy 0.0316005
2 cours 0.0609491

TF-IDF

Term Frequency (tf):It is one measure of how important a word may be and how frenquently a word occurs in a document. Inverse Document Frequency (idf): It decreases the weight for commonly used words and increases the weight for words that are not used very much in a collection of documents. Calculating tf-idf attemps to find the words that are importantin a text, but not too common. The statistic tf-idf (the two quantities multiplied together) is useful to measure how important a word is to a document in a collection of documents.

term_frequency_review <- tokens_df %>% count(word, sort = TRUE)

term_frequency_review$total_words <- as.numeric(term_frequency_review %>% summarize(total = sum(n)))

term_frequency_review$document <- as.character("Review")

term_frequency_review <- term_frequency_review %>% 
  bind_tf_idf(word, document, n)

Visualization 4: TF-IDF

Below plot shows the importance of text to a document in a corpus of documents. From the plot we can see top 15 essential words with cours being top of chart.

term_frequency_review %>% 
  arrange(desc(tf)) %>% 
  mutate(word = factor(word, levels = rev(unique(word)))) %>% 
  group_by(document) %>% 
  top_n(15, tf) %>% 
  ungroup() %>% 
  ggplot(aes(word, tf, fill = document)) + 
  geom_col(show.legend = FALSE) + 
  labs(x = NULL, y = "tf-idf") + 
  facet_wrap(~document, ncol = 2, scales = "free") + 
  coord_flip()