knitr::opts_chunk$set(warning=FALSE, message=FALSE)
In order to perform Natural Language Processing and Text mining, I have selected “100K Coursera’s Course Reviews Dataset” from Kaggle. The name of CSV file is reviews_by_course.csv. Totally there are 140321 rows and 3 columns.The link of kaggle page is “https://www.kaggle.com/septa97/100k-courseras-course-reviews-dataset”.
Load necessay packages required for this assignment.
library(widyr)
library(textdata)
library(readr)
library(tidytext)
library(stringr)
library(tidyverse)
library(data.table)
library(knitr)
library(kableExtra)
library(dplyr)
library(ggplot2)
library(tidytext)
library(RColorBrewer)
library(wordcloud)
library(reshape2)
library(hunspell)
library(SnowballC)
library(xtable)
library(NLP)
library(tm)
library(stringr)
library(broom)
The data in CSV is stored in coursera dataframe
coursera<-as.data.frame(fread("reviews_by_course.csv"))
coursera <- as_tibble(coursera)
x <- head(coursera, n = 10)
kable(x)%>%
kable_styling(bootstrap_options = "striped", full_width = F, position = "left")
| CourseId | Review | Label |
|---|---|---|
| 2-speed-it | BOring | 1 |
| 2-speed-it | Bravo ! | 5 |
| 2-speed-it | Very goo | 5 |
| 2-speed-it | Great course - I recommend it for all, especially IT and Business Managers! | 5 |
| 2-speed-it | One of the most useful course on IT Management! | 5 |
| 2-speed-it | I was disappointed because the name is misleading. The course provides a good introduction & overview of the responsibilities of the CTO, but has very little specifically digital content. It deals with two-speed IT in a single short lecture, so of course the treatment is superficial. It is easy to find more in-depth material freely available, on the McKinsey website for example. | 3 |
| 2-speed-it | Super content. I’ll definitely re-do the course | 5 |
| 2-speed-it | Etant contrôleur de gestion pour le département IT (HQ + Locale), le cours est vraiment intéressant et de très bonne qualité.J’insiste que la qualité et le professionnalisme des professeurs.I’m a controller for an IT department, the courses is very good and very helpful for my job. I recommand you to follow the training. | 5 |
| 2-speed-it | One of the excellent courses at Coursera for information technology bosses and managers. | 5 |
| 2-speed-it | Is there any reason why you should not apply the course by BCG?)It’s content is pretty unique and includes a high level analysis and a wide range of knowledge needed to cover all detailed aspects.Best regards,Oleg Serov | 5 |
Removing all the unwanted special characters from Review text.
removeSpecialChars <- function(x) gsub("[^a-zA-Z0-9 ]", "", x)
coursera$Review <- sapply(coursera$Review, removeSpecialChars)
coursera$Review <- iconv(coursera$Review, from = 'UTF-8', to = 'ASCII//TRANSLIT')
coursera$Review = gsub("!", "", coursera$Review)
coursera$Review <- gsub("[_]", "", coursera$Review)
coursera$Review <- gsub("<br />", "", coursera$Review)
head(coursera$Review,10) %>% kable()%>%
kable_styling(bootstrap_options = "striped", full_width = F, position = "left")
| x | |
|---|---|
| BOring | BOring |
| Bravo ! | Bravo |
| Very goo | Very goo |
| Great course - I recommend it for all, especially IT and Business Managers! | Great course I recommend it for all especially IT and Business Managers |
| One of the most useful course on IT Management! | One of the most useful course on IT Management |
| I was disappointed because the name is misleading. The course provides a good introduction & overview of the responsibilities of the CTO, but has very little specifically digital content. It deals with two-speed IT in a single short lecture, so of course the treatment is superficial. It is easy to find more in-depth material freely available, on the McKinsey website for example. | I was disappointed because the name is misleading The course provides a good introduction overview of the responsibilities of the CTO but has very little specifically digital content It deals with twospeed IT in a single short lecture so of course the treatment is superficial It is easy to find more indepth material freely available on the McKinsey website for example |
| Super content. I’ll definitely re-do the course | Super content Ill definitely redo the course |
| Etant contrôleur de gestion pour le département IT (HQ + Locale), le cours est vraiment intéressant et de très bonne qualité.J’insiste que la qualité et le professionnalisme des professeurs.I’m a controller for an IT department, the courses is very good and very helpful for my job. I recommand you to follow the training. | Etant contrleur de gestion pour le dpartement IT HQ Locale le cours est vraiment intressant et de trs bonne qualitJinsiste que la qualit et le professionnalisme des professeursIm a controller for an IT department the courses is very good and very helpful for my job I recommand you to follow the training |
| One of the excellent courses at Coursera for information technology bosses and managers. | One of the excellent courses at Coursera for information technology bosses and managers |
| Is there any reason why you should not apply the course by BCG?)It’s content is pretty unique and includes a high level analysis and a wide range of knowledge needed to cover all detailed aspects.Best regards,Oleg Serov | Is there any reason why you should not apply the course by BCGIts content is pretty unique and includes a high level analysis and a wide range of knowledge needed to cover all detailed aspectsBest regardsOleg Serov |
Token is defined as meaningful part of text(most often a word), which can be used for further text analysis.Tokenization is the process of splitting sentences into words(tokens).
tokens_df <- coursera %>% unnest_tokens(word, Review)
head(tokens_df,5) %>% kable()%>%
kable_styling(bootstrap_options = "striped", full_width = F, position = "left")
| CourseId | Label | word |
|---|---|---|
| 2-speed-it | 1 | boring |
| 2-speed-it | 5 | bravo |
| 2-speed-it | 5 | very |
| 2-speed-it | 5 | goo |
| 2-speed-it | 5 | great |
After tokenization, we need to analyze each word by breaking it down in itâs root (stemming) and conjugation affix.
getStemLanguages() %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = F, position = "left")
| x |
|---|
| danish |
| dutch |
| english |
| finnish |
| french |
| german |
| hungarian |
| italian |
| norwegian |
| porter |
| portuguese |
| romanian |
| russian |
| spanish |
| swedish |
| turkish |
tokens_df$word <- wordStem(tokens_df$word, language = "english")
head(table(tokens_df$word)) %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = F, position = "left")
| Var1 | Freq |
|---|---|
| 0 | 78 |
| 00 | 3 |
| 0007364 | 1 |
| 01 | 2 |
| 010 | 1 |
| 0137 | 1 |
Stopwords are words which are not useful for text analysis, so tt is essential to remove it before performing any analysis. Some of the example of stopwords are ‘to’,‘a’,‘of’ and ‘the’ etc.
get_stopwords()
## # A tibble: 175 x 2
## word lexicon
## <chr> <chr>
## 1 i snowball
## 2 me snowball
## 3 my snowball
## 4 myself snowball
## 5 we snowball
## 6 our snowball
## 7 ours snowball
## 8 ourselves snowball
## 9 you snowball
## 10 your snowball
## # ... with 165 more rows
tokens_df <- tokens_df %>% anti_join(get_stopwords(),"word")
head(tokens_df,5) %>% kable()%>%
kable_styling(bootstrap_options = "striped", full_width = F, position = "left")
| CourseId | Label | word |
|---|---|---|
| 2-speed-it | 1 | bore |
| 2-speed-it | 5 | bravo |
| 2-speed-it | 5 | veri |
| 2-speed-it | 5 | goo |
| 2-speed-it | 5 | great |
Removing the numbers which are not need for text analysis.
nums <- tokens_df %>% filter(str_detect(word, "^[0-9]")) %>% select(word) %>% unique()
head(nums) %>% kable()
| word |
|---|
| 2dcadautocad |
| 3d |
| 3ds |
| 101 |
| 3 |
| 1 |
tokens_df <- tokens_df %>% anti_join(nums, by = "word")
Removing the words which doesnt occur often. We have almost 50 K unique words.
length(unique(tokens_df$word))
## [1] 53824
But those words appear rarely.
tokens_df %>% count(word, sort = T) %>% rename(word_freq = n) %>% ggplot(aes(x=word_freq)) + geom_histogram(aes(y=..count..), color="black", fill="blue", alpha=0.3) + scale_x_continuous(breaks=c(0:5,10,100,500,10e3), trans="log1p", expand=c(0,0)) + scale_y_continuous(breaks=c(0,100,1000,5e3,10e3,5e4,10e4,4e4), expand=c(0,0)) + theme_bw()
So it makes sense to remove rare words to improve the performance of text analytics.Removing words that have less than 10 appearances.
rare <- tokens_df %>% count(word) %>% filter(n<10) %>% select(word) %>% unique()
head(rare) %>% kable()
| word |
|---|
| a1 |
| a2 |
| a3 |
| a4 |
| a65 |
| aa |
tokens_df <- tokens_df %>% filter(!word %in% rare$word)
length(unique(tokens_df$word))
## [1] 6644
Here we are finding the common word which are found in whole reviews.
xtable(head(tokens_df %>%
count(word, sort = TRUE))) %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = F, position = "left")
| word | n |
|---|---|
| cours | 90511 |
| veri | 38573 |
| great | 26875 |
| learn | 25410 |
| good | 22942 |
| realli | 14010 |
Below visulization gives an idea about the most frequently used word aross various reviews. We can see that cour is very common word which is used in various reviews and has more than 75000 occurences.
tokens_df %>%
count(word, sort = TRUE) %>%
filter(n > 5000) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n)) +
geom_col() +
xlab(NULL) +
coord_flip()
Sentiment analysis is typically performed based on a lexicon of sentiment keywords. There are three such sentiment lexicons in tidytext: - The nrc lexicon: word and their sentiment category - The bing lexicon: word and their polarity (negative or positive) - The affin lexicon: word and their numeric sentiment score
sent_reviews = tokens_df %>% left_join(get_sentiments("nrc")) %>% rename(nrc = sentiment) %>% left_join(get_sentiments("bing")) %>% rename(bing = sentiment) %>% left_join(get_sentiments("afinn")) %>% rename(afinn = value)
head(sent_reviews) %>% kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = F, position = "left")
| CourseId | Label | word | nrc | bing | afinn |
|---|---|---|---|---|---|
| 2-speed-it | 1 | bore | negative | negative | -2 |
| 2-speed-it | 5 | bravo | NA | positive | NA |
| 2-speed-it | 5 | veri | NA | NA | NA |
| 2-speed-it | 5 | goo | disgust | NA | NA |
| 2-speed-it | 5 | goo | negative | NA | NA |
| 2-speed-it | 5 | great | NA | positive | 3 |
Sentiment_Analysis <- tokens_df %>%
inner_join(get_sentiments("bing"), "word") %>%
count(CourseId, sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative)
head(Sentiment_Analysis)%>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = F, position = "left")
| CourseId | negative | positive | sentiment |
|---|---|---|---|
| 2-speed-it | 5 | 27 | 22 |
| 20cnwm | 0 | 1 | 1 |
| 2d-cad | 0 | 4 | 4 |
| 3d-cad | 0 | 3 | 3 |
| 3d-printing | 0 | 1 | 1 |
| 3d-printing-applications | 4 | 56 | 52 |
Below visulization shows top 10 positive and negative words based on bing sentiment analysis. We can see that great is top positive word and poor is bottom negative word.
Sentiment_Analysis_Word_Count <- tokens_df %>%
inner_join(get_sentiments("bing"), "word") %>%
count(word, sentiment, sort = TRUE) %>%
ungroup()
Sentiment_Analysis_Word_Count %>%
group_by(sentiment) %>%
top_n(10, n) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(y = "Contribution to Sentiment", x = NULL) +
coord_flip()
From below visualization we can infer that good has positive sentiment which has high occurances in overall review text. In the same way problem has negative sentiment which has high occurances in overall review text.
bing_word_counts <- sent_reviews %>% filter(!is.na(bing)) %>% count(word, bing, sort = TRUE)
bing_word_counts
## # A tibble: 431 x 3
## word bing n
## <chr> <chr> <int>
## 1 good positive 114710
## 2 excel positive 69025
## 3 great positive 26875
## 4 enjoy positive 24692
## 5 recommend positive 15278
## 6 thank positive 13237
## 7 love positive 11244
## 8 well positive 10493
## 9 like positive 9426
## 10 fun positive 8157
## # ... with 421 more rows
bing_word_counts %>% filter(n > 800) %>% mutate(n = ifelse(bing == "negative", -n, n)) %>% mutate(word = reorder(word, n)) %>% ggplot(aes(word, n, fill = bing)) + geom_col() + coord_flip() + labs(y = "Contribution to sentiment")
A bigram is an n-gram for n=2. It is basically a pair a consecutive occuring words.
bigrams <- tokens_df %>% unnest_tokens(bigram, word,token = "ngrams", n = 2)
bigrams %>% select(bigram)
## # A tibble: 1,682,589 x 1
## bigram
## <chr>
## 1 <NA>
## 2 cours doe
## 3 doe say
## 4 say anyth
## 5 anyth digit
## 6 digit core
## 7 core subject
## 8 subject digit
## 9 digit wave
## 10 disappoint becaus
## # ... with 1,682,579 more rows
head(bigrams)%>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = F, position = "left")
| CourseId | Label | bigram |
|---|---|---|
| 2-speed-it | 1 | NA |
| 2-speed-it | 2 | cours doe |
| 2-speed-it | 2 | doe say |
| 2-speed-it | 2 | say anyth |
| 2-speed-it | 2 | anyth digit |
| 2-speed-it | 2 | digit core |
bigrams_separated <- bigrams %>% separate(bigram, c("word1", "word2"), sep = " ")
bigrams_filtered <- bigrams_separated %>% filter(!word1 %in% stop_words$word) %>% filter(!word2 %in% stop_words$word)
bigrams_filtered %>% count(word1, word2, sort = TRUE)
## # A tibble: 346,393 x 3
## word1 word2 n
## <chr> <chr> <int>
## 1 cours veri 6103
## 2 excel cours 5365
## 3 machin learn 3460
## 4 learn lot 3069
## 5 excelent curso 2621
## 6 cours realli 2396
## 7 recommend cours 2074
## 8 easi understand 1930
## 9 cours excel 1875
## 10 realli enjoy 1853
## # ... with 346,383 more rows
To reduce the complexity by removing uncommon words.
uncommon <- tokens_df %>%
count(word) %>%
filter(n<1000) %>% #remove uncommon words
# < 1000 reviews
select(word) %>% distinct()
word_cor = tokens_df %>%
filter(!word %in% uncommon$word) %>%
widyr::pairwise_cor(word, CourseId) %>%
filter(!is.na(correlation),
correlation > .25)
head(word_cor) %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = F, position = "left")
| item1 | item2 | correlation |
|---|---|---|
| great | veri | 0.5384340 |
| cours | veri | 0.6239834 |
| recommend | veri | 0.4976756 |
| especi | veri | 0.3282563 |
| manag | veri | 0.2502092 |
| one | veri | 0.4856276 |
A document-term matrix or term-document matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms.
word_counts_by_course_id <- tokens_df %>% group_by(CourseId) %>% count(word, sort = TRUE)
review_dtm <- word_counts_by_course_id %>% cast_dtm(CourseId, word, n)
Topic models are algorithms for discovering the main themes that pervade a large and otherwise unstructured collection of documents.Latent Dirichlet Allocation is a particularly popular method for fitting a topic model.
library(topicmodels)
lda5 <- LDA(review_dtm, k = 5, control = list(seed = 1234))
terms(lda5, 10)
## Topic 1 Topic 2 Topic 3 Topic 4 Topic 5
## [1,] "de" "cours" "cours" "de" "cours"
## [2,] "curso" "veri" "learn" "et" "veri"
## [3,] "y" "great" "veri" "cour" "good"
## [4,] "que" "interest" "great" "trs" "great"
## [5,] "muy" "good" "help" "les" "learn"
## [6,] "excelent" "excel" "use" "la" "use"
## [7,] "el" "inform" "good" "le" "realli"
## [8,] "la" "thank" "thank" "des" "excel"
## [9,] "en" "realli" "realli" "un" "program"
## [10,] "para" "understand" "lot" "negoti" "assign"
lda5_betas <- broom::tidy(lda5)
head(lda5_betas) %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = F, position = "left")
| topic | term | beta |
|---|---|---|
| 1 | cours | 0.0007422 |
| 2 | cours | 0.0609491 |
| 3 | cours | 0.0655209 |
| 4 | cours | 0.0039582 |
| 5 | cours | 0.0629040 |
| 1 | learn | 0.0001602 |
top_terms_in_topics <- lda5_betas %>% group_by(topic) %>% top_n(5, beta) %>% ungroup() %>% arrange(topic, -beta)
head(top_terms_in_topics) %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = F, position = "left")
| topic | term | beta |
|---|---|---|
| 1 | de | 0.0476018 |
| 1 | curso | 0.0420936 |
| 1 | y | 0.0385821 |
| 1 | que | 0.0320571 |
| 1 | muy | 0.0316005 |
| 2 | cours | 0.0609491 |
Term Frequency (tf):It is one measure of how important a word may be and how frenquently a word occurs in a document. Inverse Document Frequency (idf): It decreases the weight for commonly used words and increases the weight for words that are not used very much in a collection of documents. Calculating tf-idf attemps to find the words that are importantin a text, but not too common. The statistic tf-idf (the two quantities multiplied together) is useful to measure how important a word is to a document in a collection of documents.
term_frequency_review <- tokens_df %>% count(word, sort = TRUE)
term_frequency_review$total_words <- as.numeric(term_frequency_review %>% summarize(total = sum(n)))
term_frequency_review$document <- as.character("Review")
term_frequency_review <- term_frequency_review %>%
bind_tf_idf(word, document, n)
Below plot shows the importance of text to a document in a corpus of documents. From the plot we can see top 15 essential words with cours being top of chart.
term_frequency_review %>%
arrange(desc(tf)) %>%
mutate(word = factor(word, levels = rev(unique(word)))) %>%
group_by(document) %>%
top_n(15, tf) %>%
ungroup() %>%
ggplot(aes(word, tf, fill = document)) +
geom_col(show.legend = FALSE) +
labs(x = NULL, y = "tf-idf") +
facet_wrap(~document, ncol = 2, scales = "free") +
coord_flip()