Acknowledgements

We would like to thank Dr. Karolina Kuligowska, University of Warsaw for instructing us in the course “Text Mining & Social Media Mining” and Mr. Roshan Sharma in scraping and sharing Coursera courses’ reviews data set on the Kaggle platfrom. This project cannot be completed without their valuable contribution.

1. Introduction

Since the bulk of data generated today is unstructured text data, it’s important that organizations find ways to manage and analyze it so that they can act on the data and make important business decisions. Online course platforms, whose is becoming more and more important in closing the gap of education, also have to deal with a large amount of text data everyday. They have to understand what customers think about their courses and what need to improve. Coursera is one of the earliest and the most well-known online course providers on the world. In this project, we apply techniques from the course “Text Mining & Social Media Mining” in sentiment analyzing Coursera courses’ reviews.

2. Purposes of the project

In this project, we focus on the following purposes:

  • Applying sentiment analysis and text categorization from class “Text Mining & Social Media Mining” in classifying course reviews data.
  • Investigating patterns of positive and negative reviews.
  • Analyzing which course content should be improved based on information extracted from course id and sentiment analysis.

3. Main assumptions

In this project, we apply the following assumptions:

  • Customers rate 1 - 5 for courses. We divided reviews into 2 groups, (1) positive or neutral reviews includes courses which were graded more than or equal to 3, (2) negative group includes courses which were graded less than 3.
  • The course content information is partially reflected in course id (For example: The course “3D Printing Applications”, provided by The University of Illinois at Urbana-Champaign, has the course id in the data set is “3d-printing-applications”).
  • Courses whose less than 10 reviews is assumed to have not enough reviews to conclude their customer reviews. These courses would not be included in the project.

4. Description of the data set

We use the Coursera courses’ reviews data set shared by Mr. Roshan Sharma via Kaggle platform. The data set can be found via the link: https://www.kaggle.com/roshansharma/coursera-course-reviews/data. The data set contains 140320 reviews of 1835 Coursera’s courses, scraped on May 2017. Each row of review have 3 fields:

  • CourseId: The id of courses in URL, in string format
  • Review: The learner’s reviews about course, in string format
  • Label: The learner’s rating in range (1-5), numeric format
CourseId Review Label
2-speed-it BOring 1
2-speed-it Bravo ! 5
2-speed-it Very goo 5
2-speed-it Great course - I recommend it for all, especially IT and Business Managers! 5
2-speed-it One of the most useful course on IT Management! 5
2-speed-it I was disappointed because the name is misleading. The course provides a good introduction & overview of the responsibilities of the CTO, but has very little specifically digital content. It deals with two-speed IT in a single short lecture, so of course the treatment is superficial. It is easy to find more in-depth material freely available, on the McKinsey website for example. 3

5. Preparation of data for modeling

a. Data processing

We read data from the csv input file and processed the following steps:

  • Remove all reviews in characters not in ASCII
  • Remove all courses whose less than 10 reviews
  • Add columns of review id per course and review id in the whole data set
  • Add a column to classify learners’ sentiment into 2 groups: ‘neutral_positive’ for labels from 3 to 5 and ‘negative’ for labels from 1 to 2
  • Add a sentiment encoding column to encode sentiments: 1 for labels from 3 to 5 and 0 for labels from 1 to 2
  • Convert reviews text to lower letter format
  • Remove all non-alphabet characters in reviews
  • Strip extra white spaces in reviews
  • Remove stop words in English, French and Spanish
  • View the word cloud after the previous step and choose additional stop words for course review: “course”,“learn”,"learning
  • Add a column to stemmed reviews for machine learning models training
    The process data is saved in data frame reviews_df.
# Filter Only ASCII characters, courses with 10+ reviews, add review id and review id in each course
reviews$IsASCII <- stri_enc_isascii(reviews$Review)
reviews_df <- reviews %>% 
  filter(IsASCII == T) %>%
  group_by(CourseId) %>% 
  mutate(NoReview = n()) %>% 
  filter(NoReview >= 10) %>% 
  mutate(IdInCourse = row_number()) %>% 
  select(-c(IsASCII,NoReview)) 

# Add learner's sentiment based on rating
reviews_df <- reviews_df %>% 
  mutate(Sentiment = case_when(Label >= 3 ~ "neutral_positive",
                               Label < 3 ~ "negative",
                               TRUE ~ "neutral_positive"),
         SentimentEncode = case_when(Label >= 3 ~ 1,
                                     Label < 3 ~ 0,
                                     TRUE ~ 1),
         Id = row_number())

# Remove non-alphabet characters, strip extra white spaces, remove stop words
clean <- function(text){
  return (gsub("[^a-z[:space:]]","",tolower(text)))
}
reviews_df$Review <- sapply(reviews_df$Review, clean)
reviews_df$Review <- stripWhitespace(reviews_df$Review)
reviews_df$Review <- removeWords(reviews_df$Review, stopwords("english"))
reviews_df$Review <- removeWords(reviews_df$Review, stopwords("spanish"))
reviews_df$Review <- removeWords(reviews_df$Review, stopwords("french"))
reviews_df$Review <- removeWords(reviews_df$Review, c("course","learn","learning"))

# Add a stemmed review column
reviews_df$ReviewStem <- stemDocument(reviews_df$Review)
  • We also split the review column into word tokens and saved into the data frame reviews_words
  • The word cloud of word tokens after processing is as below:
# Split reviews into word tokens
reviews_words <- reviews_df %>% 
  select(-c(Label, ReviewStem)) %>% 
  group_by(CourseId, IdInCourse) %>% 
  ungroup() %>%
  unnest_tokens(word, Review)

# Visualize reviews in a word cloud
reviews_words %>%
  count(word) %>%
  with(wordcloud(word, n, max.words = 100))

b. Train test split

We split the data set into 2 sets, training and test set, using row id. The test set accounts for 20% of total number of reviews with random review id.

set.seed(284)
n_reviews <- dim(reviews_df)[1]
ids <- c(1:n_reviews)
train_id <- sample(ids, round(n_reviews*0.8,0), replace = F)
test_id <- ids[!ids %in% train_id]

6. Modeling

6.1. Rule-based approachs

6.1.1. Afinn

We joined the data frame including word tokens (reviews_words) with the Afinn dictionary to get the sentiment value of word tokens. Then, we summed up sentiment value for each review id and course id.

  • A review whose total sentiment value is greater than or equal to 0 is classified as a neutral or positive one
  • A review whose total sentiment value is less than 0 is classified as a negative review
  • Because positive or neutral reviews dominate the data set, a review doesn’t have any word token in the Afinn dictionary is classified as a neutral or positive review

The result is saved in the data frame reviews_sentiment_afinn.

# Inner join work token data frame with afinn dictionary to get sentiment score of each word
reviews_keywords_sentiment_afinn <- reviews_words %>% 
  inner_join(get_sentiments("afinn")) %>%
  group_by(CourseId, IdInCourse) %>%
  summarise(Sentiment_afinn_score = sum(value))%>% 
  mutate(Sentiment_afinn = case_when(
    Sentiment_afinn_score >= 0 ~ "neutral_positive",
    Sentiment_afinn_score < 0 ~ "negative",
    TRUE ~ "neutral_positive"))

# Left join the reviews_df with reviews_keywords_sentiment_afinn, reviews whose no words in Afinn dictionary is classified as a neutral or positive review, its total sentiment score is set to 0
reviews_sentiment_afinn <- reviews_df %>%
  left_join(reviews_keywords_sentiment_afinn, by = c('CourseId','IdInCourse')) %>% 
  select(-c(ReviewStem)) %>% 
  mutate(Sentiment_afinn = replace_na(Sentiment_afinn, 'neutral_positive')) %>% 
  mutate(Sentiment_afinn_score = replace_na(Sentiment_afinn_score, 0))

# View the reviews_sentiment_afinn data frame
kbl(head(reviews_sentiment_afinn,4) %>%  select(Id, IdInCourse, CourseId, Review, Label, Sentiment, Sentiment_afinn)) %>%
  kable_paper(full_width = F) %>%
  column_spec(3, width = "12em")
Id IdInCourse CourseId Review Label Sentiment Sentiment_afinn
1 1 2-speed-it boring 1 negative negative
2 2 2-speed-it bravo 5 neutral_positive neutral_positive
3 3 2-speed-it goo 5 neutral_positive neutral_positive
4 4 2-speed-it great recommend especially business managers 5 neutral_positive neutral_positive

We created the Confusion matrix on the training data set. The balanced accuracy of this method is 0.64453.

cm_afinn_train <- caret::confusionMatrix(as.factor(reviews_sentiment_afinn$Sentiment_afinn[train_id]), as.factor(reviews_sentiment_afinn$Sentiment[train_id]))

cm_afinn_train
## Confusion Matrix and Statistics
## 
##                   Reference
## Prediction         negative neutral_positive
##   negative             1156             1518
##   neutral_positive     2633            93122
##                                           
##                Accuracy : 0.9578          
##                  95% CI : (0.9566, 0.9591)
##     No Information Rate : 0.9615          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.3366          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.30509         
##             Specificity : 0.98396         
##          Pos Pred Value : 0.43231         
##          Neg Pred Value : 0.97250         
##              Prevalence : 0.03849         
##          Detection Rate : 0.01174         
##    Detection Prevalence : 0.02717         
##       Balanced Accuracy : 0.64453         
##                                           
##        'Positive' Class : negative        
## 

6.1.2. Bing

We joined the data frame including word tokens (reviews_words) with the Bing dictionary to get the sentiment value of word tokens. Then, we subtracted total number of negative tokens from total number of positive tokens to get the Bing sentiment value for each review id and course id.

  • A review whose total sentiment value is greater than or equal to 0 is classified as a neutral or positive one
  • A review whose total sentiment value is less than 0 is classified as a negative review
  • Because positive or neutral reviews dominate the data set, a review doesn’t have any word token in the Bing dictionary is classified as a neutral or positive review

The result is saved in the data frame reviews_sentiment_afinn.

# Inner join work token data frame with Bing dictionary to get sentiment score of each word
reviews_keywords_sentiment_bing <- reviews_words %>% 
  inner_join(get_sentiments("bing")) %>%
  count(CourseId, IdInCourse, sentiment) %>% 
  spread(sentiment, n, fill = 0) %>%
  mutate(Sentiment_bing_score = positive - negative) %>% 
  mutate(Sentiment_bing = case_when(
    Sentiment_bing_score >= 0 ~ "neutral_positive",
    Sentiment_bing_score < 0 ~ "negative",
    TRUE ~ "neutral_positive")) %>% 
  select(-c(positive, negative))

# Left join the reviews_df with reviews_keywords_sentiment_bing, reviews whose no words in Bing dictionary is classified as a neutral or positive review, its total sentiment score is set to 0
reviews_sentiment_bing <- reviews_df %>%
  left_join(reviews_keywords_sentiment_bing, by = c('CourseId','IdInCourse')) %>% 
  select(-c(ReviewStem)) %>% 
  mutate(Sentiment_bing = replace_na(Sentiment_bing, 'neutral_positive')) %>% 
  mutate(Sentiment_bing_score = replace_na(Sentiment_bing_score, 0)) 

# View the reviews_sentiment_bing data frame
kbl(head(reviews_sentiment_bing,4) %>%  select(Id, IdInCourse, CourseId, Review, Label, Sentiment, Sentiment_bing)) %>%
  kable_paper(full_width = F) %>%
  column_spec(3, width = "12em")
Id IdInCourse CourseId Review Label Sentiment Sentiment_bing
1 1 2-speed-it boring 1 negative negative
2 2 2-speed-it bravo 5 neutral_positive neutral_positive
3 3 2-speed-it goo 5 neutral_positive neutral_positive
4 4 2-speed-it great recommend especially business managers 5 neutral_positive neutral_positive

We created the Confusion matrix on the training data set. The balanced accuracy of this method is 0.65547.

cm_bing_train <- caret::confusionMatrix(as.factor(reviews_sentiment_bing$Sentiment_bing[train_id]), as.factor(reviews_sentiment_bing$Sentiment[train_id]))

cm_bing_train
## Confusion Matrix and Statistics
## 
##                   Reference
## Prediction         negative neutral_positive
##   negative             1251             1820
##   neutral_positive     2538            92820
##                                          
##                Accuracy : 0.9557         
##                  95% CI : (0.9544, 0.957)
##     No Information Rate : 0.9615         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.342          
##                                          
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.33017        
##             Specificity : 0.98077        
##          Pos Pred Value : 0.40736        
##          Neg Pred Value : 0.97338        
##              Prevalence : 0.03849        
##          Detection Rate : 0.01271        
##    Detection Prevalence : 0.03120        
##       Balanced Accuracy : 0.65547        
##                                          
##        'Positive' Class : negative       
## 

6.2. Automatic approachs


Text Vectorization
We vectorized stemmed reviews of whole the data set into a Document-Term Matrix, using the hashing trick with number of features of 2^5 and only unigram pattern.

# Generate word tokenizer
tok_fun <- word_tokenizer

# Iterating tokenizing over Stemmed reviews 
it <- itoken(reviews_df$ReviewStem, 
             tokenizer = tok_fun, progressbar = FALSE)

# Create the vocabulary
vocab <- create_vocabulary(it)

# Vectorize stemmed reviews
h_vectorizer <- hash_vectorizer(hash_size = 2 ^ 5, ngram = c(1L, 1L))
dtm <- create_dtm(it, h_vectorizer)

# Train and test features, labels from Encoded Sentiment and the Document-Term Matrix
train_labels <- as.array(reviews_df$SentimentEncode[train_id])
test_labels <- as.array(reviews_df$SentimentEncode[test_id])
train_features <- as.matrix(dtm[train_id,])
test_features <- as.matrix(dtm[test_id,])

6.2.1. Naive Bayes

We trained the Naive Bayes model

modelNB <- naiveBayes(x = train_features, y = as.factor(train_labels))
summary(modelNB)
##           Length Class  Mode     
## apriori    2     table  numeric  
## tables    32     -none- list     
## levels     2     -none- character
## isnumeric 32     -none- logical  
## call       3     -none- call

We created the Confusion matrix on the training data set. The balanced accuracy of this method is 0.49372.

prediction_NB_train <- predict(modelNB, as.matrix(dtm[train_id,]))

cm_NB_train <- caret::confusionMatrix(as.factor(prediction_NB_train),as.factor(train_labels[train_id]))
cm_NB_train
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0   281  7994
##          1  2737 67661
##                                          
##                Accuracy : 0.8636         
##                  95% CI : (0.8612, 0.866)
##     No Information Rate : 0.9616         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : -0.0068        
##                                          
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.093108       
##             Specificity : 0.894336       
##          Pos Pred Value : 0.033958       
##          Neg Pred Value : 0.961121       
##              Prevalence : 0.038361       
##          Detection Rate : 0.003572       
##    Detection Prevalence : 0.105182       
##       Balanced Accuracy : 0.493722       
##                                          
##        'Positive' Class : 0              
## 

6.2.2. Neural Network

We used the Neural network with 2 hidden layers.

  • The first layer uses ‘relu’ activation function and has 128 nodes.
  • The second layer also uses ‘relu’ activation function and has 64 nodes.
  • We used a drop out rate of 30% after each hidden layer.
model_NN <- keras_model_sequential()
## Loaded Tensorflow version 2.7.0
model_NN %>%
  layer_dense(units = 128, activation = 'relu',input_shape = c(2^5)) %>%
  layer_dropout(rate = 0.3) %>%
  layer_dense(units = 64, activation = 'relu') %>%
  layer_dropout(rate = 0.3) %>%
  layer_dense(units = 2, activation = 'softmax')

summary(model_NN)
## Model: "sequential"
## ________________________________________________________________________________
##  Layer (type)                       Output Shape                    Param #     
## ================================================================================
##  dense_2 (Dense)                    (None, 128)                     4224        
##                                                                                 
##  dropout_1 (Dropout)                (None, 128)                     0           
##                                                                                 
##  dense_1 (Dense)                    (None, 64)                      8256        
##                                                                                 
##  dropout (Dropout)                  (None, 64)                      0           
##                                                                                 
##  dense (Dense)                      (None, 2)                       130         
##                                                                                 
## ================================================================================
## Total params: 12,610
## Trainable params: 12,610
## Non-trainable params: 0
## ________________________________________________________________________________

We compiled model with ‘adam’ optimizer, ‘sparse_categorical_crossentropy’ loss function and measure performance by ‘accuracy’.

model_NN %>% compile(
  optimizer = 'adam', 
  loss = 'sparse_categorical_crossentropy',
  metrics = c('accuracy'),
)

We trained the model using 10 epochs. Validation set’s size is 20% of training set’s size

history <- model_NN %>% keras::fit(train_features, train_labels, epochs = 10, validation_split = 0.2, verbose = T)
plot(history)
## `geom_smooth()` using formula 'y ~ x'

In the chart of loss history, we can see the model is becoming over-fitting if we increase number of epochs. We predict on training set and calculate. The balanced accuracy of prediction on training set is 0.5000.

prediction_NN_train <- predict(model_NN, train_features) 
prediction_NN_train <- argmax(prediction_NN_train) - 1

cm_NN_train <- caret::confusionMatrix(as.factor(prediction_NN_train),as.factor(train_labels[train_id]))
cm_NN_train
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0     0     7
##          1  3018 75648
##                                           
##                Accuracy : 0.9615          
##                  95% CI : (0.9602, 0.9629)
##     No Information Rate : 0.9616          
##     P-Value [Acc > NIR] : 0.5565          
##                                           
##                   Kappa : -2e-04          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.000e+00       
##             Specificity : 9.999e-01       
##          Pos Pred Value : 0.000e+00       
##          Neg Pred Value : 9.616e-01       
##              Prevalence : 3.836e-02       
##          Detection Rate : 0.000e+00       
##    Detection Prevalence : 8.898e-05       
##       Balanced Accuracy : 5.000e-01       
##                                           
##        'Positive' Class : 0               
## 

7. Evaluation

Predict on test set

predict_afinn_test <- reviews_sentiment_afinn$Sentiment_afinn[test_id]
predict_bing_test <-  reviews_sentiment_bing$Sentiment_bing[test_id]
predict_NB_test <- predict(modelNB, as.matrix(dtm[test_id,]))
predict_NN_test <- argmax(predict(model_NN, test_features)) - 1

Calculate the confusion matrix on test set

cm_afinn_test <- caret::confusionMatrix(as.factor(predict_afinn_test), 
                                        as.factor(reviews_sentiment_afinn$Sentiment[test_id]))
cm_bing_test <- caret::confusionMatrix(as.factor(predict_bing_test), 
                                        as.factor(reviews_sentiment_bing$Sentiment[test_id]))
cm_NB_test <- caret::confusionMatrix(as.factor(predict_NB_test),
                                     as.factor(test_labels))
cm_NN_test <- caret::confusionMatrix(as.factor(predict_NN_test),
                                     as.factor(test_labels))

7.1. Accuracy

accuracy_afinn_train <- cm_afinn_train$overall['Accuracy']
accuracy_bing_train <- cm_bing_train$overall['Accuracy']
accuracy_NB_train <- cm_NB_train$overall['Accuracy']
accuracy_NN_train <- cm_NN_train$overall['Accuracy']

accuracy_afinn_test <- cm_afinn_test$overall['Accuracy']
accuracy_bing_test <- cm_bing_test$overall['Accuracy']
accuracy_NB_test <- cm_NB_test$overall['Accuracy']
accuracy_NN_test <- cm_NN_test$overall['Accuracy']

accuracy_table <- data.frame(Model = c('Afinn','Bing','Naive Bayes','Neural Network'),
                             Train_accuracy = c(accuracy_afinn_train, accuracy_bing_train,
                                                accuracy_NB_train, accuracy_NN_train),
                             Test_accuracy = c(accuracy_afinn_test, accuracy_bing_test,
                                               accuracy_NB_test, accuracy_NN_test))
kbl(accuracy_table) %>%
  kable_paper(full_width = F)
Model Train_accuracy Test_accuracy
Afinn 0.9578275 0.9566384
Bing 0.9557244 0.9540375
Naive Bayes 0.8636000 0.8798309
Neural Network 0.9615497 0.9611493

7.2. Balanced Accuracy

balanced_accuracy_afinn_train <- cm_afinn_train$byClass['Balanced Accuracy']
balanced_accuracy_bing_train <- cm_bing_train$byClass['Balanced Accuracy']
balanced_accuracy_NB_train <- cm_NB_train$byClass['Balanced Accuracy']
balanced_accuracy_NN_train <- cm_NN_train$byClass['Balanced Accuracy']

balanced_accuracy_afinn_test <- cm_afinn_test$byClass['Balanced Accuracy']
balanced_accuracy_bing_test <- cm_bing_test$byClass['Balanced Accuracy']
balanced_accuracy_NB_test <- cm_NB_test$byClass['Balanced Accuracy']
balanced_accuracy_NN_test <- cm_NN_test$byClass['Balanced Accuracy']

balanced_accuracy_table <- data.frame(Model = c('Afinn','Bing','Naive Bayes','Neural Network'),
                    Train_balanced_accuracy = c(balanced_accuracy_afinn_train,balanced_accuracy_bing_train,
                                                balanced_accuracy_NB_train, balanced_accuracy_NN_train),
                    Test_balanced_accuracy = c(balanced_accuracy_afinn_test, balanced_accuracy_bing_test,
                                               balanced_accuracy_NB_test, balanced_accuracy_NN_test))
kbl(balanced_accuracy_table) %>%
  kable_paper(full_width = F)
Model Train_balanced_accuracy Test_balanced_accuracy
Afinn 0.6445270 0.6368028
Bing 0.6554678 0.6480103
Naive Bayes 0.4937221 0.6139307
Neural Network 0.4999537 0.5004813


Discussion
- The accuracy is quite high for all 4 models. However, the balanced accuracy is not that high. This is due to the imbalanced data set, the negative reviews only account for 3.8558% of the whole data set.
- There is no significant differences between training and test both accuracy and balanced accuracy in 2 rule-based models (using Afinn and Bing dictionaries).
- Comparing models, based on balanced accuracy, rule-based models outperformed machine learning models. We choose Bing model whose highest train and test balanced accuracy, to investigate the data set further in the next section.

8. Further text analysis

8.1. Patterns of reviews of review sentiment groups

## join sentiment by bing with data frame of word tokens 
reviews_words_sentiment_bing <- reviews_words %>%
  inner_join(reviews_sentiment_bing, by = c('CourseId','IdInCourse')) %>%
  count(word, Sentiment_bing, sort = TRUE) %>%
  ungroup()

## visualize top 20 words
p1 <- reviews_words_sentiment_bing %>%
  filter(Sentiment_bing == "neutral_positive") %>% 
  top_n(20) %>%
  mutate(word = reorder(word, n)) %>%
  ungroup() %>% 
  ggplot(aes(x = word, y = n)) +
  geom_col(show.legend = FALSE, fill = "tomato") +
  labs(y = "No. of appearance in sentiment group",
       x = NULL, title = "Top 20 words of Neutral & Positive \nreviews based on Bing") +
  coord_flip()

p2 <- reviews_words_sentiment_bing %>%
  filter(Sentiment_bing == "negative") %>% 
  top_n(20) %>%
  mutate(word = reorder(word, n)) %>%
  ungroup() %>% 
  ggplot(aes(x = word, y = n)) +
  geom_col(show.legend = FALSE, fill = "darkcyan") +
  labs(y = "No. of appearance in sentiment group",
       x = NULL,
       title = "Top 20 words of Negative \nreviews based on Bing") +
  coord_flip()

grid.arrange(p1, p2, ncol= 2, nrow = 1, heights= 1)


Discussion
Based on the above figure, we can see:

  • Learners feel positive about courses regarding to their material, they might feel it is easy to understand their content. They thank these course and think these courses are useful and interesting.
  • Learners feel negative about courses might be due to difficult or hard assignments, lectures. There may be also issues regarding materials, videos and time of courses.

8.2. Course content of review sentiment groups

# Count number of each course id and sentiment in bing
reviews_course_sentiment_bing <- reviews_sentiment_bing %>% 
  count(CourseId, Sentiment_bing)

# remove stop words in course id
reviews_course_sentiment_bing$CourseId <- removeWords(reviews_course_sentiment_bing$CourseId, c("course","learn","learning"))
reviews_course_sentiment_bing$CourseId <- removeWords(reviews_course_sentiment_bing$CourseId, stopwords("english"))

# create the comparison could to visualize the difference in course id between 2 groups of sentiment by bing
reviews_course_sentiment_bing %>% 
  group_by(CourseId, Sentiment_bing) %>%
  ungroup() %>%
  unnest_tokens(word, CourseId) %>% 
  group_by(word, Sentiment_bing) %>% 
  summarise(count_sentiment = sum(n)) %>% 
  acast(word ~ Sentiment_bing, value.var = "count_sentiment", fill = 0) %>%
  comparison.cloud(colors = c("darkcyan", "tomato"),title.size=2.5,
                   max.words = 50, random.order=FALSE)


Discussion
Based on the above figure, we can see:

  • Machine learning, python, html and javascript courses are main contributors of the group receiving positive or neutral sentiment reviews.
  • Programming, analytics,regression, R, robotics courses are main contributors of the group receiving negative sentiment reviews.

9. Summary

In summary, we come to the following conclusions:

  • Rule-based models might outperform simple machine learning models in sentiment analyzing course review data set
  • The neural network models in sentiment analysis might be improve via using a more complicated network, vectoring with higher vector size in hashing trick.
  • Thanks to text analysis techniques, we can analyze which aspects that learners are feeling negative or positive about online courses.
  • We can also utilize information of course title/course id/course introduction combining with reviews to find the relation between course content and negative or positive reviews. This is on of the first steps for further course content improvements.

10. References

List of packages used in the project:

  • tidytext
  • dplyr
  • textdata
  • wordcloud2
  • wordcloud
  • tidyverse
  • tm
  • stringi
  • e1071
  • caTools
  • caret
  • text2vec
  • keras
  • reshape2
  • ramify
  • kableExtra
  • gridExtra