Exploring Data Science Learners’ Characteristics by Text Mining of Youtube User Comments

1. Introduction

This study seeks to deploy sentiment analysis and topic modelling approaches to investigate learner characteristics in user comments of a data science YouTube channel. Since its launch in June 2005,YouTube has been recognized as one of the most widely used video sharing and content-based social media platforms in the world today. By giving users the ability to create content freely and share it publicly, it also offers the convenience of accessing these materials in an array of platforms ranging from phones, tablets to personal computers.In addition to its popularity for entertainment, music, vlogs and promotion videos, empirical research has additionally linked the platform to providing benefits to educational content (Clifton & Mann, 2011; Snelson, 2011). The critical factor that researchers have been seeking to understand is the premise of knowledge co-construction that is argued to be happening in user comments of these videos(Dubovi& Tabak,2020). According to Greenhow & Lewin (2016), informal learning and participatory digital cultures are becoming more influential in the way people learn and acquire knowledge. In fact it is argued, social media has the potential of bridging the gap that has long existed between formal and informal learning (Clifton & Mann, 2011).

As data science is an emerging field, there is an urgent need of examining its process of learning and inquiry. Investigations of how learners learn, engage and construct knowledge in designated learning environments is critical(Greenhow & Lewin, 2016).In the recent years, there has been an increased development and sharing of YouTube content that covers data science and its application in real world challenges. Despite the potential that YouTube presents in both formal and informal learning, there is still a gap in research that focuses on studying learner behaviors and the process of constructing knowledge in these platforms.

Therefore this study seeks to explore how text mining approaches can support the understanding of data science learners, their perspectives and areas of inquiry in these open learning platforms.

2. Research Questions

The main goal of this study is to investigate opinions, perspectives and areas of inquiry demonstrated by data science learners through YouTube user comments. The following subquestions will be guiding the analysis of this study:

RQ1: What are the main keywords found in learners’ comments data science YouTube videos?
RQ2: What is the overall learners’ sentiment toward data science in YouTube comments?
RQ3: What key themes are prevalent in the user comments?

3. Data Source

The dataset used in this research project is a corpus of YouTube comments from Ken Jee YouTube Channel.The channel has over 199,000 subscribers, with videos covering topics ranging from data science fundamentals to projects and career advice. The dataset was sources from Kaggle, and it originally contains a collection of over 10136 unique comments from the posted videos . For the purpose of this analysis, I pared the corpus to 1000 randomized observations.

4. Target Audience

The findings of this study can be beneficial to various stakeholders depending on their level of interest and involvement in studying learning processes. For instance, data science educators, content creators,instructional designers can get insights on the general perspective and opinion of online users and learners. Topic modeling can support understanding of areas or themes are most prevalent in these platforms and this can provide guidance on how they can design and improve learning experiences for their learners. Researchers can use these contributions to expand on their studies and innovate better methodologies for assessing and investigating learning in open and public social media platforms. Students can benefit from the meta-learning process by identifying factors and patterns that are associated with understanding data and hence build motivation to study it further.

Methods

The study follows the Data-Intensive Research workflow presented and proposed by Krumm et al.(2018). In the exploring phase,I will analyze term frequency (tf) and I will use wordcloud and frequency plots to highlight the most common keywords in the comments.Further functions such as antijoin will be used to remove stop words and clarify tidy data. Sentiment analysis will involve the use of AFFINN and BING Lexicons to establish the polarity of sampled user comments.For topic modelling, the analysis process involves casting a document-term matrix which will be used as input for Latent Dirichlet Allocation(LDA) for creating a seven topic LDA model.Additionally, I will compute word topic probabilities to establish common terms in each topic. This will support more efficient interpretation of results.

#Loading Libraries
library(dplyr)
library(readr)
library(tidyr)
library(rtweet)
library(writexl)
library(readxl)
library(tidytext)
library(textdata)
library(ggplot2)
library(textdata)
library(scales)
library(wordcloud)
library(wordcloud2)
library(SnowballC)
library(topicmodels)
library(stm)
library(ldatuning)
library(knitr)
library(LDAvis)
library(purrr)
library(RColorBrewer)

In the wrangling process, I imported the dataset which has been pared down to 1000 observations. For the scope of this analysis, I selected two primary variables of interest which are comment ID and Comments containing actual text for analysis.

#reading csv file into r
youtubecomments_data <- read_csv("data/final_youtubecomments1.csv", 
     col_types = cols(comments = col_character(),
                   commentID = col_character(), 
                   replycount = col_integer(),
                   likecount = col_integer(),
                   date = col_skip(),
                   vidid = col_character(),
                   user_id=col_character()
                   )
    )
#selecting variables of interest 
youtubecomments_text <- select(youtubecomments_data,commentID,comments)

Explore

Prior to starting tokenization, I explored sample comments to establish initial context of the comments and to frame expectations of future analysis.

sample_comments <- youtubecomments_text %>%
  select(comments)  
sample_n(sample_comments,10)

## # A tibble: 10 x 1
##    comments                                                                     
##    <chr>                                                                        
##  1 "Title of the background instrumental pls ðŸ™\u008fðŸ™\u008fðŸ™\u008f"       
##  2 "I would rather listen to this without background music. It's so relaxing th~
##  3 "Really clear description!\nUntil now i have found the model deployment step~
##  4 "Awesome explanation with some great practical examples"                     
##  5 "Thank you so much.\nI have a quick question.\nI am mid-career and switching~
##  6 "This is the video I have been looking for!  Thank you!  I am going to come ~
##  7 "This is a fantastic video, Ken!"                                            
##  8 "Thank you for sharing!!"                                                    
##  9 "WOWZER!"                                                                    
## 10 "Hi! I am new to Data Science. I have a Civil Engineering Degree and a MBA. ~

#Tokenizing and using anti_join to remove stop words
youtubecomments_tidy <- youtubecomments_text %>%
  unnest_tokens(output = word, input = comments) %>%
  anti_join(stop_words, by = "word") %>%
count(word, sort = TRUE)
youtubecomments_tidy

## # A tibble: 3,568 x 2
##    word         n
##    <chr>    <int>
##  1 data       466
##  2 science    319
##  3 ken        247
##  4 video      226
##  5 ðÿ         159
##  6 learning   106
##  7 videos      82
##  8 time        78
##  9 projects    75
## 10 project     72
## # ... with 3,558 more rows

#customized stop words
my_stopwords <- c("data", "science","ðŸ","amp", "â", "jee", "1","2","3","4", "iâ", "dont", "ive", "im", "gt","1","2","3","ds", "ken", "hey")
youtubecomments_tidy2 <-
  youtubecomments_tidy %>%
  filter(!word %in% my_stopwords)
youtubecomments_tidy2

## # A tibble: 3,554 x 2
##    word         n
##    <chr>    <int>
##  1 video      226
##  2 ðÿ         159
##  3 learning   106
##  4 videos      82
##  5 time        78
##  6 projects    75
##  7 project     72
##  8 python      60
##  9 lot         59
## 10 content     58
## # ... with 3,544 more rows

An interesting observation from the word list is the emoticon represented by symbols ðÿ. I chose leave the symbol on the word list due to the meaning it carries.In social media platforms, an emoticon is a form of language that represents meaning that can not be expressed in the formal text language.They are basically a sentiment representing facial expressions that formal language can not effectively capture.

youtubecomments_tidy3 <-
  youtubecomments_tidy %>%
  filter(!word %in% my_stopwords)
youtubecomments_tidy3

## # A tibble: 3,554 x 2
##    word         n
##    <chr>    <int>
##  1 video      226
##  2 ðÿ         159
##  3 learning   106
##  4 videos      82
##  5 time        78
##  6 projects    75
##  7 project     72
##  8 python      60
##  9 lot         59
## 10 content     58
## # ... with 3,544 more rows

WORDCLOUD

I visualized a word cloud in order to show additional words that have occurred frequently in the comments. This can highlight words that could be of interest for investigation. In order to have a neat and meaningful word cloud, I selected words that have appeared more than 5 times to be included in the visualization.

set.seed(1234)
wordcloud(words = youtubecomments_tidy2$word, freq = youtubecomments_tidy2$n, min.freq = 5,
          max.words=100, random.order=FALSE, rot.per=0.40, 
          colors=brewer.pal(8, "Dark2"))

youtubecomments_tidy2 %>%
  filter(n > 40) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(n, word)) +
 
    geom_bar(stat="identity", fill="lightblue")+
  labs(x = "Word Counts", y = NULL, title = "Most Frequent Words Appearing in Youtube Comments on Data Science") + 
  theme_minimal()

The bar graph has revealed additional words such as “job”, “experience”, “career” which indicate that some of the comments were geared towards careers and jobs in data science videos. Further validation from the video titles can be done to confirm whether this was one of the topics of interest.

SENTIMENT ANALYSIS

After exploring the frequent words by the word cloud and bar chart, in the next stage I loaded the lexicons (AFINN and BING) in order to compute the sentiment of public tweets. The use of these three lexicons enhances the validity of study especially in answering the second research question.

## Getting AFINN and BING Lexicons 
afinn <- get_sentiments ("afinn")
bing <- get_sentiments("bing")
sentiment_afinn <- inner_join(youtubecomments_tidy2, afinn, by = "word")
sentiment_bing <- inner_join(youtubecomments_tidy2, bing, by = "word")

summary_bing <- count(sentiment_bing, sentiment, sort = TRUE)
summary_bing

## # A tibble: 2 x 2
##   sentiment     n
##   <chr>     <int>
## 1 positive    235
## 2 negative    191

summary_bing <- sentiment_bing %>% 
  count(sentiment, sort = TRUE) %>% 
  spread(sentiment, n) %>%
  mutate(sentiment = positive - negative) %>%
  mutate(lexicon = "bing") %>%
  relocate(lexicon)

summary_bing

## # A tibble: 1 x 4
##   lexicon negative positive sentiment
##   <chr>      <int>    <int>     <int>
## 1 bing         191      235        44

summary_afinn <- sentiment_afinn %>% 
  summarise(sentiment = sum(value)) %>% 
  mutate(lexicon = "AFINN") %>%
  relocate(lexicon)

summary_afinn

## # A tibble: 1 x 2
##   lexicon sentiment
##   <chr>       <dbl>
## 1 AFINN         161

commentsmodel_text <-
  youtubecomments_text %>%
  select(commentID, comments)
  
 commentsmodel_text

## # A tibble: 1,000 x 2
##    commentID                  comments                                          
##    <chr>                      <chr>                                             
##  1 Ugw--6D9DvDgJa8QCIR4AaABAg "I think part of it is the pressure for the next ~
##  2 Ugw6_yWvhD50HEBzAP14AaABAg "This is awesome @Ken. Would love to see your not~
##  3 UgzdzXoiVlBqVrNl97N4AaABAg "... Well, this is super exciting. Looking forwar~
##  4 Ugz-IebRr6iAsvhN3i14AaABAg "I'm definitely struggling with imposter syndrome~
##  5 UgxlyRB9jLNyU31q_2R4AaABAg "Amazing job! I built my portfolio in just one da~
##  6 Ugw1XexXp_ejbiKcE554AaABAg "Bhai"                                            
##  7 UgxcxJS1fot8DNPhkbN4AaABAg "I GIT IT NOW ðŸ˜®"                               
##  8 UgypB01BNUcfm5B7OLV4AaABAg "That last part of your video was top tier. For m~
##  9 UgxaJdZmkfJhhw6VutB4AaABAg "I would like to attend Multi-Instance GPU (MIG) ~
## 10 UgytMZWP6oJkVz7BJvp4AaABAg "Congrats Ken!!! I loved this video! Especially t~
## # ... with 990 more rows

 sentiment_afinn <- commentsmodel_text %>%
  unnest_tokens(output = word, 
                input = comments)  %>% 
  anti_join(stop_words, by = "word") %>%
  filter(!word == "amp") %>%
  inner_join(afinn, by = "word")

sentiment_afinn

## # A tibble: 1,083 x 3
##    commentID                  word       value
##    <chr>                      <chr>      <dbl>
##  1 Ugw--6D9DvDgJa8QCIR4AaABAg pressure      -1
##  2 Ugw6_yWvhD50HEBzAP14AaABAg awesome        4
##  3 Ugw6_yWvhD50HEBzAP14AaABAg love           3
##  4 UgzdzXoiVlBqVrNl97N4AaABAg super          3
##  5 UgzdzXoiVlBqVrNl97N4AaABAg exciting       3
##  6 Ugz-IebRr6iAsvhN3i14AaABAg struggling    -2
##  7 Ugz-IebRr6iAsvhN3i14AaABAg confident      2
##  8 UgxlyRB9jLNyU31q_2R4AaABAg amazing        4
##  9 UgypB01BNUcfm5B7OLV4AaABAg top            2
## 10 UgytMZWP6oJkVz7BJvp4AaABAg congrats       2
## # ... with 1,073 more rows

ggplot(sentiment_afinn,aes(word,value, fill="value"))+
  geom_bar(stat="identity")+
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5, size=1))+
  ggtitle("Sentiment Value Spread")

afinn_score <- sentiment_afinn %>% 
  group_by(commentID) %>% 
  summarise(value = sum(value))

afinn_score

## # A tibble: 562 x 2
##    commentID                  value
##    <chr>                      <dbl>
##  1 Ugw--6D9DvDgJa8QCIR4AaABAg    -1
##  2 Ugw-8wZXggcrNzq0KGN4AaABAg     3
##  3 Ugw-FIRSkQztZMy3HvZ4AaABAg     2
##  4 Ugw_JjLakkrb-IZ0E2J4AaABAg    -2
##  5 Ugw0R1C-H85Af8N2PgB4AaABAg    -2
##  6 Ugw1gPmaznecnK6-wtx4AaABAg     2
##  7 Ugw1p79nUVIPTOFcSsN4AaABAg     2
##  8 Ugw1UbdA1VzqOHUMcJ54AaABAg     7
##  9 Ugw1Wq07hI6jTXM6uBR4AaABAg     0
## 10 Ugw25hXDa9F9r47Hujp4AaABAg     7
## # ... with 552 more rows

afinn_sentiment <- afinn_score %>%
  filter(value != 0) %>%
  mutate(sentiment = if_else(value < 0, "negative", "positive"))

afinn_sentiment

## # A tibble: 544 x 3
##    commentID                  value sentiment
##    <chr>                      <dbl> <chr>    
##  1 Ugw--6D9DvDgJa8QCIR4AaABAg    -1 negative 
##  2 Ugw-8wZXggcrNzq0KGN4AaABAg     3 positive 
##  3 Ugw-FIRSkQztZMy3HvZ4AaABAg     2 positive 
##  4 Ugw_JjLakkrb-IZ0E2J4AaABAg    -2 negative 
##  5 Ugw0R1C-H85Af8N2PgB4AaABAg    -2 negative 
##  6 Ugw1gPmaznecnK6-wtx4AaABAg     2 positive 
##  7 Ugw1p79nUVIPTOFcSsN4AaABAg     2 positive 
##  8 Ugw1UbdA1VzqOHUMcJ54AaABAg     7 positive 
##  9 Ugw25hXDa9F9r47Hujp4AaABAg     7 positive 
## 10 Ugw2F9yhanpO1SnzGSp4AaABAg     1 positive 
## # ... with 534 more rows

afinn_ratio <- afinn_sentiment %>% 
  count(sentiment) %>% 
  spread(sentiment, n) %>%
  mutate(ratio = negative/positive)

afinn_ratio

## # A tibble: 1 x 3
##   negative positive ratio
##      <int>    <int> <dbl>
## 1       88      456 0.193

afinn_counts <- afinn_sentiment %>%
  count(sentiment) 

afinn_counts %>%
ggplot(aes(x="", y=n, fill=sentiment)) +
  geom_bar(width = .6, stat = "identity") +
  labs(title = "Sentiment Analysis of Data Science YouTube Comments",
       subtitle = "Proportion of Positive and Negative Comments") +
  coord_polar(theta = "y") +
  theme_void()

summary_afinn2 <- sentiment_afinn %>% 
  
  filter(value != 0) %>%
  mutate(sentiment = if_else(value < 0, "negative", "positive")) %>% 
  count(sentiment, sort = TRUE) %>% 
  mutate(method = "AFINN")

summary_bing2 <- sentiment_bing %>% 
  count(sentiment, sort = TRUE) %>% 
  mutate(method = "bing")

#Binding rows
summary_sentiment <- bind_rows(summary_afinn2,
                               summary_bing2) %>%
  arrange(method) %>%
  relocate(method)
summary_sentiment

## # A tibble: 4 x 3
##   method sentiment     n
##   <chr>  <chr>     <int>
## 1 AFINN  positive    836
## 2 AFINN  negative    247
## 3 bing   positive    235
## 4 bing   negative    191

#Total word counts

total_counts <- summary_sentiment %>%
  group_by(method) %>%
  summarise(total = sum(n))
sentiment_counts <- left_join(summary_sentiment, total_counts)

## Joining, by = "method"

#calculating the percentage of positive and negative sentiments
sentiment_percents <- sentiment_counts %>%
  mutate(percent = n/total * 100)

sentiment_percents

## # A tibble: 4 x 5
##   method sentiment     n total percent
##   <chr>  <chr>     <int> <int>   <dbl>
## 1 AFINN  positive    836  1083    77.2
## 2 AFINN  negative    247  1083    22.8
## 3 bing   positive    235   426    55.2
## 4 bing   negative    191   426    44.8

#plotting the percentage by lexicons 

sentiment_percents %>%
  ggplot(aes(x = method, y = percent, fill=sentiment)) +
  geom_bar(width = .8, stat = "identity") +
  coord_flip() +
  labs(title = "Sentiment Summary on YouTube Comments", 
       subtitle = "",
       x = "Lexicon", 
       y = "Percentage of Words")

TOPIC MODELLING

The topi modeling process started with creating a Document Term Matrix that can be used with the Latent Dirichlet allocation (LDA) to model and explore the potential themes in the comments. In this case,each comment was used as a document. The qualifying fact is that each comment is independent and in the dataset, datasets are given unique IDs.

#cast Document Term Matrix

comments_tidy <- youtubecomments_data %>%
  unnest_tokens(output = word, input = comments) %>%
  anti_join(stop_words, by = "word")

comments_dtm <- comments_tidy %>%
  count(commentID, word) %>%
  cast_dtm(commentID, word, n)
comments_dtm

## <<DocumentTermMatrix (documents: 986, terms: 3568)>>
## Non-/sparse entries: 10666/3507382
## Sparsity           : 100%
## Maximal term length: 818
## Weighting          : term frequency (tf)

From the comments_dtm object, there are a total of 982 documents and 3555 terms which are included in the matrix. The sparsity is 100%, referring to the proportion of sparse entries in the document term matrix.

## [1] "DocumentTermMatrix"    "simple_triplet_matrix"

## <<DocumentTermMatrix (documents: 986, terms: 3568)>>
## Non-/sparse entries: 10666/3507382
## Sparsity           : 100%
## Maximal term length: 818
## Weighting          : term frequency (tf)

For the context of this independent analysis, I opted to perform stemming inorder to conflate words with related meanings and reduce the vocabulary in general. I therefore used the textProcessor function for preprocessing titles for further use with structural topic modeling algorithm.

#textProcessor
temp <- textProcessor(youtubecomments_data$comments, 
                    metadata = youtubecomments_data,  
                    lowercase=TRUE, 
                    removestopwords=TRUE, 
                    removenumbers=TRUE,  
                    removepunctuation=TRUE, 
                    wordLengths=c(3,Inf),
                    stem=TRUE,
                    onlycharacter= FALSE, 
                    striphtml=TRUE, 
                    customstopwords=NULL)

## Building corpus... 
## Converting to Lower Case... 
## Removing punctuation... 
## Removing stopwords... 
## Removing numbers... 
## Stemming... 
## Creating Output...

meta <- temp$meta
vocab <- temp$vocab
docs <- temp$documents

I then used wordStem function to create a new column with stemmed words and further on I went ahead and performed a stem count.

#stemming 
stemmed_comments <- youtubecomments_data %>%
  unnest_tokens(output = word, input = comments) %>%
  anti_join(stop_words, by = "word") %>%
  mutate(stem = wordStem(word))
stemmed_comments

## # A tibble: 11,996 x 7
##    commentID                  replycount likecount vidid   user_id  word   stem 
##    <chr>                           <int>     <int> <chr>   <chr>    <chr>  <chr>
##  1 Ugw--6D9DvDgJa8QCIR4AaABAg          1         2 2qVWur~ user_751 press~ pres~
##  2 Ugw--6D9DvDgJa8QCIR4AaABAg          1         2 2qVWur~ user_751 conte~ cont~
##  3 Ugw--6D9DvDgJa8QCIR4AaABAg          1         2 2qVWur~ user_751 piece  piec 
##  4 Ugw--6D9DvDgJa8QCIR4AaABAg          1         2 2qVWur~ user_751 â      â    
##  5 Ugw--6D9DvDgJa8QCIR4AaABAg          1         2 2qVWur~ user_751 data   data 
##  6 Ugw--6D9DvDgJa8QCIR4AaABAg          1         2 2qVWur~ user_751 scien~ scie~
##  7 Ugw--6D9DvDgJa8QCIR4AaABAg          1         2 2qVWur~ user_751 posts  post 
##  8 Ugw--6D9DvDgJa8QCIR4AaABAg          1         2 2qVWur~ user_751 videos video
##  9 Ugw--6D9DvDgJa8QCIR4AaABAg          1         2 2qVWur~ user_751 satur~ satur
## 10 Ugw--6D9DvDgJa8QCIR4AaABAg          1         2 2qVWur~ user_751 youâ   youâ 
## # ... with 11,986 more rows

## <<DocumentTermMatrix (documents: 986, terms: 2808)>>
## Non-/sparse entries: 10477/2758211
## Sparsity           : 100%
## Maximal term length: 818
## Weighting          : term frequency (tf)

## <<DocumentTermMatrix (documents: 986, terms: 3568)>>
## Non-/sparse entries: 10666/3507382
## Sparsity           : 100%
## Maximal term length: 818
## Weighting          : term frequency (tf)

## # A tibble: 2,808 x 2
##    stem        n
##    <chr>   <int>
##  1 data      466
##  2 scienc    320
##  3 video     308
##  4 ken       248
##  5 learn     174
##  6 ðÿ        159
##  7 project   148
##  8 start     109
##  9 time       85
## 10 love       73
## # ... with 2,798 more rows

barplot(stem_counts[1:10,]$n, las = 2, names.arg = stem_counts[1:10,]$stem,
        col ="lightblue", main ="Top 5 most frequent words",
        ylab = "Word frequencies")

In the modeling phase I deploy both the LDA and Stm algorithms to investigate patterns in the comments .In selecting the value for K, I decided to go with 7 to provide concise themes for exploration.

comments_lda <- LDA(comments_dtm, 
                  k = 7, 
                  control = list(seed = 200)
                  )
comments_lda

## A LDA_VEM topic model with 7 topics.

I then went ahead and fitted Structural Topic Model by initiating the extraction of temp object.

#extracting elements from the temp object
docs <- temp$documents 
meta <- temp$meta 
vocab <- temp$vocab

comments_stm <- stm(documents=docs, 
         data=meta,
         vocab=vocab,
         K=7,
         max.em.its=7,
         verbose = FALSE)
comments_stm

## A topic model with 7 topics, 997 documents and a 2902 word dictionary.

plot.STM(comments_stm, n = 7)

I also used the LDAvis topic browser to explore the distribution of the emergent theme words.

toLDAvis(mod = comments_stm, docs = docs)

## Loading required namespace: servr

There seem to be an overlapping case for topics 2 , 3, and 5. This can potentially be an indication that selecting K=5 was not an optimal choice

I further on examined β and looked at the probabilities of words belonging to the identified themes.I explored the top 5 words that were assigned to each topic and made further interpretations based on them.

terms(comments_lda, 5)

##      Topic 1   Topic 2   Topic 3   Topic 4 Topic 5     Topic 6    Topic 7   
## [1,] "ken"     "ðÿ"      "data"    "3"     "data"      "data"     "video"   
## [2,] "video"   "data"    "science" "https" "ken"       "science"  "ken"     
## [3,] "amazing" "science" "day"     "2"     "time"      "learning" "projects"
## [4,] "content" "ken"     "book"    "model" "scientist" "video"    "project" 
## [5,] "analyst" "video"   "video"   "ken"   "videos"    "degree"   "series"

tidy_lda <- tidy(comments_lda)
tidy_lda

## # A tibble: 24,976 x 3
##    topic term            beta
##    <int> <chr>          <dbl>
##  1     1 â          1.20e- 57
##  2     2 â          1.50e- 12
##  3     3 â          2.49e-  3
##  4     4 â          4.28e-162
##  5     5 â          9.30e-  3
##  6     6 â          8.14e- 25
##  7     7 â          4.43e-  3
##  8     1 absorption 4.55e-235
##  9     2 absorption 6.22e-240
## 10     3 absorption 2.02e-234
## # ... with 24,966 more rows

top_terms <- tidy_lda %>%
  group_by(topic) %>%
  slice_max(beta, n = 5, with_ties = FALSE) %>%
  ungroup() %>%
  arrange(topic, -beta)
top_terms %>%
  mutate(term = reorder_within(term, beta, topic)) %>%
  group_by(topic, term) %>%    
  arrange(desc(beta)) %>%  
  ungroup() %>%
  ggplot(aes(beta, term, fill = as.factor(topic))) +
  geom_col(show.legend = FALSE) +
  scale_y_reordered() +
  labs(title = "Top 5 terms in each identified topic",
       x = expression(beta), y = NULL) +
  facet_wrap(~ topic, ncol = 4, scales = "free")

td_beta <- tidy(comments_lda)
td_gamma <- tidy(comments_lda, matrix = "gamma")
td_beta

## # A tibble: 24,976 x 3
##    topic term            beta
##    <int> <chr>          <dbl>
##  1     1 â          1.20e- 57
##  2     2 â          1.50e- 12
##  3     3 â          2.49e-  3
##  4     4 â          4.28e-162
##  5     5 â          9.30e-  3
##  6     6 â          8.14e- 25
##  7     7 â          4.43e-  3
##  8     1 absorption 4.55e-235
##  9     2 absorption 6.22e-240
## 10     3 absorption 2.02e-234
## # ... with 24,966 more rows

td_gamma

## # A tibble: 6,902 x 3
##    document                   topic   gamma
##    <chr>                      <int>   <dbl>
##  1 Ugw--6D9DvDgJa8QCIR4AaABAg     1 0.00248
##  2 Ugw-8wZXggcrNzq0KGN4AaABAg     1 0.927  
##  3 Ugw-B4KThtc_ghl7-9V4AaABAg     1 0.0223 
##  4 Ugw-FIRSkQztZMy3HvZ4AaABAg     1 0.0157 
##  5 Ugw_8vkPAP2N_8MBWU14AaABAg     1 0.0121 
##  6 Ugw_JjLakkrb-IZ0E2J4AaABAg     1 0.984  
##  7 Ugw_sTWhfSF1uiDCKoR4AaABAg     1 0.00985
##  8 Ugw0R1C-H85Af8N2PgB4AaABAg     1 0.826  
##  9 Ugw1b5VErjT-yXwjGZ54AaABAg     1 0.241  
## 10 Ugw1gPmaznecnK6-wtx4AaABAg     1 0.00831
## # ... with 6,892 more rows

top_terms <- td_beta %>%
  arrange(beta) %>%
  group_by(topic) %>%
  top_n(5, beta) %>%
  arrange(-beta) %>%
  select(topic, term) %>%
  summarise(terms = list(term)) %>%
  mutate(terms = map(terms, paste, collapse = ", ")) %>% 
  unnest()

## Warning: `cols` is now required when using unnest().
## Please use `cols = c(terms)`

gamma_terms <- td_gamma %>%
  group_by(topic) %>%
  summarise(gamma = mean(gamma)) %>%
  arrange(desc(gamma)) %>%
  left_join(top_terms, by = "topic") %>%
  mutate(topic = paste0("Topic ", topic),
         topic = reorder(topic, gamma))
gamma_terms %>%
  select(topic, gamma, terms) %>%
  kable(digits = 3, 
        col.names = c("Topic", "Expected Topic Proportion", "Top 7 terms"))

Topic	Expected Topic Proportion	Top 7 terms
Topic 7	0.203	video, ken, projects, project, series
Topic 2	0.154	ðÿ, data, science, ken, video
Topic 1	0.153	ken, video, amazing, content, analyst
Topic 6	0.148	data, science, learning, video, degree
Topic 5	0.145	data, ken, time, scientist, videos
Topic 3	0.099	data, science, day, book, video
Topic 4	0.098	3, https, 2, model, ken

5. COMMUNICATE

Findings

The main purpose of this analysis was to deploy text mining approaches to investigate learner characteristics in data science channel user comments. In order to answer the main research questions, two main approaches were utilized to explore and model a database of 1000 observations containing user comments from Lee Jee Youtube Channel. The first phase of the study involved performing sentiment analysis to establish overall polarity of the comments. The second phase involved utilizing topic modeling to identify key topics that are discussed in the user comments. The presentation of the findings will be guided by the driving research questions as follows.

What are the main keywords found in learners’ comments data science YouTube videos?

From the exploration comments through word cloud and word frequency tables, it has been observed that the general keywords are “learning”, “time” “Projects” and “python”. This can be interpreted that most of these comments in our sample were from project- based videos. To validate this, I did a scan of the channel and explored videos with most views. It was observed that the author posts videos about his data science projects and the corresponding updates as he progresses. In addition to the nature of videos reflected in these comments, another insightful capture is the keyword “python” which at top level, it can be interpreted that most of the users discussed about “Python” as the language of use in the posted projects and perhaps in their own learning and projects. Another interesting observation is the use of emoticon, which in social media platforms, it is a common way of communicating and expressing sentiment in place of formal language.The symbol ðŸ which is in one of the most prevalent terms, represents an emoticon. However further analysis is needed to examine the encoding, whether it represents a single or a combination of emoticons. Additional observations highlight words such as “job”, “career” and “experience” which indicate discussions on careers are prevalent in the comments.

What is the overall learners’ sentiment toward data science in YouTube comments?

In analyzing sentiment of users on the videos, I deployed Affinn and Bing lexicons to establish the polarity of the sampled comments.I prefer to use more than one lexicon in order to validate the results obtain from either of the models. The findings indicate that the overall sentiment in this videos is positive, and this is according to both Affin and Bing lexicons. For Affin lexicons, it shows that 77% of the sentiment was positive while 22.8% is negative. The ratio of sentiment for Bing lexicon is closely tied, with 55% being positive and 44% being negative. Therefore the concluding sentiment for the user comments in these data science videos is positive.

What key themes are prevalent in the user comments?

The topic modeling was the challenging analysis for this study. The first phase of the analysis I choose K=5 and the results of the themes were not concise and there was a lot of overlapping words between the topics. This was an indication that perhaps the choice of K was not optimal for the analysis. In the second phase I chose K=7 and the results had slight improvement but still the distinct patterns of the themes were overlapping and unclear. Topic 7 was very concise and it reflected a theme that was based on projects. Topic 3 captured curiosity around the data with words such as “think”, “look” and “see”.Topic 4 is leaning towards feedback about the content with words such as “like”, “Thanks” and “learning”. From the LDAVis it can be observed that there substantial overlapping in Topics 1, 3, 5 and 6 which makes it challenging to structure meaningful interpretation. In this case I think it is important to understand the corpus and context in order to improve the interpretations.

Implications

These interpretations of the findings reveal that generally, learners are positive about data science videos posted on YouTube and this has been validated by both the keywords and sentiment analysis. Secondly, project-based learning videos seem to be stirring more engagements and discussions in these comments.The topic modelling pointed this out with Topic 7 being very distinct compared to the other identified themes.Thirdly, these approaches showed that what is reflected in these comments is surface learning and not deep learning. Surface learning relates to demonstration of some levels of learning and appreciating new facts, while deep learning relates to the ability of applying new facts in contexts (Biggs, 1988).

Limitations

This pilot study explored very basic level sentiment analysis and topic modeling to get primary idea of perceptions,polarity, and salient themes that can be found in YouTube comments. Additionally, the source corpus for this analysis is limited to only 1000 observations collected from only one YouTube channel. I believe a more diverse corpus would have provided more insightful analysis and findings that could enhance the validity of the study. The generalization and application of these results in other contexts are also limited given the nature of dataset and context of study.

Legal and Ethical Issues

This analysis made use of Ken Jee YouTube Data which has been made openly available and licensed to use through Kaggle. The author of the dataset has acknowledged that the dataset was pre-cleaned to omit personal names and unique identifiers of channel’s users for privacy purposes. The same protocol was observed and adhered to during the wrangling, analysis, modeling and reporting of the findings of this report.

References

Biggs, J. B. (1988). Assessing student approaches to learning. Australian Psychologist, 23(2), 197-206.

Clifton, A., & Mann, C. (2011). Can YouTube enhance student nurse learning?. Nurse education today, 31(4), 311-313. Dubovi, I., & Tabak, I. (2020). An empirical analysis of knowledge co-construction in YouTube comments. Computers & Education, 156, 103939.

Greenhow, C., & Lewin, C. (2016). Social media and education: Reconceptualizing the boundaries of formal and informal learning. Learning, media and technology, 41(1), 6-30.

Muhammad, A. N., Bukhori, S., & Pandunata, P. (2019, October). Sentiment analysis of positive and negative of youtube comments using naïve bayes–support vector machine (nbsvm) classifier. In 2019 International Conference on Computer Science, Information Technology, and Electrical Engineering (ICOMITEE) (pp. 199-205). IEEE.

Snelson, C. (2011). YouTube across the disciplines: A review of the literature. MERLOT Journal of Online learning and teaching.