This study seeks to deploy sentiment analysis and topic modelling approaches to investigate learner characteristics in user comments of a data science YouTube channel. Since its launch in June 2005,YouTube has been recognized as one of the most widely used video sharing and content-based social media platforms in the world today. By giving users the ability to create content freely and share it publicly, it also offers the convenience of accessing these materials in an array of platforms ranging from phones, tablets to personal computers.In addition to its popularity for entertainment, music, vlogs and promotion videos, empirical research has additionally linked the platform to providing benefits to educational content (Clifton & Mann, 2011; Snelson, 2011). The critical factor that researchers have been seeking to understand is the premise of knowledge co-construction that is argued to be happening in user comments of these videos(Dubovi& Tabak,2020). According to Greenhow & Lewin (2016), informal learning and participatory digital cultures are becoming more influential in the way people learn and acquire knowledge. In fact it is argued, social media has the potential of bridging the gap that has long existed between formal and informal learning (Clifton & Mann, 2011).
As data science is an emerging field, there is an urgent need of examining its process of learning and inquiry. Investigations of how learners learn, engage and construct knowledge in designated learning environments is critical(Greenhow & Lewin, 2016).In the recent years, there has been an increased development and sharing of YouTube content that covers data science and its application in real world challenges. Despite the potential that YouTube presents in both formal and informal learning, there is still a gap in research that focuses on studying learner behaviors and the process of constructing knowledge in these platforms.
Therefore this study seeks to explore how text mining approaches can support the understanding of data science learners, their perspectives and areas of inquiry in these open learning platforms.
The main goal of this study is to investigate opinions, perspectives and areas of inquiry demonstrated by data science learners through YouTube user comments. The following subquestions will be guiding the analysis of this study:
The dataset used in this research project is a corpus of YouTube comments from Ken Jee YouTube Channel.The channel has over 199,000 subscribers, with videos covering topics ranging from data science fundamentals to projects and career advice. The dataset was sources from Kaggle, and it originally contains a collection of over 10136 unique comments from the posted videos . For the purpose of this analysis, I pared the corpus to 1000 randomized observations.
The findings of this study can be beneficial to various stakeholders depending on their level of interest and involvement in studying learning processes. For instance, data science educators, content creators,instructional designers can get insights on the general perspective and opinion of online users and learners. Topic modeling can support understanding of areas or themes are most prevalent in these platforms and this can provide guidance on how they can design and improve learning experiences for their learners. Researchers can use these contributions to expand on their studies and innovate better methodologies for assessing and investigating learning in open and public social media platforms. Students can benefit from the meta-learning process by identifying factors and patterns that are associated with understanding data and hence build motivation to study it further.
The study follows the Data-Intensive Research workflow presented and proposed by Krumm et al.(2018). In the exploring phase,I will analyze term frequency (tf) and I will use wordcloud and frequency plots to highlight the most common keywords in the comments.Further functions such as antijoin will be used to remove stop words and clarify tidy data. Sentiment analysis will involve the use of AFFINN and BING Lexicons to establish the polarity of sampled user comments.For topic modelling, the analysis process involves casting a document-term matrix which will be used as input for Latent Dirichlet Allocation(LDA) for creating a seven topic LDA model.Additionally, I will compute word topic probabilities to establish common terms in each topic. This will support more efficient interpretation of results.
#Loading Libraries
library(dplyr)
library(readr)
library(tidyr)
library(rtweet)
library(writexl)
library(readxl)
library(tidytext)
library(textdata)
library(ggplot2)
library(textdata)
library(scales)
library(wordcloud)
library(wordcloud2)
library(SnowballC)
library(topicmodels)
library(stm)
library(ldatuning)
library(knitr)
library(LDAvis)
library(purrr)
library(RColorBrewer)
In the wrangling process, I imported the dataset which has been pared down to 1000 observations. For the scope of this analysis, I selected two primary variables of interest which are comment ID and Comments containing actual text for analysis.
#reading csv file into r
youtubecomments_data <- read_csv("data/final_youtubecomments1.csv",
col_types = cols(comments = col_character(),
commentID = col_character(),
replycount = col_integer(),
likecount = col_integer(),
date = col_skip(),
vidid = col_character(),
user_id=col_character()
)
)
#selecting variables of interest
youtubecomments_text <- select(youtubecomments_data,commentID,comments)
Explore
Prior to starting tokenization, I explored sample comments to establish initial context of the comments and to frame expectations of future analysis.
sample_comments <- youtubecomments_text %>%
select(comments)
sample_n(sample_comments,10)
## # A tibble: 10 x 1
## comments
## <chr>
## 1 "Title of the background instrumental pls ðŸ™\u008fðŸ™\u008fðŸ™\u008f"
## 2 "I would rather listen to this without background music. It's so relaxing th~
## 3 "Really clear description!\nUntil now i have found the model deployment step~
## 4 "Awesome explanation with some great practical examples"
## 5 "Thank you so much.\nI have a quick question.\nI am mid-career and switching~
## 6 "This is the video I have been looking for! Thank you! I am going to come ~
## 7 "This is a fantastic video, Ken!"
## 8 "Thank you for sharing!!"
## 9 "WOWZER!"
## 10 "Hi! I am new to Data Science. I have a Civil Engineering Degree and a MBA. ~
#Tokenizing and using anti_join to remove stop words
youtubecomments_tidy <- youtubecomments_text %>%
unnest_tokens(output = word, input = comments) %>%
anti_join(stop_words, by = "word") %>%
count(word, sort = TRUE)
youtubecomments_tidy
## # A tibble: 3,568 x 2
## word n
## <chr> <int>
## 1 data 466
## 2 science 319
## 3 ken 247
## 4 video 226
## 5 ðÿ 159
## 6 learning 106
## 7 videos 82
## 8 time 78
## 9 projects 75
## 10 project 72
## # ... with 3,558 more rows
#customized stop words
my_stopwords <- c("data", "science","ðŸ","amp", "â", "jee", "1","2","3","4", "iâ", "dont", "ive", "im", "gt","1","2","3","ds", "ken", "hey")
youtubecomments_tidy2 <-
youtubecomments_tidy %>%
filter(!word %in% my_stopwords)
youtubecomments_tidy2
## # A tibble: 3,554 x 2
## word n
## <chr> <int>
## 1 video 226
## 2 ðÿ 159
## 3 learning 106
## 4 videos 82
## 5 time 78
## 6 projects 75
## 7 project 72
## 8 python 60
## 9 lot 59
## 10 content 58
## # ... with 3,544 more rows
An interesting observation from the word list is the emoticon represented by symbols ðÿ. I chose leave the symbol on the word list due to the meaning it carries.In social media platforms, an emoticon is a form of language that represents meaning that can not be expressed in the formal text language.They are basically a sentiment representing facial expressions that formal language can not effectively capture.
youtubecomments_tidy3 <-
youtubecomments_tidy %>%
filter(!word %in% my_stopwords)
youtubecomments_tidy3
## # A tibble: 3,554 x 2
## word n
## <chr> <int>
## 1 video 226
## 2 ðÿ 159
## 3 learning 106
## 4 videos 82
## 5 time 78
## 6 projects 75
## 7 project 72
## 8 python 60
## 9 lot 59
## 10 content 58
## # ... with 3,544 more rows
I visualized a word cloud in order to show additional words that have occurred frequently in the comments. This can highlight words that could be of interest for investigation. In order to have a neat and meaningful word cloud, I selected words that have appeared more than 5 times to be included in the visualization.
set.seed(1234)
wordcloud(words = youtubecomments_tidy2$word, freq = youtubecomments_tidy2$n, min.freq = 5,
max.words=100, random.order=FALSE, rot.per=0.40,
colors=brewer.pal(8, "Dark2"))
youtubecomments_tidy2 %>%
filter(n > 40) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(n, word)) +
geom_bar(stat="identity", fill="lightblue")+
labs(x = "Word Counts", y = NULL, title = "Most Frequent Words Appearing in Youtube Comments on Data Science") +
theme_minimal()
The bar graph has revealed additional words such as “job”, “experience”, “career” which indicate that some of the comments were geared towards careers and jobs in data science videos. Further validation from the video titles can be done to confirm whether this was one of the topics of interest.
After exploring the frequent words by the word cloud and bar chart, in the next stage I loaded the lexicons (AFINN and BING) in order to compute the sentiment of public tweets. The use of these three lexicons enhances the validity of study especially in answering the second research question.
## Getting AFINN and BING Lexicons
afinn <- get_sentiments ("afinn")
bing <- get_sentiments("bing")
sentiment_afinn <- inner_join(youtubecomments_tidy2, afinn, by = "word")
sentiment_bing <- inner_join(youtubecomments_tidy2, bing, by = "word")
summary_bing <- count(sentiment_bing, sentiment, sort = TRUE)
summary_bing
## # A tibble: 2 x 2
## sentiment n
## <chr> <int>
## 1 positive 235
## 2 negative 191
summary_bing <- sentiment_bing %>%
count(sentiment, sort = TRUE) %>%
spread(sentiment, n) %>%
mutate(sentiment = positive - negative) %>%
mutate(lexicon = "bing") %>%
relocate(lexicon)
summary_bing
## # A tibble: 1 x 4
## lexicon negative positive sentiment
## <chr> <int> <int> <int>
## 1 bing 191 235 44
summary_afinn <- sentiment_afinn %>%
summarise(sentiment = sum(value)) %>%
mutate(lexicon = "AFINN") %>%
relocate(lexicon)
summary_afinn
## # A tibble: 1 x 2
## lexicon sentiment
## <chr> <dbl>
## 1 AFINN 161
commentsmodel_text <-
youtubecomments_text %>%
select(commentID, comments)
commentsmodel_text
## # A tibble: 1,000 x 2
## commentID comments
## <chr> <chr>
## 1 Ugw--6D9DvDgJa8QCIR4AaABAg "I think part of it is the pressure for the next ~
## 2 Ugw6_yWvhD50HEBzAP14AaABAg "This is awesome @Ken. Would love to see your not~
## 3 UgzdzXoiVlBqVrNl97N4AaABAg "... Well, this is super exciting. Looking forwar~
## 4 Ugz-IebRr6iAsvhN3i14AaABAg "I'm definitely struggling with imposter syndrome~
## 5 UgxlyRB9jLNyU31q_2R4AaABAg "Amazing job! I built my portfolio in just one da~
## 6 Ugw1XexXp_ejbiKcE554AaABAg "Bhai"
## 7 UgxcxJS1fot8DNPhkbN4AaABAg "I GIT IT NOW 😮"
## 8 UgypB01BNUcfm5B7OLV4AaABAg "That last part of your video was top tier. For m~
## 9 UgxaJdZmkfJhhw6VutB4AaABAg "I would like to attend Multi-Instance GPU (MIG) ~
## 10 UgytMZWP6oJkVz7BJvp4AaABAg "Congrats Ken!!! I loved this video! Especially t~
## # ... with 990 more rows
sentiment_afinn <- commentsmodel_text %>%
unnest_tokens(output = word,
input = comments) %>%
anti_join(stop_words, by = "word") %>%
filter(!word == "amp") %>%
inner_join(afinn, by = "word")
sentiment_afinn
## # A tibble: 1,083 x 3
## commentID word value
## <chr> <chr> <dbl>
## 1 Ugw--6D9DvDgJa8QCIR4AaABAg pressure -1
## 2 Ugw6_yWvhD50HEBzAP14AaABAg awesome 4
## 3 Ugw6_yWvhD50HEBzAP14AaABAg love 3
## 4 UgzdzXoiVlBqVrNl97N4AaABAg super 3
## 5 UgzdzXoiVlBqVrNl97N4AaABAg exciting 3
## 6 Ugz-IebRr6iAsvhN3i14AaABAg struggling -2
## 7 Ugz-IebRr6iAsvhN3i14AaABAg confident 2
## 8 UgxlyRB9jLNyU31q_2R4AaABAg amazing 4
## 9 UgypB01BNUcfm5B7OLV4AaABAg top 2
## 10 UgytMZWP6oJkVz7BJvp4AaABAg congrats 2
## # ... with 1,073 more rows
ggplot(sentiment_afinn,aes(word,value, fill="value"))+
geom_bar(stat="identity")+
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5, size=1))+
ggtitle("Sentiment Value Spread")
afinn_score <- sentiment_afinn %>%
group_by(commentID) %>%
summarise(value = sum(value))
afinn_score
## # A tibble: 562 x 2
## commentID value
## <chr> <dbl>
## 1 Ugw--6D9DvDgJa8QCIR4AaABAg -1
## 2 Ugw-8wZXggcrNzq0KGN4AaABAg 3
## 3 Ugw-FIRSkQztZMy3HvZ4AaABAg 2
## 4 Ugw_JjLakkrb-IZ0E2J4AaABAg -2
## 5 Ugw0R1C-H85Af8N2PgB4AaABAg -2
## 6 Ugw1gPmaznecnK6-wtx4AaABAg 2
## 7 Ugw1p79nUVIPTOFcSsN4AaABAg 2
## 8 Ugw1UbdA1VzqOHUMcJ54AaABAg 7
## 9 Ugw1Wq07hI6jTXM6uBR4AaABAg 0
## 10 Ugw25hXDa9F9r47Hujp4AaABAg 7
## # ... with 552 more rows
afinn_sentiment <- afinn_score %>%
filter(value != 0) %>%
mutate(sentiment = if_else(value < 0, "negative", "positive"))
afinn_sentiment
## # A tibble: 544 x 3
## commentID value sentiment
## <chr> <dbl> <chr>
## 1 Ugw--6D9DvDgJa8QCIR4AaABAg -1 negative
## 2 Ugw-8wZXggcrNzq0KGN4AaABAg 3 positive
## 3 Ugw-FIRSkQztZMy3HvZ4AaABAg 2 positive
## 4 Ugw_JjLakkrb-IZ0E2J4AaABAg -2 negative
## 5 Ugw0R1C-H85Af8N2PgB4AaABAg -2 negative
## 6 Ugw1gPmaznecnK6-wtx4AaABAg 2 positive
## 7 Ugw1p79nUVIPTOFcSsN4AaABAg 2 positive
## 8 Ugw1UbdA1VzqOHUMcJ54AaABAg 7 positive
## 9 Ugw25hXDa9F9r47Hujp4AaABAg 7 positive
## 10 Ugw2F9yhanpO1SnzGSp4AaABAg 1 positive
## # ... with 534 more rows
afinn_ratio <- afinn_sentiment %>%
count(sentiment) %>%
spread(sentiment, n) %>%
mutate(ratio = negative/positive)
afinn_ratio
## # A tibble: 1 x 3
## negative positive ratio
## <int> <int> <dbl>
## 1 88 456 0.193
afinn_counts <- afinn_sentiment %>%
count(sentiment)
afinn_counts %>%
ggplot(aes(x="", y=n, fill=sentiment)) +
geom_bar(width = .6, stat = "identity") +
labs(title = "Sentiment Analysis of Data Science YouTube Comments",
subtitle = "Proportion of Positive and Negative Comments") +
coord_polar(theta = "y") +
theme_void()
summary_afinn2 <- sentiment_afinn %>%
filter(value != 0) %>%
mutate(sentiment = if_else(value < 0, "negative", "positive")) %>%
count(sentiment, sort = TRUE) %>%
mutate(method = "AFINN")
summary_bing2 <- sentiment_bing %>%
count(sentiment, sort = TRUE) %>%
mutate(method = "bing")
#Binding rows
summary_sentiment <- bind_rows(summary_afinn2,
summary_bing2) %>%
arrange(method) %>%
relocate(method)
summary_sentiment
## # A tibble: 4 x 3
## method sentiment n
## <chr> <chr> <int>
## 1 AFINN positive 836
## 2 AFINN negative 247
## 3 bing positive 235
## 4 bing negative 191
#Total word counts
total_counts <- summary_sentiment %>%
group_by(method) %>%
summarise(total = sum(n))
sentiment_counts <- left_join(summary_sentiment, total_counts)
## Joining, by = "method"
#calculating the percentage of positive and negative sentiments
sentiment_percents <- sentiment_counts %>%
mutate(percent = n/total * 100)
sentiment_percents
## # A tibble: 4 x 5
## method sentiment n total percent
## <chr> <chr> <int> <int> <dbl>
## 1 AFINN positive 836 1083 77.2
## 2 AFINN negative 247 1083 22.8
## 3 bing positive 235 426 55.2
## 4 bing negative 191 426 44.8
#plotting the percentage by lexicons
sentiment_percents %>%
ggplot(aes(x = method, y = percent, fill=sentiment)) +
geom_bar(width = .8, stat = "identity") +
coord_flip() +
labs(title = "Sentiment Summary on YouTube Comments",
subtitle = "",
x = "Lexicon",
y = "Percentage of Words")
The topi modeling process started with creating a Document Term Matrix that can be used with the Latent Dirichlet allocation (LDA) to model and explore the potential themes in the comments. In this case,each comment was used as a document. The qualifying fact is that each comment is independent and in the dataset, datasets are given unique IDs.
#cast Document Term Matrix
comments_tidy <- youtubecomments_data %>%
unnest_tokens(output = word, input = comments) %>%
anti_join(stop_words, by = "word")
comments_dtm <- comments_tidy %>%
count(commentID, word) %>%
cast_dtm(commentID, word, n)
comments_dtm
## <<DocumentTermMatrix (documents: 986, terms: 3568)>>
## Non-/sparse entries: 10666/3507382
## Sparsity : 100%
## Maximal term length: 818
## Weighting : term frequency (tf)
From the comments_dtm object, there are a total of 982 documents and 3555 terms which are included in the matrix. The sparsity is 100%, referring to the proportion of sparse entries in the document term matrix.
## [1] "DocumentTermMatrix" "simple_triplet_matrix"
## <<DocumentTermMatrix (documents: 986, terms: 3568)>>
## Non-/sparse entries: 10666/3507382
## Sparsity : 100%
## Maximal term length: 818
## Weighting : term frequency (tf)
For the context of this independent analysis, I opted to perform stemming inorder to conflate words with related meanings and reduce the vocabulary in general. I therefore used the textProcessor function for preprocessing titles for further use with structural topic modeling algorithm.
#textProcessor
temp <- textProcessor(youtubecomments_data$comments,
metadata = youtubecomments_data,
lowercase=TRUE,
removestopwords=TRUE,
removenumbers=TRUE,
removepunctuation=TRUE,
wordLengths=c(3,Inf),
stem=TRUE,
onlycharacter= FALSE,
striphtml=TRUE,
customstopwords=NULL)
## Building corpus...
## Converting to Lower Case...
## Removing punctuation...
## Removing stopwords...
## Removing numbers...
## Stemming...
## Creating Output...
meta <- temp$meta
vocab <- temp$vocab
docs <- temp$documents
I then used wordStem function to create a new column with stemmed words and further on I went ahead and performed a stem count.
#stemming
stemmed_comments <- youtubecomments_data %>%
unnest_tokens(output = word, input = comments) %>%
anti_join(stop_words, by = "word") %>%
mutate(stem = wordStem(word))
stemmed_comments
## # A tibble: 11,996 x 7
## commentID replycount likecount vidid user_id word stem
## <chr> <int> <int> <chr> <chr> <chr> <chr>
## 1 Ugw--6D9DvDgJa8QCIR4AaABAg 1 2 2qVWur~ user_751 press~ pres~
## 2 Ugw--6D9DvDgJa8QCIR4AaABAg 1 2 2qVWur~ user_751 conte~ cont~
## 3 Ugw--6D9DvDgJa8QCIR4AaABAg 1 2 2qVWur~ user_751 piece piec
## 4 Ugw--6D9DvDgJa8QCIR4AaABAg 1 2 2qVWur~ user_751 â â
## 5 Ugw--6D9DvDgJa8QCIR4AaABAg 1 2 2qVWur~ user_751 data data
## 6 Ugw--6D9DvDgJa8QCIR4AaABAg 1 2 2qVWur~ user_751 scien~ scie~
## 7 Ugw--6D9DvDgJa8QCIR4AaABAg 1 2 2qVWur~ user_751 posts post
## 8 Ugw--6D9DvDgJa8QCIR4AaABAg 1 2 2qVWur~ user_751 videos video
## 9 Ugw--6D9DvDgJa8QCIR4AaABAg 1 2 2qVWur~ user_751 satur~ satur
## 10 Ugw--6D9DvDgJa8QCIR4AaABAg 1 2 2qVWur~ user_751 youâ youâ
## # ... with 11,986 more rows
## <<DocumentTermMatrix (documents: 986, terms: 2808)>>
## Non-/sparse entries: 10477/2758211
## Sparsity : 100%
## Maximal term length: 818
## Weighting : term frequency (tf)
## <<DocumentTermMatrix (documents: 986, terms: 3568)>>
## Non-/sparse entries: 10666/3507382
## Sparsity : 100%
## Maximal term length: 818
## Weighting : term frequency (tf)
## # A tibble: 2,808 x 2
## stem n
## <chr> <int>
## 1 data 466
## 2 scienc 320
## 3 video 308
## 4 ken 248
## 5 learn 174
## 6 ðÿ 159
## 7 project 148
## 8 start 109
## 9 time 85
## 10 love 73
## # ... with 2,798 more rows
barplot(stem_counts[1:10,]$n, las = 2, names.arg = stem_counts[1:10,]$stem,
col ="lightblue", main ="Top 5 most frequent words",
ylab = "Word frequencies")
In the modeling phase I deploy both the LDA and Stm algorithms to investigate patterns in the comments .In selecting the value for K, I decided to go with 7 to provide concise themes for exploration.
comments_lda <- LDA(comments_dtm,
k = 7,
control = list(seed = 200)
)
comments_lda
## A LDA_VEM topic model with 7 topics.
I then went ahead and fitted Structural Topic Model by initiating the extraction of temp object.
#extracting elements from the temp object
docs <- temp$documents
meta <- temp$meta
vocab <- temp$vocab
comments_stm <- stm(documents=docs,
data=meta,
vocab=vocab,
K=7,
max.em.its=7,
verbose = FALSE)
comments_stm
## A topic model with 7 topics, 997 documents and a 2902 word dictionary.
plot.STM(comments_stm, n = 7)
I also used the LDAvis topic browser to explore the distribution of the emergent theme words.
toLDAvis(mod = comments_stm, docs = docs)
## Loading required namespace: servr
There seem to be an overlapping case for topics 2 , 3, and 5. This can potentially be an indication that selecting K=5 was not an optimal choice
I further on examined β and looked at the probabilities of words belonging to the identified themes.I explored the top 5 words that were assigned to each topic and made further interpretations based on them.
terms(comments_lda, 5)
## Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7
## [1,] "ken" "ðÿ" "data" "3" "data" "data" "video"
## [2,] "video" "data" "science" "https" "ken" "science" "ken"
## [3,] "amazing" "science" "day" "2" "time" "learning" "projects"
## [4,] "content" "ken" "book" "model" "scientist" "video" "project"
## [5,] "analyst" "video" "video" "ken" "videos" "degree" "series"
tidy_lda <- tidy(comments_lda)
tidy_lda
## # A tibble: 24,976 x 3
## topic term beta
## <int> <chr> <dbl>
## 1 1 â 1.20e- 57
## 2 2 â 1.50e- 12
## 3 3 â 2.49e- 3
## 4 4 â 4.28e-162
## 5 5 â 9.30e- 3
## 6 6 â 8.14e- 25
## 7 7 â 4.43e- 3
## 8 1 absorption 4.55e-235
## 9 2 absorption 6.22e-240
## 10 3 absorption 2.02e-234
## # ... with 24,966 more rows
top_terms <- tidy_lda %>%
group_by(topic) %>%
slice_max(beta, n = 5, with_ties = FALSE) %>%
ungroup() %>%
arrange(topic, -beta)
top_terms %>%
mutate(term = reorder_within(term, beta, topic)) %>%
group_by(topic, term) %>%
arrange(desc(beta)) %>%
ungroup() %>%
ggplot(aes(beta, term, fill = as.factor(topic))) +
geom_col(show.legend = FALSE) +
scale_y_reordered() +
labs(title = "Top 5 terms in each identified topic",
x = expression(beta), y = NULL) +
facet_wrap(~ topic, ncol = 4, scales = "free")
td_beta <- tidy(comments_lda)
td_gamma <- tidy(comments_lda, matrix = "gamma")
td_beta
## # A tibble: 24,976 x 3
## topic term beta
## <int> <chr> <dbl>
## 1 1 â 1.20e- 57
## 2 2 â 1.50e- 12
## 3 3 â 2.49e- 3
## 4 4 â 4.28e-162
## 5 5 â 9.30e- 3
## 6 6 â 8.14e- 25
## 7 7 â 4.43e- 3
## 8 1 absorption 4.55e-235
## 9 2 absorption 6.22e-240
## 10 3 absorption 2.02e-234
## # ... with 24,966 more rows
td_gamma
## # A tibble: 6,902 x 3
## document topic gamma
## <chr> <int> <dbl>
## 1 Ugw--6D9DvDgJa8QCIR4AaABAg 1 0.00248
## 2 Ugw-8wZXggcrNzq0KGN4AaABAg 1 0.927
## 3 Ugw-B4KThtc_ghl7-9V4AaABAg 1 0.0223
## 4 Ugw-FIRSkQztZMy3HvZ4AaABAg 1 0.0157
## 5 Ugw_8vkPAP2N_8MBWU14AaABAg 1 0.0121
## 6 Ugw_JjLakkrb-IZ0E2J4AaABAg 1 0.984
## 7 Ugw_sTWhfSF1uiDCKoR4AaABAg 1 0.00985
## 8 Ugw0R1C-H85Af8N2PgB4AaABAg 1 0.826
## 9 Ugw1b5VErjT-yXwjGZ54AaABAg 1 0.241
## 10 Ugw1gPmaznecnK6-wtx4AaABAg 1 0.00831
## # ... with 6,892 more rows
top_terms <- td_beta %>%
arrange(beta) %>%
group_by(topic) %>%
top_n(5, beta) %>%
arrange(-beta) %>%
select(topic, term) %>%
summarise(terms = list(term)) %>%
mutate(terms = map(terms, paste, collapse = ", ")) %>%
unnest()
## Warning: `cols` is now required when using unnest().
## Please use `cols = c(terms)`
gamma_terms <- td_gamma %>%
group_by(topic) %>%
summarise(gamma = mean(gamma)) %>%
arrange(desc(gamma)) %>%
left_join(top_terms, by = "topic") %>%
mutate(topic = paste0("Topic ", topic),
topic = reorder(topic, gamma))
gamma_terms %>%
select(topic, gamma, terms) %>%
kable(digits = 3,
col.names = c("Topic", "Expected Topic Proportion", "Top 7 terms"))
| Topic | Expected Topic Proportion | Top 7 terms |
|---|---|---|
| Topic 7 | 0.203 | video, ken, projects, project, series |
| Topic 2 | 0.154 | ðÿ, data, science, ken, video |
| Topic 1 | 0.153 | ken, video, amazing, content, analyst |
| Topic 6 | 0.148 | data, science, learning, video, degree |
| Topic 5 | 0.145 | data, ken, time, scientist, videos |
| Topic 3 | 0.099 | data, science, day, book, video |
| Topic 4 | 0.098 | 3, https, 2, model, ken |
The main purpose of this analysis was to deploy text mining approaches to investigate learner characteristics in data science channel user comments. In order to answer the main research questions, two main approaches were utilized to explore and model a database of 1000 observations containing user comments from Lee Jee Youtube Channel. The first phase of the study involved performing sentiment analysis to establish overall polarity of the comments. The second phase involved utilizing topic modeling to identify key topics that are discussed in the user comments. The presentation of the findings will be guided by the driving research questions as follows.
From the exploration comments through word cloud and word frequency tables, it has been observed that the general keywords are “learning”, “time” “Projects” and “python”. This can be interpreted that most of these comments in our sample were from project- based videos. To validate this, I did a scan of the channel and explored videos with most views. It was observed that the author posts videos about his data science projects and the corresponding updates as he progresses. In addition to the nature of videos reflected in these comments, another insightful capture is the keyword “python” which at top level, it can be interpreted that most of the users discussed about “Python” as the language of use in the posted projects and perhaps in their own learning and projects. Another interesting observation is the use of emoticon, which in social media platforms, it is a common way of communicating and expressing sentiment in place of formal language.The symbol ðŸ which is in one of the most prevalent terms, represents an emoticon. However further analysis is needed to examine the encoding, whether it represents a single or a combination of emoticons. Additional observations highlight words such as “job”, “career” and “experience” which indicate discussions on careers are prevalent in the comments.
In analyzing sentiment of users on the videos, I deployed Affinn and Bing lexicons to establish the polarity of the sampled comments.I prefer to use more than one lexicon in order to validate the results obtain from either of the models. The findings indicate that the overall sentiment in this videos is positive, and this is according to both Affin and Bing lexicons. For Affin lexicons, it shows that 77% of the sentiment was positive while 22.8% is negative. The ratio of sentiment for Bing lexicon is closely tied, with 55% being positive and 44% being negative. Therefore the concluding sentiment for the user comments in these data science videos is positive.
The topic modeling was the challenging analysis for this study. The first phase of the analysis I choose K=5 and the results of the themes were not concise and there was a lot of overlapping words between the topics. This was an indication that perhaps the choice of K was not optimal for the analysis. In the second phase I chose K=7 and the results had slight improvement but still the distinct patterns of the themes were overlapping and unclear. Topic 7 was very concise and it reflected a theme that was based on projects. Topic 3 captured curiosity around the data with words such as “think”, “look” and “see”.Topic 4 is leaning towards feedback about the content with words such as “like”, “Thanks” and “learning”. From the LDAVis it can be observed that there substantial overlapping in Topics 1, 3, 5 and 6 which makes it challenging to structure meaningful interpretation. In this case I think it is important to understand the corpus and context in order to improve the interpretations.
These interpretations of the findings reveal that generally, learners are positive about data science videos posted on YouTube and this has been validated by both the keywords and sentiment analysis. Secondly, project-based learning videos seem to be stirring more engagements and discussions in these comments.The topic modelling pointed this out with Topic 7 being very distinct compared to the other identified themes.Thirdly, these approaches showed that what is reflected in these comments is surface learning and not deep learning. Surface learning relates to demonstration of some levels of learning and appreciating new facts, while deep learning relates to the ability of applying new facts in contexts (Biggs, 1988).
This pilot study explored very basic level sentiment analysis and topic modeling to get primary idea of perceptions,polarity, and salient themes that can be found in YouTube comments. Additionally, the source corpus for this analysis is limited to only 1000 observations collected from only one YouTube channel. I believe a more diverse corpus would have provided more insightful analysis and findings that could enhance the validity of the study. The generalization and application of these results in other contexts are also limited given the nature of dataset and context of study.
This analysis made use of Ken Jee YouTube Data which has been made openly available and licensed to use through Kaggle. The author of the dataset has acknowledged that the dataset was pre-cleaned to omit personal names and unique identifiers of channel’s users for privacy purposes. The same protocol was observed and adhered to during the wrangling, analysis, modeling and reporting of the findings of this report.
Biggs, J. B. (1988). Assessing student approaches to learning. Australian Psychologist, 23(2), 197-206.
Clifton, A., & Mann, C. (2011). Can YouTube enhance student nurse learning?. Nurse education today, 31(4), 311-313. Dubovi, I., & Tabak, I. (2020). An empirical analysis of knowledge co-construction in YouTube comments. Computers & Education, 156, 103939.
Greenhow, C., & Lewin, C. (2016). Social media and education: Reconceptualizing the boundaries of formal and informal learning. Learning, media and technology, 41(1), 6-30.
Muhammad, A. N., Bukhori, S., & Pandunata, P. (2019, October). Sentiment analysis of positive and negative of youtube comments using naïve bayes–support vector machine (nbsvm) classifier. In 2019 International Conference on Computer Science, Information Technology, and Electrical Engineering (ICOMITEE) (pp. 199-205). IEEE.
Snelson, C. (2011). YouTube across the disciplines: A review of the literature. MERLOT Journal of Online learning and teaching.