##1 Preparation To do the task, I will use the data that I used to complete DZ 3. To begin with, I will create a data corpus and download the quanteda library for analysis
library(readr)
youtubeData <- read_csv("youtubeDataLab3.csv")
## New names:
## Rows: 131 Columns: 13
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (8): Comment, AuthorDisplayName, AuthorProfileImageUrl, AuthorChannelUr... dbl
## (3): ...1, ReplyCount, LikeCount dttm (2): PublishedAt, UpdatedAt
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
library("quanteda")
## Package version: 4.1.0
## Unicode version: 13.0
## ICU version: 66.1
## Parallel computing: disabled
## See https://quanteda.io for tutorials and examples.
#download the quanteda library for analysis
corp_yout <- corpus(as.character(youtubeData$Comment))
# build a new corpus from the texts
To answer this question, I will use the which.max function. To do it, I will use the docvars function to add “Publication_date”, “Likecount”, “Replicount” to our corp_yout variable. Next, using the summary function, I will create a pivot table from the data corpus and add it to the tokeninfo variable. And at the end, using the which.max function, I will find the largest comment on tokens.
docvars(corp_yout, "Publication_date") <- as.Date(youtubeData$PublishedAt)
# adding "Publication_date" from youtubeData in the "date" format
docvars(corp_yout, "Likecount") <- as.numeric(youtubeData$LikeCount)
# adding "Likecount" from youtubeData in the "number" format
docvars(corp_yout, "Replycount") <- as.numeric(youtubeData$ReplyCount)
# adding "Replycount" from youtubeData in the "number" format
tokeninfo <- summary(corp_yout)
#creating a pivot table from the data corpus and adding it to the tokeninfo
tokeninfo[which.max(tokeninfo$Tokens), ]
## Text Types Tokens Sentences Publication_date Likecount Replycount
## 64 text64 40 52 4 2024-10-06 2 0
#finding information about the text that has the most tokens using the which.max function
The biggest comment in the data corpus is number 62. It can be output using the command:
as.character(corp_yout)[62]
## text62
## "i miss aguero"
The commentary contains 52 tokens, published on 2024-10-06, has 2 likes and 0 replies
For analysis, I chose the word “Mbappe” - this is the surname of the leader of the French national football team, who scored three goals in the World Cup final. Let me remind you that I analyzed a short review of this match. Using the kwic function, I will find all comments with the word “Mbappe” in the data corpus in the “tokens” column and specify a window equal to 5
kwic(tokens(corp_yout), pattern ="Mbappe", window = 5)
## Keyword-in-context with 18 matches.
## [text8, 1] | mbappe |
## [text11, 3] I support | Mbappe |
## [text34, 1] | Mbappe |
## [text38, 1] | Mbappe |
## [text47, 18] ** VS** | MBAPPE |
## [text49, 1] | Mbappe |
## [text49, 7] , Thuram... | MBAPPE |
## [text51, 5] for people who say | Mbappe |
## [text51, 17] just know if Griezman played | Mbappe |
## [text52, 1] | Mbappe |
## [text54, 21] having 4v3 with coman and | Mbappe |
## [text67, 10] cup, but that day | Mbappe |
## [text69, 5] France vs Argentina ❌ | Mbappe |
## [text85, 7] think Kolo muani ballpass to | mbappe |
## [text95, 4] In my opinion | Mbappe |
## [text96, 1] | mbappe |
## [text103, 9] trying so hard not only | mbappe |
## [text107, 13] praying in 10 religions for | mbappe |
##
## can cry a river in
## here tho, He&
## world cup version is crazy
## almost broke the script ☠️
## *
## , Thuram...
## !!!!!
## carried France in the final
## wouldve had just 1 goal
## lose the world cup 2022
## and referee decided to not
## was the man, who
## vs Argentina ✅
## 100% goal ❤ 😮
## shocked the world by scoring
## is a certified truck,
##
## to score, in the
#I find all comments with the word "Mbappe" in the data corpus in the "tokens" column and specify a window equal to 5
I received 18 comments, in which I can see the context of 5 tokens before and after the word “Mbappe”.
To answer this question, I will use the dfm function as well as topfeatures. Сначала я создаю переменную dfmat_yout3, убираю из нее знаки пунктуации, а также с помощью команды stopwords убираю все ненужные слова на английском. В конце преобразовываю в таблицу с помощью функции dfm. В конце вывожу с помощью функции topfeatures 10 самых популярных токенов.
dfmat_yout3 <- tokens(corp_yout, remove_punct = TRUE) %>%
tokens_remove(stopwords("en")) %>%
dfm()
#First, I create the dfmat_yout3 variable, remove punctuation marks from it, and also use the stopwords command to remove all unnecessary words in English. At the end, I convert it to a table using the dfm function.
topfeatures(dfmat_yout3, 10)
## ❤ mbappe 😂 🇦🇷 world match cup 2024 goal france
## 24 18 17 16 12 11 10 9 9 8
#At the end, I output the 10 most popular tokens using the topfeatures function.
I think these words are leading because they all describe the World Cup final. “mbappe” - was one of the main heroes of the match, “AR” stands for the refereeing system using replays, “match”, “goal” - words that refer to football, “world”, “final” - means that commentators often mentioned that this is the final of the World Cup, “Argentina is the winner of the FIFA World Cup.”2024” - since the most recent comments got into the database, the commentators probably shared the fact that they are still reviewing this finale. Emojis express emotions. “Heart” can mean that you liked the match very much, and “laughing smiley” as a mockery of some players.
To answer this question, I will use the textstat_lexdiv function. To do this, I will use this command for tokens from the corp_yout variable:
library("quanteda.textstats")
tstat_lexdiv <- textstat_lexdiv(tokens(corp_yout))
tstat_lexdiv
## document TTR
## 1 text1 1.0000000
## 2 text2 0.8000000
## 3 text3 1.0000000
## 4 text4 1.0000000
## 5 text5 1.0000000
## 6 text6 1.0000000
## 7 text7 0.8333333
## 8 text8 1.0000000
## 9 text9 1.0000000
## 10 text10 1.0000000
## 11 text11 1.0000000
## 12 text12 0.2727273
## 13 text13 1.0000000
## 14 text14 1.0000000
## 15 text15 1.0000000
## 16 text16 1.0000000
## 17 text17 1.0000000
## 18 text18 1.0000000
## 19 text19 0.9411765
## 20 text20 1.0000000
## 21 text21 0.9285714
## 22 text22 1.0000000
## 23 text23 1.0000000
## 24 text24 1.0000000
## 25 text25 1.0000000
## 26 text26 1.0000000
## 27 text27 0.3333333
## 28 text28 1.0000000
## 29 text29 1.0000000
## 30 text30 1.0000000
## 31 text31 1.0000000
## 32 text32 0.4000000
## 33 text33 1.0000000
## 34 text34 1.0000000
## 35 text35 1.0000000
## 36 text36 1.0000000
## 37 text37 1.0000000
## 38 text38 1.0000000
## 39 text39 1.0000000
## 40 text40 1.0000000
## 41 text41 NA
## 42 text42 1.0000000
## 43 text43 0.8571429
## 44 text44 1.0000000
## 45 text45 1.0000000
## 46 text46 1.0000000
## 47 text47 0.6666667
## 48 text48 1.0000000
## 49 text49 0.8000000
## 50 text50 1.0000000
## 51 text51 0.9047619
## 52 text52 1.0000000
## 53 text53 0.9411765
## 54 text54 0.8611111
## 55 text55 1.0000000
## 56 text56 0.8888889
## 57 text57 1.0000000
## 58 text58 1.0000000
## 59 text59 NA
## 60 text60 1.0000000
## 61 text61 0.9230769
## 62 text62 1.0000000
## 63 text63 1.0000000
## 64 text64 0.9142857
## 65 text65 0.9090909
## 66 text66 1.0000000
## 67 text67 0.7777778
## 68 text68 1.0000000
## 69 text69 0.6666667
## 70 text70 1.0000000
## 71 text71 1.0000000
## 72 text72 1.0000000
## 73 text73 0.9473684
## 74 text74 0.8750000
## 75 text75 0.9166667
## 76 text76 1.0000000
## 77 text77 0.7500000
## 78 text78 1.0000000
## 79 text79 1.0000000
## 80 text80 0.8750000
## 81 text81 1.0000000
## 82 text82 1.0000000
## 83 text83 0.9285714
## 84 text84 0.9166667
## 85 text85 1.0000000
## 86 text86 1.0000000
## 87 text87 1.0000000
## 88 text88 0.9090909
## 89 text89 1.0000000
## 90 text90 NA
## 91 text91 1.0000000
## 92 text92 NA
## 93 text93 1.0000000
## 94 text94 0.8571429
## 95 text95 1.0000000
## 96 text96 0.9285714
## 97 text97 1.0000000
## 98 text98 0.8000000
## 99 text99 1.0000000
## 100 text100 1.0000000
## 101 text101 1.0000000
## 102 text102 1.0000000
## 103 text103 1.0000000
## 104 text104 NA
## 105 text105 0.6470588
## 106 text106 0.6666667
## 107 text107 0.8800000
## 108 text108 1.0000000
## 109 text109 1.0000000
## 110 text110 1.0000000
## 111 text111 1.0000000
## 112 text112 1.0000000
## 113 text113 1.0000000
## 114 text114 1.0000000
## 115 text115 1.0000000
## 116 text116 1.0000000
## 117 text117 1.0000000
## 118 text118 1.0000000
## 119 text119 1.0000000
## 120 text120 1.0000000
## 121 text121 1.0000000
## 122 text122 1.0000000
## 123 text123 1.0000000
## 124 text124 0.8333333
## 125 text125 1.0000000
## 126 text126 0.9090909
## 127 text127 1.0000000
## 128 text128 1.0000000
## 129 text129 1.0000000
## 130 text130 1.0000000
## 131 text131 1.0000000
The higher the TTR coefficient, the greater the lexical diversity in the comment. We see that there are quite a lot of comments with high lexical diversity, since the majority is exactly one. To highlight the most diverse comment, I suggest multiplying the TTR coefficient with the number of tokens and choosing the comment with the highest score. This way I will get the biggest comment with the highest TTR value.
docvars(corp_yout, "TTR") <- as.numeric(tstat_lexdiv$TTR)
# adding "Replycount" from youtubeData in the "number" format
summary(corp_yout, n = 126)
## Corpus consisting of 131 documents, showing 126 documents:
##
## Text Types Tokens Sentences Publication_date Likecount Replycount TTR
## text1 7 8 1 2024-10-19 1 0 1.0000000
## text2 5 7 1 2024-10-19 0 0 0.8000000
## text3 5 5 1 2024-10-18 2 0 1.0000000
## text4 7 7 1 2024-10-18 1 0 1.0000000
## text5 2 2 1 2024-10-18 1 0 1.0000000
## text6 6 8 1 2024-10-18 1 0 1.0000000
## text7 13 25 2 2024-10-18 0 0 0.8333333
## text8 9 9 1 2024-10-18 0 0 1.0000000
## text9 12 13 1 2024-10-18 0 0 1.0000000
## text10 18 22 1 2024-10-17 0 0 1.0000000
## text11 13 13 1 2024-10-16 0 1 1.0000000
## text12 4 13 1 2024-10-16 1 0 0.2727273
## text13 10 10 1 2024-10-16 2 0 1.0000000
## text14 8 8 1 2024-10-16 0 0 1.0000000
## text15 7 7 1 2024-10-16 1 0 1.0000000
## text16 8 8 1 2024-10-15 1 0 1.0000000
## text17 3 3 1 2024-10-15 0 0 1.0000000
## text18 14 17 2 2024-10-15 0 0 1.0000000
## text19 16 17 1 2024-10-14 1 0 0.9411765
## text20 7 7 1 2024-10-14 1 0 1.0000000
## text21 17 18 2 2024-10-14 0 0 0.9285714
## text22 30 33 3 2024-10-14 0 0 1.0000000
## text23 9 12 2 2024-10-13 0 0 1.0000000
## text24 4 4 1 2024-10-13 0 1 1.0000000
## text25 4 4 1 2024-10-13 0 0 1.0000000
## text26 6 6 1 2024-10-13 0 0 1.0000000
## text27 4 8 1 2024-10-13 0 0 0.3333333
## text28 6 6 1 2024-10-13 0 0 1.0000000
## text29 1 1 1 2024-10-13 0 0 1.0000000
## text30 6 6 1 2024-10-12 0 1 1.0000000
## text31 1 1 1 2024-10-12 0 0 1.0000000
## text32 2 5 1 2024-10-12 0 0 0.4000000
## text33 2 2 1 2024-10-12 0 0 1.0000000
## text34 7 7 1 2024-10-12 1 0 1.0000000
## text35 4 4 1 2024-10-12 0 0 1.0000000
## text36 13 13 1 2024-10-11 7 0 1.0000000
## text37 4 4 1 2024-10-11 0 0 1.0000000
## text38 9 9 1 2024-10-11 2 0 1.0000000
## text39 9 9 1 2024-10-11 1 0 1.0000000
## text40 6 6 1 2024-10-11 1 0 1.0000000
## text41 1 1 1 2024-10-11 0 0 NA
## text42 9 9 1 2024-10-11 1 0 1.0000000
## text43 7 8 1 2024-10-11 1 0 0.8571429
## text44 8 8 1 2024-10-11 0 0 1.0000000
## text45 13 13 1 2024-10-11 0 0 1.0000000
## text46 4 4 1 2024-10-10 0 0 1.0000000
## text47 9 19 1 2024-10-10 1 0 0.6666667
## text48 8 8 1 2024-10-10 0 0 1.0000000
## text49 8 18 3 2024-10-10 2 0 0.8000000
## text50 6 8 1 2024-10-10 1 0 1.0000000
## text51 20 22 1 2024-10-09 0 0 0.9047619
## text52 7 8 1 2024-10-09 0 0 1.0000000
## text53 18 19 1 2024-10-09 0 0 0.9411765
## text54 33 37 2 2024-10-09 0 1 0.8611111
## text55 3 3 1 2024-10-08 0 0 1.0000000
## text56 41 45 2 2024-10-08 1 0 0.8888889
## text57 7 7 1 2024-10-08 0 0 1.0000000
## text58 2 2 1 2024-10-08 1 0 1.0000000
## text59 1 3 1 2024-10-07 0 0 NA
## text60 11 11 1 2024-10-07 1 1 1.0000000
## text61 13 13 1 2024-10-07 1 0 0.9230769
## text62 3 3 1 2024-10-07 0 0 1.0000000
## text63 4 4 2 2024-10-06 0 0 1.0000000
## text64 40 52 4 2024-10-06 2 0 0.9142857
## text65 12 12 1 2024-10-06 0 0 0.9090909
## text66 2 2 1 2024-10-06 0 0 1.0000000
## text67 16 22 1 2024-10-06 0 1 0.7777778
## text68 6 6 1 2024-10-06 1 0 1.0000000
## text69 6 8 1 2024-10-05 1 0 0.6666667
## text70 12 14 1 2024-10-05 0 1 1.0000000
## text71 6 6 1 2024-10-04 58 9 1.0000000
## text72 2 2 1 2024-10-04 1 6 1.0000000
## text73 18 19 1 2024-10-04 7 0 0.9473684
## text74 7 8 1 2024-10-04 0 0 0.8750000
## text75 12 12 1 2024-10-03 6 1 0.9166667
## text76 3 3 1 2024-10-02 3 0 1.0000000
## text77 5 6 1 2024-10-02 0 0 0.7500000
## text78 12 12 1 2024-10-02 1 0 1.0000000
## text79 21 22 1 2024-10-01 0 1 1.0000000
## text80 8 8 1 2024-10-01 2 0 0.8750000
## text81 7 7 1 2024-10-01 9 0 1.0000000
## text82 2 3 1 2024-09-30 4 0 1.0000000
## text83 17 18 1 2024-09-30 2 0 0.9285714
## text84 11 12 1 2024-09-30 2 0 0.9166667
## text85 12 13 1 2024-09-29 0 0 1.0000000
## text86 9 9 1 2024-09-29 1 0 1.0000000
## text87 2 2 1 2024-09-28 1 0 1.0000000
## text88 10 11 1 2024-09-28 2 0 0.9090909
## text89 17 17 1 2024-09-28 3 1 1.0000000
## text90 1 1 1 2024-09-28 1 0 NA
## text91 8 8 1 2024-09-28 3 0 1.0000000
## text92 2 3 1 2024-09-27 1 0 NA
## text93 7 8 1 2024-09-27 1 0 1.0000000
## text94 10 13 1 2024-09-26 1 0 0.8571429
## text95 11 11 1 2024-09-26 0 2 1.0000000
## text96 15 16 1 2024-09-25 0 0 0.9285714
## text97 3 3 1 2024-09-24 1 0 1.0000000
## text98 5 6 1 2024-09-24 0 0 0.8000000
## text99 6 9 1 2024-09-24 1 4 1.0000000
## text100 6 6 1 2024-09-24 1 0 1.0000000
## text101 6 6 1 2024-10-17 0 0 1.0000000
## text102 3 3 1 2024-10-13 0 0 1.0000000
## text103 9 9 1 2024-10-14 0 0 1.0000000
## text104 1 1 1 2024-10-09 0 0 NA
## text105 11 17 1 2024-10-09 2 0 0.6470588
## text106 5 7 1 2024-10-06 2 0 0.6666667
## text107 27 33 1 2024-10-09 1 0 0.8800000
## text108 1 1 1 2024-10-18 0 0 1.0000000
## text109 1 1 1 2024-10-18 0 0 1.0000000
## text110 7 7 1 2024-10-19 1 0 1.0000000
## text111 1 1 1 2024-10-19 0 0 1.0000000
## text112 3 3 1 2024-10-19 0 0 1.0000000
## text113 4 4 1 2024-10-19 1 0 1.0000000
## text114 2 3 1 2024-10-19 0 0 1.0000000
## text115 1 1 1 2024-10-19 0 0 1.0000000
## text116 2 2 1 2024-10-19 0 0 1.0000000
## text117 21 22 1 2024-10-04 5 0 1.0000000
## text118 6 6 2 2024-10-05 1 0 1.0000000
## text119 3 3 1 2024-10-09 3 0 1.0000000
## text120 3 3 1 2024-10-09 0 0 1.0000000
## text121 8 8 1 2024-10-09 3 0 1.0000000
## text122 13 13 1 2024-10-09 1 0 1.0000000
## text123 3 3 1 2024-10-03 0 0 1.0000000
## text124 6 10 1 2024-10-13 0 0 0.8333333
## text125 4 4 1 2024-09-28 0 0 1.0000000
## text126 12 17 2 2024-10-06 0 0 0.9090909
#To see what we got
#Next, I will upload this table to excel and multiply the TTR columns and the Number of tokens and find the comment with the highest value
The largest value came from comment 62 with a value of 47.4117, which corresponds to the longest comment itself in 2.a
as.character(corp_yout)[62]
## text62
## "i miss aguero"
# to see a comment
summary(corp_yout, n = 126)[which.max(tokeninfo$Tokens), ]
## Text Types Tokens Sentences Publication_date Likecount Replycount
## 64 text64 40 52 4 2024-10-06 2 0
## TTR
## 64 0.9142857
# to see the characteristics
The commentary contains 52 tokens, published on 2024-10-06, has 2 likes and 0 replies and TTR = 0.9117647
To create a word cloud, I will use the “quanteda.textplot” library and the textplot_wordcloud command, which I will apply to the dfmat_yout3 variable
set.seed(100)# settng seed is important for the functions
#that could provide different results because of iterations
library("quanteda.textplots")
textplot_wordcloud(dfmat_yout3, min_count = 5,
random_order = FALSE,
rotation = 0.25,
color = RColorBrewer::brewer.pal(8, "Dark2"))
It’s a cloud of words. It visualizes the most common tokens in comments,
the larger the word and closer to the center, the more often it occurs
in comments. We interpreted most of the words in paragraph 2.c. We can
also notice new words that were not included in the top 10.
To build a semantic network, I will use the textplot_network command
tag_fcm <- fcm(dfmat_yout3)
feat <- names(topfeatures(dfmat_yout3, 30))
set.seed(100)
fcm_select(tag_fcm, pattern = feat) %>%
textplot_network(min_freq = 0.9,edge_color = "violet",
edge_alpha = 0.8,
edge_size = 1,
vertex_color = "grey",
vertex_alpha = 0.5,
vertex_labelcolor = "black",
offset = NULL)
that’s a semantic network of words. It visualizes the combination of
words in the comments. The thicker the line, the stronger the connection
between the words. You can make out the example of the word that we
analyzed. The word “mbappe” is most often found in combination with the
words “world”, “goal”, “france” and “vs”. From this, we can assume that
the commentators discussed the goal of this player, and also noted his
“vs” confrontation against Argentina. The word “france” is also
mentioned as it refers to his team. The word “world”, as a designation
for the World Cup final. It is interesting to look at the combination of
the network of words “final”, “match”, “greatest”, “heart”,
“world”,“cup”,“ever”, “2022”. From these words, it is possible to make a
single sentence, which generally reflects the delight of commentators at
the 2022 finals.
To do this, I will first use the kwic function to analyze the word “mbappe”, which we will then visualize using the textplot_xray() function
kwic(tokens(corp_yout), pattern = "mbappe") %>% textplot_xray()
This is a graph of lexical variance. It shows in which part of the
sentence the word “mbappe” occurs. According to him, we can notice that
the word is more often mentioned in the first half of the sentence.
However, in comments 103 and 45, this word occurs almost at the end of
the comment.