DZ4

##1 Preparation To do the task, I will use the data that I used to complete DZ 3. To begin with, I will create a data corpus and download the quanteda library for analysis

library(readr)
youtubeData <- read_csv("youtubeDataLab3.csv")

## New names:
## Rows: 131 Columns: 13
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (8): Comment, AuthorDisplayName, AuthorProfileImageUrl, AuthorChannelUr... dbl
## (3): ...1, ReplyCount, LikeCount dttm (2): PublishedAt, UpdatedAt
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`

library("quanteda")

## Package version: 4.1.0
## Unicode version: 13.0
## ICU version: 66.1
## Parallel computing: disabled
## See https://quanteda.io for tutorials and examples.

#download the quanteda library for analysis
corp_yout <- corpus(as.character(youtubeData$Comment))
# build a new corpus from the texts

2.a Which document in this dataset has the maximum number of tokens?

To answer this question, I will use the which.max function. To do it, I will use the docvars function to add “Publication_date”, “Likecount”, “Replicount” to our corp_yout variable. Next, using the summary function, I will create a pivot table from the data corpus and add it to the tokeninfo variable. And at the end, using the which.max function, I will find the largest comment on tokens.

docvars(corp_yout, "Publication_date") <- as.Date(youtubeData$PublishedAt)
# adding "Publication_date" from youtubeData in the "date" format
docvars(corp_yout, "Likecount") <- as.numeric(youtubeData$LikeCount)
# adding "Likecount" from youtubeData in the "number" format
docvars(corp_yout, "Replycount") <- as.numeric(youtubeData$ReplyCount)
# adding "Replycount" from youtubeData in the "number" format
tokeninfo <- summary(corp_yout)
#creating a pivot table from the data corpus and adding it to the tokeninfo
tokeninfo[which.max(tokeninfo$Tokens), ]

##      Text Types Tokens Sentences Publication_date Likecount Replycount
## 64 text64    40     52         4       2024-10-06         2          0

#finding information about the text that has the most tokens using the which.max function

The biggest comment in the data corpus is number 62. It can be output using the command:

as.character(corp_yout)[62]

##          text62 
## "i miss aguero"

The commentary contains 52 tokens, published on 2024-10-06, has 2 likes and 0 replies

2.b Perform kwic analysis for the one word, using window 5, what are your results?

For analysis, I chose the word “Mbappe” - this is the surname of the leader of the French national football team, who scored three goals in the World Cup final. Let me remind you that I analyzed a short review of this match. Using the kwic function, I will find all comments with the word “Mbappe” in the data corpus in the “tokens” column and specify a window equal to 5

kwic(tokens(corp_yout), pattern ="Mbappe", window = 5)

## Keyword-in-context with 18 matches.                                                      
##     [text8, 1]                              | mbappe |
##    [text11, 3]                    I support | Mbappe |
##    [text34, 1]                              | Mbappe |
##    [text38, 1]                              | Mbappe |
##   [text47, 18]                      ** VS** | MBAPPE |
##    [text49, 1]                              | Mbappe |
##    [text49, 7]                  , Thuram... | MBAPPE |
##    [text51, 5]           for people who say | Mbappe |
##   [text51, 17] just know if Griezman played | Mbappe |
##    [text52, 1]                              | Mbappe |
##   [text54, 21]    having 4v3 with coman and | Mbappe |
##   [text67, 10]            cup, but that day | Mbappe |
##    [text69, 5]       France vs Argentina ❌ | Mbappe |
##    [text85, 7] think Kolo muani ballpass to | mbappe |
##    [text95, 4]                In my opinion | Mbappe |
##    [text96, 1]                              | mbappe |
##   [text103, 9]      trying so hard not only | mbappe |
##  [text107, 13]  praying in 10 religions for | mbappe |
##                              
##  can cry a river in          
##  here tho, He&               
##  world cup version is crazy  
##  almost broke the script ☠️   
##  *                           
##  , Thuram...                 
##  !!!!!                       
##  carried France in the final 
##  wouldve had just 1 goal     
##  lose the world cup 2022     
##  and referee decided to not  
##  was the man, who            
##  vs Argentina ✅             
##  100% goal ❤ 😮              
##  shocked the world by scoring
##  is a certified truck,       
##                              
##  to score, in the

#I find all comments with the word "Mbappe" in the data corpus in the "tokens" column and specify a window equal to 5

I received 18 comments, in which I can see the context of 5 tokens before and after the word “Mbappe”.

2.с What are 10 top features in your corpus?

To answer this question, I will use the dfm function as well as topfeatures. Сначала я создаю переменную dfmat_yout3, убираю из нее знаки пунктуации, а также с помощью команды stopwords убираю все ненужные слова на английском. В конце преобразовываю в таблицу с помощью функции dfm. В конце вывожу с помощью функции topfeatures 10 самых популярных токенов.

dfmat_yout3 <- tokens(corp_yout, remove_punct = TRUE) %>%
  tokens_remove(stopwords("en")) %>%
  dfm()
#First, I create the dfmat_yout3 variable, remove punctuation marks from it, and also use the stopwords command to remove all unnecessary words in English. At the end, I convert it to a table using the dfm function. 
topfeatures(dfmat_yout3, 10)

##      ❤ mbappe     😂     🇦🇷  world  match    cup   2024   goal france 
##     24     18     17     16     12     11     10      9      9      8

#At the end, I output the 10 most popular tokens using the topfeatures function.

I think these words are leading because they all describe the World Cup final. “mbappe” - was one of the main heroes of the match, “AR” stands for the refereeing system using replays, “match”, “goal” - words that refer to football, “world”, “final” - means that commentators often mentioned that this is the final of the World Cup, “Argentina is the winner of the FIFA World Cup.”2024” - since the most recent comments got into the database, the commentators probably shared the fact that they are still reviewing this finale. Emojis express emotions. “Heart” can mean that you liked the match very much, and “laughing smiley” as a mockery of some players.

2.d What comment is the most lexically diverse?

To answer this question, I will use the textstat_lexdiv function. To do this, I will use this command for tokens from the corp_yout variable:

library("quanteda.textstats")
tstat_lexdiv <-  textstat_lexdiv(tokens(corp_yout))
tstat_lexdiv

##     document       TTR
## 1      text1 1.0000000
## 2      text2 0.8000000
## 3      text3 1.0000000
## 4      text4 1.0000000
## 5      text5 1.0000000
## 6      text6 1.0000000
## 7      text7 0.8333333
## 8      text8 1.0000000
## 9      text9 1.0000000
## 10    text10 1.0000000
## 11    text11 1.0000000
## 12    text12 0.2727273
## 13    text13 1.0000000
## 14    text14 1.0000000
## 15    text15 1.0000000
## 16    text16 1.0000000
## 17    text17 1.0000000
## 18    text18 1.0000000
## 19    text19 0.9411765
## 20    text20 1.0000000
## 21    text21 0.9285714
## 22    text22 1.0000000
## 23    text23 1.0000000
## 24    text24 1.0000000
## 25    text25 1.0000000
## 26    text26 1.0000000
## 27    text27 0.3333333
## 28    text28 1.0000000
## 29    text29 1.0000000
## 30    text30 1.0000000
## 31    text31 1.0000000
## 32    text32 0.4000000
## 33    text33 1.0000000
## 34    text34 1.0000000
## 35    text35 1.0000000
## 36    text36 1.0000000
## 37    text37 1.0000000
## 38    text38 1.0000000
## 39    text39 1.0000000
## 40    text40 1.0000000
## 41    text41        NA
## 42    text42 1.0000000
## 43    text43 0.8571429
## 44    text44 1.0000000
## 45    text45 1.0000000
## 46    text46 1.0000000
## 47    text47 0.6666667
## 48    text48 1.0000000
## 49    text49 0.8000000
## 50    text50 1.0000000
## 51    text51 0.9047619
## 52    text52 1.0000000
## 53    text53 0.9411765
## 54    text54 0.8611111
## 55    text55 1.0000000
## 56    text56 0.8888889
## 57    text57 1.0000000
## 58    text58 1.0000000
## 59    text59        NA
## 60    text60 1.0000000
## 61    text61 0.9230769
## 62    text62 1.0000000
## 63    text63 1.0000000
## 64    text64 0.9142857
## 65    text65 0.9090909
## 66    text66 1.0000000
## 67    text67 0.7777778
## 68    text68 1.0000000
## 69    text69 0.6666667
## 70    text70 1.0000000
## 71    text71 1.0000000
## 72    text72 1.0000000
## 73    text73 0.9473684
## 74    text74 0.8750000
## 75    text75 0.9166667
## 76    text76 1.0000000
## 77    text77 0.7500000
## 78    text78 1.0000000
## 79    text79 1.0000000
## 80    text80 0.8750000
## 81    text81 1.0000000
## 82    text82 1.0000000
## 83    text83 0.9285714
## 84    text84 0.9166667
## 85    text85 1.0000000
## 86    text86 1.0000000
## 87    text87 1.0000000
## 88    text88 0.9090909
## 89    text89 1.0000000
## 90    text90        NA
## 91    text91 1.0000000
## 92    text92        NA
## 93    text93 1.0000000
## 94    text94 0.8571429
## 95    text95 1.0000000
## 96    text96 0.9285714
## 97    text97 1.0000000
## 98    text98 0.8000000
## 99    text99 1.0000000
## 100  text100 1.0000000
## 101  text101 1.0000000
## 102  text102 1.0000000
## 103  text103 1.0000000
## 104  text104        NA
## 105  text105 0.6470588
## 106  text106 0.6666667
## 107  text107 0.8800000
## 108  text108 1.0000000
## 109  text109 1.0000000
## 110  text110 1.0000000
## 111  text111 1.0000000
## 112  text112 1.0000000
## 113  text113 1.0000000
## 114  text114 1.0000000
## 115  text115 1.0000000
## 116  text116 1.0000000
## 117  text117 1.0000000
## 118  text118 1.0000000
## 119  text119 1.0000000
## 120  text120 1.0000000
## 121  text121 1.0000000
## 122  text122 1.0000000
## 123  text123 1.0000000
## 124  text124 0.8333333
## 125  text125 1.0000000
## 126  text126 0.9090909
## 127  text127 1.0000000
## 128  text128 1.0000000
## 129  text129 1.0000000
## 130  text130 1.0000000
## 131  text131 1.0000000

The higher the TTR coefficient, the greater the lexical diversity in the comment. We see that there are quite a lot of comments with high lexical diversity, since the majority is exactly one. To highlight the most diverse comment, I suggest multiplying the TTR coefficient with the number of tokens and choosing the comment with the highest score. This way I will get the biggest comment with the highest TTR value.

docvars(corp_yout, "TTR") <- as.numeric(tstat_lexdiv$TTR)
# adding "Replycount" from youtubeData in the "number" format
summary(corp_yout, n = 126)

## Corpus consisting of 131 documents, showing 126 documents:
## 
##     Text Types Tokens Sentences Publication_date Likecount Replycount       TTR
##    text1     7      8         1       2024-10-19         1          0 1.0000000
##    text2     5      7         1       2024-10-19         0          0 0.8000000
##    text3     5      5         1       2024-10-18         2          0 1.0000000
##    text4     7      7         1       2024-10-18         1          0 1.0000000
##    text5     2      2         1       2024-10-18         1          0 1.0000000
##    text6     6      8         1       2024-10-18         1          0 1.0000000
##    text7    13     25         2       2024-10-18         0          0 0.8333333
##    text8     9      9         1       2024-10-18         0          0 1.0000000
##    text9    12     13         1       2024-10-18         0          0 1.0000000
##   text10    18     22         1       2024-10-17         0          0 1.0000000
##   text11    13     13         1       2024-10-16         0          1 1.0000000
##   text12     4     13         1       2024-10-16         1          0 0.2727273
##   text13    10     10         1       2024-10-16         2          0 1.0000000
##   text14     8      8         1       2024-10-16         0          0 1.0000000
##   text15     7      7         1       2024-10-16         1          0 1.0000000
##   text16     8      8         1       2024-10-15         1          0 1.0000000
##   text17     3      3         1       2024-10-15         0          0 1.0000000
##   text18    14     17         2       2024-10-15         0          0 1.0000000
##   text19    16     17         1       2024-10-14         1          0 0.9411765
##   text20     7      7         1       2024-10-14         1          0 1.0000000
##   text21    17     18         2       2024-10-14         0          0 0.9285714
##   text22    30     33         3       2024-10-14         0          0 1.0000000
##   text23     9     12         2       2024-10-13         0          0 1.0000000
##   text24     4      4         1       2024-10-13         0          1 1.0000000
##   text25     4      4         1       2024-10-13         0          0 1.0000000
##   text26     6      6         1       2024-10-13         0          0 1.0000000
##   text27     4      8         1       2024-10-13         0          0 0.3333333
##   text28     6      6         1       2024-10-13         0          0 1.0000000
##   text29     1      1         1       2024-10-13         0          0 1.0000000
##   text30     6      6         1       2024-10-12         0          1 1.0000000
##   text31     1      1         1       2024-10-12         0          0 1.0000000
##   text32     2      5         1       2024-10-12         0          0 0.4000000
##   text33     2      2         1       2024-10-12         0          0 1.0000000
##   text34     7      7         1       2024-10-12         1          0 1.0000000
##   text35     4      4         1       2024-10-12         0          0 1.0000000
##   text36    13     13         1       2024-10-11         7          0 1.0000000
##   text37     4      4         1       2024-10-11         0          0 1.0000000
##   text38     9      9         1       2024-10-11         2          0 1.0000000
##   text39     9      9         1       2024-10-11         1          0 1.0000000
##   text40     6      6         1       2024-10-11         1          0 1.0000000
##   text41     1      1         1       2024-10-11         0          0        NA
##   text42     9      9         1       2024-10-11         1          0 1.0000000
##   text43     7      8         1       2024-10-11         1          0 0.8571429
##   text44     8      8         1       2024-10-11         0          0 1.0000000
##   text45    13     13         1       2024-10-11         0          0 1.0000000
##   text46     4      4         1       2024-10-10         0          0 1.0000000
##   text47     9     19         1       2024-10-10         1          0 0.6666667
##   text48     8      8         1       2024-10-10         0          0 1.0000000
##   text49     8     18         3       2024-10-10         2          0 0.8000000
##   text50     6      8         1       2024-10-10         1          0 1.0000000
##   text51    20     22         1       2024-10-09         0          0 0.9047619
##   text52     7      8         1       2024-10-09         0          0 1.0000000
##   text53    18     19         1       2024-10-09         0          0 0.9411765
##   text54    33     37         2       2024-10-09         0          1 0.8611111
##   text55     3      3         1       2024-10-08         0          0 1.0000000
##   text56    41     45         2       2024-10-08         1          0 0.8888889
##   text57     7      7         1       2024-10-08         0          0 1.0000000
##   text58     2      2         1       2024-10-08         1          0 1.0000000
##   text59     1      3         1       2024-10-07         0          0        NA
##   text60    11     11         1       2024-10-07         1          1 1.0000000
##   text61    13     13         1       2024-10-07         1          0 0.9230769
##   text62     3      3         1       2024-10-07         0          0 1.0000000
##   text63     4      4         2       2024-10-06         0          0 1.0000000
##   text64    40     52         4       2024-10-06         2          0 0.9142857
##   text65    12     12         1       2024-10-06         0          0 0.9090909
##   text66     2      2         1       2024-10-06         0          0 1.0000000
##   text67    16     22         1       2024-10-06         0          1 0.7777778
##   text68     6      6         1       2024-10-06         1          0 1.0000000
##   text69     6      8         1       2024-10-05         1          0 0.6666667
##   text70    12     14         1       2024-10-05         0          1 1.0000000
##   text71     6      6         1       2024-10-04        58          9 1.0000000
##   text72     2      2         1       2024-10-04         1          6 1.0000000
##   text73    18     19         1       2024-10-04         7          0 0.9473684
##   text74     7      8         1       2024-10-04         0          0 0.8750000
##   text75    12     12         1       2024-10-03         6          1 0.9166667
##   text76     3      3         1       2024-10-02         3          0 1.0000000
##   text77     5      6         1       2024-10-02         0          0 0.7500000
##   text78    12     12         1       2024-10-02         1          0 1.0000000
##   text79    21     22         1       2024-10-01         0          1 1.0000000
##   text80     8      8         1       2024-10-01         2          0 0.8750000
##   text81     7      7         1       2024-10-01         9          0 1.0000000
##   text82     2      3         1       2024-09-30         4          0 1.0000000
##   text83    17     18         1       2024-09-30         2          0 0.9285714
##   text84    11     12         1       2024-09-30         2          0 0.9166667
##   text85    12     13         1       2024-09-29         0          0 1.0000000
##   text86     9      9         1       2024-09-29         1          0 1.0000000
##   text87     2      2         1       2024-09-28         1          0 1.0000000
##   text88    10     11         1       2024-09-28         2          0 0.9090909
##   text89    17     17         1       2024-09-28         3          1 1.0000000
##   text90     1      1         1       2024-09-28         1          0        NA
##   text91     8      8         1       2024-09-28         3          0 1.0000000
##   text92     2      3         1       2024-09-27         1          0        NA
##   text93     7      8         1       2024-09-27         1          0 1.0000000
##   text94    10     13         1       2024-09-26         1          0 0.8571429
##   text95    11     11         1       2024-09-26         0          2 1.0000000
##   text96    15     16         1       2024-09-25         0          0 0.9285714
##   text97     3      3         1       2024-09-24         1          0 1.0000000
##   text98     5      6         1       2024-09-24         0          0 0.8000000
##   text99     6      9         1       2024-09-24         1          4 1.0000000
##  text100     6      6         1       2024-09-24         1          0 1.0000000
##  text101     6      6         1       2024-10-17         0          0 1.0000000
##  text102     3      3         1       2024-10-13         0          0 1.0000000
##  text103     9      9         1       2024-10-14         0          0 1.0000000
##  text104     1      1         1       2024-10-09         0          0        NA
##  text105    11     17         1       2024-10-09         2          0 0.6470588
##  text106     5      7         1       2024-10-06         2          0 0.6666667
##  text107    27     33         1       2024-10-09         1          0 0.8800000
##  text108     1      1         1       2024-10-18         0          0 1.0000000
##  text109     1      1         1       2024-10-18         0          0 1.0000000
##  text110     7      7         1       2024-10-19         1          0 1.0000000
##  text111     1      1         1       2024-10-19         0          0 1.0000000
##  text112     3      3         1       2024-10-19         0          0 1.0000000
##  text113     4      4         1       2024-10-19         1          0 1.0000000
##  text114     2      3         1       2024-10-19         0          0 1.0000000
##  text115     1      1         1       2024-10-19         0          0 1.0000000
##  text116     2      2         1       2024-10-19         0          0 1.0000000
##  text117    21     22         1       2024-10-04         5          0 1.0000000
##  text118     6      6         2       2024-10-05         1          0 1.0000000
##  text119     3      3         1       2024-10-09         3          0 1.0000000
##  text120     3      3         1       2024-10-09         0          0 1.0000000
##  text121     8      8         1       2024-10-09         3          0 1.0000000
##  text122    13     13         1       2024-10-09         1          0 1.0000000
##  text123     3      3         1       2024-10-03         0          0 1.0000000
##  text124     6     10         1       2024-10-13         0          0 0.8333333
##  text125     4      4         1       2024-09-28         0          0 1.0000000
##  text126    12     17         2       2024-10-06         0          0 0.9090909

#To see what we got
#Next, I will upload this table to excel and multiply the TTR columns and the Number of tokens and find the comment with the highest value

The largest value came from comment 62 with a value of 47.4117, which corresponds to the longest comment itself in 2.a

as.character(corp_yout)[62]

##          text62 
## "i miss aguero"

# to see a comment
summary(corp_yout, n = 126)[which.max(tokeninfo$Tokens), ]

##      Text Types Tokens Sentences Publication_date Likecount Replycount
## 64 text64    40     52         4       2024-10-06         2          0
##          TTR
## 64 0.9142857

# to see the characteristics

The commentary contains 52 tokens, published on 2024-10-06, has 2 likes and 0 replies and TTR = 0.9117647

3. Build a word cloud for your corpus:

To create a word cloud, I will use the “quanteda.textplot” library and the textplot_wordcloud command, which I will apply to the dfmat_yout3 variable

set.seed(100)# settng seed is important for the functions
#that could provide different results because of iterations
library("quanteda.textplots")

textplot_wordcloud(dfmat_yout3, min_count = 5,
                   random_order = FALSE, 
                   rotation = 0.25,
    color = RColorBrewer::brewer.pal(8, "Dark2"))

It’s a cloud of words. It visualizes the most common tokens in comments, the larger the word and closer to the center, the more often it occurs in comments. We interpreted most of the words in paragraph 2.c. We can also notice new words that were not included in the top 10.

4. Build a semantic network and provide comments.

To build a semantic network, I will use the textplot_network command

tag_fcm <- fcm(dfmat_yout3)
feat <- names(topfeatures(dfmat_yout3, 30))
set.seed(100)
fcm_select(tag_fcm, pattern = feat) %>%
    textplot_network(min_freq = 0.9,edge_color = "violet",
                 edge_alpha = 0.8,
                 edge_size = 1,
                 vertex_color = "grey",
                 vertex_alpha = 0.5,
                 vertex_labelcolor = "black",
                 offset = NULL)

that’s a semantic network of words. It visualizes the combination of words in the comments. The thicker the line, the stronger the connection between the words. You can make out the example of the word that we analyzed. The word “mbappe” is most often found in combination with the words “world”, “goal”, “france” and “vs”. From this, we can assume that the commentators discussed the goal of this player, and also noted his “vs” confrontation against Argentina. The word “france” is also mentioned as it refers to his team. The word “world”, as a designation for the World Cup final. It is interesting to look at the combination of the network of words “final”, “match”, “greatest”, “heart”, “world”,“cup”,“ever”, “2022”. From these words, it is possible to make a single sentence, which generally reflects the delight of commentators at the 2022 finals.

5. Build a lexical dispersion plot and provide comments.

To do this, I will first use the kwic function to analyze the word “mbappe”, which we will then visualize using the textplot_xray() function

kwic(tokens(corp_yout), pattern = "mbappe") %>% textplot_xray()

This is a graph of lexical variance. It shows in which part of the sentence the word “mbappe” occurs. According to him, we can notice that the word is more often mentioned in the first half of the sentence. However, in comments 103 and 45, this word occurs almost at the end of the comment.