library(ggplot2)
library(tidyverse)
library(kableExtra)
library(stopwords)
#install.packages('stopwords')Scraping data from contents using #Feminism on the Russian Tik-Tok
Importance of data extraction from TikTok
Scraping data on TikTok that includes the hashtag #feminism is important for sociologists for several reasons. First, it allows us to understand the perspectives and experiences of individuals who engage with feminism on this popular social media platform. Second, analyzing the data can provide insights into how feminist ideas and activism are communicated and shared in a visually-driven and interactive format. Finally, examining the hashtag #feminism on TikTok can contribute to a broader understanding of the role of digital media in shaping feminist discourse and mobilization among younger generations.
Variables
We obtained a total of six variables: the author of the publication, the publication date, the hashtags, the captions, the link to the video and the number of views for each publication. It’s important to note that hashtags are not considered captions in tiktok. Consequently, hashtags in the caption of the same publication are considered separately as tokens, making it difficult to include the “hashtag” variable in the database.
Secondly, as hashtags are not considered a content caption, content containing only hashtags is considered a missing caption. This also makes it difficult to include the “content text” variable in the database.
Main Table
The main table contains 131 observations, i.e. we managed to scrape 131 publications containing the hashtag #Feminism.
df <- read.csv('/Users/PaulMUTAMBA/Documents/final_df1.csv', header = T)
kable(head(df, 10)) %>%
kable_styling(bootstrap_options=c("bordered", "responsive","striped"), full_width = FALSE)| author_publication | videos | nbr_views | date |
|---|---|---|---|
| 23augp | https://www.tiktok.com/@23augp/video/7064928718678002946 | 14400000 | 2022-02-15 |
| nurka_shul | https://www.tiktok.com/@nurka_shul/video/6974801181591817473 | 514300 | 2021-06-17 |
| indra_ootsutsuki_00 | https://www.tiktok.com/@indra_ootsutsuki_00/video/7010047587394784513 | 436500 | 2021-09-20 |
| mariarossellini | https://www.tiktok.com/@mariarossellini/video/6948124203413441793 | 714800 | 2021-04-06 |
| gogoforogo | https://www.tiktok.com/@gogoforogo/video/7041418187170221313 | 20200 | 2021-12-14 |
| ave.anne | https://www.tiktok.com/@ave.anne/video/6996675941657316610 | 1000000 | 2021-08-15 |
| lada_ilyina | https://www.tiktok.com/@lada_ilyina/video/7066909812973374721 | 5100000 | 2022-02-21 |
| psychodelfina | https://www.tiktok.com/@psychodelfina/video/6984413873323527426 | 85800 | 2021-07-13 |
| .diedits | https://www.tiktok.com/@.diedits/video/6987687801751948545 | 2200000 | 2021-07-22 |
| bobsiiking | https://www.tiktok.com/@bobsiiking/video/6983747743059545346 | 57000 | 2021-07-11 |
Analysis of the authors’ publications could help us identify the main agents or personalities who use the hashtag #feminism on tik-tok. With a basic analysis, we can determine who are those who post the most, and with a more in-depth analysis, we can try to find out more about them by visiting their profile, if possible determine their socio-demographic profile, etc.
The video link can be useful for a more in-depth analysis, if we want to know the most popular content using the hashtag.
The number of views helps us identify the most popular videos on the platform using the #feminism hashtag.
Caption Table
The captions of the publications are shown below. Only 96 of the 131 publications contained a caption. This means that 35 publications did not contain a caption.
text_df <- read.csv('/Users/PaulMUTAMBA/Documents/texts_df.csv', header = T)
kable(head(text_df, 10)) %>%
kable_styling(bootstrap_options=c("bordered", "responsive","striped"), full_width = FALSE)| final_texts |
|---|
| главный феминист |
| Смогли бы выжить в мире без мужчин? |
| 😘 |
| Fight like a girl 💖 |
| скоро на уроках и такое появится… |
| 🥺🥺 |
| Ещё один повод задуматься |
| ….. |
| seriously I LOVE HER |
| А какие ошибки совершали вы? |
The caption of the publications may help us to capture the sentiment of tik tok contents, the topics of discussions related to the #feminism on tik tok, the common words used while using the hashtag on tik-tok etc. Using ML, one can exctract valuable information from these captions.
Visualization
Distribution of content publication using #Feminism per years (Tik-Tok)
df$date <- as.Date(df$date, format = "%Y-%m-%d")
df$year <- format(df$date, "%Y")
frequency <- table(df$year)
df_frequency <- data.frame(year = names(frequency), frequency = as.numeric(frequency))
ggplot(df_frequency, aes(x = year, y = frequency)) +
geom_bar(stat = "identity", fill = "steelblue") +
labs(x = "Year", y = "Content using #feminism") +
theme_minimal()Distribution the views of videos using #Feminism on Tik-Tok
ggplot(data = df, aes(x = nbr_views)) + geom_histogram(bins = 10 , color="black", fill= "#FFB273") + geom_vline(aes(xintercept= mean(nbr_views, na.rm = T), color = "mean"), linetype="dashed", size=1) + labs(x = "Views Number", caption = "Views number refers to the amount of time a video has been watched on the platform", fills = " ") + theme(plot.title = element_text(hjust = 0.5), plot.caption = element_text(hjust = 0.25, size = 0.25)) + theme_bw() + scale_color_manual(name = "Measurement", values = c(mean = "#824acd")) The 20 most frequent words in the captions of tik-tok content using the keyword #Feminism
texts <- as.character(text_df$final_texts)
cleaned_texts <- texts %>%
str_replace_all("[[:punct:]]", "") %>%
tolower() %>%
str_split("\\s+") %>%
unlist()
stop_words_ru <- stopwords::stopwords(language = "ru")
cleaned_texts <- cleaned_texts[!(cleaned_texts %in% stop_words_ru)]
word_frequency <- table(cleaned_texts)
sorted_words <- sort(word_frequency, decreasing = TRUE)
top_30_words <- names(sorted_words)[1:20]
bar_data <- data.frame(word = top_30_words, frequency = sorted_words[1:20])
#bar_data
ggplot(bar_data, aes(x = frequency.Freq, y = word)) +
geom_bar(stat = "identity", fill = "steelblue") +
labs(x = "Frequency", y = "Word", caption = "The most common words employed in contents using the #feminism.") +
theme_minimal()Average number of views per year of posts using #Feminism
kable(df %>% group_by(year) %>% summarise(mean_views_per_year = mean(nbr_views))) %>%
kable_styling(bootstrap_options=c("bordered", "responsive","striped"), full_width = FALSE)| year | mean_views_per_year |
|---|---|
| 2020 | 70560.3 |
| 2021 | 827634.8 |
| 2022 | 3446068.0 |
Scrap Data Stages
We use Rselenium to retrieve data from TikTok.
First, we had to create a new server and open the TikTok page. In the pieces, you can see the code we used to perform the various steps.
# rD1 <- rsDriver(browser = "firefox", chromever = NULL, verbose = F, port = free_port(), iedrver = NULL)
#
# remDr <- rD1$client
# remDr$open()
# remDr$navigate('https://www.tiktok.com/ru-RU/')The code launched the TikTok main page.
As you can see, we need to close the window offering to register for tik-tok. The following code will help us do this.
# close_be <- remDr$findElement(using = 'xpath', '//div[@aria-label= "Закрыть"]')
#
# close_be$clickElement()After closing the pop-ups. We had to click on the search bar and look for content containing the hashtag #feminism.
# search <- remDr$findElement(using = 'xpath', '//input[@type= "search"]')
# search$clickElement()
#
# search$sendKeysToElement(list('#feminism', key = 'enter'))As the image above shows, the Tik-Tok page is dynamic. We need to scroll down to capture as much data as possible. The code below scrolls the page 20 times.
# for(i in 1:20) {
#
# remDr$executeScript('window.scrollTo(0, document.body.scrollHeight);') #Scroll over the entire page.
# Sys.sleep(2)
# }After 10 to 15 iterations, they reached the bottom of the page, suggesting that there were no more articles using #Feminism.
The next step was to collect the data from the page. The code below shows how we retrieved the data.
#
# content_text <- remDr$findElements(using = 'xpath', '//span[@class = "tiktok-j2a19r-SpanText efbd9f0"]')
#
# texts <- lapply(content_text, function (x) {
# x$getElementText() %>% unlist()
#
# }) %>% flatten_chr()
#
# content_text2 <- remDr$findElements(using = "css", ".tiktok-j2a19r-SpanText")
#
# texts <- lapply(content_text2, function (x) {
# x$getElementText() %>% unlist()
#
# }) %>% flatten_chr()
#
# final_texts <- texts[texts != ""]
#
# hashtags_text <- remDr$findElements(using = 'xpath', '//strong[@class = "tiktok-1p6dp51-StrongText ejg0rhn2"]')
#
# hashtags <- lapply(hashtags_text, function (x) {
# x$getElementText() %>% unlist()
# }) %>% flatten_chr()
#
# authors <- remDr$findElements(using = 'xpath', '//p[@class = "tiktok-2zn17v-PUniqueId etrd4pu6"]')
#
# author_publication <- lapply(authors, function (x) {
# x$getElementText() %>% unlist()
# }) %>% flatten_chr()
#
# video_source <- remDr$findElements(using = 'xpath', '//a [@tabindex = "-1"]')
#
# videos <- lapply(video_source, function (x) {
# x$getElementAttribute('href') %>% unlist()
# }) %>% flatten_chr()
#
# views <- remDr$findElements(using = 'xpath', '//strong[@class = "tiktok-ws4x78-StrongVideoCount etrd4pu10"]')
#
# views_nmbr <- lapply(views, function (x) {
# x$getElementText() %>% unlist()
# }) %>% flatten_chr()
#
# date_publication <- remDr$findElements(using = 'xpath', '//div[@class = "tiktok-dennn6-DivTimeTag e19c29qe15"]')
#
# date <- lapply(date_publication, function (x) {
# x$getElementText() %>% unlist()
# }) %>% flatten_chr()
#
# date <- as.Date(date, format = "%Y-%m-%d")