library(ggplot2)
library(tidyverse)
library(kableExtra)
library(stopwords)

#install.packages('stopwords')

Scraping data from contents using #Feminism on the Russian Tik-Tok

Importance of data extraction from TikTok

Scraping data on TikTok that includes the hashtag #feminism is important for sociologists for several reasons. First, it allows us to understand the perspectives and experiences of individuals who engage with feminism on this popular social media platform. Second, analyzing the data can provide insights into how feminist ideas and activism are communicated and shared in a visually-driven and interactive format. Finally, examining the hashtag #feminism on TikTok can contribute to a broader understanding of the role of digital media in shaping feminist discourse and mobilization among younger generations.

Variables

We obtained a total of six variables: the author of the publication, the publication date, the hashtags, the captions, the link to the video and the number of views for each publication. It’s important to note that hashtags are not considered captions in tiktok. Consequently, hashtags in the caption of the same publication are considered separately as tokens, making it difficult to include the “hashtag” variable in the database.

Secondly, as hashtags are not considered a content caption, content containing only hashtags is considered a missing caption. This also makes it difficult to include the “content text” variable in the database.

Main Table

The main table contains 131 observations, i.e. we managed to scrape 131 publications containing the hashtag #Feminism.

df <- read.csv('/Users/PaulMUTAMBA/Documents/final_df1.csv', header = T)


kable(head(df, 10)) %>% 
  kable_styling(bootstrap_options=c("bordered", "responsive","striped"), full_width = FALSE)
author_publication videos nbr_views date
23augp https://www.tiktok.com/@23augp/video/7064928718678002946 14400000 2022-02-15
nurka_shul https://www.tiktok.com/@nurka_shul/video/6974801181591817473 514300 2021-06-17
indra_ootsutsuki_00 https://www.tiktok.com/@indra_ootsutsuki_00/video/7010047587394784513 436500 2021-09-20
mariarossellini https://www.tiktok.com/@mariarossellini/video/6948124203413441793 714800 2021-04-06
gogoforogo https://www.tiktok.com/@gogoforogo/video/7041418187170221313 20200 2021-12-14
ave.anne https://www.tiktok.com/@ave.anne/video/6996675941657316610 1000000 2021-08-15
lada_ilyina https://www.tiktok.com/@lada_ilyina/video/7066909812973374721 5100000 2022-02-21
psychodelfina https://www.tiktok.com/@psychodelfina/video/6984413873323527426 85800 2021-07-13
.diedits https://www.tiktok.com/@.diedits/video/6987687801751948545 2200000 2021-07-22
bobsiiking https://www.tiktok.com/@bobsiiking/video/6983747743059545346 57000 2021-07-11
  • Analysis of the authors’ publications could help us identify the main agents or personalities who use the hashtag #feminism on tik-tok. With a basic analysis, we can determine who are those who post the most, and with a more in-depth analysis, we can try to find out more about them by visiting their profile, if possible determine their socio-demographic profile, etc.

  • The video link can be useful for a more in-depth analysis, if we want to know the most popular content using the hashtag.

  • The number of views helps us identify the most popular videos on the platform using the #feminism hashtag.

Caption Table

The captions of the publications are shown below. Only 96 of the 131 publications contained a caption. This means that 35 publications did not contain a caption.

text_df <- read.csv('/Users/PaulMUTAMBA/Documents/texts_df.csv', header = T)

kable(head(text_df, 10)) %>% 
  kable_styling(bootstrap_options=c("bordered", "responsive","striped"), full_width = FALSE)
final_texts
главный феминист
Смогли бы выжить в мире без мужчин?
😘
Fight like a girl 💖
скоро на уроках и такое появится…
🥺🥺
Ещё один повод задуматься
…..
seriously I LOVE HER
А какие ошибки совершали вы?

The caption of the publications may help us to capture the sentiment of tik tok contents, the topics of discussions related to the #feminism on tik tok, the common words used while using the hashtag on tik-tok etc. Using ML, one can exctract valuable information from these captions.

Hashtags Table

There are a total of 270 hashtags for all 131 publications. Analyzing the hashtags can help us understand what values, emotions, social movements, etc. are attached to the #feminism hashtag on tik tok. This could help us better understand how the feminist movement is represented on social media.

hashtags_df <- read.csv('/Users/PaulMUTAMBA/Documents/hashatags_df.csv', header = T)

kable(head(hashtags_df,10)) %>% 
  kable_styling(bootstrap_options=c("bordered", "responsive","striped"), full_width = FALSE)
f_hashtags
#рек
#asaprocky
#feminism
#fyp
#pole
#fem
#радфем
#feminism
#альт
#политика

Visualization

Distribution of content publication using #Feminism per years (Tik-Tok)

df$date <- as.Date(df$date, format = "%Y-%m-%d")
df$year <- format(df$date, "%Y")



frequency <- table(df$year)
df_frequency <- data.frame(year = names(frequency), frequency = as.numeric(frequency))

ggplot(df_frequency, aes(x = year, y = frequency)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  labs(x = "Year", y = "Content using #feminism") +
  theme_minimal()

Distribution the views of videos using #Feminism on Tik-Tok

ggplot(data = df, aes(x = nbr_views)) + geom_histogram(bins = 10 , color="black", fill= "#FFB273") + geom_vline(aes(xintercept= mean(nbr_views, na.rm = T), color = "mean"), linetype="dashed", size=1) + labs(x = "Views Number", caption = "Views number refers to the amount of time a video has been watched on the platform", fills = " ") + theme(plot.title = element_text(hjust = 0.5), plot.caption = element_text(hjust = 0.25, size = 0.25)) + theme_bw() + scale_color_manual(name = "Measurement", values = c(mean = "#824acd")) 

Cloud Map of the Top Hashtags used along with #Feminsim

word_frequency <- table(hashtags_df$f_hashtags[hashtags_df$f_hashtags != "#feminism" | hashtags_df$f_hashtags != "#feminism"])

# Sort the words by frequency in decreasing order

sorted_words <- sort(word_frequency, decreasing = TRUE)

# Select the top 30 words
top_30_words <- names(sorted_words)[1:30]

# Create a word cloud
library(wordcloud)
#install.packages("wordcloud")
wordcloud(top_30_words, freq = sorted_words[1:30], random.order = TRUE, colors = brewer.pal(8, "Dark2"), scale = c(5, 1), max.words = 30, rot.per = 0.2, random.color = TRUE, random.font = TRUE, mask = "/Users/PaulMUTAMBA/Documents/Old Pc/HSE Social Info/4th Year/1_QVYcwX2v_28Ot91ZJ-FAFw.jpeg")

mtext("Top Hashtags used along with #feminism on Tik-Tok", side = 3, line = - 18 , adj = .5, cex = 1, font = 0.5)

The 20 most frequent words in the captions of tik-tok content using the keyword #Feminism

texts <- as.character(text_df$final_texts)

cleaned_texts <- texts %>%
  str_replace_all("[[:punct:]]", "") %>%
  tolower() %>%
  str_split("\\s+") %>%
  unlist()

stop_words_ru <- stopwords::stopwords(language = "ru")
cleaned_texts <- cleaned_texts[!(cleaned_texts %in% stop_words_ru)]

word_frequency <- table(cleaned_texts)

sorted_words <- sort(word_frequency, decreasing = TRUE)
top_30_words <- names(sorted_words)[1:20]

bar_data <- data.frame(word = top_30_words, frequency = sorted_words[1:20])

#bar_data

ggplot(bar_data, aes(x = frequency.Freq, y = word)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  labs(x = "Frequency", y = "Word", caption = "The most common words employed in contents using the #feminism.") +
  theme_minimal()

Average number of views per year of posts using #Feminism

kable(df %>% group_by(year) %>% summarise(mean_views_per_year = mean(nbr_views))) %>% 
  kable_styling(bootstrap_options=c("bordered", "responsive","striped"), full_width = FALSE)
year mean_views_per_year
2020 70560.3
2021 827634.8
2022 3446068.0

Scrap Data Stages

We use Rselenium to retrieve data from TikTok.

First, we had to create a new server and open the TikTok page. In the pieces, you can see the code we used to perform the various steps.

# rD1 <- rsDriver(browser = "firefox", chromever = NULL, verbose = F, port = free_port(), iedrver = NULL)
# 
# remDr <- rD1$client
# remDr$open()
# remDr$navigate('https://www.tiktok.com/ru-RU/')

The code launched the TikTok main page.

As you can see, we need to close the window offering to register for tik-tok. The following code will help us do this.

# close_be <- remDr$findElement(using = 'xpath', '//div[@aria-label= "Закрыть"]')
# 
# close_be$clickElement()

After closing the pop-ups. We had to click on the search bar and look for content containing the hashtag #feminism.

# search <- remDr$findElement(using = 'xpath', '//input[@type= "search"]')
# search$clickElement()
# 
# search$sendKeysToElement(list('#feminism', key = 'enter'))

As the image above shows, the Tik-Tok page is dynamic. We need to scroll down to capture as much data as possible. The code below scrolls the page 20 times.

# for(i in 1:20) {
#   
#   remDr$executeScript('window.scrollTo(0, document.body.scrollHeight);') #Scroll over the entire page.
#   Sys.sleep(2)
# }

After 10 to 15 iterations, they reached the bottom of the page, suggesting that there were no more articles using #Feminism.

The next step was to collect the data from the page. The code below shows how we retrieved the data.

# 
# content_text <- remDr$findElements(using = 'xpath', '//span[@class = "tiktok-j2a19r-SpanText efbd9f0"]')
# 
# texts <- lapply(content_text, function (x) {
#   x$getElementText() %>% unlist()
#   
# }) %>% flatten_chr()
# 
# content_text2 <- remDr$findElements(using = "css", ".tiktok-j2a19r-SpanText")
# 
# texts <- lapply(content_text2, function (x) {
#   x$getElementText() %>% unlist()
#   
# }) %>% flatten_chr()
# 
# final_texts <- texts[texts != ""]
# 
# hashtags_text <- remDr$findElements(using = 'xpath', '//strong[@class = "tiktok-1p6dp51-StrongText ejg0rhn2"]')
# 
# hashtags <- lapply(hashtags_text, function (x) {
#   x$getElementText() %>% unlist()
# }) %>% flatten_chr()
# 
# authors <- remDr$findElements(using = 'xpath', '//p[@class = "tiktok-2zn17v-PUniqueId etrd4pu6"]')
# 
# author_publication <- lapply(authors, function (x) {
#   x$getElementText() %>% unlist()
# }) %>% flatten_chr()
# 
# video_source <- remDr$findElements(using = 'xpath', '//a [@tabindex = "-1"]')
# 
# videos <- lapply(video_source, function (x) {
#   x$getElementAttribute('href') %>% unlist()
# }) %>% flatten_chr()
# 
# views <- remDr$findElements(using = 'xpath', '//strong[@class = "tiktok-ws4x78-StrongVideoCount etrd4pu10"]')
# 
# views_nmbr <- lapply(views, function (x) {
#   x$getElementText() %>% unlist()
# }) %>% flatten_chr()
# 
# date_publication <- remDr$findElements(using = 'xpath', '//div[@class = "tiktok-dennn6-DivTimeTag e19c29qe15"]')
# 
# date <- lapply(date_publication, function (x) {
#   x$getElementText() %>% unlist()
# }) %>% flatten_chr()
# 
# date <- as.Date(date, format = "%Y-%m-%d")