In the digital age, communication professionals can no longer resort to simply sending messages. In order to raise awareness around products, services, or issues; communication professionals need to spark conversations for public interest to spread through online networks. The analysis of media networks can contribute to such media strategies. This document shares an approach to: 1) identify online communities; 2) understand how they think and talk about products, services, or issues; and 3) find potential collaboration partners to stimulate online conversations such as online magazines or micro-celebrities.
Below, we share a schematic overview as well as the code and specific descriptions of the steps that we propose to take to retrieve and analyse networks of websites (issue networks), networks of Twitter users (ego networks), and networks of related YouTube-videos (recommender network).
A specific key topic should always be the starting point of this approach, as we map online communities around this key topic. During the first step, we operationalize the scope of the topic the scope of the topic by generating lists of keywords that are used to search for relevant media content on Google.
Keywords should be combined into Google-compatible queries, looking like this:
x <- data.frame(Thema=c("Level_1","Level_2","Level_1","Level_2"),
Groep=c("Topic_1","Topic_1","Topic_2","Topic_2"),
query=c("\"key%20phrase%20one\"%20OR%20\"key%20phrase%20two\"%20OR%20\"etc\"",
"\"key%20phrase%20three\"%20OR%20\"key%20phrase%20four\"%20OR%20\"etc\"",
"\"key%20phrase%20five\"%20OR%20\"key%20phrase%20six\"%20OR%20\"etc\"",
"\"key%20phrase%20seven\"%20OR%20\"key%20phrase%20eight\"%20OR%20\"etc\""))
print(x)
## Thema Groep
## 1 Level_1 Topic_1
## 2 Level_2 Topic_1
## 3 Level_1 Topic_2
## 4 Level_2 Topic_2
## query
## 1 "key%20phrase%20one"%20OR%20"key%20phrase%20two"%20OR%20"etc"
## 2 "key%20phrase%20three"%20OR%20"key%20phrase%20four"%20OR%20"etc"
## 3 "key%20phrase%20five"%20OR%20"key%20phrase%20six"%20OR%20"etc"
## 4 "key%20phrase%20seven"%20OR%20"key%20phrase%20eight"%20OR%20"etc"
Note that the queries are associated with subtopic in the column Groep and specifity levels in the column Thema. This is optional though. If you do not want to distinguish between topics or levels, simply fill these columns with a single value.
The keywords generated in step 1.1 are used to search for content on Google (see 2). Based on the results returned by Google, we typically extract a short list with the most relevant keywords to search on Twitter and YouTube. However, you could choose to use your original set of queries too.
First, we load a few required packages, set a timestamp variable so we can keep track of things, and load a data frame with Google queries.
# Loading required packages
require(tidyverse)
require(stringr)
require(lubridate)
require(rvest)
require(RCurl)
require(gmailr)
require(filesstrings)
require(tm)
require(tidytext)
require(stopwords)
require(wordcloud2)
## Set timestamp
timestamp <- lubridate::today() %>% strptime("%Y-%m-%d") %>% as.POSIXct(tz="Europe/Amsterdam")
# Start by loading a data frame with Google queries, grouped by level and topic
## Below is an example DF
to_retrieve <- data.frame(Thema=c("Level_1","Level_2","Level_1","Level_2"),
Groep=c("Topic_1","Topic_1","Topic_2","Topic_2"),
query=c("\"key%20phrase%20one\"%20OR%20\"key%20phrase%20two\"%20OR%20\"etc\"",
"\"key%20phrase%20three\"%20OR%20\"key%20phrase%20four\"%20OR%20\"etc\"",
"\"key%20phrase%20five\"%20OR%20\"key%20phrase%20six\"%20OR%20\"etc\"",
"\"key%20phrase%20seven\"%20OR%20\"key%20phrase%20eight\"%20OR%20\"etc\""))
## Helper functions
return_clean_url <- function(string) {
x <- NULL
for(i in 1:length(string)) {
#print(string[i])
y <- string[i]
if(str_detect(y,"^http[s]?://")) {
y <- y %>%
str_replace('http[s]?://',"")
}
if(str_detect(y,"/$")) {
y <- y %>%
str_replace('/$',"")
}
if(str_detect(y,".facebook\\.com")) {
y <- y %>%
str_replace("[a-z-]+.facebook.com","www.facebook.com")
}
if(str_detect(y,"www\\.")) {
y <- y %>%
str_replace("www\\.","")
}
x <- c(x,y)
}
return(x)
} # Removes noise from urls generically
return_clean_url_serp <- function(string) {
x <- NULL
for(i in 1:length(string)) {
#print(string[i])
y <- string[i]
if(str_detect(y,"^http[s]?://")) {
y <- y %>%
str_replace('http[s]?://',"")
}
if(str_detect(y,"/$")) {
y <- y %>%
str_replace('/$',"")
}
if(str_detect(y,".facebook\\.com")) {
y <- y %>%
str_replace_all("[a-z-]+.facebook.com","www.facebook.com")
}
x <- c(x,y)
}
return(x)
} # Removes noise from urls on Google SERP
return_tld <- function(string) {
x <- NULL
for(i in 1:length(string)) {
#print(string[i])
y <- NULL
if(str_detect(string[i],"facebook.com/|instagram.com/|twitter.com/|youtube.com/|youtu.be/")) {
y <- string[i] %>% return_channel
} else {
if(str_detect(string[i],"^http[s]?://")) {
y <- string[i] %>%
str_extract('http[s]?://[a-z0-9]+.[a-z0-9]+.[a-z0-9]+.[a-z0-9]+.[a-z0-9]+/|http[s]?://[a-z0-9]+.[a-z0-9]+.[a-z0-9]+.[a-z0-9]+/|http[s]?://[a-z0-9]+.[a-z0-9]+.[a-z0-9]+/|http[s]?://[a-z0-9]+.[a-z0-9]+/') %>%
str_extract('http[s]?://[a-z0-9.-]+') %>%
str_replace('http[s]?://http[s]?://',"")
} else {
y <- string[i] %>%
str_extract('[a-z0-9]+.[a-z0-9]+.[a-z0-9]+.[a-z0-9]+.[a-z0-9]+/|http[s]?://[a-z0-9]+.[a-z0-9]+.[a-z0-9]+.[a-z0-9]+/|http[s]?://[a-z0-9]+.[a-z0-9]+.[a-z0-9]+/|http[s]?://[a-z0-9]+.[a-z0-9]+/') %>%
str_extract('[a-z0-9.-]+') %>%
str_replace('http[s]?://',"")
}
}
if(is.na(y)) {y <- string[i]}
x <- c(x,y)
}
return(x)
} # Returns tld from url string. Returns page/account/channel for social media urls.
ahrefs_session <- function(email,password) {
# Create a web session with the desired login address
pgsession <- html_session("https://ahrefs.com/user/login")
pgform <- html_form(pgsession)[[1]] #in this case the submit is the 1st form
filled_form <- set_values(pgform, email=email, password=password)
submit_form(pgsession, filled_form, submit=NULL)
return(pgsession)
} # logs into ahrefs.com
calculateColors <- function(values) {
if(length(unique(values))>1){
## Use n equally spaced breaks to assign each value to n-1 equal sized bins
ii <- cut(values, breaks = seq(min(values), max(values), len = 100),
include.lowest = TRUE)
## Use bin indices, ii, to select color from vector of n-1 equally spaced colors
colors <- colorRampPalette(c("lightgrey", "darkblue"))(99)[ii]
} else {
colors <- "#D3D3D3"
}
return(colors)
} # Determines color of words in clouds
plot_cloud <- function(df,by_com) {
df %>%
filter(Community==by_com) %>%
arrange(desc(count)) %>%
.[1:25,] %>%
filter(!is.na(word)) %>%
select(Community,word,count,tf_idf) %>%
mutate(tf_idf=ifelse(is.na(tf_idf),0,tf_idf)) %>%
mutate(color=calculateColors(tf_idf)) %>%
select(word=word,freq=count,color) %>%
as.data.frame %>%
wordcloud2(.,fontFamily = 'Submariner', fontWeight = 'Bold', color=.$color)
} # Plots clouds
pos_read_html <- purrr::possibly(read_html,NA) ## Same as included function, but returns NA in case of error
pos_tolower <- purrr::possibly(tolower,NA) ## Same as included function, but returns NA in case of error
pos_html_form <- purrr::possibly(html_form,NA) ## Same as included function, but returns NA in case of error
pos_delete_message <- purrr::possibly(delete_message,NA) ## Same as included function, but returns NA in case of error
Next, we loop through the list of queries and retrieve the first 40 hits on Google Search.
In our code, safe search and personalization of results are disabled. The resulting web addresses are stored (URLs, i.e. ‘www.address.com/article.html’); as well as top-level domains (TLDs, i.e. ‘www.address.com’), and the position on the Search Engine Result Page (SERP, i.e.: 1 - 40). These settings can be changed in the script below,.
The results are stored in the dataframe result.
## Change settings
n <- "40" ## number of results
pws <- "0" ## safe search off (0) or on (1)
encoding <- "UTF-8" ## encoding
location <- "NL" ## location
extension <- "nl" ## searching google.[extension]
## Loop through queries and store results in data frame result
result <- NULL
for(i in 1:nrow(to_retrieve)) {
## Setting variables for this round
string <- to_retrieve$query[i]
marker <- to_retrieve$Groep[i]
theme <- to_retrieve$Thema[i]
url <- paste0("https://www.google.",extension,"/search?q=",string,
"&num=",n,"&pws=",pws,"&ie=",encoding,"&oe=",encoding,
"&cr=country",location,sep="")
## Retrieve SERP and extract urls of pages
page <- read_html(url,encoding="UTF-8") ## Read SERP
Sys.sleep(5)
## Extract urls from page
x <- page %>% html_nodes("a") %>% html_attr("href") %>% as.vector
t <- tibble(url=x) %>%
filter(str_detect(url,"^\\/url\\?q\\=")) %>%
filter(!str_detect(url,"^\\/url\\?q\\=\\/search")) %>%
filter(!str_detect(url,"^\\/url\\?q\\=https\\:\\/\\/accounts.google.com")) %>%
filter(!str_detect(url,"^\\/url\\?q\\=https\\:\\/\\/www.google.co")) %>%
mutate(url=str_replace(url,"^\\/url\\?q\\=","")) %>%
mutate(url=str_extract(url,".*\\&sa=")) %>%
mutate(url=str_replace(url,"\\&sa=",""))
urls <- t$url %>% unique %>% .[!is.na(.)] %>% str_squish %>% curlUnescape ## Extract urls
if(length(urls)>=1) {
tld <- urls %>% return_clean_url_serp %>% return_tld %>% return_clean_url ## Proces TLD's from urls
## 1.4 Export and store results
result <- rbind(result,tibble::tibble(theme=theme,
marker=marker,
url=urls %>% return_clean_url_serp %>% return_clean_url,
position=1:length(urls),
tld=tld,
timestamp=timestamp))
}
}
Next, we can filter our data by manually checking whether the returned SERP’s match the scope of our study - a fast way to reduce the number of irrelevant URLs. At the same time, we manage to identify the most relevant queries - possibly by topic and specifity - that we can use to search for content on Twitter and YouTube.
The script below retrieves the outgoing links and incoming links for each of the remaining URLs in the filtered data set. The data can be requested using a paid subscription to Ahrefs.com – a platform with sophisticated crawlers and offering detailed link reports. For each URL, the script logs in to Ahrefs.com, browses to the link explorer page, checks whether there are any backlinks for the given URL, and requests a backlink report if this is the case. In case of no backlinks, it checks whether there are any backlinks pointing to sub-pages. The same process is followed to identify the web pages the URLs are linking to. In between each request, the script times out for a minute.
Note: Remember to plugin your Ahrefs credentials before running this script! Also, make sure to go to Ahrefs.com, login, and download your reports on completing this step.
## Create the list of urls that ultimately need to be retrieved
to_retrieve <- result %>%
select(url,tld) %>%
distinct %>%
filter(!str_detect(tld,"facebook.com/pages$"))
## Setting credentials and starting new Ahrefs session
email <- ""
password <- ""
pgsession <- ahrefs_session(email=email,password=password) ## starting session
## Loop through urls in df to_retrieve to request (back)link reports
log <- NULL
for(i in 1:nrow(to_retrieve)) {
## Builds url where to request data
url <- paste0("https://ahrefs.com/site-explorer/export/v2/csv/",
"prefix",
"/live?target=",
to_retrieve$url[i],
collapse="")
page <- jump_to(pgsession, url) ## Jumps to url using created session
pgform <- pos_html_form(page)[[2]] ## Selects form 2
## Reads number of links and backlinks and adds to log
backlinks_count <- pgform$fields$raw_real_count_backlinks$value %>% as.numeric
## Requests backlink report for url if there are any
if(backlinks_count > 0) {
print("backlinks found for prefix; retrieving...")
pgform <- set_values(pgform, raw_export_type="backlinks", raw_export_limit=as.character(backlinks_count)) ## report type
submit_form(page,pgform) ## submit
print(paste0("Requesting ",backlinks_count," backlinks for ",to_retrieve$url[i]))
Sys.sleep(80) ## timeout
} else {
print("checking exact")
url <- paste0("https://ahrefs.com/site-explorer/export/v2/csv/",
"exact",
"/live?target=",
to_retrieve$url[i],
collapse="")
page <- jump_to(pgsession, url) ## Jumps to url using created session
pgform <- html_form(page)[[2]] ## Selects form 4
## Reads number of links and backlinks and adds to log
backlinks_count <- pgform$fields$raw_real_count_backlinks$value %>% as.numeric
if(backlinks_count > 0) {
print("Backlinks found for exact. Retrieving...")
pgform <- set_values(pgform, raw_export_type="backlinks", raw_export_limit=as.character(backlinks_count)) ## report type
submit_form(page,pgform) ## submit
print(paste0("Requesting ",backlinks_count," backlinks for ",to_retrieve$url[i]))
Sys.sleep(80) ## timeout
} else {
print(paste0("No backlinks for ",to_retrieve$url[i])) ## otherwise skip request timeout
}}
links_count <- pgform$fields$raw_real_count_linked_root_domains$value %>% as.numeric
log <- rbind(log,data.frame(url=to_retrieve$url[i],
backlinks=backlinks_count,
links=links_count))
if(links_count > 0) {
pgform<-set_values(pgform, raw_export_type="linked_domains", raw_export_limit=as.character(links_count))
submit_form(page,pgform)
print(paste0("Requesting ",links_count," links for ",to_retrieve$url[i]))
Sys.sleep(80)
} else {
print(paste0("No links for",to_retrieve$url[i]))
}
}
Next, we process the link reports and create a network file.
The network includes the TLDs as nodes (e.g. the linked websites in the network) and the number of pages found for each TLD as one of the nodes’ properties. For example, if our reports would include newswebsite.com/article-1 and newswebsite.com/article-2, the ultimate network would include the TLD newswebsite.com, with two subpages as its weight property.
The so-called edges (links) between the nodes are computed by counting the number of links between pages of the same TLDs. For example, a link between newswebsite.com/artile-1 to blog.com/photo-1 shows as a line from newswebsite.com to blog.com in our network. The number of links between two TLDs is a property of the link.
Furthermore, we add nodes to isolate Facebook pages, YouTube channels, and Instagram- as well as Twitter accounts, and connect them to their respective platform. For example, social media postings (e.g.: facebook.com/user/media-posting or youtube.com/user/watch?=video) are considered to be pages belonging to a newly created parent node (e.g.: facebook.com/user or youtube.com/user), linking to their corresponding TLD (e.g.: facebook.com and youtube.com).
The ultimate network comprises a nodelist (nodelist_all), including the TLDs and their properties, and an edgelist (edgelist_all), including links between TLDs and their properties. These can be exported to csv-files and can be loaded into Gephi.
# First, navigate to the directory where the ahrefs reports are saved
#setwd("")
## First, we creates an index of the files in the current working directory using url as identifier
files <- data.frame(file=list.files())
requested <- files$file %>%
str_extract(.,"[0-9][0-9]-[A-Z][a-z][a-z]-2019_[0-9][0-9]-[0-9][0-9]-[0-9][0-9]") %>%
str_replace("Jan","01") %>% str_replace("Feb","02") %>% str_replace("Mar","03") %>%
str_replace("Apr","04") %>% str_replace("May","05") %>% str_replace("Jun","06") %>%
str_replace("Jul","07") %>% str_replace("Aug","08") %>% str_replace("Sep","09") %>%
str_replace("Oct","10") %>% str_replace("Nov","11") %>% str_replace("Dec","12") %>%
strptime(format="%d-%m-%Y_%H-%M-%S")
url <- files$file %>%
str_replace(.,"\\[raw\\]www\\.|\\[raw\\]","") %>%
str_replace(.,"-linked.*|-backlink.*","")
type <- ifelse(files$file %>% str_detect("-linked-"),"links","backlinks")
index <- cbind(files,url,type,requested)
## Checks the number of rows for each file
index <- index %>%
group_by(file,url,type,requested) %>%
do(data.frame(rows=read_csv(.$file %>% as.character) %>% nrow)) %>%
ungroup
## Next, we connect our results, log en index df's
log_ahrefs_requests <- log
to_retrieve <- result %>%
select(url_retrieve=url,tld) %>%
distinct %>%
filter(!str_detect(tld,"facebook.com/pages$"))
retrieved <- to_retrieve %>%
filter(url_retrieve %in% (filter(log_ahrefs_requests,links>0|backlinks>0) %>% .$url %>% str_replace(.,"www\\.",""))) %>%
mutate(tld=str_extract(url_retrieve,"^[a-z0-9\\-\\.]+")) %>%
left_join(.,filter(log_ahrefs_requests,links>0|backlinks>0) %>% mutate(n=1:nrow(filter(log_ahrefs_requests,links>0|backlinks>0)),url=str_replace(url,"www\\.","")),by=c("url_retrieve"="url")) %>%
arrange(n)
retrieved_links <- retrieved %>% filter(links>0)
retrieved_backlinks <- retrieved %>% filter(backlinks>0)
## Next, we connect the retrieved files with the urls from the results data frame
index_links_strict <- retrieved_links %>%
filter(tld %in% (filter(index,type=="links") %>% .$url %>% as.character)) %>%
inner_join(.,filter(index,type=="links"),by=c("tld"="url","links"="rows")) %>%
group_by(tld) %>%
mutate(x=dense_rank(url_retrieve),y=dense_rank(file)) %>%
filter(x==y) %>%
select(-x,-y) %>%
ungroup
index_links_approx <- retrieved_links %>%
filter(!url_retrieve %in% index_links_strict$url_retrieve) %>%
filter(tld %in% (filter(index,type=="links") %>% .$url %>% as.character)) %>%
inner_join(.,filter(index,type=="links",!file %in% index_links_strict$file),by=c("tld"="url")) %>%
mutate(diff=sqrt((links-rows)^2)) %>%
group_by(file) %>%
top_n(-1,diff) %>%
group_by(url_retrieve) %>%
mutate(x=dense_rank(file)) %>%
group_by(file) %>%
mutate(y=dense_rank(n)) %>%
filter(x==y) %>%
select(-x,-y,-diff,-rows) %>%
ungroup
index_links_full <- rbind(index_links_strict,index_links_approx)
index_backlinks_strict <- retrieved_backlinks %>%
filter(tld %in% (filter(index,type=="backlinks") %>% .$url %>% as.character)) %>%
inner_join(.,filter(index,type=="backlinks"),by=c("tld"="url","backlinks"="rows")) %>%
group_by(tld) %>%
mutate(x=dense_rank(url_retrieve),y=dense_rank(file)) %>%
filter(x==y) %>%
select(-x,-y) %>%
ungroup
index_backlinks_approx <- retrieved_backlinks %>%
filter(!url_retrieve %in% index_backlinks_strict$url_retrieve) %>%
filter(tld %in% (filter(index,type=="backlinks") %>% .$url %>% as.character)) %>%
inner_join(.,filter(index,type=="backlinks",!file %in% index_backlinks_strict$file),by=c("tld"="url")) %>%
mutate(diff=sqrt((backlinks-rows)^2)) %>%
group_by(file) %>%
top_n(-1,diff) %>%
group_by(url_retrieve) %>%
mutate(x=dense_rank(file)) %>%
group_by(file) %>%
mutate(y=dense_rank(n)) %>%
filter(x==y) %>%
select(-x,-y,-diff,-rows) %>%
ungroup
index_backlinks_full <- rbind(index_backlinks_strict,index_backlinks_approx)
index_full <- rbind(index_backlinks_full,index_links_full)
## Loops through the index, adding data to a backlinks and links df
backlinks <- NULL
links <- NULL
for(i in (1:nrow(index_full))) {
print(paste0("now doing ",index_full$file[i],sep=""))
print(paste0("Round ",i,sep=""))
if(index_full$file[i] %>% as.character %>% str_detect("backlink")) {
bl <- read_csv(index_full$file[i] %>% as.character) %>%
group_by(`Referring Page URL`,`First Seen`) %>%
do(tld=(.$`Referring Page URL` %>% return_tld %>% return_clean_url)) %>%
unnest
link_to_tld <- index_full$url_retrieve[i] %>% return_tld
backlinks <- rbind(backlinks,tibble::tibble(
link_to_tld=link_to_tld,
link_from_tld=bl$tld,
link_from=bl$`Referring Page URL` %>% return_clean_url,
link_to=index_full$url_retrieve[i] %>% return_clean_url,
first_seen=bl$`First Seen`
))
} ## reads backlinks
if(index_full$file[i] %>% as.character %>% str_detect("linked-domains")) {
if(file.info(index_full$file[i] %>% as.character)$size >= 75) { ## check size for crashes due to empty ones
l <- read_csv(index_full$file[i] %>% as.character)
link_from_tld <- index_full$url_retrieve[i] %>% return_clean_url %>% return_tld
link_to <- l$`Linked Domains` %>% return_clean_url
counts <- l$`Total Links Count`
links <- rbind(links,tibble::tibble(
link_from_tld=link_from_tld, ## return tld
link_from=index_full$url_retrieve[i] %>% return_clean_url, ## retrieve url from file name
link_to=link_to,
count=counts))
}
} ## reads links
}
## Next, we create the network files
## First, we start with a clean, untagged list of our results ( discriminating between topics and levels is a feature that can be built in later)
result_flt <- result %>%
select(-theme,-marker) %>%
group_by(url,tld,timestamp) %>%
summarize(position=mean(position)%>%ceiling) %>%
ungroup
## We also apply this to our other data frames
links_flt <- links %>%
filter(link_from %in% result_flt$url)
backlinks_flt <- backlinks %>%
filter(link_to %in% result_flt$url)
log_ahrefs_requests_flt <- log_ahrefs_requests %>%
filter(url %in% result_flt$url)
index_full_flt <- index_full %>%
filter(url_retrieve %in% result_flt$url)
# Creating nodelist, calculating:
## onSERP: if page is on SERP or not
## count_weight: weight (0-1) reflecting proportion of found pages within group (results vs links vs backlinks) by timestamp
#### Unique pages from tld found in network, regardless wether result or not
## pos_weight: weight (0-1) reflecting average SERP position (larger = better) by timestamp
#### Weighted SERP position. I.e. 4 on scale 1-4 is same weighted score as 20 on scale 1-20
## This vector serves as an index file
nodes <- tibble(Id=c(backlinks_flt$link_to_tld %>% as.character,
backlinks_flt$link_from_tld %>% as.character,
links_flt$link_to %>% as.character,
links_flt$link_from_tld %>% as.character,
result_flt$tld %>% as.character) %>% unique %>%
.[!is.na(.)])
## Initiates the nodelist from results, calculating...
nodelist_result <- result_flt %>%
group_by(Id=tld) %>%
summarize(count=n(),Position=min(position)) %>%
group_by(Id) %>%
mutate(Weight=cumsum(count),
Pos_cum=cumsum(Position),
Position=(Pos_cum/Weight)%>%ceiling,
row=1:n()) %>%
ungroup %>%
select(Id,Position,Weight,row) %>%
mutate(onSERP=TRUE)
nodes <- nodes[!nodes$Id %in% nodelist_result$Id,]
## Creating nodelist from backlinks
nodelist_backlinks <- backlinks_flt %>%
filter(link_from_tld %in% nodes$Id) %>%
group_by(Id=link_from_tld) %>%
summarize(count=n(),Position=NA) %>%
group_by(Id) %>%
mutate(Weight=cumsum(count),
row=1:n()) %>%
ungroup %>%
select(Id,Position,Weight,row) %>%
mutate(onSERP=FALSE)
nodes <- nodes[!nodes$Id %in% c(nodelist_result$Id,nodelist_backlinks$Id),]
nodelist_links <- links_flt %>%
filter(link_to %in% nodes$Id) %>%
group_by(Id=link_to) %>%
summarize(count=n(),Position=NA) %>%
group_by(Id) %>%
mutate(Weight=cumsum(count),
row=1:n()) %>%
ungroup %>%
select(Id,Position,Weight,row) %>%
mutate(onSERP=FALSE)
nodes <- nodes[!nodes$Id %in% c(nodelist_result$Id,nodelist_backlinks$Id,nodelist_links$Id),]
if(nrow(nodes)==0) {print("All ids succesfully processed")}
nodelist_all <- rbind(nodelist_result,nodelist_backlinks,nodelist_links) %>%
mutate(count_weight=Weight/max(Weight),
pos_weight=ifelse(onSERP==TRUE,1-(Position/max(Position,na.rm=T)),0)) %>%
arrange(Id)
## Build Edgelist
edges_backlinks <- backlinks_flt %>%
group_by(Source=link_from_tld,
Target=link_to_tld) %>%
summarize(count=n()) %>%
group_by(Source,Target) %>%
mutate(Weight=cumsum(count),
row=1:n()) %>%
ungroup %>%
select(-count)
edges_links <- links_flt %>%
group_by(Source=link_from_tld,
Target=link_to) %>%
summarize(count=n()) %>%
group_by(Source,Target) %>%
mutate(Weight=cumsum(count),
row=1:n()) %>%
ungroup %>%
select(-count)
## Adding edges for social media pages
missing_edges_facebook <- filter(nodelist_all,str_detect(Id,"facebook.com/[aA-zZ0-9]+"))$Id %>% unique
missing_edges_youtube <- filter(nodelist_all,str_detect(Id,"youtube.com/user|youtube.com/watch|youtube.com/channel|youtu.be/user|youtu.be/watch|youtu.be/channel"))$Id %>% unique
missing_edges_instagram <- filter(nodelist_all,str_detect(Id,"instagram.com/p/|instagram.com/stories/|instagram.com/[aA-zZ0-9\\.]+/"))$Id %>% unique
missing_edges_twitter <- filter(nodelist_all,str_detect(Id,"twitter.com/[aA-zZ0-9\\.]+"))$Id %>% unique
missing_edges_linkedin <- filter(nodelist_all,str_detect(Id,"linkedin.com/in/[aA-zZ0-9\\.]+|linkedin.com/post"))$Id %>% unique
if(length(missing_edges_facebook)>0){
missing_edges_facebook <- rbind(data.frame(Source=missing_edges_facebook,Target="facebook.com"),
data.frame(Source="facebook.com",Target=missing_edges_facebook))
}else{missing_edges_facebook<-NULL}
if(length(missing_edges_youtube)>0){
missing_edges_youtube <- rbind(data.frame(Source=missing_edges_youtube,Target="youtube.com"),
data.frame(Source="youtube.com",Target=missing_edges_facebook))
}else{missing_edges_youtube<-NULL}
if(length(missing_edges_instagram)>0){
missing_edges_instagram <- rbind(data.frame(Source=missing_edges_instagram,Target="instagram.com"),
data.frame(Source="instagram.com",Target=missing_edges_instagram))
}else{missing_edges_instagram<-NULL}
if(length(missing_edges_twitter)>0){
missing_edges_twitter <- rbind(data.frame(Source=missing_edges_twitter,Target="twitter.com"),
data.frame(Source="twitter.com",Target=missing_edges_twitter))
}else{missing_edges_twitter<-NULL}
if(length(missing_edges_linkedin)>0){
missing_edges_linkedin <- rbind(data.frame(Source=missing_edges_linkedin,Target="linkedin.com"),
data.frame(Source="linkedin.com",Target=missing_edges_linkedin))
}else{missing_edges_linkedin<-NULL}
missing_edges <- rbind(missing_edges_facebook,missing_edges_youtube,missing_edges_instagram,missing_edges_twitter,missing_edges_linkedin)
if(!is.null(missing_edges)){
missing_edges <- missing_edges %>%
group_by(Source,Target) %>%
mutate(Weight=1) %>%
select(Source,Target,Weight)
}
edgelist_all <- rbind(edges_backlinks,edges_links,missing_edges) %>% distinct
Network analysis is done in Gephi. An export of the nodelist made after detecting communities is used in the next step.
We import the nodelist exported from Gephi after performing community analysis (add the path to yours below). To understand the nature of the different communities, we analyze their corresponding URLs. For each URL, we replace dashes, slashes and other non-letter characters with spaces, remove Dutch and English stop words, and compute word occurrence and TF-IDF. Next, we compute word clouds, including the 25 most common words, where word size expresses occurrence and the shade of blue expresses the uniqueness of this word compared to other communities.
Using the plot_cloud function, providing it with the worclouds_for_each_com data frame and the community’s id, wordclouds can be plotted.
## nodelist_gephi <- read_csv("")
result_tm <- result %>%
mutate(text=str_replace(url,tld,"")) %>%
mutate(text=str_replace_all(text,"/|\\.nl|\\.com|\\.org|\\.info|\\.|-|_|aspx|wordpress|html|htm|pdf|images|author|\\?|\\=|title|id=|home|default|[0-9]+|upload|content|archive|archief"," ") %>% str_trim) %>%
mutate(text=str_replace_all(text," [aA-zZ][ |^]| nl[ |^]| com[ |^]| info[ |^]| php[ |^]| asp[ |^]| tag[ |^]"," "))
result_tm_coms <- result_tm %>%
left_join(.,nodelist_gephi %>% select(Id,Community=modularity_class),by=c("tld"="Id")) %>%
select(url,tld,text,Community) %>%
distinct
keepers <- nodelist_gephi %>%
group_by(modularity_class) %>%
summarize(count=n()) %>%
mutate(perc=count/sum(count)*100) %>%
arrange(desc(perc)) %>%
filter(perc>=1) %>%
.$modularity_class
corpus <- result_tm_coms %>%
filter(Community %in% keepers) %>%
unnest_tokens(.,word,text,token="words") %>%
anti_join(data.frame(word=stopwords::stopwords(language="nl",source="stopwords-iso")), by = c("word" = "word")) %>% ## remove stop words
anti_join(data.frame(word=stopwords::stopwords(language="en",source="stopwords-iso")),by=c("word"="word")) %>%
filter(!str_detect(word,"[0-9]")) %>% ## remove numbers
filter(!str_detect(word,"http[s]?|www|.in|.com|t.co|co.in")) %>% ## remove url noise
filter(!str_detect(word,"^[A-z]$|^[A-z][A-z]$")) ## remove one and two character words
tf_idf_coms <- corpus %>%
group_by(Community,word) %>%
summarize(count=n()) %>%
bind_tf_idf(.,word,Community,count) %>%
arrange(Community,desc(tf_idf))
wordclouds_for_each_com <- NULL
for(i in 1:length(keepers)) {
## Create wordcloud data by week
x <- tibble(Community=keepers[i]) %>%
full_join(.,tf_idf_coms,by=c("Community"="Community"))
wordclouds_for_each_com <- rbind(wordclouds_for_each_com,x) %>% distinct
}
wordclouds_for_each_com <- wordclouds_for_each_com %>% filter(!is.na(word))
# plot_cloud(wordclouds_for_each_com,39)
We use Jefferson Henrique’s GetOldTweets to retrieve tweets for a fixed time period. Later (step 3.1), we will use the id’s of these tweets to re-retrieve them using rtweet, and snowball from there to retrieve the tweets’ conversational context (replies, retweets, quotes, etc.) as well as the ego networks of the associated users.
Next, we load a few required packages, a few helper functions, and set a timestamp variable so we can keep track of things.
# Loading required packages
require(rtweet)
require(httpuv)
require(longurl)
require(igraph)
require(SnowballC)
require(webshot)
require(htmltools)
require(htmlwidgets)
require(imager)
require(kableExtra)
require(rmdformats)
require(googleLanguageR)
## Set timestamp
timestamp <- lubridate::today() %>% strptime("%Y-%m-%d") %>% as.POSIXct(tz="Europe/Amsterdam")
## Load token
twitter_token <- read_rds("") ## load your twitter token here; for more info, consult the rtweet documentation
## Helper functions
requests_left <- function(call="followers/ids") {
rate_limit(twitter_token) %>% filter(query==call) %>% .$remaining
} # Returns requests left for certain api call (standard: retrieve friends)
time_left <- function(call="followers/ids") {
rate_limit(twitter_token) %>% filter(query==call) %>% .$reset
} ## Returns time left until refresh of rate limit
limit_check <- function(requests,expected=FALSE,call="follower/ids") {
## Limit check
calls_to_go <- requests_left(call)
calls_needed <- requests
time_to_go <- time_left(call) %>% as.numeric
while(calls_to_go<calls_needed) {
if(expected==FALSE) {
print("## Rate limit exceeded unexpectedly!")
}
print(paste("#### We need",calls_needed,"calls, but have",calls_to_go ,"left..."))
print(paste("#### The current time is",Sys.time(),"Please wait."))
print(paste("#### Rate limit resetting in",time_to_go,"minutes."))
if(calls_to_go<calls_needed) {
Sys.sleep(60*(time_to_go+1))
}
calls_to_go <- requests_left(call)
time_to_go <- time_left(call) %>% as.integer
if(calls_to_go>=calls_needed) {
print("Rate Limit refreshed. Continuing...")
}
}
} ## Checks limit before retrieving data
### retrieves number of statuses by given user_id
retrieve_con <- function(status_id_input) {
tweets_complete %>%
filter(status_id==status_id_input) %>%
.$conversation
}
### retrieves number of statuses by given user_id
get_statuses <- function(input_user_id) {
tweets_complete[tweets_complete$user_id==input_user_id,] %>% nrow
}
### retrieves whether input user_id is associated with an account that actively tweeted
get_original <- function(input_user_id) {
ifelse((tweets_complete[tweets_complete$user_id==input_user_id & tweets_complete$original==TRUE,] %>% nrow)>0,TRUE,FALSE)
}
retrieve_user_id <- function(df,input) {
df %>%
filter(status_id %in% input) %>%
.$user_id %>%
unique %>%
as.character
}
pos_gl_translate_detect <- purrr::possibly(gl_translate_detect,tibble(confidence=NA,isReliable=FALSE,language=NA,text="error")) ## Returning NA instead of error
process_tweets <- function(df) {
## Tags tweets that were not retrieved as a result of the initial query as non-original
if(!"original" %in% colnames(df)) {
df <- df %>%
mutate(original=FALSE)
}
print("Processing tweets and mentions...")
## Processes raw tweets
tweets <- df %>%
mutate(hasLink=str_detect(text,"http"), ## whether a tweet contains links
isReply=ifelse(!is.na(.$reply_to_status_id),TRUE,FALSE)) %>% ## wether the tweet is a reply
group_by(status_id=(status_id %>% as.character)) %>%
do(
data.frame(
original=.$original, ## wether tweet is original
screen_name=(.$screen_name %>% as.character), ## user
user_id=(.$user_id %>% as.character),
text_original=(.$text %>% as.character),
text=(.$text %>%
str_replace(.,"^RT ","") %>%
str_replace(.,"http.* ","") %>%
str_replace(.,"^@.*@[a-zA-Z0-9_.-]+ ","") %>%
str_replace(.," @.*@[a-zA-Z0-9_.-]+","") %>%
str_replace(.,"^@[a-zA-Z0-9_.-]+: ","") %>%
str_replace(.,"^@[a-zA-Z0-9_.-]+ ","") %>%
str_replace(.," @[a-zA-Z0-9_.-]+","") %>%
str_replace(.,"<.*>","")),
created=(.$created_at %>%
strptime(format="%Y-%m-%d %H:%M:%S", tz=Sys.timezone(location=TRUE))),
is_retweet=.$is_retweet, ## whether the tweet is a retweet
retweet_status_id=(.$retweet_status_id %>% as.character),
is_quote=.$is_quote,
quote_status_id=(.$quoted_status_id %>% as.character),
is_reply=.$isReply,
reply_status_id=(.$reply_to_status_id %>% as.character),
reply_user_id=(.$reply_to_user_id %>% as.character),
retweet_count=(.$retweet_count %>% as.integer),
favorite_count=(.$favorite_count %>% as.integer),
platform=(.$source %>% as.character),
has_link=(ifelse(!is.na(.$urls_expanded_url),T,F)),
has_media=ifelse(!is.na(.$media_url),T,F),
has_symbols=(ifelse(!is.na(.$symbols),T,F)),
has_hashtags=(ifelse(!is.na(.$hashtags),T,F)),
country=(.$country_code %>% as.character)
)) %>%
as.data.frame() %>%
unique
## Processes user mentions (later used for edge weight)
mentioned_users <- df %>%
group_by(status_id,user_id) %>%
do(mentions_user_id=(.$mentions_user_id %>% unlist %>% as.character)) %>%
unnest %>%
filter(!is.na(mentions_user_id))
## Combining results in a list
exp <- list(tweets,mentioned_users)
names(exp) <- c("tweets","mentioned_users")
return(exp)
} ## Process raw tweets
retrieve_rts_quotes <- function(df) {
# Retweets
## Check if there are retweets to retrieve
rts <- df %>%
filter(!is.na(retweet_status_id))
## Initiating results
retweets_user_ids <- NULL
## Retrieve retweets, if any.
if(nrow(rts)>0) {
print("Retrieving originals for retweets...")
## Retrieve all retweets
rts <- rts %>%
.$retweet_status_id %>%
unique %>%
lookup_statuses %>%
process_tweets ## alle retweets ophalen
## Filter observations already in source file
rts[[1]] <- rts[[1]] %>%
dplyr::filter(!status_id %in% df$status_id)
rts[[2]] <- rts[[2]] %>%
filter(status_id %in% rts[[1]]$status_id)
print(paste0(nrow(rts[[1]])," retweeted originals retrieved"))
## Retrieves user_id's for rts in order to retrieve the original tweets
retweets_user_ids <- df %>%
filter(is_retweet==TRUE) %>%
group_by(retweet_status_id) %>%
do(data.frame(retweet_user_id=retrieve_user_id(rts[[1]],.$retweet_status_id)))
}
## Return NA when there are no retweeted originals to retrieve
if(is.null(retweets_user_ids)) {
print("No originals for retweets to retrieve")
tweets <- tweets %>% mutate(retweet_user_id=NA)
}
# Quotes
## Check if there are quotes to retrieve
quotes <- df %>%
filter(!is.na(quote_status_id))
## Initiating results
quotes_user_ids <- NULL
## Retrieve quotes, if any
if(nrow(quotes)>0) {
print("Retrieving originals for quotes...")
quotes <- quotes %>%
.$quote_status_id %>%
unique %>%
lookup_statuses %>%
process_tweets ## retrieve all quotes
## Filter observations already in source file
quotes[[1]] <- quotes[[1]] %>%
dplyr::filter(!status_id %in% df$status_id)
quotes[[2]] <- quotes[[2]] %>% filter(status_id %in% quotes[[1]]$status_id)
print(paste0(nrow(quotes[[1]])," quoted originals retrieved"))
## Check if there are user_ids associated with quotes to retrieve
quotes_user_ids <- df %>%
filter(is_quote==TRUE)
## Retrieve user ids for quotes
if(nrow(quotes_user_ids)>0) {
print("Retrieving user_ids of quoted original tweets...")
quotes_user_ids <- quotes_user_ids %>%
group_by(quote_status_id) %>%
do(data.frame(quote_user_id=retrieve_user_id(quotes[[1]],.$quote_status_id))) %>%
select(quote_status_id,quote_user_id)
}
}
## Return NA when there are no quoted originals to retrieve
if(is.null(quotes_user_ids)) {
print("No originals for quotes to retrieve")
tweets <- tweets %>% mutate(quote_user_id=NA)
}
# Combining data
## Rbind the retrieved tweets
tweets <- rbind(rts[[1]],quotes[[1]]) %>%
distinct %>%
mutate(conversation=NA)
## Add user_ids for retweeted and quoted originals
if(!is.null(retweets_user_ids)){
tweets <- tweets %>%
left_join(.,retweets_user_ids,by=c("retweet_status_id"="retweet_status_id"))
}
if(!is.null(quotes_user_ids)){
tweets <- tweets %>%
left_join(.,quotes_user_ids,by=c("quote_status_id"="quote_status_id"))
}
tweets <- tweets %>% unique
mentioned_users <- rbind(rts[[2]],quotes[[2]])
exp <- list(tweets,mentioned_users)
names(exp) <- c("tweets","mentioned_users")
return(exp)
} ## Retrieve retweeted and quoted originals
retrieve_conversations <- function(df) {
## Creates index df in which replies are numbered
tweets <- df %>%
unique %>%
mutate(conversation=dense_rank(.$reply_status_id) %>% as.factor) %>%
filter(!is.na(conversation))
if(nrow(tweets)>=1) {
## How many tweets to retrieve
## Is recalculated after each retrieval loop
to_retrieve <- tweets %>%
.$reply_status_id %>%
unique %>%
length
## Initiating index and results dfs
tweet_ids <- NULL
mentioned_users_ <- NULL
i <- 1
## Keeps retrieving until no more left
while(to_retrieve>0) {
## Determine which tweets to lookup
lookup <- tweets %>%
filter(!reply_status_id %in% tweet_ids) %>% ## excludes previously retrieved tweets
filter(!is.na(reply_status_id)) %>%
.$reply_status_id %>%
unique %>%
as.character
print("Retrieving tweets that were replied to...")
## Looks up tweets and checks if tweets were returned
replies_raw <- lookup_statuses(lookup)
check <- replies_raw %>% nrow %>% data.frame(x=.) %>% mutate(y=ifelse(x==0,FALSE,TRUE))
## Numbering results
if(check$y==TRUE) {
## Processes the raw results
replies <- process_tweets(replies_raw)
## Numbers the tweets that were just retrieved according to the associated tweet-and-reply-chain
replies[[1]] <- left_join(replies[[1]],(tweets %>% select(reply_status_id,conversation)),
by=c("status_id"="reply_status_id"))
}
if(check$y==FALSE) {
## Returns NA in case of error (when tweets is deleted)
replies <- list(NA,NA,NA,NA,NA,NA,NA)
}
## Add the new tweets to the index
tweet_ids <- c(tweet_ids,lookup) %>% unique
## Adding results to df
if(check$y==TRUE) {
## Combines existing data with retrieved data
tweets <- rbind(tweets,replies[[1]]) %>%
unique
mentioned_users_ <- rbind(mentioned_users_,replies[[2]])
}
## Determining which tweets to retrieve in next round
to_retrieve <- tweets %>%
filter(!reply_status_id %in% tweet_ids) %>%
filter(!reply_status_id %in% lookup) %>%
filter(!is.na(reply_status_id)) %>%
.$reply_status_id %>%
unique %>%
length
i <- i+1
print(paste0(to_retrieve," tweets to retrieve in next round"))
}
print(paste0(nrow(tweets)," tweets retrieved associated with ",max(tweets$conversation %>% as.numeric)," conversations"))
## Adding extra columns (otherwise rbind error)
tweets <- tweets %>%
mutate(retweet_user_id=NA,
quote_user_id=NA) %>%
unique
mentioned_users_ <- mentioned_users_ %>% filter(!status_id %in% df$status_id) %>% unique
exp <- list(tweets,mentioned_users_)
names(exp) <- c("tweets","mentioned_users")
return(exp)
}
} ## Unfold and retrieve reply-chains and mark conversation numbers
retrieve_user_stats <- function(user_list) {
print("Starting function retrieve_user_stats()")
## Creates a unique user list, counts them, and calculates required batches
user_list <- as.character(user_list) %>% unique
rows <- length(user_list)
batches <- (rows/90000)%>%ceiling
print(paste("## Retrieving a list of ",rows," users in",batches," batches"))
## Initiates data frame to write data to
wip_user_stats <- NULL
## Retrieves all the users in one batch in case of less than 18k users
if(rows<90000) {
limit_check(rows/90000,expected=TRUE,call="users/lookup")
print("### Starting first and last batch")
wip_user_stats <- lookup_users(user_list)
print("### Success")
}
## In case of more than 18k users, retrieves the users in batches of 18k
if(rows>90000) {
limit_check(900,expected=TRUE,call="users/lookup")
print("### Starting first of miltiple batches")
wip_user_stats <- rbind(wip_user_stats,
lookup_users(user_list[1:90000]))
print("### Success")
for(i in 1:(batches-1)) {
limit_check(900,expected=TRUE,call="users/lookup")
print(paste("### Starting batch",i+1))
batch <- user_list[(1+(i*90000)):(90000+(i*90000))]
batch <- batch[!is.na(batch)]
print(paste("#### Retrieving",length(batch),"users"))
tmp <- lookup_users(batch)
print("### Succes")
wip_user_stats <- rbind(wip_user_stats,tmp)
}
}
print("Retrieval succesfull, proceeding with processing")
## Processes the data
user_stats_t <- wip_user_stats %>%
group_by(user_id=(user_id %>% as.character)) %>%
do(
data.frame(screen_name=(.$screen_name %>% as.character),
real_name=(.$name %>% as.character),
description=(.$description %>% as.character),
profile_image=(.$profile_image_url %>% as.character),
background_image=(.$profile_background_url %>% as.character),
banner_image=(.$profile_banner_url %>% as.character),
location=(.$location %>% as.character),
language=(.$account_lang %>% as.character),
statuses=(.$statuses_count %>% as.integer),
followers=(.$followers_count %>% as.integer),
following=(.$friends_count %>% as.integer),
favorites=(.$favourites_count %>% as.integer),
protected=(.$protected %>% as.logical),
verified=(.$verified %>% as.logical),
age_weeks=(((Sys.time() - .$account_created_at) %>%
as.integer / 7) %>% round(0) %>% as.integer)
)
) %>% unique
return(user_stats_t)
} ## Function to retrieve user_stats
#### Returns ids of followers one by one. Note users in input df may not have more tha 75k followers
get_followers_in_batches <- function(user_id,requests) {
limit_check(requests,FALSE,"followers/ids")
followers <- get_followers(user_id %>% as.character,n=(requests*5000))$user_id %>%
as.data.frame %>%
ifelse(nrow(.)>0,.,NA) %>%
unlist %>%
as.character
return(followers)
}
## Alternative function that returns NA in case of error
pos_get_followers <- possibly(get_followers_in_batches,NA)
#### Returns ids of followers for user accounts with more than 75k followers
get_followers_for_large_account <- function(user_id,calls_needed) {
requests <- calls_needed
windows <- (requests/15) %>% ceiling
print(paste("######## Retrieving followers for user in",windows,"windows."))
print(paste("######## Estimated time left:",(windows*15)-15,"minutes"))
print("######## Starting window 1")
limit_check(15,TRUE,"followers/ids")
tmp <- get_followers(as.character(user_id %>% as.character),n=(calls_needed*5000))
page <- next_cursor(tmp)
followers <- tmp$user_id %>%
as.data.frame %>%
ifelse(nrow(.)>0,.,NA) %>%
unlist %>%
as.character
for(i in 1:(windows-1)) {
limit_check(15,TRUE,"followers/ids")
print(paste("######## Starting window",i+1))
tmp <- get_followers(as.character(user_id),n=(calls_needed*5000),page=page)
page <- next_cursor(tmp)
tmp <- tmp$user_id %>%
as.data.frame %>%
ifelse(nrow(.)>0,.,NA) %>%
unlist %>%
as.character
followers <- c(followers,tmp)
}
return(followers)
}
## Alternative function that returns NA in case of error
pos_get_followers_for_large_account <- possibly(get_followers_for_large_account,NA)
#### Returns ids of following one by one. Note users in input df may not have more tha 75k followers
get_following_in_batches <- function(user_id,requests) {
limit_check(requests,FALSE,"friends/ids")
following <- get_friends(user_id %>% as.character,n=(requests*5000))$user_id %>%
as.data.frame %>%
ifelse(nrow(.)>0,.,NA) %>%
unlist %>%
as.character
return(following)
}
## Alternative function that returns NA in case of error
pos_get_following <- possibly(get_following_in_batches,NA)
#### Returns ids of following for user accounts with more than 75k following
get_following_for_large_account <- function(user_id,calls_needed) {
windows <- (calls_needed/15) %>% ceiling
print(paste("######## Retrieving followers for user in",windows,"windows."))
print(paste("######## Estimated time left:",(windows*15)-15,"minutes"))
print(paste("######## Starting window 1 of ",windows))
limit_check(15,TRUE,"friends/ids")
tmp <- get_friends((user_id %>% as.character),n=(calls_needed*5000))
page <- next_cursor(tmp)
following <- tmp$user_id %>%
as.data.frame %>%
ifelse(nrow(.)>0,.,NA) %>%
unlist %>%
as.character
## hier calls needed gedeeld door 15. Moet werken nu.
for(i in 1:(windows-1)) {
limit_check(15,TRUE,"friends/ids")
print(paste("######## Starting windows",i+1))
tmp <- get_friends((user_id %>% as.character),n=(calls_needed**5000),page=page)
page <- next_cursor(tmp)
tmp <- tmp$user_id %>%
as.data.frame %>%
ifelse(nrow(.)>0,.,NA) %>%
unlist %>%
as.character
following <- c(following,tmp)
}
return(following)
}
## Alternative function that returns NA in case of error
pos_get_following_for_large_account <- possibly(get_following_for_large_account,NA)
We navigate to the folder where the output of the GetOldTweets step are stored, and load a data frame with Tweets. Note that - similar to how we handle the results of the Google step - queries and results can be distinguished by associated topic (type) and level (marker) that the script below derives from the file names. Follow this format in naming your CSV’s (level_topic.csv), or simply overwrite these columns manually with a single value.
## Loading tweets
### The code commented out below loads all output files of GetOldTweets in a given folder
#index <- tibble(file=list.files())
#tweets <- NULL
#for(i in 1:nrow(index)) {
# x <- read_delim(index$file[i],
# delim="\t",
# quote = "",
# col_types = cols(
# username = col_character(),
# date = col_datetime(format = ""),
# replies = col_integer(),
# retweets = col_integer(),
# favorites = col_integer(),
# text = col_character(),
# geo = col_character(),
# mentions = col_character(),
# hashtags = col_character(),
# id = col_character(),
# permalink = col_character()
# ))
# marker <- index$file[i] %>% substr(1,str_locate(.,"_") %>% .[1]-1)
# type <- index$file[i] %>% str_extract("_[aA-zZ]+") %>% str_replace_all("_","")
# tweets <- rbind(tweets, x %>% mutate(marker=marker,type=type,isOriginal=TRUE))
#}
We (re-)retrieve all the tweets using rtweet, offering us a little more information on each tweet. As many of the core set of tweets resulted from (multiple) interactions between users, we also retrieve these tweets’ conversational context spanning the tweets, retweets, quotes, and replies the tweets were responding to.
## Split data up into batches to retrieve tweets for
## This is to distinguish tweets belonging to different groups (markers)
tweet_index <- tweets %>% select(marker,type,id) %>% arrange(marker,type,id) %>%
mutate(batchprep=paste(marker,type,sep="-"),
batch=dense_rank(batchprep)) %>%
select(-batchprep)
## Retrieve all queries
tweets_raw <- NULL
for(i in 1:length(tweet_index$batch %>% unique)) {
to_retrieve <- tweet_index %>%
filter(batch==tweet_index$batch %>% unique %>% .[i]) %>%
mutate(id=str_replace_all(id,"\"",""))
tmp <- lookup_statuses(to_retrieve$id %>% unique,
parse=TRUE) %>%
mutate(original=TRUE,
Type=to_retrieve$type[1],
Marker=to_retrieve$marker[1])
tweets_raw <- rbind(tweets_raw,tmp)
}
## Create Index
tweets_by_marker <- tweets_raw %>% select(user_id,status_id,Type,Marker) %>%
mutate(timestamp=timestamp)
### Process the data in the desired format
tweets_processed <- process_tweets(tweets_raw)
### Retrieve the retweeted and quoted originals before retrieving all the user data
rts_quotes_processed <- NULL
if((tweets_processed[[1]] %>% filter(is_quote==TRUE | is_retweet==TRUE) %>% nrow)>=1) {
rts_quotes_processed <- retrieve_rts_quotes(tweets_processed[[1]])
rts_quotes_processed[[1]] <- rts_quotes_processed[[1]] %>%
mutate(original=ifelse(status_id %in% tweets_processed[[1]]$status_id,TRUE,FALSE)) %>%
unique
}
### Retrieve the statuses that replies are replying to, and continue to follow these chains until first tweet is reached
conversations_processed <- retrieve_conversations(tweets_processed[[1]])
if(!is.null(nrow(conversations_processed[[1]]))) {
conversations_processed[[1]] <- conversations_processed[[1]] %>%
mutate(original=ifelse(status_id %in% tweets_processed[[1]]$status_id,TRUE,FALSE)) %>%
unique
} else {
conversations_processed <- NULL
}
## Retrieving retweet and quote originals as well as conversations results in doubles. The code below addresses this.
### Mark rts or quotes that were part of a conversation as such
if(!is.null(nrow(conversations_processed[[1]])) && !is.null(nrow(rts_quotes_processed))) {
rts_quotes_processed[[1]] <- rts_quotes_processed[[1]] %>%
select(-conversation) %>%
left_join(.,conversations_processed[[1]] %>%
select(status_id,conversation),by=c("status_id"="status_id"))
### Remove rts and quotes that were part of a conversation from the conversation set
conversations_processed[[1]] <- conversations_processed[[1]] %>%
filter(!status_id %in% (rts_quotes_processed[[1]] %>%
select(status_id,conversation) %>%
filter(!is.na(conversation)) %>%
.$status_id))
}
### Removing tweets from the original set that are part of replies, rts or quotes
tweets_processed[[1]] <- tweets_processed[[1]] %>%
filter(!status_id %in% rts_quotes_processed[[1]]$status_id) %>%
filter(!status_id %in% conversations_processed[[1]]$status_id) %>%
mutate(conversation=NA,retweet_user_id=NA,quote_user_id=NA)
if(!is.null(nrow(rts_quotes_processed[[1]]))) {
rts_quotes_processed[[1]] <- rts_quotes_processed[[1]] %>% mutate(retweet_user_id=NA)
}
## Finishing up the data
tweets_complete <- rbind(tweets_processed[[1]],rts_quotes_processed[[1]],conversations_processed[[1]]) %>%
unique
### Checking for doubles; doubles are and have ambiguous characteristics, such as
### being part of two conversations or being a quote as well as a normal tweet
double <- tweets_complete %>%
group_by(status_id) %>%
dplyr::summarize(count=n()) %>%
filter(count>1) %>%
.$status_id
### Isolating the non-double tweets as basis for conversations table
if(length(double)>=1) {
without_double <- tweets_complete %>%
filter(!status_id %in% double) %>%
distinct %>%
select(status_id,conversation)
### Only the double cases, without conversation column (cleaning its characteristics)
double_cases_without_conv <- tweets_complete %>%
filter(status_id %in% double) %>%
select(-conversation) %>%
distinct
## Retrieves the conversation number for the double cases
clean_doubles <- double_cases_without_conv %>%
group_by(status_id) %>%
do(conversation=retrieve_con(.$status_id)) %>%
unnest
tweets <- tweets_complete %>%
select(-conversation) %>%
distinct %>%
filter(!status_id %in% double) %>%
mutate(timestamp) %>%
as_tibble
## Create conversation data frame (conversation is not stored as a tweet attribute, but in a seperate table)
conversations <- clean_doubles %>%
rbind(.,without_double) %>%
unique %>%
filter(!is.na(conversation))
} else {
tweets <- tweets_complete %>%
distinct %>%
mutate(timestamp) %>%
as_tibble
conversations <- conversations_processed[[1]] %>%
unique %>%
filter(!is.na(conversation))
}
## Creates mentions data frame (including the mentions in the rts, quotes, and tweets-and-reply-chains.)
mentions_users <- rbind(tweets_processed[[2]],rts_quotes_processed[[2]],conversations_processed[[2]]) %>%
unique %>%
mutate(timestamp=timestamp) %>%
as_tibble
mentions <- rbind(tweets_processed$mentioned_users,
rts_quotes_processed$mentioned_users,
conversations_processed$mentioned_users) %>%
as_tibble %>% mutate(timestamp=timestamp) ## mentions, te gebruiken voor netwerk zonder context
user_stats <- retrieve_user_stats(tweets$user_id %>% unique %>% as.character) %>%
as_tibble %>%
mutate(timestamp=timestamp)
Using Google’s Natural Language Processing API, we detect the language of each tweet, and filter our data set to only retain tweets by Dutch users, or tweets in the Dutch language.
## Detect language and only retain Tweets in Dutch
# First, authenticate with Google
### plugin your credentials
options("googleAuthR.client_id" = "")
options("googleAuthR.client_secret" = "")
options("googleAuthR.scopes.selected" = c("https://www.googleapis.com/auth/cloud-platform"))
googleAuthR::gar_auth()
## Batching tweets
tweets_tmp <- tweets
tweets_tmp$line <- c(1:nrow(tweets_tmp))
tweets_tmp <- tweets_tmp %>%
mutate(batch=line/250) %>% mutate(batch=ceiling(batch)) %>% select(-line)
## Running through for loop delaying each batch with 110 seconds
tweets_lang <- NULL
index <- tweets_tmp$batch %>% unique
for(i in 1:length(index)) {
print(paste("Retrieving round ",i," of ",max(index)))
x <- tweets_tmp %>%
filter(batch==index[i]) %>%
group_by(status_id,user_id) %>%
do(pos_gl_translate_detect(.$text)) %>% ## return language characteristics by id
unnest
tweets_lang <- rbind(tweets_lang,x)
Sys.sleep(110)
}
## Including messages that...
### Are in Dutch (dutch_messages - status_ids)
### Are Written by Dutch users (dutch_users - user_ids)
### Are associated with conversations involving Dutch users (ducth_conversations - conversations numbers)
### Are mentioning or mentioned by Dutch users (dutch_mentions - status_ids)
dutch_messages <- tweets_lang %>% filter(language=="nl") %>% .$status_id %>% unique
## Loading list of Dutch cities to identify users living in the Netherlands
dutch_cities <- read_csv("dutch_cities.csv",
col_types = cols(Province = col_skip(),
S.No = col_skip())) %>%
.$Cities %>% tolower
dutch_cities <- c(paste0(" ",dutch_cities," "),paste0(" ",dutch_cities,"$"),paste0("^",dutch_cities," "),paste0("^",dutch_cities,"$"))
dutch_cities <- c(dutch_cities,"netherlands","holland"," nl$", " nl ")
dutch_users <- user_stats %>%
mutate(location=location %>% tolower) %>%
group_by(user_id,location) %>%
do(dutch=ifelse((str_detect(.$location,dutch_cities) %>% sum)>=1,TRUE,FALSE)) %>%
unnest
dutch_users_2 <- tweets_lang %>% filter(language=="nl") %>% .$user_id %>% c(.,dutch_users %>% filter(dutch==TRUE) %>% .$user_id) %>% unique
messages_by_dutch_users <- tweets %>% filter(user_id %in% dutch_users_2 | status_id %in% dutch_messages) %>% .$status_id %>% unique
dutch_conversations <- conversations %>% filter(status_id %in% messages_by_dutch_users) %>% .$conversation %>% unique
dutch_mentions <- mentions %>% filter(user_id %in% dutch_users_2 | mentions_user_id %in% dutch_users_2) %>% .$status_id %>% unique
messages_to_include <- c(messages_by_dutch_users,
conversations,
dutch_mentions) %>% unique
tweets <- tweets %>% filter(status_id %in% messages_to_include)
user_stats <- user_stats %>% filter(user_id %in% tweets$user_id)
tweets_by_marker <- tweets_by_marker %>% filter(status_id %in% messages_to_include)
mentions <- mentions %>% filter(status_id %in% messages_to_include)
conversations <- conversations %>% filter(conversation %in% dutch_conversations)
This results in the following data frames: tweets, tweets_by_marker, mentions, conversations, and user_stats.
Next, we retrieve the ego network of the authors of these tweets, including all the users that the authors are following and are being followed by.
## Preprocess data
users_followers <- user_stats %>%
select(user_id,followers) %>%
filter(!user_id %in% x) %>%
filter(followers>0, followers < 500000) %>%
unique %>%
mutate(calls_needed=((followers/5000)%>%ceiling))
## Retrieve followers
if(nrow(users_followers>0)) {
## Smaller accounts to retrieve in one time frame
smaller_accounts <- users_followers %>%
filter(followers<=75000) %>%
arrange(followers)
## Big accounts to retrieve in more time frames
are_there_big_accounts <- ifelse(nrow(users_followers %>%filter(followers>75000))!=0,TRUE,FALSE)
## Initiating results df
followers <- NULL
## Loops through given data frame row by row
for(i in 1:nrow(smaller_accounts)) {
## Subset batch, announcing batch start
batch <- smaller_accounts[i,]
calls_needed <- batch$calls_needed
## Limit check
limit_check(batch$calls_needed,TRUE,"followers/ids")
print(paste("###### Retrieving user",i,"of",nrow(smaller_accounts)))
## Retrieve followers. Returning NA when no results given
tmp <- batch %>%
group_by(user_id) %>% ##=user_id) %>%
do(
data.frame(follower=(pos_get_followers(.$user_id,calls_needed)))) %>%
as.data.frame
tmp$follower <- tmp$follower %>% as.character
followers <- rbind(followers,tmp)
} ## einde retrieval small accounts
if(are_there_big_accounts==TRUE) {
## create set
big_accounts <- users_followers %>%
filter(followers>75000) %>%
arrange(calls_needed)
## Loops through given data frame row by row
for(i in 1:nrow(big_accounts)) {
## Subset batch, announcing batch start
batch <- big_accounts[i,]
## Retrieve followers. Returning NA when no results given
tmp <- batch %>%
group_by(user_id) %>%
do(
data.frame(follower=get_followers_for_large_account(.$user_id,.$calls_needed))) %>%
as.data.frame
tmp$follower <- tmp$follower %>% as.character
followers <- rbind(followers,tmp)
} ## for loop big_accounts
} ## einde retrieval large
followers$user_id <- followers$user_id %>% as.character
followers$follower <- followers$follower %>% as.character
} ## einde retrieval followers
## Retrieve following
## Preprocess data
users_following <- user_stats %>%
select(user_id,following) %>%
filter(user_id %in% followers$user_id) %>%
filter(following>0,following<500000) %>%
unique %>%
mutate(calls_needed=((following/5000)%>%ceiling))
if(nrow(users_following>0)) {
## Smaller accounts to retrieve in one time frame
smaller_accounts <- users_following %>%
filter(following<=75000) %>%
arrange(calls_needed)
## Big accounts to retrieve in more time frames
are_there_big_accounts <- ifelse(nrow(users_following %>%filter(following>75000))!=0,TRUE,FALSE)
## Initiating results df
following <- NULL
## Loops through given data frame row by row
for(i in 1:nrow(smaller_accounts)) {
## Subset batch, announcing batch start
batch <- smaller_accounts[i,]
calls_needed <- batch$calls_needed
## Limit check
limit_check(batch$calls_needed,TRUE,"friends/ids")
print(paste("###### Retrieving user",i,"of",nrow(smaller_accounts)))
## Retrieve followers. Returning NA when no results given
tmp <- batch %>%
group_by(user_id) %>% ##=user_id) %>%
do(
data.frame(following=(pos_get_following(.$user_id,calls_needed)))) %>%
as.data.frame
tmp$following <- tmp$following %>% as.character
following <- rbind(following,tmp)
} ## einde retrieval small accounts
if(are_there_big_accounts==TRUE) {
## create set
big_accounts <- users_following %>%
filter(following>75000)
## Loops through given data frame row by row
for(i in 1:nrow(big_accounts)) {
## Subset batch, announcing batch start
batch <- big_accounts[i,]
## Retrieve followers. Returning NA when no results given
tmp <- batch %>%
group_by(user_id) %>%
do(
data.frame(following=get_following_for_large_account(.$user_id,.$calls_needed))) %>%
as.data.frame
tmp$following <- tmp$following %>% as.character
following <- rbind(following,tmp)
} ## einde for loop big_accounts
} ## einde retrieval large
following$user_id <- following$user_id %>% as.character
following$following <- following$following %>% as.character
} ## einde retrieval following
## Cleaning colnames
colnames(followers) <- c("source","target")
colnames(following) <- c("target","source")
Collectively, these data reflect how are authors are connected, and what media preferences and audiences they have in common. To reduce noise, we can filter the network to only include Twitter users that are connected to at least 10% of the authors of our tweets. The code offers the option to filter our users that are not well-connected to authors in our data set. This cut-off point can be determined by looking at the distribution of non-author connections with authors.
This returns two data frames: nodelist_data_frame and edgelist_data_frame.
## Calculate author connections...
### ...for user_id_x_follower
followers_count <- followers %>%
.$target %>%
table %>%
as.data.frame
### ...for user_id_x_following
following_count <- following %>%
.$source %>%
table %>%
as.data.frame
### ...and combining both
connections_with_authors <- data.frame(user_id=c(followers_count$. %>% as.character,following_count$. %>% as.character),
count=c(followers_count$Freq,following_count$Freq)) %>%
group_by(user_id) %>%
summarize(count=sum(count))
## Dealing with factors
connections_with_authors$user_id <- connections_with_authors$user_id %>% as.character
## Inspecting the distribution to determine the cut-off point
### Summary
summ <- connections_with_authors$count %>% summary
### Distribution
dist <- connections_with_authors %>%
group_by(count) %>%
summarize(occurrence=n()) %>%
mutate(percentage=occurrence/sum(occurrence)*100) %>% arrange(desc(percentage))
print(summ)
plot(dist$occurrence)
#### Based on this distribution above, we vould decide to retain accounts with 10 authors or more
## Filtering the data
### Creating a vector of the author id's
author_ids <- tweets_complete$user_id %>% unique
### Filtering followers
followers_flt <- followers %>%
filter(target %in% c(author_ids,connections_with_authors %>% filter(count>=(length(author_ids) * .1)) %>% .$user_id %>% as.character)) %>%
select(source,target)
### Filtering following
following_flt <- following %>%
filter(source %in% c(author_ids,(connections_with_authors %>% filter(count>=(length(author_ids) * .1)) %>% .$user_id %>% as.character))) %>%
select(source,target)
## Creating edgelist
## Retrieve mentions for weight
mentions <- mentions %>%
group_by(user_id,mentions_user_id) %>%
summarize(weight=n()) %>%
ungroup %>%
mutate(weight=dense_rank(weight)+1)
## Create edgelist including weight column
edgelist_data_frame <- rbind(data.frame(source=followers_flt$source %>% as.character,
target=followers_flt$target %>% as.character),
data.frame(source=following_flt$source %>% as.character,
target=following_flt$target %>% as.character)) %>%
distinct %>%
left_join(.,mentions,by=c("source"="user_id","target"="mentions_user_id")) %>%
mutate(weight=ifelse(is.na(weight),1,weight))
edgelist_data_frame <- edgelist_data_frame %>%
select(Source=source,Target=target,Weight=weight)
## retrieve user_stats for all nodes in the network and create the initial nodelist
## Retrieve user_stats
user_data_to_retrieve <- c(edgelist_data_frame$Source,edgelist_data_frame$Target) %>% unique %>% as.character
user_stats <- retrieve_user_stats(user_data_to_retrieve)
# Create nodelist
nodelist_data_frame <- data.frame(Id=user_stats$user_id %>% as.character,
Label=user_stats$screen_name %>% as.character,
real_name=user_stats$real_name %>% as.character,
description=user_stats$description %>% as.character,
location=user_stats$location %>% as.character,
language=user_stats$language %>% as.character,
followers=user_stats$followers,
following=user_stats$following) %>%
as.data.frame %>%
group_by(Id) %>%
mutate(statuses=get_statuses(Id) %>% as.integer) %>% ## gets the ammount of statuses for every node
as.data.frame
## Workaround for R making factors from characters
nodelist_data_frame$Id <- nodelist_data_frame$Id %>% as.character
nodelist_data_frame$Label <- nodelist_data_frame$Label %>% as.character
nodelist_data_frame$real_name <- nodelist_data_frame$real_name %>% as.character
nodelist_data_frame$description <- nodelist_data_frame$description %>% as.character
nodelist_data_frame$location <- nodelist_data_frame$location %>% as.character
nodelist_data_frame$language <- nodelist_data_frame$language %>% as.character
user_ids <- c(edgelist_data_frame$source,edgelist_data_frame$target) %>% unique
## Creating NA's for user_ids in the edgelist that could not be retrieved (deleted?)
na <- data.frame(Id=c(user_ids[!user_ids %in% nodelist_data_frame$Id])) %>%
mutate(Label=NA,
real_name=NA,
description=NA,
location=NA,
language=NA,
followers=NA,
following=NA,
statuses=NA) %>%
as.data.frame
## Combining the data
nodelist_data_frame <- rbind(nodelist_data_frame,na)
original <- tweets_complete[tweets_complete$original==TRUE,]$user_id %>% unique
nodelist_data_frame <- nodelist_data_frame %>%
group_by(Id,Label,real_name,description,location,language,followers,following,statuses) %>%
do(original=ifelse(.$Id %in% original,TRUE,FALSE)) %>%
unnest
We use igraph’s implementation of the Louvain algorithm to identify communities of closely connected Twitter users. We use igraph’s betweenness algorithm to get a sense of the extent to which each user in the network is capable of connecting a large number of people.
Finally, we export our igraph network into a data frame nodelist_network_stats that can be exported and load it into Gephi for visualization.
## First, we make sure the same user_ids are in both the nodelist and edgelist
Ids_in_nodelist <- nodelist_data_frame %>% .$Id %>% unique ## All ids in nodelist
Ids_in_edgelist <- c(edgelist_data_frame$Source,edgelist_data_frame$Target) %>% unique ## all ids in edgelist
Edge_ids_not_in_nodelist <- Ids_in_edgelist[!Ids_in_edgelist %in% Ids_in_nodelist] ## Ids in edgelist not in nodelist
Node_ids_not_in_edgelist <- Ids_in_nodelist[!Ids_in_nodelist %in% Ids_in_edgelist] ## Ids in de nodelist not in de edgelist; typically users that didnt interact
edgelist <- edgelist_data_frame %>% filter(!Source %in% Edge_ids_not_in_nodelist, !Target %in% Edge_ids_not_in_nodelist)
nodelist <- nodelist_data_frame %>% filter(!Id %in% Node_ids_not_in_edgelist)
## create graph
graph <- graph_from_data_frame(edgelist,directed=TRUE,vertices=(nodelist))
## Cluster louvain
louvain <- cluster_louvain(as.undirected(graph))
## Calc modularity score
modularity <- modularity(as.undirected(graph), membership(louvain))
## create df for louvain
Ids <- membership(louvain) %>% names
Communities <- membership(louvain) %>% as.numeric
membership <- data.frame(Id=Ids,Community=Communities)
## Betweenness
betweenness <- estimate_betweenness(graph,directed=TRUE,cutoff=10)
Ids_btw <- betweenness %>% names
btw <- betweenness %>% as.numeric
betweenness <- data.frame(Id=Ids_btw,betweenness=btw)
## PageRank
pagerank <- page_rank(graph,directed=TRUE)
Ids_pr <- pagerank$vector %>% names
pr <- pagerank$vector %>% as.numeric
pagerank <- data.frame(Id=Ids_pr,pagerank=pr)
## update nodelist
nodelist_network_stats <- nodelist %>%
left_join(.,membership,by=c("Id"="Id")) %>%
left_join(.,pagerank,by=c("Id"="Id")) %>%
left_join(.,betweenness,by=c("Id"="Id"))
We analyze the profile descriptions – which are the short texts people use to describe themselves on Twitter – by community, distinguishing between authors and the wider media ecologies, in a similar manner as described in step 2.6.
Using the plot_cloud function, providing it with the worclouds_for_each_com data frame and the community’s id, wordclouds can be plotted.
keepers <- nodelist_network_stats %>%
group_by(Community) %>%
summarize(count=n()) %>%
mutate(perc=count/sum(count)*100) %>%
arrange(desc(perc)) %>%
filter(perc>=1) %>%
.$Community
## Cleaning corpus
corpus_description <- nodelist_network_stats %>%
unnest_tokens(.,word,description,token="words") %>%
filter(!word %in% stopwords("en",source="snowball")) %>% ## stopwords
filter(!word %in% stopwords("nl",source="snowball")) %>% ## stopwords
filter(!word %in% stopwords("de",source="snowball")) %>% ## stopwords
filter(!word %in% stopwords("fr",source="snowball")) %>% ## stopwords
filter(!str_detect(word,"[0-9]")) %>% ## getallen
filter(!str_detect(word,"http[s]?|www|co.in|.in|.com|t.co")) %>% ## url noise
filter(!str_detect(word,"^[A-z]$|^[A-z][A-z]$")) ## one and two character words
## Cleaning corpus
corpus_tweets <- tweets %>%
left_join(.,nodelist_network_stats %>% select(Id,Community),by=c("user_id"="Id")) %>%
filter(is_retweet==FALSE, is_quote==FALSE) %>%
unnest_tokens(.,word,text,token="words") %>%
filter(!word %in% stopwords("en",source="snowball")) %>% ## stopwords
filter(!word %in% stopwords("nl",source="snowball")) %>% ## stopwords
filter(!word %in% stopwords("de",source="snowball")) %>% ## stopwords
filter(!word %in% stopwords("fr",source="snowball")) %>% ## stopwords
filter(!str_detect(word,"[0-9]")) %>% ## getallen
filter(!str_detect(word,"http[s]?|www|co.in|.in|.com|t.co")) %>% ## url noise
filter(!str_detect(word,"^[A-z]$|^[A-z][A-z]$"))
tf_idf_desc <- corpus_description %>%
group_by(Community,word) %>%
summarize(count=n()) %>%
bind_tf_idf(.,word,Community,count) %>%
arrange(Community,desc(tf_idf))
tf_idf_tweets <- corpus_tweets %>%
group_by(Community,word) %>%
summarize(count=n()) %>%
bind_tf_idf(.,word,Community,count) %>%
arrange(Community,desc(tf_idf))
wordclouds_for_each_com_desc <- NULL
for(i in 1:length(keepers)) {
## Create wordcloud data by week
x <- tibble(Community=keepers[i]) %>%
full_join(.,tf_idf_desc,by=c("Community"="Community"))
wordclouds_for_each_com_desc <- rbind(wordclouds_for_each_com_desc,x) %>% distinct
}
wordclouds_for_each_com_tweets <- NULL
for(i in 1:length(keepers)) {
## Create wordcloud data by week
x <- tibble(Community=keepers[i]) %>%
full_join(.,tf_idf_tweets,by=c("Community"="Community"))
wordclouds_for_each_com_tweets <- rbind(wordclouds_for_each_com_tweets,x) %>% distinct
}
# plot_cloud(wordclouds_for_each_com,39)
For the recommender network, we use the Video Network option from the YouTube Data Tools.
We retrieve the video networks for each of our queries with a crawl depth of 1. The YouTube Data Tools return a Gephi file.
We used Gephi for visualization and analysis of the network.
First, we import the nodelist exported from Gephi after performing community analysis. To understand the nature of the different communities in the video network, we have analyzed the video titles by community by plotting wordclouds following the method described in step 2.6.
Using the plot_cloud function, providing it with the worclouds_for_each_com data frame and the community’s id, wordclouds can be plotted.
## related_videos_depth0_gephi_export <- read_csv("")
YT_graph <- related_videos_depth0_gephi_export
list_for_analysis <- YT_graph %>% select(Label,modularity_class)
# CLEANING THE CORPUS
## Wordcloud data by community
keepers <- list_for_analysis %>%
group_by(modularity_class) %>%
summarize(count=n()) %>%
ungroup %>%
mutate(perc=count/sum(count)) %>%
arrange(desc(perc)) %>% filter(perc>=0.1) %>%
.$modularity_class
## Cleaning corpus
corpus_videotitles <- list_for_analysis %>% as_tibble %>%
unnest_tokens(.,word,Label,token="words") %>%
filter(!word %in% stopwords("nl",source="snowball")) %>% ## stopwords
filter(!word %in% stopwords("en",source="snowball")) %>% ## stopwords
filter(!word %in% stopwords("de",source="snowball")) %>% ## stopwords
filter(!word %in% stopwords("ru",source="snowball")) %>% ## stopwords
filter(!str_detect(word,"[0-9]")) %>% ## getallen
filter(!str_detect(word,"http[s]?|www|co.in|.in|.com")) %>% ## url noise
filter(!str_detect(word,"^[A-z]$|^[A-z][A-z]$")) %>% ## one and two character words
filter(modularity_class %in% keepers)
## Word count by com (title)
titlecount_by_marker <- corpus_videotitles %>%
group_by(modularity_class,word) %>%
summarize(count=n()) %>%
arrange(modularity_class,desc(count))
## TF*IDF by com (title)
titletfidf_by_marker <- corpus_videotitles %>%
group_by(modularity_class,word) %>%
summarize(count=n()) %>%
ungroup %>%
bind_tf_idf(word,modularity_class,count) %>%
arrange(modularity_class,desc(count))
wordclouds_for_each_com_yt <- NULL
for(i in 1:length(keepers)) {
## Create wordcloud data by week
x <- tibble(Community=keepers[i]) %>%
full_join(.,titletfidf_by_marker,by=c("Community"="modularity_class"))
wordclouds_for_each_com_yt <- rbind(wordclouds_for_each_com_yt,x) %>% distinct
}
# plot_cloud(wordclouds_for_each_com_yt,9)