In the digital age, communication professionals can no longer resort to simply sending messages. In order to raise awareness around products, services, or issues; communication professionals need to spark conversations for public interest to spread through online networks. The analysis of media networks can contribute to such media strategies. This document shares an approach to: 1) identify online communities; 2) understand how they think and talk about products, services, or issues; and 3) find potential collaboration partners to stimulate online conversations such as online magazines or micro-celebrities.

Below, we share a schematic overview as well as the code and specific descriptions of the steps that we propose to take to retrieve and analyse networks of websites (issue networks), networks of Twitter users (ego networks), and networks of related YouTube-videos (recommender network).

1. Scope and key phrases

1.1 Define scope

A specific key topic should always be the starting point of this approach, as we map online communities around this key topic. During the first step, we operationalize the scope of the topic the scope of the topic by generating lists of keywords that are used to search for relevant media content on Google.

Keywords should be combined into Google-compatible queries, looking like this:

x <- data.frame(Thema=c("Level_1","Level_2","Level_1","Level_2"),
           Groep=c("Topic_1","Topic_1","Topic_2","Topic_2"),
           query=c("\"key%20phrase%20one\"%20OR%20\"key%20phrase%20two\"%20OR%20\"etc\"",
                   "\"key%20phrase%20three\"%20OR%20\"key%20phrase%20four\"%20OR%20\"etc\"",
                   "\"key%20phrase%20five\"%20OR%20\"key%20phrase%20six\"%20OR%20\"etc\"",
                   "\"key%20phrase%20seven\"%20OR%20\"key%20phrase%20eight\"%20OR%20\"etc\""))
print(x)
##     Thema   Groep
## 1 Level_1 Topic_1
## 2 Level_2 Topic_1
## 3 Level_1 Topic_2
## 4 Level_2 Topic_2
##                                                               query
## 1     "key%20phrase%20one"%20OR%20"key%20phrase%20two"%20OR%20"etc"
## 2  "key%20phrase%20three"%20OR%20"key%20phrase%20four"%20OR%20"etc"
## 3    "key%20phrase%20five"%20OR%20"key%20phrase%20six"%20OR%20"etc"
## 4 "key%20phrase%20seven"%20OR%20"key%20phrase%20eight"%20OR%20"etc"

Note that the queries are associated with subtopic in the column Groep and specifity levels in the column Thema. This is optional though. If you do not want to distinguish between topics or levels, simply fill these columns with a single value.

1.2 Create shortlist of search queries

The keywords generated in step 1.1 are used to search for content on Google (see 2). Based on the results returned by Google, we typically extract a short list with the most relevant keywords to search on Twitter and YouTube. However, you could choose to use your original set of queries too.

2. Issue networks

Preparatory step: load packages and queries

First, we load a few required packages, set a timestamp variable so we can keep track of things, and load a data frame with Google queries.

# Loading required packages
require(tidyverse)
require(stringr)
require(lubridate)
require(rvest)
require(RCurl)
require(gmailr)
require(filesstrings)
require(tm)
require(tidytext)
require(stopwords)
require(wordcloud2)

## Set timestamp
timestamp <- lubridate::today() %>% strptime("%Y-%m-%d") %>% as.POSIXct(tz="Europe/Amsterdam")

# Start by loading a data frame with Google queries, grouped by level and topic

## Below is an example DF
to_retrieve <- data.frame(Thema=c("Level_1","Level_2","Level_1","Level_2"),
           Groep=c("Topic_1","Topic_1","Topic_2","Topic_2"),
           query=c("\"key%20phrase%20one\"%20OR%20\"key%20phrase%20two\"%20OR%20\"etc\"",
                   "\"key%20phrase%20three\"%20OR%20\"key%20phrase%20four\"%20OR%20\"etc\"",
                   "\"key%20phrase%20five\"%20OR%20\"key%20phrase%20six\"%20OR%20\"etc\"",
                   "\"key%20phrase%20seven\"%20OR%20\"key%20phrase%20eight\"%20OR%20\"etc\""))

## Helper functions
return_clean_url <- function(string) {
  
  x <- NULL
  for(i in 1:length(string)) {
  #print(string[i])
  y <- string[i]
  if(str_detect(y,"^http[s]?://")) {
      y <- y %>%
        str_replace('http[s]?://',"")  
  }
  if(str_detect(y,"/$")) {
    y <- y %>%
      str_replace('/$',"")  
  }
  if(str_detect(y,".facebook\\.com")) {
    y <- y %>%
      str_replace("[a-z-]+.facebook.com","www.facebook.com")
  }
  if(str_detect(y,"www\\.")) {
    y <- y %>%
      str_replace("www\\.","")
  }
  x <- c(x,y)
  }
  
  return(x)
  
} # Removes noise from urls generically
return_clean_url_serp <- function(string) {
  
  x <- NULL
  for(i in 1:length(string)) {
    #print(string[i])
    y <- string[i]
    if(str_detect(y,"^http[s]?://")) {
      y <- y %>%
        str_replace('http[s]?://',"")  
    }
    if(str_detect(y,"/$")) {
      y <- y %>%
        str_replace('/$',"")  
    }
    if(str_detect(y,".facebook\\.com")) {
      y <- y %>%
        str_replace_all("[a-z-]+.facebook.com","www.facebook.com")
    }
    x <- c(x,y)
  }
  
  return(x)
  
} # Removes noise from urls on Google SERP
return_tld <- function(string) {
  x <- NULL
  for(i in 1:length(string)) {
    #print(string[i])
    y <- NULL
    if(str_detect(string[i],"facebook.com/|instagram.com/|twitter.com/|youtube.com/|youtu.be/")) {
      y <- string[i] %>% return_channel
    } else {
    
      if(str_detect(string[i],"^http[s]?://")) {
        y <- string[i] %>%
          str_extract('http[s]?://[a-z0-9]+.[a-z0-9]+.[a-z0-9]+.[a-z0-9]+.[a-z0-9]+/|http[s]?://[a-z0-9]+.[a-z0-9]+.[a-z0-9]+.[a-z0-9]+/|http[s]?://[a-z0-9]+.[a-z0-9]+.[a-z0-9]+/|http[s]?://[a-z0-9]+.[a-z0-9]+/') %>%
          str_extract('http[s]?://[a-z0-9.-]+') %>%
          str_replace('http[s]?://http[s]?://',"")  
      } else {
        y <- string[i] %>%
          str_extract('[a-z0-9]+.[a-z0-9]+.[a-z0-9]+.[a-z0-9]+.[a-z0-9]+/|http[s]?://[a-z0-9]+.[a-z0-9]+.[a-z0-9]+.[a-z0-9]+/|http[s]?://[a-z0-9]+.[a-z0-9]+.[a-z0-9]+/|http[s]?://[a-z0-9]+.[a-z0-9]+/') %>%
          str_extract('[a-z0-9.-]+') %>%
          str_replace('http[s]?://',"")  
      }
    } 
  if(is.na(y)) {y <- string[i]}
  x <- c(x,y)
  }
  return(x)
} # Returns tld from url string. Returns page/account/channel for social media urls.
ahrefs_session <- function(email,password) {
  
  # Create a web session with the desired login address
  pgsession <- html_session("https://ahrefs.com/user/login")
  pgform <- html_form(pgsession)[[1]]  #in this case the submit is the 1st form
  filled_form <- set_values(pgform, email=email, password=password)
  submit_form(pgsession, filled_form, submit=NULL)
  return(pgsession)
  
} # logs into ahrefs.com
calculateColors <- function(values) {
  
  if(length(unique(values))>1){
    ## Use n equally spaced breaks to assign each value to n-1 equal sized bins 
    ii <- cut(values, breaks = seq(min(values), max(values), len = 100), 
              include.lowest = TRUE)
    ## Use bin indices, ii, to select color from vector of n-1 equally spaced colors
    colors <- colorRampPalette(c("lightgrey", "darkblue"))(99)[ii]  
  } else {
    colors <- "#D3D3D3"
  }
  
  return(colors)
  
} # Determines color of words in clouds
plot_cloud <- function(df,by_com) {
  
  df %>%
    filter(Community==by_com) %>%
    arrange(desc(count)) %>%
    .[1:25,] %>%
    filter(!is.na(word)) %>%
    select(Community,word,count,tf_idf) %>%
    mutate(tf_idf=ifelse(is.na(tf_idf),0,tf_idf)) %>%
    mutate(color=calculateColors(tf_idf)) %>% 
    select(word=word,freq=count,color) %>%
    as.data.frame %>%
    wordcloud2(.,fontFamily = 'Submariner', fontWeight = 'Bold', color=.$color)
  
} # Plots clouds
pos_read_html <- purrr::possibly(read_html,NA) ## Same as included function, but returns NA in case of error
pos_tolower <- purrr::possibly(tolower,NA) ## Same as included function, but returns NA in case of error
pos_html_form <- purrr::possibly(html_form,NA) ## Same as included function, but returns NA in case of error
pos_delete_message <- purrr::possibly(delete_message,NA) ## Same as included function, but returns NA in case of error

2.2 Filter for relevance

Next, we can filter our data by manually checking whether the returned SERP’s match the scope of our study - a fast way to reduce the number of irrelevant URLs. At the same time, we manage to identify the most relevant queries - possibly by topic and specifity - that we can use to search for content on Twitter and YouTube.

2.4 Process network data

Next, we process the link reports and create a network file.

The network includes the TLDs as nodes (e.g. the linked websites in the network) and the number of pages found for each TLD as one of the nodes’ properties. For example, if our reports would include newswebsite.com/article-1 and newswebsite.com/article-2, the ultimate network would include the TLD newswebsite.com, with two subpages as its weight property.

The so-called edges (links) between the nodes are computed by counting the number of links between pages of the same TLDs. For example, a link between newswebsite.com/artile-1 to blog.com/photo-1 shows as a line from newswebsite.com to blog.com in our network. The number of links between two TLDs is a property of the link.

Furthermore, we add nodes to isolate Facebook pages, YouTube channels, and Instagram- as well as Twitter accounts, and connect them to their respective platform. For example, social media postings (e.g.: facebook.com/user/media-posting or youtube.com/user/watch?=video) are considered to be pages belonging to a newly created parent node (e.g.: facebook.com/user or youtube.com/user), linking to their corresponding TLD (e.g.: facebook.com and youtube.com).

The ultimate network comprises a nodelist (nodelist_all), including the TLDs and their properties, and an edgelist (edgelist_all), including links between TLDs and their properties. These can be exported to csv-files and can be loaded into Gephi.

# First, navigate to the directory where the ahrefs reports are saved
#setwd("")

## First, we creates an index of the files in the current working directory using url as identifier
files <- data.frame(file=list.files())
requested <- files$file %>%
  str_extract(.,"[0-9][0-9]-[A-Z][a-z][a-z]-2019_[0-9][0-9]-[0-9][0-9]-[0-9][0-9]") %>%
  str_replace("Jan","01") %>% str_replace("Feb","02") %>% str_replace("Mar","03") %>% 
  str_replace("Apr","04") %>% str_replace("May","05") %>% str_replace("Jun","06") %>% 
  str_replace("Jul","07") %>% str_replace("Aug","08") %>% str_replace("Sep","09") %>%
  str_replace("Oct","10") %>% str_replace("Nov","11") %>% str_replace("Dec","12") %>%
  strptime(format="%d-%m-%Y_%H-%M-%S")
url <- files$file %>% 
  str_replace(.,"\\[raw\\]www\\.|\\[raw\\]","") %>%
  str_replace(.,"-linked.*|-backlink.*","")
type <- ifelse(files$file %>% str_detect("-linked-"),"links","backlinks")
index <- cbind(files,url,type,requested)

## Checks the number of rows for each file
index <- index %>%
  group_by(file,url,type,requested) %>%
  do(data.frame(rows=read_csv(.$file %>% as.character) %>% nrow)) %>%
  ungroup

## Next, we connect our results, log en index df's
log_ahrefs_requests <- log
to_retrieve <- result %>% 
  select(url_retrieve=url,tld) %>% 
  distinct %>%
  filter(!str_detect(tld,"facebook.com/pages$"))

retrieved <- to_retrieve %>%
  filter(url_retrieve %in% (filter(log_ahrefs_requests,links>0|backlinks>0) %>% .$url %>% str_replace(.,"www\\.",""))) %>%
  mutate(tld=str_extract(url_retrieve,"^[a-z0-9\\-\\.]+")) %>%
  left_join(.,filter(log_ahrefs_requests,links>0|backlinks>0) %>% mutate(n=1:nrow(filter(log_ahrefs_requests,links>0|backlinks>0)),url=str_replace(url,"www\\.","")),by=c("url_retrieve"="url")) %>%
  arrange(n)

retrieved_links <- retrieved %>% filter(links>0)
retrieved_backlinks <- retrieved %>% filter(backlinks>0)

## Next, we connect the retrieved files with the urls from the results data frame
index_links_strict <- retrieved_links %>%
  filter(tld %in% (filter(index,type=="links") %>% .$url %>% as.character)) %>%
  inner_join(.,filter(index,type=="links"),by=c("tld"="url","links"="rows")) %>%
  group_by(tld) %>%
  mutate(x=dense_rank(url_retrieve),y=dense_rank(file)) %>%
  filter(x==y) %>%
  select(-x,-y) %>%
  ungroup

index_links_approx <- retrieved_links %>%
  filter(!url_retrieve %in% index_links_strict$url_retrieve) %>%
  filter(tld %in% (filter(index,type=="links") %>% .$url %>% as.character)) %>%
  inner_join(.,filter(index,type=="links",!file %in% index_links_strict$file),by=c("tld"="url")) %>%
  mutate(diff=sqrt((links-rows)^2)) %>%
  group_by(file) %>%
  top_n(-1,diff) %>%
  group_by(url_retrieve) %>%
  mutate(x=dense_rank(file)) %>%
  group_by(file) %>%
  mutate(y=dense_rank(n)) %>%
  filter(x==y) %>%
  select(-x,-y,-diff,-rows) %>%
  ungroup

index_links_full <- rbind(index_links_strict,index_links_approx)

index_backlinks_strict <- retrieved_backlinks %>%
  filter(tld %in% (filter(index,type=="backlinks") %>% .$url %>% as.character)) %>%
  inner_join(.,filter(index,type=="backlinks"),by=c("tld"="url","backlinks"="rows")) %>%
  group_by(tld) %>%
  mutate(x=dense_rank(url_retrieve),y=dense_rank(file)) %>%
  filter(x==y) %>%
  select(-x,-y) %>%
  ungroup

index_backlinks_approx <- retrieved_backlinks %>%
  filter(!url_retrieve %in% index_backlinks_strict$url_retrieve) %>%
  filter(tld %in% (filter(index,type=="backlinks") %>% .$url %>% as.character)) %>%
  inner_join(.,filter(index,type=="backlinks",!file %in% index_backlinks_strict$file),by=c("tld"="url")) %>%
  mutate(diff=sqrt((backlinks-rows)^2)) %>%
  group_by(file) %>%
  top_n(-1,diff) %>%
  group_by(url_retrieve) %>%
  mutate(x=dense_rank(file)) %>%
  group_by(file) %>%
  mutate(y=dense_rank(n)) %>%
  filter(x==y) %>%
  select(-x,-y,-diff,-rows) %>%
  ungroup

index_backlinks_full <- rbind(index_backlinks_strict,index_backlinks_approx)
index_full <- rbind(index_backlinks_full,index_links_full)

## Loops through the index, adding data to a backlinks and links df
backlinks <- NULL
links <- NULL
for(i in (1:nrow(index_full))) {
  
  print(paste0("now doing ",index_full$file[i],sep=""))
  print(paste0("Round ",i,sep=""))
  
  if(index_full$file[i] %>% as.character %>% str_detect("backlink")) {
    
    bl <- read_csv(index_full$file[i] %>% as.character) %>%
      group_by(`Referring Page URL`,`First Seen`) %>%
      do(tld=(.$`Referring Page URL` %>% return_tld %>% return_clean_url)) %>%
      unnest
    
    link_to_tld <- index_full$url_retrieve[i] %>% return_tld
    
    backlinks <- rbind(backlinks,tibble::tibble(
      link_to_tld=link_to_tld, 
      link_from_tld=bl$tld, 
      link_from=bl$`Referring Page URL` %>% return_clean_url,
      link_to=index_full$url_retrieve[i]  %>% return_clean_url,
      first_seen=bl$`First Seen`
    ))
    
  } ## reads backlinks
  
  if(index_full$file[i] %>% as.character %>% str_detect("linked-domains")) {
    
    if(file.info(index_full$file[i] %>% as.character)$size >= 75) { ## check size for crashes due to empty ones
      
      l <- read_csv(index_full$file[i] %>% as.character)
      
      link_from_tld <- index_full$url_retrieve[i] %>% return_clean_url %>%  return_tld
      link_to <- l$`Linked Domains` %>% return_clean_url
      counts <- l$`Total Links Count`
      
      links <- rbind(links,tibble::tibble(
        link_from_tld=link_from_tld, ## return tld
        link_from=index_full$url_retrieve[i] %>% return_clean_url, ## retrieve url from file name
        link_to=link_to,
        count=counts)) 
    }
    
  } ## reads links
  
}

## Next, we create the network files

## First, we start with a clean, untagged list of our results ( discriminating between topics and levels is a feature that can be built in later)
result_flt <- result %>% 
  select(-theme,-marker) %>%
  group_by(url,tld,timestamp) %>%
  summarize(position=mean(position)%>%ceiling) %>%
  ungroup

## We also apply this to our other data frames
links_flt <- links %>%
  filter(link_from %in% result_flt$url)
backlinks_flt <- backlinks %>%
  filter(link_to %in% result_flt$url)
log_ahrefs_requests_flt <- log_ahrefs_requests %>%
  filter(url %in% result_flt$url)
index_full_flt <- index_full %>%
  filter(url_retrieve %in% result_flt$url)

# Creating nodelist, calculating:
## onSERP: if page is on SERP or not
## count_weight: weight (0-1) reflecting proportion of found pages within group (results vs links vs backlinks) by timestamp
#### Unique pages from tld found in network, regardless wether result or not
## pos_weight: weight (0-1) reflecting average SERP position (larger = better) by timestamp
#### Weighted SERP position. I.e. 4 on scale 1-4 is same weighted score as 20 on scale 1-20

## This vector serves as an index file
nodes <- tibble(Id=c(backlinks_flt$link_to_tld %>% as.character,
           backlinks_flt$link_from_tld %>% as.character,
           links_flt$link_to %>% as.character,
           links_flt$link_from_tld %>% as.character,
           result_flt$tld %>% as.character) %>% unique %>%
  .[!is.na(.)])


## Initiates the nodelist from results, calculating...
nodelist_result <- result_flt %>%
  group_by(Id=tld) %>%
  summarize(count=n(),Position=min(position)) %>%
  group_by(Id) %>%
  mutate(Weight=cumsum(count),
         Pos_cum=cumsum(Position),
         Position=(Pos_cum/Weight)%>%ceiling,
         row=1:n()) %>%
  ungroup %>%
  select(Id,Position,Weight,row) %>%
  mutate(onSERP=TRUE)

nodes <- nodes[!nodes$Id %in% nodelist_result$Id,]

## Creating nodelist from backlinks
nodelist_backlinks <- backlinks_flt %>%
  filter(link_from_tld %in% nodes$Id) %>%
  group_by(Id=link_from_tld) %>%
  summarize(count=n(),Position=NA) %>%
  group_by(Id) %>%
  mutate(Weight=cumsum(count),
         row=1:n()) %>%
  ungroup %>%
  select(Id,Position,Weight,row) %>%
  mutate(onSERP=FALSE)

nodes <- nodes[!nodes$Id %in% c(nodelist_result$Id,nodelist_backlinks$Id),]

nodelist_links <- links_flt %>%
  filter(link_to %in% nodes$Id) %>%
  group_by(Id=link_to) %>%
  summarize(count=n(),Position=NA) %>%
  group_by(Id) %>%
  mutate(Weight=cumsum(count),
         row=1:n()) %>%
  ungroup %>%
  select(Id,Position,Weight,row) %>%
  mutate(onSERP=FALSE)

nodes <- nodes[!nodes$Id %in% c(nodelist_result$Id,nodelist_backlinks$Id,nodelist_links$Id),]
if(nrow(nodes)==0) {print("All ids succesfully processed")}

nodelist_all <- rbind(nodelist_result,nodelist_backlinks,nodelist_links) %>%
  mutate(count_weight=Weight/max(Weight),
         pos_weight=ifelse(onSERP==TRUE,1-(Position/max(Position,na.rm=T)),0)) %>%
  arrange(Id)

## Build Edgelist
edges_backlinks <- backlinks_flt %>%
  group_by(Source=link_from_tld,
           Target=link_to_tld) %>%
  summarize(count=n()) %>%
  group_by(Source,Target) %>%
  mutate(Weight=cumsum(count),
         row=1:n()) %>%
  ungroup %>%
  select(-count)

edges_links <- links_flt %>%
  group_by(Source=link_from_tld,
           Target=link_to) %>%
  summarize(count=n()) %>%
  group_by(Source,Target) %>%
  mutate(Weight=cumsum(count),
         row=1:n()) %>%
  ungroup %>%
  select(-count)

## Adding edges for social media pages
missing_edges_facebook <- filter(nodelist_all,str_detect(Id,"facebook.com/[aA-zZ0-9]+"))$Id %>% unique
missing_edges_youtube <- filter(nodelist_all,str_detect(Id,"youtube.com/user|youtube.com/watch|youtube.com/channel|youtu.be/user|youtu.be/watch|youtu.be/channel"))$Id %>% unique
missing_edges_instagram <- filter(nodelist_all,str_detect(Id,"instagram.com/p/|instagram.com/stories/|instagram.com/[aA-zZ0-9\\.]+/"))$Id %>% unique
missing_edges_twitter <- filter(nodelist_all,str_detect(Id,"twitter.com/[aA-zZ0-9\\.]+"))$Id %>% unique
missing_edges_linkedin <- filter(nodelist_all,str_detect(Id,"linkedin.com/in/[aA-zZ0-9\\.]+|linkedin.com/post"))$Id %>% unique

if(length(missing_edges_facebook)>0){
  missing_edges_facebook <- rbind(data.frame(Source=missing_edges_facebook,Target="facebook.com"),
                                  data.frame(Source="facebook.com",Target=missing_edges_facebook))
}else{missing_edges_facebook<-NULL}
if(length(missing_edges_youtube)>0){
  missing_edges_youtube <- rbind(data.frame(Source=missing_edges_youtube,Target="youtube.com"),
                                  data.frame(Source="youtube.com",Target=missing_edges_facebook))
}else{missing_edges_youtube<-NULL}
if(length(missing_edges_instagram)>0){
  missing_edges_instagram <- rbind(data.frame(Source=missing_edges_instagram,Target="instagram.com"),
                                 data.frame(Source="instagram.com",Target=missing_edges_instagram))
}else{missing_edges_instagram<-NULL}
if(length(missing_edges_twitter)>0){
  missing_edges_twitter <- rbind(data.frame(Source=missing_edges_twitter,Target="twitter.com"),
                                   data.frame(Source="twitter.com",Target=missing_edges_twitter))
}else{missing_edges_twitter<-NULL}
if(length(missing_edges_linkedin)>0){
  missing_edges_linkedin <- rbind(data.frame(Source=missing_edges_linkedin,Target="linkedin.com"),
                                 data.frame(Source="linkedin.com",Target=missing_edges_linkedin))
}else{missing_edges_linkedin<-NULL}

missing_edges <- rbind(missing_edges_facebook,missing_edges_youtube,missing_edges_instagram,missing_edges_twitter,missing_edges_linkedin)

if(!is.null(missing_edges)){
  missing_edges <- missing_edges %>% 
  group_by(Source,Target) %>% 
  mutate(Weight=1) %>%
  select(Source,Target,Weight)
}

edgelist_all <- rbind(edges_backlinks,edges_links,missing_edges) %>% distinct

2.5 Network analysis

Network analysis is done in Gephi. An export of the nodelist made after detecting communities is used in the next step.

2.6 Compute wordclouds

We import the nodelist exported from Gephi after performing community analysis (add the path to yours below). To understand the nature of the different communities, we analyze their corresponding URLs. For each URL, we replace dashes, slashes and other non-letter characters with spaces, remove Dutch and English stop words, and compute word occurrence and TF-IDF. Next, we compute word clouds, including the 25 most common words, where word size expresses occurrence and the shade of blue expresses the uniqueness of this word compared to other communities.

Using the plot_cloud function, providing it with the worclouds_for_each_com data frame and the community’s id, wordclouds can be plotted.

## nodelist_gephi <- read_csv("")

result_tm <- result %>%
  mutate(text=str_replace(url,tld,"")) %>%
  mutate(text=str_replace_all(text,"/|\\.nl|\\.com|\\.org|\\.info|\\.|-|_|aspx|wordpress|html|htm|pdf|images|author|\\?|\\=|title|id=|home|default|[0-9]+|upload|content|archive|archief"," ") %>% str_trim) %>%
  mutate(text=str_replace_all(text," [aA-zZ][ |^]| nl[ |^]| com[ |^]| info[ |^]| php[ |^]| asp[ |^]| tag[ |^]"," "))

result_tm_coms <- result_tm %>% 
  left_join(.,nodelist_gephi %>% select(Id,Community=modularity_class),by=c("tld"="Id")) %>%
  select(url,tld,text,Community) %>%
  distinct

keepers <- nodelist_gephi %>% 
  group_by(modularity_class) %>% 
  summarize(count=n()) %>% 
  mutate(perc=count/sum(count)*100) %>% 
  arrange(desc(perc)) %>%
  filter(perc>=1) %>%
  .$modularity_class

corpus <- result_tm_coms %>%
  filter(Community %in% keepers) %>%
  unnest_tokens(.,word,text,token="words") %>%
  anti_join(data.frame(word=stopwords::stopwords(language="nl",source="stopwords-iso")), by = c("word" = "word")) %>% ## remove stop words
  anti_join(data.frame(word=stopwords::stopwords(language="en",source="stopwords-iso")),by=c("word"="word")) %>%
  filter(!str_detect(word,"[0-9]")) %>% ## remove numbers
  filter(!str_detect(word,"http[s]?|www|.in|.com|t.co|co.in")) %>% ## remove url noise
  filter(!str_detect(word,"^[A-z]$|^[A-z][A-z]$")) ## remove one and two character words
  
tf_idf_coms <- corpus %>% 
  group_by(Community,word) %>% 
  summarize(count=n()) %>% 
  bind_tf_idf(.,word,Community,count) %>% 
  arrange(Community,desc(tf_idf))

wordclouds_for_each_com <- NULL
for(i in 1:length(keepers)) {
  
  ## Create wordcloud data by week
  x <- tibble(Community=keepers[i]) %>%
    full_join(.,tf_idf_coms,by=c("Community"="Community"))
  
  wordclouds_for_each_com <- rbind(wordclouds_for_each_com,x) %>% distinct
  
}

wordclouds_for_each_com <- wordclouds_for_each_com %>% filter(!is.na(word))

# plot_cloud(wordclouds_for_each_com,39)

3. Ego networks

We use Jefferson Henrique’s GetOldTweets to retrieve tweets for a fixed time period. Later (step 3.1), we will use the id’s of these tweets to re-retrieve them using rtweet, and snowball from there to retrieve the tweets’ conversational context (replies, retweets, quotes, etc.) as well as the ego networks of the associated users.

Perparatory steps: loadig packages and helper functions

Next, we load a few required packages, a few helper functions, and set a timestamp variable so we can keep track of things.

# Loading required packages
require(rtweet)
require(httpuv)
require(longurl)
require(igraph)
require(SnowballC)
require(webshot)
require(htmltools)
require(htmlwidgets)
require(imager)
require(kableExtra)
require(rmdformats)
require(googleLanguageR)

## Set timestamp
timestamp <- lubridate::today() %>% strptime("%Y-%m-%d") %>% as.POSIXct(tz="Europe/Amsterdam")

## Load token
twitter_token <- read_rds("") ## load your twitter token here; for more info, consult the rtweet documentation

## Helper functions
requests_left <- function(call="followers/ids") {
  rate_limit(twitter_token) %>% filter(query==call) %>% .$remaining
} # Returns requests left for certain api call (standard: retrieve friends)
time_left <- function(call="followers/ids") {
  rate_limit(twitter_token) %>% filter(query==call) %>% .$reset
} ## Returns time left until refresh of rate limit
limit_check <- function(requests,expected=FALSE,call="follower/ids") {
  
  ## Limit check
  calls_to_go <- requests_left(call)
  calls_needed <- requests
  time_to_go <- time_left(call) %>% as.numeric
  
  while(calls_to_go<calls_needed) {
    if(expected==FALSE) { 
      print("## Rate limit exceeded unexpectedly!")
    }
    print(paste("#### We need",calls_needed,"calls, but have",calls_to_go ,"left..."))
    print(paste("#### The current time is",Sys.time(),"Please wait."))
    print(paste("#### Rate limit resetting in",time_to_go,"minutes."))
    if(calls_to_go<calls_needed) {
      Sys.sleep(60*(time_to_go+1))  
    }
    calls_to_go <- requests_left(call)
    time_to_go <- time_left(call) %>% as.integer
    if(calls_to_go>=calls_needed) {
      print("Rate Limit refreshed. Continuing...")
    }
  }
  
} ## Checks limit before retrieving data
### retrieves number of statuses by given user_id
retrieve_con <- function(status_id_input) {
  tweets_complete %>%
    filter(status_id==status_id_input) %>%
    .$conversation
}

### retrieves number of statuses by given user_id
get_statuses <- function(input_user_id) {
  tweets_complete[tweets_complete$user_id==input_user_id,] %>% nrow
}

### retrieves whether input user_id is associated with an account that actively tweeted
get_original <- function(input_user_id) {
  ifelse((tweets_complete[tweets_complete$user_id==input_user_id & tweets_complete$original==TRUE,] %>% nrow)>0,TRUE,FALSE) 
}

retrieve_user_id <- function(df,input) {
  df %>% 
    filter(status_id %in% input) %>%
    .$user_id %>%
    unique %>%
    as.character
}
pos_gl_translate_detect <- purrr::possibly(gl_translate_detect,tibble(confidence=NA,isReliable=FALSE,language=NA,text="error")) ## Returning NA instead of error
process_tweets <- function(df) {
  
  ## Tags tweets that were not retrieved as a result of the initial query as non-original
  if(!"original" %in% colnames(df)) {
    df <- df %>%
      mutate(original=FALSE)
  }
  
  print("Processing tweets and mentions...")
  
  ## Processes raw tweets
  tweets <- df %>%
    mutate(hasLink=str_detect(text,"http"), ## whether a tweet contains links
           isReply=ifelse(!is.na(.$reply_to_status_id),TRUE,FALSE)) %>% ## wether the tweet is a reply
    group_by(status_id=(status_id %>% as.character)) %>%
    do(
      data.frame(
        original=.$original, ## wether tweet is original
        screen_name=(.$screen_name %>% as.character), ## user
        user_id=(.$user_id %>% as.character),
        text_original=(.$text %>% as.character),
        text=(.$text %>% 
                str_replace(.,"^RT ","") %>% 
                str_replace(.,"http.* ","") %>% 
                str_replace(.,"^@.*@[a-zA-Z0-9_.-]+ ","") %>%
                str_replace(.," @.*@[a-zA-Z0-9_.-]+","") %>%
                str_replace(.,"^@[a-zA-Z0-9_.-]+: ","") %>%
                str_replace(.,"^@[a-zA-Z0-9_.-]+ ","") %>%
                str_replace(.," @[a-zA-Z0-9_.-]+","") %>%
                str_replace(.,"<.*>","")),
        created=(.$created_at %>% 
                   strptime(format="%Y-%m-%d %H:%M:%S", tz=Sys.timezone(location=TRUE))),
        is_retweet=.$is_retweet, ## whether the tweet is a retweet
        retweet_status_id=(.$retweet_status_id %>% as.character),
        is_quote=.$is_quote,
        quote_status_id=(.$quoted_status_id %>% as.character),
        is_reply=.$isReply,
        reply_status_id=(.$reply_to_status_id %>% as.character),
        reply_user_id=(.$reply_to_user_id %>% as.character),
        retweet_count=(.$retweet_count %>% as.integer),
        favorite_count=(.$favorite_count %>% as.integer),
        platform=(.$source %>% as.character),
        has_link=(ifelse(!is.na(.$urls_expanded_url),T,F)),
        has_media=ifelse(!is.na(.$media_url),T,F),
        has_symbols=(ifelse(!is.na(.$symbols),T,F)),
        has_hashtags=(ifelse(!is.na(.$hashtags),T,F)),
        country=(.$country_code %>% as.character)
      )) %>% 
    as.data.frame() %>% 
    unique
  
  ## Processes user mentions (later used for edge weight)
  mentioned_users <- df %>%
    group_by(status_id,user_id) %>% 
    do(mentions_user_id=(.$mentions_user_id %>% unlist %>% as.character)) %>%
    unnest %>%
    filter(!is.na(mentions_user_id))  
  
  
  ## Combining results in a list
  exp <- list(tweets,mentioned_users)
  names(exp) <- c("tweets","mentioned_users")
  return(exp)
  
} ## Process raw tweets
retrieve_rts_quotes <- function(df) {
  
  # Retweets
  
  ## Check if there are retweets to retrieve
  rts <- df %>%
    filter(!is.na(retweet_status_id)) 
  
  ## Initiating results
  retweets_user_ids <- NULL
  
  ## Retrieve retweets, if any.
  if(nrow(rts)>0) {
    
    print("Retrieving originals for retweets...")
    
    ## Retrieve all retweets
    rts <- rts %>%
      .$retweet_status_id %>%
      unique %>%
      lookup_statuses %>% 
      process_tweets  ## alle retweets ophalen
    
    ## Filter observations already in source file
    rts[[1]] <- rts[[1]] %>% 
      dplyr::filter(!status_id %in% df$status_id)
    rts[[2]] <- rts[[2]] %>% 
      filter(status_id %in% rts[[1]]$status_id)
    
    print(paste0(nrow(rts[[1]])," retweeted originals retrieved"))
    
    ## Retrieves user_id's for rts in order to retrieve the original tweets
    retweets_user_ids <- df %>% 
      filter(is_retweet==TRUE) %>% 
      group_by(retweet_status_id) %>% 
      do(data.frame(retweet_user_id=retrieve_user_id(rts[[1]],.$retweet_status_id)))
  }
  
  ## Return NA when there are no retweeted originals to retrieve
  if(is.null(retweets_user_ids)) {
    
    print("No originals for retweets to retrieve")
    
    tweets <- tweets %>% mutate(retweet_user_id=NA)
  }
  
  # Quotes
  
  ## Check if there are quotes to retrieve
  quotes <- df %>% 
    filter(!is.na(quote_status_id))
  
  ## Initiating results
  quotes_user_ids <- NULL
  
  ## Retrieve quotes, if any
  if(nrow(quotes)>0) {
    
    print("Retrieving originals for quotes...")
    
    quotes <- quotes %>%
      .$quote_status_id %>%
      unique %>%
      lookup_statuses %>% 
      process_tweets ## retrieve all quotes  
    
    ## Filter observations already in source file
    quotes[[1]] <- quotes[[1]] %>% 
      dplyr::filter(!status_id %in% df$status_id)
    quotes[[2]] <- quotes[[2]] %>% filter(status_id %in% quotes[[1]]$status_id)
    
    print(paste0(nrow(quotes[[1]])," quoted originals retrieved"))
    
    ## Check if there are user_ids associated with quotes to retrieve
    quotes_user_ids <- df %>% 
      filter(is_quote==TRUE) 
    
    ## Retrieve user ids for quotes
    if(nrow(quotes_user_ids)>0) {
      
      print("Retrieving user_ids of quoted original tweets...")
      
      quotes_user_ids <- quotes_user_ids %>% 
        group_by(quote_status_id) %>% 
        do(data.frame(quote_user_id=retrieve_user_id(quotes[[1]],.$quote_status_id))) %>%
        select(quote_status_id,quote_user_id)
    }
  }
  
  ## Return NA when there are no quoted originals to retrieve
  if(is.null(quotes_user_ids)) {
    
    print("No originals for quotes to retrieve")
    
    tweets <- tweets %>% mutate(quote_user_id=NA)
    
  }
  
  # Combining data 
  
  ## Rbind the retrieved tweets
  tweets <- rbind(rts[[1]],quotes[[1]]) %>%
    distinct %>%
    mutate(conversation=NA)
  
  ## Add user_ids for retweeted and quoted originals
  if(!is.null(retweets_user_ids)){
    tweets <- tweets %>% 
      left_join(.,retweets_user_ids,by=c("retweet_status_id"="retweet_status_id"))
  }
  if(!is.null(quotes_user_ids)){
    tweets <- tweets %>% 
      left_join(.,quotes_user_ids,by=c("quote_status_id"="quote_status_id"))
  }
  
  tweets <- tweets %>% unique 
  mentioned_users <- rbind(rts[[2]],quotes[[2]])
  
  exp <- list(tweets,mentioned_users)
  names(exp) <- c("tweets","mentioned_users")
  return(exp)
  
} ## Retrieve retweeted and quoted originals
retrieve_conversations <- function(df) {
  
  ## Creates index df in which replies are numbered
  tweets <- df %>%
    unique %>%
    mutate(conversation=dense_rank(.$reply_status_id) %>% as.factor) %>%
    filter(!is.na(conversation))
  
  if(nrow(tweets)>=1) {
    
    ## How many tweets to retrieve
    ## Is recalculated after each retrieval loop
    to_retrieve <- tweets %>% 
      .$reply_status_id %>% 
      unique %>% 
      length
    
    ## Initiating index and results dfs
    tweet_ids <- NULL
    mentioned_users_ <- NULL
    
    i <- 1
    ## Keeps retrieving until no more left
    while(to_retrieve>0) {
      
      ## Determine which tweets to lookup
      lookup <- tweets %>% 
        filter(!reply_status_id %in% tweet_ids) %>% ## excludes previously retrieved tweets
        filter(!is.na(reply_status_id)) %>%
        .$reply_status_id %>%
        unique %>%
        as.character
      
      print("Retrieving tweets that were replied to...")
      
      ## Looks up tweets and checks if tweets were returned
      replies_raw <- lookup_statuses(lookup)
      
      check <- replies_raw %>% nrow %>% data.frame(x=.) %>% mutate(y=ifelse(x==0,FALSE,TRUE))
      
      ## Numbering results
      if(check$y==TRUE) {
        
        ## Processes the raw results
        replies <- process_tweets(replies_raw)
        
        ## Numbers the tweets that were just retrieved according to the associated tweet-and-reply-chain
        replies[[1]] <- left_join(replies[[1]],(tweets %>% select(reply_status_id,conversation)),
                                  by=c("status_id"="reply_status_id"))
      }
      if(check$y==FALSE) {
        
        ## Returns NA in case of error (when tweets is deleted)
        replies <- list(NA,NA,NA,NA,NA,NA,NA)
      }
      
      ## Add the new tweets to the index
      tweet_ids <- c(tweet_ids,lookup) %>% unique
      
      ## Adding results to df
      if(check$y==TRUE) {
        ## Combines existing data with retrieved data
        tweets <- rbind(tweets,replies[[1]]) %>%
          unique
        mentioned_users_ <- rbind(mentioned_users_,replies[[2]])
      }
      
      ## Determining which tweets to retrieve in next round
      to_retrieve <- tweets %>% 
        filter(!reply_status_id %in% tweet_ids) %>%
        filter(!reply_status_id %in% lookup) %>%
        filter(!is.na(reply_status_id)) %>%
        .$reply_status_id %>% 
        unique %>% 
        length
      
      i <- i+1
      
      print(paste0(to_retrieve," tweets to retrieve in next round"))
      
    }
    
    print(paste0(nrow(tweets)," tweets retrieved associated with ",max(tweets$conversation %>% as.numeric)," conversations"))
    
    ## Adding extra columns (otherwise rbind error)
    tweets <- tweets %>%
      mutate(retweet_user_id=NA,
             quote_user_id=NA) %>%
      unique
    
    mentioned_users_ <- mentioned_users_ %>% filter(!status_id %in% df$status_id) %>% unique
    
    exp <- list(tweets,mentioned_users_)
    names(exp) <- c("tweets","mentioned_users")
    return(exp)
    
    
  }
  
  
  
} ## Unfold and retrieve reply-chains and mark conversation numbers
retrieve_user_stats <- function(user_list) {
  
  print("Starting function retrieve_user_stats()")
  
  ## Creates a unique user list, counts them, and calculates required batches
  user_list <- as.character(user_list) %>% unique
  rows <- length(user_list)
  batches <- (rows/90000)%>%ceiling
  
  print(paste("## Retrieving a list of ",rows," users in",batches," batches"))
  
  ## Initiates data frame to write data to
  wip_user_stats <- NULL
  
  ## Retrieves all the users in one batch in case of less than 18k users
  if(rows<90000) {
    limit_check(rows/90000,expected=TRUE,call="users/lookup")
    
    print("### Starting first and last batch")
    wip_user_stats <- lookup_users(user_list)
    print("### Success")
  }
  
  ## In case of more than 18k users, retrieves the users in batches of 18k
  if(rows>90000) {
    
    limit_check(900,expected=TRUE,call="users/lookup")
    
    print("### Starting first of miltiple batches")
    wip_user_stats <- rbind(wip_user_stats,
                            lookup_users(user_list[1:90000]))  
    print("### Success")
    
    for(i in 1:(batches-1)) {
      
      limit_check(900,expected=TRUE,call="users/lookup")
      
      print(paste("### Starting batch",i+1))
      batch <- user_list[(1+(i*90000)):(90000+(i*90000))]
      batch <- batch[!is.na(batch)]
      print(paste("#### Retrieving",length(batch),"users"))
      
      tmp <- lookup_users(batch)
      print("### Succes")
      
      wip_user_stats <- rbind(wip_user_stats,tmp)
      
    }
    
  }
  
  print("Retrieval succesfull, proceeding with processing")
  
  ## Processes the data
  user_stats_t <- wip_user_stats %>%
    group_by(user_id=(user_id %>% as.character)) %>%
    do(
      data.frame(screen_name=(.$screen_name %>% as.character),
                 real_name=(.$name %>% as.character),
                 description=(.$description %>% as.character),
                 profile_image=(.$profile_image_url %>% as.character),
                 background_image=(.$profile_background_url %>% as.character),
                 banner_image=(.$profile_banner_url %>% as.character),
                 location=(.$location %>% as.character),
                 language=(.$account_lang %>% as.character),
                 statuses=(.$statuses_count %>% as.integer),
                 followers=(.$followers_count %>% as.integer),
                 following=(.$friends_count %>% as.integer),
                 favorites=(.$favourites_count %>% as.integer),
                 protected=(.$protected %>% as.logical),
                 verified=(.$verified %>% as.logical),
                 age_weeks=(((Sys.time() - .$account_created_at) %>% 
                               as.integer / 7) %>% round(0) %>% as.integer)
      )
    ) %>% unique
  
  return(user_stats_t)
  
} ## Function to retrieve user_stats


#### Returns ids of followers one by one. Note users in input df may not have more tha 75k followers
get_followers_in_batches <- function(user_id,requests) {
  
  limit_check(requests,FALSE,"followers/ids")
  
  followers <- get_followers(user_id %>% as.character,n=(requests*5000))$user_id %>% 
    as.data.frame %>% 
    ifelse(nrow(.)>0,.,NA) %>% 
    unlist %>% 
    as.character
  
  return(followers)
  
}
## Alternative function that returns NA in case of error
pos_get_followers <- possibly(get_followers_in_batches,NA)
#### Returns ids of followers for user accounts with more than 75k followers
get_followers_for_large_account <- function(user_id,calls_needed) {
  
  requests <- calls_needed
  windows <- (requests/15) %>% ceiling
  print(paste("######## Retrieving followers for user in",windows,"windows."))
  print(paste("######## Estimated time left:",(windows*15)-15,"minutes"))
  print("######## Starting window 1")
  
  limit_check(15,TRUE,"followers/ids")
  tmp <- get_followers(as.character(user_id %>% as.character),n=(calls_needed*5000))
  page <- next_cursor(tmp)
  
  followers <- tmp$user_id %>% 
    as.data.frame %>% 
    ifelse(nrow(.)>0,.,NA) %>% 
    unlist %>% 
    as.character
  
  for(i in 1:(windows-1)) {
    
    limit_check(15,TRUE,"followers/ids")
    
    print(paste("######## Starting window",i+1))
    
    tmp <- get_followers(as.character(user_id),n=(calls_needed*5000),page=page)
    page <- next_cursor(tmp)
    
    tmp <- tmp$user_id %>% 
      as.data.frame %>% 
      ifelse(nrow(.)>0,.,NA) %>% 
      unlist %>% 
      as.character
    
    followers <- c(followers,tmp)
    
  }
  
  return(followers)
  
}
## Alternative function that returns NA in case of error
pos_get_followers_for_large_account <- possibly(get_followers_for_large_account,NA)
#### Returns ids of following one by one. Note users in input df may not have more tha 75k followers
get_following_in_batches <- function(user_id,requests) {
  
  limit_check(requests,FALSE,"friends/ids")
  
  following <- get_friends(user_id %>% as.character,n=(requests*5000))$user_id %>% 
    as.data.frame %>% 
    ifelse(nrow(.)>0,.,NA) %>% 
    unlist %>% 
    as.character
  
  return(following)
  
}
## Alternative function that returns NA in case of error
pos_get_following <- possibly(get_following_in_batches,NA)
#### Returns ids of following for user accounts with more than 75k following
get_following_for_large_account <- function(user_id,calls_needed) {
  
  windows <- (calls_needed/15) %>% ceiling
  print(paste("######## Retrieving followers for user in",windows,"windows."))
  print(paste("######## Estimated time left:",(windows*15)-15,"minutes"))
  print(paste("######## Starting window 1 of ",windows))
  
  limit_check(15,TRUE,"friends/ids")
  tmp <- get_friends((user_id %>% as.character),n=(calls_needed*5000))
  page <- next_cursor(tmp)
  
  following <- tmp$user_id %>% 
    as.data.frame %>% 
    ifelse(nrow(.)>0,.,NA) %>% 
    unlist %>% 
    as.character
  
  ## hier calls needed gedeeld door 15. Moet werken nu.
  for(i in 1:(windows-1)) {
    
    limit_check(15,TRUE,"friends/ids")
    
    print(paste("######## Starting windows",i+1))
    
    tmp <- get_friends((user_id %>% as.character),n=(calls_needed**5000),page=page)
    page <- next_cursor(tmp)
    
    tmp <- tmp$user_id %>% 
      as.data.frame %>% 
      ifelse(nrow(.)>0,.,NA) %>% 
      unlist %>% 
      as.character
    
    following <- c(following,tmp)
    
  }
  
  return(following)
  
}
## Alternative function that returns NA in case of error
pos_get_following_for_large_account <- possibly(get_following_for_large_account,NA)

We navigate to the folder where the output of the GetOldTweets step are stored, and load a data frame with Tweets. Note that - similar to how we handle the results of the Google step - queries and results can be distinguished by associated topic (type) and level (marker) that the script below derives from the file names. Follow this format in naming your CSV’s (level_topic.csv), or simply overwrite these columns manually with a single value.

## Loading tweets
### The code commented out below loads all output files of GetOldTweets in a given folder

#index <- tibble(file=list.files())
#tweets <- NULL
#for(i in 1:nrow(index)) {
#  x <- read_delim(index$file[i],
#                  delim="\t",
#                  quote = "",
#                  col_types = cols(
#                    username = col_character(),
#                    date = col_datetime(format = ""),
#                    replies = col_integer(),
#                    retweets = col_integer(),
#                    favorites = col_integer(),
#                    text = col_character(),
#                    geo = col_character(),
#                    mentions = col_character(),
#                    hashtags = col_character(),
#                    id = col_character(),
#                    permalink = col_character()
#                  ))
#  marker <- index$file[i] %>% substr(1,str_locate(.,"_") %>% .[1]-1)
# type <- index$file[i] %>% str_extract("_[aA-zZ]+") %>% str_replace_all("_","")
#  tweets <- rbind(tweets, x %>% mutate(marker=marker,type=type,isOriginal=TRUE))
#}

3.1 (Re-)retrieve tweets

We (re-)retrieve all the tweets using rtweet, offering us a little more information on each tweet. As many of the core set of tweets resulted from (multiple) interactions between users, we also retrieve these tweets’ conversational context spanning the tweets, retweets, quotes, and replies the tweets were responding to.

## Split data up into batches to retrieve tweets for 
## This is to distinguish tweets belonging to different groups (markers)
tweet_index <- tweets %>% select(marker,type,id) %>% arrange(marker,type,id) %>% 
  mutate(batchprep=paste(marker,type,sep="-"),
         batch=dense_rank(batchprep)) %>%
  select(-batchprep)

## Retrieve all queries
tweets_raw <- NULL
for(i in 1:length(tweet_index$batch %>% unique)) {
  
  to_retrieve <- tweet_index %>%
    filter(batch==tweet_index$batch %>% unique %>% .[i]) %>%
    mutate(id=str_replace_all(id,"\"",""))
  
  tmp <- lookup_statuses(to_retrieve$id %>% unique,
                        parse=TRUE) %>%
    mutate(original=TRUE,
           Type=to_retrieve$type[1],
           Marker=to_retrieve$marker[1])
  
  tweets_raw <- rbind(tweets_raw,tmp)
}

## Create Index
tweets_by_marker <- tweets_raw %>% select(user_id,status_id,Type,Marker) %>%
  mutate(timestamp=timestamp)

### Process the data in the desired format
tweets_processed <- process_tweets(tweets_raw)

### Retrieve the retweeted and quoted originals before retrieving all the user data
rts_quotes_processed <- NULL
if((tweets_processed[[1]] %>% filter(is_quote==TRUE | is_retweet==TRUE) %>% nrow)>=1) {
  rts_quotes_processed <- retrieve_rts_quotes(tweets_processed[[1]])
  rts_quotes_processed[[1]] <- rts_quotes_processed[[1]] %>%
    mutate(original=ifelse(status_id %in% tweets_processed[[1]]$status_id,TRUE,FALSE)) %>%
    unique
}

### Retrieve the statuses that replies are replying to, and continue to follow these chains until first tweet is reached
conversations_processed <- retrieve_conversations(tweets_processed[[1]])
if(!is.null(nrow(conversations_processed[[1]]))) {
  conversations_processed[[1]] <- conversations_processed[[1]] %>%
    mutate(original=ifelse(status_id %in% tweets_processed[[1]]$status_id,TRUE,FALSE)) %>%
    unique
} else {
  conversations_processed <- NULL
}

## Retrieving retweet and quote originals as well as conversations results in doubles. The code below addresses this.
### Mark rts or quotes that were part of a conversation as such
if(!is.null(nrow(conversations_processed[[1]])) && !is.null(nrow(rts_quotes_processed))) {
  rts_quotes_processed[[1]] <- rts_quotes_processed[[1]] %>% 
    select(-conversation) %>%
    left_join(.,conversations_processed[[1]] %>% 
                select(status_id,conversation),by=c("status_id"="status_id"))
  
  ### Remove rts and quotes that were part of a conversation from the conversation set
  conversations_processed[[1]] <- conversations_processed[[1]] %>%
    filter(!status_id %in% (rts_quotes_processed[[1]] %>% 
                              select(status_id,conversation) %>% 
                              filter(!is.na(conversation)) %>% 
                              .$status_id))
}

### Removing tweets from the original set that are part of replies, rts or quotes
tweets_processed[[1]] <- tweets_processed[[1]] %>% 
  filter(!status_id %in% rts_quotes_processed[[1]]$status_id) %>%
  filter(!status_id %in% conversations_processed[[1]]$status_id) %>%
  mutate(conversation=NA,retweet_user_id=NA,quote_user_id=NA)

if(!is.null(nrow(rts_quotes_processed[[1]]))) {
  rts_quotes_processed[[1]] <- rts_quotes_processed[[1]] %>% mutate(retweet_user_id=NA)
}

## Finishing up the data
tweets_complete <- rbind(tweets_processed[[1]],rts_quotes_processed[[1]],conversations_processed[[1]]) %>% 
  unique

### Checking for doubles; doubles are and have ambiguous characteristics, such as 
### being part of two conversations or being a quote as well as a normal tweet
double <- tweets_complete %>% 
  group_by(status_id) %>% 
  dplyr::summarize(count=n()) %>% 
  filter(count>1) %>% 
  .$status_id

### Isolating the non-double tweets as basis for conversations table
if(length(double)>=1) {
  without_double <- tweets_complete %>%
    filter(!status_id %in% double) %>%
    distinct %>%
    select(status_id,conversation)
  
  ### Only the double cases, without conversation column (cleaning its characteristics)
  double_cases_without_conv <- tweets_complete %>%
    filter(status_id %in% double) %>%
    select(-conversation) %>%
    distinct
  
  ## Retrieves the conversation number for the double cases
  clean_doubles <- double_cases_without_conv %>% 
    group_by(status_id) %>%
    do(conversation=retrieve_con(.$status_id)) %>%
    unnest
  
  tweets <- tweets_complete %>% 
    select(-conversation) %>% 
    distinct %>% 
    filter(!status_id %in% double) %>%
    mutate(timestamp) %>%
    as_tibble
  
  ## Create conversation data frame (conversation is not stored as a tweet attribute, but in a seperate table)
  conversations <- clean_doubles %>% 
    rbind(.,without_double) %>%
    unique %>%
    filter(!is.na(conversation))
} else {
  tweets <- tweets_complete %>% 
    distinct %>% 
    mutate(timestamp) %>%
    as_tibble
  
  conversations <- conversations_processed[[1]] %>% 
    unique %>%
    filter(!is.na(conversation))
}

## Creates mentions data frame (including the mentions in the rts, quotes, and tweets-and-reply-chains.)
mentions_users <- rbind(tweets_processed[[2]],rts_quotes_processed[[2]],conversations_processed[[2]]) %>% 
  unique %>%
  mutate(timestamp=timestamp) %>%
  as_tibble

mentions <- rbind(tweets_processed$mentioned_users,
                  rts_quotes_processed$mentioned_users,
                  conversations_processed$mentioned_users) %>%
  as_tibble %>% mutate(timestamp=timestamp)  ## mentions, te gebruiken voor netwerk zonder context

user_stats <- retrieve_user_stats(tweets$user_id %>% unique %>% as.character) %>%
  as_tibble %>%
  mutate(timestamp=timestamp)

Using Google’s Natural Language Processing API, we detect the language of each tweet, and filter our data set to only retain tweets by Dutch users, or tweets in the Dutch language.

## Detect language and only retain Tweets in Dutch
# First, authenticate with Google

### plugin your credentials
options("googleAuthR.client_id" = "")
options("googleAuthR.client_secret" = "")
options("googleAuthR.scopes.selected" = c("https://www.googleapis.com/auth/cloud-platform"))
googleAuthR::gar_auth()

## Batching tweets
tweets_tmp <- tweets
tweets_tmp$line <- c(1:nrow(tweets_tmp))
tweets_tmp <- tweets_tmp %>%
  mutate(batch=line/250) %>% mutate(batch=ceiling(batch)) %>% select(-line)

## Running through for loop delaying each batch with 110 seconds
tweets_lang <- NULL
index <- tweets_tmp$batch %>% unique
for(i in 1:length(index)) {
  print(paste("Retrieving round ",i," of ",max(index)))
  x <- tweets_tmp %>%
    filter(batch==index[i]) %>%
    group_by(status_id,user_id) %>%
    do(pos_gl_translate_detect(.$text)) %>% ## return language characteristics by id
    unnest
  
  tweets_lang <- rbind(tweets_lang,x)
  
  Sys.sleep(110)
  
}

## Including messages that...
### Are in Dutch (dutch_messages - status_ids)
### Are Written by Dutch users (dutch_users - user_ids)
### Are associated with conversations involving Dutch users (ducth_conversations - conversations numbers)
### Are mentioning or mentioned by Dutch users (dutch_mentions - status_ids)

dutch_messages <- tweets_lang %>% filter(language=="nl") %>% .$status_id %>% unique

## Loading list of Dutch cities to identify users living in the Netherlands
dutch_cities <- read_csv("dutch_cities.csv", 
                         col_types = cols(Province = col_skip(), 
                                          S.No = col_skip())) %>%
  .$Cities %>% tolower

dutch_cities <- c(paste0(" ",dutch_cities," "),paste0(" ",dutch_cities,"$"),paste0("^",dutch_cities," "),paste0("^",dutch_cities,"$"))
dutch_cities <- c(dutch_cities,"netherlands","holland"," nl$", " nl ")  

dutch_users <- user_stats %>%
  mutate(location=location %>% tolower) %>%
  group_by(user_id,location) %>%
  do(dutch=ifelse((str_detect(.$location,dutch_cities) %>% sum)>=1,TRUE,FALSE)) %>%
  unnest
  
dutch_users_2 <- tweets_lang %>% filter(language=="nl") %>% .$user_id %>% c(.,dutch_users %>% filter(dutch==TRUE) %>% .$user_id) %>% unique

messages_by_dutch_users <- tweets %>% filter(user_id %in% dutch_users_2 | status_id %in% dutch_messages) %>% .$status_id %>% unique

dutch_conversations <- conversations %>% filter(status_id %in% messages_by_dutch_users) %>% .$conversation %>% unique

dutch_mentions <- mentions %>% filter(user_id %in% dutch_users_2 | mentions_user_id %in% dutch_users_2) %>% .$status_id %>% unique

messages_to_include <- c(messages_by_dutch_users,
                         conversations,
                         dutch_mentions) %>% unique

tweets <- tweets %>% filter(status_id %in% messages_to_include)
user_stats <- user_stats %>% filter(user_id %in% tweets$user_id)
tweets_by_marker <- tweets_by_marker %>% filter(status_id %in% messages_to_include)
mentions <- mentions %>% filter(status_id %in% messages_to_include)
conversations <- conversations %>% filter(conversation %in% dutch_conversations)

This results in the following data frames: tweets, tweets_by_marker, mentions, conversations, and user_stats.

3.2 Ego networks

Next, we retrieve the ego network of the authors of these tweets, including all the users that the authors are following and are being followed by.

## Preprocess data

users_followers <- user_stats %>%
  select(user_id,followers) %>%
  filter(!user_id %in% x) %>%
  filter(followers>0, followers < 500000) %>%
  unique %>%
  mutate(calls_needed=((followers/5000)%>%ceiling))

## Retrieve followers

if(nrow(users_followers>0)) {
  
  ## Smaller accounts to retrieve in one time frame
  
  smaller_accounts <- users_followers %>% 
    filter(followers<=75000) %>%
    arrange(followers)
  
  ## Big accounts to retrieve in more time frames
  are_there_big_accounts <- ifelse(nrow(users_followers %>%filter(followers>75000))!=0,TRUE,FALSE)
  
  ## Initiating results df
  followers <- NULL
  ## Loops through given data frame row by row
  for(i in 1:nrow(smaller_accounts)) {
    
    ## Subset batch, announcing batch start
    batch <- smaller_accounts[i,]
    calls_needed <- batch$calls_needed
    
    ## Limit check
    limit_check(batch$calls_needed,TRUE,"followers/ids")
    
    print(paste("###### Retrieving user",i,"of",nrow(smaller_accounts)))
    
    ## Retrieve followers. Returning NA when no results given
    tmp <- batch %>%
      group_by(user_id) %>%  ##=user_id) %>%
      do(
        data.frame(follower=(pos_get_followers(.$user_id,calls_needed)))) %>%
      as.data.frame
    
    tmp$follower <- tmp$follower %>% as.character
    
    followers <- rbind(followers,tmp)
    
  } ## einde retrieval small accounts
  
  if(are_there_big_accounts==TRUE) {
    
    ## create set
    big_accounts <- users_followers %>% 
      filter(followers>75000) %>%
      arrange(calls_needed)
    
    ## Loops through given data frame row by row
    for(i in 1:nrow(big_accounts)) {
      
      ## Subset batch, announcing batch start
      batch <- big_accounts[i,]
      
      ## Retrieve followers. Returning NA when no results given
      tmp <- batch %>%
        group_by(user_id) %>%
        do(
          data.frame(follower=get_followers_for_large_account(.$user_id,.$calls_needed))) %>%
        as.data.frame
      
      tmp$follower <- tmp$follower %>% as.character
      
      followers <- rbind(followers,tmp)
      
    } ## for loop big_accounts
    
  } ## einde retrieval large 
  
  followers$user_id <- followers$user_id %>% as.character
  followers$follower <- followers$follower %>% as.character
  
} ## einde retrieval followers

## Retrieve following

## Preprocess data
users_following <- user_stats %>%
  select(user_id,following) %>%
  filter(user_id %in% followers$user_id) %>%
  filter(following>0,following<500000) %>%
  unique %>%
  mutate(calls_needed=((following/5000)%>%ceiling))

if(nrow(users_following>0)) {
  
  ## Smaller accounts to retrieve in one time frame
  smaller_accounts <- users_following %>% 
    filter(following<=75000) %>%
    arrange(calls_needed)
  
  ## Big accounts to retrieve in more time frames
  are_there_big_accounts <- ifelse(nrow(users_following %>%filter(following>75000))!=0,TRUE,FALSE)
  
  ## Initiating results df
  following <- NULL
  ## Loops through given data frame row by row
  for(i in 1:nrow(smaller_accounts)) {
    
    ## Subset batch, announcing batch start
    batch <- smaller_accounts[i,]
    calls_needed <- batch$calls_needed
    
    ## Limit check
    limit_check(batch$calls_needed,TRUE,"friends/ids")
    
    print(paste("###### Retrieving user",i,"of",nrow(smaller_accounts)))
    
    ## Retrieve followers. Returning NA when no results given
    tmp <- batch %>%
      group_by(user_id) %>%  ##=user_id) %>%
      do(
        data.frame(following=(pos_get_following(.$user_id,calls_needed)))) %>%
      as.data.frame
    
    tmp$following <- tmp$following %>% as.character
    
    following <- rbind(following,tmp)
    
  } ## einde retrieval small accounts
  
  if(are_there_big_accounts==TRUE) {
    
    ## create set
    big_accounts <- users_following %>% 
      filter(following>75000)
    
    ## Loops through given data frame row by row
    for(i in 1:nrow(big_accounts)) {
      
      ## Subset batch, announcing batch start
      batch <- big_accounts[i,]
      
      ## Retrieve followers. Returning NA when no results given
      tmp <- batch %>%
        group_by(user_id) %>%
        do(
          data.frame(following=get_following_for_large_account(.$user_id,.$calls_needed))) %>%
        as.data.frame
      
      tmp$following <- tmp$following %>% as.character
      
      following <- rbind(following,tmp)
      
    } ## einde for loop big_accounts
    
  } ## einde retrieval large 
  
  following$user_id <- following$user_id %>% as.character
  following$following <- following$following %>% as.character
  
} ## einde retrieval following

## Cleaning colnames
colnames(followers) <- c("source","target")
colnames(following) <- c("target","source")

3.3 Process data

Collectively, these data reflect how are authors are connected, and what media preferences and audiences they have in common. To reduce noise, we can filter the network to only include Twitter users that are connected to at least 10% of the authors of our tweets. The code offers the option to filter our users that are not well-connected to authors in our data set. This cut-off point can be determined by looking at the distribution of non-author connections with authors.

This returns two data frames: nodelist_data_frame and edgelist_data_frame.

## Calculate author connections...

### ...for user_id_x_follower
followers_count <- followers %>%
  .$target %>%
  table %>%
  as.data.frame

### ...for user_id_x_following
following_count <- following %>%
  .$source %>%
  table %>%
  as.data.frame

### ...and combining both
connections_with_authors <- data.frame(user_id=c(followers_count$. %>% as.character,following_count$. %>% as.character),
                                       count=c(followers_count$Freq,following_count$Freq)) %>%
  group_by(user_id) %>%
  summarize(count=sum(count))

## Dealing with factors
connections_with_authors$user_id <- connections_with_authors$user_id %>% as.character

## Inspecting the distribution to determine the cut-off point

### Summary
summ <- connections_with_authors$count %>% summary

### Distribution
dist <- connections_with_authors %>%
  group_by(count) %>%
  summarize(occurrence=n()) %>%
  mutate(percentage=occurrence/sum(occurrence)*100) %>% arrange(desc(percentage))

print(summ)
plot(dist$occurrence)

#### Based on this distribution above, we vould decide to retain accounts with 10 authors or more

## Filtering the data

### Creating a vector of the author id's
author_ids <- tweets_complete$user_id %>% unique

### Filtering followers
followers_flt <- followers %>% 
  filter(target %in% c(author_ids,connections_with_authors %>% filter(count>=(length(author_ids) * .1)) %>% .$user_id %>% as.character)) %>%
  select(source,target)

### Filtering following
following_flt <- following %>% 
  filter(source %in% c(author_ids,(connections_with_authors %>% filter(count>=(length(author_ids) * .1)) %>% .$user_id %>% as.character))) %>%
  select(source,target)

## Creating edgelist

## Retrieve mentions for weight
mentions <- mentions %>% 
  group_by(user_id,mentions_user_id) %>% 
  summarize(weight=n()) %>% 
  ungroup %>% 
  mutate(weight=dense_rank(weight)+1)

## Create edgelist including weight column
edgelist_data_frame <- rbind(data.frame(source=followers_flt$source %>% as.character,
                                        target=followers_flt$target %>% as.character),
                             data.frame(source=following_flt$source %>% as.character,
                                        target=following_flt$target %>% as.character)) %>%
  distinct %>%
  left_join(.,mentions,by=c("source"="user_id","target"="mentions_user_id")) %>%
  mutate(weight=ifelse(is.na(weight),1,weight))

edgelist_data_frame <- edgelist_data_frame %>%
  select(Source=source,Target=target,Weight=weight)

## retrieve user_stats for all nodes in the network and create the initial nodelist

## Retrieve user_stats
user_data_to_retrieve <- c(edgelist_data_frame$Source,edgelist_data_frame$Target) %>% unique %>% as.character
user_stats <- retrieve_user_stats(user_data_to_retrieve)

# Create nodelist
nodelist_data_frame <- data.frame(Id=user_stats$user_id %>% as.character,
                                  Label=user_stats$screen_name %>% as.character,
                                  real_name=user_stats$real_name %>% as.character,
                                  description=user_stats$description %>% as.character,
                                  location=user_stats$location %>% as.character,
                                  language=user_stats$language %>% as.character,
                                  followers=user_stats$followers,
                                  following=user_stats$following) %>%
  as.data.frame %>%
  group_by(Id) %>%
  mutate(statuses=get_statuses(Id) %>% as.integer) %>% ## gets the ammount of statuses for every node
  as.data.frame

## Workaround for R making factors from characters
nodelist_data_frame$Id <- nodelist_data_frame$Id %>% as.character
nodelist_data_frame$Label <- nodelist_data_frame$Label %>% as.character 
nodelist_data_frame$real_name <- nodelist_data_frame$real_name %>% as.character 
nodelist_data_frame$description <- nodelist_data_frame$description %>% as.character
nodelist_data_frame$location <- nodelist_data_frame$location %>% as.character 
nodelist_data_frame$language <- nodelist_data_frame$language %>% as.character 

user_ids <- c(edgelist_data_frame$source,edgelist_data_frame$target) %>% unique

## Creating NA's for user_ids in the edgelist that could not be retrieved (deleted?)
na <- data.frame(Id=c(user_ids[!user_ids %in% nodelist_data_frame$Id])) %>%
  mutate(Label=NA,
         real_name=NA,
         description=NA,
         location=NA,
         language=NA,
         followers=NA,
         following=NA,
         statuses=NA) %>%
  as.data.frame

## Combining the data
nodelist_data_frame <- rbind(nodelist_data_frame,na)

original <- tweets_complete[tweets_complete$original==TRUE,]$user_id %>% unique

nodelist_data_frame <- nodelist_data_frame %>%
  group_by(Id,Label,real_name,description,location,language,followers,following,statuses) %>%
  do(original=ifelse(.$Id %in% original,TRUE,FALSE)) %>%
  unnest

3.4 Community detection, influence analysis and visualization

We use igraph’s implementation of the Louvain algorithm to identify communities of closely connected Twitter users. We use igraph’s betweenness algorithm to get a sense of the extent to which each user in the network is capable of connecting a large number of people.

Finally, we export our igraph network into a data frame nodelist_network_stats that can be exported and load it into Gephi for visualization.

## First, we make sure the same user_ids are in both the nodelist and edgelist
Ids_in_nodelist <- nodelist_data_frame %>% .$Id %>% unique  ## All ids in nodelist 
Ids_in_edgelist <- c(edgelist_data_frame$Source,edgelist_data_frame$Target) %>% unique ## all ids in edgelist
Edge_ids_not_in_nodelist <- Ids_in_edgelist[!Ids_in_edgelist %in% Ids_in_nodelist] ## Ids in edgelist not in nodelist
Node_ids_not_in_edgelist <- Ids_in_nodelist[!Ids_in_nodelist %in% Ids_in_edgelist] ## Ids in de nodelist not in de edgelist; typically users that didnt interact

edgelist <- edgelist_data_frame %>% filter(!Source %in% Edge_ids_not_in_nodelist, !Target %in% Edge_ids_not_in_nodelist)
nodelist <- nodelist_data_frame %>% filter(!Id %in% Node_ids_not_in_edgelist)

## create graph
graph <- graph_from_data_frame(edgelist,directed=TRUE,vertices=(nodelist))

## Cluster louvain
louvain <- cluster_louvain(as.undirected(graph))

## Calc modularity score
modularity <- modularity(as.undirected(graph), membership(louvain))

## create df for louvain
Ids <- membership(louvain) %>% names
Communities <- membership(louvain) %>% as.numeric
membership <- data.frame(Id=Ids,Community=Communities)

## Betweenness
betweenness <- estimate_betweenness(graph,directed=TRUE,cutoff=10)
Ids_btw <- betweenness %>% names
btw <- betweenness %>% as.numeric
betweenness <- data.frame(Id=Ids_btw,betweenness=btw)

## PageRank
pagerank <- page_rank(graph,directed=TRUE)
Ids_pr <- pagerank$vector %>% names
pr <- pagerank$vector %>% as.numeric
pagerank <- data.frame(Id=Ids_pr,pagerank=pr)

## update nodelist
nodelist_network_stats <- nodelist %>%
  left_join(.,membership,by=c("Id"="Id")) %>%
  left_join(.,pagerank,by=c("Id"="Id")) %>% 
  left_join(.,betweenness,by=c("Id"="Id"))

3.5 Compute wordclouds

We analyze the profile descriptions – which are the short texts people use to describe themselves on Twitter – by community, distinguishing between authors and the wider media ecologies, in a similar manner as described in step 2.6.

Using the plot_cloud function, providing it with the worclouds_for_each_com data frame and the community’s id, wordclouds can be plotted.

keepers <- nodelist_network_stats %>% 
  group_by(Community) %>% 
  summarize(count=n()) %>% 
  mutate(perc=count/sum(count)*100) %>% 
  arrange(desc(perc)) %>%
  filter(perc>=1) %>%
  .$Community

## Cleaning corpus
corpus_description <- nodelist_network_stats %>%
  unnest_tokens(.,word,description,token="words") %>%
  filter(!word %in% stopwords("en",source="snowball")) %>% ## stopwords
  filter(!word %in% stopwords("nl",source="snowball")) %>% ## stopwords
  filter(!word %in% stopwords("de",source="snowball")) %>% ## stopwords
  filter(!word %in% stopwords("fr",source="snowball")) %>% ## stopwords
  filter(!str_detect(word,"[0-9]")) %>% ## getallen
  filter(!str_detect(word,"http[s]?|www|co.in|.in|.com|t.co")) %>% ## url noise
  filter(!str_detect(word,"^[A-z]$|^[A-z][A-z]$")) ## one and two character words

## Cleaning corpus
corpus_tweets <- tweets %>% 
  left_join(.,nodelist_network_stats %>% select(Id,Community),by=c("user_id"="Id")) %>%
  filter(is_retweet==FALSE, is_quote==FALSE) %>%
  unnest_tokens(.,word,text,token="words") %>%
  filter(!word %in% stopwords("en",source="snowball")) %>% ## stopwords
  filter(!word %in% stopwords("nl",source="snowball")) %>% ## stopwords
  filter(!word %in% stopwords("de",source="snowball")) %>% ## stopwords
  filter(!word %in% stopwords("fr",source="snowball")) %>% ## stopwords
  filter(!str_detect(word,"[0-9]")) %>% ## getallen
  filter(!str_detect(word,"http[s]?|www|co.in|.in|.com|t.co")) %>% ## url noise
  filter(!str_detect(word,"^[A-z]$|^[A-z][A-z]$"))

tf_idf_desc <- corpus_description %>% 
  group_by(Community,word) %>% 
  summarize(count=n()) %>% 
  bind_tf_idf(.,word,Community,count) %>% 
  arrange(Community,desc(tf_idf))

tf_idf_tweets <- corpus_tweets %>% 
  group_by(Community,word) %>% 
  summarize(count=n()) %>% 
  bind_tf_idf(.,word,Community,count) %>% 
  arrange(Community,desc(tf_idf))

wordclouds_for_each_com_desc <- NULL
for(i in 1:length(keepers)) {
  
  ## Create wordcloud data by week
  x <- tibble(Community=keepers[i]) %>%
    full_join(.,tf_idf_desc,by=c("Community"="Community"))
  
  wordclouds_for_each_com_desc <- rbind(wordclouds_for_each_com_desc,x) %>% distinct
  
}

wordclouds_for_each_com_tweets <- NULL
for(i in 1:length(keepers)) {
  
  ## Create wordcloud data by week
  x <- tibble(Community=keepers[i]) %>%
    full_join(.,tf_idf_tweets,by=c("Community"="Community"))
  
  wordclouds_for_each_com_tweets <- rbind(wordclouds_for_each_com_tweets,x) %>% distinct
  
}

# plot_cloud(wordclouds_for_each_com,39)

4. Recommender network (YouTube)

For the recommender network, we use the Video Network option from the YouTube Data Tools.

4.1 Retrieve network

We retrieve the video networks for each of our queries with a crawl depth of 1. The YouTube Data Tools return a Gephi file.

4.2 Analyze network

We used Gephi for visualization and analysis of the network.

4.3 Compute wordclouds for network

First, we import the nodelist exported from Gephi after performing community analysis. To understand the nature of the different communities in the video network, we have analyzed the video titles by community by plotting wordclouds following the method described in step 2.6.

Using the plot_cloud function, providing it with the worclouds_for_each_com data frame and the community’s id, wordclouds can be plotted.

## related_videos_depth0_gephi_export <- read_csv("")

YT_graph <- related_videos_depth0_gephi_export

list_for_analysis <- YT_graph %>% select(Label,modularity_class)
  
# CLEANING THE CORPUS

## Wordcloud data by community
keepers <- list_for_analysis %>% 
  group_by(modularity_class) %>% 
  summarize(count=n()) %>% 
  ungroup %>% 
  mutate(perc=count/sum(count)) %>% 
  arrange(desc(perc)) %>% filter(perc>=0.1) %>%
  .$modularity_class

## Cleaning corpus
corpus_videotitles <- list_for_analysis %>% as_tibble %>%
  unnest_tokens(.,word,Label,token="words") %>%
  filter(!word %in% stopwords("nl",source="snowball")) %>% ## stopwords
  filter(!word %in% stopwords("en",source="snowball")) %>% ## stopwords
  filter(!word %in% stopwords("de",source="snowball")) %>% ## stopwords
  filter(!word %in% stopwords("ru",source="snowball")) %>% ## stopwords
  filter(!str_detect(word,"[0-9]")) %>% ## getallen
  filter(!str_detect(word,"http[s]?|www|co.in|.in|.com")) %>% ## url noise
  filter(!str_detect(word,"^[A-z]$|^[A-z][A-z]$")) %>% ## one and two character words
  filter(modularity_class %in% keepers)

## Word count by com (title)
titlecount_by_marker <- corpus_videotitles %>%
  group_by(modularity_class,word) %>% 
  summarize(count=n()) %>% 
  arrange(modularity_class,desc(count))

## TF*IDF by com (title)
titletfidf_by_marker <- corpus_videotitles %>%
  group_by(modularity_class,word) %>% 
  summarize(count=n()) %>% 
  ungroup %>%
  bind_tf_idf(word,modularity_class,count) %>%
  arrange(modularity_class,desc(count))

wordclouds_for_each_com_yt <- NULL
for(i in 1:length(keepers)) {
  
  ## Create wordcloud data by week
  x <- tibble(Community=keepers[i]) %>%
    full_join(.,titletfidf_by_marker,by=c("Community"="modularity_class"))
  
  wordclouds_for_each_com_yt <- rbind(wordclouds_for_each_com_yt,x) %>% distinct
  
}

# plot_cloud(wordclouds_for_each_com_yt,9)