Introduction
The following document contains notes and exercises from the DataCamp course “Analyzing Social Media Data in R”.
Twitter data and the rtweet package
Considering the volume and rate of tweets posted every second worldwide, Twitter data represent an enormous amount of information, both from the tweet text and its metadata, which creates enormous opportunities to derive social and marketing insights. The rtweet package is an R interface for the Twitter API, and contains a lot of functions which can be used to extract twitter data relative to a specific topic or user. With the function stream_tweets() we can extract 1% of the total tweets tweeted in a certain amount of seconds and save it in a data frame.
library(rtweet)
# Extract live tweets for 120 seconds window
tweets120s <- stream_tweets("", timeout = 10)
# View dimensions of the data frame with live tweets
dim(tweets120s)## [1] 429 90
Search and extract tweets
The search_tweets() function
search_tweets() is a powerful function from rtweet which is used to extract tweets based on a search query. The function returns a maximum of 18,000 tweets for each request posted, posted in the time period of the last 6-9 days. In this exercise, search_tweets() is used to extract tweets on the Emmy Awards, by looking for tweets containing the Emmy Awards hashtag.
# Extract tweets on "#Emmyawards" and include retweets
twts_emmy <- search_tweets("#Emmyawards",
n = 2000,
include_rts = TRUE,
lang = "en")
# View output for the first 5 columns and 10 rows
head(twts_emmy[,1:5], 5)## # A tibble: 5 x 5
## user_id status_id created_at screen_name text
## <chr> <chr> <dttm> <chr> <chr>
## 1 107225075… 13807534734… 2021-04-10 05:24:15 LMcBee4Dall… "Legendary natural h…
## 2 4120556734 13806253931… 2021-04-09 20:55:19 NBCDFWCommu… "Legendary natural h…
## 3 235598077 13806251676… 2021-04-09 20:54:25 earthxorg "Legendary natural h…
## 4 822493154… 13806123892… 2021-04-09 20:03:38 EarthxFilm "Legendary natural h…
## 5 29852972 13805661470… 2021-04-09 16:59:53 RobertaRT "Flashback 11 years …
The get_timeline() function
get_timeline() is another function in the rtweet library that can be used to extract tweets, by extracting tweets posted by a given user to their timeline. The get_timeline() function can extract upto 3200 tweets at a time.
In this exercise, tweets posted by Cristiano Ronaldo (@Cristiano twitter handle) are extracted.
# Extract tweets posted by the user @Cristiano
get_cris <- get_timeline("@Cristiano", n = 20)
# View output for the first 5 columns and 10 rows
head(get_cris[,1:10], 5)## # A tibble: 5 x 10
## user_id status_id created_at screen_name text source
## <chr> <chr> <dttm> <chr> <chr> <chr>
## 1 1556592… 1379903300… 2021-04-07 21:05:58 Cristiano "Grandi ragazzi,… Twitte…
## 2 1556592… 1379446925… 2021-04-06 14:52:30 Cristiano "🏳️🏴👏🏽 #juventus… Twitte…
## 3 1556592… 1377019810… 2021-03-30 22:08:01 Cristiano "Vitória importa… Twitte…
## 4 1556592… 1376596593… 2021-03-29 18:06:18 Cristiano "🇵🇹❤️ https://t.… Twitte…
## 5 1556592… 1375064846… 2021-03-25 12:39:41 Cristiano "Muito important… Twitte…
## # … with 4 more variables: display_text_width <dbl>, reply_to_status_id <lgl>,
## # reply_to_user_id <lgl>, reply_to_screen_name <lgl>
#Plot text_width as a function of time
library(ggplot2)
ggplot(get_cris, aes(created_at, display_text_width)) +
geom_col(fill = "coral") +
labs(x = "Date", y = "Text length")Tweets metadata
User interest and tweet counts
The metadata components of extracted twitter data can be analyzed to derive insights. In order to identify twitter users who are interested in a topic, you can look at users who tweet often on that topic. In this exercise, users who have tweeted often on the topic “Artificial Intelligence” are identified.
#Search tweets related to artificialm intelligence
tweets_ai <- search_tweets("ArtificialIntelligence", n = 1000)
# Create a table of users and tweet counts for the topic
sc_name <- table(tweets_ai$screen_name)
# Sort the table in descending order of tweet counts
sc_name_sort <- sort(sc_name, decreasing = TRUE)
library(dplyr)
sc_name_df <- as.data.frame(sc_name_sort) %>%
top_n(n=10, Freq)
#Plot the 10 most active users
ggplot(sc_name_df, aes(x = Var1, y = Freq)) +
geom_col(fill = "lightblue") +
labs(x = " ",
y = "Frequency") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))Compare follower count
The follower count for a twitter account indicates the popularity of the personality or a business entity and is a measure of influence in social media. In this exercise, the followers count for twitter accounts of three popular news sites are extracted and compared.
# Extract user data for the twitter accounts of 4 news sites
users_id <- c("CNN", "fivethirtyeight", "espn")
users <- lookup_users(users_id, parse = TRUE, token = NULL)
# Create a data frame of screen names and follower counts
user_df <- users[,c("screen_name","followers_count")]
# Display and compare the follower counts for the 3 news sites
ggplot(user_df, aes(x = screen_name, y = followers_count, fill = screen_name)) +
labs(x = " ",
y = "Followers count") +
geom_col() +
theme(legend.position = "none")Retweet counts
The number of times a twitter text is retweeted indicates what is trending.
In this exercise, the tweets on “Artificial Intelligence” that have been retweeted the most are extracted.
# Create a data frame of tweet text and retweet count
rtwt <- tweets_ai[,c("text", "retweet_count")]
# Sort data frame based on descending order of retweet counts
rtwt_sort <- arrange(rtwt, desc(retweet_count))
# Exclude rows with duplicate text from sorted data frame
rtwt_unique <- unique(rtwt_sort, by = "text")
# Print top 6 unique posts retweeted most number of times
rownames(rtwt_unique) <- NULL
head(rtwt_unique, 3)## # A tibble: 3 x 2
## text retweet_count
## <chr> <int>
## 1 "We are thrilled to announce that @PacktPub is sponsoring @clou… 885
## 2 "Toyota Is Using Artificial Intelligence To Build A New City - … 674
## 3 "IAB AI Working Group to Establish Artificial Intelligence Stan… 669
Filtering tweets
Filtering for original tweets
An original tweet is an original posting by a twitter user and is not a retweet, quote, or reply. The “-filter” attribute can be combined with a search query to exclude retweets, quotes, and replies during tweet extraction. In this exercise, tweets on “Superbowl” that are original posts and not retweets, quotes, or replies are extracted.
# Extract 100 original tweets on "Superbowl"
tweets_org <- search_tweets("Superbowl -filter:retweets -filter:quote -filter:replies", n = 100)
# Check for presence of replies
library(plyr)
count(tweets_org$reply_to_screen_name)## x freq
## 1 NA 100
## x freq
## 1 FALSE 100
## x freq
## 1 FALSE 100
Filtering on tweet language
The lenguage filter can be used to extract tweets in a specific lenguage. Example: tweets posted in French on the topic “Apple iphone”.
Filter based on tweet popularity
Popular tweets are tweets that are retweeted and favorited several times. We can extract tweet filtering the ones that have been retweeted and/or favoured a certain amount of times. In this exercise, tweets on “Chelsea” that have been retweeted a minimum of 100 times and also favorited at least by 100 users are extracted.
# Extract tweets with a minimum of 100 retweets and 100 favorites
tweets_pop <- search_tweets("Chelsea min_retweets:100 AND min_faves:100")
# Create a data frame to check retweet and favorite counts
counts <- tweets_pop[c("retweet_count", "favorite_count")]
ggplot(counts, aes(retweet_count, favorite_count)) +
geom_point(color = "coral", size = 4, alpha = 0.4) +
labs(x = "number of retweets",
y = "number of favourite")User information
Extract user information
User information contains data on the number of followers and friends of the twitter user. It can be extracted with the user_data() function. The user information may have multiple instances of the same user as the user might have tweeted multiple times on a given subject. In this exercise, the number of friends and followers of users who tweet on #cosmetics are identified.
tweet_cos <- search_tweets("#cosmetics", n = 1000)
# Extract user information of people who have tweeted on the topic
user_cos <- users_data(tweet_cos)
# Aggregate screen name, follower and friend counts
library(tidyr)
counts_df <- user_cos %>%
group_by(screen_name) %>%
summarize(follower = mean(followers_count),
friend = mean(friends_count))
head(counts_df)## follower friend
## 1 2850.622 1536.955
Explore users based on the golden ratio
The ratio of the number of followers to the number of friends a user has is called the golden ratio., which is a useful metric for marketers to strategize promotions. Users with a high Ratio can be used to promote a product.
# Calculate and store the golden ratio
counts_df$ratio <- counts_df$follower/counts_df$friend
# Sort the data frame in decreasing order of follower count
counts_sort <- arrange(counts_df, desc(follower))
# Select rows where the follower count is greater than 50000
counts_sort[counts_sort$follower>50000,]## [1] follower friend ratio
## <0 rows> (or 0-length row.names)
Subscribers to twitter lists
A twitter list is a curated group of twitter accounts centered around a topic of interest. In this exercise, lists of the twitter account of “NBA”, are extracted.
# Extract all the lists "NBA" subscribes to and view the first 4 columns
lst_NBA <- lists_users("NBA")
lst_NBA[,1:3]## # A tibble: 4 x 3
## list_id name uri
## <chr> <chr> <chr>
## 1 18013707 NBA G League Teams /NBA/lists/nba-g-league-teams1
## 2 18013538 WNBA Teams /NBA/lists/wnba-teams
## 3 17852612 NBA Players /NBA/lists/nba-players
## 4 3738526 NBA Teams /NBA/lists/nba-teams
# Extract subscribers of the list "nbateams" and view the first 4 columns
list_NBA_sub <- lists_subscribers(slug = "nbateams", owner_user = "NBA")
list_NBA_sub[,1:3]## NULL
## <0 rows> (or 0-length row.names)
# Create a list of 4 screen names from the subscribers list
users <- c("JWBaker_4","towstend", "iKaanic", "Dalton_Boyd")
# Extract user information for the list and view the first 4 columns
users_NBA_sub <- lookup_users(users)
users_NBA_sub[,1:3]## # A tibble: 4 x 3
## user_id status_id created_at
## <chr> <chr> <dttm>
## 1 91031258 1380224790527545347 2021-04-08 18:23:28
## 2 1100210166727729152 1375079166849081345 2021-03-25 13:36:35
## 3 2150289313 1375150800176087048 2021-03-25 18:21:14
## 4 30284789 1373844979869741069 2021-03-22 03:52:22
Trends
Trends by country name
Location-specific trends identify popular topics trending in a specific country.
# Get topics trending in Canada
gt_country <- get_trends("Canada")
# View the first 6 columns
head(gt_country[,1:4])## # A tibble: 6 x 4
## trend url promoted_content query
## <chr> <chr> <lgl> <chr>
## 1 Pulisic http://twitter.com/search?q=Pu… NA Pulisic
## 2 #NationalSibl… http://twitter.com/search?q=%2… NA %23NationalSi…
## 3 #CRYCHE http://twitter.com/search?q=%2… NA %23CRYCHE
## 4 Havertz http://twitter.com/search?q=Ha… NA Havertz
## 5 #Flames1stGoal http://twitter.com/search?q=%2… NA %23Flames1stG…
## 6 Chelsea http://twitter.com/search?q=Ch… NA Chelsea
Trends by city and most tweeted trends
Trending topics in a city provide a chance to promote region-specific events or products.
# Get topics trending in London
gt_city <- get_trends("London")
# View the first 6 columns
head(gt_city[,1:3])## # A tibble: 6 x 3
## trend url promoted_content
## <chr> <chr> <lgl>
## 1 Bielsa http://twitter.com/search?q=Bielsa NA
## 2 Payne http://twitter.com/search?q=Payne NA
## 3 #CSKvsDC http://twitter.com/search?q=%23CSKvsDC NA
## 4 Wycombe http://twitter.com/search?q=Wycombe NA
## 5 Shinnie http://twitter.com/search?q=Shinnie NA
## 6 #RomeEPrix http://twitter.com/search?q=%23RomeEPrix NA
# Aggregate the trends and tweet volumes
trend_df <- gt_city %>%
group_by(trend) %>%
summarize(tweet_vol = mean(tweet_volume))
# Sort data frame on descending order of tweet volumes and plot
trend_df_sort <- arrange(trend_df, desc(tweet_vol))
head(trend_df_sort, 5)## tweet_vol
## 1 NA
Tweet frequency
Visualizing frequency of tweets
Visualizing the frequency of tweets over time helps understand the interest level over a topic or a product. In this exercise, tweets on “#walmart” are extracted and a time series plot created for visualizing the interest levels.
# Extract tweets on #walmart and exclude retweets
walmart_twts <- search_tweets("#walmart", n = 1000, include_rts = FALSE)
# Create a time series plot
ts_plot(walmart_twts, by = "hours", color = "blue")Create time series objects
A time series object contains the aggregated frequency of tweets over a specified time interval. It allows comparison between topics/products.
In this exercise, time series objects for the sportswear brands Puma and Nike is created.
puma_st <- search_tweets("#puma", n = 1000)
nike_st <- search_tweets("#nike", n = 1000)
# Create a time series object for Puma at hourly intervals
puma_ts <- ts_data(puma_st, by ='hours')
# Rename the two columns in the time series object
names(puma_ts) <- c("time", "puma_n")
# Create a time series object for Nike at hourly intervals
nike_ts <- ts_data(nike_st, by ='hours')
# Rename the two columns in the time series object
names(nike_ts) <- c("time", "nike_n")
# Merge the two time series objects and retain "time" column
merged_df <- merge(puma_ts, nike_ts, by = "time", all = TRUE)
head(merged_df)## time puma_n nike_n
## 1 2021-04-08 12:00:00 4 NA
## 2 2021-04-08 13:00:00 36 NA
## 3 2021-04-08 14:00:00 53 NA
## 4 2021-04-08 15:00:00 45 NA
## 5 2021-04-08 16:00:00 33 NA
## 6 2021-04-08 17:00:00 23 NA
# Stack the tweet frequency columns
library(reshape2)
melt_df <- melt(merged_df, na.rm = TRUE, id.vars = "time")
# View the output
head(melt_df)## time variable value
## 1 2021-04-08 12:00:00 puma_n 4
## 2 2021-04-08 13:00:00 puma_n 36
## 3 2021-04-08 14:00:00 puma_n 53
## 4 2021-04-08 15:00:00 puma_n 45
## 5 2021-04-08 16:00:00 puma_n 33
## 6 2021-04-08 17:00:00 puma_n 23
# Plot frequency of tweets on Puma and Nike
library(ggplot2)
ggplot(data = melt_df, aes(x = time, y = value, col = variable))+
geom_line(lwd = 0.8)Analyze Tweet texts
transform in corpus and clean text
Tweet text posted by twitter users is usually unstructured and often contains emoticons, URLs, and numbers. This redundant information has to be cleaned before analysis in order to yield meaningful results. In order to facilitate processing, we need to transfor the tweet texts into a corpus, which is a list of text documents. The corpus is then cleaned by transorming all characters in lowercase, eliminating spaces and redundant words (stopwords).
library(qdapRegex)
twt_telmed <- search_tweets("#telemedicine", n = 500)
# Extract tweet text from the dataset
twt_txt <- twt_telmed$text
# Remove URLs from the tweet text and view the output
twt_txt_url <- rm_twitter_url(twt_txt)
# Replace special characters, punctuation, & numbers with spaces
twt_txt_chrs <- gsub("[^A-Za-z]"," " , twt_txt_url)
twt_gsub <- twt_txt_chrs
# View text after replacing special characters, punctuation, & numbers
library(tm)
# Convert text in "twt_gsub" dataset to a text corpus and view output
twt_corpus <- twt_gsub %>%
VectorSource() %>%
Corpus()
# Convert the corpus to lowercase
twt_corpus_lwr <- tm_map(twt_corpus, tolower)
# Remove English stop words from the corpus and view the corpus
twt_corpus_stpwd <- tm_map(twt_corpus_lwr, removeWords, stopwords("english"))
# Remove additional spaces from the corpus
twt_corpus_final <- tm_map(twt_corpus_stpwd, stripWhitespace)Removing custom stop words
It is important to remove custom stop words present in the corpus first before using the visualization tools.
# Create a vector of custom stop words
custom_stopwds <- c("telemedicine", " s", "amp", "can", "new", "medical",
"will", "via", "way", "today", "come", "t", "ways",
"say", "ai", "get", "now")
# Remove custom stop words and create a refined corpus
corp_refined <- tm_map(twt_corpus_final, removeWords, custom_stopwds) Word clouds for visualization
A word cloud is an image made up of words in which the size of each word indicates its frequency. It is a common way to visualize words frequency in a document.
library(wordcloud)
# Create a word cloud in red with min frequency of 20
wordcloud(corp_refined, min.freq = 20, colors = "red",
scale = c(3,0.5),random.order = FALSE)# Create word cloud with 6 colors and max 50 words
wordcloud(corp_refined, max.words = 50,
colors = brewer.pal(6, "Dark2"),
scale=c(3,0.5), random.order = FALSE)Create a document term matrix
The document term matrix or DTM is a matrix representation of a corpus. Creating the DTM from the text corpus is the first step towards more compelx analysis.
twt_climate <- search_tweets("#climatechange", n = 500)
climate_txt <- twt_climate$text
climate_txt_url <- rm_twitter_url(climate_txt)
climate_txt_chrs <- gsub("[^A-Za-z]"," " , climate_txt_url)
climate_gsub <- climate_txt_chrs
corpus_climate <- climate_gsub %>%
VectorSource() %>%
Corpus()
corpus_climate <- tm_map(corpus_climate, tolower)
corpus_climate <- tm_map(corpus_climate, removeWords, stopwords("english"))
corpus_climate <- tm_map(corpus_climate, stripWhitespace)
# Create a document term matrix (DTM) from the pre-loaded corpus
dtm_climate <- DocumentTermMatrix(corpus_climate)
# Find the sum of word counts in each document
rowTotals <- apply(dtm_climate, 1, sum)
# Select rows with a row total greater than zero
dtm_climate_new <- dtm_climate[rowTotals > 0, ]Create a topic model
Topic modeling is the task of automatically discovering topics from a vast amount of text. You can create topic models from the tweet text to quickly summarize the vast information available into distinct topics and gain insights.
library(topicmodels)
# Create a topic model with 5 topics
topicmodl_5 <- LDA(dtm_climate_new, k = 5)
# Select and view the top 10 terms in the topic model
top_5terms <- terms(topicmodl_5,5)
top_5terms ## Topic 1 Topic 2 Topic 3 Topic 4
## [1,] "climatechange" "climatechange" "climatechange" "climatechange"
## [2,] "climatecrisis" "sustainability" "climate" "years"
## [3,] "amp" "innovation" "amp" "will"
## [4,] "human" "climate" "new" "seen"
## [5,] "can" "amp" "energy" "today"
## Topic 5
## [1,] "climatechange"
## [2,] "climate"
## [3,] "climatecrisis"
## [4,] "environment"
## [5,] "amp"
# Create a topic model with 4 topics
topicmodl_4 <- LDA(dtm_climate_new, k = 4)
# Select and view the top 6 terms in the topic model
top_6terms <- terms(topicmodl_4, 6)
top_6terms## Topic 1 Topic 2 Topic 3 Topic 4
## [1,] "climatechange" "climatechange" "climatechange" "climatechange"
## [2,] "climatecrisis" "years" "climate" "climatecrisis"
## [3,] "climate" "seen" "amp" "climate"
## [4,] "sustainability" "will" "energy" "amp"
## [5,] "amp" "well" "climatecrisis" "seals"
## [6,] "fridaysforfuture" "die" "need" "change"
Extract sentiment scores
Sentiment analysis is useful in social media monitoring since it gives an overview of people’s sentiments around a certain topic or product.
library(syuzhet)
sa.value <- get_nrc_sentiment(twt_climate$text)
# Calculate sum of sentiment scores
score <- colSums(sa.value[,])
# Convert the sum of scores to a data frame
score_df <- data.frame(score)
# Convert row names into 'sentiment' column and combine with sentiment scores
score_df2 <- cbind(sentiment = row.names(score_df),
score_df, row.names = NULL)
# Plot the sentiment scores
ggplot(data = score_df2, aes(x = sentiment, y = score, fill = sentiment)) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))Network analysis
Preparing data for a retweet network
A retweet network is a network of twitter users who retweet tweets posted by other users.
twts_trvl <- search_tweets("#travel", n = 100)
# Extract source vertex and target vertex from the tweet data frame
rtwt_df <- twts_trvl[, c("screen_name" , "retweet_screen_name" )]
# Remove rows with missing values
rtwt_df_new <- rtwt_df[complete.cases(rtwt_df), ]
# Create a matrix
rtwt_matrx <- as.matrix(rtwt_df_new)Create a retweet network
A retweet network can be built usig the igraph package. Understanding the position of potential customers on a retweet network allows a brand to identify key players who are likely to retweet posts to spread brand messaging.
library(igraph)
# Convert the matrix to a retweet network
nw_rtweet <- graph_from_edgelist(el = rtwt_matrx, directed = TRUE)
# View the retweet network
print.igraph(nw_rtweet)## IGRAPH 5b00d36 DN-- 80 61 --
## + attr: name (v/c)
## + edges from 5b00d36 (vertex names):
## [1] ReparSandra ->505Nomad tdg_trekking ->Dariozogbi
## [3] Domainbot1 ->HWdomains CapelliLaVita1 ->VIParis
## [5] carissahadid ->SecretFlying KingOfPentacl ->org_scp
## [7] texastwins2004 ->birdwriter7 TriadTravelogs ->Dastylishfoodie
## [9] ExploreAmadeus ->DeniseSanger ExploreAmadeus ->HulloSafaris
## [11] ExploreAmadeus ->CraftingDir ExploreAmadeus ->logomaco
## [13] ExploreAmadeus ->TravelYesPlease ExploreAmadeus ->rv_our
## [15] travel_biz_news->CraftingDir RubyPerry11 ->birdwriter7
## + ... omitted several edges
Calculate out-degree scores
In a retweet network, the out-degree of a user indicates the number of times the user retweets posts, while in-degree of a user indicates the number of times the user’s posts are retweeted. Users with high out-degree scores are key players who can be used as a medium to retweet promotional posts. Users with high in-degrees are influential as their tweets are retweeted many times.
# Calculate out-degree scores from the retweet network
out_degree <- degree(nw_rtweet, mode = c("out"))
# Sort the out-degree scores in decreasing order
out_degree_sort <- sort(out_degree, decreasing = TRUE)
# View users with the top 10 out-degree scores
out_degree_sort[1:5]## myfoodfantasy69 ExploreAmadeus sadytushar ReparSandra tdg_trekking
## 15 6 2 1 1
# Compute the in-degree scores from the retweet network
in_degree <- degree(nw_rtweet, mode = c("in"))
# Sort the in-degree scores in decreasing order
in_degree_sort <- sort(in_degree, decreasing = TRUE)
# View users with the top 10 in-degree scores
in_degree_sort[1:5]## Charlesfrize VIParis birdwriter7 CraftingDir 2WheelersLife
## 15 2 2 2 2
Calculate the betweenness scores
Betweenness centrality represents the degree to which nodes stand between each other. In a retweet network, a user with a high betweenness centrality score can have more control over a network because more information will pass through the user.
# Calculate the betweenness scores from the retweet network
betwn_nw <- betweenness(nw_rtweet, directed = TRUE)
# Sort betweenness scores in decreasing order and round the values
betwn_nw_sort <- betwn_nw %>%
sort(decreasing = TRUE) %>%
round()
# View users with the top 10 betweenness scores
betwn_nw_sort[1:5]## ReparSandra 505Nomad tdg_trekking Dariozogbi Domainbot1
## 0 0 0 0 0
Create a network plot with attributes
Visualization of twitter networks helps understand complex networks in an easier and appealing way.
# Create a network plot with formatting attributes
set.seed(1234)
plot(nw_rtweet, asp = 9/12,
vertex.size = 10,
vertex.color = "green",
edge.arrow.size = 0.5,
edge.color = "black",
vertex.label.cex = 0.9,
vertex.label.color = "black")Network plot based on centrality measure
To add more information to the plot, the vertex size in the plot can be set to be proportional to the number of times the user retweets.
# Create a variable for out-degree
deg_out <- degree(nw_rtweet, mode = c("out"))
# Amplify the out-degree values
vert_size <- (deg_out * 4) + 5
# Set vertex size to amplified out-degree values
set.seed(1234)
plot(nw_rtweet, asp = 10/11,
vertex.size = vert_size, vertex.color = "lightblue",
edge.arrow.size = 0.5,
edge.color = "grey",
vertex.label.cex = 0.8,
vertex.label.color = "black")Follower count to enhance the network plot
The users who retweet most will add more value if they have a high follower count as their retweets will reach a wider audience. We can add this information to the plot by setting the vertex color to indicate the follower count.
followers <- twts_trvl[, c("screen_name" , "followers_count" )]
# Create a column and categorize follower counts above and below 500
followers$follow <- ifelse(followers$followers_count > 500, "1", "0")
# Assign the new column as vertex attribute to the retweet network
V(nw_rtweet)$followers <- followers$follow
vertex_attr(nw_rtweet)## $name
## [1] "ReparSandra" "505Nomad" "tdg_trekking" "Dariozogbi"
## [5] "Domainbot1" "HWdomains" "CapelliLaVita1" "VIParis"
## [9] "carissahadid" "SecretFlying" "KingOfPentacl" "org_scp"
## [13] "texastwins2004" "birdwriter7" "TriadTravelogs" "Dastylishfoodie"
## [17] "ExploreAmadeus" "DeniseSanger" "HulloSafaris" "CraftingDir"
## [21] "logomaco" "TravelYesPlease" "rv_our" "travel_biz_news"
## [25] "RubyPerry11" "_DesertX" "Snezny1" "TamurilMinyatur"
## [29] "KastKe" "t_dalmar" "Ghereandthere" "Fabriziobustama"
## [33] "TriptiCharan" "Soofuro" "Apple_505050" "PenguinSix"
## [37] "Kabirkhan547680" "GulzarMustafak1" "ACAroundTown" "aaroundtown"
## [41] "sadytushar" "2WheelersLife" "SolespireDerek" "TRAVOH_travel"
## [45] "lovelaughterlug" "RoadTripsCoffee" "YouggyS" "bcgoodsintl"
## [49] "NewsBizLizzy" "jlessuck" "GloballyKenyan" "PrinceTrails"
## [53] "katr_elena" "SunnyHolidays4u" "myfoodfantasy69" "Charlesfrize"
## [57] "mwangiedwin504" "185a29a356b6406" "CompassAcademic" "travelmail"
## [61] "CarmanK1" "ArteLeonida" "PIPIENPierre" "CelinePivoine"
## [65] "IslamabadScene" "UMountaineer" "loriiejaane" "AllThingTravel"
## [69] "C_Two_Eagle" "wallpaperable" "Maitedalmau56" "Havenlust"
## [73] "dipali_atul" "anuradhagoyal" "precruise" "T1Texas"
## [77] "Jeanyor" "JimByersTravel" "Y4794" "simplyart4794"
##
## $followers
## [1] "1" "1" "0" "0" "1" "0" "1" "1" "0" "1" "0" "1" "0" "1" "1" "1" "1" "1" "1"
## [20] "1" "0" "1" "1" "0" "1" "1" "1" "1" "1" "1" "0" "1" "1" "0" "0" "1" "1" "1"
## [39] "1" "1" "0" "1" "0" "1" "0" "1" "0" "0" "0" "0" "0" "1" "1" "0" "0" "1" "1"
## [58] "0" "1" "0" "0" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1"
## [77] "1" "1" "0" "1"
# Set the vertex colors based on follower count and create a plot
sub_color <- c("pink", "lightblue")
plot(nw_rtweet, asp = 9/12,
vertex.size = vert_size, edge.arrow.size = 0.5,
vertex.label.cex = 0.8,
vertex.color = sub_color[as.factor(vertex_attr(nw_rtweet, "followers"))],
vertex.label.color = "black", vertex.frame.color = "grey")Tweets geolocation
It is possible to extract the geolocation of tweets to gain insight of the popularity of a topic/ oroduct across a region.
# Extract 18000 tweets on #vegan
vegan <- search_tweets("#vegan", n = 1000)
# Extract geo-coordinates data to append as new columns
vegan_coord <- lat_lng(vegan)
# Omit rows with missing geo-coordinates in the data frame
vegan_geo <- na.omit(vegan_coord[,c("lat", "lng")])
# Plot longitude and latitude values of tweets on the US state map
library(leaflet)
leaflet() %>%
addProviderTiles("CartoDB") %>%
addCircleMarkers(vegan_geo$lng, vegan_geo$lat)