Playing with the streamR package and visualizing Twitter conversations

This morning I started fooling around with the streamR package to access the Twitter Streaming API and see what data was captured and what could be done with it. For 90 minutes this morning I captured all tweets that originated in the United States - because I was searching for tweets within a specific geographic area, I necessarily missed tweets from those users not using geolocation.

This assumes that you have already gained authorization and credentials to use the Twitter API.

library(hadleyverse)
library(maps)
library(streamR)

# open stream and capture tweets from the United States
filterStream("tweetsUS90.json", locations = c(-125, 25, -66, 50), timeout = 5400, 
              oauth = my_oauth)
tweets.df <- parseTweets("tweetsUS90.json", verbose = FALSE)

After 90 minutes this morning I had gathered 172,738 tweets from 85,837 unique users. As an initial check on the data, we can map the locations of the users in the dataset. As expected, the majority of the users are found in metropolitan areas.

points <- data.frame(x = as.numeric(tweets.df$place_lon), y = as.numeric(tweets.df$place_lat))
points <- points[points$y > 25, ]

## plot out the usage on a map
xlim <- c(-124.738281, -66.601563)
ylim <- c(24.039321, 50.856229)
map("world", col="#E8E8E8", fill=TRUE, bg="white", lwd=0.4, xlim=xlim, ylim=ylim, interior=TRUE)
points(points, pch=16, cex=.10, col="red")
map("state", fill=FALSE, bg="white", add = TRUE)

Based on the plot, we can see that the location filter properly captured data within the specified longitude and latitude. But this is not a density plot, so the northeastern corridor is a bit misleading. In fact, users in the South more than doubled the number of tweets of every other region over the timeframe in question. You can see in the (ugly) table below the top 20 states and the number of tweets captured over the 90 minute span.

st_abb	tweets	st_abb	tweets	st_abb	tweets	st_abb	tweets
CA	22030	PA	6060	VA	4255	WA	3286
TX	19855	IL	5442	NJ	4200	IN	3273
FL	10762	GA	5261	MD	3663	LA	3006
NY	10564	MI	5237	AZ	3506	TN	2975
OH	7593	NC	4930	MA	3504	MO	2498

The filterStream function also captures the type of device or platform from which the tweet was generated. This allows us to investigate the most popular platforms and devices for each region. The plot below captures the top 5 most popular platform types for each region, and in each the iPhone is certainly the most popular. The top 5 platforms are consistent across region, and while three of the top 5 in each region are native to Twitter, two of are note. First, Instagram, once blocked from direct sharing on Twitter, is now a source of heavy user interaction. Second, more popular than Instagram is TweetMyJOBS, an online recruiting company posting jobs through social media. As an anecdotal point of evidence, it appears the economy is doing well.

The streamR package also captures the activity and popularity of each of these users. A histogram of users’ follower count demonstrates a significant right skew. The mean number of followers for the sample is 1835, but the median number is only 436 and users in the 90th percentile of followers have only 1734. The mean here is being inflated by the super users of Twitter; within this sample, we find accounts for perezhilton.com, NASA, Slate, and Domino’s pizza. Each of these accounts are far into the tail of the distribution, with the perezhilton account clocking in with over 6 million followers.
Visualizing the number of followers for each account as a function of account activity indicates that there is not a strong relationship between frequent updates and earning followers. The correlation between the two is just .07. An example of this comes from the most frequent poster over the duration of the data collected. In 90 minutes, a spam-bot account posted 125 times. This account was created almost three years ago, and in that time has posted over 37,000 status updates. It has 6 followers. Finally, we can use this data to visualize the conversations that are being had on Twitter. Are people shouting into the void on the platform? Are they using it to interact with people near or far? The streamR package allows us to begin thinking about this because it captures whether or not a tweet was in reply to another user (this does not necessarily imply a two-way conversation, but it was at least in reference to a previous tweet; it wasn’t just a random thought thrown at random into the ether). With this data, one can then cross reference the recipient to see if they had previously sent a geocoded tweet during the span of data collected. With only a 90 minute window, the number of positive cases was low - 1254 “conversations” were had - but it does provide some data to play with.

full replication code can be found at https://github.com/taylorgrant/sandbox/tree/master/Twitter