This morning I started fooling around with the streamR package to access the Twitter Streaming API and see what data was captured and what could be done with it. For 90 minutes this morning I captured all tweets that originated in the United States - because I was searching for tweets within a specific geographic area, I necessarily missed tweets from those users not using geolocation.

This assumes that you have already gained authorization and credentials to use the Twitter API.

library(hadleyverse)
library(maps)
library(streamR)

# open stream and capture tweets from the United States
filterStream("tweetsUS90.json", locations = c(-125, 25, -66, 50), timeout = 5400, 
              oauth = my_oauth)
tweets.df <- parseTweets("tweetsUS90.json", verbose = FALSE)

After 90 minutes this morning I had gathered 172,738 tweets from 85,837 unique users. As an initial check on the data, we can map the locations of the users in the dataset. As expected, the majority of the users are found in metropolitan areas.

points <- data.frame(x = as.numeric(tweets.df$place_lon), y = as.numeric(tweets.df$place_lat))
points <- points[points$y > 25, ]

## plot out the usage on a map
xlim <- c(-124.738281, -66.601563)
ylim <- c(24.039321, 50.856229)
map("world", col="#E8E8E8", fill=TRUE, bg="white", lwd=0.4, xlim=xlim, ylim=ylim, interior=TRUE)
points(points, pch=16, cex=.10, col="red")
map("state", fill=FALSE, bg="white", add = TRUE)

Based on the plot, we can see that the location filter properly captured data within the specified longitude and latitude. But this is not a density plot, so the northeastern corridor is a bit misleading. In fact, users in the South more than doubled the number of tweets of every other region over the timeframe in question. You can see in the (ugly) table below the top 20 states and the number of tweets captured over the 90 minute span.

st_abb tweets st_abb tweets st_abb tweets st_abb tweets
CA 22030 PA 6060 VA 4255 WA 3286
TX 19855 IL 5442 NJ 4200 IN 3273
FL 10762 GA 5261 MD 3663 LA 3006
NY 10564 MI 5237 AZ 3506 TN 2975
OH 7593 NC 4930 MA 3504 MO 2498

The filterStream function also captures the type of device or platform from which the tweet was generated. This allows us to investigate the most popular platforms and devices for each region. The plot below captures the top 5 most popular platform types for each region, and in each the iPhone is certainly the most popular. The top 5 platforms are consistent across region, and while three of the top 5 in each region are native to Twitter, two of are note. First, Instagram, once blocked from direct sharing on Twitter, is now a source of heavy user interaction. Second, more popular than Instagram is TweetMyJOBS, an online recruiting company posting jobs through social media. As an anecdotal point of evidence, it appears the economy is doing well.

The streamR package also captures the activity and popularity of each of these users. A histogram of users’ follower count demonstrates a significant right skew. The mean number of followers for the sample is 1835, but the median number is only 436 and users in the 90th percentile of followers have only 1734. The mean here is being inflated by the super users of Twitter; within this sample, we find accounts for perezhilton.com, NASA, Slate, and Domino’s pizza. Each of these accounts are far into the tail of the distribution, with the perezhilton account clocking in with over 6 million followers.
Visualizing the number of followers for each account as a function of account activity indicates that there is not a strong relationship between frequent updates and earning followers. The correlation between the two is just .07. An example of this comes from the most frequent poster over the duration of the data collected. In 90 minutes, a spam-bot account posted 125 times. This account was created almost three years ago, and in that time has posted over 37,000 status updates. It has 6 followers. Finally, we can use this data to visualize the conversations that are being had on Twitter. Are people shouting into the void on the platform? Are they using it to interact with people near or far? The streamR package allows us to begin thinking about this because it captures whether or not a tweet was in reply to another user (this does not necessarily imply a two-way conversation, but it was at least in reference to a previous tweet; it wasn’t just a random thought thrown at random into the ether). With this data, one can then cross reference the recipient to see if they had previously sent a geocoded tweet during the span of data collected. With only a 90 minute window, the number of positive cases was low - 1254 “conversations” were had - but it does provide some data to play with.

full replication code can be found at https://github.com/taylorgrant/sandbox/tree/master/Twitter