The question “what’s the best city for data science?” was asked on the Sept “Not So Standard Deviations” podcast. To inject some analysis in the discussion, I used the twitteR
package to measure interest in R by computing the “flux” of tweets with the #rstats
hashtag.
The top metro areas are New York, Boston, and the SF Bay area, with a tweet flux of about 50 #rstat tweets per million residents per day (“twipermipeds”). Other leading cities include Long Beach, Washington DC, Seattle, Raleigh NC, and Henderson NV.
Even Portland, Oregon (yay) weighs-in within the top 15 on tweets. Results are sensitive to assumptions about metro size and show some short term time dependency.
This was quick and dirty so no telling how stable the result will be over longer time.
In their September NSSD Podcast, Hilary and Roger discussed “the best city for data science.” Let’s try measuring something just to inject a little analysis into the discussion.
Twitter is a good source of data on “interest” in topics since it is both timely and social.
Setting up the twitter API is relatively quick.
library(twitteR)
## create URLs
reqURL <- "https://api.twitter.com/oauth/request_token"
accessURL <- "https://api.twitter.com/oauth/access_token"
authURL <- "https://api.twitter.com/oauth/authorize"
## read file containing secret keys (obtained from apps.twitter.com)
keys <- read.table("/users/winstonsaunders/documents/city_politics/secret_t.key.txt", stringsAsFactors = FALSE, col.names = "secret" )
## convert to characters (read.table coerces to a factor)
consumerKey <- keys$secret[1]
consumerSecret <- keys$secret[2]
accessToken <- keys$secret[3]
accessTokenSecret <- keys$secret[4]
## set up authentication
setup_twitter_oauth(consumerKey, consumerSecret, accessToken, accessTokenSecret)
[1] "Using direct authentication"
Test functionality with George Takei’s latest tweet
## test functionanlity with georgetakei tweets
userTimeline(getUser('georgetakei'), n=1, includeRts=FALSE, excludeReplies=FALSE)[[1]]
[1] "GeorgeTakei: Trump has reimbursed his own companies 8.2mil in campaign expenses. He once said he should run for president b/c he'd make a lot of money..."
it works!
We’ll also need to localize tweets to cities and thus need the lat and lon of major US cities. It’s alos be nice to mormalize the data to population. It turns out sity coordinates and populations are available in the super-convenient {maps}
package.
require(maps)
cities <- us.cities
## create city_name by removing state designnator
cities <- cities %>% mutate(city_name = gsub(' [A-Z]{2,}','', name))
## clean
cities <- cities[complete.cases(cities),] %>% as_data_frame
## sort
cities <- cities[order(cities$pop, decreasing = TRUE), ]
According to the package, the top US cities by population are:
name | country.etc | pop | lat | long | capital | city_name |
---|---|---|---|---|---|---|
New York NY | NY | 8124427 | 40.67 | -73.94 | 0 | New York |
Los Angeles CA | CA | 3911500 | 34.11 | -118.41 | 0 | Los Angeles |
Chicago IL | IL | 2830144 | 41.84 | -87.68 | 0 | Chicago |
Houston TX | TX | 2043005 | 29.77 | -95.39 | 0 | Houston |
Looks right…
To get the tweet data use the twitteR::searchTwiter
command. Data collection is with the following variables.
## set up search terms
searchString.x <- "#rstats" # search term
n.x <- 900 # number of tweets
radius <- "10mi" # radius around selected geo-location
duration.days <- 14 # how many days
since.date <- (Sys.Date() - duration.days) %>% as.character # calculated starting date
Note the radius of 10mi, which is used to localize tweet collected around specific geo-locations. For cases, where major cities are in close proximity, this certainly picks up some redundant tweets. More work needed here…
n.cities <- 57
I pull data for the top 57 cities (by population) in the U.S. This includes cities from New York NY to Riverside CA.
Once collected, the data are lightly analyzed. Specifically the ‘tweet.flux’, representing the number of tweets per million people per day (“twipermipeds”), is computed.
analyzed_df <- collected_df %>%
mutate("tweet.flux" = 10^6 * n.tweets/population/duration.days ) %>%
select(name, lon, lat, tweet.flux, n.tweets, population)
Collected data are put into collected_df
. For this first-pass analysis tweets are counted but are not cached.
Use the {ggmap}
package to get a base Google map.
library(ggmap)
map = get_googlemap(center = c(lon = -95.58, lat = 36.83),
zoom = 3, size = c(390, 250), scale = 2, source = "google",
maptype="roadmap") #, key = my.secret.key)
map.plot <- ggmap(map)
After that standard ggplot2
functions are used to plot the data. Note that several dimensions of data are shown. The latitude and longitude reprsent the geolocation of the town. The size of the point represents the number of tweets n.tweets
and the shading of the dot represents the tweet.flux
in “twipermipeds.”"
map.plot +
geom_point(aes(x = lon, y = lat, fill = tweet.flux, size = n.tweets), data=analyzed_df, pch=21, color = "#33333399") +
ggtitle(paste0(searchString.x, " tweets for ", duration.days," days since ", since.date, " within ", radius, " of metro center")) +
scale_fill_gradient(low = "#BBBBFF", high = "#EE3300", space = "Lab", na.value = "grey50", guide = "colourbar")
Here are the top few cities by tweet flux (in “twipermipeds”).
name | tweet.flux | n.tweets | population |
---|---|---|---|
Boston MA | 60.89 | 484 | 567759 |
Oakland CA | 14.15 | 78 | 393632 |
WASHINGTON DC | 13.55 | 104 | 548359 |
Seattle WA | 9.27 | 74 | 570430 |
Chicago IL | 8.10 | 321 | 2830144 |
Arlington TX | 4.57 | 24 | 374729 |
Portland OR | 3.68 | 28 | 542751 |
Jacksonville FL | 3.62 | 41 | 809874 |
Tampa FL | 3.48 | 16 | 328578 |
Las Vegas NV | 2.97 | 23 | 553807 |
Fort Worth TX | 2.70 | 24 | 633849 |
San Francisco CA | 2.37 | 24 | 723724 |
Atlanta GA | 2.36 | 14 | 424096 |
New York NY | 2.02 | 230 | 8124427 |
Denver CO | 1.93 | 15 | 556575 |
Here are the top few cities sorted by raw tweets, again with major metro areas leading. Note that some other cities, like Chicago, have a large number of tweets but a lower flux because of their higher population.
name | tweet.flux | n.tweets | population |
---|---|---|---|
Boston MA | 60.89 | 484 | 567759 |
Chicago IL | 8.10 | 321 | 2830144 |
New York NY | 2.02 | 230 | 8124427 |
WASHINGTON DC | 13.55 | 104 | 548359 |
Los Angeles CA | 1.79 | 98 | 3911500 |
Oakland CA | 14.15 | 78 | 393632 |
Seattle WA | 9.27 | 74 | 570430 |
Jacksonville FL | 3.62 | 41 | 809874 |
Portland OR | 3.68 | 28 | 542751 |
Houston TX | 0.84 | 24 | 2043005 |
San Francisco CA | 2.37 | 24 | 723724 |
Fort Worth TX | 2.70 | 24 | 633849 |
Arlington TX | 4.57 | 24 | 374729 |
Las Vegas NV | 2.97 | 23 | 553807 |
Philadelphia PA | 1.04 | 21 | 1439814 |
Using #rstats
tweets, we find Boston leads in overal tweets, followed by Chicago and NYC. Tweet flux shows a different behavior with Boston still leading, but less populous cities moving up the ranks in social discussions about R. This says little, directly, about overall ‘data science’, but it does indicate that heavy usage of a powerful data science tool is localized to a handful of US cities.
Results show short term instablity - even within a period of hours, results can change. While problematic for this particular analysis, it does suggest the methodology may potentially be used to address other questions of timely reactions.
Normalizing the data for flux measurement is a key challenge. For instance it’s likely many of the same tweets are captured for both Newark and Jersey City (since they are in close proximity) representing a double-counting that would alter the tweet flux measurement. Including things like metropolitain areas, “likely” users, numbers of startups and academic institutions, etc could possibly improve the methodology.
> “twipermipeds” == definitely a thing.