Summary

The question “what’s the best city for data science?” was asked on the Sept “Not So Standard Deviations” podcast. To inject some analysis in the discussion, I used the twitteR package to measure interest in R by computing the “flux” of tweets with the #rstats hashtag.
The top metro areas are New York, Boston, and the SF Bay area, with a tweet flux of about 50 #rstat tweets per million residents per day (“twipermipeds”). Other leading cities include Long Beach, Washington DC, Seattle, Raleigh NC, and Henderson NV.
Even Portland, Oregon (yay) weighs-in within the top 15 on tweets. Results are sensitive to assumptions about metro size and show some short term time dependency.
This was quick and dirty so no telling how stable the result will be over longer time.

Problem Statement

In their September NSSD Podcast, Hilary and Roger discussed “the best city for data science.” Let’s try measuring something just to inject a little analysis into the discussion.

Where do I get data?

Twitter is a good source of data on “interest” in topics since it is both timely and social.

Tweets, the Twitter API, and twitteR package.

Setting up the twitter API is relatively quick.

library(twitteR)
## create URLs
reqURL <- "https://api.twitter.com/oauth/request_token"
accessURL <- "https://api.twitter.com/oauth/access_token"
authURL <- "https://api.twitter.com/oauth/authorize"

## read file containing secret keys (obtained from apps.twitter.com)
keys <- read.table("/users/winstonsaunders/documents/city_politics/secret_t.key.txt", stringsAsFactors = FALSE, col.names = "secret" )
## convert to characters (read.table coerces to a factor)
consumerKey       <- keys$secret[1] 
consumerSecret    <- keys$secret[2] 
accessToken       <- keys$secret[3] 
accessTokenSecret <- keys$secret[4] 

## set up authentication
setup_twitter_oauth(consumerKey, consumerSecret, accessToken, accessTokenSecret)
[1] "Using direct authentication"

Test functionality with George Takei’s latest tweet

## test functionanlity with georgetakei tweets
userTimeline(getUser('georgetakei'), n=1, includeRts=FALSE, excludeReplies=FALSE)[[1]]
[1] "GeorgeTakei: Trump has reimbursed his own companies 8.2mil in campaign expenses. He once said he should run for president b/c he'd make a lot of money..."

it works!

City populations and geo-locations

We’ll also need to localize tweets to cities and thus need the lat and lon of major US cities. It’s alos be nice to mormalize the data to population. It turns out sity coordinates and populations are available in the super-convenient {maps} package.

    require(maps)
    cities <- us.cities
    
    ## create city_name by removing state designnator
    cities <- cities %>% mutate(city_name = gsub(' [A-Z]{2,}','', name))
    ## clean
    cities <- cities[complete.cases(cities),] %>% as_data_frame
    ## sort
    cities <- cities[order(cities$pop, decreasing = TRUE), ]

According to the package, the top US cities by population are:

name country.etc pop lat long capital city_name
New York NY NY 8124427 40.67 -73.94 0 New York
Los Angeles CA CA 3911500 34.11 -118.41 0 Los Angeles
Chicago IL IL 2830144 41.84 -87.68 0 Chicago
Houston TX TX 2043005 29.77 -95.39 0 Houston

Looks right…

Getting the tweets

To get the tweet data use the twitteR::searchTwiter command. Data collection is with the following variables.

## set up search terms
searchString.x <- "#rstats"    # search term
n.x <- 900                     # number of tweets
radius <- "10mi"               # radius around selected geo-location
duration.days <- 14             # how many days
since.date <- (Sys.Date() - duration.days) %>% as.character # calculated starting date

Note the radius of 10mi, which is used to localize tweet collected around specific geo-locations. For cases, where major cities are in close proximity, this certainly picks up some redundant tweets. More work needed here…

n.cities <- 57

I pull data for the top 57 cities (by population) in the U.S. This includes cities from New York NY to Riverside CA.

Analysis

Once collected, the data are lightly analyzed. Specifically the ‘tweet.flux’, representing the number of tweets per million people per day (“twipermipeds”), is computed.

analyzed_df <- collected_df %>% 
    mutate("tweet.flux" = 10^6 * n.tweets/population/duration.days ) %>% 
    select(name, lon, lat, tweet.flux, n.tweets, population)

Collected data are put into collected_df. For this first-pass analysis tweets are counted but are not cached.

So, what does the Tweet-Map look like?

Use the {ggmap} package to get a base Google map.

    library(ggmap)
    map = get_googlemap(center =  c(lon = -95.58, lat = 36.83), 
              zoom = 3, size = c(390, 250), scale = 2, source = "google",
              maptype="roadmap") #, key = my.secret.key)
    map.plot <- ggmap(map)

After that standard ggplot2 functions are used to plot the data. Note that several dimensions of data are shown. The latitude and longitude reprsent the geolocation of the town. The size of the point represents the number of tweets n.tweets and the shading of the dot represents the tweet.flux in “twipermipeds.”"

map.plot +
    geom_point(aes(x = lon, y = lat, fill = tweet.flux, size = n.tweets), data=analyzed_df, pch=21, color = "#33333399") +
    ggtitle(paste0(searchString.x, " tweets for ", duration.days," days since ", since.date, " within ", radius, " of metro center")) +
    scale_fill_gradient(low = "#BBBBFF", high = "#EE3300", space = "Lab", na.value = "grey50", guide = "colourbar")

What are the top cities in #rstats?

AMB twipermipeds

Here are the top few cities by tweet flux (in “twipermipeds”).

name tweet.flux n.tweets population
Boston MA 60.89 484 567759
Oakland CA 14.15 78 393632
WASHINGTON DC 13.55 104 548359
Seattle WA 9.27 74 570430
Chicago IL 8.10 321 2830144
Arlington TX 4.57 24 374729
Portland OR 3.68 28 542751
Jacksonville FL 3.62 41 809874
Tampa FL 3.48 16 328578
Las Vegas NV 2.97 23 553807
Fort Worth TX 2.70 24 633849
San Francisco CA 2.37 24 723724
Atlanta GA 2.36 14 424096
New York NY 2.02 230 8124427
Denver CO 1.93 15 556575

AMB tweets

Here are the top few cities sorted by raw tweets, again with major metro areas leading. Note that some other cities, like Chicago, have a large number of tweets but a lower flux because of their higher population.

name tweet.flux n.tweets population
Boston MA 60.89 484 567759
Chicago IL 8.10 321 2830144
New York NY 2.02 230 8124427
WASHINGTON DC 13.55 104 548359
Los Angeles CA 1.79 98 3911500
Oakland CA 14.15 78 393632
Seattle WA 9.27 74 570430
Jacksonville FL 3.62 41 809874
Portland OR 3.68 28 542751
Houston TX 0.84 24 2043005
San Francisco CA 2.37 24 723724
Fort Worth TX 2.70 24 633849
Arlington TX 4.57 24 374729
Las Vegas NV 2.97 23 553807
Philadelphia PA 1.04 21 1439814

Summary

Using #rstats tweets, we find Boston leads in overal tweets, followed by Chicago and NYC. Tweet flux shows a different behavior with Boston still leading, but less populous cities moving up the ranks in social discussions about R. This says little, directly, about overall ‘data science’, but it does indicate that heavy usage of a powerful data science tool is localized to a handful of US cities.
Results show short term instablity - even within a period of hours, results can change. While problematic for this particular analysis, it does suggest the methodology may potentially be used to address other questions of timely reactions.
Normalizing the data for flux measurement is a key challenge. For instance it’s likely many of the same tweets are captured for both Newark and Jersey City (since they are in close proximity) representing a double-counting that would alter the tweet flux measurement. Including things like metropolitain areas, “likely” users, numbers of startups and academic institutions, etc could possibly improve the methodology.
> “twipermipeds” == definitely a thing.