Centimenity St (i.e. City Sentiment)

Intro

Here we are looking at how the sentiments of tweets varies across US cities. No large geographic trends are apparent, but there do appear to be some local patterns in the positivity and negativity of tweets. The information here represents a snapshot in time from 139 cities in the United States. We only sampled 30 tweets from each city due to restrictions of the Twitter aplication program interface (API), so our sample sizes are relatively small. Also, since we only sampled at max 3 cities from each state, some geographic areas didn’t get as much coverage as others. A prime example of this is the east-coast v.s. the west-coast. There are 14 states bordering the Atlantic Ocean while only 3 bordering the Pacific, so we sampled many more cities on the east-coast than the west-coast. It would be interesting to both increase the sample size of each city (at least 100 tweets per city), the number of cities included, and the time window.

Process Overview

We chose 1 to 3 of the largest cities from each state (and the District of Columbia) excluding Alaska and Hawaii. We used the us.cities data set from the maps package to select our cities. For some states only 1 or 2 cities were chosen because there weren’t any other cities in that state large enough to make it into the dataset. In total we looked at 139 US cities. In each of these cities we looked at the 30 most recently posted (as of our search time) tweets. Since the cities are spread across different timezones, we grabbed tweets at different global times depending on the city’s timezones so that all the tweets come from roughly similar locale times. For example, if we compared east coast tweets at 12 pm EST time to west coast tweets at 12 pm EST, we would have to take into account that the west coast tweeters were posting at 9 am (and maybe they just got into work and are a little stressed) while the east coast tweeters are posting at 12 pm (and perhaps happily eating lunch). Thus we categorized cities by their Greenwhich Mean Time (GMT) offset using the GeoNames API and the geonames package. Check out http://www.geonames.org/ and https://cran.r-project.org/web/packages/geonames/geonames.pdf for more information.

We had 58 cities in the eastern time zone (our largest category). The basic Twitter API restricts each user to 450 GET requests for each 15 minute interval (see here for more info). For each timezone, we went through the list of cities and downloaded the 30 most recent tweets from each using the searchTwitter function from the twitteR package. When we reached the download limit (15 cities: \(15*30 = 450\)), we paused for fifteen minutes before coninuing. We did this repeatedly until we had 30 tweets for all the cities in the timezone. We used the store_tweets_db function (also from the twitteR package) to store tweets in an SQLite database called: “Tweet Database.sqlite” with a table of tweets for each city (139 tables). We resticted our searched to English tweets within a five mile radius of the latidude and longitude of the city as recorded in the us.cities data set. See the user-created function search.and.store for details. The tweets that we fetched all occured between 12:30 and 1:30 pm (local city time) on November 11th, 2017.

After acquiring the tweets we performed sentiment analysis as shown in David Robinson’s article and further explained in Prof. James Suleiman’s notes. The metrics we looked at were the proportion of positive words in the tweets, the proportion of negative words, and the ratio of the proportion of positive to the proportion of negative (positive to negative ratio). Finally, we used the leaflet package to visualize the results.

Enjoy!

Packages

Click “Code” to see a list of all the packages used.

library(tidyverse)
library(twitteR)
library(maps)
library(geonames)
library(pander)
library(RSQLite)
library(tidytext)
library(leaflet)

Preparing Data Frame of City Locations

Our first step is to prepare a data set that contains the name, location, GMT offset, and geocode (will be used for the searchTwitter call) for each city that we will be looking at. The first rows of this data set are shown below. Click “Code” to see how it was constructed.

locations <- us.cities %>% 
     group_by(country.etc) %>% 
     filter(min_rank(desc(pop)) <= 3) %>% 
     ungroup() %>% 
     filter(!(country.etc %in% c('AK', 'HI'))) %>% 
     rename(state = country.etc)

# Create a gcode column that contains the value that will be supplied to the
# geocode argument in searchTwitter()
locations <- locations %>% 
     mutate(gcode = paste(lat,long,'5mi', sep = ','),
            name = gsub(' ', '_', name)) %>% 
     select(-capital)



# The call below does NOT work. It just gives each location the GMT
# offset of the first location in the data frame (Albuquerque, NM)
locations <- locations %>% 
     mutate(GMT.offset = GNtimezone(lag(lat, 0), long, radius = 1)[,4])
# Since the above code doesn't work I'll use a for loop:
for(i in 1:nrow(locations)){
     locations$GMT.offset[i] <- GNtimezone(locations$lat[i],
                                           locations$long[i],
                                           radius = 1)[,4]
}

pandoc.table(head(locations),
             split.cells = Inf,
             split.tabl = Inf)

name	state	pop	lat	long	gcode	GMT.offset
Albuquerque_NM	NM	494962	35.12	-106.6	35.12,-106.62,5mi	-7
Allentown_PA	PA	105224	40.6	-75.48	40.6,-75.48,5mi	-5
Atlanta_GA	GA	424096	33.76	-84.42	33.76,-84.42,5mi	-5
Augusta_ME	ME	18626	44.32	-69.77	44.32,-69.77,5mi	-5
Augusta-Richmond_GA	GA	192376	33.46	-81.99	33.46,-81.99,5mi	-5
Aurora_CO	CO	304641	39.71	-104.7	39.71,-104.73,5mi	-7

Getting Tweets

We have 4 different GMT offsets in our cities data set:

-5
-6
-7
-8

The plan is to first pull tweets from the -5 offset (east coast), then pull tweets from cities in the -6 offset an hour later, than cities in the -7 offset and hour after that, and so on. Tweets will be downloaded from one city at a time using the searchTwitter object. After being downloaded, the list of tweets for a city will be stored in an SQLite database as a table whose name corresponds to the city and the time of download. Click the “Code” button to see how this is done.

# Create SQLite database
tweet.db <- dbConnect(SQLite(), dbname = 'Tweet Database.sqlite')
dbDisconnect(tweet.db)

# Backend register database to store tweets in
register_sqlite_backend('Tweet Database.sqlite')

# Create a function for searching for tweets and storing them
# in properly named tables in connected database (which we 
# connected to via register_sqlite_backend() above)

search.and.store <- function(cities, n.tweets = 7){
     
     table.names <- c()
     
     req.count <- 0
     
     for(i in 1:nrow(cities)){
          name <- paste(cities$name[i], Sys.time(), sep = ' -- ')
          
          Tweets <- searchTwitter('',
                                  n = n.tweets,
                                  lang = 'en',
                                  geocode = cities$gcode[i])
          
          req.count <- req.count + n.tweets
          
          store_tweets_db(tweets = Tweets,
                          table_name = name)
          
          table.names <- c(table.names, name)
          
          if(req.count >= 449){
               print(paste(i, 
                           "Cities, Almost reached max requests, going to sleep for 15 minutes at:",
                           Sys.time()))
               print('Tables created:')
               print(table.names)
               Sys.sleep(900)
               req.count <- 0
          }
     }
     
     return(table.names)
}


locations.5 <- locations %>% 
     filter(GMT.offset == -5)
table.names.5 <- search.and.store(locations.5, n.tweets = 30)


locations.6 <- locations %>% 
     filter(GMT.offset == -6)
table.names.6 <- search.and.store(locations.6, n.tweets = 30)

locations.7 <- locations %>% 
     filter(GMT.offset == -7)
table.names.7 <- search.and.store(locations.7, n.tweets = 30)

locations.8 <- locations %>% 
     filter(GMT.offset == -8)
table.names.8 <- search.and.store(locations.8, n.tweets = 30)

City Sentiments

As mentioned in the process overview, many of the techniques and some of code for handling the text data and performing sentiment analysis was taken from David Robinson’s article and Prof. James Suleiman’s notes. I’ve done my best to especially note when I am using lines of code directly from those sources.

In the last section we downloaded tweets and created an SQLite database containing those tweets. Click the “Code” button to see how we now import those tweets from the database.

register_sqlite_backend('Tweet Database.sqlite')

con <- dbConnect(SQLite(), 'Tweet Database.sqlite')
Table.names <- dbListTables(con)
dbDisconnect(con)

Tweet.dfs <- lapply(Table.names, load_tweets_db, as.data.frame = TRUE)

names(Tweet.dfs) <- Table.names

str(Tweet.dfs$`Albuquerque_NM -- 2017-11-18 14:25:11`)

## 'data.frame':    30 obs. of  16 variables:
##  $ text         : chr  "RT @LiberalResist: Guantanamo Under Trump Is a New Hell, Says 70-Year-Old Inmate Held For 13 Years - Newsweek  https://t.co/kEP"| __truncated__ "NYT<U+0092>s Campus Free Speech Coverage Focuses 7-to-1 on Plight of Right | FAIR https://t.co/33k0tz6yvT""| __truncated__ "When girls block me but still look at my story. Lmfaooooo \xed<U+00A0><U+00BD>\xed<U+00B8>\u0082""| __truncated__ "All of the <U+0093>this is a good tweet<U+0094>s have brought me to this moment :<U+0092>) https://t.co/RBJDaBKEEq""| __truncated__ ...
##  $ favorited    : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ favoriteCount: num  0 0 0 1 0 0 0 0 0 1 ...
##  $ replyToSN    : chr  NA NA NA NA ...
##  $ created      : POSIXct, format: "2017-11-18 19:25:15" "2017-11-18 19:25:14" ...
##  $ truncated    : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ replyToSID   : chr  NA NA NA NA ...
##  $ id           : chr  "931966552852848641" "931966545600958464" "931966539816980481" "931966517888958464" ...
##  $ replyToUID   : chr  NA NA NA NA ...
##  $ statusSource : chr  "<a href=\"http://twitter.com/download/android\" rel=\"nofollow\">Twitter for Android</a>" "<a href=\"http://www.crowdfireapp.com\" rel=\"nofollow\">Crowdfire - Go Big</a>" "<a href=\"http://twitter.com/download/iphone\" rel=\"nofollow\">Twitter for iPhone</a>" "<a href=\"http://twitter.com/download/iphone\" rel=\"nofollow\">Twitter for iPhone</a>" ...
##  $ screenName   : chr  "BalelBauer" "MDBlanchfield" "sonrriissaaa" "Devo_Weavo_" ...
##  $ retweetCount : num  2 0 0 0 2 0 0 0 0 2 ...
##  $ isRetweet    : logi  TRUE FALSE FALSE FALSE TRUE FALSE ...
##  $ retweeted    : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ longitude    : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ latitude     : int  NA NA NA NA NA NA NA NA NA NA ...

Now we have a list of 139 data frames, one for each city. Each data frame has thirty observations (tweets) with 16 different variables. The variable that we are interested in is text which contains the actual text content of the tweet. For each city we are going to select just the text from all of the tweets and break it down into words (using unnest_tokens). Thus, for each city we will have a data frame with all the words (excluding common stop words using the stop_words data set from the tidytext package, see this wikipedia article for more information on stop words) from the thirty tweets collected for that city. Click the “Code” button to see how this is done.

# Define regex pattern to identify words in text. This pattern was taken from
# David Robinson's article (see link in introduction)
reg <- "([^A-Za-z\\d#@']|'(?![A-Za-z\\d#@]))"

# characters to remove from all text strings (if present), also
# taken from David Robinson's article
txt.rmv <- 'https://t.co/[A-Za-z\\d]+|&amp;'

Tweet.words <- list()
for(i in 1:length(Tweet.dfs)){
     # Get name of city
     location <- strsplit(names(Tweet.dfs)[i], ' -- ')[[1]][1]
     
     Tweet.words[[i]] <- Tweet.dfs[[i]] %>%
          select(text) %>% # Select text column only
          mutate(text = str_replace_all(text, txt.rmv, '')) %>% # Remove characters
          unnest_tokens(word, text, token = 'regex', pattern = reg) %>% # Split text into words
          filter(!(word %in% stop_words$word),     # Remove common stop words and any
                 str_detect(word, '[a-zA-z]')) %>% # "words" that don't have letters
          mutate(name = location) %>% # Add a column with the city name
          select(name, word) # Reorder columns
}

names(Tweet.words) <- Table.names

Now we want to tag each word with sentiments, sum up the total counts of sentiments for each city, and combine all the cities into one data frame where the rows are cities, the columns are sentiments, and the values within each sentiment column are the numbers of times a word of that sentiment was observed. We’ll use the NRC Word-Emotion Association lexicon, which is in the sentiments data set of the tidytext package. The first rows of our data frame are shown below. Click “Code” to see how it was constructed.

NRC.sent <- sentiments %>% 
     filter(lexicon == 'nrc') %>% 
     select(word, sentiment)

All.Words <- Tweet.words[[1]]

for(i in 2:length(Tweet.words)){
     All.Words <- rbind(All.Words, Tweet.words[[i]])
}

Total.Sent.Words <- All.Words %>%
     filter(word %in% NRC.sent$word, # Select only words that are in the 
            word != 'trump') %>%     # sentiment dictionary and exclude trump
     group_by(name) %>%              # since it likely refers to a name
     summarise(Total.Sentiment.Words = n())

City.Sents <- All.Words %>% 
     inner_join(NRC.sent, by = 'word') %>% 
     group_by(name, sentiment) %>% 
     summarise(Sentiment.Count = n()) %>% 
     ungroup() %>% 
     spread(sentiment, Sentiment.Count) %>% 
     inner_join(Total.Sent.Words, by = 'name') %>% 
     rename(total = Total.Sentiment.Words)

pandoc.table(head(City.Sents),
             split.tables = Inf,
             split.cells = Inf)

name	anger	anticipation	disgust	fear	joy	negative	positive	sadness	surprise	trust	total
Albuquerque_NM	8	13	13	14	12	19	23	8	12	7	43
Allentown_PA	10	15	NA	10	11	14	15	13	3	26	55
Atlanta_GA	6	19	7	8	20	14	41	6	10	21	59
Augusta-Richmond_GA	3	3	4	2	5	7	12	1	1	6	20
Augusta_ME	2	11	3	6	13	12	28	4	4	12	46
Aurora_CO	7	7	2	5	7	13	17	4	2	9	32

Now that we have our data frame, we want to calculate the proportion of negative tweets and the proportion of positive tweets in each city, as well as the ratio of these two proportions (proportion postive : proportion negative), which we will call the “Positive to Negative Ratio”. These are the measures that we will eventually be mapping. Click the “Code” button to see how this is done.

Sent.Props <- City.Sents %>% 
     mutate(positive.prop = positive/total,
            negative.prop = negative/total,
            pos.to.neg = positive.prop/negative.prop) %>% 
     inner_join(locations, by = 'name') %>% 
     select(name, positive.prop, negative.prop, pos.to.neg, lat, long) %>% 
     mutate(name = substr(name, 1, nchar(name) - 3),
            name = gsub('_', ' ', name))

pandoc.table(head(Sent.Props),
             split.tables = Inf,
             split.cells = Inf)

name	positive.prop	negative.prop	pos.to.neg	lat	long
Albuquerque	0.5349	0.4419	1.211	35.12	-106.6
Allentown	0.2727	0.2545	1.071	40.6	-75.48
Atlanta	0.6949	0.2373	2.929	33.76	-84.42
Augusta-Richmond	0.6	0.35	1.714	33.46	-81.99
Augusta	0.6087	0.2609	2.333	44.32	-69.77
Aurora	0.5312	0.4062	1.308	39.71	-104.7

Finally! We have our cities in a data frame with their coordinates and data on positive and negative tweet sentiment. Let’s move on to mapping!

Mapping our Data

The maps below show the positive and negative tweet sentiment for each city. There is one map for the positive to negative ratio, one for the proportion of positive words in tweets, and one for the proportion of negative words in tweets. The bubbles represent cities and the color of the bubbles represent sentiment, green indicating more positive sentiment and red indicating more negative sentiment. Click the “Code” buttons to see how each map was constructed.

City Sentiment Maps

Positive to Negative Ratio

pal <- colorQuantile("RdYlGn", NULL, n = 10)

M.Pos.to.Neg <- leaflet() %>%
     addProviderTiles('OpenStreetMap.Mapnik') %>%
     addCircleMarkers(lng = Sent.Props$long, 
                      lat = Sent.Props$lat,
                      color = pal(Sent.Props$pos.to.neg),
                      weight = 1,
                      popup = paste(Sent.Props$name, 
                                    '<br>', 
                                    'Postive to Negative Ratio:',
                                    round(Sent.Props$pos.to.neg, 2)),
                      radius = 10,
                      fillOpacity = .50) %>% 
     addLegend("bottomleft", 
               colors = pal(seq(.1,1,.1)), 
               labels = c('0-0.86',
                          '0.86-1.06',
                          '1.06-1.26',
                          '1.26-1.45',
                          '1.45-1.53',
                          '1.53-1.80',
                          '1.80-2.05',
                          '2.05-2.42',
                          '2.42-3.34',
                          '3.34-18.0'),
               opacity = 1
               )

M.Pos.to.Neg

Proportion of Positive Words

pal <- colorQuantile("RdYlGn", NULL, n = 10)

M.Pos <- leaflet() %>%
     addProviderTiles('OpenStreetMap.Mapnik') %>%
     addCircleMarkers(lng = Sent.Props$long, 
                      lat = Sent.Props$lat,
                      color = pal(Sent.Props$positive.prop),
                      weight = 1,
                      popup = paste(Sent.Props$name, 
                                    '<br>', 
                                    'Proportion of Positive Words:',
                                    round(Sent.Props$positive.prop, 2)),
                      radius = 10,
                      fillOpacity = .50) %>% 
     addLegend("bottomleft", 
               colors = pal(seq(.1,1,.1)), 
               labels = c('0-0.36',
                          '0.36-0.42',
                          '0.42-0.45',
                          '0.45-0.48',
                          '0.48-0.50',
                          '0.50-0.53',
                          '0.53-0.56',
                          '0.56-0.60',
                          '0.60-0.64',
                          '0.64-0.86'),
               opacity = 1
               )

M.Pos

Proportion of Negative Words

pal <- colorQuantile("RdYlGn", NULL, n = 10, reverse = TRUE)

M.Neg <- leaflet() %>%
     addProviderTiles('OpenStreetMap.Mapnik') %>%
     addCircleMarkers(lng = Sent.Props$long, 
                      lat = Sent.Props$lat,
                      color = pal(Sent.Props$negative.prop),
                      weight = 1,
                      popup = paste(Sent.Props$name, 
                                    '<br>', 
                                    'Proportion of Negative Words:',
                                    round(Sent.Props$negative.prop, 2)),
                      radius = 10,
                      fillOpacity = .50) %>% 
     addLegend("bottomleft", 
               colors = rev(pal(seq(.1,1,.1))), 
               labels = c('0.45-1',
                          '0.41-0.45',
                          '0.37-0.41',
                          '0.33-0.37',
                          '0.31-0.33',
                          '0.29-0.31',
                          '0.27-0.29',
                          '0.24-0.27',
                          '0.18-0.24',
                          '0-0.18'),
               opacity = 1
               )

M.Neg

Thoughts

There isn’t any obvious large-scale geographic trend in tweet sentiment apparent in this data. That is to say, we don’t see any general trends like “The northern United States is much more positive than the southwest”. There does seem to be a little bit more negativity on the west-coast than the east-coast, but we didn’t sample many cities on the west-coast (since the west-coast is made up of fewer states and we only sampled at max 3 cities in each state) so it is hard to tell. However you can see some local trends. There seems to be a little half-moon of negativity in the Carolinas and a line of positivity through central Connecticut to south-western Mass. If you look across the US you can find more small local trends like this.

We only looked at a snapshot in time here. It would be interesting to continue to get data at the same time of day over a longer period of time (say a month) for each of these cities, and see if and how sentiments changed over time. Observed changes could then be compared with events that occured at a similar place and time to see if there were any relations. I had quite a bit of fun with this and it would be interesting to figure out a way to automate the data fetching so that each day at a specific time I could download and store tweets from each city.

Citations and References

I used a lot of resources here. In addition to searching around on stack overflow, R-bloggers, and the like the, two primary resources were Professor James Suleiman’s (of the University of Southern Maine) notes and David Robinson’s interesting article on Trump’s tweets. The idea for the tabbed maps was inspired by Katherine Bessey’s assignment 3 for MBA 676.

David Robinson (2016). “Text analysis of Trump’s tweets confirms he writes only the (angrier) Android half” http://varianceexplained.org/r/trump-tweets/
James Suleiman (2017). “MBA 676: Data Management and Analytics” http://jamessuleiman.com/mba676/index.html
R Core Team (2017). “R: A Language and Environment for Statistical Computing” https://www.R-project.org
Hadley Wickham (2017). “Tidyverse” https://CRAN.R-project.org/package=tidyverse
Jeff Gentry (2015). “twitteR: R Based Twitter Client” https://CRAN.R-project.org/package=twitteR
Richard Becker et al (2017). “maps: Draw Geographical Maps” https://CRAN.R-project.org/package=maps
Barry Rowlingson (2014). “geonames: Interface to www.geonames.org web service” https://CRAN.R-project.org/package=geonames
Gergely Daróczi and Roman Tsegelskyi (2017). “pander: An R ‘Pandoc’ Writer” https://CRAN.R-project.org/package=pander
Kirill Müller et al (2017). “RSQLite: ‘SQLite’ Interface for R” https://CRAN.R-project.org/package=RSQLite
Julia Silge and David Robinson (2016). “tidytext: Text Mining and Analysis Using Tidy Data Principles in R” http://dx.doi.org/10.21105/joss.00037
Joe Cheng et al (2017). “leaflet: Create Interactive Web Maps with the JavaScript ‘Leaflet’ Library” https://CRAN.R-project.org/package=leaflet