In this assignment, I’m going to donwload Tweets that contain the names of neighborhoods in Atlanta. I’ll then apply sentiment analysis to the Tweets and map/plot the sentiments associated with neighborhoods. Specifically, the workflow for this assignment involves performing the following steps:
Step 1. Download and read a shapefile that contains neighborhood boundary and their names. Step 2. Initiate a deep learning-based package for sentiment analysis called “sentiment.ai” (or if I have problem with this package, I’ll use a different package). Step 3. Loop through the names of neighborhoods in Atlanta to collect Tweets. Step 4. Clean and filter the collected Tweets. Step 5. Analyze the Tweets.
As always, let’s load our packages first.
library(rtweet)
library(tidyverse)
library(sf)
library(sentiment.ai)
library(SentimentAnalysis)
library(ggplot2)
library(here)
library(tmap)
library(tidydr)
library(data.table)
First, I’ll go to this webpage and download the shapefile from there. Once downloaded, I will read the data into our current R environment.
# TASK ////////////////////////////////////////////////////////////////////////
# Read neighborhood shapefile
nb_shp <- st_read("C:/Users/kwells65/Atlanta_Neighborhoods.shp")
## Reading layer `Atlanta_Neighborhoods' from data source
## `C:\Users\kwells65\Atlanta_Neighborhoods.shp' using driver `ESRI Shapefile'
## Simple feature collection with 248 features and 20 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -84.55085 ymin: 33.64799 xmax: -84.28962 ymax: 33.88687
## Geodetic CRS: WGS 84
# //TASK //////////////////////////////////////////////////////////////////////
If I have issues with using this package, I’m going to use the other package introduced in the class called SentimentAnalysis.
# TASK ////////////////////////////////////////////////////////////////////////
#First, I' going to install sentiment.ai (I had to re-do this bit for some reason; if you've already installed it, then skip this part.)
install_sentiment.ai(envname = "r-sentiment-ai",
method = "conda",
python_version = "3.8.10")
#Initiate sentiment.ai
init_sentiment.ai(envname = "r-sentiment-ai", method = "conda") #feel free to change these arguments if you need to.
## <tensorflow.python.saved_model.load.Loader._recreate_base_user_object.<locals>._UserObject object at 0x000001C9B0DF26D0>
#if you're having problems with this part, I found that running R studio as an admin helps some of the permissions errors you get.
#As always, let's run a quick check before moving on
check_sentiment.ai()
## NULL
#Just to be on the safe side, let's test it as well. If the below block of code shows us the sentiment scores for each string, then that means it works.
sentiment_score(c("This installation process is too complicated!",
"Only if it works in the end.",
"But does it?",
"It does work!"))
## This installation process is too complicated!
## -0.7417080
## Only if it works in the end.
## -0.4529068
## But does it?
## 0.1627681
## It does work!
## 0.6917385
# //TASK //////////////////////////////////////////////////////////////////////
In this step, I’m going to prepare to use Twitter API by specifying arguments of create_token() function using my credentials that we created in another class exercise.
# TASK ////////////////////////////////////////////////////////////////////////
#Here, I have to use the name that I assigned the app I created on the developer portal.
appname <- "UrbanAnalytics_tutorial"
#Now, let's create a token named "twitter_token"
#NOTE: the keys used should be replaced by your own keys obtained by creating the app
twitter_token <- create_token(
app = appname,
consumer_key = Sys.getenv("twitter_key"),
consumer_secret = Sys.getenv("twitter_key_secret"),
access_token = Sys.getenv("twitter_access_token"),
access_secret = Sys.getenv("twitter_access_token_secret"))
# //TASK //////////////////////////////////////////////////////////////////////
Next, let’s define a function that downloads Tweets, clean them, and apply sentiment analysis to them.
# Extract neighborhood names from nb_shp's NAME column and store it in nb_names object.
nb_names <- nb_shp$NAME
# Define a search function
get_twt <- function(term){
# =========== NO MODIFICATION ZONE STARTS HERE ===============================
term_mod <- paste0("\"", term, "\"")
# =========== NO MODIFY ZONE ENDS HERE ========================================
# TASK ////////////////////////////////////////////////////////////////////////
# 1. Use search_tweets() function to get Tweets.
# Use term_mod as the search keyword to get Tweets.
# Set n to a number large enough to get all Tweets from the past 7 days
# Set geocode argument such that the search is made with 50 mile radius from 33.76, -84.41
# Be sure the exclude retweets.
# You may need to enable the function to automatically wait if rate limit is exceeded.
# I recommend using suppressWarnings() to suppress warnings.
# Make sure you assign the output from the seach_tweets to object named 'out'
out <- suppressWarnings(search_tweets(q = term_mod,
n = 100, #captures tweets in last 7 days
lang = "en", #english
include_rts = FALSE,
geocode = "33.76,-84.41,50mi",
retryonratelimit = TRUE)) #automatically waits if rate limit is exceeded
# //TASK //////////////////////////////////////////////////////////////////////
# =========== NO MODIFICATION ZONE STARTS HERE ===============================
out <- out %>%
select(created_at, id, id_str, full_text, geo, coordinates, place, text)
# Basic cleaning
replace_reg <- "http[s]?://[A-Za-z\\d/\\.]+|&|<|>"
out <- out %>%
mutate(text = str_replace_all(text, replace_reg, ""),
text = gsub("@", "", text),
text = gsub("\n\n", "", text))
# Sentiment analysis
# Also add a column for neighborhood names
if (nrow(out)>0){
out <- out %>%
mutate(sentiment_ai = sentiment_score(out$text),
sentiment_an = analyzeSentiment(text)$SentimentQDAP,
nb = term)
print(paste0("Search term:", term))
} else {
return(out)
}
return(out)
}
# =========== NO MODIFY ZONE ENDS HERE ========================================
Let’s apply the function to Tweets. Note that this code chunk may take more than 15 minutes if you’ve already spent some (or all) of your rate limit.
# =========== NO MODIFICATION ZONE STARTS HERE ===============================
# Apply the function to get Tweets
twt <- map(nb_names, ~get_twt(.x))
## [1] "Search term:Fairburn"
## [1] "Search term:Brandon"
## [1] "Search term:Poncey-Highland"
## [1] "Search term:Inman Park"
## [1] "Search term:Edgewood"
## [1] "Search term:Lakewood"
## [1] "Search term:Cabbagetown"
## [1] "Search term:Reynoldstown"
## [1] "Search term:Campbellton Road"
## [1] "Search term:Southwest"
## [1] "Search term:Adams Park"
## [1] "Search term:Ben Hill"
## [1] "Search term:Underwood Hills"
## [1] "Search term:Riverside"
## [1] "Search term:Bolton"
## [1] "Search term:Rockdale"
## [1] "Search term:Lenox"
## [1] "Search term:Kingswood"
## [1] "Search term:Margaret Mitchell"
## [1] "Search term:Cross Creek"
## [1] "Search term:Memorial Park"
## [1] "Search term:Pittsburgh"
## [1] "Search term:Peoplestown"
## [1] "Search term:Summerhill"
## [1] "Search term:Castleberry Hill"
## [1] "Search term:Sherwood Forest"
## [1] "Search term:Loring Heights"
## [1] "Search term:Mays"
## [1] "Search term:Grove Park"
## [1] "Search term:Adamsville"
## [1] "Search term:Cascade Heights"
## [1] "Search term:Westview"
## [1] "Search term:West End"
## [1] "Search term:Fort Valley"
## [1] "Search term:Greenbriar"
## [1] "Search term:South Atlanta"
## [1] "Search term:Downtown"
## [1] "Search term:Georgia Tech"
## [1] "Search term:Home Park"
## [1] "Search term:Midtown"
## [1] "Search term:Brookwood"
## [1] "Search term:Virginia Highland"
## [1] "Search term:Kirkwood"
## [1] "Search term:Pine Hills"
## [1] "Search term:Carroll Heights"
## [1] "Search term:Brookhaven"
## [1] "Search term:Druid Hills"
## [1] "Search term:Chattahoochee"
## [1] "Search term:Oakland"
## [1] "Search term:Atlantic Station"
## [1] "Search term:East Lake"
## [1] "Search term:East Atlanta"
## [1] "Search term:Emory"
# =========== NO MODIFY ZONE ENDS HERE ========================================
The downloaded Tweets need some cleaning / reorganizing process, including
Drop empty elements from the list twt. These are
neighborhoods with no Tweets referring to them. Hint: you can create a
logical vector that has FALSEs if the corresponding elements in
twt has no Tweets and TRUE otherwise.
The coordinates column is currently a list-column.
Unnest this column so that lat, long, and type (i.e., column names
inside coordinates) are separate columns. You can use unnest()
function.
Calculate the average sentiment score for each neighborhood. You
can group_by() nb column in twt objects and summarise() to
calculate means. Also add an additional column n that
contains the number of rows in each group using n() function.
Join the cleaned Tweet data back to the neighborhood shapefile.
Use the neighborhood name as the join key. Make sure that the
result of the join is assigned to an object called twt_poly
to ensure that the subsequent code runs smoothly.
#Step one: dropping the empty elements
twt_no_empty <- twt[which(lapply(twt, nrow) != 0)]
#Step two: unnesting and separating lat long data
twt_df <- rbindlist(twt_no_empty, fill=FALSE, idcol=NULL)
twt_unnest <- unnest(twt_df, cols = c("coordinates"))
#Step three: calculating the average sentiment score for each neighborhood
twt_sent <- twt_unnest %>%
group_by(nb) %>%
summarise(
mean_sent_ai = mean(sentiment_ai),
mean_sent_an = mean(sentiment_an),
n = n())
#Alright, now that we've got our average scores calculated for each neighborhood, we're ready to put everything back with our shapefile.
#Step four: joining the cleaned data back into the neighborhood shapefile
names(twt_sent)[names(twt_sent) == 'nb'] <- 'NAME'
twt_poly <- merge(x=nb_shp,y=twt_sent,by="NAME",all.x=TRUE)
I’m sure there is a more elegant way to do all of these steps that doesn’t involve breaking each one down into a single process and creating multiple objects, but sometimes it’s easier to understand things more clearly when you compartmentalize things this way.
Either way, with this task accomplished, let’s move on to our analysis.
Now that we have collected Tweets, calculated sentiment score, and merged it back to the original shapefile, we can map them to see spatial distribution and draw plots to see inter-variable relationships.
First, let’s draw two interactive choropleth maps, one using sentiment score as the color and the other one using the number of Tweets as the color. Use tmap_arrange() function to display the two maps side-by-side.
tmap_mode("view")
## tmap mode set to interactive viewing
twt_avg <- tm_shape(twt_poly) +
tm_polygons(col = "mean_sent_ai", title = 'Mean Sentiment Score', style = "quantile", palette = "RdYlBu")
twt_num <- tm_shape(twt_poly) +
tm_polygons(col = "n", title = "Number of Tweets", style="quantile", palette = "-RdYlBu")
tmap_arrange(twt_avg, twt_num, sync = TRUE)
## Variable(s) "mean_sent_ai" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.
## Variable(s) "mean_sent_ai" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.
#I decided to go with the red to blue palette because it's more colorblind friendly than the default colors that are used.
Second, Use ggplot 2 package to draw a scatterplot using the number of Tweets for each neighborhood on X-axis and sentiment score on Y-axis. Also perform correlation analysis between the number of Tweets for each neighborhood and sentiment score either using cor.test() function or ggpubr::stat_cor() function.
ggplot(data = twt_poly, mapping = aes(x=n, y=mean_sent_ai, color = mean_sent_ai)) +
geom_point() +
geom_smooth(method = "lm",se = FALSE) +
labs(
x = "Tweets (#)",
y = "Mean Sentiment Score",
title = "Twitter Sentiment Analysis of Atlanta Neighborhoods")+
scale_color_gradient(low="darkblue", high="red")
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 195 rows containing non-finite values (stat_smooth).
## Warning: Removed 195 rows containing missing values (geom_point).
I decided to go with a gradient to show the values of the mean sentiment
score because I thought using color would help make what the graph is
communicating clearer to users.
#Analysis Write-Up
Disclaimer: My answers are predicated on the fact that before this exercise, I have never used Twitter before in my life, so my understanding of how it works/what things mean might not be perfect.
Using the map and plot I created above (as well as using my inspection of the data), let’s answer the following questions:
Q.1 What’s the proportion of neighborhoods with one or more Tweets? By my count, there are about 53-56 neighborhoods with at least one tweet, so that means approximately 21-23% of all the neighborhoods had some kind of data. That’s nearly a quarter and over a fifth, which I guess isn’t bad.
Q.2 Is there any pattern to neighborhoods with/without Tweets? Is there anything that can help us guess how likely a given neighborhood will have Tweets? It looks like the places like Edgewood, Atlantic Station, Emory, and Bolton have more tweets. I think it could be because a lot of these areas are either in transition or urban in character. They aren’t quite as dense as more urban places and have a “small-scale urbanism” feel while others are more traditionally urban/dense. Several of the areas that have a high number of tweets also have a lot of shops, events, attractions, or even institutions. Some parts are gentrifying and others might have higher crime rates. I think all of these things are bound to make people more likely to talk either positively or negatively about a place.
As for the areas with less tweets, the ones that I recognize tend to be “quieter” and more suburban in character, so unless there’s something big happening, they might not get a mention.
Q.3 (If you’ve observed relationship between sentiment score and the number of Tweets) Why do you think there is the relationship between sentiment score and the number of Tweets? It could just be my bias, but I’m noticing a slight negative relationship between sentiment scores and number of tweets. I think it’s because people like to complain–especially on social media.
Q.4 The neighborhood ‘Rockdale’ has many Tweets mentioning its name. Does high volume of Tweets make sense? Why do you think this occurred? I don’t think this is unusual. The parts I’m familiar with are urban and have a lot of good restaurants/nightlife spots as well as some nearby parks. The urban character and concentration of attractions could explain the influx of tweets. I also remember that while there are lots of neighborhoods with retired seniors, it is a fairly “young” area. Many seniors these days can use smartphones and have social media, although it has historically been associated with young people. Maybe there are especially active members in one or both of these groups. Local-level accounts designed to spread news to residents could also be fairly active. Moreover, the recent elections as well as any other event, accident, or noteworthy happening could lead to a spike in twitter activity.
Q.5 What do you think are the strengths and shortcomings of this method (i.e., using Twitter & neighborhood names to evaluate sentiments around each neighborhood)? One of the major strengths is that it is an easy, accessible method to evaluate neighborhood sentiment. Furthermore, Twitter–for better or for worse–lends itself to self-expression in the most unfiltered way possible, which means that people express themselves with a breadth and depth of emotion that you might not find at your usual community meetings and other formal settings where such discussions might take place. People will say things on social media that they will never say to someone’s face in “real life”. There’s definitely room for nuance and because of the nature of social media and how removed it makes people feel from others. There is a higher likelihood that people will “say the quiet part out loud”. For example: if something like racial resentment/other forms of bigotry is driving neighborhood sentiment, then that matters.
As for shortcomings, I question how comprehensive this method is. There are people who, for a variety of reasons, do not have a Twitter account or do not post tweets regularly. Moreover, even in a so-called “developed” nation like the United States, the digital divide is still a very real issue–especially in a place like Atlanta, where there is a widening wealth gap. People without access to the internet, don’t like or are uninterested in using Twitter, or aren’t familiar with social media/certain technologies aren’t being included in this analysis. This could exclude entire swaths of a population while over-emphasizing the voice of those who are “chronically online”. This not only leads to biased data, but could tacitly sent a message that some people’s voices matter more than others–a tendency which is especially worrying as many of the people excluded could already be marginalized in some way.
Q.6 Can you think of a better way to define neighborhoods and collect Tweets that can better represent the sentiment of neighborhoods? Defining neighborhoods can be difficult–especially when you account for the people living there, which I believe you should. Sometimes people conceptualize what neighborhood they belong to differently than their actual location. For example: one of my neighbors says we are in “the Decatur/Atlanta area” when according to DeKalb, we’re located in Druid Hills. Moreover, the official boundaries might not be an accurate reflection of what the residents perceive to be their neighborhood. I’ve seen public engagement exercises where residents are asked to draw a boundary around their neighborhood and all the maps looked slightly different. Neighborhoods can also feel “fragmented” to some locals–especially if they’re undergoing change. For instance, not all residents of East Atlanta, Edgewood, or Kirkwood might feel like they’re a part of the same collective due to gentrification. Both of these phenomena could impact sentiment and should be considered.
I’m not really sure the best way to control for this tendency since so much of its boundaries, characteristics/culture, and sentiment are based on individual perception. That said, I think incorporating demographics would give us some context that might help. Esri has what’s known as a Tapestry Segmentation that characterizes and divides neighborhoods into segments based on a variety of demographic and socioeconomic variables that create a community profile. It also allows you to see the breakdown of tapestry segments in a given geography.
This might be too sophisticated to replicate, but similar profiles could give us an idea of the character of the area. Is a neighborhood historically an ethnic enclave or majority-minority community? Is it a suburban bedroom community? Newly-arrived young families? Is it densifying? Diversifying? Gentrifying? Is there an interesting mix or division somewhere in the population? What is the majority profile and how has that changed over time? I’d argue these are all things that tell us something about the neighborhood that could influence or explain individual sentiment–even if that individual conceptualizes things a bit differently than the people drawing the map.
As for the tweets, I think expanding our net to scrape Twitter for neighborhood features such as streets, landmarks, services, local-level departments (fire, law enforcement, etc.) might help us capture more data, but we’d have to normalize our data some kind of way to control for biased responses. It might even be fun to supplement it with yelp data for some of the aforementioned features since the reviews can include sentiment. Also, I think just using a single week might introduce some bias; if one neighborhood had a music festival and another had a sewage pipe burst in the same week, then I would bet money that I don’t have that it’d influence what people are tweeting about in both those neighborhoods.