Introduction

In this assignment, we will download Tweets that contain the names of neighborhoods in Atlanta. We will apply sentiment analysis to the Tweets and map/plot the sentiments associated with neighborhoods. Specifically, you will be performing the following steps:

Step 1. You will download and read a shapefile that contains neighborhood boundary and thier names. Step 2. Initiate a deep learning-based package for sentiment analysis called “sentiment.ai” (if you have problem with this package, you can use a different package). Step 3. Loop through the names of neighborhoods in Atlanta to collect Tweets. Step 4. Clean and filter the collected Tweets. Step 5. Analyze the Tweets.

As always, load packages first.

library(rtweet)
library(tidyverse)
library(sf)
library(sentiment.ai)
library(SentimentAnalysis)
library(ggplot2)
library(here)
library(tmap)
library(envnames)

Step 1. Neighborhood Shapefile

Go to this webpage and download the shapefile from there. Once downloaded, read the data into your current R environment.

# TASK ////////////////////////////////////////////////////////////////////////

# Read neighborhood shapefile
nb_shp <- st_read("C:\\Users\\faiza\\OneDrive\\Documents\\GT\\CP-8883\\Atlanta_Neighborhoods.shp")

## Reading layer `Atlanta_Neighborhoods' from data source 
##   `C:\Users\faiza\OneDrive\Documents\GT\CP-8883\Atlanta_Neighborhoods.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 248 features and 20 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -84.55085 ymin: 33.64799 xmax: -84.28962 ymax: 33.88687
## Geodetic CRS:  WGS 84

# //TASK //////////////////////////////////////////////////////////////////////

Step 2. Initiate Sentiment.ai

If you have issues with using this package, you can use the other package introduced in the class called SentimentAnalysis.

# TASK ////////////////////////////////////////////////////////////////////////

# Initiate sentiment.ai

init_sentiment.ai(envname = "r-sentiment-ai", method = "conda") # feel free to change these arguments if you need to.

## <tensorflow.python.saved_model.load.Loader._recreate_base_user_object.<locals>._UserObject object at 0x000001CEBCAB8640>

# //TASK //////////////////////////////////////////////////////////////////////

Step 3. Looping through neighborhood names to get Tweets

Prepare to use Twitter API by specifying arguments of create_token() function using your credentials.

# TASK ////////////////////////////////////////////////////////////////////////

# whatever name you assigned to your created app
appname <- "UA_faiza"

# create token named "twitter_token"
# the keys used should be replaced by your own keys obtained by creating the app  
twitter_token <- create_token(
 app = appname,
  consumer_key = Sys.getenv("twitter_key"), 
  consumer_secret = Sys.getenv("twitter_key_secret"),
  access_token =Sys.getenv("twitter_access_token"),
  access_secret =Sys.getenv("twitter_access_token_secret"))

# //TASK //////////////////////////////////////////////////////////////////////

Next, let’s define a function that downloads Tweets, clean them, and apply sentiment analysis to them.

# Extract neighborhood names from nb_shp's NAME column and store it in nb_names object.
nb_names <- nb_shp$NAME

# Define a search function
get_twt <- function(term){
  # =========== NO MODIFICATION ZONE STARTS HERE ===============================
  term_mod <- paste0("\"", term, "\"")
  # =========== NO MODIFY ZONE ENDS HERE ========================================

  
  # TASK ////////////////////////////////////////////////////////////////////////
  
  # 1. Use search_tweets() function to get Tweets.
  #    Use term_mod as the search keyword to get Tweets.
  #    Set n to a number large enough to get all Tweets from the past 7 days
  #    Set geocode argument such that the search is made with 50 mile radius from 33.76, -84.41
  #    Be sure the exlucde retweets.
  #    You may need to enable the function to automatically wait if rate limit is exceeded.
  #    I recommend using suppressWarnings() to suppress warnings.
  #    Make sure you assign the output from the seach_tweets to object named 'out'
  
  out <- search_tweets(q = term_mod, 
                          n = 1000,
                          lang = "en",
                          geocode = "33.76,-84.41,50mi",
                          include_rts = FALSE,
                       retryonratelimit = TRUE) # **YOUR CODE HERE..**
    
  # //TASK //////////////////////////////////////////////////////////////////////
  
  
  
  # =========== NO MODIFICATION ZONE STARTS HERE ===============================
  out <- out %>%
    select(created_at, id, id_str, full_text, geo, coordinates, place, text) 

  
  # Basic cleaning
  replace_reg <- "http[s]?://[A-Za-z\\d/\\.]+|&amp;|&lt;|&gt;"

  out <- out %>% 
    mutate(text = str_replace_all(text, replace_reg, ""),
           text = gsub("@", "", text),
           text = gsub("\n\n", "", text))
  
  # Sentiment analysis
  # Also add a column for neighborhood names
  if (nrow(out)>0){
    out <- out %>% 
      mutate(sentiment_ai = sentiment_score(out$text),
             sentiment_an = analyzeSentiment(text)$SentimentQDAP,
             nb = term)
    print(paste0("Search term:", term))
  } else {
    return(out)
  }
  
  return(out)
}
# =========== NO MODIFY ZONE ENDS HERE ========================================

Let’s apply the function to Tweets. Note that this code chunk may take more than 15 minutes if you’ve already spent some (or all) of your rate limit.

# =========== NO MODIFICATION ZONE STARTS HERE ===============================
# Apply the function to get Tweets
twt <- map(nb_names, ~get_twt(.x))

## [1] "Search term:Fairburn"
## [1] "Search term:Brandon"
## [1] "Search term:Poncey-Highland"
## [1] "Search term:Inman Park"
## [1] "Search term:Edgewood"
## [1] "Search term:Lakewood"
## [1] "Search term:Cabbagetown"
## [1] "Search term:Reynoldstown"
## [1] "Search term:Campbellton Road"
## [1] "Search term:Southwest"
## [1] "Search term:Adams Park"
## [1] "Search term:Ben Hill"
## [1] "Search term:Underwood Hills"
## [1] "Search term:Riverside"
## [1] "Search term:Bolton"
## [1] "Search term:Rockdale"
## [1] "Search term:Lenox"
## [1] "Search term:Kingswood"
## [1] "Search term:Margaret Mitchell"
## [1] "Search term:Cross Creek"
## [1] "Search term:Memorial Park"
## [1] "Search term:Pittsburgh"
## [1] "Search term:Peoplestown"
## [1] "Search term:Summerhill"
## [1] "Search term:Castleberry Hill"
## [1] "Search term:Sherwood Forest"
## [1] "Search term:Loring Heights"
## [1] "Search term:Mays"
## [1] "Search term:Grove Park"
## [1] "Search term:Adamsville"
## [1] "Search term:Cascade Heights"
## [1] "Search term:Westview"
## [1] "Search term:West End"

# =========== NO MODIFY ZONE ENDS HERE ========================================

Step 4. Clean and filter the collected Tweets.

The downloaded Tweets need some cleaning / reorganizing process, including

Drop empty elements from the list twt. These are neighborhoods with no Tweets referring to them. Hint: you can create a logical vector that has FALSEs if the corresponding elements in twt has no Tweets and TRUE otherwise.

twt_rows<- twt[sapply(twt, nrow)>0]

The coordinates column is currently a list-column. Unnest this column so that lat, long, and type (i.e., column names inside coordinates) are separate columns. You can use unnest() function.

library(data.table)

## 
## Attaching package: 'data.table'

## The following object is masked from 'package:envnames':
## 
##     address

## The following objects are masked from 'package:dplyr':
## 
##     between, first, last

## The following object is masked from 'package:purrr':
## 
##     transpose

twt_unnest <- twt_rows %>% rbindlist(use.names=TRUE, fill=FALSE)%>% unnest("coordinates")

Calculate the average sentiment score for each neighborhood. You can group_by() nb column in twt objects and summarise() to calculate means. Also add an additional column n that contains the number of rows in each group using n() function.

avg <-group_by(twt_unnest, nb)
twt_avg <- avg %>%  summarise(mean_sentiment_AI = mean(sentiment_ai),mean_sentiment_an = mean(sentiment_an), n = n())

Join the cleaned Tweet data back to the neighborhood shapefile. Use the neighborhood name as the join key. Make sure that the result of the join is assigned to an object called twt_poly to ensure that the subsequent code runs smoothly.

colnames(twt_avg)[1] <- "NAME"
twt_poly<-nb_shp %>% left_join(twt_avg,by="NAME")

Step 5. Analysis

Now that we have collected Tweets, calculated sentiment score, and merged it back to the original shapefile, we can map them to see spatial distribution and draw plots to see inter-variable relationships.

First, let’s draw two interactive choropleth maps, one using sentiment score as the color and the other one using the number of Tweets as the color. Use tmap_arrange() function to display the two maps side-by-side.

tmap_mode("view")

## tmap mode set to interactive viewing

twt_poly <- st_as_sf(twt_poly, crs = 4326)
score<-tm_shape(twt_poly) +
  tm_polygons("mean_sentiment_AI", style = "equal",palette = "PiYG")

n<-tm_shape(twt_poly) +
  tm_polygons("n",style = "equal",palette = "PiYG")

tmap_arrange(score, n, ncol=2)

## Variable(s) "mean_sentiment_AI" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.

Second, Use ggplot 2 package to draw a scatterplot using the number of Tweets for each neighborhood on X-axis and sentiment score on Y-axis. Also perform correlation analysis between the number of Tweets for each neighborhood and sentiment score either using cor.test() function or ggpubr::stat_cor() function.

ggplot(twt_avg,aes(x=n, y=mean_sentiment_AI)) + geom_point()+labs(title = "Number of Tweets Vs Sentiment Score", x= "Number of Tweets", y ='Sentiment Score')+geom_smooth(method = "lm", se = FALSE)

## `geom_smooth()` using formula 'y ~ x'

cor.test(twt_avg$n, twt_avg$mean_sentiment_AI)

## 
##  Pearson's product-moment correlation
## 
## data:  twt_avg$n and twt_avg$mean_sentiment_AI
## t = -1.3805, df = 31, p-value = 0.1773
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.5394051  0.1118910
## sample estimates:
##        cor 
## -0.2406626

Using the map and plot you created above (as well as using your inspection of the data), answer the following questions.

Q. What’s the proportion of neighborhoods with one or more Tweets?

The original shape file has 248 neighborhoods and after cleaning the tweets there are 42, so about 17%.

Q. Do you see any pattern to neighborhoods with/without Tweets? Is there anything that can help us guess how likely a given neighborhood will have Tweets?

There doesn’t seem to be a pattern. I think using census data could help answer this question because we could map the average age of residents and try to see if there is correlation between age and number of tweets. It is possible that some of the areas don’t have tweets because the residents are older and don’t use twitter.

Q. (If you’ve observed relationship between sentiment score and the number of Tweets) Why do you think there is the relationship between sentiment score and the number of Tweets?

The graph has a negative correlation and shows that more tweets leads to a lower sentiment score. This could be because when events that cause a negative reaction occurs, more people are likely to tweet about it. I think tweets that occur in large volume tends to be over something that is upsetting people.

Q. The neighborhood ‘Rockdale’ has many Tweets mentioning its name. Does high volume of Tweets make sense? Why do you think this occurred?

This could be due to many reasons. Maybe Rockdale has a younger population and therefore more tweets. It could also be due to the recent elections or maybe a surprising incident occurred causing residents to tweet.

Q. What do you think are the strengths and shortcomings of this method (i.e., using Twitter & neighborhood names to evaluate sentiments around each neighborhood)?

One strength is that it collects real time data. People usually tweet about an issue while its happening so its the immediate reaction of the user. Additionally there are no prompts or specific question being answered, the tweet is an unfiltered and unprompted thought. I feel like this skews the data less. However there are many short comings such as only 42 neighborhoods had data attached so it would be difficult to get information on many neighborhoods this way.Additionally of the 42 neighborhoods many only had one tweet, which skews the data.

Q. Can you think of a better way to define neighborhoods and collect Tweets that can better represent the sentiment of neighborhoods?

Not everyone tweets the name of their neighborhood when talking about it. Instead of collecting tweets based off neighborhood names, it could be better to collect data based off specific prominent areas of the neighborhood. Additionally, some twitter users have their location on. Those tweets could be collected and then filtered using stop words to find tweets about the neighborhood that doesn’t mention it by name.