Introduction

In this assignment, we will donwload Tweets that contain the names of neighborhoods in Atlanta. We will apply sentiment analysis to the Tweets and map/plot the sentiments associated with neighborhoods. Specifically, you will be performing the following steps:

Step 1. You will download and read a shapefile that contains neighborhood boundary and thier names. Step 2. Initiate a deep learning-based package for sentiment analysis called “sentiment.ai” (if you have problem with this package, you can use a different package). Step 3. Loop through the names of neighborhoods in Atlanta to collect Tweets. Step 4. Clean and filter the collected Tweets. Step 5. Analyze the Tweets.

As always, load packages first.

library(rtweet)
library(tidyverse)
library(sf)
library(sentiment.ai)
library(SentimentAnalysis)
library(ggplot2)
library(here)
library(tmap)
library(twitteR)
library(jsonlite)
library(dplyr)
library(data.table)

Step 1. Neighborhood Shapefile

Go to this webpage and download the shapefile from there. Once downloaded, read the data into your current R environment.

# TASK ////////////////////////////////////////////////////////////////////////

# Read neighborhood shapefile
nb_shp <- st_read("Atlanta_Neighborhoods.shp")
## Reading layer `Atlanta_Neighborhoods' from data source 
##   `C:\Users\CP8883\CP8883\Twitter\Atlanta_Neighborhoods.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 248 features and 20 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -84.55085 ymin: 33.64799 xmax: -84.28962 ymax: 33.88687
## Geodetic CRS:  WGS 84
# //TASK //////////////////////////////////////////////////////////////////////

Step 2. Initiate Sentiment.ai

If you have issues with using this package, you can use the other package introduced in the class called SentimentAnalysis.

# TASK ////////////////////////////////////////////////////////////////////////

# Initiate sentiment.ai 

sentiment.ai::init_sentiment.ai(envname = "r-sentiment-ai", method = "conda")
## <tensorflow.python.saved_model.load.Loader._recreate_base_user_object.<locals>._UserObject object at 0x0000024E18F08850>
# //TASK //////////////////////////////////////////////////////////////////////

Step 3. Looping through neighborhood names to get Tweets

Prepare to use Twitter API by specifying arguments of create_token() function using your credentials.

# TASK ////////////////////////////////////////////////////////////////////////

# whatever name you assigned to your created app
#appname <- "UrbanAnalytics_tutorial"


# create token named "twitter_token"
# the keys used should be replaced by your own keys obtained by creating the app  

#twitter_token <- create_token(
 #app = appname,
  #consumer_key = Sys.getenv("twitter_key"), 
  #consumer_secret = Sys.getenv("twitter_key_secret"),
  #access_token = Sys.getenv("twitter_access_token"),
  #access_secret = Sys.getenv("twitter_access_token_secret"))

# //TASK //////////////////////////////////////////////////////////////////////

Next, let’s define a function that downloads Tweets, clean them, and apply sentiment analysis to them.

# Extract neighborhood names from nb_shp's NAME column and store it in nb_names object.
nb_names <- nb_shp$NAME

# Define a search function
#get_twt <- function(term){
  # =========== NO MODIFICATION ZONE STARTS HERE ===============================
  #term_mod <- paste0("\"", term, "\"")
  # =========== NO MODIFY ZONE ENDS HERE ========================================

  
  # TASK ////////////////////////////////////////////////////////////////////////
  
  # 1. Use search_tweets() function to get Tweets.
  #    Use term_mod as the search keyword to get Tweets.
  #    Set n to a number large enough to get all Tweets from the past 7 days
  #    Set geocode argument such that the search is made with 50 mile radius from 33.76, -84.41
  #    Be sure the exlucde retweets.
  #    You may need to enable the function to automatically wait if rate limit is exceeded.
  #    I recommend using suppressWarnings() to suppress warnings.
  #    Make sure you assign the output from the seach_tweets to object named 'out'
  
 # out <- search_tweets(q = term_mod, 
                         # n = 180,
                          #lang = "en",
                        #  geocode = "33.76,-84.41,50mi",
                         # include_rts = FALSE,
                        #  retryonratelimit = TRUE) %>% 
                         # suppressWarnings()
    
  # //TASK //////////////////////////////////////////////////////////////////////
  
  
  
  # =========== NO MODIFICATION ZONE STARTS HERE ===============================
 # out <- out %>%
    #select(created_at, id, id_str, full_text, geo, coordinates, place, text) 

  
  # Basic cleaning
 # replace_reg <- "http[s]?://[A-Za-z\\d/\\.]+|&amp;|&lt;|&gt;"

  #out <- out %>% 
   # mutate(text = str_replace_all(text, replace_reg, ""),
          # text = gsub("@", "", text),
           #text = gsub("\n\n", "", text))
  
  # Sentiment analysis
  # Also add a column for neighborhood names
#  if (nrow(out)>0){
  #  out <- out %>% 
   #   mutate(sentiment_ai = sentiment_score(out$text),
     #        sentiment_an = analyzeSentiment(text)$SentimentQDAP,
    #         nb = term)
    #print(paste0("Search term:", term))
#  } else {
   # return(out)
 # }
  
  #return(out)
#}
# =========== NO MODIFY ZONE ENDS HERE ========================================

Let’s apply the function to Tweets. Note that this code chunk may take more than 15 minutes if you’ve already spent some (or all) of your rate limit.

# =========== NO MODIFICATION ZONE STARTS HERE ===============================
# Apply the function to get Tweets
#twt <- map(nb_names, ~get_twt(.x))
# =========== NO MODIFY ZONE ENDS HERE ========================================
#saving twt data

#twt %>% write_rds("C:/Users/CP8883/CP8883/Twitter/twt.rds")

Step 4. Clean and filter the collected Tweets.

The downloaded Tweets need some cleaning / reorganizing process, including

  1. Drop empty elements from the list twt. These are neighborhoods with no Tweets referring to them. Hint: you can create a logical vector that has FALSEs if the corresponding elements in twt has no Tweets and TRUE otherwise.

  2. The coordinates column is currently a list-column. Unnest this column so that lat, long, and type (i.e., column names inside coordinates) are separate columns. You can use unnest() function.

  3. Calculate the average sentiment score for each neighborhood. You can group_by() nb column in twt objects and summarise() to calculate means. Also add an additional column n that contains the number of rows in each group using n() function.

  4. Join the cleaned Tweet data back to the neighborhood shapefile. Use the neighborhood name as the join key. Make sure that the result of the join is assigned to an object called twt_poly to ensure that the subsequent code runs smoothly.

twtrds <- read_rds(here("twt.rds"))
twtclean <- twtrds[which(lapply(twtrds, nrow) != 0)]

twtclean2 <-  rbindlist(twtclean, fill=FALSE, idcol=NULL)

twt_unnest <- unnest(twtclean2, cols = c("coordinates"))

head(twt_unnest)
## # A tibble: 6 × 13
##   created_at               id id_str full_…¹ geo    long   lat type  place text 
##   <dttm>                <dbl> <chr>  <chr>   <lis> <dbl> <dbl> <chr> <lis> <chr>
## 1 2022-11-17 22:56:24 1.59e18 15934… "We ch… <lgl>  NA    NA   <NA>  <df>  "We …
## 2 2022-11-24 13:54:18 1.60e18 15958… "Just … <df>  -84.6  33.6 Point <df>  "Jus…
## 3 2022-11-23 13:05:43 1.60e18 15954… "Close… <df>  -84.5  33.7 Point <df>  "Clo…
## 4 2022-11-23 13:03:51 1.60e18 15954… "Click… <df>  -84.6  33.6 Point <df>  "Cli…
## 5 2022-11-23 12:00:43 1.60e18 15954… "Close… <df>  -84.5  33.7 Point <df>  "Clo…
## 6 2022-11-23 03:37:38 1.60e18 15953… "Fuck.… <df>   NA    NA   <NA>  <df>  "Fuc…
## # … with 3 more variables: sentiment_ai <dbl>, sentiment_an <dbl>, nb <chr>,
## #   and abbreviated variable name ¹​full_text
twtclean2 <- twt_unnest %>% group_by(nb) %>% 
  mutate(n=n()) %>% 
  mutate(pct_nb = round((n()/nrow(twt_unnest)), digit=3)*100)

twt_summary <- twtclean2 %>% summarise(
  sentiment_ai = round(mean(sentiment_ai), digit=3),
  sentiment_an = round(mean(sentiment_an), digit=3),
  n= mean(n),
  pct_nb = mean(pct_nb))

nb_shp <- nb_shp %>% 
       rename("nb" = "NAME")

twt_poly <- merge(nb_shp, twt_summary, by = 'nb')
twt_poly_summary <- merge(nb_shp, twt_summary, by = 'nb')

Step 5. Analysis

Now that we have collected Tweets, calculated sentiment score, and merged it back to the original shapefile, we can map them to see spatial distribution and draw plots to see inter-variable relationships.

First, let’s draw two interactive choropleth maps, one using sentiment score as the color and the other one using the number of Tweets as the color. Use tmap_arrange() function to display the two maps side-by-side.

tmap_mode("view")
## tmap mode set to interactive viewing
nsentimentp<- tm_shape(twt_poly_summary) +
  tm_polygons(col = c("sentiment_ai", "n"), midpoint=0)
  

nsentimentp

#Second, Use ggplot 2 package to draw a scatterplot using the number of Tweets for each neighborhood on X-axis and sentiment score on Y-axis. Also #perform correlation analysis between the number of Tweets for each neighborhood and sentiment score either using cor.test() function or #ggpubr::stat_cor() function.

No code is provided as a template.

Feel free to write your own code to perform the tasks listed above.

library(ggplot2)

scat_plot <- ggplot(data = twt_poly_summary) +
  geom_point(mapping = aes(x=n, y=sentiment_ai, color = nb)) + 
  geom_smooth(mapping = aes(x=n, y=sentiment_ai), method = "lm")+
  labs(x = "Tweets by neighborhood",
       y = "Sentiment score",
       color = "Neighborhood",
       title = "Sentiment Analysis Score from Atlanta neighborhood Tweets") +
  theme_bw()

plotly::ggplotly(scat_plot)
## `geom_smooth()` using formula 'y ~ x'
library(ggpubr)

 
 twt_cor_plot<-  ggplot(data = twt_poly_summary, mapping = aes(x=n, y=sentiment_ai)) +
   geom_point(aes(color = nb)) +
   geom_smooth(mapping = aes(x=n, y=sentiment_ai), method = "lm")+
   ggpubr::stat_cor(method = "pearson")+
   theme_light()
 
twt_cor_plot
## `geom_smooth()` using formula 'y ~ x'

#Using the map and plot you created above (as well as using your inspection of the data), answer the following questions.

#Q. What’s the proportion of neighborhoods with one or more Tweets? #23% percent of neighborhoods have one or more tweets(56/248).

#Q. Do you see any pattern to neighborhoods with/without Tweets? Is there anything that can help us guess how likely a given neighborhood will have #Tweets? Midtown, Downtown, and Rockdale seem to have the highest number of tweets in this 7 day timeline. This might be due of population density or #geotagging.

#Q. (If you’ve observed relationship between sentiment score and the number of Tweets) Why do you think there is the relationship between sentiment #score and the number of Tweets? There is a negative association between the number of tweets by neighborhood and the tweet sentiment score, meaning #more tweets are associated with a negative sentiment.

#Q. The neighborhood ‘Rockdale’ has many Tweets mentioning its name. Does high volume of Tweets make sense? Why do you think this occurred? #It is unclear to me why Rockdale stands out. It might be useful to create a word cloud of tweets and see what trends appear.

#Q. What do you think are the strengths and shortcomings of this method (i.e., using Twitter & neighborhood names to evaluate sentiments around each #neighborhood)? #The strength of this method is being able to pull neighborhood trends from what people are posting online in real time. Since we are #dealing with words and sentiment analysis, I would have found it more useful to visualize the data in a word cloud or word network, rather than a #scatterplot.

#Q. Can you think of a better way to define neighborhoods and collect Tweets that can better represent the sentiment of neighborhoods? As previously #stated, I think doing a word analysis might prove more useful than mapping or showing the data in a scatter. Comparing neighborhoods using #hastags #might prove more meaningful.