There will also be multiple ‘NO MODIFICATION ZONE’. Do not modify code in the No Modification Zone.
You will need to knit it, publish it on Rpubs, and submit the link. If there is any question about this template, do not hesitate to reach out to Bonwoo.
In this assignment, we will donwload Tweets that contain the names of neighborhoods in Atlanta. We will apply sentiment analysis to the Tweets and map/plot the sentiments associated with neighborhoods. Specifically, you will be performing the following steps:
Step 1. You will download and read a shapefile that contains neighborhood boundary and thier names. Step 2. Initiate a deep learning-based package for sentiment analysis called “sentiment.ai” (if you have problem with this package, you can use a different package). Step 3. Loop through the names of neighborhoods in Atlanta to collect Tweets. Step 4. Clean and filter the collected Tweets. Step 5. Analyze the Tweets.
As always, load packages first.
library(rtweet)
library(tidyverse)
library(sf)
# install.packages("sentiment.ai")
library(sentiment.ai)
# install.packages("SentimentAnalysis")
library(SentimentAnalysis)
library(ggplot2)
library(here)
library(tmap)
Go to this webpage and download the shapefile from there. Once downloaded, read the data into your current R environment.
# TASK ////////////////////////////////////////////////////////////////////////
# Read neighborhood shapefile
nb_shp <- st_read("D:/Georgia Tech/Spec topic_/major ass_5/Atlanta_Neighborhoods")
## Reading layer `Atlanta_Neighborhoods' from data source
## `D:\Georgia Tech\Spec topic_\major ass_5\Atlanta_Neighborhoods'
## using driver `ESRI Shapefile'
## Simple feature collection with 248 features and 20 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -84.55085 ymin: 33.64799 xmax: -84.28962 ymax: 33.88687
## Geodetic CRS: WGS 84
# //TASK //////////////////////////////////////////////////////////////////////
If you have issues with using this package, you can use the other package introduced in the class called SentimentAnalysis.
# TASK ////////////////////////////////////////////////////////////////////////
# require(sentiment.ai)
# require(SentimentAnalysis)
# require(sentimentr)
# Initiate sentiment.ai
init_sentiment.ai(envname = "r-sentiment-ai", method = "conda") # feel free to change these arguments if you need to.
## <tensorflow.python.saved_model.load.Loader._recreate_base_user_object.<locals>._UserObject object at 0x000001DAAD8413A0>
# //TASK //////////////////////////////////////////////////////////////////////
Prepare to use Twitter API by specifying arguments of create_token() function using your credentials.
# TASK ////////////////////////////////////////////////////////////////////////
# whatever name you assigned to your created app
appname <- "UrbanAnalytics_tutorial"
# create token named "twitter_token"
# the keys used should be replaced by your own keys obtained by creating the app
twitter_token <- create_token(
app = appname,
consumer_key = Sys.getenv("twitter_key"),
consumer_secret = Sys.getenv("twitter_key_secret"),
access_token = Sys.getenv("twitter_access_token"),
access_secret = Sys.getenv("twitter_access_token_secret"))
# //TASK //////////////////////////////////////////////////////////////////////
Next, let’s define a function that downloads Tweets, clean them, and apply sentiment analysis to them.
# Extract neighborhood names from nb_shp's NAME column and store it in nb_names object.
nb_names <- nb_shp$NAME
# Define a search function
get_twt <- function(term){
# =========== NO MODIFICATION ZONE STARTS HERE ===============================
term_mod <- paste0("\"", term, "\"")
# =========== NO MODIFY ZONE ENDS HERE ========================================
# TASK ////////////////////////////////////////////////////////////////////////
# 1. Use search_tweets() function to get Tweets.
# Use term_mod as the search keyword to get Tweets.
# Set n to a number large enough to get all Tweets from the past 7 days
# Set geocode argument such that the search is made with 50 mile radius from 33.76, -84.41
# Be sure the exlucde retweets.
# You may need to enable the function to automatically wait if rate limit is exceeded.
# I recommend using suppressWarnings() to suppress warnings.
# Make sure you assign the output from the seach_tweets to object named 'out'
out <- search_tweets(q = term_mod,
n = 1000,
lang = "en",
geocode = "33.76,-84.41,50mi",
retryonratelimit = TRUE,
include_rts = FALSE)
# **YOUR CODE HERE..**
# //TASK //////////////////////////////////////////////////////////////////////
# =========== NO MODIFICATION ZONE STARTS HERE ===============================
out <- out %>%
select(created_at, id, id_str, full_text, geo, coordinates, place, text)
# Basic cleaning
replace_reg <- "http[s]?://[A-Za-z\\d/\\.]+|&|<|>"
out <- out %>%
mutate(text = str_replace_all(text, replace_reg, ""),
text = gsub("@", "", text),
text = gsub("\n\n", "", text))
# Sentiment analysis
# Also add a column for neighborhood names
if (nrow(out)>0){
out <- out %>%
mutate(sentiment_ai = sentiment_score(out$text),
sentiment_an = analyzeSentiment(text)$SentimentQDAP,
nb = term)
print(paste0("Search term:", term))
} else {
return(out)
}
return(out)
}
# =========== NO MODIFY ZONE ENDS HERE ========================================
Let’s apply the function to Tweets. Note that this code chunk may take more than 15 minutes if you’ve already spent some (or all) of your rate limit.
# =========== NO MODIFICATION ZONE STARTS HERE ===============================
# Apply the function to get Tweets
twt <- map(nb_names, ~get_twt(.x))
## [1] "Search term:Fairburn"
## [1] "Search term:Brandon"
## [1] "Search term:Poncey-Highland"
## [1] "Search term:Inman Park"
## [1] "Search term:Edgewood"
## [1] "Search term:Lakewood"
## [1] "Search term:Cabbagetown"
## [1] "Search term:Reynoldstown"
## [1] "Search term:Campbellton Road"
## [1] "Search term:Southwest"
## [1] "Search term:Adams Park"
## [1] "Search term:Ben Hill"
## [1] "Search term:Underwood Hills"
## [1] "Search term:Riverside"
## [1] "Search term:Bolton"
## [1] "Search term:Rockdale"
## [1] "Search term:Lenox"
## [1] "Search term:Kingswood"
## [1] "Search term:Margaret Mitchell"
## [1] "Search term:Cross Creek"
## [1] "Search term:Memorial Park"
## [1] "Search term:Pittsburgh"
## [1] "Search term:Peoplestown"
## [1] "Search term:Summerhill"
## [1] "Search term:Castleberry Hill"
## [1] "Search term:Sherwood Forest"
## [1] "Search term:Loring Heights"
## [1] "Search term:Mays"
## [1] "Search term:Grove Park"
## [1] "Search term:Adamsville"
## [1] "Search term:Cascade Heights"
## [1] "Search term:Westview"
## [1] "Search term:West End"
## [1] "Search term:Fort Valley"
## [1] "Search term:Greenbriar"
## [1] "Search term:South Atlanta"
## [1] "Search term:Downtown"
## [1] "Search term:Georgia Tech"
## [1] "Search term:Home Park"
## [1] "Search term:Midtown"
## [1] "Search term:Brookwood"
## [1] "Search term:Virginia Highland"
## [1] "Search term:Kirkwood"
## [1] "Search term:Pine Hills"
## [1] "Search term:Carroll Heights"
## [1] "Search term:Brookhaven"
## [1] "Search term:Druid Hills"
## [1] "Search term:Chattahoochee"
## [1] "Search term:Oakland"
## [1] "Search term:Atlantic Station"
## [1] "Search term:East Lake"
## [1] "Search term:East Atlanta"
## [1] "Search term:Emory"
# =========== NO MODIFY ZONE ENDS HERE ========================================
The downloaded Tweets need some cleaning / reorganizing process, including
Drop empty elements from the list twt. These are
neighborhoods with no Tweets referoilring to them. Hint: you can create
a logical vector that has FALSEs if the corresponding elements in
twt has no Tweets and TRUE otherwise.
The coordinates column is currently a list-column.
Unnest this column so that lat, long, and type (i.e., column names
inside coordinates) are separate columns. You can use unnest()
function.
Calculate the average sentiment score for each neighborhood. You
can group_by() nb column in twt objects and summarise() to
calculate means. Also add an additional column n that
contains the number of rows in each group using n() function.
Join the cleaned Tweet data back to the neighborhood shapefile.
Use the neighborhood name as the join key. Make sure that the
result of the join is assigned to an object called twt_poly
to ensure that the subsequent code runs smoothly.
# No code is provided as a template. Feel free to write your own code to perform the tasks listed above.
# MAKE SURE THAT THE LAST RESULT IS ASSIGNED TO AN OBJECT NAMED `twt_poly`.
library("data.table")
## Warning: package 'data.table' was built under R version 4.2.1
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
##
## between, first, last
## The following object is masked from 'package:purrr':
##
## transpose
library(dplyr)
library(plyr)
## ------------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## ------------------------------------------------------------------------------
##
## Attaching package: 'plyr'
## The following object is masked from 'package:here':
##
## here
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## The following object is masked from 'package:purrr':
##
## compact
twts <- twt[which(lapply(twt, nrow)!=0)]
twts <- rbindlist(twts , fill = FALSE, idcol = NULL)
typeof(twts)
## [1] "list"
twts_unnest <- unnest(twts, cols= c("coordinates"))
twts_clean <- twts_unnest %>% group_by(nb) %>%
dplyr::summarise(sentiment_ai = mean(sentiment_ai),
sentiment_an = mean(sentiment_an),
n = n()
)
names(twts_clean)[names(twts_clean) == 'nb'] <- 'NAME'
twt_poly <- merge(x= nb_shp, y = twts_clean, by= 'NAME')
Now that we have collected Tweets, calculated sentiment score, and merged it back to the original shapefile, we can map them to see spatial distribution and draw plots to see inter-variable relationships.
First, let’s draw two interactive choropleth maps, one using sentiment score as the color and the other one using the number of Tweets as the color. Use tmap_arrange() function to display the two maps side-by-side.
tmap_mode("view")
## tmap mode set to interactive viewing
a <- tm_basemap("OpenStreetMap")+tm_shape(twt_poly) +
tm_polygons(col = "sentiment_ai", style = "quantile")
a
## Variable(s) "sentiment_ai" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.
b <- tm_basemap("OpenStreetMap")+ tm_shape(twt_poly) +
tm_polygons(col = "n", style="quantile")
tmap_arrange(a,b, sync = TRUE)
## Variable(s) "sentiment_ai" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.
## Variable(s) "sentiment_ai" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.
# No code is provided as a template.
# Feel free to write your own code to perform the tasks listed above.
Second, Use ggplot 2 package to draw a scatterplot using the number of Tweets for each neighborhood on X-axis and sentiment score on Y-axis. Also perform correlation analysis between the number of Tweets for each neighborhood and sentiment score either using cor.test() function or ggpubr::stat_cor() function.
ggplot(data = twt_poly, mapping = aes(x=n, y=sentiment_ai)) +
geom_point() +
geom_smooth(method = "lm",se = FALSE) +
labs(
x = "Count_Tweets",
y = "Avg_Sentiment_Score",
title = "Tweet patterns in different neighborhoods in Atlanta"
)
## `geom_smooth()` using formula 'y ~ x'
library(ggpubr)
## Warning: package 'ggpubr' was built under R version 4.2.1
##
## Attaching package: 'ggpubr'
## The following object is masked from 'package:plyr':
##
## mutate
twt_cor <- ggscatter(twt_poly,x= "n", y = "sentiment_ai", add = "reg.line", add.params = list(color = "blue", fill = "lightgray"),method = "pearson", label.x = 3, label.y = 30) # Customize reg. line
## Warning: Ignoring unknown parameters: method
twt_cor + stat_cor(p.accuracy = 0.001, r.accuracy = 0.01)
## `geom_smooth()` using formula 'y ~ x'
twt_cor
## `geom_smooth()` using formula 'y ~ x'
cor_map <- twt_cor + stat_cor(method = "pearson")
twt_cor <- cor.test(twt_poly$n,twt_poly$sentiment_ai)
# No code is provided as a template.
# Feel free to write your own code to perform the tasks listed above.
Using the map and plot you created above (as well as using your inspection of the data), answer the following questions.
Q. What’s the proportion of neighborhoods with one or more Tweets? -There are total 28 neighbouhoods with tweets.
Q. Do you see any pattern to neighborhoods with/without Tweets? Is there anything that can help us guess how likely a given neighborhood will have Tweets? -The neighborhoods at the northern part of the city in has higher number of tweets in comparison to the central part as we can see the n column in the data frame and the map on the left.
Q. (If you’ve observed relationship between sentiment score and the number of Tweets) Why do you think there is the relationship between sentiment score and the number of Tweets? -The correlation is a not that strong as shown by the R and p value. This shows that the tweet does not represent the whole population.
Q. The neighborhood ‘Rockdale’ has many Tweets mentioning its name. Does high volume of Tweets make sense? Why do you think this occurred? -There might be a possibility that there might be some tweets around that area.
Q. What do you think are the strengths and shortcomings of this method (i.e., using Twitter & neighborhood names to evaluate sentiments around each neighborhood)? -The drawback is it doesn’t represent the population. the positive effect is that it provides a platform for opinion for different people, but on the other hand can also represent wrong information and has no control over the data.
Q. Can you think of a better way to define neighborhoods and collect
Tweets that can better represent the sentiment of neighborhoods?
- We need to include more specific data and points that can make the
data more accurate by including different factors and cleaning the
data.