This is part of the interactive tutorial for COMM 497DB on social data analytics. You can visit previous topics at https://curiositybits.shinyapps.io/R_social_data_analytics.
You will learn how to extract location data from tweets and Twitter users’ bio page, as well as how to visualize the data using ggplot2 and leaflet.
There are two types of geographic information in data returned by Twitter API. Review the lecture slide.
Take a look at the data frame below. This is a typical data frame created by rtweet.
bbox_coords contains the coordinates in geo-tagged tweets location contains the user-provided location on Twitter bio
Geo-tagged tweets returned from Twitter API already contain coordinates. We just need to parse the data in bbox_coords using an in-built function in rtweet, the library we use for collecting data. The function is called lat_lng().
Below, we extract coordinates from geo-tagged tweets in the data frame named tweets. This data frame contains close to 20k tweets sent to @realDonaldTrump.
The parsed geocodes will be in the lat and lng column.
library(rtweet)
geocodes <- lat_lng(tweets)
geocodes <- geocodes[!is.na(geocodes$lat),] #remove non-geo-tagged tweets
datatable(geocodes[!is.na(geocodes$lat),c("screen_name","text","lat","lng")], options = list(pageLength = 5)) We will plot tweets on a world map. Recall that the data frame containing coordinates is called geocodes (created from the previous step).
library(rtweet)
library(ggplot2)
par(mar = c(0, 0, 0, 0))
maps::map("world", lwd = .25)
with(geocodes, points(lng, lat, pch = 20, cex = .75, col = rgb(0, .3, .7, .75)))leaflet is a library for interactive maps. Now that we have a data frame with coordinates (called geocodes), we can plug it in the function below.
First, try a simple map.
library(leaflet)
# remove tweets with empty coordinate
geocodes <- geocodes[!is.na(geocodes$lat),]
map1 <- leaflet(data = geocodes) %>%
addTiles() %>%
setView(lng = -98.35, lat = 39.50, zoom = 2) %>%
addCircleMarkers(lng = ~lng, lat = ~lat) %>%
addProviderTiles("OpenStreetMap.Mapnik") %>%
addCircleMarkers(
stroke = FALSE, fillOpacity = 0.5
)
map1It is quite dull, is it? Let’s spice up the map by adding three elements:
library(leaflet)
#let's make marker icon based on user profile images
usericon <- makeIcon(
iconUrl = geocodes$profile_image_url,
iconWidth = 15, iconHeight = 15
)
map2 <- leaflet(data = geocodes) %>%
addTiles() %>%
setView(lng = -98.35, lat = 39.50, zoom = 2) %>%
addMarkers(lng = ~lng, lat = ~lat,popup = ~ as.character(text),icon = usericon) %>%
addProviderTiles("Stamen.Watercolor") %>%
addCircleMarkers(
stroke = FALSE, fillOpacity = 0.5
)
map2You will note that very few tweets are geo-tagged. However, many Twitter users reveal where they live on Twitter bio. We can use the location information on Twitter bio to geo-map users. Be aware that the location on bio is not always accurate. People may use the location field in a sarcastic manner (e.g., describing him/herself as living on Mars).
The lookup_coords() function in rtweet can generate lattitudes and longitudes based on string inputs. For the FUNCTION to work, you need a working Google Map API (with Geocoding API and Geolocation API enabled). See this slide on how to set up the API.
Try below.
library(rtweet)
lookup_coords("Amherst, MA", apikey ="PASTE YOUR API KEY")After you have a working Google Map API for returning coordinates, we can run the following code to get the coordinate for each user that has some location info on bio.
Please go through my comments.
library(rtweet)
library(dplyr)
#create a new data frame (called tweets_withlocation) to contain only tweets sent by users whose bio contain location info. The location info on bio is in the location column
tweets_withlocation <- tweets[tweets$location !="",]
#create two columns in tweets_withlocation for the lattitudes and longitudes returned from Google Geocoding API
tweets_tovis$lat_google <- NA
tweets_tovis$lng_google <- NA
#due to the rate limit of the API, we are going to randomly select 1000 tweets from tweets_withlocation for visualization.
#To create the random subset, use sample_n function in the dplyr library
tweets_tovis <- sample_n(tweets_withlocation, 1000)
#use the following to iterate over tweets and get their coordinates
for (row in tweets_tovis$location[1:1000]){
rowid<-which(tweets_tovis$location == row)
print(paste0("getting the coordinates for:",rowid,", rowid is:",rowid))
Sys.sleep(1)
try(geodata <- lookup_coords(row,apikey ="PASTE YOUR API KEY"))
if (length(geodata$point)=="2") {
print(c("the lat is:",geodata$point[1]))
print(c("the lng is:",geodata$point[2]))
tweets_tovis[rowid,]$lat_google <- geodata$point[1]
tweets_tovis[rowid,]$lng_google <- geodata$point[2]
}else {
print ("skipping")
}
}Take a look at the added coordinates from Google Geocoding API. Showing only the first 100 tweets in tweets_tovis.