Twitter data analysis

Saina Sheini

March 2020

Boston Marathon Bombing

In this analysis, I have picked a subset of my twitter data representing the one week time period which contains the tragedy of Boston marathon bombing. During the annual Boston Marathon on April 15, 2013, two homemade pressure cooker bombs detonated killing at 2:49 p.m., near the finish line of the race, killing 3 people and injuring several hundred others, including 16 who lost limbs. The identification of the suspects and the manhunt following that took four days. During these awful event,the community showed strength and solidarity while experiencing fear and agner. With a lot of misinformation and rumors going on, the parts of the community that have stronger social ties would appear stronger and experience less negative emotions and exposire to false information. The Twitter data before, during and after this event provides an opportunity to define a social capital metric and test whether the above theory has grounds. To het there, I will perform sentiment analysis on three different time periods of before, during and after the terror attacks. In addition, I will take advantage of Boston Neighborhood Survey to confirm the reliability of social capital metric measured in the twitter data analysis.

This data covers tweets from 12th to 22th of April. As mentioned, the attack happened on 15th, in the afternoon. The milestons are the 15th (the day of the attack), and the 19th (the manhunt and arrest taking place). I intend to analyze the reactions of the citizens and if/how the effects of the incident wears of after a few days.

Geographical information is added to the Twitter data using QGIS. Intensive tidy work was done on Twitter data since social media data is usually quite messy. First step is to some initial data exploration and after that, the sentiment analysis would enable me to look further into the fear and trust emotions. Trust is of of the indicators of social capital in a community. I will also use Boston Neighborhood Survey was conducted by the Injury Control Research Center at the Harvard T.H. Chan School of Public Health (HSPH) in three rounds of 2008, 2010 and 2012. The data from the BNS are a rich resource for understanding the conditions and social dynamics of local communities. I will use the social capital index in BNS data in my analysis.

I discussed the potential of doing a network analysis on TWitter users as a way of understanding social ties in communities in times of crisis in comparison with normal situation. ## Network Analysis

#libraries
library(tidyr)
library(viridis)
library(hrbrthemes)
library(gganimate)
library(sf)
library(dplyr)
library(tm)
library(tidytext)
library(ggplot2)
library(stringr)
library(syuzhet)
library(wordcloud)
library(tigris)
library(leaflet)
library(plotly)
library(streamgraph)
  
## upload the data
#shapefiles
nhoods <- st_read("D:\\School\\Semester 7\\BARI\\Boston_Neighborhoods\\Boston_Neighborhoods.shp", quiet = TRUE)
twt <- read.csv("D:\\Google Drive\\Final presentation\\Tweets_Saina\\twt_marathon.csv") #111918
twt$time <- as.character(twt$time)
#this creates a new date field in a cleaner format as "YYYY-MM-DD"
twt$created_at_date <- strftime(twt$time, format="%Y-%m-%d %H:%M:%S")
twt$date <- as.Date(twt$created_at_date, "%Y-%m-%d")
#create a new column for day of the week
twt$day_of_week <- weekdays(twt$date)
#create a new column for month
twt$the_month <- months(twt$date)
#create a new column for hour
twt$the_hour<-substr(twt$time, start=12, stop=13)
#create a new column for day
twt$the_day <- lubridate::day(twt$created_at_date)

A <- twt %>%
  group_by(the_day) %>%
  tally()
  
A <- A[,c("the_day", "n")]
A <- A[!duplicated(A),]
c4 <- c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K")
A = cbind(A, c4)
A$the_day <- factor(A$the_day)

ggplot(A, aes(y=n, x=the_day, fill = c4)) + 
    geom_bar(stat="identity") + scale_fill_manual(values=c("A" = "MistyRose1", "B" = "MistyRose1", "C" = "MistyRose1", "D" = "Red", "E" = "MistyRose1", "F" = "MistyRose1", "G" = "MistyRose1", "H" = "Red", "I" = "IndianRed2", "J" = "MistyRose1", "K" = "MistyRose1")) + theme_bw() + theme(legend.position = "none",plot.title = element_text(hjust=0.5), panel.border = element_blank(), panel.grid.major = element_blank(),
panel.grid.minor = element_blank(), axis.line = element_line(colour = "black"), axis.text.y = element_text(face = "bold", size = 10)) + xlab("The Day") +  ylab("") + ggtitle("Number of Tweets per Day during 10th to 22th of April")

There is a considerable drop in the number of tweets posted starting from the day after the attack. There’s a jump right after the arrests have been made. This could be due to a couple of different reasons such as people just staying away from social media to protect themselves from negative news, or not using it for unnecessary interactions. But then they come back to their soial media after they hear the good news. The jump goes on for almost two days and then the last two days of the dataset the trend goes back to normal which is similar to what it was before the trend.

A simple bar chart can show the variation of number of tweets before, during (since the attack till the arrest) and after the event. Having an animation of the variation can sdemonstrate it more clearly.

A <- twt %>%
  group_by(the_day) %>%
  mutate(freq_day = n())

p <- ggplot(A, aes(x=factor(the_hour), y = order(freq_day))) + 
  geom_bar(stat='identity', fill = "red3") + 
  # geom_text(aes(x=day_of_week, y = freq_weekday + 2, label = freq_weekday)) +
# + geom_text(aes(label = Number), position=position_dodge(width=0.9), vjust=1)  
  theme_bw() + theme(panel.border = element_blank(), panel.grid.major = element_blank(),
panel.grid.minor = element_blank(), axis.line = element_line(colour = "black"), axis.text.y = element_text(face = "bold", size = 15), axis.text.x = element_blank()) +
  # , axis.ticks.x = element_blank()
  xlab("") +
  # theme(axis.text.y  = element_text(vjust=0.5, size=8)) +
  ylab("") +
  coord_flip() +
  # gganimate specific bits:
  transition_states(
    the_day,
    transition_length = 2,
    state_length = 1
  ) +
  ease_aes('sine-in-out') + labs(title = 'Number of hourly tweets by day: {closest_state}')

animate(p, duration = 5, fps = 40, width = 1000, height = 700)

In this animation, we can clearly see the drop after the attack and the jump after the arrests. The situation apparently comes back starting from the 21th, almost two days after the attack.

sentiment analysis

Now that I’m done with initial exploration, I want to look at the content of the tweets and see if they also show a pattern or a meaning too.

I have created three different subseted twitter datasets dividing the time period into before, during and after the disaster. Let’s see if the emotions of residents of communities changes under these different scenarios. First dataset covers the 12th, 13th, 14th and 15th of the month up to 3 pm when the attack happened. The second dataset included dates in between the attack and the 19th at 8 pm when the manhunt is over. Finally, the rest of the data up to the 22th of the month is in the third dataset. The emotions I am most interested in are trust and fear. The first one shows the level of social capital and the second one is the direct effect of the terror attack and can be argued as the opposite of what having strong social ties would benefit the community.

Before getting into analyzing each of the datasets separately, let’s look at the flow of the emotion trust. In theory, if a community is benefiting from high levels of social capital, the residents would feel as if their co-residents and authorities are reliable and trustwothy, and a disaster would cause them to rely on their social networks even more.

# uploading data after sentiment analysis

twt <- read.csv("D:\\School\\Semester 8\\DI\\wtwsent.csv")
# neighborhood names need to be the same between shapefile and .csv
nhoods$Name <- str_replace(nhoods$Name, "^Allston", "Allston/Brighton")
nhoods$Name <- str_replace(nhoods$Name, "^Brighton", "Allston/Brighton")
twt$ISD_Nbhd <- str_replace(twt$ISD_Nbhd, "Financial District/Downtown", "Downtown")
nhoods$Name <- str_replace(nhoods$Name, "South Boston Waterfront", "South Boston")
twt$ISD_Nbhd <- str_replace(twt$ISD_Nbhd, "Fenway/Kenmore", "Fenway")
nhoods$Name <- str_replace(nhoods$Name, "Leather District", "Downtown")
nhoods$Name <- str_replace(nhoods$Name, "Longwood", "Fenway")

dt_1 <- twt %>% # before
  filter(the_day %in% c(12,13,14,15)) %>%
  filter(!(the_day == 15 & the_hour > 15)) %>%
  group_by(ISD_Nbhd) %>%
  mutate(anger_avg = mean(anger), anticipation_avg = mean(anticipation), disgust_avg = mean(disgust),
         fear_avg = mean(fear), joy_avg = mean(joy), sadness_avg = mean(sadness), surprise_avg = mean(surprise), trust_avg = mean(trust), negative_avg = mean(negative), positive_avg = mean(positive)) %>%
  dplyr::select(the_day, the_hour, CT_ID_10, ISD_Nbhd, anger_avg, anticipation_avg, disgust_avg, fear_avg, joy_avg, sadness_avg, surprise_avg, trust_avg, negative_avg, positive_avg)

dt_2 <- twt %>% # during
  filter(the_day %in% c(16,17,18,19)) %>%
  filter(!(the_day == 19 & the_hour > 20)) %>%
  group_by(ISD_Nbhd) %>%
  mutate(anger_avg = mean(anger), anticipation_avg = mean(anticipation), disgust_avg = mean(disgust),
         fear_avg = mean(fear), joy_avg = mean(joy), sadness_avg = mean(sadness), surprise_avg = mean(surprise), trust_avg = mean(trust), negative_avg = mean(negative), positive_avg = mean(positive)) %>%
  dplyr::select(the_day, the_hour, CT_ID_10, ISD_Nbhd, anger_avg, anticipation_avg, disgust_avg, fear_avg, joy_avg, sadness_avg, surprise_avg, trust_avg, negative_avg, positive_avg)


dt_3 <- twt %>% # after
  filter(the_day %in% c(19,20,21,22)) %>%
  filter(!(the_day == 19 & the_hour <= 20)) %>%
  group_by(ISD_Nbhd) %>%
  mutate(anger_avg = mean(anger), anticipation_avg = mean(anticipation), disgust_avg = mean(disgust),
         fear_avg = mean(fear), joy_avg = mean(joy), sadness_avg = mean(sadness), surprise_avg = mean(surprise), trust_avg = mean(trust), negative_avg = mean(negative), positive_avg = mean(positive)) %>%
  dplyr::select(the_day, the_hour, CT_ID_10, ISD_Nbhd, anger_avg, anticipation_avg, disgust_avg, fear_avg, joy_avg, sadness_avg, surprise_avg, trust_avg, negative_avg, positive_avg)


twt$ISD_Nbhd <- as.character(twt$ISD_Nbhd)

mat <- twt %>%
  group_by(the_day, ISD_Nbhd) %>%
  mutate(anger_avg = sum(anger), anticipation_avg = sum(anticipation), disgust_avg = sum(disgust),
         fear_avg = sum(fear), joy_avg = sum(joy), sadness_avg = sum(sadness), surprise_avg = sum(surprise), trust_avg = sum(trust), negative_avg = sum(negative), positive_avg = sum(positive)) %>%
  dplyr::select(date, the_day, ISD_Nbhd, anger_avg, anticipation_avg, disgust_avg, fear_avg, joy_avg, sadness_avg, surprise_avg, trust_avg, negative_avg, positive_avg)

m <- mat[,c("ISD_Nbhd", "trust_avg", "date", "the_day")]
m <- m[!duplicated(m),]

pp <- streamgraph(m, key="ISD_Nbhd", value="trust_avg", date="date", interactive = TRUE, height="300px", width="1000px") %>% sg_legend(show=TRUE, label="names: ") %>% sg_axis_x(1, "date", "%d") %>% sg_fill_brewer("Greens")
pp

As presented in this steam graph, we have two considerable jumps in the trust emotion right at the time of the attack and the arrests, along with a pattern of less drastic increases in the days between. The going back to normal situation is apparent at the end of the graph (two last days of our time period). I can see the specific pattern of each neighborhood by using the dropdown menu at the bottom of the graoh. The value of the trust in certain dates is also visible by hovering over the lines. Keep in mind that the wider each line, the higher the value of trust in the respected neighborhoods.

Let’s map the three different scenario and see if the level of trust and fear changes through this time period.

# getting the indexes
#adding geo data
db <- geo_join(nhoods, dt_1, "Name", "ISD_Nbhd")
df <- db[!is.na(db$trust_avg),]

#the pop up variable
mypopup <- paste0(df$Name, " ", round(df$trust_avg,2))
#the palette
mypal <- colorNumeric(
  palette = "YlGn",
  domain = (df$trust_avg)
)

myLAT <- 42.349970
myLNG <- -71.078940
mycentername <- "Marathon Sports" # picking one point on the map to perminantly pop up

# mapping
mymap <- leaflet() %>%
  addProviderTiles("CartoDB.Positron") %>%
  setView(myLNG, myLAT, zoom = 12) %>%
  addMarkers(lat=myLAT, lng=myLNG, popup=mycentername) %>%
  addPolygons(data = df, highlight = highlightOptions(weight = 3,
                                           color = "red",
                                           bringToFront = TRUE) ,
              fillColor = ~mypal(df$trust_avg), 
              color = "#000000", 
              fillOpacity = 0.7, 
              weight = 1,
              smoothFactor = 0.2,
              popup = mypopup) %>%
  addLegend(pal = mypal,
            values = df$trust_avg,
            position = "bottomright",
            title = "Trust Tweets",
            opacity = 1
           )
mymap

# saveWidget(mymap, file="D:\\School\\Semester 8\\DI\\Trust.html")

This is the map of trust-related tweets in the days before the attack. As demonstrated in the map, Fenway, Back Bay and Beacon Hills - affluent neighborhoods of Boston - show high levels of trust along with Allston/Brighton and South Boston. The bluest neighborhood on the map is Downtown, but the tweets posted there do not reflect the social dynamics of the residents due to high number of local transportation from all over the city to there. Downtown is the economic center of the city.

This is the map of the distribution of trust-related tweets in the days after the attack up to the end of the manhunt. Considerable changes are visible compared to the days before the attack. Specifically the neighborhoods around the scene of the attack are showing higher levels of trust which is exactly expected from resilient communities. Excersizing the social capital of the community gives the residents more strength to get through the dark events. Even the West End neighborhood which showed very low trust levels in the first map, now demonstrate high levels of it after the attacks confirming Daniel Aldrich’s theory that social capital can be enhanced after a community goes through tramatic events.
The map of the distribution of emotions after the arrests show that the communities mostly go back to business as usual status. The West End neighborhood which was at its highest level of trust during the event now shows an improved version of its trust level before the attack. This is also the case for a few more neighborhoods, although they all demonstrate lower levels of trust comparing to their peak during the event.
The fear level demostrated on social media before the attacks is low specially around the scene of the attack sicne those neighborhods are among the wealthiest, safest places at Boston. But as shown in the following map, these patterns change once the attack happens.
The map lights up specifically around the explotion place. The residents are experiencing more fear after the attack which is not surprising but the number is not specially high and stays at medium.
The level of fear has lowered after the arrests has taken place but it is still higher than the normal situation in the days before the attack.