Twitter Data Analysis

Saina Sheini

March 2020

Twitter data and social resilience

For my project, I have chosen to use Twitter data which is a great pool of valuable real-time information. Since Twitter data is very high in valume and computationally intense for personal laptops, the data I am using in this notebook covers 2018 geo-coded tweets in Boston. I intnd to used vaster time periods of data; not limited to geo-tagged data in the next steps to fulfil my project’s goals.

My project’s goals are as followed:
a. Designing measurable metrics for social capital and social resilience based on Twitter data using network analysis and sentiment analysis
b. Comparing the role of social capital and the process of social resilience during and after the two types of disasters: natural and human-driven.
c. Modeling social resilience when natural disasters happen based on Twitter data
d. Modeling social resilience when human-driven disasters happen based on Twitter data

In this notebook, I will explore the data to get to know it, and start of my analysis with some sentiment analysis. There are some steps I have taken before starting this analysis in both QGIS and to add geographic attributes to my Twitter data. Therefore, I need to declare that the data used here has been cleaned and processed before using.

Data Exploration

For the first step of exploring the data, I want to see how many tweets have been posted by month, day of the week and hour of the day during the year of 2018. It would be interesting t osee if there is actually a pattern in people’s tweeting habits!

# library
library(ggplot2)
library(viridis)
library(hrbrthemes)
library(dplyr)
library(plotly)
library(gganimate)

## upload the data
dt <- read.csv("D:\\School\\Semester 8\\DI\\twitter_dt_nhood_geo.csv")

#this creates a new date field in a cleaner format as "YYYY-MM-DD"
dt$created_at_date<-as.Date(dt$created_at, "%a %b %d %H:%M:%S %z %Y", tz = "GMT")
#create a new column for day of the week
dt$day_of_week<-substr(dt$created_at, start = 1, stop=3)
#create a new column for month
dt$the_month<-substr(dt$created_at, start = 5, stop =7)
#create a new column for hour
dt$the_hour<-substr(dt$created_at, start=12, stop=13)

# Now let's see what is the trend of tweeting during the day in different months
dt$the_month = factor(dt$the_month, levels = month.abb)

A <- dt %>%
  group_by(the_month, the_hour) %>%
  tally()

A$the_hour <- as.numeric(A$the_hour)

# Graph
p <- ggplot(A, aes(y=n, x=the_hour)) + 
    geom_bar(position="dodge", stat="identity", fill = "red3") +
    scale_fill_viridis(discrete = T, option = "E") +
    ggtitle("Studying Months of the year - Number of Tweets per Hour") +
    facet_wrap(~the_month) +
    theme_ipsum() +
    theme(legend.position="none") +
    xlab("") +
    ylab("") +
    theme_bw() + theme(panel.border = element_blank(), panel.grid.major = element_blank(),
panel.grid.minor = element_blank(), axis.line = element_line(colour = "black"), axis.text.y = element_text(face = "bold", size = 10))+ scale_x_continuous(breaks=1:24, labels=c("00", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " " , " ", "23"))

ggplotly(p)

# htmlwidgets::saveWidget(p, file =  "D:\\School\\Semester 8\\DI\\image.html")

In this graph, a solid pattern is visible. According to this pattern, people tend to tweet more at night. The pattern shows a drastic fall in the number of tweets a couple hours after midnight which is reasonable. Let’s see how a presentation of the total number of tweets accross the months looks like.

# Create dataset
A <- dt %>%
  group_by(the_month) %>%
  tally()
 
ggplot(A, aes(the_month, n, fill=the_month)) +
  geom_col(position = "dodge", show.legend = FALSE) +
  scale_fill_viridis_d()  + 
  coord_polar() + ylim(-13000, 9463) + theme_bw() + theme(panel.border = element_blank(), panel.grid.major = element_blank(),
panel.grid.minor = element_blank(), axis.line = element_blank(), axis.text.y = element_blank()) + xlab("") + ylab("")  + theme(axis.ticks.length = unit(0.001, "mm")) + labs(x=NULL, y=NULL)

This circular bar graph shows that there is a noticable difference between the number of tweets in each month or even each season. This can have to do with the weather. We can argue that in months like May and June, the weather is at its nicest in Boston! So probably outdoor activities are quite popular during those months. There is a considerable fall in number of tweets posted at January. It can have various reasons as the possible vacations, New Year’s resolutions (maybe spending less time online?), less interest in geo-tagging tweets or even reasons unrelated to community as technological problems in the scrapping process. So it needs furthur investigation.

One other thing that can be interesting to look at is distribution of tweets across neighborhoods. Residents of which neighborhoods tweet more? Is there a correlation between the number of residents on Twitter in each neighborhood, how much they tweet and the demographies (as race or income) of those neighborhoods? (will not look at it at this document but in furthur steps).

# Data Prep

A <- dt %>%
  filter(neighborhood != "Downtown") %>%
  group_by(neighborhood) %>%
  mutate(n = n())

A$norm_n <- round((A$n - mean(A$n))/sd(A$n), 2)  # compute normalized mpg
A$norm_type <- ifelse(A$norm_n < 0, "below", "above")  # above / below avg flag
A <- A[order(A$norm_n), ]  # sort
A$neighborhood <- factor(A$neighborhood, levels =unique( A$neighborhood))  # convert to factor to retain sorted order in plot.

# Diverging Barcharts
p <- ggplot(A, aes(x=neighborhood, y=norm_n, label=norm_n)) + 
  geom_bar(stat='identity', aes(fill=norm_type), width=.5)  +
  scale_fill_manual(name="Mileage", 
                    labels = c("Above Average", "Below Average"), 
                    values = c("above"="#00ba38", "below"="#f8766d")) + 
  # labs(
    # title="Normalised number of tweets per neighborhood", 
    #    title= "Diverging Bars") +
  ylab("") + xlab("") +
  coord_flip() + 
  
transition_states(
    the_month,
    transition_length = 2,
    state_length = 1
  ) +
  ease_aes('sine-in-out') + labs(title = 'Normalised number of tweets per neighborhood per month: {closest_state}')

animate(p, duration = 5, fps = 10, width = 1000, height = 700)
# Save at gif:
# anim_save("D:\\School\\Semester 8\\DI\\bar2.gif")

The number of tweets posted from Downtown are three times more than the maximum number in all other neighborhooda. Therefore, I have removed it from the comparison to maintain proper scale. As demonstrated in the animated graph of number of tweets in each neighborhood across the year, the four neighborhoods of Beacon Hills, Fenway, South End and Roslindale (three of which predominantly white neighborhoods) have the highest number of tweets most of the times of the year. Different variations of tweet frequencies show that tweets, their number and most centainly what is in the text of those tweets are dependent highly on different characteristics as the neighborhood, events happening at that specific time, the environment, and most importantly, the status of citizens.

Twitter Data Analysis

Saina Sheini

March 2020

Twitter data and social resilience

Data Exploration

What if we design a method to discover which communities will have the least amount of social resilience when a disaster hits?

Reference