For the first step of exploring the data, I want to see how many tweets have been posted by month, day of the week and hour of the day during the year of 2018. It would be interesting t osee if there is actually a pattern in people’s tweeting habits!
# library
library(ggplot2)
library(viridis)
library(hrbrthemes)
library(dplyr)
library(plotly)
library(gganimate)
## upload the data
dt <- read.csv("D:\\School\\Semester 8\\DI\\twitter_dt_nhood_geo.csv")
#this creates a new date field in a cleaner format as "YYYY-MM-DD"
dt$created_at_date<-as.Date(dt$created_at, "%a %b %d %H:%M:%S %z %Y", tz = "GMT")
#create a new column for day of the week
dt$day_of_week<-substr(dt$created_at, start = 1, stop=3)
#create a new column for month
dt$the_month<-substr(dt$created_at, start = 5, stop =7)
#create a new column for hour
dt$the_hour<-substr(dt$created_at, start=12, stop=13)
# Now let's see what is the trend of tweeting during the day in different months
dt$the_month = factor(dt$the_month, levels = month.abb)
A <- dt %>%
group_by(the_month, the_hour) %>%
tally()
A$the_hour <- as.numeric(A$the_hour)
# Graph
p <- ggplot(A, aes(y=n, x=the_hour)) +
geom_bar(position="dodge", stat="identity", fill = "red3") +
scale_fill_viridis(discrete = T, option = "E") +
ggtitle("Studying Months of the year - Number of Tweets per Hour") +
facet_wrap(~the_month) +
theme_ipsum() +
theme(legend.position="none") +
xlab("") +
ylab("") +
theme_bw() + theme(panel.border = element_blank(), panel.grid.major = element_blank(),
panel.grid.minor = element_blank(), axis.line = element_line(colour = "black"), axis.text.y = element_text(face = "bold", size = 10))+ scale_x_continuous(breaks=1:24, labels=c("00", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " " , " ", "23"))
ggplotly(p)
# htmlwidgets::saveWidget(p, file = "D:\\School\\Semester 8\\DI\\image.html")
In this graph, a solid pattern is visible. According to this pattern, people tend to tweet more at night. The pattern shows a drastic fall in the number of tweets a couple hours after midnight which is reasonable. Let’s see how a presentation of the total number of tweets accross the months looks like.
# Create dataset
A <- dt %>%
group_by(the_month) %>%
tally()
ggplot(A, aes(the_month, n, fill=the_month)) +
geom_col(position = "dodge", show.legend = FALSE) +
scale_fill_viridis_d() +
coord_polar() + ylim(-13000, 9463) + theme_bw() + theme(panel.border = element_blank(), panel.grid.major = element_blank(),
panel.grid.minor = element_blank(), axis.line = element_blank(), axis.text.y = element_blank()) + xlab("") + ylab("") + theme(axis.ticks.length = unit(0.001, "mm")) + labs(x=NULL, y=NULL)
This circular bar graph shows that there is a noticable difference between the number of tweets in each month or even each season. This can have to do with the weather. We can argue that in months like May and June, the weather is at its nicest in Boston! So probably outdoor activities are quite popular during those months. There is a considerable fall in number of tweets posted at January. It can have various reasons as the possible vacations, New Year’s resolutions (maybe spending less time online?), less interest in geo-tagging tweets or even reasons unrelated to community as technological problems in the scrapping process. So it needs furthur investigation.
One other thing that can be interesting to look at is distribution of tweets across neighborhoods. Residents of which neighborhoods tweet more? Is there a correlation between the number of residents on Twitter in each neighborhood, how much they tweet and the demographies (as race or income) of those neighborhoods? (will not look at it at this document but in furthur steps).
# Data Prep
A <- dt %>%
filter(neighborhood != "Downtown") %>%
group_by(neighborhood) %>%
mutate(n = n())
A$norm_n <- round((A$n - mean(A$n))/sd(A$n), 2) # compute normalized mpg
A$norm_type <- ifelse(A$norm_n < 0, "below", "above") # above / below avg flag
A <- A[order(A$norm_n), ] # sort
A$neighborhood <- factor(A$neighborhood, levels =unique( A$neighborhood)) # convert to factor to retain sorted order in plot.
# Diverging Barcharts
p <- ggplot(A, aes(x=neighborhood, y=norm_n, label=norm_n)) +
geom_bar(stat='identity', aes(fill=norm_type), width=.5) +
scale_fill_manual(name="Mileage",
labels = c("Above Average", "Below Average"),
values = c("above"="#00ba38", "below"="#f8766d")) +
# labs(
# title="Normalised number of tweets per neighborhood",
# title= "Diverging Bars") +
ylab("") + xlab("") +
coord_flip() +
transition_states(
the_month,
transition_length = 2,
state_length = 1
) +
ease_aes('sine-in-out') + labs(title = 'Normalised number of tweets per neighborhood per month: {closest_state}')
animate(p, duration = 5, fps = 10, width = 1000, height = 700)
# Save at gif:
# anim_save("D:\\School\\Semester 8\\DI\\bar2.gif")
The number of tweets posted from Downtown are three times more than the maximum number in all other neighborhooda. Therefore, I have removed it from the comparison to maintain proper scale. As demonstrated in the animated graph of number of tweets in each neighborhood across the year, the four neighborhoods of Beacon Hills, Fenway, South End and Roslindale (three of which predominantly white neighborhoods) have the highest number of tweets most of the times of the year. Different variations of tweet frequencies show that tweets, their number and most centainly what is in the text of those tweets are dependent highly on different characteristics as the neighborhood, events happening at that specific time, the environment, and most importantly, the status of citizens.