Chapter 6 Homework

For this homework assignment, you will use data from Twitter that include tweets (2011 to 2017) from Colorado senators, which can be downloaded from Canvas. Just FYI—some tweets were cut off before Twitter’s character limit; just work with the data you have. The original data are from FiveThirtyEight.

When a question asks you to make a plot, remember to set a theme, title, subtitle, labels, colors, etc. It is up to you how to personalize your plots, but put in some effort and think about making the plotting approach consistent throughout the document. For example, you could use the same theme for all plots. I also like to use the subtitle as a place for the main summary for the viewer.

Question 1: Hashtags

Within a pipeline using the Colorado-only tweet data, select text variable and use stringr::str_extract_all() with a pattern of "#(\\d|\\w)+" to extract all of the hashtags from the tweets. This will return a list with one element. How many hashtags were used by Colorado senators?

## [1] 5436

There were 5436 hashtags used by colorado senators between 2011 and 2017.

Question 2: Fires

Colorado is on fire right now and has experienced many wildfires over the years. Let’s examine senators’ tweet activity related to wildfires based on hashtags. Using the character vector of hashtags you extracted in Question 1, search for the hashtags that include “fire” or “wildfire”. How many hashtags included “fire”? How many included “wildfire”?

## [1] 16

## [1] 8

The number of hashtags that included that word “fire” was 16, and the number of hashtags that included that word “wildfire” was 8.

Question 3: Wildfires

Now, let’s look at general tweets concerning wildfires. First, subset the data to a dataframe that includes tweets containing the word “wildfire” and their corresponding timestamp and user. Specifically, (a) select text, date, and user and (b) filter to text strings that include the word “wildfire” using dplyr::filter() and stringr::str_detect().

## # A tibble: 33 × 3
##    text                                                         created_at user 
##    <chr>                                                        <chr>      <chr>
##  1 "Intro'd bill to help #wildfire recovery &amp; prevention e… 10/6/17 1… SenB…
##  2 "Tune in to watch @USDA @forestservice briefing on fixing #… 9/26/17 1… SenB…
##  3 "As #opioid addiction rips through Colorado &amp; the count… 7/5/17 19… SenB…
##  4 "Our thoughts and sympathies are with all of the families a… 7/12/16 2… SenB…
##  5 "Glad to see our wildfire mitigation provision, the Good Ne… 8/6/15 22… SenB…
##  6 "#TBT to speaking with the brave firefighters from @forests… 6/11/15 2… SenB…
##  7 "CO already experiencing climate change - wildfire, drought… 6/2/14 19… SenB…
##  8 "RT @WildernessNow: Great work by @SenBennetCO on linking d… 11/6/13 1… SenB…
##  9 "Time to work in Congress and with @forestservice to invest… 11/5/13 2… SenB…
## 10 "RT @SenateAg: Since 1980, wildfires have caused over $28 b… 11/5/13 2… SenB…
## # ℹ 23 more rows

Question 4: Senators

Which Colorado senator tweets more about wildfires?

## # A tibble: 2 × 2
## # Groups:   user [2]
##   user               n
##   <chr>          <int>
## 1 SenBennetCO       20
## 2 SenCoryGardner    13

Senator Bennet tweet more about wildfires

Question 5: Timing

Using the same wildfires dataframe, create a summary table that shows the number of tweets containing the word “wildfire” by year (2011-2017). Which year has the most tweets about wildfires? Why might this be the case? (Hint: Think about what happened in the previous year.)

## # A tibble: 7 × 2
## # Groups:   year [7]
##    year     n
##   <dbl> <int>
## 1  2011     2
## 2  2012     3
## 3  2013    13
## 4  2014     1
## 5  2015     6
## 6  2016     5
## 7  2017     3

In 2013 the senators tweeted the most about wildfires in Colorado, this is most likely due to the Black Forest Fire which was the most destructive fire in Colorado until 2020. There were also 3 other wildfires that year making it the worst year for wildfires in Colorado state history (this was broken in 2020 but the data given onlt recorded tweets from 2011 to 2017).

Question 6: Monthly tweets

Create a bar chart that answers the question: Are Colorado senators more active at a certain time of year? Hints: Convert month to a factor. Fill by user.

Overall, Senator Cory Gardner is more active on twitter than senator Bennet. It looks like Cory Gardner was more active during the late summer to early fall season. Both senators were not as active during November and December.

Question 7: Hourly tweets

Create a histogram of tweets by hour of day to visualize when our senators are tweeting.

Appendix

# set global options for figures, code, warnings, and messages
knitr::opts_chunk$set(fig.width=6, fig.height=4, fig.path="../figs/",
                      echo=FALSE, warning=FALSE, message=FALSE)

# load in packages
library(tidyverse)
library(dplyr)
library(ggplot2)
library(stringr)
library(lubridate)

# load in data
senator_tweets <- readr::read_csv(file = "senators_co.csv")

# selecting hashtags within the text variable
hashtags <- stringr::str_extract_all(senator_tweets$text, pattern = "#(\\d|\\w)+")

# total number of tweets with hashtags
num_hashtags <- sum(length(hashtags))
print(num_hashtags)
  
# hashtags that include "fire"
hashtags_fire <- stringr::str_subset(unlist(hashtags), "fire")
print(length(hashtags_fire))

# hashtags that include "wildfire"
hashtags_wildfire <- stringr::str_subset(unlist(hashtags), "wildfire")
print(length(hashtags_wildfire))

# filter to tweets concerning wildfires
wildfire <- senator_tweets %>%
  dplyr::select(text, created_at, user) %>%
  dplyr::filter(stringr::str_detect(text, "wildfire"))
print(wildfire)
# number of wildfire tweets by senator
senator <- wildfire %>%
  group_by(user) %>%
  count()
print(senator)
  
# number of wildfire tweets by year 
timing <- wildfire %>%
  mutate(date = mdy_hm(created_at),
         year = year(date)) %>%
  group_by(year) %>%
  count()
print(timing)
# create plot of tweets by month and user
monthly_tweets <- senator_tweets %>%
  mutate(date = mdy_hm(created_at),
         month = month(date)) %>%
  group_by(month, user)

ggplot(data = monthly_tweets,
       aes(x = month,
           fill = user)) +
  geom_bar(position = "dodge",
           color = "black") +
  scale_x_continuous(breaks = seq(from = 1,
                                  to = 12,
                                  by =1),
                     labels = month.abb) +
  labs(x = "Month",
       y = "Tweet Count",
       title = "Tweets Count of Colorado Senators by Month") +
  scale_fill_manual(values = c("SenBennetCO" = "blue", "SenCoryGardner" = "red")) +
  theme_minimal()
# create plot of cumulative hourly tweets by senator
hourly_tweets <- senator_tweets %>%
  mutate(date = mdy_hm(created_at),
         hour = hour(date)) %>%
  group_by(hour, user)

ggplot(data = hourly_tweets,
       aes(x = hour,
           fill = user)) +
  geom_histogram(bins = 24,
                 color = "black") +
  scale_x_continuous(breaks = seq(from = 0,
                                  to = 24,
                                  by = 1)) +
  labs(x = "Hour of the Day",
       y = "Tweet Count",
       title = "Tweets Count of Colorado Senators by Hour of the Day") +
  scale_fill_manual(values = c("SenBennetCO" = "blue", "SenCoryGardner" = "red")) +
  theme_minimal()

MECH476: Engineering Data Analysis in R