Intro

Explain the three questions

Question 1) Comparing the proportion of tweets of Mcdonald’s breakfast to that of Wendy’s breakfast.

Question 2) Comparing the mean number of likes and retweets in tweets of Mcdonald’s vs Wendy’s.

Question 3) modeling the length of breakfast tweets as a function of the restaurant they were describing and the number of likes and retweets they recieved

Discuss data I plan to collect

To compare the popularity of Wendy’s vs. McDonald’s breakfast, I used the rtweet package to obtain data from the Twitter API. Using the search_tweets function, I searched for a maximum of 18,000 tweets in the past 6 days using search criteria “mcdonalds breakfast” and “wendy breakfast”. I used the term “wendy” instead of “wendy’s” to account for the possible mispelling “wendys”. Separating each of this search terms with a space results in searches for tweets that contain both of the search terms.

Data Collection

# load packages
library(rtweet)
library(httpuv)
library(tidyverse)
library(broom)
library(knitr)

# Setup rtweet credentials
appname <- "XavierBAIS2E"
ckey <- "Kvt0WrU2Dv5DnWTzA486aFFPh"
secret <- "QFduroaBCuy7SDS2EypVBFmqI7bVBWaZkCaNTO6Udm1rtGiotA"

# use the previously assigned objects into the twitter_token
twitter_token <- create_token(
  app=appname,
  consumer_key = key,
  consumer_secret = secret,
  set_renv = FALSE)

rtweet_mcds <- search_tweets("mcdonalds breakfast", token = twitter_token, n=18000)
rtweet_mcds_full <- search_tweets("mcdonalds", token = twitter_token, n=18000)
save(rtweet_mcds,file = "rtweet_mcds.Rdata")

load("rtweet_mcds.Rdata")

rtweet_wendys <- search_tweets("wendy breakfast", token = twitter_token, n=18000) 
rtweet_wendys_full <- search_tweets("wendy", token = twitter_token, n=18000) 
save(rtweet_wendys,file = "rtweet_wendys.Rdata")

load("rtweet_wendys.Rdata")

Aggregate the data for plotting and other analyses.

# aggregate data
rtweet_mcds <- rtweet_mcds %>%
  mutate(rest = "mcdonalds")
rtweet_wendys <- rtweet_wendys %>%
  mutate(rest = "wendys")
full_df <- rbind(rtweet_mcds,rtweet_wendys)

Exploratory Analysis

In the figure below I plotted the number of tweets per hour that were about breakfast and either McDonald’s or Wendy’s. We can see from this plot that Wendy’s tweets about breakfast appear to be more frequent than McDonald’s tweets about breakfast.

Next, we plot a histogram of the number of favorites for each tweet for each restaurant side by side. We can see that the McDonald’s and Wendy’s tweets about breakfast had mostly 0 likes, however a few tweets about McDonald’s breakfast recieved 200-400 likes, whereas no Wendy’s breakfast tweet recieved more than 200. This suggests that McDonald’s breakfast tweets are more engaged with.

Next, I plotted a histogram of the number of retweets for each tweet for each restaurant side by side. We can see that the McDonald’s and Wendy’s tweets about breakfast had mostly 0 retweets, however there were several tweets about McDonald’s breakfast that recieved 1-100 retweets, whereas very few Wendy’s breakfast tweets fell in this range. This also suggests that McDonald’s breakfast tweets are more engaged with.

In my last plot, I displayed histograms of the number of characters of each tweet for each group. We can see that the McDonald’s breakfast tweets had a bimodal character distribution, with one peak occuring around 50 character and the second occuring around 140 characters. The Wendy’s distribution on the other hand was unimodal and more heavily right-skewed.

Statistical Analysis

Question 1

For my first question, we wish to compare the proportion of tweets about McDonald’s that were also about breakfast to the proportion of tweets about Wendy’s that were also about breakfast. The data consist of the most recent 18,000 tweets as of April 29th around 5 pm EST, for each of “mcdonalds” and “wendys”. Of these most recent 18,000 tweets for each group, I found the number of tweets that also contain the word “breakfast” and use this to define the proportion of tweets in each group that were also about breakfast.

My null hypothesis is that these two proportions are equal, and we test this hypothesis using a two sample z-test for proportions. We can use the prop.test function in R to test this hypothesis using a two-sided alternative and a significance level of 0.05.

n_mcds <- nrow(rtweet_mcds)
n_wendys <- nrow(rtweet_wendys)

prop.test(x = c(n_mcds,n_wendys),
          n = c(18000,18000))

## 
##  2-sample test for equality of proportions with continuity correction
## 
## data:  c(n_mcds, n_wendys) out of c(18000, 18000)
## X-squared = 11.591, df = 1, p-value = 0.0006627
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.015954608 -0.004267614
## sample estimates:
##     prop 1     prop 2 
## 0.08083333 0.09094444

With a p-value < 0.05 we observed sufficient evidence against this null hypothesis and conclude that the proportion of McDonald’s tweets that were also about breakfast is less than the proportion of Wendy’s tweets that were also about breakfast. These results indicate that perhaps Wendy’s breakfast was more popular in this time frame.

Question 2

In my second question, we are interested in comparing the mean number of likes and retweets in tweets about McDonald’s breakfast vs. Wendy’s breakfast. To assess this, we use a two-sample t-test, with null hypotheses that the mean number and likes and retweets are equal across the two groups

t.test(x = rtweet_mcds$favorite_count,
       y = rtweet_wendys$favorite_count)

## 
##  Welch Two Sample t-test
## 
## data:  rtweet_mcds$favorite_count and rtweet_wendys$favorite_count
## t = 3.6951, df = 1547.2, p-value = 0.0002274
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  1.173117 3.827740
## sample estimates:
## mean of x mean of y 
##  3.610997  1.110568

t.test(x = rtweet_mcds$retweet_count,
       y = rtweet_wendys$retweet_count)

## 
##  Welch Two Sample t-test
## 
## data:  rtweet_mcds$retweet_count and rtweet_wendys$retweet_count
## t = 10.513, df = 3023.7, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   7.508055 10.950759
## sample estimates:
## mean of x mean of y 
## 11.316151  2.086744

For the test about the number of favorites in each group, I found a p-value of < 0.05, indicating there is sufficient evidence to conclude that the mean number of likes for tweets about McDonald’s breakfast is not equal to the mean number of likes for tweets about Wendy’s breakfast. Specifically, we conclude that the McDonald’s breakfast tweets had significantly more likes than the Wendy’s breakfast tweets.

For the test about the number of retweets in each group, I found a p-value of < 0.05, indicating that there is sufficient evidence to conclude that the mean number of retweets for tweets about McDonald’s breakfast is not equal to the mean number of retweets for tweets about Wendy’s breakfast. Specifically, we conclude that the mean number of retweets for tweets about Wendy’s breakfast is less than the mean number of retweets for tweets about McDonald’s breakfast.

Question 3

For my third question, we were interested in modeling the length of breakfast tweets as a function of the restaurant they were describing and the number of likes and retweets they recieved. To do this, we use a multiple regression model using the lm function in R.

# Q3: Regression model of text width vs. rest + n_fav + n_retweets
q3 <- lm(display_text_width ~ rest + favorite_count + retweet_count,
         data = full_df)

term	estimate	std.error	statistic	p.value
(Intercept)	113.0868460	1.4412037	78.4669427	0.0000000
restwendys	-46.1184384	1.9103616	-24.1412083	0.0000000
favorite_count	0.0471571	0.0526628	0.8954545	0.3706140
retweet_count	0.1106514	0.0384944	2.8744842	0.0040746

In the table above, we can see that the coeffient for retaurant (Wendy’s vs. McDonald’s) is significantly different than zero (p-value < 0.05). Moreover, the coefficient for this term was -46.12, indicating that Wendy’s breakfast tweets were on average about 46 characters shorter than McDonald’s breakfast tweets, adjusting for likes and retweets.

Similarly, the coefficient for number of retweets was significantly different from zero (p-value < 0.05). With an estimated coefficient of 0.11, we expect that tweets with more retweets are on average longer adjusting for restaurant and number of likes.

Finally, I found insufficient evidence to conclude that the coefficient for number of likes was significantly different than zero. Overall this model tells us that Wendy’s breakfast tweets were shorter and that longer tweets got more retweets.

Data from API Assignment