(1) Obama Tweets: Retweets vs. Favorites A .CSV file containing recent Tweets from former President Barack Obama can be downloaded HERE. The data is sorted by date, most recent at the top.
The variables (columns) are:
text: the body of the tweetdate: when the tweet was sentdate2: when the tweet was sent, different formatretweet_count: how many people retweeted this tweetfavorite_count: how many people favorited this tweetis_retweet: whether or not this tweet is a retweet of someone else’s tweetsource: device used to send the tweetis_quote: is the tweet a quote of someone elseis_reply: is the tweet a replyThere are two ways in which other Twitter users can indicate support for a tweet: favoriting and retweeting. For example, if a tweet has favorite_count = 5 and retweet_count = 10, then this suggests that 5 people favorited the tweet (saved it) and 10 people retweeted it (broadcasted it to their followers).
recent. Note that the data is sorted in reverse time order. Get the header names of recent to confirm that the data imported correctly. Look at the first few rows of the data and the final few rows of the data. Also get the dimension of recent. What is the date range of the tweets? How many tweets does this dataset include?recent <- read.csv("http://reuningscherer.net/S&DS230/data/ObamaTweets.csv", header = TRUE)
names(recent)
## [1] "text" "date" "date2" "source"
## [5] "is_quote" "is_retweet" "is_reply" "favorite_count"
## [9] "retweet_count"
head(recent)
## text
## 1 .@EkpeUdoh\x97I saw your tweet about A Promised Land and thought I\x92d respond. For me, the title of my book is about the work we need to do to realize the promise of becoming a more perfect union, even if we don\x92t get there in our lifetimes. Glad to see your book club is reading it! https://t.co/aVRMxwt5JH https://t.co/Uz3LMAWD2z
## 2 Here\x92s an example of the incredible risks and burdens that our essential workers have been facing. And even as we move toward vaccinating our population, we all need to remain vigilant until we\x92ve beaten this pandemic.\nhttps://t.co/SKhioxW0Mp
## 3 Hank Aaron was one of the best baseball players we\x92ve ever seen and one of the strongest people I\x92ve ever met. Michelle and I send our thoughts and prayers to the Aaron family and everyone who was inspired by this unassuming man and his towering example. https://t.co/2RZdc82Y18
## 4 This is a time for boldness and President Biden is already delivering.\n\nBy rejoining the Paris climate accords on day one, he declared loudly and clearly that the U.S. will once again lead the fight against climate change.\n\nAnd this is only the beginning. \nhttps://t.co/6cSS49lkyx
## 5 Today was a good day.\n\nAnd it was only possible because of you. Because you made calls. Because you marched. Because you wore your masks and voted like never before.\n\nFor four years, you defended our democracy with everything you had\x97and now, our country can enter a new day. https://t.co/JicHOQCIxt
## 6 On a day for the history books, @TheAmandaGorman delivered a poem that more than met the moment. Young people like her are proof that "there is always light, if only we're brave enough to see it; if only we're brave enough to be it." https://t.co/mbywtvjtEH
## date date2 source is_quote is_retweet is_reply
## 1 1/27/2021 14:17 1/27/2021 iPhone TRUE FALSE FALSE
## 2 1/26/2021 6:31 1/26/2021 Web App FALSE FALSE FALSE
## 3 1/22/2021 11:00 1/22/2021 iPhone FALSE FALSE FALSE
## 4 1/21/2021 14:53 1/21/2021 iPhone FALSE FALSE FALSE
## 5 1/20/2021 19:16 1/20/2021 iPhone TRUE FALSE FALSE
## 6 1/20/2021 12:50 1/20/2021 Media Studio FALSE FALSE FALSE
## favorite_count retweet_count
## 1 31409 3267
## 2 32256 4404
## 3 216268 18422
## 4 152452 12436
## 5 420452 39642
## 6 575775 87055
tail(recent)
## text
## 722 In the weekly address, President Obama discusses what #Obamacare has done to improve health care. https://t.co/VdQlyrSZhx
## 723 Let's keep working to keep our economy on a better, stronger course. https://t.co/bV2BVjyj7a
## 724 The landmark #ParisAgreement enters into force today\x97we must keep up the momentum to #ActOnClimate. https://t.co/Cyw5Udaoro
## 725 The economy added 161,000 jobs in October, and wages are up 2.8 percent over the past year. https://t.co/pJxjgLnjCt #JobsReport
## 726 There are a lot of plans out there. Check your options and lock in the one that's best for you: https://t.co/buFY9ozDz4 #GetCovered
## 727 The positive impact of #Obamacare is undeniable, but there's one big factor holding many states back: https://t.co/7XebIdRX34
## date date2 source is_quote is_retweet is_reply
## 722 11/5/2016 8:38 11/5/2016 Web Client FALSE FALSE FALSE
## 723 11/4/2016 15:16 11/4/2016 Web Client FALSE FALSE FALSE
## 724 11/4/2016 11:25 11/4/2016 Web Client FALSE FALSE FALSE
## 725 11/4/2016 9:12 11/4/2016 Web Client FALSE FALSE FALSE
## 726 11/3/2016 14:02 11/3/2016 Web Client FALSE FALSE FALSE
## 727 11/3/2016 12:08 11/3/2016 Web Client FALSE FALSE FALSE
## favorite_count retweet_count
## 722 26447 3971
## 723 42089 8255
## 724 23195 5438
## 725 16160 3457
## 726 8290 1348
## 727 10422 1672
dim(recent)
## [1] 727 9
The tweets in this data frame range from 11/3/2016 to 1/27/2021 There are 727 tweets included in this data frame
table1. Show the results of table1. Write a line that calculates the percent of Tweets that were Retweets, rounds this value to two decimal places, multipies the results by 100, and pastes on a “%” symbol. There should be no space between the number and the ‘%’ symbol.table1 <- table(recent$is_retweet)
table1
##
## FALSE TRUE
## 663 64
paste(round((table1[2]/(table1[1]+table1[2]))*100), "% of Obama's tweets are retweets", sep = "")
## [1] "9% of Obama's tweets are retweets"
favorite_count and retweet_count. Make histograms for each of these two variables as well. Put a title on each histogram, label the horizontal axis, and make the bars orange. How would you describe the shape of these distributions (use words like ‘symmetric’ or ‘skewed’, or perhaps the name of some distribution that has a simlar shape . . .)?summary(recent$favorite_count)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 42738 120891 274676 280454 4284800
summary(recent$retweet_count)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 901 7514 19449 53427 44008 1541589
hist(recent$favorite_count,
col = "orange",
xlab = "Number of favorites",
main = "Histogram of Favorited Tweets by Obama")
hist(recent$retweet_count,
col = "orange",
xlab = "Number of retweets",
main = "Histogram of Retweets by Obama")
Both histograms show a right skewed unimodal distribution
favorite_count JUST for the observations for which isRetweet is TRUE. What do you observe?summary(subset(recent, is_retweet == TRUE, select = favorite_count))
## favorite_count
## Min. :0
## 1st Qu.:0
## Median :0
## Mean :0
## 3rd Qu.:0
## Max. :0
Tweets that a re retweets are not favorited. However, this claim may not be substantiated as part (b) shows the number of retweets is very low
recent_NoRe that contains all data from recent for which is_retweet is FALSE (essentially, we’re removing ReTweets since they all have favorite_count = 0). USE THIS NEW DATAFRAME for the remainder of this problem set. Get the dimension of this dataframe to make sure the remaining number of rows (and columns) are consistent with the results in part (b).recent_noRe <-subset(recent, is_retweet == FALSE)
dim(recent_noRe)
## [1] 663 9
recent_NoRe which will be the log base 10 transformations of favorite_count and retweet_count. Call these variables log10favCnt and log10reCnt, respectively. The function you want to take log base 10 is called log10(). Note - you can add a variable to dateframe by simply creating a name using the $ operator and then assigning it the desired value : e.g. recent_NoRe$log10facCnt <- (whatever you want to assign this)recent_noRe$log10favCnt <- log10(recent_noRe$favorite_count)
recent_noRe$log10reCnt <- log10(recent_noRe$retweet_count)
hist(recent_noRe$log10favCnt,
col = "orange",
xlab = "Number of favorites (log_10)",
main = "Histogram of Favorited Tweets by Obama (log scale)")
hist(recent_noRe$log10reCnt,
col = "orange",
xlab = "Number of favorites (log_10)",
main = "Histogram of Retweeted Tweets by Obama (log scale)")
These new histograms are still unimodally distributed but are closer to a normal symmetrical distribution
favorite_count on the x-axis and retweet_count on the y-axis. Label your axes, put on a main title, and make the plot characters blue.plot(recent_noRe$favorite_count, recent_noRe$retweet_count,
main = "Favorite vs retweet counts of tweets by Obama",
xlab = "number of times retweeted",
ylab = "number of times favorited",
col = "blue")
plot(recent_noRe$log10favCnt, recent_noRe$log10reCnt,
main = "Favorite vs retweet counts of tweets by Obama (log scale)",
xlab = "number of times retweeted (log_10)",
ylab = "number of times favorited (log_10)",
col = "red")
The log scale graph shifts the data more twoards the center of the plot making it easier to analyze. I prefer the log scale plot because it shows a much more clear linear trend between number of times favorited and number of times retweeted
recent_NoRe dataframe called year and month that will contain respectively the year and month the tweet was created. You’ll need to look up how to use two functions : as.Date and substr(). You’ll also need to use the as.numeric() function to make sure that both new variables are numbers. Show the first 20 observations for each resulting variable.recent_noRe$year <- as.numeric(substr(as.Date(recent_noRe$date2, format = "%m/%d/%Y"), 1, 4))
recent_noRe$year[c(1:20)]
## [1] 2021 2021 2021 2021 2021 2021 2021 2021 2021 2021 2021 2021 2021 2021 2021
## [16] 2021 2021 2021 2021 2021
recent_noRe$month <- as.numeric(substr(as.Date(recent_noRe$date2, format = "%m/%d/%Y"), 6, 7))
recent_noRe$month[c(1:20)]
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
recent_3 that only has observations from the specified years. You might want to use the %in% operator. on your newly created variable year. Use this new dataframe to make your plot. Use the graphics option pch = 19 to get solid round points, and use the graphics option col = as.factor(year) to make different colors for 2019 and 2020. The final line of code below will add a legend to the top left of the plot.recent_3 <- subset(recent_noRe, year %in% c("2020", "2019"))
plot(recent_3$log10favCnt, recent_3$log10reCnt,
main = "Favorite vs retweet counts of tweets by Obama in 2019 and 2020(log scale)",
xlab = "number of times retweeted (log_10)",
ylab = "number of times favorited (log_10)",
col = as.factor(recent_3$year))
legend("topleft", legend = c("2019","2020"), col = c(1,2), pch = 19)
The 2020 data is more clustered than the 2019 data. Both show a linear relationship but the 2019 data more strongly. The slope of both trends are approximately the same with the 2019 being possibly a bit steeper.