Homework 02 Data Handling, Graphics, More R!

(1) Obama Tweets: Retweets vs. Favorites A .CSV file containing recent Tweets from former President Barack Obama can be downloaded HERE. The data is sorted by date, most recent at the top.

The variables (columns) are:

text: the body of the tweet
date: when the tweet was sent
date2: when the tweet was sent, different format
retweet_count: how many people retweeted this tweet
favorite_count: how many people favorited this tweet
is_retweet: whether or not this tweet is a retweet of someone else’s tweet
source: device used to send the tweet
is_quote: is the tweet a quote of someone else
is_reply: is the tweet a reply

There are two ways in which other Twitter users can indicate support for a tweet: favoriting and retweeting. For example, if a tweet has favorite_count = 5 and retweet_count = 10, then this suggests that 5 people favorited the tweet (saved it) and 10 people retweeted it (broadcasted it to their followers).

Insert an R code chunk right below this that imports the data into a dataframe called recent. Note that the data is sorted in reverse time order. Get the header names of recent to confirm that the data imported correctly. Look at the first few rows of the data and the final few rows of the data. Also get the dimension of recent. What is the date range of the tweets? How many tweets does this dataset include?

recent <- read.csv("http://reuningscherer.net/S&DS230/data/ObamaTweets.csv", header = TRUE)
names(recent)

## [1] "text"           "date"           "date2"          "source"        
## [5] "is_quote"       "is_retweet"     "is_reply"       "favorite_count"
## [9] "retweet_count"

head(recent)

##                                                                                                                                                                                                                                                                                                                                               text
## 1 .@EkpeUdoh\x97I saw your tweet about A Promised Land and thought I\x92d respond. For me, the title of my book is about the work we need to do to realize the promise of becoming a more perfect union, even if we don\x92t get there in our lifetimes. Glad to see your book club is reading it! https://t.co/aVRMxwt5JH https://t.co/Uz3LMAWD2z
## 2                                                                                        Here\x92s an example of the incredible risks and burdens that our essential workers have been facing. And even as we move toward vaccinating our population, we all need to remain vigilant until we\x92ve beaten this pandemic.\nhttps://t.co/SKhioxW0Mp
## 3                                                     Hank Aaron was one of the best baseball players we\x92ve ever seen and one of the strongest people I\x92ve ever met. Michelle and I send our thoughts and prayers to the Aaron family and everyone who was inspired by this unassuming man and his towering example. https://t.co/2RZdc82Y18
## 4                                                    This is a time for boldness and President Biden is already delivering.\n\nBy rejoining the Paris climate accords on day one, he declared loudly and clearly that the U.S. will once again lead the fight against climate change.\n\nAnd this is only the beginning. \nhttps://t.co/6cSS49lkyx
## 5                               Today was a good day.\n\nAnd it was only possible because of you. Because you made calls. Because you marched. Because you wore your masks and voted like never before.\n\nFor four years, you defended our democracy with everything you had\x97and now, our country can enter a new day. https://t.co/JicHOQCIxt
## 6                                                                                On a day for the history books, @TheAmandaGorman delivered a poem that more than met the moment. Young people like her are proof that "there is always light, if only we're brave enough to see it; if only we're brave enough to be it." https://t.co/mbywtvjtEH
##              date     date2       source is_quote is_retweet is_reply
## 1 1/27/2021 14:17 1/27/2021       iPhone     TRUE      FALSE    FALSE
## 2  1/26/2021 6:31 1/26/2021      Web App    FALSE      FALSE    FALSE
## 3 1/22/2021 11:00 1/22/2021       iPhone    FALSE      FALSE    FALSE
## 4 1/21/2021 14:53 1/21/2021       iPhone    FALSE      FALSE    FALSE
## 5 1/20/2021 19:16 1/20/2021       iPhone     TRUE      FALSE    FALSE
## 6 1/20/2021 12:50 1/20/2021 Media Studio    FALSE      FALSE    FALSE
##   favorite_count retweet_count
## 1          31409          3267
## 2          32256          4404
## 3         216268         18422
## 4         152452         12436
## 5         420452         39642
## 6         575775         87055

tail(recent)

##                                                                                                                                    text
## 722           In the weekly address, President Obama discusses what #Obamacare has done to improve health care. https://t.co/VdQlyrSZhx
## 723                                        Let's keep working to keep our economy on a better, stronger course. https://t.co/bV2BVjyj7a
## 724      The landmark #ParisAgreement enters into force today\x97we must keep up the momentum to #ActOnClimate. https://t.co/Cyw5Udaoro
## 725     The economy added 161,000 jobs in October, and wages are up 2.8 percent over the past year. https://t.co/pJxjgLnjCt #JobsReport
## 726 There are a lot of plans out there. Check your options and lock in the one that's best for you: https://t.co/buFY9ozDz4 #GetCovered
## 727       The positive impact of #Obamacare is undeniable, but there's one big factor holding many states back: https://t.co/7XebIdRX34
##                date     date2     source is_quote is_retweet is_reply
## 722  11/5/2016 8:38 11/5/2016 Web Client    FALSE      FALSE    FALSE
## 723 11/4/2016 15:16 11/4/2016 Web Client    FALSE      FALSE    FALSE
## 724 11/4/2016 11:25 11/4/2016 Web Client    FALSE      FALSE    FALSE
## 725  11/4/2016 9:12 11/4/2016 Web Client    FALSE      FALSE    FALSE
## 726 11/3/2016 14:02 11/3/2016 Web Client    FALSE      FALSE    FALSE
## 727 11/3/2016 12:08 11/3/2016 Web Client    FALSE      FALSE    FALSE
##     favorite_count retweet_count
## 722          26447          3971
## 723          42089          8255
## 724          23195          5438
## 725          16160          3457
## 726           8290          1348
## 727          10422          1672

dim(recent)

## [1] 727   9

The tweets in this data frame range from 11/3/2016 to 1/27/2021 There are 727 tweets included in this data frame

Create a table that shows how many of the Tweets were Retweets, and call this object table1. Show the results of table1. Write a line that calculates the percent of Tweets that were Retweets, rounds this value to two decimal places, multipies the results by 100, and pastes on a “%” symbol. There should be no space between the number and the ‘%’ symbol.

table1 <- table(recent$is_retweet)
table1

## 
## FALSE  TRUE 
##   663    64

paste(round((table1[2]/(table1[1]+table1[2]))*100), "% of Obama's tweets are retweets", sep = "")

## [1] "9% of Obama's tweets are retweets"

Get summary statistics for both favorite_count and retweet_count. Make histograms for each of these two variables as well. Put a title on each histogram, label the horizontal axis, and make the bars orange. How would you describe the shape of these distributions (use words like ‘symmetric’ or ‘skewed’, or perhaps the name of some distribution that has a simlar shape . . .)?

summary(recent$favorite_count)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0   42738  120891  274676  280454 4284800

summary(recent$retweet_count)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     901    7514   19449   53427   44008 1541589

hist(recent$favorite_count,
          col = "orange",
          xlab = "Number of favorites",
          main = "Histogram of Favorited Tweets by Obama")

hist(recent$retweet_count,
          col = "orange",
          xlab = "Number of retweets",
          main = "Histogram of Retweets by Obama")

Both histograms show a right skewed unimodal distribution

Get summary statistics for favorite_count JUST for the observations for which isRetweet is TRUE. What do you observe?

summary(subset(recent, is_retweet == TRUE, select = favorite_count))

##  favorite_count
##  Min.   :0     
##  1st Qu.:0     
##  Median :0     
##  Mean   :0     
##  3rd Qu.:0     
##  Max.   :0

Tweets that a re retweets are not favorited. However, this claim may not be substantiated as part (b) shows the number of retweets is very low

Create a new dataframe called recent_NoRe that contains all data from recent for which is_retweet is FALSE (essentially, we’re removing ReTweets since they all have favorite_count = 0). USE THIS NEW DATAFRAME for the remainder of this problem set. Get the dimension of this dataframe to make sure the remaining number of rows (and columns) are consistent with the results in part (b).

recent_noRe <-subset(recent, is_retweet == FALSE)
dim(recent_noRe)

## [1] 663   9

Make two new variables as a part of recent_NoRe which will be the log base 10 transformations of favorite_count and retweet_count. Call these variables log10favCnt and log10reCnt, respectively. The function you want to take log base 10 is called log10(). Note - you can add a variable to dateframe by simply creating a name using the $ operator and then assigning it the desired value : e.g. recent_NoRe$log10facCnt <- (whatever you want to assign this)

recent_noRe$log10favCnt <- log10(recent_noRe$favorite_count)
recent_noRe$log10reCnt <- log10(recent_noRe$retweet_count)

Make histograms of these two new log-scale variables. Put a title on each histogram, label the horizontal axis, and make the bars orange. How would you describe the shape of these transformed distributions (use words like ‘symmetric’ or ‘skewed’, or perhaps the name of some distribution that has a similar shape . . .)?

hist(recent_noRe$log10favCnt,
    col = "orange",
    xlab = "Number of favorites (log_10)",
    main = "Histogram of Favorited Tweets by Obama (log scale)")

hist(recent_noRe$log10reCnt,
    col = "orange",
    xlab = "Number of favorites (log_10)",
    main = "Histogram of Retweeted Tweets by Obama (log scale)")

These new histograms are still unimodally distributed but are closer to a normal symmetrical distribution

Make a plot of the number of times that each tweet was favorited vs. the number of times a tweet was retweeted. Put favorite_count on the x-axis and retweet_count on the y-axis. Label your axes, put on a main title, and make the plot characters blue.

plot(recent_noRe$favorite_count, recent_noRe$retweet_count,
     main = "Favorite vs retweet counts of tweets by Obama",
     xlab = "number of times retweeted",
     ylab = "number of times favorited",
     col = "blue")

Repeat part (h) but use the log-transformed variables. Label your axes, put on a main title, and make the plot characters red. How does the scatterplot on the log-scale compare to the scatterplot on the raw scale? Which one do you prefer?

plot(recent_noRe$log10favCnt, recent_noRe$log10reCnt,
     main = "Favorite vs retweet counts of tweets by Obama (log scale)",
     xlab = "number of times retweeted (log_10)",
     ylab = "number of times favorited (log_10)",
     col = "red")

The log scale graph shifts the data more twoards the center of the plot making it easier to analyze. I prefer the log scale plot because it shows a much more clear linear trend between number of times favorited and number of times retweeted

Create two new variables on the recent_NoRe dataframe called year and month that will contain respectively the year and month the tweet was created. You’ll need to look up how to use two functions : as.Date and substr(). You’ll also need to use the as.numeric() function to make sure that both new variables are numbers. Show the first 20 observations for each resulting variable.

recent_noRe$year <- as.numeric(substr(as.Date(recent_noRe$date2, format = "%m/%d/%Y"), 1, 4))
recent_noRe$year[c(1:20)]

##  [1] 2021 2021 2021 2021 2021 2021 2021 2021 2021 2021 2021 2021 2021 2021 2021
## [16] 2021 2021 2021 2021 2021

recent_noRe$month <- as.numeric(substr(as.Date(recent_noRe$date2, format = "%m/%d/%Y"), 6, 7))
recent_noRe$month[c(1:20)]

##  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Repeat part (i) BUT only for 2019 and 2020. First, create a dataframe called recent_3 that only has observations from the specified years. You might want to use the %in% operator. on your newly created variable year. Use this new dataframe to make your plot. Use the graphics option pch = 19 to get solid round points, and use the graphics option col = as.factor(year) to make different colors for 2019 and 2020. The final line of code below will add a legend to the top left of the plot.

recent_3 <- subset(recent_noRe, year %in% c("2020", "2019"))
plot(recent_3$log10favCnt, recent_3$log10reCnt,
     main = "Favorite vs retweet counts of tweets by Obama in 2019 and 2020(log scale)",
     xlab = "number of times retweeted (log_10)",
     ylab = "number of times favorited (log_10)",
     col = as.factor(recent_3$year))
legend("topleft", legend = c("2019","2020"), col = c(1,2), pch = 19)

Write no more than three sentences that describe what you see. Does the pattern appear any different between 2019 and 2020?

The 2020 data is more clustered than the 2019 data. Both show a linear relationship but the 2019 data more strongly. The slope of both trends are approximately the same with the 2019 being possibly a bit steeper.

Homework 02 Data Handling, Graphics, More R!

Due by 11:59pm, Friday, February 11, 2022

S&DS 230/S&DS 530/ENV 757/PLSC 530