## ── Attaching packages ─────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.2.1 ✔ purrr 0.3.3
## ✔ tibble 2.1.3 ✔ dplyr 0.8.3
## ✔ tidyr 1.0.0 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.4.0
## ── Conflicts ────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
glimpse(trump_tweets)
## Observations: 20,761
## Variables: 8
## $ source <chr> "Twitter Web Client", "Twitter Web Clien…
## $ id_str <chr> "6971079756", "6312794445", "6090839867"…
## $ text <chr> "From Donald Trump: Wishing everyone a w…
## $ created_at <dttm> 2009-12-23 12:38:18, 2009-12-03 14:39:0…
## $ retweet_count <int> 28, 33, 13, 5, 7, 4, 2, 4, 1, 22, 7, 5, …
## $ in_reply_to_user_id_str <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ favorite_count <int> 12, 6, 11, 3, 6, 5, 2, 10, 4, 30, 6, 3, …
## $ is_retweet <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
Write your description here There are 20,761 observations and 8 variables that display different facets of information about tweets Trump sent out between the years 2009 and 2017.
What type of data is each variable? source - character
id_str - character text - character created_at - date / time retweet_count - integer in_reply_to_user_id_str - character favorite_count - integer is_retweet - logical
What is the total size of the data frame? 20,761
What are the boundaries of each period of observation (e.g. what time period do the observations in these data frames span)?
summary(trump_tweets$created_at)
## Min. 1st Qu. Median
## "2009-05-04 13:54:25" "2013-02-13 14:15:54" "2014-05-03 07:07:38"
## Mean 3rd Qu. Max.
## "2014-08-02 16:29:32" "2016-02-13 20:17:51" "2018-01-01 08:37:52"
Min: 05-04-2009 Max: 01-01-2018
Do any variables have missing values, and if so which ones have more NA values than others? Why might that be the case?
summary(trump_tweets)
## source id_str text
## Length:20761 Length:20761 Length:20761
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## created_at retweet_count in_reply_to_user_id_str
## Min. :2009-05-04 13:54:25 Min. : 0 Length:20761
## 1st Qu.:2013-02-13 14:15:54 1st Qu.: 34 Class :character
## Median :2014-05-03 07:07:38 Median : 264 Mode :character
## Mean :2014-08-02 16:29:32 Mean : 3854
## 3rd Qu.:2016-02-13 20:17:51 3rd Qu.: 3267
## Max. :2018-01-01 08:37:52 Max. :369530
## favorite_count is_retweet
## Min. : 0 Mode :logical
## 1st Qu.: 22 FALSE:20761
## Median : 173
## Mean : 13591
## 3rd Qu.: 8573
## Max. :633253
There are missing values in the in_reply_to_user_id_str. The number of NA’s is found by inputting the following code: sum(is.na(trump_tweets$in_reply_to_user_id_str)) (the answer is 18319 missing values)
The subject matter describes different tweets that Trump has sent out over the span of years 2009 through 2017. The data tells us items like what the tweets said, number of characters, number of times it was retweeted, and number of times it was favorited.
All tweets from Donald Trump’s twitter account from 2009 to 2017 Source:The Trump Twitter Archive: http://www.trumptwitterarchive.com
Summary table template:
| Variable | Description |
|---|---|
| source | Device or service used to compose tweet. |
| id_str | Tweet ID. |
| text | Tweet. |
| created_at | Data and time tweet was tweeted. |
| retweet_count | How many times tweet had been retweeted at time dataset was created. |
| in_reply_to_user_id_str | If a reply, the user id of person being replied to. |
| favorite_count | Number of times tweet had been favored at time dataset was created. |
| is_retweet | A logical telling us if it is a retweet or not. |
# Write R code here to create your first plot
ggplot(data = trump_tweets, aes(x = retweet_count, y = favorite_count)) +
geom_point()
Write a description and interpretation of your first plot here. The scatterplot I created graphed retweet_count on the x axis, and favorite_count on the y axis. In other words, the number of times the tweet has been retweeted at the time the dataset was created is plotted on the x, and the number of times the tweet had been favorited at the time the dataset was created is plotted on the y. The data points appear to plot close together near 0, and then get farther apart as the number of retweets and favorites of the tweets increases. There is a relatively linear pattern, however, there is not an equal distribution of points along a linear line; they clump together by 0 and then there is very few farther out (towards the right).
# Write R code here to create your second plot
trump_tweets <- trump_tweets %>% mutate(count_type = if_else(retweet_count < 1000, '1 Lowest',
if_else(retweet_count < 2000, '2 Lower',
if_else(retweet_count < 3000, '3 Low',
if_else(retweet_count < 4000, '4 Low Medium',
if_else(retweet_count < 5000, '5 Medium',
if_else(retweet_count < 6000, '6 High Medium',
if_else(retweet_count < 7000, '7 High',
if_else(retweet_count < 8000, '8 Higher','9 Highest')))))))))
ggplot(data = trump_tweets) +
geom_bar(aes(x = count_type, fill = count_type))
Write a description and interpretation of your second plot here. The histogram plots the number of times the tweet had been retweeted at the time the dataset was created, and I subjected this data to a maximum number of tweets at 1000 in order to provide an actual image of the histogram (because it wouldn’t generate with so many data points for the variable). After plotting, it is evident that the highest number of retweets was 1, and the second highest value of retweets was 9. The lowest number of retweets was 8, and the second lowest value of retweets was 7.