## ── Attaching packages ─────────────────────────────────────────────────── tidyverse 1.2.1 ──

## ✔ ggplot2 3.2.1     ✔ purrr   0.3.3
## ✔ tibble  2.1.3     ✔ dplyr   0.8.3
## ✔ tidyr   1.0.0     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0

## ── Conflicts ────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

Data description

glimpse(trump_tweets)

## Observations: 20,761
## Variables: 8
## $ source                  <chr> "Twitter Web Client", "Twitter Web Clien…
## $ id_str                  <chr> "6971079756", "6312794445", "6090839867"…
## $ text                    <chr> "From Donald Trump: Wishing everyone a w…
## $ created_at              <dttm> 2009-12-23 12:38:18, 2009-12-03 14:39:0…
## $ retweet_count           <int> 28, 33, 13, 5, 7, 4, 2, 4, 1, 22, 7, 5, …
## $ in_reply_to_user_id_str <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ favorite_count          <int> 12, 6, 11, 3, 6, 5, 2, 10, 4, 30, 6, 3, …
## $ is_retweet              <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…

Write your description here There are 20,761 observations and 8 variables that display different facets of information about tweets Trump sent out between the years 2009 and 2017.

What type of data is each variable? source - character
id_str - character text - character created_at - date / time retweet_count - integer in_reply_to_user_id_str - character favorite_count - integer is_retweet - logical

What is the total size of the data frame? 20,761

What are the boundaries of each period of observation (e.g. what time period do the observations in these data frames span)?

summary(trump_tweets$created_at)

##                  Min.               1st Qu.                Median 
## "2009-05-04 13:54:25" "2013-02-13 14:15:54" "2014-05-03 07:07:38" 
##                  Mean               3rd Qu.                  Max. 
## "2014-08-02 16:29:32" "2016-02-13 20:17:51" "2018-01-01 08:37:52"

Min: 05-04-2009 Max: 01-01-2018

Do any variables have missing values, and if so which ones have more NA values than others? Why might that be the case?

summary(trump_tweets)

##     source             id_str              text          
##  Length:20761       Length:20761       Length:20761      
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##    created_at                  retweet_count    in_reply_to_user_id_str
##  Min.   :2009-05-04 13:54:25   Min.   :     0   Length:20761           
##  1st Qu.:2013-02-13 14:15:54   1st Qu.:    34   Class :character       
##  Median :2014-05-03 07:07:38   Median :   264   Mode  :character       
##  Mean   :2014-08-02 16:29:32   Mean   :  3854                          
##  3rd Qu.:2016-02-13 20:17:51   3rd Qu.:  3267                          
##  Max.   :2018-01-01 08:37:52   Max.   :369530                          
##  favorite_count   is_retweet     
##  Min.   :     0   Mode :logical  
##  1st Qu.:    22   FALSE:20761    
##  Median :   173                  
##  Mean   : 13591                  
##  3rd Qu.:  8573                  
##  Max.   :633253

There are missing values in the in_reply_to_user_id_str. The number of NA’s is found by inputting the following code: sum(is.na(trump_tweets$in_reply_to_user_id_str)) (the answer is 18319 missing values)

Document and Summarize your Dataset

A brief overview of the subject matter of the dataset. What is this dataset about?

The subject matter describes different tweets that Trump has sent out over the span of years 2009 through 2017. The data tells us items like what the tweets said, number of characters, number of times it was retweeted, and number of times it was favorited.

A description of the data source. Read the package help files to identify as much information as possible about how the data were collected (when, where, by whom, etc.) and whether the data have already been pre-processed or not. If the data come from a published paper, include a citation to that paper.

All tweets from Donald Trump’s twitter account from 2009 to 2017 Source:The Trump Twitter Archive: http://www.trumptwitterarchive.com

A summary table of each variable in the dataset. Each row should be a variable, and the columns should be the variable name and a short description of the variable (the “hw6.Rmd” template already has this table started for you).

Summary table template:

Variable	Description
source	Device or service used to compose tweet.
id_str	Tweet ID.
text	Tweet.
created_at	Data and time tweet was tweeted.
retweet_count	How many times tweet had been retweeted at time dataset was created.
in_reply_to_user_id_str	If a reply, the user id of person being replied to.
favorite_count	Number of times tweet had been favored at time dataset was created.
is_retweet	A logical telling us if it is a retweet or not.

Data visualizations

# Write R code here to create your first plot

ggplot(data = trump_tweets, aes(x = retweet_count, y = favorite_count)) + 
    geom_point()

Write a description and interpretation of your first plot here. The scatterplot I created graphed retweet_count on the x axis, and favorite_count on the y axis. In other words, the number of times the tweet has been retweeted at the time the dataset was created is plotted on the x, and the number of times the tweet had been favorited at the time the dataset was created is plotted on the y. The data points appear to plot close together near 0, and then get farther apart as the number of retweets and favorites of the tweets increases. There is a relatively linear pattern, however, there is not an equal distribution of points along a linear line; they clump together by 0 and then there is very few farther out (towards the right).

# Write R code here to create your second plot

trump_tweets  <- trump_tweets %>% mutate(count_type = if_else(retweet_count < 1000, '1 Lowest', 
                        if_else(retweet_count < 2000, '2 Lower', 
                        if_else(retweet_count < 3000, '3 Low',
                        if_else(retweet_count < 4000, '4 Low Medium',
                        if_else(retweet_count < 5000, '5 Medium',
                        if_else(retweet_count < 6000, '6 High Medium',
                        if_else(retweet_count < 7000, '7 High', 
                        if_else(retweet_count < 8000, '8 Higher','9 Highest')))))))))

ggplot(data = trump_tweets) +
    geom_bar(aes(x = count_type, fill = count_type))

Write a description and interpretation of your second plot here. The histogram plots the number of times the tweet had been retweeted at the time the dataset was created, and I subjected this data to a maximum number of tweets at 1000 in order to provide an actual image of the histogram (because it wouldn’t generate with so many data points for the variable). After plotting, it is evident that the highest number of retweets was 1, and the second highest value of retweets was 9. The lowest number of retweets was 8, and the second lowest value of retweets was 7.

HW6: Summary of the trump_tweets dataset from the dslabs package

Alexa Solomon

11/13/2019

Data description

Data visualizations