TidyVerse CREATE assignment

Your task here is to Create an Example. Using one or more TidyVerse packages, and any dataset from fivethirtyeight.com or Kaggle, create a programming sample “vignette” that demonstrates how to use one or more of the capabilities of the selected TidyVerse package with your selected dataset. (25 points)

Later, you’ll be asked to extend an existing vignette. Using one of your classmate’s examples (as created above), you’ll then extend his or her example with additional annotated code. (15 points)

You should clone the provided repository. Once you have code to submit, you should make a pull request on the shared repository. You should also update the README.md file with your example.

After you’ve created your vignette, please submit your GitHub handle name in the submission link provided below. This will let your instructor know that your work is ready to be peer-graded.

You should complete your submission on the schedule stated in the course syllabus.

1. Load data

We will use the data behind the story “Is The Russia Investigation Really Another Watergate?” to show the features of the tidyverse package.
The file will be downloaded from the Github repository to csv file using read.csv function. russia-investigation.csv contains every special investigation since the Watergate probe began in 1973 and who was charged in them. The data set contains 194 observations of 13 variables.
The dataset source: https://github.com/fivethirtyeight/data/tree/master/russia-investigation

data <- read.csv('https://raw.githubusercontent.com/ex-pr/DATA607/tidyverse_create/russia-investigation.csv', header=TRUE, sep=",", check.names=FALSE)

summary(data)

##  investigation      investigation-start investigation-end  investigation-days
##  Length:194         Length:194          Length:194         Min.   : 171      
##  Class :character   Class :character    Class :character   1st Qu.:1101      
##  Mode  :character   Mode  :character    Mode  :character   Median :1492      
##                                                            Mean   :1734      
##                                                            3rd Qu.:2419      
##                                                            Max.   :3387      
##                                                                              
##      name           indictment-days      type             cp-date         
##  Length:194         Min.   :-316.0   Length:194         Length:194        
##  Class :character   1st Qu.: 275.0   Class :character   Class :character  
##  Mode  :character   Median : 422.0   Mode  :character   Mode  :character  
##                     Mean   : 507.2                                        
##                     3rd Qu.: 670.0                                        
##                     Max.   :2006.0                                        
##                     NA's   :13                                            
##     cp-days       overturned       pardoned        american      
##  Min.   :-136.0   Mode :logical   Mode :logical   Mode :logical  
##  1st Qu.: 275.0   FALSE:185       FALSE:168       FALSE:27       
##  Median : 545.0   TRUE :9         TRUE :26        TRUE :167      
##  Mean   : 637.4                                                  
##  3rd Qu.: 990.0                                                  
##  Max.   :2183.0                                                  
##  NA's   :71                                                      
##   president        
##  Length:194        
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
##

head(data, n=3)

##   investigation investigation-start investigation-end investigation-days
## 1     watergate          1973-05-19        1977-06-19               1492
## 2     watergate          1973-05-19        1977-06-19               1492
## 3     watergate          1973-05-19        1977-06-19               1492
##                name indictment-days       type    cp-date cp-days overturned
## 1   James W. McCord            -246 conviction 1973-01-30    -109      FALSE
## 2 Bernard L. Barker            -246 conviction 1973-01-15    -124      FALSE
## 3 Bernard L. Barker             292 conviction 1974-07-12     419       TRUE
##   pardoned american     president
## 1    FALSE     TRUE Richard Nixon
## 2    FALSE     TRUE Richard Nixon
## 3    FALSE     TRUE Richard Nixon

2. Dplyr package

Dplyr package is one of the most useful part of the tidyverse library. Dplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges. In this vignette we will consider the main functions of dplyr package: rename(), filter(), select(), distinct(), mutate(), group_by(), arrange(), summarise(). The dplyr provides the pipe %>% operator, so the result from one step is then “piped” into the next step df %>% f(y).

2.1 Rename()

Rename() changes the names of individual variables using new_name = old_name syntax, we will rename some columns to make it more readable using rename() function of dplyr package.

investigation <- data %>% 
           rename("id"="investigation",
           "start_date" = "investigation-start",  "end_date" = "investigation-end", "convict_date" = "cp-date")

2.2 Filter()

Filter function allows us to select a subset of rows in a data frame based on the condition. The first argument is the tibble/ data frame, the further arguments refer to variables within that data frame.
We can check the investigations that happened under Bill Clinton and check persons who weren’t pardoned.

investigation %>% filter(president == "Bill Clinton", pardoned == "TRUE")

##            id start_date   end_date investigation-days
## 1  whitewater 1994-08-05 2002-03-06               2770
## 2  whitewater 1994-08-05 2002-03-06               2770
## 3  whitewater 1994-08-05 2002-03-06               2770
## 4  whitewater 1994-08-05 2002-03-06               2770
## 5        espy 1994-09-09 2001-01-30               2335
## 6        espy 1994-09-09 2001-01-30               2335
## 7        espy 1994-09-09 2001-01-30               2335
## 8        espy 1994-09-09 2001-01-30               2335
## 9        espy 1994-09-09 2001-01-30               2335
## 10       espy 1994-09-09 2001-01-30               2335
## 11       espy 1994-09-09 2001-01-30               2335
## 12   cisneros 1995-05-24 2004-08-31               3387
## 13   cisneros 1995-05-24 2004-08-31               3387
##                         name indictment-days        type convict_date cp-days
## 1         Webster L. Hubbell             123 guilty-plea   1994-12-06     123
## 2           Christopher Wade             228 guilty-plea   1995-03-21     228
## 3           Stephen A. Smith             307 guilty-plea   1995-06-08     307
## 4             Susan McDougal             377  conviction   1996-05-26     660
## 5                 James Lake             409 guilty-plea   1995-10-23     409
## 6  Brooke Keith Mitchell, Sr             622 guilty-plea   1996-11-13     796
## 7        Alverez Ferrouillet             670  conviction   1996-12-20     833
## 8            John Hemmingson             698  conviction   1996-12-19     832
## 9           Jack L. Williams             740  conviction   1998-06-26    1386
## 10           Richard Douglas             769 guilty-plea   1998-03-16    1284
## 11        Archibald Schaffer            1225  conviction   1998-06-26    1386
## 12      Linda (Medlar) Jones             843 guilty-plea   1998-01-15     967
## 13            Henry Cisneros             932 guilty-plea   1999-09-07    1567
##    overturned pardoned american    president
## 1       FALSE     TRUE     TRUE Bill Clinton
## 2       FALSE     TRUE     TRUE Bill Clinton
## 3       FALSE     TRUE     TRUE Bill Clinton
## 4       FALSE     TRUE     TRUE Bill Clinton
## 5       FALSE     TRUE     TRUE Bill Clinton
## 6       FALSE     TRUE     TRUE Bill Clinton
## 7       FALSE     TRUE     TRUE Bill Clinton
## 8       FALSE     TRUE     TRUE Bill Clinton
## 9       FALSE     TRUE     TRUE Bill Clinton
## 10      FALSE     TRUE     TRUE Bill Clinton
## 11      FALSE     TRUE     TRUE Bill Clinton
## 12      FALSE     TRUE     TRUE Bill Clinton
## 13      FALSE     TRUE     TRUE Bill Clinton

2.3 Select(), Distinct()

Select function changes whether or not a column is included in case we need only several columns instead of the entire data frame, the first argument is the data frame/tibble, the further arguments are one or more unquoted expressions separated by commas.
Distinct function select only unique/distinct rows from a data frame.The first argument is a data frame/tibble, the further arguments are optional variables to use when determining uniqueness.
We will select column name withe the names of all the persons charged. There are only 178 unique names among 194 investigations, some people were charged several times.

names <- investigation %>% select(name)  %>% distinct()
summary(names)

##      name          
##  Length:178        
##  Class :character  
##  Mode  :character

head(names)

##                   name
## 1      James W. McCord
## 2    Bernard L. Barker
## 3  Eugenio R. Martinez
## 4     Frank A. Sturgis
## 5 Virgilio R. Gonzalez
## 6      G. Gordon Liddy

2.4 Mutate(), stringr library

Mutate() changes the values of columns and creates new columns. Arguments are data frame/tibble, name-value pairs (the name gives the name of the column in the output), keep (control which columns from .data are retained in the output), before, after (optionally, control where new columns should appear).
Mutate_at affects variables selected with a character vector or vars().
In this example, we see another library of the tidyverse package “stringr” which provides a cohesive set of functions designed to make working with strings as easy as possible. The function str_replace() that replaces the matches with new text. We will use mutate to change the values in columns ‘overturned’, ‘pardoned’, ‘american’ from TRUE/FALSE to Yes/No.

investigation <- investigation %>% 
  mutate_at(c('overturned', 'pardoned', 'american'),funs(str_replace(., "TRUE", "Yes")))  %>% 
  mutate_at(c('overturned', 'pardoned', 'american'),funs(str_replace(., "FALSE", "No")))
head(investigation)

##          id start_date   end_date investigation-days                name
## 1 watergate 1973-05-19 1977-06-19               1492     James W. McCord
## 2 watergate 1973-05-19 1977-06-19               1492   Bernard L. Barker
## 3 watergate 1973-05-19 1977-06-19               1492   Bernard L. Barker
## 4 watergate 1973-05-19 1977-06-19               1492 Eugenio R. Martinez
## 5 watergate 1973-05-19 1977-06-19               1492 Eugenio R. Martinez
## 6 watergate 1973-05-19 1977-06-19               1492    Frank A. Sturgis
##   indictment-days        type convict_date cp-days overturned pardoned american
## 1            -246  conviction   1973-01-30    -109         No       No      Yes
## 2            -246  conviction   1973-01-15    -124         No       No      Yes
## 3             292  conviction   1974-07-12     419        Yes       No      Yes
## 4            -246 guilty-plea   1973-01-15    -124         No      Yes      Yes
## 5             292  conviction   1974-07-12     419        Yes       No      Yes
## 6            -246 guilty-plea   1973-01-15    -124         No       No      Yes
##       president
## 1 Richard Nixon
## 2 Richard Nixon
## 3 Richard Nixon
## 4 Richard Nixon
## 5 Richard Nixon
## 6 Richard Nixon

2.5 Group_by(), Arrange(), Summarise()

Group_by() takes an existing tbl and converts it into a grouped data frame/tibble where operations are performed “by group”. Arguments are a data frame/tibble, variables or computations to group by, .add (FALSE will override existing groups, .add = TRUE will add to the existing groups), .drop (drop groups formed by factor levels that don’t appear in the data).
Arrange() function changes the order of the rows by the values of selected columns. The arguments are .data (data frame/tibble), variables, or functions of variables (use desc() to sort a variable in descending order), .by_group (TRUE will sort first by grouping variable).
Summarise() collapses a group into a single row, it will have one (or more) rows for each combination of grouping variables. The arguments are .data, name-value pairs of summary functions. The name will be the name of the variable in the result, .groups (grouping structure of the result).
We see that the most number of charges were for the investigation called “watergate”. First, group_by id, after we count how many repetitions we have for each id and arrange the results in desc order.

investigation %>% 
  group_by(id) %>% 
  summarize(investigation_total=n()) %>% 
  arrange(desc(investigation_total))

## # A tibble: 24 x 2
##    id                    investigation_total
##    <chr>                               <int>
##  1 watergate                              72
##  2 russia                                 34
##  3 whitewater                             20
##  4 pierce                                 18
##  5 iran-contra                            14
##  6 espy                                   13
##  7 cisneros                                6
##  8 bruce-babbitt                           1
##  9 bush-clinton-passport                   1
## 10 deaver                                  1
## # ... with 14 more rows

3. ggplot2, forcats

ggplot2() is a plotting package that provides helpful commands to create complex plots from data in a data frame. The graphics are built layer by layer by adding new elements. Adding layers in this fashion allows for extensive flexibility and customization of plots.
We will plot the graph to find out which president had the most number of investigations during the term. It was Richard Nixon.
The arguments of ggplot() are the data frame/tibble, mapping (default list of aesthetic mappings to use for plot). We have added the type of the plot - bar using geom_bar() with the blue color (also we can use geom_point(), geom_boxplot(), geom_line(), geom_col(). etc), theme_light() made our plot with the white background instead of grey, theme() is used to define text and the position of text on the graph, labs() will define labels and title of the plot.
Another tidyverse library is forcats, the library provides a suite of tools that solve common problems with factors, including changing the order of levels or the values. The function fct_infreq(): Reordering a factor by the frequency of values.

ggplot(investigation, aes(x=fct_infreq(president))) + 
  geom_bar(fill='blue') + 
  theme_light() +
  theme(plot.title = element_text(hjust = 0.5), axis.text.x = element_text(size = 10, angle = 90))+
  labs(x = 'President', y = "# of investigations", title = "# of investigations for each president")

Using geom_point(), we can build the plot to find out which investigation was the longest. It was cisneros, more than 3,000 days.

ggplot(investigation, aes(x=id, y=`investigation-days`)) + 
  geom_point (color="red") + 
  theme_light() +
  theme(plot.title = element_text(hjust = 0.5), axis.text.x = element_text(angle = 90))+
  labs(x = 'Investigation name', y = "# of days", title = "Length, in days, of the investigation")

Conclusion

Tidyverse package is an irreplaceable tool to transform messy data sets into the convenient for analysis format. In this “vignette” we demonstrated how to use dplyr, ggplot2, stringr, forcats libraries of the tidyVerse package. There are more to discover within each of the libraries mentioned above as well as tidyverse package contains a lot of other libraries that were not mentioned in the current “vignette” such as tidyr, tibble, purrr.