Your task here is to Create an Example. Using one or more TidyVerse packages, and any dataset from fivethirtyeight.com or Kaggle, create a programming sample “vignette” that demonstrates how to use one or more of the capabilities of the selected TidyVerse package with your selected dataset. (25 points)
Later, you’ll be asked to extend an existing vignette. Using one of your classmate’s examples (as created above), you’ll then extend his or her example with additional annotated code. (15 points)
You should clone the provided repository. Once you have code to submit, you should make a pull request on the shared repository. You should also update the README.md file with your example.
After you’ve created your vignette, please submit your GitHub handle name in the submission link provided below. This will let your instructor know that your work is ready to be peer-graded.
You should complete your submission on the schedule stated in the course syllabus.
We will use the data behind the story “Is The Russia Investigation Really Another Watergate?” to show the features of the tidyverse package.
The file will be downloaded from the Github repository to csv file using read.csv function. russia-investigation.csv contains every special investigation since the Watergate probe began in 1973 and who was charged in them. The data set contains 194 observations of 13 variables.
The dataset source: https://github.com/fivethirtyeight/data/tree/master/russia-investigation
data <- read.csv('https://raw.githubusercontent.com/ex-pr/DATA607/tidyverse_create/russia-investigation.csv', header=TRUE, sep=",", check.names=FALSE)
summary(data)
## investigation investigation-start investigation-end investigation-days
## Length:194 Length:194 Length:194 Min. : 171
## Class :character Class :character Class :character 1st Qu.:1101
## Mode :character Mode :character Mode :character Median :1492
## Mean :1734
## 3rd Qu.:2419
## Max. :3387
##
## name indictment-days type cp-date
## Length:194 Min. :-316.0 Length:194 Length:194
## Class :character 1st Qu.: 275.0 Class :character Class :character
## Mode :character Median : 422.0 Mode :character Mode :character
## Mean : 507.2
## 3rd Qu.: 670.0
## Max. :2006.0
## NA's :13
## cp-days overturned pardoned american
## Min. :-136.0 Mode :logical Mode :logical Mode :logical
## 1st Qu.: 275.0 FALSE:185 FALSE:168 FALSE:27
## Median : 545.0 TRUE :9 TRUE :26 TRUE :167
## Mean : 637.4
## 3rd Qu.: 990.0
## Max. :2183.0
## NA's :71
## president
## Length:194
## Class :character
## Mode :character
##
##
##
##
head(data, n=3)
## investigation investigation-start investigation-end investigation-days
## 1 watergate 1973-05-19 1977-06-19 1492
## 2 watergate 1973-05-19 1977-06-19 1492
## 3 watergate 1973-05-19 1977-06-19 1492
## name indictment-days type cp-date cp-days overturned
## 1 James W. McCord -246 conviction 1973-01-30 -109 FALSE
## 2 Bernard L. Barker -246 conviction 1973-01-15 -124 FALSE
## 3 Bernard L. Barker 292 conviction 1974-07-12 419 TRUE
## pardoned american president
## 1 FALSE TRUE Richard Nixon
## 2 FALSE TRUE Richard Nixon
## 3 FALSE TRUE Richard Nixon
Dplyr package is one of the most useful part of the tidyverse library. Dplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges. In this vignette we will consider the main functions of dplyr package: rename(), filter(), select(), distinct(), mutate(), group_by(), arrange(), summarise(). The dplyr provides the pipe %>% operator, so the result from one step is then “piped” into the next step df %>% f(y).
Rename() changes the names of individual variables using new_name = old_name syntax, we will rename some columns to make it more readable using rename() function of dplyr package.
investigation <- data %>%
rename("id"="investigation",
"start_date" = "investigation-start", "end_date" = "investigation-end", "convict_date" = "cp-date")
Filter function allows us to select a subset of rows in a data frame based on the condition. The first argument is the tibble/ data frame, the further arguments refer to variables within that data frame.
We can check the investigations that happened under Bill Clinton and check persons who weren’t pardoned.
investigation %>% filter(president == "Bill Clinton", pardoned == "TRUE")
## id start_date end_date investigation-days
## 1 whitewater 1994-08-05 2002-03-06 2770
## 2 whitewater 1994-08-05 2002-03-06 2770
## 3 whitewater 1994-08-05 2002-03-06 2770
## 4 whitewater 1994-08-05 2002-03-06 2770
## 5 espy 1994-09-09 2001-01-30 2335
## 6 espy 1994-09-09 2001-01-30 2335
## 7 espy 1994-09-09 2001-01-30 2335
## 8 espy 1994-09-09 2001-01-30 2335
## 9 espy 1994-09-09 2001-01-30 2335
## 10 espy 1994-09-09 2001-01-30 2335
## 11 espy 1994-09-09 2001-01-30 2335
## 12 cisneros 1995-05-24 2004-08-31 3387
## 13 cisneros 1995-05-24 2004-08-31 3387
## name indictment-days type convict_date cp-days
## 1 Webster L. Hubbell 123 guilty-plea 1994-12-06 123
## 2 Christopher Wade 228 guilty-plea 1995-03-21 228
## 3 Stephen A. Smith 307 guilty-plea 1995-06-08 307
## 4 Susan McDougal 377 conviction 1996-05-26 660
## 5 James Lake 409 guilty-plea 1995-10-23 409
## 6 Brooke Keith Mitchell, Sr 622 guilty-plea 1996-11-13 796
## 7 Alverez Ferrouillet 670 conviction 1996-12-20 833
## 8 John Hemmingson 698 conviction 1996-12-19 832
## 9 Jack L. Williams 740 conviction 1998-06-26 1386
## 10 Richard Douglas 769 guilty-plea 1998-03-16 1284
## 11 Archibald Schaffer 1225 conviction 1998-06-26 1386
## 12 Linda (Medlar) Jones 843 guilty-plea 1998-01-15 967
## 13 Henry Cisneros 932 guilty-plea 1999-09-07 1567
## overturned pardoned american president
## 1 FALSE TRUE TRUE Bill Clinton
## 2 FALSE TRUE TRUE Bill Clinton
## 3 FALSE TRUE TRUE Bill Clinton
## 4 FALSE TRUE TRUE Bill Clinton
## 5 FALSE TRUE TRUE Bill Clinton
## 6 FALSE TRUE TRUE Bill Clinton
## 7 FALSE TRUE TRUE Bill Clinton
## 8 FALSE TRUE TRUE Bill Clinton
## 9 FALSE TRUE TRUE Bill Clinton
## 10 FALSE TRUE TRUE Bill Clinton
## 11 FALSE TRUE TRUE Bill Clinton
## 12 FALSE TRUE TRUE Bill Clinton
## 13 FALSE TRUE TRUE Bill Clinton
Select function changes whether or not a column is included in case we need only several columns instead of the entire data frame, the first argument is the data frame/tibble, the further arguments are one or more unquoted expressions separated by commas.
Distinct function select only unique/distinct rows from a data frame.The first argument is a data frame/tibble, the further arguments are optional variables to use when determining uniqueness.
We will select column name withe the names of all the persons charged. There are only 178 unique names among 194 investigations, some people were charged several times.
names <- investigation %>% select(name) %>% distinct()
summary(names)
## name
## Length:178
## Class :character
## Mode :character
head(names)
## name
## 1 James W. McCord
## 2 Bernard L. Barker
## 3 Eugenio R. Martinez
## 4 Frank A. Sturgis
## 5 Virgilio R. Gonzalez
## 6 G. Gordon Liddy
Mutate() changes the values of columns and creates new columns. Arguments are data frame/tibble, name-value pairs (the name gives the name of the column in the output), keep (control which columns from .data are retained in the output), before, after (optionally, control where new columns should appear).
Mutate_at affects variables selected with a character vector or vars().
In this example, we see another library of the tidyverse package “stringr” which provides a cohesive set of functions designed to make working with strings as easy as possible. The function str_replace() that replaces the matches with new text. We will use mutate to change the values in columns ‘overturned’, ‘pardoned’, ‘american’ from TRUE/FALSE to Yes/No.
investigation <- investigation %>%
mutate_at(c('overturned', 'pardoned', 'american'),funs(str_replace(., "TRUE", "Yes"))) %>%
mutate_at(c('overturned', 'pardoned', 'american'),funs(str_replace(., "FALSE", "No")))
head(investigation)
## id start_date end_date investigation-days name
## 1 watergate 1973-05-19 1977-06-19 1492 James W. McCord
## 2 watergate 1973-05-19 1977-06-19 1492 Bernard L. Barker
## 3 watergate 1973-05-19 1977-06-19 1492 Bernard L. Barker
## 4 watergate 1973-05-19 1977-06-19 1492 Eugenio R. Martinez
## 5 watergate 1973-05-19 1977-06-19 1492 Eugenio R. Martinez
## 6 watergate 1973-05-19 1977-06-19 1492 Frank A. Sturgis
## indictment-days type convict_date cp-days overturned pardoned american
## 1 -246 conviction 1973-01-30 -109 No No Yes
## 2 -246 conviction 1973-01-15 -124 No No Yes
## 3 292 conviction 1974-07-12 419 Yes No Yes
## 4 -246 guilty-plea 1973-01-15 -124 No Yes Yes
## 5 292 conviction 1974-07-12 419 Yes No Yes
## 6 -246 guilty-plea 1973-01-15 -124 No No Yes
## president
## 1 Richard Nixon
## 2 Richard Nixon
## 3 Richard Nixon
## 4 Richard Nixon
## 5 Richard Nixon
## 6 Richard Nixon
Group_by() takes an existing tbl and converts it into a grouped data frame/tibble where operations are performed “by group”. Arguments are a data frame/tibble, variables or computations to group by, .add (FALSE will override existing groups, .add = TRUE will add to the existing groups), .drop (drop groups formed by factor levels that don’t appear in the data).
Arrange() function changes the order of the rows by the values of selected columns. The arguments are .data (data frame/tibble), variables, or functions of variables (use desc() to sort a variable in descending order), .by_group (TRUE will sort first by grouping variable).
Summarise() collapses a group into a single row, it will have one (or more) rows for each combination of grouping variables. The arguments are .data, name-value pairs of summary functions. The name will be the name of the variable in the result, .groups (grouping structure of the result).
We see that the most number of charges were for the investigation called “watergate”. First, group_by id, after we count how many repetitions we have for each id and arrange the results in desc order.
investigation %>%
group_by(id) %>%
summarize(investigation_total=n()) %>%
arrange(desc(investigation_total))
## # A tibble: 24 x 2
## id investigation_total
## <chr> <int>
## 1 watergate 72
## 2 russia 34
## 3 whitewater 20
## 4 pierce 18
## 5 iran-contra 14
## 6 espy 13
## 7 cisneros 6
## 8 bruce-babbitt 1
## 9 bush-clinton-passport 1
## 10 deaver 1
## # ... with 14 more rows
ggplot2() is a plotting package that provides helpful commands to create complex plots from data in a data frame. The graphics are built layer by layer by adding new elements. Adding layers in this fashion allows for extensive flexibility and customization of plots.
We will plot the graph to find out which president had the most number of investigations during the term. It was Richard Nixon.
The arguments of ggplot() are the data frame/tibble, mapping (default list of aesthetic mappings to use for plot). We have added the type of the plot - bar using geom_bar() with the blue color (also we can use geom_point(), geom_boxplot(), geom_line(), geom_col(). etc), theme_light() made our plot with the white background instead of grey, theme() is used to define text and the position of text on the graph, labs() will define labels and title of the plot.
Another tidyverse library is forcats, the library provides a suite of tools that solve common problems with factors, including changing the order of levels or the values. The function fct_infreq(): Reordering a factor by the frequency of values.
ggplot(investigation, aes(x=fct_infreq(president))) +
geom_bar(fill='blue') +
theme_light() +
theme(plot.title = element_text(hjust = 0.5), axis.text.x = element_text(size = 10, angle = 90))+
labs(x = 'President', y = "# of investigations", title = "# of investigations for each president")
Using geom_point(), we can build the plot to find out which investigation was the longest. It was cisneros, more than 3,000 days.
ggplot(investigation, aes(x=id, y=`investigation-days`)) +
geom_point (color="red") +
theme_light() +
theme(plot.title = element_text(hjust = 0.5), axis.text.x = element_text(angle = 90))+
labs(x = 'Investigation name', y = "# of days", title = "Length, in days, of the investigation")
Tidyverse package is an irreplaceable tool to transform messy data sets into the convenient for analysis format. In this “vignette” we demonstrated how to use dplyr, ggplot2, stringr, forcats libraries of the tidyVerse package. There are more to discover within each of the libraries mentioned above as well as tidyverse package contains a lot of other libraries that were not mentioned in the current “vignette” such as tidyr, tibble, purrr.