Assignment

In this assignment, you’ll practice collaborating around a code project with GitHub. You could consider our collective work as building out a book of examples on how to use TidyVerse functions.

GitHub repository: https://github.com/acatlin/FALL2020TIDYVERSE

https://data.fivethirtyeight.com/ datasets.

https://www.kaggle.com/datasets datasets.

Your task here is to Create an Example. Using one or more TidyVerse packages, and any dataset from fivethirtyeight.com or Kaggle, create a programming sample “vignette” that demonstrates how to use one or more of the capabilities of the selected TidyVerse package with your selected dataset. (25 points)

Loading of required libraries

library(tidyverse)
## -- Attaching packages -------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2          v purrr   0.3.4     
## v tibble  3.0.3          v dplyr   1.0.2     
## v tidyr   1.1.2          v stringr 1.4.0.9000
## v readr   1.3.1          v forcats 0.5.0
## -- Conflicts ----------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Dataset

https://www.kaggle.com/ronitf/heart-disease-uci

File on Github

https://raw.githubusercontent.com/petferns/607-week9/main/heart.csv

Capability1 readr::read_csv

First lets try to read the CSV file from GitHub using the read_csv function

heart <- read_csv("https://raw.githubusercontent.com/petferns/607-week9/main/heart.csv")
## Parsed with column specification:
## cols(
##   age = col_double(),
##   sex = col_double(),
##   cp = col_double(),
##   trestbps = col_double(),
##   chol = col_double(),
##   fbs = col_double(),
##   restecg = col_double(),
##   thalach = col_double(),
##   exang = col_double(),
##   oldpeak = col_double(),
##   slope = col_double(),
##   ca = col_double(),
##   thal = col_double(),
##   target = col_double()
## )
head(heart)
## # A tibble: 6 x 14
##     age   sex    cp trestbps  chol   fbs restecg thalach exang oldpeak slope
##   <dbl> <dbl> <dbl>    <dbl> <dbl> <dbl>   <dbl>   <dbl> <dbl>   <dbl> <dbl>
## 1    63     1     3      145   233     1       0     150     0     2.3     0
## 2    37     1     2      130   250     0       1     187     0     3.5     0
## 3    41     0     1      130   204     0       0     172     0     1.4     2
## 4    56     1     1      120   236     0       1     178     0     0.8     2
## 5    57     0     0      120   354     0       1     163     1     0.6     2
## 6    57     1     0      140   192     0       1     148     0     0.4     1
## # ... with 3 more variables: ca <dbl>, thal <dbl>, target <dbl>

read_csv function from tidyverse library is faster than the default read.csv from R. Let us verify the same

system.time(d<-read.csv("https://raw.githubusercontent.com/petferns/607-week9/main/heart.csv"))
##    user  system elapsed 
##    0.12    0.11    1.06
system.time(d<-read_csv("https://raw.githubusercontent.com/petferns/607-week9/main/heart.csv"))
##    user  system elapsed 
##    0.00    0.03    0.27

Capability 2 dplyr::filter

As the name suggests we can filter the rows from the dataframe using the filter function. We apply the filter to heart dataframe based on age column

filter(heart,age > 45)
## # A tibble: 239 x 14
##      age   sex    cp trestbps  chol   fbs restecg thalach exang oldpeak slope
##    <dbl> <dbl> <dbl>    <dbl> <dbl> <dbl>   <dbl>   <dbl> <dbl>   <dbl> <dbl>
##  1    63     1     3      145   233     1       0     150     0     2.3     0
##  2    56     1     1      120   236     0       1     178     0     0.8     2
##  3    57     0     0      120   354     0       1     163     1     0.6     2
##  4    57     1     0      140   192     0       1     148     0     0.4     1
##  5    56     0     1      140   294     0       0     153     0     1.3     1
##  6    52     1     2      172   199     1       1     162     0     0.5     2
##  7    57     1     2      150   168     0       1     174     0     1.6     2
##  8    54     1     0      140   239     0       1     160     0     1.2     2
##  9    48     0     2      130   275     0       1     139     0     0.2     2
## 10    49     1     1      130   266     0       1     171     0     0.6     2
## # ... with 229 more rows, and 3 more variables: ca <dbl>, thal <dbl>,
## #   target <dbl>

Capability 3 dplyr::select

Just as we filtered the rows of a dataframe in the above example, we can use select function to filter columns of the dataframe. If you want only specific columns rather than the whole set of columns we can use select function

select(heart,c("age","sex","chol"))
## # A tibble: 303 x 3
##      age   sex  chol
##    <dbl> <dbl> <dbl>
##  1    63     1   233
##  2    37     1   250
##  3    41     0   204
##  4    56     1   236
##  5    57     0   354
##  6    57     1   192
##  7    56     0   294
##  8    44     1   263
##  9    52     1   199
## 10    57     1   168
## # ... with 293 more rows