In this assignment, you’ll practice collaborating around a code project with GitHub. You could consider our collective work as building out a book of examples on how to use TidyVerse functions.
GitHub repository: https://github.com/acatlin/FALL2020TIDYVERSE
https://data.fivethirtyeight.com/ datasets.
https://www.kaggle.com/datasets datasets.
Your task here is to Create an Example. Using one or more TidyVerse packages, and any dataset from fivethirtyeight.com or Kaggle, create a programming sample “vignette” that demonstrates how to use one or more of the capabilities of the selected TidyVerse package with your selected dataset. (25 points)
library(tidyverse)
## -- Attaching packages -------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2 v purrr 0.3.4
## v tibble 3.0.3 v dplyr 1.0.2
## v tidyr 1.1.2 v stringr 1.4.0.9000
## v readr 1.3.1 v forcats 0.5.0
## -- Conflicts ----------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
First lets try to read the CSV file from GitHub using the read_csv function
heart <- read_csv("https://raw.githubusercontent.com/petferns/607-week9/main/heart.csv")
## Parsed with column specification:
## cols(
## age = col_double(),
## sex = col_double(),
## cp = col_double(),
## trestbps = col_double(),
## chol = col_double(),
## fbs = col_double(),
## restecg = col_double(),
## thalach = col_double(),
## exang = col_double(),
## oldpeak = col_double(),
## slope = col_double(),
## ca = col_double(),
## thal = col_double(),
## target = col_double()
## )
head(heart)
## # A tibble: 6 x 14
## age sex cp trestbps chol fbs restecg thalach exang oldpeak slope
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 63 1 3 145 233 1 0 150 0 2.3 0
## 2 37 1 2 130 250 0 1 187 0 3.5 0
## 3 41 0 1 130 204 0 0 172 0 1.4 2
## 4 56 1 1 120 236 0 1 178 0 0.8 2
## 5 57 0 0 120 354 0 1 163 1 0.6 2
## 6 57 1 0 140 192 0 1 148 0 0.4 1
## # ... with 3 more variables: ca <dbl>, thal <dbl>, target <dbl>
read_csv function from tidyverse library is faster than the default read.csv from R. Let us verify the same
system.time(d<-read.csv("https://raw.githubusercontent.com/petferns/607-week9/main/heart.csv"))
## user system elapsed
## 0.12 0.11 1.06
system.time(d<-read_csv("https://raw.githubusercontent.com/petferns/607-week9/main/heart.csv"))
## user system elapsed
## 0.00 0.03 0.27
As the name suggests we can filter the rows from the dataframe using the filter function. We apply the filter to heart dataframe based on age column
filter(heart,age > 45)
## # A tibble: 239 x 14
## age sex cp trestbps chol fbs restecg thalach exang oldpeak slope
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 63 1 3 145 233 1 0 150 0 2.3 0
## 2 56 1 1 120 236 0 1 178 0 0.8 2
## 3 57 0 0 120 354 0 1 163 1 0.6 2
## 4 57 1 0 140 192 0 1 148 0 0.4 1
## 5 56 0 1 140 294 0 0 153 0 1.3 1
## 6 52 1 2 172 199 1 1 162 0 0.5 2
## 7 57 1 2 150 168 0 1 174 0 1.6 2
## 8 54 1 0 140 239 0 1 160 0 1.2 2
## 9 48 0 2 130 275 0 1 139 0 0.2 2
## 10 49 1 1 130 266 0 1 171 0 0.6 2
## # ... with 229 more rows, and 3 more variables: ca <dbl>, thal <dbl>,
## # target <dbl>
Just as we filtered the rows of a dataframe in the above example, we can use select function to filter columns of the dataframe. If you want only specific columns rather than the whole set of columns we can use select function
select(heart,c("age","sex","chol"))
## # A tibble: 303 x 3
## age sex chol
## <dbl> <dbl> <dbl>
## 1 63 1 233
## 2 37 1 250
## 3 41 0 204
## 4 56 1 236
## 5 57 0 354
## 6 57 1 192
## 7 56 0 294
## 8 44 1 263
## 9 52 1 199
## 10 57 1 168
## # ... with 293 more rows