Tydiverse Create Extend

The Data

I’ll be using the data from SARS 2003 Outbreak Complete Dataset from Kaggle. The raw github link is here

Loading Packages

library(tidyverse)

## -- Attaching packages ------------------------------------------------------------------------------------------------------- tidyverse 1.3.0 --

## v ggplot2 3.2.1     v purrr   0.3.3
## v tibble  2.1.3     v dplyr   0.8.3
## v tidyr   1.0.2     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0

## -- Conflicts ---------------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(tidyr)
library(knitr)

Loading the data

df=read.csv('https://raw.githubusercontent.com/metis-macys-66898/data_607_sp2020/master/sars_2003_complete_dataset_clean.csv', stringsAsFactors 
 = F)

head(df)

str(df)

## 'data.frame':    2538 obs. of  5 variables:
##  $ Date                        : chr  "2003-03-17" "2003-03-17" "2003-03-17" "2003-03-17" ...
##  $ Country                     : chr  "Germany" "Canada" "Singapore" "Hong Kong SAR, China" ...
##  $ Cumulative.number.of.case.s.: int  1 8 20 95 2 1 40 2 8 0 ...
##  $ Number.of.deaths            : int  0 2 0 1 0 0 1 0 2 0 ...
##  $ Number.recovered            : int  0 0 0 0 0 0 0 0 0 0 ...

as.tibble(df)

## Warning: `as.tibble()` is deprecated, use `as_tibble()` (but mind the new semantics).
## This warning is displayed once per session.

Using dplyr: rename to rename variables by name

Usage

rename(.data, …)

select(.data, …)

Examples

rename( ) can be used to rename columns, except all columns not mentioned are kept

df1 <- df %>% rename("Cumulative_number_of_cases" = "Cumulative.number.of.case.s.")
names(df1)

## [1] "Date"                       "Country"                   
## [3] "Cumulative_number_of_cases" "Number.of.deaths"          
## [5] "Number.recovered"

Select( ) can be used to rename columns, though all columns not mentioned are dropped

df %>% select("Cumulative_number_of_cases" = "Cumulative.number.of.case.s.") %>% head(5)

Need to make sure df1 is not tibble.

Using dplyr: top_frac to select top (or bottom) n rows (by value)

Usage

top_frac(x, n, wt)

top_n(x, n, wt)

Examples

Wanted to get the top 30% of the countries by day in terms of cumulative number of cases

# The funciton count will give us a count of records by day, denoted in n. 
# group_by (x) followed by %>% will give you the result grouping by x
df1 %>% group_by (Date) %>% count

.3 * n( ) will give you 30% of row count. Thus the following will give you the top 30% of the countries per date in terms of cumulative number of cases.

#  Assign the result to df2  
df2 <- df1 %>% group_by(Date ) %>% top_n(.3 * n(), Cumulative_number_of_cases)
df2

top_frac is basically a shorthand to eliminate the need to put in n() for the total row count in that group. Simply put in the fraction as a the parameter. See below.

# note that top_frac takes in .3 instead of .3 * n(). Assign the result to df2a
df2a <- df1 %>% group_by(Date ) %>% top_frac(.3, Cumulative_number_of_cases)
df2a

# checking if df2a is the same as df2
all.equal(df2, df2a)

## [1] TRUE

Use dplyr::tally or dplyr::count to do counts like n( )

Besides summarize/summarise, what’re some of the functions that one can use to tally counts?

Usage

tally(x, wt = NULL, sort = FALSE, name = “n”)

count(x, …, wt = NULL, sort = FALSE, name = “n”, .drop = group_by_drop_default(x))

tally the record/obs count by country

# an automatic variable n is created to hold the results of tally
df2 %>% group_by(Country) %>% tally

Count the number of record/obs by country

# It can be further simplified to using just count. count (x) means counting the number of records grouping by x. 
df2 %>% count (Country)

Using dplyr: mutate to add new variables that are functions of existing variables

Usage

mutate(.data, …)

Examples

mutate() adds new variables and preserves existing ones

#using Mutate to create new variable as `Year` having just the date component of the existing variable `Date` 
df.new <- df %>% 
  mutate(Year=format(as.POSIXct(df$Date,format="%Y-%m-%d"),"%Y"))
str(df.new)

## 'data.frame':    2538 obs. of  6 variables:
##  $ Date                        : chr  "2003-03-17" "2003-03-17" "2003-03-17" "2003-03-17" ...
##  $ Country                     : chr  "Germany" "Canada" "Singapore" "Hong Kong SAR, China" ...
##  $ Cumulative.number.of.case.s.: int  1 8 20 95 2 1 40 2 8 0 ...
##  $ Number.of.deaths            : int  0 2 0 1 0 0 1 0 2 0 ...
##  $ Number.recovered            : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Year                        : chr  "2003" "2003" "2003" "2003" ...

kable(head(df.new,10))

Date	Country	Cumulative.number.of.case.s.	Number.of.deaths	Year
2003-03-17	Germany	1	0	2003
2003-03-17	Canada	8	2	2003
2003-03-17	Singapore	20	0	2003
2003-03-17	Hong Kong SAR, China	95	1	2003
2003-03-17	Switzerland	2	0	2003
2003-03-17	Thailand	1	0	2003
2003-03-17	Viet Nam	40	1	2003
2003-03-18	Germany	2	0	2003
2003-03-18	Canada	8	2	2003
2003-03-18	China	0	0	2003

Using dplyr: arrange to changes the ordering of the rows

Usage

arrange(.data, …)

Examples

arrange() changes the ordering of the rows based on one or more columns and in ascending or descending order

# ordering first by `Country` in descending and then by `Date` in defualt of ascending 
df.new %>% 
  arrange(desc(Country), Date)

Using dplyr: slice to Choose rows by their ordinal position in the tbl. Grouped tbls use the ordinal position within the group.

Usage

slice(.data, …, .preserve = FALSE)

Examples

slice() does not work with relational databases because they have no intrinsic notion of row order. If you want to perform the equivalent operation, use filter() and row_number().

# choosing the last row using n() along with slice()
kable(slice(df.new, n()))

Date	Country	Cumulative.number.of.case.s.	Number.of.deaths	Number.recovered	Year
2003-07-11	Viet Nam	63	5	58	2003

# choosing the first 7 row using n() along with slice()
kable(slice(df.new, 1:7))

Date	Country	Cumulative.number.of.case.s.	Number.of.deaths	Year
2003-03-17	Germany	1	0	2003
2003-03-17	Canada	8	2	2003
2003-03-17	Singapore	20	0	2003
2003-03-17	Hong Kong SAR, China	95	1	2003
2003-03-17	Switzerland	2	0	2003
2003-03-17	Thailand	1	0	2003
2003-03-17	Viet Nam	40	1	2003

Tydiverse Create Extend

Dennis Pong / Vinayak kamath for Extend

04/15/2020

The Data

Loading Packages

Loading the data

Using dplyr: rename to rename variables by name

Usage

Examples

rename( ) can be used to rename columns, except all columns not mentioned are kept

Select( ) can be used to rename columns, though all columns not mentioned are dropped

Need to make sure df1 is not tibble.

Using dplyr: top_frac to select top (or bottom) n rows (by value)

Usage

Examples

Wanted to get the top 30% of the countries by day in terms of cumulative number of cases

.3 * n( ) will give you 30% of row count. Thus the following will give you the top 30% of the countries per date in terms of cumulative number of cases.

top_frac is basically a shorthand to eliminate the need to put in n() for the total row count in that group. Simply put in the fraction as a the parameter. See below.

Use dplyr::tally or dplyr::count to do counts like n( )

Besides summarize/summarise, what’re some of the functions that one can use to tally counts?

Usage

tally the record/obs count by country

Count the number of record/obs by country

Using dplyr: mutate to add new variables that are functions of existing variables

Usage

Examples

mutate() adds new variables and preserves existing ones

Using dplyr: arrange to changes the ordering of the rows

Usage

Examples

arrange() changes the ordering of the rows based on one or more columns and in ascending or descending order

Using dplyr: slice to Choose rows by their ordinal position in the tbl. Grouped tbls use the ordinal position within the group.

Usage

Examples

slice() does not work with relational databases because they have no intrinsic notion of row order. If you want to perform the equivalent operation, use filter() and row_number().