The 4 fundamental processes to turn data into understanding, knowledge and insight consist of: data manipulation, data visualization, statistical analysis, and deployment of results. Having well-structured data, “makes data suitable for software processing whether that be mathematical functions, visualization, etc. and reveals information and insights.” Both packages are useful for cleaning, processing, and manipulating data.

Tidyr

gather()

This function reshapes a wide format to a long format. It will take multiple columns and collapse them into key-value pairs, duplicating all other coloums as needed. This is a complement to the spread ( ) function.

Function:

“gather(data, key, value, …, na.rm = FALSE, convert = FALSE or data %>% gather (key, value, …, na.rm = FALSE, convert = FALSE”

Arguments:

“key: column name representing new variable”

“value: column name representing variable values”

“…: names of columns to gather (or not gather)”

“na.rm: option to remove observations with missing values (represented by NAs)”

“convert: if TRUE will automatically convert values to logical, integer, numeric, complex or factor as appropriate”

separate()

This function splits a single variable into two. Sometimes a single column variable may contain multiple variables or irrelevant pieces of variables. This is a complement to the unite() function

Function:

“separate(data, col, into, sep =” “, remove = TRUE, convert = FALSE) or data %>% separate(col, into, sep =” “, remove = TRUE, convert = FALSE)”

Arguments:

“col: column name representing current variable into: names of variables representing new variables”

“sep: how to separate current variable (char, num, or symbol)”

" remove: if TRUE, remove input column from output data frame"

" convert: if TRUE will automatically convert values to logical, integer, numeric, complex or factor as appropriate"

unite()

This function helps unite two variables into one. This could be used when you want to combine the values of two variables. It combines two variables of a single observation into one variable. This is a complement to the separate() function.

Function:

“unite(data, col, …, sep =” “, remove = TRUE or data %>% unite(col, …, sep =” “, remove = TRUE”

Arguments

" col: column name of new “merged” column"

“…: names of columns to merge”

" sep: separator to use between merged values"

“remove: if TRUE, remove input column from output data frame”

spread ()

This function reshapes a long format into a wide format. It spreads a key-value pair across multiplt columns. This is a complement to the gather() function.

Function:

" spread(data, key, value, fill = NA, convert = FALSE) or data %>% spread(key, value, fill = NA, convert = FALSE)"

Arguments:

“key: column values to convert to multiple columns”

“value: single column values to convert to multiple columns’ values”

“fill: If there isn’t a value for every combination of the other variables and the key column, this value will be substituted”

“convert: if TRUE will automatically convert values to logical, integer, numeric, complex or factor as appropriate”

Dplyr

select()

This function reduces the datafram size to only desired variables for current task. This would be useful if we want to assess specific variables when we have a large dataframe. This function allows you to select and/or rename varibales.

Function:

“select(data,…) or data %>% select(…)”

Special functions:

" starts_with(x, ignore.case = TRUE): names starts with x "

“ends_with(x, ignore.case = TRUE): names ends in x”

“contains(x, ignore.case = TRUE): selects all variables whose name contains x”

filter()

This function reduces rows/observations with matching conditions. This is useful for filtering data to indentify/select observations in which a particular variable matches a specific value/condition.

Function:

“filter(data,…)” “…: conditions to be met”

Logic rules can be applied to the filter function

group_by()

This function allows you to group data by categorical values. This allows for statistical analysis at the observation and group levels. This function allows us to create categorical groupings.

Function:

“group-by(data, …) or data %>% group_by(…) …: variables to group by”

Use ungroup(x) to remove groups

summarise()

This function is used to perform summart statistics or variables. It allows us to perform the majority of the initial summary statistics when performing exploratory data analysis.

Function:

“summarise(data, …) or data %>% summarise (…) …: name-value pairs of logic function(s) like min(), mean(), max(), etc.”

arrange()

This function is used to order variable values. It allows us to order data by variables in accending or descending order.

Function:

“arrange(data, …) or data %>% arrange(…) …: variable(s) to order”

Use desc(x) to sort variable in descending order

join()

This function joins two datasets together. It provides multiple ways to join dataframes, which may have common and differing variables for similar observations.

Function:

“inner_join(x, y, by = NULL”

Include only roqa in both x and y that have a matching value

“left_join(x, y, by = NULL”

Include all of x, and matching rows of y

“semi_join(x, y, by = NULL”

Include rows of x that match y but only keep the columns from x

“anti_join(x, y, by = NULL”

Opposite of semi_join

“x,y: data frames to join by: a character vector of variables to join by. If NULL, the default, join will do a natural join using all variables with common names across the two tables”

mutate()

This function creates new variables. This function would be used if we want to add new variables while preserving the existing variables.

Function:

“mutate(data, …) or data %>% mutate(…) …: Expression(s)”

Data Tidying

Tidy Data

Data is easier to work with if it follows these rules:

Each variable in the data set is placed in its own column
Each observation is placed in its own row
Each value is place in its own cell

Data Frames

A data frame is a list of vectors that R displays as a table. When your data is tidy, the values of each variable fall in their own column vector. Tidy data arranges values so that the relationships in the data parallel the structure of the data frame. In Tidy Data, each variable is assigned to its own column (its own vector in the data frame).

spread() and gather()

Key Value Pairs

A key value pair is a simple way to record information. A pair contains 2 parts: 1. A key that explains what the information describes 2. A value that contains the actual information

Data values form natural key value pairs, The value is the value of the pair and the variable that the value describes is the key. In tidy data, each cell will contain a value and each column will contain a key.

spread()

This turns a pair of key:value columns into a set of tidy columns.

This function returns a copy of your data set that has had the key and value columns removed. Spread() adds a new column for each unique value of the key column. It then distributes cells of the former value column across the cells of the new columns and truncates any non-key, non-value columns in a way that prevents duplication.

gather()

This collects a set of column names and places them into a single “key” column. It also collects the cells of those columns and places them into a single value column.

Gather() returns a copy of the data frame with the specified columns removed. It repeats each of the former column names to maintain each combination of values that appeared in the original data set. It uses the first string that you supplied as the name f the new “key” column, and it uses the second string as the name of the new value column.

How to split and combine values

separate()

This turns a single character column into multiple columns by splitting the values of the column wherever a separator character appears. By default, it will split values wherever a non-alphanumeric character appears.

unite()

This combines multiple columns into a single column. It returns a copy of the data frame that includes the new column, but not the columns used to build the new column. If you want to retain theses columns, add the argument remove = False.

Tutorials for the dplyr package in R

dplyr Functionality

5 basic verbs (explained earlier) 1. filter 2. select 3. arrange 4. mutate 5. summarise (and group_by)

For all dplyr verbs: 1. First argument is a data frame 2. Return value is a data frame 3. Nothing is modified in place (dplyr generally does not preserve row names)

filter: keep rows matching criteria

You can use comma or ampersand to represent and condition

“filter(flights, Month==1, DayofMonth==1)”

select: pick columns by name

Use colon to select multiple contiguous columns, and use ‘contains’ to match columns by name. ‘starts_with’, and ‘matches’ can also be used to match columns.

“select(flights, Year:DayofMonth, contains(”Taxi“), contains(”Delay“))”

“Chaining” or “Pipelining”

Chaining increases readability significantly when there are many commands.It can be used to replace nesting in R commands outside of dplyr “flights %>%

select(UniqueCarrier, DepDelay) %>%

filter(DepDelay > 60)"

arrange: reoder rows

“flights %>% select(UniqueCarrier, DepDelay) %>% arrange(DepDelay)” For descending order: arrange(desc(DepDelay))

mutate: add new variables

This creates new variables that are functions of existing variables. With this approach, it prints the new variable but does not store it. “flights %>% select(Distance, AirTime) %>% mutate(Speed = Distance/AirTime*60)" Store the new variable: “flights <- flights %>% mutate(Speed = Distance/AirTime*60)"

summarise: reduce variables to values

This is primarily useful with data that has been grouped by one or more variables. “flights %>% group_by(Dest) %>% summarise(avg_delay = mean(ArrDelay, na.rm=TRUE))”

group_by: creates groups that will be operated on

summarise: uses the provided aggregation function to summarise each group

summarise_each: allows you to apply the same summary function to multiple columns at once (mutate_each is also available)

Helper function n(): counds the number of rows in a group

n_distinct(vector): counts the number of unique items in that vector

Tally function: “flights %>% group_by(Month, DayofMonth) %>% tally(sort = TRUE)”

Window Functions

Aggregation function (like mean) takes n inputs and returns 1 value

Windows function: takes n inputs and returns n values

This includes ranking and ordering functions (like min_rank), offset functions (lead and lag) and cumulative aggregates (like cummean)

Other useful Convenience Functions

Randomly sample a fixed number of rows, without replacement

flights %>% sample_n(5)

Randomly sample a fraction of rows, with replacement

flights %>% sample_frac(0.25, replace=TRUE)

Base R approach to view the structure of an object

str(flights)

dplyr approach: better formatting, and adapts to your screen width

glimpse(flights)

Conneceting to Databases

Dplyr can connect to a database as if the data was loaded into a data frame

Tidyr and Dplyr

Maya Fraser-Philbin

September 28, 2015

Tidyr

gather()

This function reshapes a wide format to a long format. It will take multiple columns and collapse them into key-value pairs, duplicating all other coloums as needed. This is a complement to the spread ( ) function.

Function:

Arguments:

separate()

This function splits a single variable into two. Sometimes a single column variable may contain multiple variables or irrelevant pieces of variables. This is a complement to the unite() function

Function:

Arguments:

unite()

This function helps unite two variables into one. This could be used when you want to combine the values of two variables. It combines two variables of a single observation into one variable. This is a complement to the separate() function.

Function:

Arguments

spread ()

This function reshapes a long format into a wide format. It spreads a key-value pair across multiplt columns. This is a complement to the gather() function.

Function:

Arguments:

Dplyr

select()

This function reduces the datafram size to only desired variables for current task. This would be useful if we want to assess specific variables when we have a large dataframe. This function allows you to select and/or rename varibales.

Function:

Special functions:

filter()

This function reduces rows/observations with matching conditions. This is useful for filtering data to indentify/select observations in which a particular variable matches a specific value/condition.

Function:

Logic rules can be applied to the filter function

group_by()

This function allows you to group data by categorical values. This allows for statistical analysis at the observation and group levels. This function allows us to create categorical groupings.

Function:

Use ungroup(x) to remove groups

summarise()

This function is used to perform summart statistics or variables. It allows us to perform the majority of the initial summary statistics when performing exploratory data analysis.

Function:

arrange()

This function is used to order variable values. It allows us to order data by variables in accending or descending order.

Function:

Use desc(x) to sort variable in descending order

join()

This function joins two datasets together. It provides multiple ways to join dataframes, which may have common and differing variables for similar observations.

Function:

Include only roqa in both x and y that have a matching value

Include all of x, and matching rows of y

Include rows of x that match y but only keep the columns from x

Opposite of semi_join

mutate()

This function creates new variables. This function would be used if we want to add new variables while preserving the existing variables.

Function:

Data Tidying

Tidy Data

Data is easier to work with if it follows these rules:

Data Frames

spread() and gather()

Key Value Pairs

spread()

gather()

How to split and combine values

separate()

unite()

Tutorials for the dplyr package in R

dplyr Functionality

filter: keep rows matching criteria

select: pick columns by name

“Chaining” or “Pipelining”

arrange: reoder rows

mutate: add new variables

summarise: reduce variables to values

group_by: creates groups that will be operated on

summarise: uses the provided aggregation function to summarise each group

Helper function n(): counds the number of rows in a group

n_distinct(vector): counts the number of unique items in that vector

Window Functions

Aggregation function (like mean) takes n inputs and returns 1 value

Windows function: takes n inputs and returns n values

Other useful Convenience Functions

Randomly sample a fixed number of rows, without replacement

Randomly sample a fixed number of rows, without replacement

Randomly sample a fraction of rows, with replacement