The 4 fundamental processes to turn data into understanding, knowledge and insight consist of: data manipulation, data visualization, statistical analysis, and deployment of results. Having well-structured data, “makes data suitable for software processing whether that be mathematical functions, visualization, etc. and reveals information and insights.” Both packages are useful for cleaning, processing, and manipulating data.

Tidyr

gather()

This function reshapes a wide format to a long format. It will take multiple columns and collapse them into key-value pairs, duplicating all other coloums as needed. This is a complement to the spread ( ) function.

Function:

“gather(data, key, value, …, na.rm = FALSE, convert = FALSE or data %>% gather (key, value, …, na.rm = FALSE, convert = FALSE”

Arguments:

“key: column name representing new variable”

“value: column name representing variable values”

“…: names of columns to gather (or not gather)”

“na.rm: option to remove observations with missing values (represented by NAs)”

“convert: if TRUE will automatically convert values to logical, integer, numeric, complex or factor as appropriate”

separate()

This function splits a single variable into two. Sometimes a single column variable may contain multiple variables or irrelevant pieces of variables. This is a complement to the unite() function

Function:

“separate(data, col, into, sep =” “, remove = TRUE, convert = FALSE) or data %>% separate(col, into, sep =” “, remove = TRUE, convert = FALSE)”

Arguments:

“col: column name representing current variable into: names of variables representing new variables”

“sep: how to separate current variable (char, num, or symbol)”

" remove: if TRUE, remove input column from output data frame"

" convert: if TRUE will automatically convert values to logical, integer, numeric, complex or factor as appropriate"

unite()

This function helps unite two variables into one. This could be used when you want to combine the values of two variables. It combines two variables of a single observation into one variable. This is a complement to the separate() function.

Function:

“unite(data, col, …, sep =” “, remove = TRUE or data %>% unite(col, …, sep =” “, remove = TRUE”

Arguments

" col: column name of new “merged” column"

“…: names of columns to merge”

" sep: separator to use between merged values"

“remove: if TRUE, remove input column from output data frame”

spread ()

This function reshapes a long format into a wide format. It spreads a key-value pair across multiplt columns. This is a complement to the gather() function.

Function:

" spread(data, key, value, fill = NA, convert = FALSE) or data %>% spread(key, value, fill = NA, convert = FALSE)"

Arguments:

“key: column values to convert to multiple columns”

“value: single column values to convert to multiple columns’ values”

“fill: If there isn’t a value for every combination of the other variables and the key column, this value will be substituted”

“convert: if TRUE will automatically convert values to logical, integer, numeric, complex or factor as appropriate”

Dplyr

select()

This function reduces the datafram size to only desired variables for current task. This would be useful if we want to assess specific variables when we have a large dataframe. This function allows you to select and/or rename varibales.

Function:

“select(data,…) or data %>% select(…)”

Special functions:

" starts_with(x, ignore.case = TRUE): names starts with x "

“ends_with(x, ignore.case = TRUE): names ends in x”

“contains(x, ignore.case = TRUE): selects all variables whose name contains x”

filter()

This function reduces rows/observations with matching conditions. This is useful for filtering data to indentify/select observations in which a particular variable matches a specific value/condition.

Function:

“filter(data,…)” “…: conditions to be met”

Logic rules can be applied to the filter function

group_by()

This function allows you to group data by categorical values. This allows for statistical analysis at the observation and group levels. This function allows us to create categorical groupings.

Function:

“group-by(data, …) or data %>% group_by(…) …: variables to group by”

Use ungroup(x) to remove groups

summarise()

This function is used to perform summart statistics or variables. It allows us to perform the majority of the initial summary statistics when performing exploratory data analysis.

Function:

“summarise(data, …) or data %>% summarise (…) …: name-value pairs of logic function(s) like min(), mean(), max(), etc.”

arrange()

This function is used to order variable values. It allows us to order data by variables in accending or descending order.

Function:

“arrange(data, …) or data %>% arrange(…) …: variable(s) to order”

Use desc(x) to sort variable in descending order

join()

This function joins two datasets together. It provides multiple ways to join dataframes, which may have common and differing variables for similar observations.

Function:

“inner_join(x, y, by = NULL”

Include only roqa in both x and y that have a matching value

“left_join(x, y, by = NULL”

Include all of x, and matching rows of y

“semi_join(x, y, by = NULL”

Include rows of x that match y but only keep the columns from x

“anti_join(x, y, by = NULL”

Opposite of semi_join

“x,y: data frames to join by: a character vector of variables to join by. If NULL, the default, join will do a natural join using all variables with common names across the two tables”

mutate()

This function creates new variables. This function would be used if we want to add new variables while preserving the existing variables.

Function:

“mutate(data, …) or data %>% mutate(…) …: Expression(s)”

Data Tidying

Tidy Data

Data is easier to work with if it follows these rules:

  1. Each variable in the data set is placed in its own column
  2. Each observation is placed in its own row
  3. Each value is place in its own cell

Data Frames

A data frame is a list of vectors that R displays as a table. When your data is tidy, the values of each variable fall in their own column vector. Tidy data arranges values so that the relationships in the data parallel the structure of the data frame. In Tidy Data, each variable is assigned to its own column (its own vector in the data frame).

spread() and gather()

Key Value Pairs

A key value pair is a simple way to record information. A pair contains 2 parts: 1. A key that explains what the information describes 2. A value that contains the actual information

Data values form natural key value pairs, The value is the value of the pair and the variable that the value describes is the key. In tidy data, each cell will contain a value and each column will contain a key.

spread()

This turns a pair of key:value columns into a set of tidy columns.

This function returns a copy of your data set that has had the key and value columns removed. Spread() adds a new column for each unique value of the key column. It then distributes cells of the former value column across the cells of the new columns and truncates any non-key, non-value columns in a way that prevents duplication.

gather()

This collects a set of column names and places them into a single “key” column. It also collects the cells of those columns and places them into a single value column.

Gather() returns a copy of the data frame with the specified columns removed. It repeats each of the former column names to maintain each combination of values that appeared in the original data set. It uses the first string that you supplied as the name f the new “key” column, and it uses the second string as the name of the new value column.

How to split and combine values

separate()

This turns a single character column into multiple columns by splitting the values of the column wherever a separator character appears. By default, it will split values wherever a non-alphanumeric character appears.

unite()

This combines multiple columns into a single column. It returns a copy of the data frame that includes the new column, but not the columns used to build the new column. If you want to retain theses columns, add the argument remove = False.

Tutorials for the dplyr package in R

dplyr Functionality

5 basic verbs (explained earlier) 1. filter 2. select 3. arrange 4. mutate 5. summarise (and group_by)

For all dplyr verbs: 1. First argument is a data frame 2. Return value is a data frame 3. Nothing is modified in place (dplyr generally does not preserve row names)

filter: keep rows matching criteria

You can use comma or ampersand to represent and condition

“filter(flights, Month==1, DayofMonth==1)”

select: pick columns by name

Use colon to select multiple contiguous columns, and use ‘contains’ to match columns by name. ‘starts_with’, and ‘matches’ can also be used to match columns.

“select(flights, Year:DayofMonth, contains(”Taxi“), contains(”Delay“))”

“Chaining” or “Pipelining”

Chaining increases readability significantly when there are many commands.It can be used to replace nesting in R commands outside of dplyr “flights %>%

select(UniqueCarrier, DepDelay) %>%

filter(DepDelay > 60)"

arrange: reoder rows

“flights %>% select(UniqueCarrier, DepDelay) %>% arrange(DepDelay)” For descending order: arrange(desc(DepDelay))

mutate: add new variables

This creates new variables that are functions of existing variables. With this approach, it prints the new variable but does not store it. “flights %>% select(Distance, AirTime) %>% mutate(Speed = Distance/AirTime*60)" Store the new variable: “flights <- flights %>% mutate(Speed = Distance/AirTime*60)"

summarise: reduce variables to values

This is primarily useful with data that has been grouped by one or more variables. “flights %>% group_by(Dest) %>% summarise(avg_delay = mean(ArrDelay, na.rm=TRUE))”

group_by: creates groups that will be operated on

summarise: uses the provided aggregation function to summarise each group

summarise_each: allows you to apply the same summary function to multiple columns at once (mutate_each is also available)

Helper function n(): counds the number of rows in a group

n_distinct(vector): counts the number of unique items in that vector

Tally function: “flights %>% group_by(Month, DayofMonth) %>% tally(sort = TRUE)”

Window Functions

Aggregation function (like mean) takes n inputs and returns 1 value

Windows function: takes n inputs and returns n values

This includes ranking and ordering functions (like min_rank), offset functions (lead and lag) and cumulative aggregates (like cummean)

Other useful Convenience Functions

Randomly sample a fixed number of rows, without replacement

Randomly sample a fixed number of rows, without replacement

flights %>% sample_n(5)

Randomly sample a fraction of rows, with replacement

flights %>% sample_frac(0.25, replace=TRUE)

Base R approach to view the structure of an object

str(flights)

dplyr approach: better formatting, and adapts to your screen width

glimpse(flights)

Conneceting to Databases

Dplyr can connect to a database as if the data was loaded into a data frame

More Tutorials for dplyr

Choosing columns:select, rename

Use select() to pick columns

flights %>% select(carrier, flight)

You can use the minus sign to hide columns

flights %>% select(-month, -day)

hide a range of columns

flights %>% select(-(dep_time:arr_delay))

hide any column with a matching name

flights %>% select(-contains(“time”))

pick columns using a character vector of column names

cols <- c(“carrier”, “flight”, “tailnum”) flights %>% select(one_of(cols))

select() can be used to rename columns, though all columns not mentioned are dropped

flights %>% select(tail = tailnum)

rename() does the same thing, except all columns not mentioned are kept

flights %>% rename(tail = tailnum)

Choosing rows: filter, between, slice, sample_n, top_n, distinct

filter() supports the use of multiple conditions

flights %>% filter(dep_time >= 600, dep_time <= 605)

between() is a concise alternative for determing if numeric values fall in a range

flights %>% filter(between(dep_time, 600, 605))

side note: is.na() can also be useful when filtering

flights %>% filter(!is.na(dep_time))

slice() filters rows by position

flights %>% slice(1000:1005)

keep the first three rows within each group

flights %>% group_by(month, day) %>% slice(1:3)

sample three rows from each group

flights %>% group_by(month, day) %>% sample_n(3)

keep three rows from each group with the top dep_delay

flights %>% group_by(month, day) %>% top_n(3, dep_delay)

also sort by dep_delay within each group

flights %>% group_by(month, day) %>% top_n(3, dep_delay) %>% arrange(desc(dep_delay))

unique rows can be identified using unique() from base R

flights %>% select(origin, dest) %>% unique()

dplyr provides an alternative that is more “efficient”

flights %>% select(origin, dest) %>% distinct()

side note: when chaining, you don’t have to include the parentheses if there are no arguments

flights %>% select(origin, dest) %>% distinct

Adding new variables: mutate, transmute, add_rownames

mutate() creates a new variable (and keeps all existing variables)

flights %>% mutate(speed = distance/air_time*60)

transmute() only keeps the new variables

flights %>% transmute(speed = distance/air_time*60)

example data frame with row names

mtcars %>% head()

add_rownames() turns row names into an explicit variable

mtcars %>% add_rownames(“model”) %>% head()

side note: dplyr no longer prints row names (ever) for local data frames

mtcars %>% tbl_df()

Grouping and counting: summarise, tally, count, group_size, n_groups, ungroup

summarise() can be used to count the number of rows in each group

flights %>% group_by(month) %>% summarise(cnt = n())

tally() and count() can do this more concisely

flights %>% group_by(month) %>% tally() flights %>% count(month)

you can sort by the count

flights %>% group_by(month) %>% summarise(cnt = n()) %>% arrange(desc(cnt))

tally() and count() have a sort parameter for this purpose

flights %>% group_by(month) %>% tally(sort=TRUE) flights %>% count(month, sort=TRUE)

you can sum over a specific variable instead of simply counting rows

flights %>% group_by(month) %>% summarise(dist = sum(distance))

tally() and count() have a wt parameter for this purpose

flights %>% group_by(month) %>% tally(wt = distance) flights %>% count(month, wt = distance)

group_size() returns the counts as a vector

flights %>% group_by(month) %>% group_size()

n_groups() simply reports the number of groups

flights %>% group_by(month) %>% n_groups()

group by two variables, summarise, arrange (output is possibly confusing)

flights %>% group_by(month, day) %>% summarise(cnt = n()) %>% arrange(desc(cnt)) %>% print(n = 40)

ungroup() before arranging to arrange across all groups

flights %>% group_by(month, day) %>% summarise(cnt = n()) %>% ungroup() %>% arrange(desc(cnt))

Creating data frames: data_frame

Using data_frame() is better than data.frame()

data_frame() benefits:

You can us previously defined columns to compute new columns, it never coerces column types, it never adds row names, it only recycles length 1 input and it returns a local data frame (a tbl_df)

Example:

data_frame(a = 1:6, b = a*2, c = ‘string’, ‘d+e’ = 1) %>% glimpse()

Joining (merging) tables: left_join, right_join, inner_join, full_join, semi_join, anti_join

create two simple data frames as an example

(a <- data_frame(color = c(“green”,“yellow”,“red”), num = 1:3)) (b <- data_frame(color = c(“green”,“yellow”,“pink”), size = c(“S”,“M”,“L”)))

only include observations found in both “a” and “b” (automatically joins on variables that appear in both tables)

inner_join(a, b)

include observations found in either “a” or “b”

full_join(a, b)

include all observations found in “a”

left_join(a, b)

include all observations found in “b”

right_join(a, b)

right_join(a, b) is identical to left_join(b, a) except for column ordering

left_join(b, a)

filter “a” to only show observations that match “b”

semi_join(a, b)

filter “a” to only show observations that don’t match “b”

anti_join(a, b)

sometimes matching variables don’t have identical names

b <- b %>% rename(col = color)

specify that the join should occur by matching “color” in “a” with “col” in “b”

inner_join(a, b, by=c(“color” = “col”))

Viewing more output: print, view

specify that you want to see more rows

flights %>% print(n = 15)

specify that you want to see ALL rows (don’t run this!)

flights %>% print(n = Inf)

specify that you want to see all columns

flights %>% print(width = Inf)

show up to 1000 rows and all columns

flights %>% View()

set option to see all columns and fewer rows

options(dplyr.width = Inf, dplyr.print_min = 6)

reset options (or just close R)

options(dplyr.width = NULL, dplyr.print_min = 10)