The 4 fundamental processes to turn data into understanding, knowledge and insight consist of: data manipulation, data visualization, statistical analysis, and deployment of results. Having well-structured data, “makes data suitable for software processing whether that be mathematical functions, visualization, etc. and reveals information and insights.” Both packages are useful for cleaning, processing, and manipulating data.
“gather(data, key, value, …, na.rm = FALSE, convert = FALSE or data %>% gather (key, value, …, na.rm = FALSE, convert = FALSE”
“key: column name representing new variable”
“value: column name representing variable values”
“…: names of columns to gather (or not gather)”
“na.rm: option to remove observations with missing values (represented by NAs)”
“convert: if TRUE will automatically convert values to logical, integer, numeric, complex or factor as appropriate”
“separate(data, col, into, sep =” “, remove = TRUE, convert = FALSE) or data %>% separate(col, into, sep =” “, remove = TRUE, convert = FALSE)”
“col: column name representing current variable into: names of variables representing new variables”
“sep: how to separate current variable (char, num, or symbol)”
" remove: if TRUE, remove input column from output data frame"
" convert: if TRUE will automatically convert values to logical, integer, numeric, complex or factor as appropriate"
“unite(data, col, …, sep =” “, remove = TRUE or data %>% unite(col, …, sep =” “, remove = TRUE”
" col: column name of new “merged” column"
“…: names of columns to merge”
" sep: separator to use between merged values"
“remove: if TRUE, remove input column from output data frame”
" spread(data, key, value, fill = NA, convert = FALSE) or data %>% spread(key, value, fill = NA, convert = FALSE)"
“key: column values to convert to multiple columns”
“value: single column values to convert to multiple columns’ values”
“fill: If there isn’t a value for every combination of the other variables and the key column, this value will be substituted”
“convert: if TRUE will automatically convert values to logical, integer, numeric, complex or factor as appropriate”
“select(data,…) or data %>% select(…)”
" starts_with(x, ignore.case = TRUE): names starts with x "
“ends_with(x, ignore.case = TRUE): names ends in x”
“contains(x, ignore.case = TRUE): selects all variables whose name contains x”
“filter(data,…)” “…: conditions to be met”
“group-by(data, …) or data %>% group_by(…) …: variables to group by”
“summarise(data, …) or data %>% summarise (…) …: name-value pairs of logic function(s) like min(), mean(), max(), etc.”
“arrange(data, …) or data %>% arrange(…) …: variable(s) to order”
“inner_join(x, y, by = NULL”
“left_join(x, y, by = NULL”
“semi_join(x, y, by = NULL”
“anti_join(x, y, by = NULL”
“x,y: data frames to join by: a character vector of variables to join by. If NULL, the default, join will do a natural join using all variables with common names across the two tables”
“mutate(data, …) or data %>% mutate(…) …: Expression(s)”
A data frame is a list of vectors that R displays as a table. When your data is tidy, the values of each variable fall in their own column vector. Tidy data arranges values so that the relationships in the data parallel the structure of the data frame. In Tidy Data, each variable is assigned to its own column (its own vector in the data frame).
A key value pair is a simple way to record information. A pair contains 2 parts: 1. A key that explains what the information describes 2. A value that contains the actual information
Data values form natural key value pairs, The value is the value of the pair and the variable that the value describes is the key. In tidy data, each cell will contain a value and each column will contain a key.
This turns a pair of key:value columns into a set of tidy columns.
This function returns a copy of your data set that has had the key and value columns removed. Spread() adds a new column for each unique value of the key column. It then distributes cells of the former value column across the cells of the new columns and truncates any non-key, non-value columns in a way that prevents duplication.
This collects a set of column names and places them into a single “key” column. It also collects the cells of those columns and places them into a single value column.
Gather() returns a copy of the data frame with the specified columns removed. It repeats each of the former column names to maintain each combination of values that appeared in the original data set. It uses the first string that you supplied as the name f the new “key” column, and it uses the second string as the name of the new value column.
This turns a single character column into multiple columns by splitting the values of the column wherever a separator character appears. By default, it will split values wherever a non-alphanumeric character appears.
This combines multiple columns into a single column. It returns a copy of the data frame that includes the new column, but not the columns used to build the new column. If you want to retain theses columns, add the argument remove = False.
5 basic verbs (explained earlier) 1. filter 2. select 3. arrange 4. mutate 5. summarise (and group_by)
For all dplyr verbs: 1. First argument is a data frame 2. Return value is a data frame 3. Nothing is modified in place (dplyr generally does not preserve row names)
You can use comma or ampersand to represent and condition
“filter(flights, Month==1, DayofMonth==1)”
Use colon to select multiple contiguous columns, and use ‘contains’ to match columns by name. ‘starts_with’, and ‘matches’ can also be used to match columns.
“select(flights, Year:DayofMonth, contains(”Taxi“), contains(”Delay“))”
Chaining increases readability significantly when there are many commands.It can be used to replace nesting in R commands outside of dplyr “flights %>%
select(UniqueCarrier, DepDelay) %>%
filter(DepDelay > 60)"
“flights %>% select(UniqueCarrier, DepDelay) %>% arrange(DepDelay)” For descending order: arrange(desc(DepDelay))
This creates new variables that are functions of existing variables. With this approach, it prints the new variable but does not store it. “flights %>% select(Distance, AirTime) %>% mutate(Speed = Distance/AirTime*60)" Store the new variable: “flights <- flights %>% mutate(Speed = Distance/AirTime*60)"
This is primarily useful with data that has been grouped by one or more variables. “flights %>% group_by(Dest) %>% summarise(avg_delay = mean(ArrDelay, na.rm=TRUE))”
summarise_each: allows you to apply the same summary function to multiple columns at once (mutate_each is also available)
Tally function: “flights %>% group_by(Month, DayofMonth) %>% tally(sort = TRUE)”
This includes ranking and ordering functions (like min_rank), offset functions (lead and lag) and cumulative aggregates (like cummean)
flights %>% sample_n(5)
flights %>% sample_frac(0.25, replace=TRUE)
str(flights)
glimpse(flights)
Dplyr can connect to a database as if the data was loaded into a data frame
flights %>% select(carrier, flight)
flights %>% select(-month, -day)
flights %>% select(-(dep_time:arr_delay))
flights %>% select(-contains(“time”))
cols <- c(“carrier”, “flight”, “tailnum”) flights %>% select(one_of(cols))
flights %>% select(tail = tailnum)
flights %>% rename(tail = tailnum)
flights %>% filter(dep_time >= 600, dep_time <= 605)
flights %>% filter(between(dep_time, 600, 605))
flights %>% filter(!is.na(dep_time))
flights %>% slice(1000:1005)
flights %>% group_by(month, day) %>% slice(1:3)
flights %>% group_by(month, day) %>% sample_n(3)
flights %>% group_by(month, day) %>% top_n(3, dep_delay)
flights %>% group_by(month, day) %>% top_n(3, dep_delay) %>% arrange(desc(dep_delay))
flights %>% select(origin, dest) %>% unique()
flights %>% select(origin, dest) %>% distinct()
flights %>% select(origin, dest) %>% distinct
flights %>% mutate(speed = distance/air_time*60)
flights %>% transmute(speed = distance/air_time*60)
mtcars %>% head()
mtcars %>% add_rownames(“model”) %>% head()
mtcars %>% tbl_df()
flights %>% group_by(month) %>% summarise(cnt = n())
flights %>% group_by(month) %>% tally() flights %>% count(month)
flights %>% group_by(month) %>% summarise(cnt = n()) %>% arrange(desc(cnt))
flights %>% group_by(month) %>% tally(sort=TRUE) flights %>% count(month, sort=TRUE)
flights %>% group_by(month) %>% summarise(dist = sum(distance))
flights %>% group_by(month) %>% tally(wt = distance) flights %>% count(month, wt = distance)
flights %>% group_by(month) %>% group_size()
flights %>% group_by(month) %>% n_groups()
flights %>% group_by(month, day) %>% summarise(cnt = n()) %>% arrange(desc(cnt)) %>% print(n = 40)
flights %>% group_by(month, day) %>% summarise(cnt = n()) %>% ungroup() %>% arrange(desc(cnt))
Using data_frame() is better than data.frame()
You can us previously defined columns to compute new columns, it never coerces column types, it never adds row names, it only recycles length 1 input and it returns a local data frame (a tbl_df)
data_frame(a = 1:6, b = a*2, c = ‘string’, ‘d+e’ = 1) %>% glimpse()
(a <- data_frame(color = c(“green”,“yellow”,“red”), num = 1:3)) (b <- data_frame(color = c(“green”,“yellow”,“pink”), size = c(“S”,“M”,“L”)))
inner_join(a, b)
full_join(a, b)
left_join(a, b)
right_join(a, b)
left_join(b, a)
semi_join(a, b)
anti_join(a, b)
b <- b %>% rename(col = color)
inner_join(a, b, by=c(“color” = “col”))
flights %>% print(n = 15)
flights %>% print(n = Inf)
flights %>% print(width = Inf)
flights %>% View()
options(dplyr.width = Inf, dplyr.print_min = 6)
options(dplyr.width = NULL, dplyr.print_min = 10)