A dplyr is the grammar for data manipulation that provides a set of verbs that help navigate the challenges faced in data manipulation. The main verbs are; select() to select columns, filter() to filter rows, mutate() to create new columns, and summarize() that is use to summarize “one or more of its columns/variables with a summary statistic. Examples of summary statistics include the median and interquartile range” (Chester and Foreword, 2021, p.1). and arrange() to re-order new columns. The group_by() function helps the other functions combine, allowing you to perform any operation. The functions in dplyr are easier to work with, have consistent syntax, and analyze data frames instead of just vectors. You can learn more about dplyr here.
In this session, Learners would learn the five dplyr functions that would allow learners to solve most data manipulation tasks. We would pick observations by their values and arrange them in rows to look at simple to complex data manipulation challenges. In the end, the learners would be able to summarize values into summary statistic.
Data wrangling is critical in data science, and with dplyr functions you can manipulate data as you needed. Dplyr is easier to work with and more consistent than other R base functions.
To enhance learner’s skills in data wrangling through the use of dplyr verbs ( select, filter, mutate, summarize and arrange).
Data Manipulation in R using dplyr Package
Besides the use of dpplyr to manipulate data frame, dplyr does works with other computation backends. Computation backends like; dtplyr, dbplyr and sparklyr efficiently work with dplyr at the backend. For instance, dtplyr helps translate dplyr code to high-performance data table code.
Installation
# You can install the whole tidyverse that includes dpplyr:
install.packages("tidyverse")
# Alternatively, install just dplyr:
install.packages("dplyr")Let get the view of the data frame we are going to use.
# Let get the view of the data frame we are going to use
head(starwars).Let pipe starwars %>% and Select columns with select()
starwars %>%
select(name, sex, gender, height, mass, species)Filter rows with filter() to determine height below 96
# let height values less than 96 for sex (male)
starwars %>%
filter(height < 96,
sex == "male") #Arrange rows in descending order with arrange()
# Let arrange the values of height in descending order
starwars %>%
select(name,
sex,
gender,
height,
mass, species ) %>%
filter(height < 96) %>%
arrange(desc(height))Using mutate() to find body mass index
# Let create new variable by finding body mass index
starwars %>%
select(name,
sex,
gender,
height,
mass, species ) %>%
mutate(body_mass_index = mass /
((height / 100) ^ 2) )summarize data frame using summarize() to determin the mean of height
# find the total of species by gender and use filter to remove "NA"
starwars %>%
select(name,
sex,
gender,
height,
mass,
species) %>%
group_by(gender) %>%
summarize(n = n(),
mean_height = mean(height,
na.rm = TRUE)) %>%
filter(!is.na(gender)) Let combine all the five verbs
# let look at three variables in starwars and find their proportions
starwars %>%
group_by(birth_year) %>%
summarize(count = n(),
tot_mass = sum(mass),
tot_height = sum(height)) %>%
ungroup() %>%
mutate(tot_pro = count / sum(count),
mass_pro = percent(tot_mass),
height_pro = percent(tot_height)) %>%
filter(birth_year < 31)
Learn more about [package, technique, dataset] with the following:
Resource I dplyr
Resource II The 5 verbs of dplyr
Resource III Introduction to dplyr
This code through references and cites the following sources:
Chester Ismay and Albert Y. Kim & Foreword by Kelly S. McConville (August 09, 2021). Statistical Inference via Data Science
Hadley Wickham and Garrett Grolemund (January 2017). R for Data Science