Introduction

A dplyr is the grammar for data manipulation that provides a set of verbs that help navigate the challenges faced in data manipulation. The main verbs are; select() to select columns, filter() to filter rows, mutate() to create new columns, and summarize() that is use to summarize “one or more of its columns/variables with a summary statistic. Examples of summary statistics include the median and interquartile range” (Chester and Foreword, 2021, p.1). and arrange() to re-order new columns. The group_by() function helps the other functions combine, allowing you to perform any operation. The functions in dplyr are easier to work with, have consistent syntax, and analyze data frames instead of just vectors. You can learn more about dplyr here.

Content Overview

In this session, Learners would learn the five dplyr functions that would allow learners to solve most data manipulation tasks. We would pick observations by their values and arrange them in rows to look at simple to complex data manipulation challenges. In the end, the learners would be able to summarize values into summary statistic.

Why You Should Care

Data wrangling is critical in data science, and with dplyr functions you can manipulate data as you needed. Dplyr is easier to work with and more consistent than other R base functions.

Learning Objectives

To enhance learner’s skills in data wrangling through the use of dplyr verbs ( select, filter, mutate, summarize and arrange).

Body Title

Data Manipulation in R using dplyr Package

Further Exposition

Besides the use of dpplyr to manipulate data frame, dplyr does works with other computation backends. Computation backends like; dtplyr, dbplyr and sparklyr efficiently work with dplyr at the backend. For instance, dtplyr helps translate dplyr code to high-performance data table code.


Installation

# You can install the whole tidyverse that includes dpplyr:

install.packages("tidyverse")

# Alternatively, install just dplyr:

install.packages("dplyr")


Let get the view of the data frame we are going to use.

# Let get the view of the data frame we are going to use
head(starwars)


.Let pipe starwars %>% and Select columns with select()

starwars %>%
  select(name, sex, gender, height, mass, species)


Filter rows with filter() to determine height below 96

# let height values less than 96 for sex (male)

starwars %>%  
    filter(height < 96, 
           sex == "male") 


#Arrange rows in descending order with arrange()

# Let arrange the values of height in descending order

starwars %>% 
  select(name, 
         sex, 
         gender, 
         height,
         mass, species ) %>% 
  filter(height < 96) %>% 
  arrange(desc(height))


Using mutate() to find body mass index

# Let create new variable by finding body mass index
starwars %>% 
  select(name, 
         sex, 
         gender, 
         height,
         mass, species ) %>% 

  mutate(body_mass_index = mass /
           ((height / 100)  ^ 2) )


summarize data frame using summarize() to determin the mean of height

# find the total of species by gender and use filter to remove "NA"

starwars %>% 
  select(name, 
         sex, 
         gender, 
         height, 
         mass, 
         species) %>% 
  group_by(gender) %>% 
  summarize(n = n(), 
            mean_height = mean(height, 
                               na.rm = TRUE)) %>%  
  filter(!is.na(gender)) 


Let combine all the five verbs

# let look at three variables in starwars and find their proportions

starwars %>% 
  group_by(birth_year) %>% 
  summarize(count = n(),
            tot_mass = sum(mass),
            tot_height = sum(height)) %>% 
  ungroup() %>% 
  mutate(tot_pro = count / sum(count),
         mass_pro = percent(tot_mass),
         height_pro = percent(tot_height)) %>% 
  filter(birth_year < 31)



Further Resources

Learn more about [package, technique, dataset] with the following:




Works Cited

This code through references and cites the following sources: