Preamble

In this notes we shall focus on ‘dplyr’ package which is one of the member of Tidyverse family. The dplyr is an Important tool for data wrangling in R. Learning and using dplyr makes data preparation and management process quick. It is generally refered to as “Grammar of Data Manipulation”. This is because the package provides a set of verbs (functions) to describe and perform common data preparation tasks.

We will look at core dplyr functions such as

  1. filter()
  2. group_by()
  3. mutate()
  4. arrange()
  5. summarize()
  6. select()

Loading the packages needed

#dplyr for data wrangling
library(dplyr)

Dataset

We shall make use of ‘Wage’ dataset

library(ISLR)
data=Wage

This dataset provides wage and other data for a group of 3000 male workers in Mid-Atlantic region. Let us have a quick look at the variables in the data

Glimpse of Wage data
Variable.name Description Type
year Year that wage information was recorded Year
age Age of worker Numeric
maritl marital status Str
race indicating race of person Str
education Education level Str
region Indication mid-atlantic Str
jobclass type of job Str
health health level of worker Str
health_ins health insurance status Str
logwage Log of worker’s wage Numeric
wage worke’s raw wage Numeric

Data wrangling with Wage data

We shall perform data manipulation with our Wage dataset.

filter

Considering education, we have

## 5 levels which are
## [1] "1. < HS Grad"       "2. HS Grad"         "3. Some College"   
## [4] "4. College Grad"    "5. Advanced Degree"

Now we try to subset the data where individuals of education level ‘2.HS Grad’

data1=data %>% 
  filter(education=='2. HS Grad')
Let us have a look at first few rows of data1
year age maritl race education region jobclass health health_ins logwage wage
2005 50
  1. Divorced
  1. White
  1. HS Grad
  1. Middle Atlantic
  1. Information
  1. <=Good
  1. Yes
4.318063 75.04315
2004 52
  1. Married
  1. White
  1. HS Grad
  1. Middle Atlantic
  1. Information
  1. >=Very Good
  1. Yes
4.857333 128.68049
2007 34
  1. Married
  1. White
  1. HS Grad
  1. Middle Atlantic
  1. Industrial
  1. >=Very Good
  1. No
4.397940 81.28325
2005 35
  1. Never Married
  1. White
  1. HS Grad
  1. Middle Atlantic
  1. Information
  1. >=Very Good
  1. Yes
4.494155 89.49248
2009 54
  1. Married
  1. White
  1. HS Grad
  1. Middle Atlantic
  1. Information
  1. >=Very Good
  1. Yes
4.903090 134.70538
2005 40
  1. Divorced
  1. White
  1. HS Grad
  1. Middle Atlantic
  1. Information
  1. >=Very Good
  1. No
3.920123 50.40666

We can add more than one condition for filtering data, for this we shall consider jobclass ‘1. Industrial’ and education level ‘2.HS Grad’, we can perform this using following way

data2=data %>% 
  filter(education=='2. HS Grad',jobclass=='1. Industrial')
Let us look at first few rows of the data
year age maritl race education region jobclass health health_ins logwage wage
2007 34
  1. Married
  1. White
  1. HS Grad
  1. Middle Atlantic
  1. Industrial
  1. >=Very Good
  1. No
4.397940 81.28325
2006 38
  1. Never Married
  1. White
  1. HS Grad
  1. Middle Atlantic
  1. Industrial
  1. <=Good
  1. No
4.544068 94.07271
2006 43
  1. Married
  1. White
  1. HS Grad
  1. Middle Atlantic
  1. Industrial
  1. <=Good
  1. Yes
4.431364 84.04596
2004 34
  1. Married
  1. Other
  1. HS Grad
  1. Middle Atlantic
  1. Industrial
  1. <=Good
  1. Yes
4.176091 65.11085
2008 57
  1. Married
  1. White
  1. HS Grad
  1. Middle Atlantic
  1. Industrial
  1. >=Very Good
  1. No
4.755875 116.26532
2003 18
  1. Never Married
  1. Black
  1. HS Grad
  1. Middle Atlantic
  1. Industrial
  1. >=Very Good
  1. Yes
4.255273 70.47602
We can also add more than one category for single variable using ‘|’ operator for either or condition
data3=data %>% 
  filter(education=='2. HS Grad'|education=='3. Some College',jobclass=='2. Information')
year age maritl race education region jobclass health health_ins logwage wage
2005 50
  1. Divorced
  1. White
  1. HS Grad
  1. Middle Atlantic
  1. Information
  1. <=Good
  1. Yes
4.318063 75.04315
2008 30
  1. Never Married
  1. Asian
  1. Some College
  1. Middle Atlantic
  1. Information
  1. <=Good
  1. Yes
4.716003 111.72085
2006 41
  1. Never Married
  1. Black
  1. Some College
  1. Middle Atlantic
  1. Information
  1. >=Very Good
  1. Yes
4.778151 118.88436
2004 52
  1. Married
  1. White
  1. HS Grad
  1. Middle Atlantic
  1. Information
  1. >=Very Good
  1. Yes
4.857333 128.68049
2007 45
  1. Divorced
  1. White
  1. Some College
  1. Middle Atlantic
  1. Information
  1. <=Good
  1. Yes
4.763428 117.14682
2005 35
  1. Never Married
  1. White
  1. HS Grad
  1. Middle Atlantic
  1. Information
  1. >=Very Good
  1. Yes
4.494155 89.49248

More than two categories can be considered for filtering using ‘%in%’ operator

data4=data %>% 
  filter(education%in%c('2. HS Grad','3. Some College','4. College Grad'),jobclass=='2. Information')
year age maritl race education region jobclass health health_ins logwage wage
2004 24
  1. Never Married
  1. White
  1. College Grad
  1. Middle Atlantic
  1. Information
  1. >=Very Good
  1. No
4.255273 70.47602
2003 43
  1. Married
  1. Asian
  1. College Grad
  1. Middle Atlantic
  1. Information
  1. >=Very Good
  1. Yes
5.041393 154.68529
2005 50
  1. Divorced
  1. White
  1. HS Grad
  1. Middle Atlantic
  1. Information
  1. <=Good
  1. Yes
4.318063 75.04315
2008 54
  1. Married
  1. White
  1. College Grad
  1. Middle Atlantic
  1. Information
  1. >=Very Good
  1. Yes
4.845098 127.11574
2008 30
  1. Never Married
  1. Asian
  1. Some College
  1. Middle Atlantic
  1. Information
  1. <=Good
  1. Yes
4.716003 111.72085
2006 41
  1. Never Married
  1. Black
  1. Some College
  1. Middle Atlantic
  1. Information
  1. >=Very Good
  1. Yes
4.778151 118.88436

group_by and summarize

Grouping data by one variable and summarizing

summary=data %>% 
  group_by(maritl) %>% 
  summarize(mean = mean(wage, na.rm = TRUE),std_dev = sd(wage, na.rm = TRUE))
Let’s see the summary table
maritl mean std_dev
  1. Never Married
92.73465 32.92007
  1. Married
118.86026 43.12055
  1. Widowed
99.53866 23.74106
  1. Divorced
103.15926 33.80098
  1. Separated
101.21579 33.66338

It is safe to ungroup the data which we have grouped, this can be done as

data %>% group_by(maritl) %>% ungroup()

Grouping by more than one variable can be done using

summary2=data %>%
  group_by(maritl,education) %>%
  summarize(mean=mean(wage,na.rm=T),std_dev=sd(wage,na.rm=T),number_of_observations=n())
The resulting table is
maritl education mean std_dev number_of_observations
  1. Never Married
  1. < HS Grad
73.42270 15.705084 62
  1. Never Married
  1. HS Grad
80.27383 19.462133 219
  1. Never Married
  1. Some College
91.02989 28.917466 164
  1. Never Married
  1. College Grad
108.22921 39.335340 143
  1. Never Married
  1. Advanced Degree
125.90330 38.968472 60
  1. Married
  1. < HS Grad
88.10074 22.809326 174
  1. Married
  1. HS Grad
101.54363 29.797860 651
  1. Married
  1. Some College
114.92456 32.378203 421
  1. Married
  1. College Grad
129.88437 40.416962 487
  1. Married
  1. Advanced Degree
156.72968 55.268681 341
  1. Widowed
  1. < HS Grad
58.31423 5.255327 2
  1. Widowed
  1. HS Grad
101.93244 16.075910 8
  1. Widowed
  1. Some College
103.43270 21.851952 2
  1. Widowed
  1. College Grad
103.52201 24.742911 5
  1. Widowed
  1. Advanced Degree
117.33551 34.562904 2
  1. Divorced
  1. < HS Grad
91.04488 18.504915 16
  1. Divorced
  1. HS Grad
88.83916 24.196654 73
  1. Divorced
  1. Some College
105.19988 26.912748 52
  1. Divorced
  1. College Grad
118.35697 45.834830 41
  1. Divorced
  1. Advanced Degree
126.34012 34.392348 22
  1. Separated
  1. < HS Grad
77.49285 10.872880 14
  1. Separated
  1. HS Grad
101.00232 22.118534 20
  1. Separated
  1. Some College
95.61057 9.478846 11
  1. Separated
  1. College Grad
125.82291 23.996230 9
  1. Separated
  1. Advanced Degree
277.79948 NA 1
data %>% 
  group_by(maritl,education) %>%
  ungroup()

mutate

We can create a new variable using existing variables easily with the help of mutate. For this let us consider finding the birth year of persons since we have age and year when the responses were collected.

B_data=data %>%  mutate(Birth_year = year-age)
Let’s see first few rows of the data
year age maritl race education region jobclass health health_ins logwage wage Birth_year
2006 18
  1. Never Married
  1. White
  1. < HS Grad
  1. Middle Atlantic
  1. Industrial
  1. <=Good
  1. No
4.318063 75.04315 1988
2004 24
  1. Never Married
  1. White
  1. College Grad
  1. Middle Atlantic
  1. Information
  1. >=Very Good
  1. No
4.255273 70.47602 1980
2003 45
  1. Married
  1. White
  1. Some College
  1. Middle Atlantic
  1. Industrial
  1. <=Good
  1. Yes
4.875061 130.98218 1958
2003 43
  1. Married
  1. Asian
  1. College Grad
  1. Middle Atlantic
  1. Information
  1. >=Very Good
  1. Yes
5.041393 154.68529 1960
2005 50
  1. Divorced
  1. White
  1. HS Grad
  1. Middle Atlantic
  1. Information
  1. <=Good
  1. Yes
4.318063 75.04315 1955
2008 54
  1. Married
  1. White
  1. College Grad
  1. Middle Atlantic
  1. Information
  1. >=Very Good
  1. Yes
4.845098 127.11574 1954

We can create more than 1 variable using mutate as follows, let us find birth year and deviation of wage from its mean

N_data=data %>% 
  mutate(Birth_year=year-age,mean_dev=wage-mean(wage))
year age maritl race education region jobclass health health_ins logwage wage Birth_year mean_dev
2006 18
  1. Never Married
  1. White
  1. < HS Grad
  1. Middle Atlantic
  1. Industrial
  1. <=Good
  1. No
4.318063 75.04315 1988 -36.66045
2004 24
  1. Never Married
  1. White
  1. College Grad
  1. Middle Atlantic
  1. Information
  1. >=Very Good
  1. No
4.255273 70.47602 1980 -41.22759
2003 45
  1. Married
  1. White
  1. Some College
  1. Middle Atlantic
  1. Industrial
  1. <=Good
  1. Yes
4.875061 130.98218 1958 19.27857
2003 43
  1. Married
  1. Asian
  1. College Grad
  1. Middle Atlantic
  1. Information
  1. >=Very Good
  1. Yes
5.041393 154.68529 1960 42.98168
2005 50
  1. Divorced
  1. White
  1. HS Grad
  1. Middle Atlantic
  1. Information
  1. <=Good
  1. Yes
4.318063 75.04315 1955 -36.66045
2008 54
  1. Married
  1. White
  1. College Grad
  1. Middle Atlantic
  1. Information
  1. >=Very Good
  1. Yes
4.845098 127.11574 1954 15.41214

arrange

To arrange the data based on particular variable we use arrange. For example, we shall arrange the data based on year

arranged_dat=data %>% arrange(year)
Let’s look at the data
First few rows of data arranged based on ascending order of year
year age maritl race education region jobclass health health_ins logwage wage
2003 45
  1. Married
  1. White
  1. Some College
  1. Middle Atlantic
  1. Industrial
  1. <=Good
  1. Yes
4.875061 130.98218
2003 43
  1. Married
  1. Asian
  1. College Grad
  1. Middle Atlantic
  1. Information
  1. >=Very Good
  1. Yes
5.041393 154.68529
2003 39
  1. Married
  1. White
  1. College Grad
  1. Middle Atlantic
  1. Industrial
  1. >=Very Good
  1. Yes
4.903090 134.70538
2003 37
  1. Never Married
  1. Asian
  1. College Grad
  1. Middle Atlantic
  1. Industrial
  1. >=Very Good
  1. No
4.414973 82.67964
2003 37
  1. Never Married
  1. White
  1. Some College
  1. Middle Atlantic
  1. Industrial
  1. >=Very Good
  1. Yes
4.591065 98.59934
2003 38
  1. Married
  1. Asian
  1. College Grad
  1. Middle Atlantic
  1. Information
  1. >=Very Good
  1. No
5.301030 200.54326

By default it arranges in ascending order, we can obtain descending order as follows

arranged_datd=data %>% arrange(desc(age))
Let’s look at the data
First few rows of data arranged based on Descending order of age
year age maritl race education region jobclass health health_ins logwage wage
2003 80
  1. Married
  1. White
  1. HS Grad
  1. Middle Atlantic
  1. Information
  1. <=Good
  1. Yes
4.380211 79.85490
2003 80
  1. Married
  1. White
  1. Some College
  1. Middle Atlantic
  1. Information
  1. >=Very Good
  1. Yes
4.544068 94.07271
2005 80
  1. Never Married
  1. White
  1. HS Grad
  1. Middle Atlantic
  1. Information
  1. >=Very Good
  1. No
4.477121 87.98103
2003 80
  1. Married
  1. White
  1. HS Grad
  1. Middle Atlantic
  1. Industrial
  1. >=Very Good
  1. No
4.414973 82.67964
2005 77
  1. Married
  1. White
  1. HS Grad
  1. Middle Atlantic
  1. Information
  1. >=Very Good
  1. Yes
4.602060 99.68946
2009 76
  1. Divorced
  1. White
  1. Advanced Degree
  1. Middle Atlantic
  1. Information
  1. >=Very Good
  1. Yes
5.176091 176.98965

select

We can select interested columns from a dataframe using this function

selected=data %>%
  select(age,wage,education)
Let us look at the resulting dataset
age wage education
18 75.04315
  1. < HS Grad
24 70.47602
  1. College Grad
45 130.98218
  1. Some College
43 154.68529
  1. College Grad
50 75.04315
  1. HS Grad
54 127.11574
  1. College Grad

case_when

This is an important function which can be used for conditional execution, this can be used as an alternative for if_else, for example let us create a new variable ‘wage_category’ which indicate whether wage of person is less than or greater than median

data_n=data %>%
  mutate(wage_category=case_when(wage<=104.92~"Less than median",T~"Greater than median"))
Let us look at first few rows of data_n
year age maritl race education region jobclass health health_ins logwage wage wage_category
2006 18
  1. Never Married
  1. White
  1. < HS Grad
  1. Middle Atlantic
  1. Industrial
  1. <=Good
  1. No
4.318063 75.04315 Less than median
2004 24
  1. Never Married
  1. White
  1. College Grad
  1. Middle Atlantic
  1. Information
  1. >=Very Good
  1. No
4.255273 70.47602 Less than median
2003 45
  1. Married
  1. White
  1. Some College
  1. Middle Atlantic
  1. Industrial
  1. <=Good
  1. Yes
4.875061 130.98218 Greater than median
2003 43
  1. Married
  1. Asian
  1. College Grad
  1. Middle Atlantic
  1. Information
  1. >=Very Good
  1. Yes
5.041393 154.68529 Greater than median
2005 50
  1. Divorced
  1. White
  1. HS Grad
  1. Middle Atlantic
  1. Information
  1. <=Good
  1. Yes
4.318063 75.04315 Less than median
2008 54
  1. Married
  1. White
  1. College Grad
  1. Middle Atlantic
  1. Information
  1. >=Very Good
  1. Yes
4.845098 127.11574 Greater than median

Joining data frames together

Data may not always come from single source, when various data frames are obtained from various sources our primary aim will be to combine all the important data to form a single dataframe. For this let us subset the data in to two data frames

data_1=data[1:6,1:4];data_1
data_2=data[c(1:3,10),c(2,4,7,9)]
Let us look at the data
data_1
year age maritl race
2006 18
  1. Never Married
  1. White
2004 24
  1. Never Married
  1. White
2003 45
  1. Married
  1. White
2003 43
  1. Married
  1. Asian
2005 50
  1. Divorced
  1. White
2008 54
  1. Married
  1. White
data_2
age race jobclass health_ins
18
  1. White
  1. Industrial
  1. No
24
  1. White
  1. Information
  1. No
45
  1. White
  1. Industrial
  1. Yes
52
  1. White
  1. Information
  1. Yes

For joining the datasets we can use the following types of join which are

Inner Join

For two data frames data_1 and data_2, this join checks for similar values in given columns in condition and returns all rows of both the data frames where match occurs. For example this join creates a new table which will combine data_1 and data_2 based on the join-variable, “race”

dat_12i=data_1 %>% inner_join(data_2,by='race')
The resulting dataset is
year age.x maritl race age.y jobclass health_ins
2006 18
  1. Never Married
  1. White
18
  1. Industrial
  1. No
2006 18
  1. Never Married
  1. White
24
  1. Information
  1. No
2006 18
  1. Never Married
  1. White
45
  1. Industrial
  1. Yes
2006 18
  1. Never Married
  1. White
52
  1. Information
  1. Yes
2004 24
  1. Never Married
  1. White
18
  1. Industrial
  1. No
2004 24
  1. Never Married
  1. White
24
  1. Information
  1. No
2004 24
  1. Never Married
  1. White
45
  1. Industrial
  1. Yes
2004 24
  1. Never Married
  1. White
52
  1. Information
  1. Yes
2003 45
  1. Married
  1. White
18
  1. Industrial
  1. No
2003 45
  1. Married
  1. White
24
  1. Information
  1. No
2003 45
  1. Married
  1. White
45
  1. Industrial
  1. Yes
2003 45
  1. Married
  1. White
52
  1. Information
  1. Yes
2005 50
  1. Divorced
  1. White
18
  1. Industrial
  1. No
2005 50
  1. Divorced
  1. White
24
  1. Information
  1. No
2005 50
  1. Divorced
  1. White
45
  1. Industrial
  1. Yes
2005 50
  1. Divorced
  1. White
52
  1. Information
  1. Yes
2008 54
  1. Married
  1. White
18
  1. Industrial
  1. No
2008 54
  1. Married
  1. White
24
  1. Information
  1. No
2008 54
  1. Married
  1. White
45
  1. Industrial
  1. Yes
2008 54
  1. Married
  1. White
52
  1. Information
  1. Yes

Left Join

For two data frames data_1 and data_2, this join checks for similar values in condition and returns rows of dataset specified first(on Left) and all columns in both datasets. If there isn’t a match in the second table, then it will return NULL for the row

data_12l=data_1 %>% left_join(data_2,by='age')
Left join by age
year age maritl race.x race.y jobclass health_ins
2006 18
  1. Never Married
  1. White
  1. White
  1. Industrial
  1. No
2004 24
  1. Never Married
  1. White
  1. White
  1. Information
  1. No
2003 45
  1. Married
  1. White
  1. White
  1. Industrial
  1. Yes
2003 43
  1. Married
  1. Asian
NA NA NA
2005 50
  1. Divorced
  1. White
NA NA NA
2008 54
  1. Married
  1. White
NA NA NA

Right Join

For two data frames data_1 and data_2, this join checks for similar values in condition and returns rows of dataset specified second(on Right) and all columns in both datasets. If there isn’t a match in the second table, then it will return NULL for the row

data_12r=data_1 %>% right_join(data_2,by='age')
Right join by age
year age maritl race.x race.y jobclass health_ins
2006 18
  1. Never Married
  1. White
  1. White
  1. Industrial
  1. No
2004 24
  1. Never Married
  1. White
  1. White
  1. Information
  1. No
2003 45
  1. Married
  1. White
  1. White
  1. Industrial
  1. Yes
NA 52 NA NA
  1. White
  1. Information
  1. Yes

Full Join

For two data frames data_1 and data_2, this join checks for similar values in condition and returns rows of both dataset and all columns of both datasets. If there isn’t a match in the second table, then it will return NULL for the row

data_12f=data_1 %>% full_join(data_2,by='age')
Full join by age
year age maritl race.x race.y jobclass health_ins
2006 18
  1. Never Married
  1. White
  1. White
  1. Industrial
  1. No
2004 24
  1. Never Married
  1. White
  1. White
  1. Information
  1. No
2003 45
  1. Married
  1. White
  1. White
  1. Industrial
  1. Yes
2003 43
  1. Married
  1. Asian
NA NA NA
2005 50
  1. Divorced
  1. White
NA NA NA
2008 54
  1. Married
  1. White
NA NA NA
NA 52 NA NA
  1. White
  1. Information
  1. Yes

Semi Join

For two data frames data_1 and data_2, this join checks for similar values in condition and returns rows of both dataset and columns of first dataset(On left). If there isn’t a match in the second table, then it will return NULL for the row

data_12s=data_1 %>% semi_join(data_2,by='age')
Semi join by age
year age maritl race
2006 18
  1. Never Married
  1. White
2004 24
  1. Never Married
  1. White
2003 45
  1. Married
  1. White

Anti Join

For two data frames data_1 and data_2, this will return all of the rows from the first dataframe where there are not matching values from the second. Only first dataframe’s columns will be returned

data_12a=data_1 %>% anti_join(data_2,by='age')
Anti join by age
year age maritl race
2003 43
  1. Married
  1. Asian
2005 50
  1. Divorced
  1. White
2008 54
  1. Married
  1. White

Final note

In this notes we have gone through various verbs in dplyr that help us perform data wrangling such as filtering, grouping, arranging, creating new column, summarizing the dataset in representable table structure.
We have also looked at data joining verbs available in dplyr which will be greatly used while working with dataframes from multiple sources.