Preamble

In this notes we shall focus on ‘dplyr’ package which is one of the member of Tidyverse family. The dplyr is an Important tool for data wrangling in R. Learning and using dplyr makes data preparation and management process quick. It is generally refered to as “Grammar of Data Manipulation”. This is because the package provides a set of verbs (functions) to describe and perform common data preparation tasks.

We will look at core dplyr functions such as

filter()
group_by()
mutate()
arrange()
summarize()
select()

Loading the packages needed

#dplyr for data wrangling
library(dplyr)

Dataset

We shall make use of ‘Wage’ dataset

library(ISLR)
data=Wage

This dataset provides wage and other data for a group of 3000 male workers in Mid-Atlantic region. Let us have a quick look at the variables in the data

Glimpse of Wage data
Variable.name	Description	Type
year	Year that wage information was recorded	Year
age	Age of worker	Numeric
maritl	marital status	Str
race	indicating race of person	Str
education	Education level	Str
region	Indication mid-atlantic	Str
jobclass	type of job	Str
health	health level of worker	Str
health_ins	health insurance status	Str
logwage	Log of worker’s wage	Numeric
wage	worke’s raw wage	Numeric

Data wrangling with Wage data

We shall perform data manipulation with our Wage dataset.

filter

Considering education, we have

## 5 levels which are

## [1] "1. < HS Grad"       "2. HS Grad"         "3. Some College"   
## [4] "4. College Grad"    "5. Advanced Degree"

Now we try to subset the data where individuals of education level ‘2.HS Grad’

data1=data %>% 
  filter(education=='2. HS Grad')

Let us have a look at first few rows of data1

year	age	maritl	race	education	region	jobclass	health	health_ins	logwage	wage
2005	50	Divorced	White	HS Grad	Middle Atlantic	Information	<=Good	Yes	4.318063	75.04315
2004	52	Married	White	HS Grad	Middle Atlantic	Information	>=Very Good	Yes	4.857333	128.68049
2007	34	Married	White	HS Grad	Middle Atlantic	Industrial	>=Very Good	No	4.397940	81.28325
2005	35	Never Married	White	HS Grad	Middle Atlantic	Information	>=Very Good	Yes	4.494155	89.49248
2009	54	Married	White	HS Grad	Middle Atlantic	Information	>=Very Good	Yes	4.903090	134.70538
2005	40	Divorced	White	HS Grad	Middle Atlantic	Information	>=Very Good	No	3.920123	50.40666

We can add more than one condition for filtering data, for this we shall consider jobclass ‘1. Industrial’ and education level ‘2.HS Grad’, we can perform this using following way

data2=data %>% 
  filter(education=='2. HS Grad',jobclass=='1. Industrial')

Let us look at first few rows of the data

year	age	maritl	race	education	region	jobclass	health	health_ins	logwage	wage
2007	34	Married	White	HS Grad	Middle Atlantic	Industrial	>=Very Good	No	4.397940	81.28325
2006	38	Never Married	White	HS Grad	Middle Atlantic	Industrial	<=Good	No	4.544068	94.07271
2006	43	Married	White	HS Grad	Middle Atlantic	Industrial	<=Good	Yes	4.431364	84.04596
2004	34	Married	Other	HS Grad	Middle Atlantic	Industrial	<=Good	Yes	4.176091	65.11085
2008	57	Married	White	HS Grad	Middle Atlantic	Industrial	>=Very Good	No	4.755875	116.26532
2003	18	Never Married	Black	HS Grad	Middle Atlantic	Industrial	>=Very Good	Yes	4.255273	70.47602

We can also add more than one category for single variable using ‘|’ operator for either or condition

data3=data %>% 
  filter(education=='2. HS Grad'|education=='3. Some College',jobclass=='2. Information')

year	age	maritl	race	education	region	jobclass	health	health_ins	logwage	wage
2005	50	Divorced	White	HS Grad	Middle Atlantic	Information	<=Good	Yes	4.318063	75.04315
2008	30	Never Married	Asian	Some College	Middle Atlantic	Information	<=Good	Yes	4.716003	111.72085
2006	41	Never Married	Black	Some College	Middle Atlantic	Information	>=Very Good	Yes	4.778151	118.88436
2004	52	Married	White	HS Grad	Middle Atlantic	Information	>=Very Good	Yes	4.857333	128.68049
2007	45	Divorced	White	Some College	Middle Atlantic	Information	<=Good	Yes	4.763428	117.14682
2005	35	Never Married	White	HS Grad	Middle Atlantic	Information	>=Very Good	Yes	4.494155	89.49248

More than two categories can be considered for filtering using ‘%in%’ operator

data4=data %>% 
  filter(education%in%c('2. HS Grad','3. Some College','4. College Grad'),jobclass=='2. Information')

year	age	maritl	race	education	region	jobclass	health	health_ins	logwage	wage
2004	24	Never Married	White	College Grad	Middle Atlantic	Information	>=Very Good	No	4.255273	70.47602
2003	43	Married	Asian	College Grad	Middle Atlantic	Information	>=Very Good	Yes	5.041393	154.68529
2005	50	Divorced	White	HS Grad	Middle Atlantic	Information	<=Good	Yes	4.318063	75.04315
2008	54	Married	White	College Grad	Middle Atlantic	Information	>=Very Good	Yes	4.845098	127.11574
2008	30	Never Married	Asian	Some College	Middle Atlantic	Information	<=Good	Yes	4.716003	111.72085
2006	41	Never Married	Black	Some College	Middle Atlantic	Information	>=Very Good	Yes	4.778151	118.88436

group_by and summarize

Grouping data by one variable and summarizing

summary=data %>% 
  group_by(maritl) %>% 
  summarize(mean = mean(wage, na.rm = TRUE),std_dev = sd(wage, na.rm = TRUE))

Let’s see the summary table

maritl	mean	std_dev
Never Married	92.73465	32.92007
Married	118.86026	43.12055
Widowed	99.53866	23.74106
Divorced	103.15926	33.80098
Separated	101.21579	33.66338

It is safe to ungroup the data which we have grouped, this can be done as

data %>% group_by(maritl) %>% ungroup()

Grouping by more than one variable can be done using

summary2=data %>%
  group_by(maritl,education) %>%
  summarize(mean=mean(wage,na.rm=T),std_dev=sd(wage,na.rm=T),number_of_observations=n())

The resulting table is

maritl	education	mean	std_dev	number_of_observations
Never Married	< HS Grad	73.42270	15.705084	62
Never Married	HS Grad	80.27383	19.462133	219
Never Married	Some College	91.02989	28.917466	164
Never Married	College Grad	108.22921	39.335340	143
Never Married	Advanced Degree	125.90330	38.968472	60
Married	< HS Grad	88.10074	22.809326	174
Married	HS Grad	101.54363	29.797860	651
Married	Some College	114.92456	32.378203	421
Married	College Grad	129.88437	40.416962	487
Married	Advanced Degree	156.72968	55.268681	341
Widowed	< HS Grad	58.31423	5.255327	2
Widowed	HS Grad	101.93244	16.075910	8
Widowed	Some College	103.43270	21.851952	2
Widowed	College Grad	103.52201	24.742911	5
Widowed	Advanced Degree	117.33551	34.562904	2
Divorced	< HS Grad	91.04488	18.504915	16
Divorced	HS Grad	88.83916	24.196654	73
Divorced	Some College	105.19988	26.912748	52
Divorced	College Grad	118.35697	45.834830	41
Divorced	Advanced Degree	126.34012	34.392348	22
Separated	< HS Grad	77.49285	10.872880	14
Separated	HS Grad	101.00232	22.118534	20
Separated	Some College	95.61057	9.478846	11
Separated	College Grad	125.82291	23.996230	9
Separated	Advanced Degree	277.79948	NA	1

data %>% 
  group_by(maritl,education) %>%
  ungroup()

mutate

We can create a new variable using existing variables easily with the help of mutate. For this let us consider finding the birth year of persons since we have age and year when the responses were collected.

B_data=data %>%  mutate(Birth_year = year-age)

Let’s see first few rows of the data

year	age	maritl	race	education	region	jobclass	health	health_ins	logwage	wage	Birth_year
2006	18	Never Married	White	< HS Grad	Middle Atlantic	Industrial	<=Good	No	4.318063	75.04315	1988
2004	24	Never Married	White	College Grad	Middle Atlantic	Information	>=Very Good	No	4.255273	70.47602	1980
2003	45	Married	White	Some College	Middle Atlantic	Industrial	<=Good	Yes	4.875061	130.98218	1958
2003	43	Married	Asian	College Grad	Middle Atlantic	Information	>=Very Good	Yes	5.041393	154.68529	1960
2005	50	Divorced	White	HS Grad	Middle Atlantic	Information	<=Good	Yes	4.318063	75.04315	1955
2008	54	Married	White	College Grad	Middle Atlantic	Information	>=Very Good	Yes	4.845098	127.11574	1954

We can create more than 1 variable using mutate as follows, let us find birth year and deviation of wage from its mean

N_data=data %>% 
  mutate(Birth_year=year-age,mean_dev=wage-mean(wage))

year	age	maritl	race	education	region	jobclass	health	health_ins	logwage	wage	Birth_year	mean_dev
2006	18	Never Married	White	< HS Grad	Middle Atlantic	Industrial	<=Good	No	4.318063	75.04315	1988	-36.66045
2004	24	Never Married	White	College Grad	Middle Atlantic	Information	>=Very Good	No	4.255273	70.47602	1980	-41.22759
2003	45	Married	White	Some College	Middle Atlantic	Industrial	<=Good	Yes	4.875061	130.98218	1958	19.27857
2003	43	Married	Asian	College Grad	Middle Atlantic	Information	>=Very Good	Yes	5.041393	154.68529	1960	42.98168
2005	50	Divorced	White	HS Grad	Middle Atlantic	Information	<=Good	Yes	4.318063	75.04315	1955	-36.66045
2008	54	Married	White	College Grad	Middle Atlantic	Information	>=Very Good	Yes	4.845098	127.11574	1954	15.41214

arrange

To arrange the data based on particular variable we use arrange. For example, we shall arrange the data based on year

arranged_dat=data %>% arrange(year)

Let’s look at the data

First few rows of data arranged based on ascending order of year
year	age	maritl	race	education	region	jobclass	health	health_ins	logwage	wage
2003	45	Married	White	Some College	Middle Atlantic	Industrial	<=Good	Yes	4.875061	130.98218
2003	43	Married	Asian	College Grad	Middle Atlantic	Information	>=Very Good	Yes	5.041393	154.68529
2003	39	Married	White	College Grad	Middle Atlantic	Industrial	>=Very Good	Yes	4.903090	134.70538
2003	37	Never Married	Asian	College Grad	Middle Atlantic	Industrial	>=Very Good	No	4.414973	82.67964
2003	37	Never Married	White	Some College	Middle Atlantic	Industrial	>=Very Good	Yes	4.591065	98.59934
2003	38	Married	Asian	College Grad	Middle Atlantic	Information	>=Very Good	No	5.301030	200.54326

By default it arranges in ascending order, we can obtain descending order as follows

arranged_datd=data %>% arrange(desc(age))

Let’s look at the data

First few rows of data arranged based on Descending order of age
year	age	maritl	race	education	region	jobclass	health	health_ins	logwage	wage
2003	80	Married	White	HS Grad	Middle Atlantic	Information	<=Good	Yes	4.380211	79.85490
2003	80	Married	White	Some College	Middle Atlantic	Information	>=Very Good	Yes	4.544068	94.07271
2005	80	Never Married	White	HS Grad	Middle Atlantic	Information	>=Very Good	No	4.477121	87.98103
2003	80	Married	White	HS Grad	Middle Atlantic	Industrial	>=Very Good	No	4.414973	82.67964
2005	77	Married	White	HS Grad	Middle Atlantic	Information	>=Very Good	Yes	4.602060	99.68946
2009	76	Divorced	White	Advanced Degree	Middle Atlantic	Information	>=Very Good	Yes	5.176091	176.98965

select

We can select interested columns from a dataframe using this function

selected=data %>%
  select(age,wage,education)

Let us look at the resulting dataset

age	wage	education
18	75.04315	< HS Grad
24	70.47602	College Grad
45	130.98218	Some College
43	154.68529	College Grad
50	75.04315	HS Grad
54	127.11574	College Grad

case_when

This is an important function which can be used for conditional execution, this can be used as an alternative for if_else, for example let us create a new variable ‘wage_category’ which indicate whether wage of person is less than or greater than median

data_n=data %>%
  mutate(wage_category=case_when(wage<=104.92~"Less than median",T~"Greater than median"))

Let us look at first few rows of data_n

year	age	maritl	race	education	region	jobclass	health	health_ins	logwage	wage	wage_category
2006	18	Never Married	White	< HS Grad	Middle Atlantic	Industrial	<=Good	No	4.318063	75.04315	Less than median
2004	24	Never Married	White	College Grad	Middle Atlantic	Information	>=Very Good	No	4.255273	70.47602	Less than median
2003	45	Married	White	Some College	Middle Atlantic	Industrial	<=Good	Yes	4.875061	130.98218	Greater than median
2003	43	Married	Asian	College Grad	Middle Atlantic	Information	>=Very Good	Yes	5.041393	154.68529	Greater than median
2005	50	Divorced	White	HS Grad	Middle Atlantic	Information	<=Good	Yes	4.318063	75.04315	Less than median
2008	54	Married	White	College Grad	Middle Atlantic	Information	>=Very Good	Yes	4.845098	127.11574	Greater than median

Joining data frames together

Data may not always come from single source, when various data frames are obtained from various sources our primary aim will be to combine all the important data to form a single dataframe. For this let us subset the data in to two data frames

data_1=data[1:6,1:4];data_1

data_2=data[c(1:3,10),c(2,4,7,9)]

Let us look at the data

data_1
year	age	maritl	race
2006	18	Never Married	White
2004	24	Never Married	White
2003	45	Married	White
2003	43	Married	Asian
2005	50	Divorced	White
2008	54	Married	White

data_2
age	race	jobclass	health_ins
18	White	Industrial	No
24	White	Information	No
45	White	Industrial	Yes
52	White	Information	Yes

For joining the datasets we can use the following types of join which are

Inner Join

For two data frames data_1 and data_2, this join checks for similar values in given columns in condition and returns all rows of both the data frames where match occurs. For example this join creates a new table which will combine data_1 and data_2 based on the join-variable, “race”

dat_12i=data_1 %>% inner_join(data_2,by='race')

The resulting dataset is

year	age.x	maritl	race	age.y	jobclass	health_ins
2006	18	Never Married	White	18	Industrial	No
2006	18	Never Married	White	24	Information	No
2006	18	Never Married	White	45	Industrial	Yes
2006	18	Never Married	White	52	Information	Yes
2004	24	Never Married	White	18	Industrial	No
2004	24	Never Married	White	24	Information	No
2004	24	Never Married	White	45	Industrial	Yes
2004	24	Never Married	White	52	Information	Yes
2003	45	Married	White	18	Industrial	No
2003	45	Married	White	24	Information	No
2003	45	Married	White	45	Industrial	Yes
2003	45	Married	White	52	Information	Yes
2005	50	Divorced	White	18	Industrial	No
2005	50	Divorced	White	24	Information	No
2005	50	Divorced	White	45	Industrial	Yes
2005	50	Divorced	White	52	Information	Yes
2008	54	Married	White	18	Industrial	No
2008	54	Married	White	24	Information	No
2008	54	Married	White	45	Industrial	Yes
2008	54	Married	White	52	Information	Yes

Left Join

For two data frames data_1 and data_2, this join checks for similar values in condition and returns rows of dataset specified first(on Left) and all columns in both datasets. If there isn’t a match in the second table, then it will return NULL for the row

data_12l=data_1 %>% left_join(data_2,by='age')

Left join by age
year	age	maritl	race.x	race.y	jobclass	health_ins
2006	18	Never Married	White	White	Industrial	No
2004	24	Never Married	White	White	Information	No
2003	45	Married	White	White	Industrial	Yes
2003	43	Married	Asian	NA	NA	NA
2005	50	Divorced	White	NA	NA	NA
2008	54	Married	White	NA	NA	NA

Right Join

For two data frames data_1 and data_2, this join checks for similar values in condition and returns rows of dataset specified second(on Right) and all columns in both datasets. If there isn’t a match in the second table, then it will return NULL for the row

data_12r=data_1 %>% right_join(data_2,by='age')

Right join by age
year	age	maritl	race.x	race.y	jobclass	health_ins
2006	18	Never Married	White	White	Industrial	No
2004	24	Never Married	White	White	Information	No
2003	45	Married	White	White	Industrial	Yes
NA	52	NA	NA	White	Information	Yes

Full Join

For two data frames data_1 and data_2, this join checks for similar values in condition and returns rows of both dataset and all columns of both datasets. If there isn’t a match in the second table, then it will return NULL for the row

data_12f=data_1 %>% full_join(data_2,by='age')

Full join by age
year	age	maritl	race.x	race.y	jobclass	health_ins
2006	18	Never Married	White	White	Industrial	No
2004	24	Never Married	White	White	Information	No
2003	45	Married	White	White	Industrial	Yes
2003	43	Married	Asian	NA	NA	NA
2005	50	Divorced	White	NA	NA	NA
2008	54	Married	White	NA	NA	NA
NA	52	NA	NA	White	Information	Yes

Semi Join

For two data frames data_1 and data_2, this join checks for similar values in condition and returns rows of both dataset and columns of first dataset(On left). If there isn’t a match in the second table, then it will return NULL for the row

data_12s=data_1 %>% semi_join(data_2,by='age')

Semi join by age
year	age	maritl	race
2006	18	Never Married	White
2004	24	Never Married	White
2003	45	Married	White

Anti Join

For two data frames data_1 and data_2, this will return all of the rows from the first dataframe where there are not matching values from the second. Only first dataframe’s columns will be returned

data_12a=data_1 %>% anti_join(data_2,by='age')

Anti join by age
year	age	maritl	race
2003	43	Married	Asian
2005	50	Divorced	White
2008	54	Married	White

Final note

In this notes we have gone through various verbs in dplyr that help us perform data wrangling such as filtering, grouping, arranging, creating new column, summarizing the dataset in representable table structure.
We have also looked at data joining verbs available in dplyr which will be greatly used while working with dataframes from multiple sources.