Preamble
In this notes we shall focus on ‘dplyr’ package which is one of the member of Tidyverse family. The dplyr is an Important tool for data wrangling in R. Learning and using dplyr makes data preparation and management process quick. It is generally refered to as “Grammar of Data Manipulation”. This is because the package provides a set of verbs (functions) to describe and perform common data preparation tasks.
We will look at core dplyr functions such as
filter()
group_by()
mutate()
arrange()
summarize()
select()
Loading the packages needed
#dplyr for data wrangling
library(dplyr)Dataset
We shall make use of ‘Wage’ dataset
library(ISLR)
data=WageThis dataset provides wage and other data for a group of 3000 male workers in Mid-Atlantic region. Let us have a quick look at the variables in the data
| Variable.name | Description | Type |
|---|---|---|
| year | Year that wage information was recorded | Year |
| age | Age of worker | Numeric |
| maritl | marital status | Str |
| race | indicating race of person | Str |
| education | Education level | Str |
| region | Indication mid-atlantic | Str |
| jobclass | type of job | Str |
| health | health level of worker | Str |
| health_ins | health insurance status | Str |
| logwage | Log of worker’s wage | Numeric |
| wage | worke’s raw wage | Numeric |
Data wrangling with Wage data
We shall perform data manipulation with our Wage dataset.
filter
Considering education, we have
## 5 levels which are
## [1] "1. < HS Grad" "2. HS Grad" "3. Some College"
## [4] "4. College Grad" "5. Advanced Degree"
Now we try to subset the data where individuals of education level ‘2.HS Grad’
data1=data %>%
filter(education=='2. HS Grad')| year | age | maritl | race | education | region | jobclass | health | health_ins | logwage | wage |
|---|---|---|---|---|---|---|---|---|---|---|
| 2005 | 50 |
|
|
|
|
|
|
|
4.318063 | 75.04315 |
| 2004 | 52 |
|
|
|
|
|
|
|
4.857333 | 128.68049 |
| 2007 | 34 |
|
|
|
|
|
|
|
4.397940 | 81.28325 |
| 2005 | 35 |
|
|
|
|
|
|
|
4.494155 | 89.49248 |
| 2009 | 54 |
|
|
|
|
|
|
|
4.903090 | 134.70538 |
| 2005 | 40 |
|
|
|
|
|
|
|
3.920123 | 50.40666 |
We can add more than one condition for filtering data, for this we shall consider jobclass ‘1. Industrial’ and education level ‘2.HS Grad’, we can perform this using following way
data2=data %>%
filter(education=='2. HS Grad',jobclass=='1. Industrial')| year | age | maritl | race | education | region | jobclass | health | health_ins | logwage | wage |
|---|---|---|---|---|---|---|---|---|---|---|
| 2007 | 34 |
|
|
|
|
|
|
|
4.397940 | 81.28325 |
| 2006 | 38 |
|
|
|
|
|
|
|
4.544068 | 94.07271 |
| 2006 | 43 |
|
|
|
|
|
|
|
4.431364 | 84.04596 |
| 2004 | 34 |
|
|
|
|
|
|
|
4.176091 | 65.11085 |
| 2008 | 57 |
|
|
|
|
|
|
|
4.755875 | 116.26532 |
| 2003 | 18 |
|
|
|
|
|
|
|
4.255273 | 70.47602 |
data3=data %>%
filter(education=='2. HS Grad'|education=='3. Some College',jobclass=='2. Information')| year | age | maritl | race | education | region | jobclass | health | health_ins | logwage | wage |
|---|---|---|---|---|---|---|---|---|---|---|
| 2005 | 50 |
|
|
|
|
|
|
|
4.318063 | 75.04315 |
| 2008 | 30 |
|
|
|
|
|
|
|
4.716003 | 111.72085 |
| 2006 | 41 |
|
|
|
|
|
|
|
4.778151 | 118.88436 |
| 2004 | 52 |
|
|
|
|
|
|
|
4.857333 | 128.68049 |
| 2007 | 45 |
|
|
|
|
|
|
|
4.763428 | 117.14682 |
| 2005 | 35 |
|
|
|
|
|
|
|
4.494155 | 89.49248 |
More than two categories can be considered for filtering using ‘%in%’ operator
data4=data %>%
filter(education%in%c('2. HS Grad','3. Some College','4. College Grad'),jobclass=='2. Information')| year | age | maritl | race | education | region | jobclass | health | health_ins | logwage | wage |
|---|---|---|---|---|---|---|---|---|---|---|
| 2004 | 24 |
|
|
|
|
|
|
|
4.255273 | 70.47602 |
| 2003 | 43 |
|
|
|
|
|
|
|
5.041393 | 154.68529 |
| 2005 | 50 |
|
|
|
|
|
|
|
4.318063 | 75.04315 |
| 2008 | 54 |
|
|
|
|
|
|
|
4.845098 | 127.11574 |
| 2008 | 30 |
|
|
|
|
|
|
|
4.716003 | 111.72085 |
| 2006 | 41 |
|
|
|
|
|
|
|
4.778151 | 118.88436 |
group_by and summarize
Grouping data by one variable and summarizing
summary=data %>%
group_by(maritl) %>%
summarize(mean = mean(wage, na.rm = TRUE),std_dev = sd(wage, na.rm = TRUE))| maritl | mean | std_dev |
|---|---|---|
|
92.73465 | 32.92007 |
|
118.86026 | 43.12055 |
|
99.53866 | 23.74106 |
|
103.15926 | 33.80098 |
|
101.21579 | 33.66338 |
It is safe to ungroup the data which we have grouped, this can be done as
data %>% group_by(maritl) %>% ungroup()Grouping by more than one variable can be done using
summary2=data %>%
group_by(maritl,education) %>%
summarize(mean=mean(wage,na.rm=T),std_dev=sd(wage,na.rm=T),number_of_observations=n())| maritl | education | mean | std_dev | number_of_observations |
|---|---|---|---|---|
|
|
73.42270 | 15.705084 | 62 |
|
|
80.27383 | 19.462133 | 219 |
|
|
91.02989 | 28.917466 | 164 |
|
|
108.22921 | 39.335340 | 143 |
|
|
125.90330 | 38.968472 | 60 |
|
|
88.10074 | 22.809326 | 174 |
|
|
101.54363 | 29.797860 | 651 |
|
|
114.92456 | 32.378203 | 421 |
|
|
129.88437 | 40.416962 | 487 |
|
|
156.72968 | 55.268681 | 341 |
|
|
58.31423 | 5.255327 | 2 |
|
|
101.93244 | 16.075910 | 8 |
|
|
103.43270 | 21.851952 | 2 |
|
|
103.52201 | 24.742911 | 5 |
|
|
117.33551 | 34.562904 | 2 |
|
|
91.04488 | 18.504915 | 16 |
|
|
88.83916 | 24.196654 | 73 |
|
|
105.19988 | 26.912748 | 52 |
|
|
118.35697 | 45.834830 | 41 |
|
|
126.34012 | 34.392348 | 22 |
|
|
77.49285 | 10.872880 | 14 |
|
|
101.00232 | 22.118534 | 20 |
|
|
95.61057 | 9.478846 | 11 |
|
|
125.82291 | 23.996230 | 9 |
|
|
277.79948 | NA | 1 |
data %>%
group_by(maritl,education) %>%
ungroup()mutate
We can create a new variable using existing variables easily with the help of mutate. For this let us consider finding the birth year of persons since we have age and year when the responses were collected.
B_data=data %>% mutate(Birth_year = year-age)| year | age | maritl | race | education | region | jobclass | health | health_ins | logwage | wage | Birth_year |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 2006 | 18 |
|
|
|
|
|
|
|
4.318063 | 75.04315 | 1988 |
| 2004 | 24 |
|
|
|
|
|
|
|
4.255273 | 70.47602 | 1980 |
| 2003 | 45 |
|
|
|
|
|
|
|
4.875061 | 130.98218 | 1958 |
| 2003 | 43 |
|
|
|
|
|
|
|
5.041393 | 154.68529 | 1960 |
| 2005 | 50 |
|
|
|
|
|
|
|
4.318063 | 75.04315 | 1955 |
| 2008 | 54 |
|
|
|
|
|
|
|
4.845098 | 127.11574 | 1954 |
We can create more than 1 variable using mutate as follows, let us find birth year and deviation of wage from its mean
N_data=data %>%
mutate(Birth_year=year-age,mean_dev=wage-mean(wage))| year | age | maritl | race | education | region | jobclass | health | health_ins | logwage | wage | Birth_year | mean_dev |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2006 | 18 |
|
|
|
|
|
|
|
4.318063 | 75.04315 | 1988 | -36.66045 |
| 2004 | 24 |
|
|
|
|
|
|
|
4.255273 | 70.47602 | 1980 | -41.22759 |
| 2003 | 45 |
|
|
|
|
|
|
|
4.875061 | 130.98218 | 1958 | 19.27857 |
| 2003 | 43 |
|
|
|
|
|
|
|
5.041393 | 154.68529 | 1960 | 42.98168 |
| 2005 | 50 |
|
|
|
|
|
|
|
4.318063 | 75.04315 | 1955 | -36.66045 |
| 2008 | 54 |
|
|
|
|
|
|
|
4.845098 | 127.11574 | 1954 | 15.41214 |
arrange
To arrange the data based on particular variable we use arrange. For example, we shall arrange the data based on year
arranged_dat=data %>% arrange(year)| year | age | maritl | race | education | region | jobclass | health | health_ins | logwage | wage |
|---|---|---|---|---|---|---|---|---|---|---|
| 2003 | 45 |
|
|
|
|
|
|
|
4.875061 | 130.98218 |
| 2003 | 43 |
|
|
|
|
|
|
|
5.041393 | 154.68529 |
| 2003 | 39 |
|
|
|
|
|
|
|
4.903090 | 134.70538 |
| 2003 | 37 |
|
|
|
|
|
|
|
4.414973 | 82.67964 |
| 2003 | 37 |
|
|
|
|
|
|
|
4.591065 | 98.59934 |
| 2003 | 38 |
|
|
|
|
|
|
|
5.301030 | 200.54326 |
By default it arranges in ascending order, we can obtain descending order as follows
arranged_datd=data %>% arrange(desc(age))| year | age | maritl | race | education | region | jobclass | health | health_ins | logwage | wage |
|---|---|---|---|---|---|---|---|---|---|---|
| 2003 | 80 |
|
|
|
|
|
|
|
4.380211 | 79.85490 |
| 2003 | 80 |
|
|
|
|
|
|
|
4.544068 | 94.07271 |
| 2005 | 80 |
|
|
|
|
|
|
|
4.477121 | 87.98103 |
| 2003 | 80 |
|
|
|
|
|
|
|
4.414973 | 82.67964 |
| 2005 | 77 |
|
|
|
|
|
|
|
4.602060 | 99.68946 |
| 2009 | 76 |
|
|
|
|
|
|
|
5.176091 | 176.98965 |
select
We can select interested columns from a dataframe using this function
selected=data %>%
select(age,wage,education)| age | wage | education |
|---|---|---|
| 18 | 75.04315 |
|
| 24 | 70.47602 |
|
| 45 | 130.98218 |
|
| 43 | 154.68529 |
|
| 50 | 75.04315 |
|
| 54 | 127.11574 |
|
case_when
This is an important function which can be used for conditional execution, this can be used as an alternative for if_else, for example let us create a new variable ‘wage_category’ which indicate whether wage of person is less than or greater than median
data_n=data %>%
mutate(wage_category=case_when(wage<=104.92~"Less than median",T~"Greater than median"))| year | age | maritl | race | education | region | jobclass | health | health_ins | logwage | wage | wage_category |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 2006 | 18 |
|
|
|
|
|
|
|
4.318063 | 75.04315 | Less than median |
| 2004 | 24 |
|
|
|
|
|
|
|
4.255273 | 70.47602 | Less than median |
| 2003 | 45 |
|
|
|
|
|
|
|
4.875061 | 130.98218 | Greater than median |
| 2003 | 43 |
|
|
|
|
|
|
|
5.041393 | 154.68529 | Greater than median |
| 2005 | 50 |
|
|
|
|
|
|
|
4.318063 | 75.04315 | Less than median |
| 2008 | 54 |
|
|
|
|
|
|
|
4.845098 | 127.11574 | Greater than median |
Joining data frames together
Data may not always come from single source, when various data frames are obtained from various sources our primary aim will be to combine all the important data to form a single dataframe. For this let us subset the data in to two data frames
data_1=data[1:6,1:4];data_1data_2=data[c(1:3,10),c(2,4,7,9)]| year | age | maritl | race |
|---|---|---|---|
| 2006 | 18 |
|
|
| 2004 | 24 |
|
|
| 2003 | 45 |
|
|
| 2003 | 43 |
|
|
| 2005 | 50 |
|
|
| 2008 | 54 |
|
|
| age | race | jobclass | health_ins |
|---|---|---|---|
| 18 |
|
|
|
| 24 |
|
|
|
| 45 |
|
|
|
| 52 |
|
|
|
For joining the datasets we can use the following types of join which are
Inner Join
For two data frames data_1 and data_2, this join checks for similar values in given columns in condition and returns all rows of both the data frames where match occurs. For example this join creates a new table which will combine data_1 and data_2 based on the join-variable, “race”
dat_12i=data_1 %>% inner_join(data_2,by='race')| year | age.x | maritl | race | age.y | jobclass | health_ins |
|---|---|---|---|---|---|---|
| 2006 | 18 |
|
|
18 |
|
|
| 2006 | 18 |
|
|
24 |
|
|
| 2006 | 18 |
|
|
45 |
|
|
| 2006 | 18 |
|
|
52 |
|
|
| 2004 | 24 |
|
|
18 |
|
|
| 2004 | 24 |
|
|
24 |
|
|
| 2004 | 24 |
|
|
45 |
|
|
| 2004 | 24 |
|
|
52 |
|
|
| 2003 | 45 |
|
|
18 |
|
|
| 2003 | 45 |
|
|
24 |
|
|
| 2003 | 45 |
|
|
45 |
|
|
| 2003 | 45 |
|
|
52 |
|
|
| 2005 | 50 |
|
|
18 |
|
|
| 2005 | 50 |
|
|
24 |
|
|
| 2005 | 50 |
|
|
45 |
|
|
| 2005 | 50 |
|
|
52 |
|
|
| 2008 | 54 |
|
|
18 |
|
|
| 2008 | 54 |
|
|
24 |
|
|
| 2008 | 54 |
|
|
45 |
|
|
| 2008 | 54 |
|
|
52 |
|
|
Left Join
For two data frames data_1 and data_2, this join checks for similar values in condition and returns rows of dataset specified first(on Left) and all columns in both datasets. If there isn’t a match in the second table, then it will return NULL for the row
data_12l=data_1 %>% left_join(data_2,by='age')| year | age | maritl | race.x | race.y | jobclass | health_ins |
|---|---|---|---|---|---|---|
| 2006 | 18 |
|
|
|
|
|
| 2004 | 24 |
|
|
|
|
|
| 2003 | 45 |
|
|
|
|
|
| 2003 | 43 |
|
|
NA | NA | NA |
| 2005 | 50 |
|
|
NA | NA | NA |
| 2008 | 54 |
|
|
NA | NA | NA |
Right Join
For two data frames data_1 and data_2, this join checks for similar values in condition and returns rows of dataset specified second(on Right) and all columns in both datasets. If there isn’t a match in the second table, then it will return NULL for the row
data_12r=data_1 %>% right_join(data_2,by='age')| year | age | maritl | race.x | race.y | jobclass | health_ins |
|---|---|---|---|---|---|---|
| 2006 | 18 |
|
|
|
|
|
| 2004 | 24 |
|
|
|
|
|
| 2003 | 45 |
|
|
|
|
|
| NA | 52 | NA | NA |
|
|
|
Full Join
For two data frames data_1 and data_2, this join checks for similar values in condition and returns rows of both dataset and all columns of both datasets. If there isn’t a match in the second table, then it will return NULL for the row
data_12f=data_1 %>% full_join(data_2,by='age')| year | age | maritl | race.x | race.y | jobclass | health_ins |
|---|---|---|---|---|---|---|
| 2006 | 18 |
|
|
|
|
|
| 2004 | 24 |
|
|
|
|
|
| 2003 | 45 |
|
|
|
|
|
| 2003 | 43 |
|
|
NA | NA | NA |
| 2005 | 50 |
|
|
NA | NA | NA |
| 2008 | 54 |
|
|
NA | NA | NA |
| NA | 52 | NA | NA |
|
|
|
Semi Join
For two data frames data_1 and data_2, this join checks for similar values in condition and returns rows of both dataset and columns of first dataset(On left). If there isn’t a match in the second table, then it will return NULL for the row
data_12s=data_1 %>% semi_join(data_2,by='age')| year | age | maritl | race |
|---|---|---|---|
| 2006 | 18 |
|
|
| 2004 | 24 |
|
|
| 2003 | 45 |
|
|
Anti Join
For two data frames data_1 and data_2, this will return all of the rows from the first dataframe where there are not matching values from the second. Only first dataframe’s columns will be returned
data_12a=data_1 %>% anti_join(data_2,by='age')| year | age | maritl | race |
|---|---|---|---|
| 2003 | 43 |
|
|
| 2005 | 50 |
|
|
| 2008 | 54 |
|
|
Final note
In this notes we have gone through various verbs in dplyr that help us perform data wrangling such as filtering, grouping, arranging, creating new column, summarizing the dataset in representable table structure.
We have also looked at data joining verbs available in dplyr which will be greatly used while working with dataframes from multiple sources.