Introduction

This notes provides some essential details for learning dplyr a package in R. Though it has many options to perform data analysis tasks more conveniently, we focus five major utilities of dplyr package

Codes may be obtained using the tab appearing in the top right of each output

General Points

  1. This package is mainly helpful to treat or process the variables for suitable data analytic requirements. Hence, knowledge of possible queries one may need to do in a data analysis is important

  2. Queries may include analysis for subsets; choosing some part of data set based on a condition

  3. In some cases it may be required to modify an existing variable and create new variable with that modification

  4. In certain cases, we may rewrite the modified values in the same variable

  5. We may have to include a new variable in to the existing data set

  6. For all these (and more) we must know the nature / type of a variable for proper processing

  7. We shall use an operator %>% called pipe operator to read a data set and perform operations sequentially

Let us recall the guideline about the type of variables

library(kableExtra)
D1=c("Character","Integer","Numeric","Binary Factor", "Poly Factor")
D2=c("An identifier","Count","Measurable","Two level factor","More than two level factor")
D_Nature=as.data.frame(cbind(D1,D2))
colnames(D_Nature)=c("Type","Description")

kable(rbind(D_Nature)) %>%  kable_styling()
Type Description
Character An identifier
Integer Count
Numeric Measurable
Binary Factor Two level factor
Poly Factor More than two level factor

Some variables may have alpha-numeric type which may be classified as Character or factor as per the definition / meaning of a variable. For example

  1. Pincode, Row numbers may be given as numbers but they are identifier of a row or case. Hence they must be treated as Character

  2. Gender, a variable may be originally have alpha-numeric: 1-Male and 0 - Female or M - Male and F- Female. Here, we have to treat them as Binary Factor

  3. Education, a variable may be originally have numeric codes: 1, 2, 3, and 4 for Up to School, UG, PG, and Research. Also the same may be given in character with out numbers. Here, we have to treat them as Poly Factor

  4. For any data analysis practices Know your data set, type of variables and more importantly the question about the data set

filter

  1. Create a subset from the given data set with single condition by a categorical variable

  2. Filter with more than one condition by a factor and metric variable

  3. Filter with more than one variable

  4. Filter to remove or exclude few cases

  5. Plots based on Filtering

  6. Numerical summaries based on Filtering

group_by

This function helps us to make subset based on a condition and perform subsequent summaries for the smaller subsets

This is better than filter to bring the summaries in a single attempt

This function will have another function *summarize to carry out the operation

  1. Group by one variable

  2. Grouping by more than one variable

mutate

  1. To create new variables by suitable conditions or operations

  2. It is possible to create another variable based on previously created in the same sequence of coding

  3. Summaries (numerical and visual) can be obained from these newly created variables

arrange

This is useful for sorting a data set based on the type of variable, usually ascending or descending order

  1. Numerical variable - higher number to lower or vice-versa

  2. Alpha-numeric - based on alphabetical order

join

Create subsets by merging one or more data sets. dplyr in R has different option of merging based on possible inclusion / exclusion of rows and columns.

This may be better understood from the logic of and, or, not that are applied to the rows and columns of the data sets considered for joining

  1. Inner Join - This join works based on and logic for rows and all columns

  2. Full Join - Underlying logic is or for rows. The full outer join returns all of the rows of two data sets, whether it matches on either the left or right data set. If the rows of the two data sets do not match, then it will return NULL in places (where a matching row does not exist).

  3. Semi Join - Underlying logic is and for rows but includes columns of the first data set only. This creates a new data set where it will return all rows common to both data sets, but contains columns from the first table only.

  4. Anti Join - Underlying logic is not in the second data set. This will return all of the rows from the first data set where there are not matching values from the second. The new data set will contain columns only from the first data set

  5. Left Join - Underlying logic is only for the first data set (usually read from left to right in the code). This join will take all of the values from the data we specify in the left and match them to records from the right data set. If there isn’t a match in the second data set, then it will return NULL for the row in question

  6. Right Join - Underlying logic is only for the second data set (usually read from right to left in the code). New data set takes all of its values from the data set specified second within the join statement.If there isn’t a match in the first data set, then it will return NULL for the row(s) that do not match

To understand about join functions, we shall make following assumption

  1. Two data sets X and Y

  2. Syntax for join functions is (X, Y, by = " ")

  3. Common variables - used in “by”

  4. Counting can be adjusted for common variables \((k)\)

  5. Number of rows and columns are referred as \(Xr\) and \(Yr\); \(Xc\); \(Yc\);

  6. Number of rows for common items is referred as \((X\) and \(Y)r\)

  7. Number of rows in other than common (\(X\) alone not in \(Y\)) items is referred as \((X\) not \(Y)r\)

Dimension of New Data Set due to Join

join_dplyr

Apart from the functions, dplyr provide more direct and simplified ways to handle data analysis requirements. We shall provide few more functions.

Select

This function is helpful for creating subsets by including / excluding few columns (variables / features) of a given data set. Additional functions select_if, select_all, select_at are also very useful in handling variables

Looping

In many of our data analysis, we may encounter repeated operations; for example, finding mean for all numeric variables. This may further depend on some conditions such as mean of all variables only for high income group. Let us illustrate such situations with Smarket, a data set from ISLR package in R. This data set has nine variables of which except two (Year, Direction"), all variables are numeric.

Following code finds mean of numeric variables in the data set. Intentionally we create meanfun instead of using R function mean(), just to indicate the choice of user defined functions

dat1=ISLR::Smarket
meanfun=function(x) mean(x)
me=0
for(j in 1:ncol(dat1))
{
  me[j]=ifelse(is.numeric(dat1[,j]),meanfun(dat1[,j])," ")
}
me=as.numeric(me[which(me!=" ")])

across

dplyr::across provides nicer tools to handle these requirements

dat2=dat1 %>% select_if(is.numeric) %>% 
  summarise(across(everything(),~meanfun(.x)))

dplyr::add_row can be used to add these summaries in the last row of the data frame as well as we can name the rows. In this data set, the two non-numeric variables that have no summary can be left blank in the newly appended row

dat3=dat1 %>% add_row(Year=" ",dat2,Direction=" ") 
rownames(dat3)=c(1:1250,"Avg")

Following out put shows these operations

library(dplyr)
dat1=ISLR::Smarket
dat1$Year=as.factor(dat1$Year)

meanfun=function(x) mean(x)
dat2=dat1 %>% select_if(is.numeric) %>% 
  summarise(across(everything(),~round(meanfun(.x),3)))

kable(dat2) %>%  kable_styling()
Lag1 Lag2 Lag3 Lag4 Lag5 Volume Today
0.004 0.004 0.002 0.002 0.006 1.478 0.003

Last 3 rows of the data frame, two columns (Year and Direction) have no summaries

dat3=dat1 %>% add_row(Year=" ",dat2,Direction=" ") 
rownames(dat3)=c(1:1250,"Avg")
kable(tail(dat3,3)) %>%  kable_styling()
Year Lag1 Lag2 Lag3 Lag4 Lag5 Volume Today Direction
1249 2005 0.130 -0.955 0.043 0.422 0.252 1.42236 -0.298 Down
1250 2005 -0.298 0.130 -0.955 0.043 0.422 1.38254 -0.489 Down
Avg 0.004 0.004 0.002 0.002 0.006 1.47800 0.003

case_when

Another data analysis requirement may be grouping a numeric variable based on some pre defined classes.

Consider Lag1 in the data set Smarket, we shall divide in to two groups positive and non-positive (negative or zero). Equally, we may be interested to have three groups negative, positive but less than 1, and greater than or equal to 1. In fact, ifelse can be used as

dat1 %>% 
  mutate(Lag1_2=ifelse(Lag1>0,"Positive","Non_Positive"),
         
         Lag1_3=ifelse(Lag1<0,"Negative",
                       ifelse(Lag1>=0 & Lag1<1),"Posi_Lt1","Gte1")

dplyr::case_when can be used in a more simplified way to carry out the above task

dat1 %>% 
  mutate(Lag1_2=case_when(Lag1>0 ~"Positive",
                          TRUE~"Non_Positive"),
         
         Lag1_3=case_when(Lag1<0~"Negative",
                          Lag1>=0 & Lag1<1~"Posi_Lt1",
                          TRUE~"Gte1"))

This option of using case_when can be much appreciated when we have more number of groups, which will avoid using nested if conditions.

Also note the use of TRUE in the last group which includes all the values not satisfying the rest of the conditions mentioned.

lag/lead

These two functions are helpful to create new variables by adding preceding or succeeding values. The number of positions to lag / lead by may also be specified (default = 1, immediate predecessor / successor of the current index)

For example this output provides top 5 rows of newly created variables using lag / lead functions

dat4=dat1 %>% 
  mutate(new1=case_when(Today>0 ~ Today+lag(Today),
                        Today<=0 ~ Today+lead(Today,n=2)))

kable(head(dat4 %>% select(Today,new1))) %>%  kable_styling()
Today new1
0.959 NA
1.032 1.991
-0.623 -0.410
0.614 -0.009
0.213 0.827
1.392 1.605
  • Second row of the variable new1 is 0.959 + 1.032 (lag by 1) from the column Today

  • Third row of variable new1 is -0.623 + 0.213 (lead by 2) from the column Today

  • Operation is not performed for the first row of Today; it is positive with out any preceding value while apply lag()