DPLYR Learning Tips

Introduction

This notes provides some essential details for learning dplyr a package in R. Though it has many options to perform data analysis tasks more conveniently, we focus five major utilities of dplyr package

Codes may be obtained using the tab appearing in the top right of each output

General Points

This package is mainly helpful to treat or process the variables for suitable data analytic requirements. Hence, knowledge of possible queries one may need to do in a data analysis is important
Queries may include analysis for subsets; choosing some part of data set based on a condition
In some cases it may be required to modify an existing variable and create new variable with that modification
In certain cases, we may rewrite the modified values in the same variable
We may have to include a new variable in to the existing data set
For all these (and more) we must know the nature / type of a variable for proper processing
We shall use an operator %>% called pipe operator to read a data set and perform operations sequentially

Let us recall the guideline about the type of variables

library(kableExtra)
D1=c("Character","Integer","Numeric","Binary Factor", "Poly Factor")
D2=c("An identifier","Count","Measurable","Two level factor","More than two level factor")
D_Nature=as.data.frame(cbind(D1,D2))
colnames(D_Nature)=c("Type","Description")

kable(rbind(D_Nature)) %>%  kable_styling()

Type	Description
Character	An identifier
Integer	Count
Numeric	Measurable
Binary Factor	Two level factor
Poly Factor	More than two level factor

Some variables may have alpha-numeric type which may be classified as Character or factor as per the definition / meaning of a variable. For example

Pincode, Row numbers may be given as numbers but they are identifier of a row or case. Hence they must be treated as Character
Gender, a variable may be originally have alpha-numeric: 1-Male and 0 - Female or M - Male and F- Female. Here, we have to treat them as Binary Factor
Education, a variable may be originally have numeric codes: 1, 2, 3, and 4 for Up to School, UG, PG, and Research. Also the same may be given in character with out numbers. Here, we have to treat them as Poly Factor
For any data analysis practices Know your data set, type of variables and more importantly the question about the data set

filter

Create a subset from the given data set with single condition by a categorical variable
Filter with more than one condition by a factor and metric variable
Filter with more than one variable
Filter to remove or exclude few cases
Plots based on Filtering
Numerical summaries based on Filtering

group_by

This function helps us to make subset based on a condition and perform subsequent summaries for the smaller subsets

This is better than filter to bring the summaries in a single attempt

This function will have another function *summarize to carry out the operation

Group by one variable
Grouping by more than one variable

mutate

To create new variables by suitable conditions or operations
It is possible to create another variable based on previously created in the same sequence of coding
Summaries (numerical and visual) can be obained from these newly created variables

arrange

This is useful for sorting a data set based on the type of variable, usually ascending or descending order

Numerical variable - higher number to lower or vice-versa
Alpha-numeric - based on alphabetical order

join

Create subsets by merging one or more data sets. dplyr in R has different option of merging based on possible inclusion / exclusion of rows and columns.

This may be better understood from the logic of and, or, not that are applied to the rows and columns of the data sets considered for joining

Inner Join - This join works based on and logic for rows and all columns
Full Join - Underlying logic is or for rows. The full outer join returns all of the rows of two data sets, whether it matches on either the left or right data set. If the rows of the two data sets do not match, then it will return NULL in places (where a matching row does not exist).
Semi Join - Underlying logic is and for rows but includes columns of the first data set only. This creates a new data set where it will return all rows common to both data sets, but contains columns from the first table only.
Anti Join - Underlying logic is not in the second data set. This will return all of the rows from the first data set where there are not matching values from the second. The new data set will contain columns only from the first data set
Left Join - Underlying logic is only for the first data set (usually read from left to right in the code). This join will take all of the values from the data we specify in the left and match them to records from the right data set. If there isn’t a match in the second data set, then it will return NULL for the row in question
Right Join - Underlying logic is only for the second data set (usually read from right to left in the code). New data set takes all of its values from the data set specified second within the join statement.If there isn’t a match in the first data set, then it will return NULL for the row(s) that do not match

To understand about join functions, we shall make following assumption

Two data sets X and Y
Syntax for join functions is (X, Y, by = " ")
Common variables - used in “by”
Counting can be adjusted for common variables \((k)\)
Number of rows and columns are referred as \(Xr\) and \(Yr\); \(Xc\); \(Yc\);
Number of rows for common items is referred as \((X\) and \(Y)r\)
Number of rows in other than common (\(X\) alone not in \(Y\)) items is referred as \((X\) not \(Y)r\)

Dimension of New Data Set due to Join

join_dplyr

Apart from the functions, dplyr provide more direct and simplified ways to handle data analysis requirements. We shall provide few more functions.

Select

This function is helpful for creating subsets by including / excluding few columns (variables / features) of a given data set. Additional functions select_if, select_all, select_at are also very useful in handling variables

Looping

In many of our data analysis, we may encounter repeated operations; for example, finding mean for all numeric variables. This may further depend on some conditions such as mean of all variables only for high income group. Let us illustrate such situations with Smarket, a data set from ISLR package in R. This data set has nine variables of which except two (Year, Direction"), all variables are numeric.

Following code finds mean of numeric variables in the data set. Intentionally we create meanfun instead of using R function mean(), just to indicate the choice of user defined functions

dat1=ISLR::Smarket
meanfun=function(x) mean(x)
me=0
for(j in 1:ncol(dat1))
{
  me[j]=ifelse(is.numeric(dat1[,j]),meanfun(dat1[,j])," ")
}
me=as.numeric(me[which(me!=" ")])

across

dplyr::across provides nicer tools to handle these requirements

dat2=dat1 %>% select_if(is.numeric) %>% 
  summarise(across(everything(),~meanfun(.x)))

dplyr::add_row can be used to add these summaries in the last row of the data frame as well as we can name the rows. In this data set, the two non-numeric variables that have no summary can be left blank in the newly appended row

dat3=dat1 %>% add_row(Year=" ",dat2,Direction=" ") 
rownames(dat3)=c(1:1250,"Avg")

Following out put shows these operations

library(dplyr)
dat1=ISLR::Smarket
dat1$Year=as.factor(dat1$Year)

meanfun=function(x) mean(x)
dat2=dat1 %>% select_if(is.numeric) %>% 
  summarise(across(everything(),~round(meanfun(.x),3)))

kable(dat2) %>%  kable_styling()

Lag1	Lag2	Lag3	Lag4	Lag5	Volume	Today
0.004	0.004	0.002	0.002	0.006	1.478	0.003

Last 3 rows of the data frame, two columns (Year and Direction) have no summaries

dat3=dat1 %>% add_row(Year=" ",dat2,Direction=" ") 
rownames(dat3)=c(1:1250,"Avg")
kable(tail(dat3,3)) %>%  kable_styling()

	Year	Lag1	Lag2	Lag3	Lag4	Lag5	Volume	Today	Direction
1249	2005	0.130	-0.955	0.043	0.422	0.252	1.42236	-0.298	Down
1250	2005	-0.298	0.130	-0.955	0.043	0.422	1.38254	-0.489	Down
Avg		0.004	0.004	0.002	0.002	0.006	1.47800	0.003

case_when

Another data analysis requirement may be grouping a numeric variable based on some pre defined classes.

Consider Lag1 in the data set Smarket, we shall divide in to two groups positive and non-positive (negative or zero). Equally, we may be interested to have three groups negative, positive but less than 1, and greater than or equal to 1. In fact, ifelse can be used as

dat1 %>% 
  mutate(Lag1_2=ifelse(Lag1>0,"Positive","Non_Positive"),
         
         Lag1_3=ifelse(Lag1<0,"Negative",
                       ifelse(Lag1>=0 & Lag1<1),"Posi_Lt1","Gte1")

dplyr::case_when can be used in a more simplified way to carry out the above task

dat1 %>% 
  mutate(Lag1_2=case_when(Lag1>0 ~"Positive",
                          TRUE~"Non_Positive"),
         
         Lag1_3=case_when(Lag1<0~"Negative",
                          Lag1>=0 & Lag1<1~"Posi_Lt1",
                          TRUE~"Gte1"))

This option of using case_when can be much appreciated when we have more number of groups, which will avoid using nested if conditions.

Also note the use of TRUE in the last group which includes all the values not satisfying the rest of the conditions mentioned.

lag/lead

These two functions are helpful to create new variables by adding preceding or succeeding values. The number of positions to lag / lead by may also be specified (default = 1, immediate predecessor / successor of the current index)

For example this output provides top 5 rows of newly created variables using lag / lead functions

dat4=dat1 %>% 
  mutate(new1=case_when(Today>0 ~ Today+lag(Today),
                        Today<=0 ~ Today+lead(Today,n=2)))

kable(head(dat4 %>% select(Today,new1))) %>%  kable_styling()

Today	new1
0.959	NA
1.032	1.991
-0.623	-0.410
0.614	-0.009
0.213	0.827
1.392	1.605

Second row of the variable new1 is 0.959 + 1.032 (lag by 1) from the column Today
Third row of variable new1 is -0.623 + 0.213 (lead by 2) from the column Today
Operation is not performed for the first row of Today; it is positive with out any preceding value while apply lag()