This tuturial is the first in the dplyr training series. Here is the YouTube Video link

https://youtu.be/-U4TB2rgCfE

Why dplyr

dplyr is a great tool to use in R. The commands may look long and overwhelming to someone not using dplyr but that is not the case. Once you learn the basics then it is very intuitive.

Audience

If you a beginner in R or if you have experience in R but never used dplyr or want to learn something new about dplyr then go ahead and watch this tutorial on YouTube.

DPLYR : Arrange

We will be covering all practical aspects of dplyr::arrange command in this tutorial. This tutorial is part of a series of tutorials on all practical aspects of dplyr All youtube videos are available in a single playlist on youtube.

https://www.youtube.com/playlist?list=PLkHcMTpvAaXVJzyRSytUn3nSK92TJphxR

Create sample dataset

Run the following command to create our sample dataset This is a fictitious data about hospital patients and their clinical information like diagnostic codes and other demographic information.

library(dplyr)  # we will be using this packages
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
#install.packages("dplyr") 
# If you do not have this packages then run this code to install the package. Remove the # from the front before running it.

t1 <- sample(paste0("Hospital ", toupper(letters)), size = 100, replace=TRUE)
t2 <- sample(x = c("Male", "Female")   , size = 100, replace=TRUE)
t3 <- floor(runif(100, min = 0, max = 110))
t4 <- sample(x = c("Survived", "Died") , size = 100, replace=TRUE)
t5 <- sample(paste0("Facility ", toupper(letters)), size = 100, replace=TRUE)

d  <- data.frame(cbind(t1,t2,t3,t4, t5))
names(d) <- c('AdmittingHospital', 'Gender', 'AgeYears', 'Outcome', 'Dischargeto')

d$Gender   <- as.factor(d$Gender)
d$Outcome  <- as.factor(d$Outcome)
d$AgeYears <- as.integer(d$AgeYears)

d$AgeGroup <- cut(d$AgeYears, 
                  breaks = c(-Inf
                             ,5 ,10 ,15,20,25,30,35,40,45,50,55,60 ,65,70,75,80,85
                             , Inf), 
                  
                  labels = c("0-4 years"
                             ,"5-9 years","10-14 years","15-19 years","20-24 years"
                             ,"25-29 years","30-34 years","35-39 years","40-44 years"
                             ,"45-49 years","50-54 years","55-59 years","60-64 years"
                             ,"65-69 years","70-74 years","75-79 years","80-84 years"
                             ,"85+ years"),
                  right = FALSE)


d$Diag1 <- sample(x= c("A00.0","E00.0","F01.50","G00.0","H00.011"), size = 100, replace = TRUE)
d$Diag3 <- sample(x= c("Y70","Y71","Y72","Y73","Y74"), size = 100, replace = TRUE)
d$Diag4 <- sample(x= c("G00","G01","G02","G03","G04", "G05"), size = 100, replace = TRUE)
d$Diag2 <- sample(x= c("H00","H10","H15","H16","H28"), size = 100, replace = TRUE)
d$Diag5 <- sample(x= c("E00","E01","E02","E03","E04","E05"), size = 100, replace = TRUE)
d$Diag6 <- sample(x= c("E08","E09","E10","E11","E12", "E13"), size = 100, replace = TRUE)
d$Diag7 <- sample(x= c("E40","E41","E42","E43","E44"), size = 100, replace = TRUE)

Have a look at the sample dataset

d

Sort data using ARRANGE

First method

In this method you can define your data set within the arrange keyword.

d1 <- dplyr::arrange(d, AdmittingHospital)

Second method

In this method we have defined the data set in the first line and then piped the output of the date set to the arrange command.

Both methods are equally good but I prefer the second method shown below.

d1 <- d%>%
      dplyr::arrange(AdmittingHospital)

Arranging the data on two or more columns

By default the sort order is in ascending order, the smallest number first and the largest number at the last. With alphabetical text the ascending order is from (A to Z) and Descending from (Z to A)

d2 <- d%>%
      dplyr::arrange(AdmittingHospital, AgeYears)

d2

Arranging the data on two or more columns

d3 <- d%>%
      dplyr::arrange(AdmittingHospital, desc(AgeYears))

d3
d4 <- d%>%
      dplyr::arrange(AdmittingHospital, - AgeYears)

d4

The order of sorting is important

d5 <- d%>%
      dplyr::arrange(- AgeYears, AdmittingHospital)

d5

Another variation of the selection of columns

Using group by and arranging

What does the group by achieve. In the example below we want to see the oldest male and oldest female who died and we also want to see the oldest male and female who survived. So we used the group_by statement and grouped the data based on Gender and Outcome. Then we asked the data to be arranged in descending order of AgeYears. To preserve the groups which we created we have to specify .by_group = TRUE

d6 <- d%>%
      dplyr::group_by(Gender,Outcome)%>%
      dplyr::arrange(desc(AgeYears), .by_group = TRUE)%>%
      slice(1)

d6

Using the selection of multiple columns

In this example we use the start_with syntax and said that we want to apply the arrange across all the columns which start with the text Diag So all our columns like Diag1, Diag2, Diag3 etc will be sorted at once.

d7 <- d%>%
      dplyr::arrange(across(starts_with("Diag")))
  
d7

Using descending order

d8 <- d%>%
     
      dplyr::arrange(across(starts_with("Diag"), desc))
  
d8
d9 <- d%>%
     
      dplyr::arrange(across( contains("Year"), desc))
  
d9