This tuturial is the first in the dplyr training series. Here is the YouTube Video link
dplyr is a great tool to use in R. The commands may look long and overwhelming to someone not using dplyr but that is not the case. Once you learn the basics then it is very intuitive.
If you a beginner in R or if you have experience in R but never used dplyr or want to learn something new about dplyr then go ahead and watch this tutorial on YouTube.
We will be covering all practical aspects of dplyr::arrange command in this tutorial. This tutorial is part of a series of tutorials on all practical aspects of dplyr All youtube videos are available in a single playlist on youtube.
https://www.youtube.com/playlist?list=PLkHcMTpvAaXVJzyRSytUn3nSK92TJphxR
Run the following command to create our sample dataset This is a fictitious data about hospital patients and their clinical information like diagnostic codes and other demographic information.
library(dplyr) # we will be using this packages
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
#install.packages("dplyr")
# If you do not have this packages then run this code to install the package. Remove the # from the front before running it.
t1 <- sample(paste0("Hospital ", toupper(letters)), size = 100, replace=TRUE)
t2 <- sample(x = c("Male", "Female") , size = 100, replace=TRUE)
t3 <- floor(runif(100, min = 0, max = 110))
t4 <- sample(x = c("Survived", "Died") , size = 100, replace=TRUE)
t5 <- sample(paste0("Facility ", toupper(letters)), size = 100, replace=TRUE)
d <- data.frame(cbind(t1,t2,t3,t4, t5))
names(d) <- c('AdmittingHospital', 'Gender', 'AgeYears', 'Outcome', 'Dischargeto')
d$Gender <- as.factor(d$Gender)
d$Outcome <- as.factor(d$Outcome)
d$AgeYears <- as.integer(d$AgeYears)
d$AgeGroup <- cut(d$AgeYears,
breaks = c(-Inf
,5 ,10 ,15,20,25,30,35,40,45,50,55,60 ,65,70,75,80,85
, Inf),
labels = c("0-4 years"
,"5-9 years","10-14 years","15-19 years","20-24 years"
,"25-29 years","30-34 years","35-39 years","40-44 years"
,"45-49 years","50-54 years","55-59 years","60-64 years"
,"65-69 years","70-74 years","75-79 years","80-84 years"
,"85+ years"),
right = FALSE)
d$Diag1 <- sample(x= c("A00.0","E00.0","F01.50","G00.0","H00.011"), size = 100, replace = TRUE)
d$Diag3 <- sample(x= c("Y70","Y71","Y72","Y73","Y74"), size = 100, replace = TRUE)
d$Diag4 <- sample(x= c("G00","G01","G02","G03","G04", "G05"), size = 100, replace = TRUE)
d$Diag2 <- sample(x= c("H00","H10","H15","H16","H28"), size = 100, replace = TRUE)
d$Diag5 <- sample(x= c("E00","E01","E02","E03","E04","E05"), size = 100, replace = TRUE)
d$Diag6 <- sample(x= c("E08","E09","E10","E11","E12", "E13"), size = 100, replace = TRUE)
d$Diag7 <- sample(x= c("E40","E41","E42","E43","E44"), size = 100, replace = TRUE)
d
In this method you can define your data set within the arrange keyword.
d1 <- dplyr::arrange(d, AdmittingHospital)
In this method we have defined the data set in the first line and then piped the output of the date set to the arrange command.
Both methods are equally good but I prefer the second method shown below.
d1 <- d%>%
dplyr::arrange(AdmittingHospital)
By default the sort order is in ascending order, the smallest number first and the largest number at the last. With alphabetical text the ascending order is from (A to Z) and Descending from (Z to A)
d2 <- d%>%
dplyr::arrange(AdmittingHospital, AgeYears)
d2
d3 <- d%>%
dplyr::arrange(AdmittingHospital, desc(AgeYears))
d3
d4 <- d%>%
dplyr::arrange(AdmittingHospital, - AgeYears)
d4
d5 <- d%>%
dplyr::arrange(- AgeYears, AdmittingHospital)
d5
What does the group by achieve. In the example below we want to see the oldest male and oldest female who died and we also want to see the oldest male and female who survived. So we used the group_by statement and grouped the data based on Gender and Outcome. Then we asked the data to be arranged in descending order of AgeYears. To preserve the groups which we created we have to specify .by_group = TRUE
d6 <- d%>%
dplyr::group_by(Gender,Outcome)%>%
dplyr::arrange(desc(AgeYears), .by_group = TRUE)%>%
slice(1)
d6
In this example we use the start_with syntax and said that we want to apply the arrange across all the columns which start with the text Diag So all our columns like Diag1, Diag2, Diag3 etc will be sorted at once.
d7 <- d%>%
dplyr::arrange(across(starts_with("Diag")))
d7
d8 <- d%>%
dplyr::arrange(across(starts_with("Diag"), desc))
d8
d9 <- d%>%
dplyr::arrange(across( contains("Year"), desc))
d9