Data Management using dplyr Package in R

Ph.D. Course Work - Computer Application

Lecture Material Part 1

———————————————————————–

What is dplyr?
* package that contains many functions  to  manipulate unstructured data.
* Developed by Hadley Wickham
Is it mandatory to have dplyr package to manipulate the data in R ?
* Absolutely Not.
* Base R functionality can be used.
* dplyr functions are computationally more efficient and syntactically stable -- very useful while working with dataframes 
Important verbs/functions in dplyr:
* select() :  for selecting columns
* slice()  :  for selecting rows
* filter() :  for selecting subsets of rows
* mutute() :  for creating new columnsrows  
* arrange():  for arranging rows
* summarize   
  using  
  group_by : For summarizing the data
#Remove the environment variable
rm(list=ls())
#Load Packages
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
#Load Compensation Data File
library(readr)
compensation = read_csv("D:\\R Course\\DataAnalysis\\Data\\datasets-master\\compensation.csv")
## Rows: 40 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Grazing
## dbl (2): Root, Fruit
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#View the data loaded
View(compensation)
#Dimension of the Data File 
dim(compensation)
## [1] 40  3
#Display first 6 rows of the data frame
head(compensation)
## # A tibble: 6 × 3
##    Root Fruit Grazing 
##   <dbl> <dbl> <chr>   
## 1  6.22  59.8 Ungrazed
## 2  6.49  61.0 Ungrazed
## 3  4.92  14.7 Ungrazed
## 4  5.13  19.3 Ungrazed
## 5  5.42  34.2 Ungrazed
## 6  5.36  35.5 Ungrazed
#Display first 6 rows of the data frame 
tail(compensation)
## # A tibble: 6 × 3
##    Root Fruit Grazing
##   <dbl> <dbl> <chr>  
## 1  9.35  98.5 Grazed 
## 2  7.07  40.2 Grazed 
## 3  8.16  52.3 Grazed 
## 4  7.38  46.6 Grazed 
## 5  8.52  71.0 Grazed 
## 6  8.53  83.0 Grazed
#Column names of the data file
names(compensation)
## [1] "Root"    "Fruit"   "Grazing"
select()
  • Displays the variables/columns of the dataset

  • Used for sub-setting data by selecting variables

  • Syntax of select()

    • select(.data,…)
      • .data
        • data frame, data frame extension (a tibble)
        • One or more unquoted expressions separated by commas.
        • Variable names can be used as if they were positions in the data frame,
        • Expressions like x:y can be used to select a range of variables.
  • Takes dataframe as input and produce datframe as output

  • Different options:

    1. Select variables by name
    2. Select variables by index position
    3. Deselect/Drop the variables
    4. Select variables based on some pattern
    5. Select all numeric variables
  • Using pipe operator (%>%) multiple operations could be done sequentially from left to right. The pipe operator is defined in magrittr package which is installed automatically while installing dplyr package.

#Example-1: Select variables by name

select(compensation,Fruit,Grazing) 
## # A tibble: 40 × 2
##    Fruit Grazing 
##    <dbl> <chr>   
##  1  59.8 Ungrazed
##  2  61.0 Ungrazed
##  3  14.7 Ungrazed
##  4  19.3 Ungrazed
##  5  34.2 Ungrazed
##  6  35.5 Ungrazed
##  7  87.7 Ungrazed
##  8  63.2 Ungrazed
##  9  24.2 Ungrazed
## 10  64.3 Ungrazed
## # ℹ 30 more rows
names(compensation)
## [1] "Root"    "Fruit"   "Grazing"
select(compensation,Fruit,Grazing)
## # A tibble: 40 × 2
##    Fruit Grazing 
##    <dbl> <chr>   
##  1  59.8 Ungrazed
##  2  61.0 Ungrazed
##  3  14.7 Ungrazed
##  4  19.3 Ungrazed
##  5  34.2 Ungrazed
##  6  35.5 Ungrazed
##  7  87.7 Ungrazed
##  8  63.2 Ungrazed
##  9  24.2 Ungrazed
## 10  64.3 Ungrazed
## # ℹ 30 more rows
#Using pipe operator

compensation%>% select(Fruit,Grazing)
## # A tibble: 40 × 2
##    Fruit Grazing 
##    <dbl> <chr>   
##  1  59.8 Ungrazed
##  2  61.0 Ungrazed
##  3  14.7 Ungrazed
##  4  19.3 Ungrazed
##  5  34.2 Ungrazed
##  6  35.5 Ungrazed
##  7  87.7 Ungrazed
##  8  63.2 Ungrazed
##  9  24.2 Ungrazed
## 10  64.3 Ungrazed
## # ℹ 30 more rows
#Example-2: Select variables by index position

select(compensation,c(1,3))
## # A tibble: 40 × 2
##     Root Grazing 
##    <dbl> <chr>   
##  1  6.22 Ungrazed
##  2  6.49 Ungrazed
##  3  4.92 Ungrazed
##  4  5.13 Ungrazed
##  5  5.42 Ungrazed
##  6  5.36 Ungrazed
##  7  7.61 Ungrazed
##  8  6.35 Ungrazed
##  9  4.97 Ungrazed
## 10  6.93 Ungrazed
## # ℹ 30 more rows
#Using pipe operator

compensation%>%select(1, 3)
## # A tibble: 40 × 2
##     Root Grazing 
##    <dbl> <chr>   
##  1  6.22 Ungrazed
##  2  6.49 Ungrazed
##  3  4.92 Ungrazed
##  4  5.13 Ungrazed
##  5  5.42 Ungrazed
##  6  5.36 Ungrazed
##  7  7.61 Ungrazed
##  8  6.35 Ungrazed
##  9  4.97 Ungrazed
## 10  6.93 Ungrazed
## # ℹ 30 more rows
#selecting by index range

compensation%>%select(1:3)
## # A tibble: 40 × 3
##     Root Fruit Grazing 
##    <dbl> <dbl> <chr>   
##  1  6.22  59.8 Ungrazed
##  2  6.49  61.0 Ungrazed
##  3  4.92  14.7 Ungrazed
##  4  5.13  19.3 Ungrazed
##  5  5.42  34.2 Ungrazed
##  6  5.36  35.5 Ungrazed
##  7  7.61  87.7 Ungrazed
##  8  6.35  63.2 Ungrazed
##  9  4.97  24.2 Ungrazed
## 10  6.93  64.3 Ungrazed
## # ℹ 30 more rows
#Example-3: Deselect/Drop the variables

compensation%>% select(!Fruit)
## # A tibble: 40 × 2
##     Root Grazing 
##    <dbl> <chr>   
##  1  6.22 Ungrazed
##  2  6.49 Ungrazed
##  3  4.92 Ungrazed
##  4  5.13 Ungrazed
##  5  5.42 Ungrazed
##  6  5.36 Ungrazed
##  7  7.61 Ungrazed
##  8  6.35 Ungrazed
##  9  4.97 Ungrazed
## 10  6.93 Ungrazed
## # ℹ 30 more rows
compensation%>% select(!3)
## # A tibble: 40 × 2
##     Root Fruit
##    <dbl> <dbl>
##  1  6.22  59.8
##  2  6.49  61.0
##  3  4.92  14.7
##  4  5.13  19.3
##  5  5.42  34.2
##  6  5.36  35.5
##  7  7.61  87.7
##  8  6.35  63.2
##  9  4.97  24.2
## 10  6.93  64.3
## # ℹ 30 more rows
#Example 4: Select  variables based on some pattern  

compensation%>% select(starts_with("Fr"))
## # A tibble: 40 × 1
##    Fruit
##    <dbl>
##  1  59.8
##  2  61.0
##  3  14.7
##  4  19.3
##  5  34.2
##  6  35.5
##  7  87.7
##  8  63.2
##  9  24.2
## 10  64.3
## # ℹ 30 more rows
compensation%>% select(ends_with("t"))
## # A tibble: 40 × 2
##     Root Fruit
##    <dbl> <dbl>
##  1  6.22  59.8
##  2  6.49  61.0
##  3  4.92  14.7
##  4  5.13  19.3
##  5  5.42  34.2
##  6  5.36  35.5
##  7  7.61  87.7
##  8  6.35  63.2
##  9  4.97  24.2
## 10  6.93  64.3
## # ℹ 30 more rows
compensation%>% select(contains("a"))
## # A tibble: 40 × 1
##    Grazing 
##    <chr>   
##  1 Ungrazed
##  2 Ungrazed
##  3 Ungrazed
##  4 Ungrazed
##  5 Ungrazed
##  6 Ungrazed
##  7 Ungrazed
##  8 Ungrazed
##  9 Ungrazed
## 10 Ungrazed
## # ℹ 30 more rows
compensation%>% select(starts_with("Fr") & ends_with("t"))
## # A tibble: 40 × 1
##    Fruit
##    <dbl>
##  1  59.8
##  2  61.0
##  3  14.7
##  4  19.3
##  5  34.2
##  6  35.5
##  7  87.7
##  8  63.2
##  9  24.2
## 10  64.3
## # ℹ 30 more rows
compensation%>% select(starts_with("Fr") | ends_with("t"))
## # A tibble: 40 × 2
##    Fruit  Root
##    <dbl> <dbl>
##  1  59.8  6.22
##  2  61.0  6.49
##  3  14.7  4.92
##  4  19.3  5.13
##  5  34.2  5.42
##  6  35.5  5.36
##  7  87.7  7.61
##  8  63.2  6.35
##  9  24.2  4.97
## 10  64.3  6.93
## # ℹ 30 more rows
#Example 5: Select all numeric variables
compensation%>%select_if(is.numeric)
## # A tibble: 40 × 2
##     Root Fruit
##    <dbl> <dbl>
##  1  6.22  59.8
##  2  6.49  61.0
##  3  4.92  14.7
##  4  5.13  19.3
##  5  5.42  34.2
##  6  5.36  35.5
##  7  7.61  87.7
##  8  6.35  63.2
##  9  4.97  24.2
## 10  6.93  64.3
## # ℹ 30 more rows
slice()
  • Displays cases/rows by their (integer) locations.
  • Number of common helpers for :
    • slice_head() : select the first rows
    • slice_tail() : select the last rows.
    • slice_sample(): select rows randomly with or without replacement.
    • slice_min() : select rows with the smallest value of a variable.
    • slice_max() : select rows with the largest values of a variable.
  • Different options:
    1. Select specific row by index position.
    2. Select sequence of rows by index positions.
    3. Select discontinuous set of rows
    4. Drop rows
#select the first rows 

compensation%>%slice_head()
## # A tibble: 1 × 3
##    Root Fruit Grazing 
##   <dbl> <dbl> <chr>   
## 1  6.22  59.8 Ungrazed
compensation%>%slice_head(n=5)
## # A tibble: 5 × 3
##    Root Fruit Grazing 
##   <dbl> <dbl> <chr>   
## 1  6.22  59.8 Ungrazed
## 2  6.49  61.0 Ungrazed
## 3  4.92  14.7 Ungrazed
## 4  5.13  19.3 Ungrazed
## 5  5.42  34.2 Ungrazed
#select the last rows

compensation%>%slice_tail()
## # A tibble: 1 × 3
##    Root Fruit Grazing
##   <dbl> <dbl> <chr>  
## 1  8.53  83.0 Grazed
compensation%>%slice_tail(n=5)
## # A tibble: 5 × 3
##    Root Fruit Grazing
##   <dbl> <dbl> <chr>  
## 1  7.07  40.2 Grazed 
## 2  8.16  52.3 Grazed 
## 3  7.38  46.6 Grazed 
## 4  8.52  71.0 Grazed 
## 5  8.53  83.0 Grazed
# selects rows randomly with or without replacement

compensation%>%slice_sample(n=5)
## # A tibble: 5 × 3
##    Root Fruit Grazing 
##   <dbl> <dbl> <chr>   
## 1  7.92  41.5 Grazed  
## 2  5.93  54.9 Ungrazed
## 3  5.84  46.7 Ungrazed
## 4  8.53  83.0 Grazed  
## 5  6.93  64.3 Ungrazed
compensation%>%slice_sample(n=5,replace = TRUE)
## # A tibble: 5 × 3
##    Root Fruit Grazing 
##   <dbl> <dbl> <chr>   
## 1  8.91  70.1 Grazed  
## 2  6.01  53.6 Ungrazed
## 3  7.69  70.7 Grazed  
## 4  5.13  19.3 Ungrazed
## 5  7.35  50.1 Grazed
# selects rows based on the minimum or maximum value of any variable

compensation%>%slice_min(Fruit,n=5)
## # A tibble: 5 × 3
##    Root Fruit Grazing 
##   <dbl> <dbl> <chr>   
## 1  4.92  14.7 Ungrazed
## 2  6.11  15.0 Grazed  
## 3  4.43  18.9 Ungrazed
## 4  5.13  19.3 Ungrazed
## 5  4.97  24.2 Ungrazed
compensation%>%slice_max(Fruit,n=5)
## # A tibble: 5 × 3
##    Root Fruit Grazing 
##   <dbl> <dbl> <chr>   
## 1 10.3  116.  Grazed  
## 2  9.84 105.  Grazed  
## 3  9.35  98.5 Grazed  
## 4  7.61  87.7 Ungrazed
## 5  9.04  84.4 Grazed
#Select a specific row, sequence of rows, discontinuous set of rows from the data frame

compensation%>%slice(2)
## # A tibble: 1 × 3
##    Root Fruit Grazing 
##   <dbl> <dbl> <chr>   
## 1  6.49  61.0 Ungrazed
compensation%>%slice(2:6)
## # A tibble: 5 × 3
##    Root Fruit Grazing 
##   <dbl> <dbl> <chr>   
## 1  6.49  61.0 Ungrazed
## 2  4.92  14.7 Ungrazed
## 3  5.13  19.3 Ungrazed
## 4  5.42  34.2 Ungrazed
## 5  5.36  35.5 Ungrazed
compensation%>%slice(2,6,9)
## # A tibble: 3 × 3
##    Root Fruit Grazing 
##   <dbl> <dbl> <chr>   
## 1  6.49  61.0 Ungrazed
## 2  5.36  35.5 Ungrazed
## 3  4.97  24.2 Ungrazed
#Dropping rows

compensation%>%slice(-(3:8))
## # A tibble: 34 × 3
##     Root Fruit Grazing 
##    <dbl> <dbl> <chr>   
##  1  6.22  59.8 Ungrazed
##  2  6.49  61.0 Ungrazed
##  3  4.97  24.2 Ungrazed
##  4  6.93  64.3 Ungrazed
##  5  6.25  52.9 Ungrazed
##  6  5.45  32.4 Ungrazed
##  7  6.01  53.6 Ungrazed
##  8  5.93  54.9 Ungrazed
##  9  6.26  64.8 Ungrazed
## 10  7.18  73.2 Ungrazed
## # ℹ 24 more rows
filter()
  • Subset a data frame, retaining all rows that satisfy the conditions.
  • Many logical/Boolean operators are useful when constructing the expressions used to filter the data:
Logical Operator Meaning Third Header
== Equals finds rows where condition is met.
!= Doesn’t Equal finds rows where condition is NOT met.
<,>,<=,>= Relational finds rows where relation is TRUE.
OR
& AND

`

#Filter based on logical operator

compensation%>%filter(Fruit>80)
## # A tibble: 9 × 3
##    Root Fruit Grazing 
##   <dbl> <dbl> <chr>   
## 1  7.61  87.7 Ungrazed
## 2  7.00  80.6 Ungrazed
## 3 10.3  116.  Grazed  
## 4  9.04  84.4 Grazed  
## 5  8.99  80.3 Grazed  
## 6  8.98  82.4 Grazed  
## 7  9.84 105.  Grazed  
## 8  9.35  98.5 Grazed  
## 9  8.53  83.0 Grazed
compensation%>%filter(Fruit>80 | Fruit<20)
## # A tibble: 13 × 3
##     Root Fruit Grazing 
##    <dbl> <dbl> <chr>   
##  1  4.92  14.7 Ungrazed
##  2  5.13  19.3 Ungrazed
##  3  7.61  87.7 Ungrazed
##  4  7.00  80.6 Ungrazed
##  5  4.43  18.9 Ungrazed
##  6 10.3  116.  Grazed  
##  7  9.04  84.4 Grazed  
##  8  6.11  15.0 Grazed  
##  9  8.99  80.3 Grazed  
## 10  8.98  82.4 Grazed  
## 11  9.84 105.  Grazed  
## 12  9.35  98.5 Grazed  
## 13  8.53  83.0 Grazed
compensation%>%filter(Fruit>mean(Fruit))
## # A tibble: 22 × 3
##     Root Fruit Grazing 
##    <dbl> <dbl> <chr>   
##  1  6.22  59.8 Ungrazed
##  2  6.49  61.0 Ungrazed
##  3  7.61  87.7 Ungrazed
##  4  6.35  63.2 Ungrazed
##  5  6.93  64.3 Ungrazed
##  6  6.26  64.8 Ungrazed
##  7  7.18  73.2 Ungrazed
##  8  7.00  80.6 Ungrazed
##  9  7.30  75.5 Ungrazed
## 10 10.3  116.  Grazed  
## # ℹ 12 more rows
compensation%>%filter(Fruit>80 & Grazing=="Ungrazed")
## # A tibble: 2 × 3
##    Root Fruit Grazing 
##   <dbl> <dbl> <chr>   
## 1  7.61  87.7 Ungrazed
## 2  7.00  80.6 Ungrazed
mutate():
  • Creates new column using some transformation.
  • Create a new data frame and assign it to the older one.
compensation%>%mutate(frt_rt=Fruit/Root)
## # A tibble: 40 × 4
##     Root Fruit Grazing  frt_rt
##    <dbl> <dbl> <chr>     <dbl>
##  1  6.22  59.8 Ungrazed   9.60
##  2  6.49  61.0 Ungrazed   9.40
##  3  4.92  14.7 Ungrazed   2.99
##  4  5.13  19.3 Ungrazed   3.76
##  5  5.42  34.2 Ungrazed   6.32
##  6  5.36  35.5 Ungrazed   6.63
##  7  7.61  87.7 Ungrazed  11.5 
##  8  6.35  63.2 Ungrazed   9.95
##  9  4.97  24.2 Ungrazed   4.87
## 10  6.93  64.3 Ungrazed   9.28
## # ℹ 30 more rows
compensation%>%mutate(logfrt=log(Fruit))
## # A tibble: 40 × 4
##     Root Fruit Grazing  logfrt
##    <dbl> <dbl> <chr>     <dbl>
##  1  6.22  59.8 Ungrazed   4.09
##  2  6.49  61.0 Ungrazed   4.11
##  3  4.92  14.7 Ungrazed   2.69
##  4  5.13  19.3 Ungrazed   2.96
##  5  5.42  34.2 Ungrazed   3.53
##  6  5.36  35.5 Ungrazed   3.57
##  7  7.61  87.7 Ungrazed   4.47
##  8  6.35  63.2 Ungrazed   4.15
##  9  4.97  24.2 Ungrazed   3.19
## 10  6.93  64.3 Ungrazed   4.16
## # ℹ 30 more rows
arrange():
  • Used for sorting the rows in a particular order
  • Sorting is possible for multiple variable also
#By default sorting is in ascending order
compensation%>%arrange(Fruit)
## # A tibble: 40 × 3
##     Root Fruit Grazing 
##    <dbl> <dbl> <chr>   
##  1  4.92  14.7 Ungrazed
##  2  6.11  15.0 Grazed  
##  3  4.43  18.9 Ungrazed
##  4  5.13  19.3 Ungrazed
##  5  4.97  24.2 Ungrazed
##  6  5.45  32.4 Ungrazed
##  7  5.42  34.2 Ungrazed
##  8  5.36  35.5 Ungrazed
##  9  6.96  38.9 Grazed  
## 10  7.07  40.2 Grazed  
## # ℹ 30 more rows
#The order can be changed
compensation%>%arrange(desc(Fruit))
## # A tibble: 40 × 3
##     Root Fruit Grazing 
##    <dbl> <dbl> <chr>   
##  1 10.3  116.  Grazed  
##  2  9.84 105.  Grazed  
##  3  9.35  98.5 Grazed  
##  4  7.61  87.7 Ungrazed
##  5  9.04  84.4 Grazed  
##  6  8.53  83.0 Grazed  
##  7  8.98  82.4 Grazed  
##  8  7.00  80.6 Ungrazed
##  9  8.99  80.3 Grazed  
## 10  8.64  78.3 Grazed  
## # ℹ 30 more rows
#Sorting based on multiple variables
compensation%>%arrange(Fruit,desc(Root))
## # A tibble: 40 × 3
##     Root Fruit Grazing 
##    <dbl> <dbl> <chr>   
##  1  4.92  14.7 Ungrazed
##  2  6.11  15.0 Grazed  
##  3  4.43  18.9 Ungrazed
##  4  5.13  19.3 Ungrazed
##  5  4.97  24.2 Ungrazed
##  6  5.45  32.4 Ungrazed
##  7  5.42  34.2 Ungrazed
##  8  5.36  35.5 Ungrazed
##  9  6.96  38.9 Grazed  
## 10  7.07  40.2 Grazed  
## # ℹ 30 more rows
summarize():

Summarization is accomplished in a series of steps. They are:

  1. Declare the data frame and what the grouping variable is.
  2. Provide some kind of maths function with which to summarize the data (e.g. mean() or sd()).
  3. Provide a nice name for the values returned.
  4. Make R use all of this information.
compensation%>%group_by(Grazing)%>%summarize(mean_fruit=mean(Fruit))
## # A tibble: 2 × 2
##   Grazing  mean_fruit
##   <chr>         <dbl>
## 1 Grazed         67.9
## 2 Ungrazed       50.9
compensation%>%group_by(Grazing)%>%summarize(mean_fruit=mean(Fruit),sd_fruit=sd(Fruit))
## # A tibble: 2 × 3
##   Grazing  mean_fruit sd_fruit
##   <chr>         <dbl>    <dbl>
## 1 Grazed         67.9     25.0
## 2 Ungrazed       50.9     21.8
Base R vs. dplyr
Operation dplyr Function Base R Command
select some rows comepensation>%>slice(2,6,8) compensation[c(2,6,8),]
select some columns comepensation>%>select(Root,Fruit) compensation[,c(“Root”,“Fruit”)]
subset comepensation>%>filter(Fruit>80) compensation[compensation$Fruit>80,]
Ordering comepensation>%>arrange(Fruit) compensation[order(compensation$Fruit),]
Add a column compensation>%>mutate(frt_rt=Fruit/Root) compensation$frt_rt=compensation$Fruit/compensation$Root
summarize data compensation>%>group_by(Grazing)>%>%summarize(mean_fruit=mean(Fruit)) tapply(compensation$Fruit,list(compensation$Grazing),mean)