If dplyr package is not installed then install it first
by running the command install.packages('dplyr').
Then import the package:
> library(dplyr)
dplyr has some cool and useful functions-
To know the structure of the data set, dplyr has the function
glimpse() which is faster than the base R’s function
str().
Let’s take the built in data set mtcars for
examples:
> head(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
To filter out rows where cyl is equal to 4:
> filter(mtcars, cyl == 4)
mpg cyl disp hp drat wt qsec vs am gear carb
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
Using pipe operator the above code can be written in the form:
> mtcars %>% filter(cyl == 4)
To know no of observations where cyl = 4:
> length(filter(mtcars, cyl == 4))
[1] 11
Filtering with multiple conditions:
> filter(mtcars, cyl == 4, hp > 96)
mpg cyl disp hp drat wt qsec vs am gear carb
Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
Using pipe operator the above code can be written in the form:
> mtcars %>%
+ filter(cyl == 4, hp > 96)
mpg cyl disp hp drat wt qsec vs am gear carb
Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
> mtcars %>% filter(hp>=90, hp<=100)
mpg cyl disp hp drat wt qsec vs am gear carb
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
> mtcars %>% filter(between(hp, 90, 100)) # same as above
mpg cyl disp hp drat wt qsec vs am gear carb
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
While filtering, sometime we may need to omit the rows with NA
values. In those cases is.na() can be useful:
> starwars %>%
+ filter(!is.na(hair_color)) %>%
+ head(4)
# A tibble: 4 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Luke Sky… 172 77 blond fair blue 19 male mascu…
2 Darth Va… 202 136 none white yellow 41.9 male mascu…
3 Leia Org… 150 49 brown light brown 19 fema… femin…
4 Owen Lars 178 120 brown, gr… light blue 52 male mascu…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
Now in the column hair_color, there is no NA values. Note: In this process the data set is not replaced. We actually need to store it by assigning into a variable:
> starwars_noNA <- starwars %>%
+ filter(!is.na(hair_color)) %>%
+ head(4)
slice() filters rows by position:
> esoph %>%
+ slice(3:7) #Shows rows from 3rd to 7th position
agegp alcgp tobgp ncases ncontrols
1 25-34 0-39g/day 20-29 0 6
2 25-34 0-39g/day 30+ 0 5
3 25-34 40-79 0-9g/day 0 27
4 25-34 40-79 10-19 0 7
5 25-34 40-79 20-29 0 4
Draw random samples from the data:
> esoph2 <- esoph %>%
+ slice_sample(n = 10)
Let’s see the structure of the mtcars data set first:
> glimpse(mtcars)
Rows: 32
Columns: 11
$ mpg <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8,…
$ cyl <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8,…
$ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8, 16…
$ hp <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, 180…
$ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.92,…
$ wt <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150, 3.…
$ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90, 18…
$ vs <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0,…
$ am <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,…
$ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3,…
$ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1, 2,…
To select columns mpg, cyl and hp:
> select(mtcars, mpg, cyl, hp) %>%
+ head(8) # to show first 8 observations
mpg cyl hp
Mazda RX4 21.0 6 110
Mazda RX4 Wag 21.0 6 110
Datsun 710 22.8 4 93
Hornet 4 Drive 21.4 6 110
Hornet Sportabout 18.7 8 175
Valiant 18.1 6 105
Duster 360 14.3 8 245
Merc 240D 24.4 4 62
Or,
> mtcars %>%
+ select(mpg, cyl, hp) %>%
+ head(8)
mpg cyl hp
Mazda RX4 21.0 6 110
Mazda RX4 Wag 21.0 6 110
Datsun 710 22.8 4 93
Hornet 4 Drive 21.4 6 110
Hornet Sportabout 18.7 8 175
Valiant 18.1 6 105
Duster 360 14.3 8 245
Merc 240D 24.4 4 62
Using filter and select:
> mtcars %>%
+ select(mpg, cyl, hp) %>%
+ filter(cyl==4)
mpg cyl hp
Datsun 710 22.8 4 93
Merc 240D 24.4 4 62
Merc 230 22.8 4 95
Fiat 128 32.4 4 66
Honda Civic 30.4 4 52
Toyota Corolla 33.9 4 65
Toyota Corona 21.5 4 97
Fiat X1-9 27.3 4 66
Porsche 914-2 26.0 4 91
Lotus Europa 30.4 4 113
Volvo 142E 21.4 4 109
We can delete any columns using the select function by putting minus (-) before the column name:
Let’s use the esoph data set for this:
> glimpse(esoph)
Rows: 88
Columns: 5
$ agegp <ord> 25-34, 25-34, 25-34, 25-34, 25-34, 25-34, 25-34, 25-34, 25-3…
$ alcgp <ord> 0-39g/day, 0-39g/day, 0-39g/day, 0-39g/day, 40-79, 40-79, 40…
$ tobgp <ord> 0-9g/day, 10-19, 20-29, 30+, 0-9g/day, 10-19, 20-29, 30+, 0-…
$ ncases <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, …
$ ncontrols <dbl> 40, 10, 6, 5, 27, 7, 4, 7, 2, 1, 2, 1, 0, 1, 2, 60, 13, 7, 8…
The following code will hide alcgp and ncontrols from the data set:
> select(esoph, -alcgp, -ncontrols) %>%
+ sample_n(5) # takes 5 random rows from the data set
agegp tobgp ncases
1 55-64 20-29 3
2 25-34 0-9g/day 0
3 65-74 0-9g/day 3
4 45-54 10-19 3
5 65-74 30+ 1
select() can be used to rename columns. But the problem
is all columns that are not mentioned are dropped:
> esoph %>%
+ select(age_group = agegp) %>%
+ head(5)
age_group
1 25-34
2 25-34
3 25-34
4 25-34
5 25-34
rename() does the same thing, but in this case all other
columns are not dropped:
> esoph %>%
+ rename(age_group = agegp) %>%
+ head(5)
age_group alcgp tobgp ncases ncontrols
1 25-34 0-39g/day 0-9g/day 0 40
2 25-34 0-39g/day 10-19 0 10
3 25-34 0-39g/day 20-29 0 6
4 25-34 0-39g/day 30+ 0 5
5 25-34 40-79 0-9g/day 0 27
Factor variable columns:
> select_if(esoph, is.factor) %>%
+ head()
agegp alcgp tobgp
1 25-34 0-39g/day 0-9g/day
2 25-34 0-39g/day 10-19
3 25-34 0-39g/day 20-29
4 25-34 0-39g/day 30+
5 25-34 40-79 0-9g/day
6 25-34 40-79 10-19
Numeric variable columns:
> esoph %>%
+ select_if(is.numeric) %>%
+ head()
ncases ncontrols
1 0 40
2 0 10
3 0 6
4 0 5
5 0 27
6 0 7
__Note:__ Although the select_if method works fine, the
base R’s Filter() (Notice: not filter() from dplyr, it’s Filter() with
capital F) is way much faster than this. Following is an example:
> library(microbenchmark)
> microbenchmark(
+ dplyr::select_if(esoph, is.factor),
+ Filter(is.factor, esoph)
+ )
Unit: microseconds
expr min lq mean median uq
dplyr::select_if(esoph, is.factor) 4027.4 4172.3 4627.727 4385.30 4801.60
Filter(is.factor, esoph) 24.3 28.9 39.782 39.75 42.85
max neval cld
6535.7 100 a
144.2 100 b
Let’s use starwars data set for now:
> head(starwars)
# A tibble: 6 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Luke Sky… 172 77 blond fair blue 19 male mascu…
2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
4 Darth Va… 202 136 none white yellow 41.9 male mascu…
5 Leia Org… 150 49 brown light brown 19 fema… femin…
6 Owen Lars 178 120 brown, gr… light blue 52 male mascu…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
Here we will use the columns height and mass to calculate bmi and add this variable to the data set:
> starwars %>%
+ mutate(bmi= mass/((height/100)^2)) %>%
+ select(name, height, mass, bmi) %>%
+ filter(bmi<22, bmi>18)
# A tibble: 8 × 4
name height mass bmi
<chr> <int> <dbl> <dbl>
1 Leia Organa 150 49 21.8
2 Chewbacca 228 112 21.5
3 Ki-Adi-Mundi 198 82 20.9
4 Luminara Unduli 170 56.2 19.4
5 Barriss Offee 166 50 18.1
6 Dooku 193 80 21.5
7 Zam Wesell 168 55 19.5
8 Tion Medon 206 80 18.9
It only keeps the new variables and the given variables given inside the function unlike the mutate function:
> starwars %>%
+ transmute(name, bmi= mass/((height/100)^2))
# A tibble: 87 × 2
name bmi
<chr> <dbl>
1 Luke Skywalker 26.0
2 C-3PO 26.9
3 R2-D2 34.7
4 Darth Vader 33.3
5 Leia Organa 21.8
6 Owen Lars 37.9
7 Beru Whitesun Lars 27.5
8 R5-D4 34.0
9 Biggs Darklighter 25.1
10 Obi-Wan Kenobi 23.2
# ℹ 77 more rows
To replace something based on conditions in columns -
> starwars %>%
+ transmute(name,
+ bmi= mass/((height/100)^2),
+ out_of_range = replace(bmi, bmi > 30, NA))
# A tibble: 87 × 3
name bmi out_of_range
<chr> <dbl> <dbl>
1 Luke Skywalker 26.0 26.0
2 C-3PO 26.9 26.9
3 R2-D2 34.7 NA
4 Darth Vader 33.3 NA
5 Leia Organa 21.8 21.8
6 Owen Lars 37.9 NA
7 Beru Whitesun Lars 27.5 27.5
8 R5-D4 34.0 NA
9 Biggs Darklighter 25.1 25.1
10 Obi-Wan Kenobi 23.2 23.2
# ℹ 77 more rows
Ascending order of disp -
> mtcars %>%
+ arrange(disp)
mpg cyl disp hp drat wt qsec vs am gear carb
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
Descending order of disp -
> mtcars %>%
+ arrange(desc(disp))
mpg cyl disp hp drat wt qsec vs am gear carb
Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Ascending order of cyl, then descending order of disp -
> mtcars %>%
+ arrange(cyl, desc(disp))
mpg cyl disp hp drat wt qsec vs am gear carb
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
From the mtcars data set, if we want to know how many
groups of cyl are there in the data set and the no of cars in each group
along with their average mpg, then the code can be written as
follows:
> mtcars %>%
+ group_by(cyl) %>%
+ summarise(Freq_cyl = n(),
+ Mean_mpg = mean(mpg))
# A tibble: 3 × 3
cyl Freq_cyl Mean_mpg
<dbl> <int> <dbl>
1 4 11 26.7
2 6 7 19.7
3 8 14 15.1
Here we have found that there is 3 groups in cyl and each group has different no. of cars in it.
Another example can be:
> starwars %>%
+ group_by(hair_color) %>%
+ summarise(Average_height = round(mean(height, na.rm=T),2)) %>%
+ arrange(desc(Average_height))
# A tibble: 12 × 2
hair_color Average_height
<chr> <dbl>
1 auburn, white 182
2 none 181.
3 auburn, grey 180
4 brown, grey 178
5 blond 177.
6 brown 177.
7 black 174.
8 grey 170
9 blonde 168
10 white 156
11 auburn 150
12 <NA> 142.
> starwars %>%
+ group_by(hair_color) %>%
+ summarise(Average_height = round(mean(height, na.rm=T),2),
+ Count = n())
# A tibble: 12 × 3
hair_color Average_height Count
<chr> <dbl> <int>
1 auburn 150 1
2 auburn, grey 180 1
3 auburn, white 182 1
4 black 174. 13
5 blond 177. 3
6 blonde 168 1
7 brown 177. 18
8 brown, grey 178 1
9 grey 170 1
10 none 181. 38
11 white 156 4
12 <NA> 142. 5
Some other functions used in summarize are: sum(), mean(), median(),sd(),IQR(),min(),max(),first(),last(),nth() etc.
Another example of using only group_by():
Let’s use the esoph data set for this:
> head(esoph)
agegp alcgp tobgp ncases ncontrols
1 25-34 0-39g/day 0-9g/day 0 40
2 25-34 0-39g/day 10-19 0 10
3 25-34 0-39g/day 20-29 0 6
4 25-34 0-39g/day 30+ 0 5
5 25-34 40-79 0-9g/day 0 27
6 25-34 40-79 10-19 0 7
Here is a code that creates two way table:
> esoph %>%
+ group_by(agegp) %>%
+ select(agegp, alcgp) %>%
+ table()
alcgp
agegp 0-39g/day 40-79 80-119 120+
25-34 4 4 3 4
35-44 4 4 4 3
45-54 4 4 4 4
55-64 4 4 4 4
65-74 4 3 4 4
75+ 3 4 2 2
Same summary statistics on multiple columns -
> mtcars %>%
+ group_by(gear) %>%
+ summarize(across(c(count_hp = hp, count_disp = disp),
+ ~ n()))
# A tibble: 3 × 3
gear count_hp count_disp
<dbl> <int> <int>
1 3 15 15
2 4 12 12
3 5 5 5
Functions can be passed in across as purrr-style lambda, e.g. ~ mean(.x, na.rm = TRUE):
> mtcars %>%
+ group_by(gear) %>%
+ summarize(across(c(hp, disp),
+ c(Freq = ~ n(), Mean = ~mean(., na.rm=TRUE))))
# A tibble: 3 × 5
gear hp_Freq hp_Mean disp_Freq disp_Mean
<dbl> <int> <dbl> <int> <dbl>
1 3 15 176. 15 326.
2 4 12 89.5 12 123.
3 5 5 196. 5 202.
Functions can be passed in across as a list of functions/lambdas, e.g. list(mean = mean, n_miss = ~ sum(is.na(.x)):
> mtcars %>%
+ group_by(gear) %>%
+ summarize(across(c(hp, disp),
+ list(Mean = mean, SD = sd)))
# A tibble: 3 × 5
gear hp_Mean hp_SD disp_Mean disp_SD
<dbl> <dbl> <dbl> <dbl> <dbl>
1 3 176. 47.7 326. 94.9
2 4 89.5 25.9 123. 38.9
3 5 196. 103. 202. 115.
If list of functins is passed, the the functions inside it can be written as just the name of function or in purrr-style lambda:
> mtcars %>%
+ group_by(gear) %>%
+ summarize(across(c(hp, disp),
+ list(Mean = ~ mean(., na.rm = TRUE), SD = sd)))
# A tibble: 3 × 5
gear hp_Mean hp_SD disp_Mean disp_SD
<dbl> <dbl> <dbl> <dbl> <dbl>
1 3 176. 47.7 326. 94.9
2 4 89.5 25.9 123. 38.9
3 5 196. 103. 202. 115.
Column names can be specified too:
> mtcars %>%
+ group_by(gear) %>%
+ summarize(across(c(hp, disp),
+ list(Mean = ~ mean(.), SD = ~ sd(.)),
+ .names = "{.fn} of {.col}"))
# A tibble: 3 × 5
gear `Mean of hp` `SD of hp` `Mean of disp` `SD of disp`
<dbl> <dbl> <dbl> <dbl> <dbl>
1 3 176. 47.7 326. 94.9
2 4 89.5 25.9 123. 38.9
3 5 196. 103. 202. 115.
More example:
> esoph %>%
+ group_by(agegp) %>%
+ summarize(across(where(is.numeric),
+ list(Mean = ~mean(.),
+ Variance = ~var(.)),
+ .names = "{.fn}_{.col}"))
# A tibble: 6 × 5
agegp Mean_ncases Variance_ncases Mean_ncontrols Variance_ncontrols
<ord> <dbl> <dbl> <dbl> <dbl>
1 25-34 0.0667 0.0667 7.67 124.
2 35-44 0.6 0.971 12.7 251.
3 45-54 2.88 4.38 10.4 160.
4 55-64 4.75 5.27 10.4 164.
5 65-74 3.67 16.8 7.07 121.
6 75+ 1.18 0.364 2.82 24.4
tips: matches("string") function
selects columns whose column name matches the given “string”.
The function tally() can be used to count instead of
n():
> mtcars %>%
+ group_by(cyl) %>%
+ tally(sort=T, name='Count')
# A tibble: 3 × 2
cyl Count
<dbl> <int>
1 8 14
2 4 11
3 6 7
> # sort = TRUE, will show the largest groups at the top
n_distinct(vector) counts the no. of unique items in the
vector:
> mtcars %>%
+ group_by(cyl) %>%
+ summarise(no_of_cars_in_group=n(),
+ unique_hp_cars=n_distinct(hp))
# A tibble: 3 × 3
cyl no_of_cars_in_group unique_hp_cars
<dbl> <int> <int>
1 4 11 10
2 6 7 4
3 8 14 9
The following example keeps top 4 cars according to mpg from each group of cyl:
> mtcars %>%
+ add_rownames("CarName") %>%
+ group_by(cyl) %>%
+ top_n(2, mpg) %>%
+ arrange(desc(mpg)) %>%
+ select(CarName,gear,mpg)
# A tibble: 7 × 4
# Groups: cyl [3]
cyl CarName gear mpg
<dbl> <chr> <dbl> <dbl>
1 4 Toyota Corolla 4 33.9
2 4 Fiat 128 4 32.4
3 6 Hornet 4 Drive 3 21.4
4 6 Mazda RX4 4 21
5 6 Mazda RX4 Wag 4 21
6 8 Pontiac Firebird 3 19.2
7 8 Hornet Sportabout 3 18.7
Tips: add_rownames() actually turns the
row names into a column with given variable name. Also,
add_rownames() was deprecated in dplyr 1.0.0. Please use
tibble::rownames_to_column() instead.
> mtcars %>%
+ tibble::rownames_to_column("CarName") %>%
+ group_by(cyl) %>%
+ top_n(2, mpg) %>%
+ arrange(desc(mpg)) %>%
+ select(CarName,gear,mpg)
# A tibble: 7 × 4
# Groups: cyl [3]
cyl CarName gear mpg
<dbl> <chr> <dbl> <dbl>
1 4 Toyota Corolla 4 33.9
2 4 Fiat 128 4 32.4
3 6 Hornet 4 Drive 3 21.4
4 6 Mazda RX4 4 21
5 6 Mazda RX4 Wag 4 21
6 8 Pontiac Firebird 3 19.2
7 8 Hornet Sportabout 3 18.7
Notice: Although the cyl column was not
selected, since it was the grouping variable, dplyr is showing the
column. To prevent this from happening, ungroup() is
used.
> mtcars %>%
+ tibble::rownames_to_column("CarName") %>%
+ group_by(cyl) %>%
+ top_n(2, mpg) %>%
+ arrange(desc(mpg)) %>%
+ ungroup() %>%
+ select(CarName,gear,mpg)
# A tibble: 7 × 3
CarName gear mpg
<chr> <dbl> <dbl>
1 Toyota Corolla 4 33.9
2 Fiat 128 4 32.4
3 Hornet 4 Drive 3 21.4
4 Mazda RX4 4 21
5 Mazda RX4 Wag 4 21
6 Pontiac Firebird 3 19.2
7 Hornet Sportabout 3 18.7
> mtcars %>%
+ tibble::rownames_to_column("CarName") %>%
+ tidyr::separate(col = CarName,
+ into = c("Car Name","Model"), # separate into these variables
+ sep = " ", # separator
+ extra = "merge", # if more parts than the length of `into`
+ fill = "right" # fill to the right
+ )
Car Name Model mpg cyl disp hp drat wt qsec vs am gear carb
1 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
2 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
3 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
4 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
5 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
6 Valiant <NA> 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
7 Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
8 Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
9 Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
10 Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
11 Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
12 Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
13 Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
14 Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
15 Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
16 Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
17 Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
18 Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
19 Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
20 Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
21 Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
22 Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
23 AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
24 Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
25 Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
26 Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
27 Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
28 Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
29 Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
30 Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
31 Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
32 Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
Let’s create two data frames:
> (a <- data.frame(color=c("Blue","Red","Yellow","Megenda","White"),
+ taka1=c(10,10,15,20,30)))
color taka1
1 Blue 10
2 Red 10
3 Yellow 15
4 Megenda 20
5 White 30
> (b <- data.frame(color=c("Black","Red","Yellow","Pink","Green"),
+ taka2=c(20,10,15,10,30)))
color taka2
1 Black 20
2 Red 10
3 Yellow 15
4 Pink 10
5 Green 30
Only include observations found in both “a” and “b” (A intersect B):
> inner_join(a,b)
color taka1 taka2
1 Red 10 10
2 Yellow 15 15
Include observations found in either “a” or “b” (A U B):
> full_join(a,b)
color taka1 taka2
1 Blue 10 NA
2 Red 10 10
3 Yellow 15 15
4 Megenda 20 NA
5 White 30 NA
6 Black NA 20
7 Pink NA 10
8 Green NA 30
Shows those values of “a” that matches the values of “b”:
> semi_join(a,b)
color taka1
1 Red 10
2 Yellow 15
Shows those values of “a” that does not match the values of “b”:
> anti_join(a,b)
color taka1
1 Blue 10
2 Megenda 20
3 White 30
> left_join(a,b) # include all observations found in a
color taka1 taka2
1 Blue 10 NA
2 Red 10 10
3 Yellow 15 15
4 Megenda 20 NA
5 White 30 NA
> right_join(a,b) # include all observations found in b
color taka1 taka2
1 Red 10 10
2 Yellow 15 15
3 Black NA 20
4 Pink NA 10
5 Green NA 30
Run the following code to learn more about dplyr.
> # vignette("programming", package = "dplyr")