loiyumba
Packages are sets of functions, and data, specific for some particular purpose, that can be loaded into an R session to make them available so that they can be used in the same way as built-in R functions and data. Most of the useful R applications appear in packages.
# install.packages("dplyr")
library(dplyr)
We can check all the packages which are loaded in our current session with search() function.
search()
[1] ".GlobalEnv" "package:dplyr" "package:knitr"
[4] "package:stats" "package:graphics" "package:grDevices"
[7] "package:utils" "package:datasets" "package:methods"
[10] "Autoloads" "package:base"
If we want to know more about the package and its functions, we can simple do
# library(help = 'dplyr')
And it will give details of the package with its functions.
We also want to check the list of datasets available with the package. Then this brings up the list of data associated with the package.
# data(package = 'dplyr')
Hadley Wickham of RStudio is the author of dplyr package. dplyr functions are similar to base R functions in extracting existing variables, extracting existing observations and derive new variables, but it simplifies existing base R functions and it's very fast.
Some of the main functions of dplyr are -
In your console, run this command
mtcars
And then
tbl_df(mtcars)
What's the difference?
glimpse(mtcars)
Observations: 32
Variables: 11
$ mpg <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19....
$ cyl <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, ...
$ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 1...
$ hp <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, ...
$ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.9...
$ wt <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3...
$ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 2...
$ vs <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, ...
$ am <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, ...
$ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, ...
$ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, ...
new_mtcars <- select(mtcars, mpg, cyl, disp, hp, gear)
head(new_mtcars)
mpg cyl disp hp gear
Mazda RX4 21.0 6 160 110 4
Mazda RX4 Wag 21.0 6 160 110 4
Datsun 710 22.8 4 108 93 4
Hornet 4 Drive 21.4 6 258 110 3
Hornet Sportabout 18.7 8 360 175 3
Valiant 18.1 6 225 105 3
How can we do this with base R function?
base_mtcars <- mtcars[, c(1:4, 10)]
head(base_mtcars)
mpg cyl disp hp gear
Mazda RX4 21.0 6 160 110 4
Mazda RX4 Wag 21.0 6 160 110 4
Datsun 710 22.8 4 108 93 4
Hornet 4 Drive 21.4 6 258 110 3
Hornet Sportabout 18.7 8 360 175 3
Valiant 18.1 6 225 105 3
new_mtcars <- select(new_mtcars, -gear)
head(new_mtcars)
mpg cyl disp hp
Mazda RX4 21.0 6 160 110
Mazda RX4 Wag 21.0 6 160 110
Datsun 710 22.8 4 108 93
Hornet 4 Drive 21.4 6 258 110
Hornet Sportabout 18.7 8 360 175
Valiant 18.1 6 225 105
How can we do this in base R?
base_mtcars$gear <- NULL
head(base_mtcars)
mpg cyl disp hp
Mazda RX4 21.0 6 160 110
Mazda RX4 Wag 21.0 6 160 110
Datsun 710 22.8 4 108 93
Hornet 4 Drive 21.4 6 258 110
Hornet Sportabout 18.7 8 360 175
Valiant 18.1 6 225 105
new_mtcars <- select(new_mtcars, cyl:hp)
head(new_mtcars)
cyl disp hp
Mazda RX4 6 160 110
Mazda RX4 Wag 6 160 110
Datsun 710 4 108 93
Hornet 4 Drive 6 258 110
Hornet Sportabout 8 360 175
Valiant 6 225 105
Lets do the same with base R function
base_mtcars <- base_mtcars[, c(2:4)]
head(base_mtcars)
cyl disp hp
Mazda RX4 6 160 110
Mazda RX4 Wag 6 160 110
Datsun 710 4 108 93
Hornet 4 Drive 6 258 110
Hornet Sportabout 8 360 175
Valiant 6 225 105
filter(new_mtcars, disp > 300)
cyl disp hp
1 8 360 175
2 8 360 245
3 8 472 205
4 8 460 215
5 8 440 230
6 8 318 150
7 8 304 150
8 8 350 245
9 8 400 175
10 8 351 264
11 8 301 335
Do the same with base R function
base_mtcars[base_mtcars$disp > 300, ]
cyl disp hp
Hornet Sportabout 8 360 175
Duster 360 8 360 245
Cadillac Fleetwood 8 472 205
Lincoln Continental 8 460 215
Chrysler Imperial 8 440 230
Dodge Challenger 8 318 150
AMC Javelin 8 304 150
Camaro Z28 8 350 245
Pontiac Firebird 8 400 175
Ford Pantera L 8 351 264
Maserati Bora 8 301 335
filter(new_mtcars, disp < 200, cyl == 6)
cyl disp hp
1 6 160.0 110
2 6 160.0 110
3 6 167.6 123
4 6 167.6 123
5 6 145.0 175
In base R
base_mtcars[base_mtcars$disp < 200 & base_mtcars$cyl == 6, ]
cyl disp hp
Mazda RX4 6 160.0 110
Mazda RX4 Wag 6 160.0 110
Merc 280 6 167.6 123
Merc 280C 6 167.6 123
Ferrari Dino 6 145.0 175
head(mutate(new_mtcars, hp_cyl = hp/cyl), 3)
cyl disp hp hp_cyl
1 6 160 110 18.33333
2 6 160 110 18.33333
3 4 108 93 23.25000
head(mutate(new_mtcars, hp_cyl = hp/cyl, disp_cyl = disp/cyl), 4)
cyl disp hp hp_cyl disp_cyl
1 6 160 110 18.33333 26.66667
2 6 160 110 18.33333 26.66667
3 4 108 93 23.25000 27.00000
4 6 258 110 18.33333 43.00000
In the base R function, this is how we do it
base_mtcars$hp_cyl <- base_mtcars$hp/base_mtcars$cyl
head(base_mtcars, 3)
cyl disp hp hp_cyl
Mazda RX4 6 160 110 18.33333
Mazda RX4 Wag 6 160 110 18.33333
Datsun 710 4 108 93 23.25000
base_mtcars$disp_cyl <- base_mtcars$disp/base_mtcars$cyl
head(base_mtcars, 3)
cyl disp hp hp_cyl disp_cyl
Mazda RX4 6 160 110 18.33333 26.66667
Mazda RX4 Wag 6 160 110 18.33333 26.66667
Datsun 710 4 108 93 23.25000 27.00000
summarise(new_mtcars, median = median(hp), variance = var(disp), numbers = n())
median variance numbers
1 123 15360.8 32
summarise(new_mtcars, average = mean(hp), total = sum(disp), deviation = sd(hp))
average total deviation
1 146.6875 7383.1 68.56287
In base R
summary(base_mtcars$hp)
Min. 1st Qu. Median Mean 3rd Qu. Max.
52.0 96.5 123.0 146.7 180.0 335.0
sum(base_mtcars$disp)
[1] 7383.1
nrow(base_mtcars); sd(base_mtcars$hp)
[1] 32
[1] 68.56287
head(arrange(new_mtcars, disp),4)
cyl disp hp
1 4 71.1 65
2 4 75.7 52
3 4 78.7 66
4 4 79.0 66
head(arrange(new_mtcars, desc(disp)),4)
cyl disp hp
1 8 472 205
2 8 460 215
3 8 440 230
4 8 400 175
head(arrange(new_mtcars, hp, disp), 5)
cyl disp hp
1 4 75.7 52
2 4 146.7 62
3 4 71.1 65
4 4 78.7 66
5 4 79.0 66
head(arrange(new_mtcars, hp, desc(disp)), 5)
cyl disp hp
1 4 75.7 52
2 4 146.7 62
3 4 71.1 65
4 4 79.0 66
5 4 78.7 66
In base R
head(base_mtcars[order(base_mtcars$disp), ], 4)
cyl disp hp hp_cyl disp_cyl
Toyota Corolla 4 71.1 65 16.25 17.775
Honda Civic 4 75.7 52 13.00 18.925
Fiat 128 4 78.7 66 16.50 19.675
Fiat X1-9 4 79.0 66 16.50 19.750
head(base_mtcars[order(base_mtcars$disp, decreasing = TRUE), ], 4)
cyl disp hp hp_cyl disp_cyl
Cadillac Fleetwood 8 472 205 25.625 59.0
Lincoln Continental 8 460 215 26.875 57.5
Chrysler Imperial 8 440 230 28.750 55.0
Pontiac Firebird 8 400 175 21.875 50.0
head(base_mtcars[order(base_mtcars$hp, base_mtcars$disp), ], 5)
cyl disp hp hp_cyl disp_cyl
Honda Civic 4 75.7 52 13.00 18.925
Merc 240D 4 146.7 62 15.50 36.675
Toyota Corolla 4 71.1 65 16.25 17.775
Fiat 128 4 78.7 66 16.50 19.675
Fiat X1-9 4 79.0 66 16.50 19.750
head(base_mtcars[order(base_mtcars$hp, -base_mtcars$disp), ], 5)
cyl disp hp hp_cyl disp_cyl
Honda Civic 4 75.7 52 13.00 18.925
Merc 240D 4 146.7 62 15.50 36.675
Toyota Corolla 4 71.1 65 16.25 17.775
Fiat X1-9 4 79.0 66 16.50 19.750
Fiat 128 4 78.7 66 16.50 19.675
new_mtcars2 <- mtcars %>%
select(mpg, cyl, disp, hp, gear)
head(new_mtcars2)
mpg cyl disp hp gear
Mazda RX4 21.0 6 160 110 4
Mazda RX4 Wag 21.0 6 160 110 4
Datsun 710 22.8 4 108 93 4
Hornet 4 Drive 21.4 6 258 110 3
Hornet Sportabout 18.7 8 360 175 3
Valiant 18.1 6 225 105 3
new_mtcars2 %>%
filter(disp > 300)
mpg cyl disp hp gear
1 18.7 8 360 175 3
2 14.3 8 360 245 3
3 10.4 8 472 205 3
4 10.4 8 460 215 3
5 14.7 8 440 230 3
6 15.5 8 318 150 3
7 15.2 8 304 150 3
8 13.3 8 350 245 3
9 19.2 8 400 175 3
10 15.8 8 351 264 5
11 15.0 8 301 335 5
new_mtcars2 %>%
filter(disp > 300) %>%
select(mpg, gear)
mpg gear
1 18.7 3
2 14.3 3
3 10.4 3
4 10.4 3
5 14.7 3
6 15.5 3
7 15.2 3
8 13.3 3
9 19.2 3
10 15.8 5
11 15.0 5
How can we do this in Base R?
new_mtcars2[new_mtcars2$disp > 300, c(1, 5)]
mpg gear
Hornet Sportabout 18.7 3
Duster 360 14.3 3
Cadillac Fleetwood 10.4 3
Lincoln Continental 10.4 3
Chrysler Imperial 14.7 3
Dodge Challenger 15.5 3
AMC Javelin 15.2 3
Camaro Z28 13.3 3
Pontiac Firebird 19.2 3
Ford Pantera L 15.8 5
Maserati Bora 15.0 5
Some more good functions provided by dplyr
mtcars %>%
group_by(cyl) %>%
summarise(Ave = mean(mpg), Disp = sum(disp), Num = n())
# A tibble: 3 x 4
cyl Ave Disp Num
<dbl> <dbl> <dbl> <int>
1 4 26.66364 1156.5 11
2 6 19.74286 1283.2 7
3 8 15.10000 4943.4 14
How do we achieve this with Base R?
tapply(mtcars$mpg, mtcars$cyl, mean)
4 6 8
26.66364 19.74286 15.10000
tapply(mtcars$disp, mtcars$cyl, sum)
4 6 8
1156.5 1283.2 4943.4
tapply(mtcars$mpg, mtcars$cyl, length)
4 6 8
11 7 14
With bind_cols(), we can bind columns in data frame.
song <- c("To Be With You", "Wild World", "Just Take My Heart", "Goin' Where the Wind Blows", "Promise Her the Moon")
album <- c("Lean into it", "Bump Ahead", "Lean into it", "Hey Man", "Bump Ahead")
mr.big <- data.frame(song, album)
mr.big
song album
1 To Be With You Lean into it
2 Wild World Bump Ahead
3 Just Take My Heart Lean into it
4 Goin' Where the Wind Blows Hey Man
5 Promise Her the Moon Bump Ahead
year <- c(1991, 1993, 1991, 1996, 1993)
writer <- c("Eric Martin", "Cat Stevens", "Martin, Pessis, Call", "Martin, Pessis, Andre", "Martin, Pessis, Andre")
others <- data.frame(year, writer)
others
year writer
1 1991 Eric Martin
2 1993 Cat Stevens
3 1991 Martin, Pessis, Call
4 1996 Martin, Pessis, Andre
5 1993 Martin, Pessis, Andre
mr.big <- bind_cols(mr.big, others)
mr.big
song album year writer
1 To Be With You Lean into it 1991 Eric Martin
2 Wild World Bump Ahead 1993 Cat Stevens
3 Just Take My Heart Lean into it 1991 Martin, Pessis, Call
4 Goin' Where the Wind Blows Hey Man 1996 Martin, Pessis, Andre
5 Promise Her the Moon Bump Ahead 1993 Martin, Pessis, Andre
We can bind rows with dplyr function bind_rows()
new_obs <- data.frame(song = "Nothing But Love",
album = "Bump Ahead",
year = 1993,
writer = "Paul Gilbert")
mr.big <- bind_rows(mr.big, new_obs)
mr.big
song album year writer
1 To Be With You Lean into it 1991 Eric Martin
2 Wild World Bump Ahead 1993 Cat Stevens
3 Just Take My Heart Lean into it 1991 Martin, Pessis, Call
4 Goin' Where the Wind Blows Hey Man 1996 Martin, Pessis, Andre
5 Promise Her the Moon Bump Ahead 1993 Martin, Pessis, Andre
6 Nothing But Love Bump Ahead 1993 Paul Gilbert
union()
table1 <- data.frame(X1 = I(c("A", "B", "C")),
X2 = c(1:3))
table1
X1 X2
1 A 1
2 B 2
3 C 3
table2 <- data.frame(X1 = I(c("B", "C", "D")),
X2 = c(2:4))
table2
X1 X2
1 B 2
2 C 3
3 D 4
dplyr::union(table1, table2)
X1 X2
1 A 1
2 B 2
3 C 3
4 D 4
dplyr::intersect(table1, table2)
X1 X2
1 B 2
2 C 3
dplyr::setdiff(table2, table1)
X1 X2
1 D 4
inner_join()
song <- I(c("Across the Universe", "Come Together", "Hello, Goodbye", "Peggy Sue"))
name <- I(c("John", "John", "Paul", "Buddy"))
songs <- data.frame(song, name)
songs
song name
1 Across the Universe John
2 Come Together John
3 Hello, Goodbye Paul
4 Peggy Sue Buddy
name <- I(c("George", "John", "Paul", "Ringo"))
plays <- I(c("Sitar", "Guitar", "Bass", "Drums"))
artists <- data.frame(name, plays)
artists
name plays
1 George Sitar
2 John Guitar
3 Paul Bass
4 Ringo Drums
inner_join(songs, artists)
song name plays
1 Across the Universe John Guitar
2 Come Together John Guitar
3 Hello, Goodbye Paul Bass
right_join(songs, artists)
song name plays
1 <NA> George Sitar
2 Across the Universe John Guitar
3 Come Together John Guitar
4 Hello, Goodbye Paul Bass
5 <NA> Ringo Drums
left_join(songs, artists)
song name plays
1 Across the Universe John Guitar
2 Come Together John Guitar
3 Hello, Goodbye Paul Bass
4 Peggy Sue Buddy <NA>
semi_join(songs, artists)
song name
1 Across the Universe John
2 Come Together John
3 Hello, Goodbye Paul
anti_join(songs, artists)
song name
1 Peggy Sue Buddy
anti_join(artists, songs)
name plays
1 George Sitar
2 Ringo Drums
rename(songs, title = song, band_member = name)
title band_member
1 Across the Universe John
2 Come Together John
3 Hello, Goodbye Paul
4 Peggy Sue Buddy
How do we rename variable names in base R?
sample_n(mtcars, 5, replace = TRUE)
mpg cyl disp hp drat wt qsec vs am gear carb
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
Fiat 128.1 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
We will use readr package to load external data. Also, to write/save data from R, we can use the same package.
First of all, let's see how we set directory in R so that we can load or save data in the folder that we want to.
We use this function to check current directory in R
getwd()
To change to our desire directory
setwd(“your computer directory path”)
Once we set the directory, we can load the data from the directory in this way -
objectname <- read_csv(“sample_data.csv”)
If you want to load a file with just strings or text lines, we load it like this -
objectname <- read_lines(“sample_data.txt”)
For more information on reading different types of data, do
help(package = 'readr')
If we want to save data then we do
write_csv(objectname, “cleaned_data.txt”)
This will save data into our current directory.
We can import data from databases and webpages as well.