Packages + EDA

loiyumba

Introduction

Packages are sets of functions, and data, specific for some particular purpose, that can be loaded into an R session to make them available so that they can be used in the same way as built-in R functions and data. Most of the useful R applications appear in packages.

  • Install packages with install.packages() (One time)
  • Load packages with library() (Every time with new session)

In your console

# install.packages("dplyr")
library(dplyr)

We can check all the packages which are loaded in our current session with search() function.

search()
 [1] ".GlobalEnv"        "package:dplyr"     "package:knitr"    
 [4] "package:stats"     "package:graphics"  "package:grDevices"
 [7] "package:utils"     "package:datasets"  "package:methods"  
[10] "Autoloads"         "package:base"     

If we want to know more about the package and its functions, we can simple do

# library(help = 'dplyr')

And it will give details of the package with its functions.

We also want to check the list of datasets available with the package. Then this brings up the list of data associated with the package.

# data(package = 'dplyr')

Data Manipulation using dplyr package

Hadley Wickham of RStudio is the author of dplyr package. dplyr functions are similar to base R functions in extracting existing variables, extracting existing observations and derive new variables, but it simplifies existing base R functions and it's very fast.

Some of the main functions of dplyr are -

  • select(): focus on a subset of variables
  • filter(): focus on a subset of rows
  • mutate(): add new columns
  • summarise(): reduce ech group to a smaller number of summary statistics
  • arrange(): re-order the rows

In your console, run this command

mtcars

And then

tbl_df(mtcars)

What's the difference?

glimpse(mtcars)
Observations: 32
Variables: 11
$ mpg  <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19....
$ cyl  <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, ...
$ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 1...
$ hp   <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, ...
$ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.9...
$ wt   <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3...
$ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 2...
$ vs   <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, ...
$ am   <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, ...
$ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, ...
$ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, ...

select()

new_mtcars <- select(mtcars, mpg, cyl, disp, hp, gear)
head(new_mtcars)
                   mpg cyl disp  hp gear
Mazda RX4         21.0   6  160 110    4
Mazda RX4 Wag     21.0   6  160 110    4
Datsun 710        22.8   4  108  93    4
Hornet 4 Drive    21.4   6  258 110    3
Hornet Sportabout 18.7   8  360 175    3
Valiant           18.1   6  225 105    3

How can we do this with base R function?

base_mtcars <- mtcars[, c(1:4, 10)]
head(base_mtcars)
                   mpg cyl disp  hp gear
Mazda RX4         21.0   6  160 110    4
Mazda RX4 Wag     21.0   6  160 110    4
Datsun 710        22.8   4  108  93    4
Hornet 4 Drive    21.4   6  258 110    3
Hornet Sportabout 18.7   8  360 175    3
Valiant           18.1   6  225 105    3
new_mtcars <- select(new_mtcars, -gear)
head(new_mtcars)
                   mpg cyl disp  hp
Mazda RX4         21.0   6  160 110
Mazda RX4 Wag     21.0   6  160 110
Datsun 710        22.8   4  108  93
Hornet 4 Drive    21.4   6  258 110
Hornet Sportabout 18.7   8  360 175
Valiant           18.1   6  225 105

How can we do this in base R?

base_mtcars$gear <- NULL
head(base_mtcars)
                   mpg cyl disp  hp
Mazda RX4         21.0   6  160 110
Mazda RX4 Wag     21.0   6  160 110
Datsun 710        22.8   4  108  93
Hornet 4 Drive    21.4   6  258 110
Hornet Sportabout 18.7   8  360 175
Valiant           18.1   6  225 105
new_mtcars <- select(new_mtcars, cyl:hp)
head(new_mtcars)
                  cyl disp  hp
Mazda RX4           6  160 110
Mazda RX4 Wag       6  160 110
Datsun 710          4  108  93
Hornet 4 Drive      6  258 110
Hornet Sportabout   8  360 175
Valiant             6  225 105

Lets do the same with base R function

base_mtcars <- base_mtcars[, c(2:4)]
head(base_mtcars)
                  cyl disp  hp
Mazda RX4           6  160 110
Mazda RX4 Wag       6  160 110
Datsun 710          4  108  93
Hornet 4 Drive      6  258 110
Hornet Sportabout   8  360 175
Valiant             6  225 105

filter()

filter(new_mtcars, disp > 300)
   cyl disp  hp
1    8  360 175
2    8  360 245
3    8  472 205
4    8  460 215
5    8  440 230
6    8  318 150
7    8  304 150
8    8  350 245
9    8  400 175
10   8  351 264
11   8  301 335

Do the same with base R function

base_mtcars[base_mtcars$disp > 300, ]
                    cyl disp  hp
Hornet Sportabout     8  360 175
Duster 360            8  360 245
Cadillac Fleetwood    8  472 205
Lincoln Continental   8  460 215
Chrysler Imperial     8  440 230
Dodge Challenger      8  318 150
AMC Javelin           8  304 150
Camaro Z28            8  350 245
Pontiac Firebird      8  400 175
Ford Pantera L        8  351 264
Maserati Bora         8  301 335
filter(new_mtcars, disp < 200, cyl == 6)
  cyl  disp  hp
1   6 160.0 110
2   6 160.0 110
3   6 167.6 123
4   6 167.6 123
5   6 145.0 175

In base R

base_mtcars[base_mtcars$disp < 200 & base_mtcars$cyl == 6, ]
              cyl  disp  hp
Mazda RX4       6 160.0 110
Mazda RX4 Wag   6 160.0 110
Merc 280        6 167.6 123
Merc 280C       6 167.6 123
Ferrari Dino    6 145.0 175

mutate()

head(mutate(new_mtcars, hp_cyl = hp/cyl), 3)
  cyl disp  hp   hp_cyl
1   6  160 110 18.33333
2   6  160 110 18.33333
3   4  108  93 23.25000
head(mutate(new_mtcars, hp_cyl = hp/cyl, disp_cyl = disp/cyl), 4)
  cyl disp  hp   hp_cyl disp_cyl
1   6  160 110 18.33333 26.66667
2   6  160 110 18.33333 26.66667
3   4  108  93 23.25000 27.00000
4   6  258 110 18.33333 43.00000

In the base R function, this is how we do it

base_mtcars$hp_cyl <- base_mtcars$hp/base_mtcars$cyl
head(base_mtcars, 3)
              cyl disp  hp   hp_cyl
Mazda RX4       6  160 110 18.33333
Mazda RX4 Wag   6  160 110 18.33333
Datsun 710      4  108  93 23.25000
base_mtcars$disp_cyl <- base_mtcars$disp/base_mtcars$cyl
head(base_mtcars, 3)
              cyl disp  hp   hp_cyl disp_cyl
Mazda RX4       6  160 110 18.33333 26.66667
Mazda RX4 Wag   6  160 110 18.33333 26.66667
Datsun 710      4  108  93 23.25000 27.00000

summarise()

summarise(new_mtcars, median = median(hp), variance = var(disp), numbers = n())
  median variance numbers
1    123  15360.8      32
summarise(new_mtcars, average = mean(hp), total = sum(disp), deviation = sd(hp)) 
   average  total deviation
1 146.6875 7383.1  68.56287

In base R

summary(base_mtcars$hp)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   52.0    96.5   123.0   146.7   180.0   335.0 
sum(base_mtcars$disp)
[1] 7383.1
nrow(base_mtcars); sd(base_mtcars$hp)
[1] 32
[1] 68.56287

arrange()

head(arrange(new_mtcars, disp),4)
  cyl disp hp
1   4 71.1 65
2   4 75.7 52
3   4 78.7 66
4   4 79.0 66
head(arrange(new_mtcars, desc(disp)),4)
  cyl disp  hp
1   8  472 205
2   8  460 215
3   8  440 230
4   8  400 175
head(arrange(new_mtcars, hp, disp), 5)
  cyl  disp hp
1   4  75.7 52
2   4 146.7 62
3   4  71.1 65
4   4  78.7 66
5   4  79.0 66
head(arrange(new_mtcars, hp, desc(disp)), 5)
  cyl  disp hp
1   4  75.7 52
2   4 146.7 62
3   4  71.1 65
4   4  79.0 66
5   4  78.7 66

In base R

head(base_mtcars[order(base_mtcars$disp), ], 4)
               cyl disp hp hp_cyl disp_cyl
Toyota Corolla   4 71.1 65  16.25   17.775
Honda Civic      4 75.7 52  13.00   18.925
Fiat 128         4 78.7 66  16.50   19.675
Fiat X1-9        4 79.0 66  16.50   19.750
head(base_mtcars[order(base_mtcars$disp, decreasing = TRUE), ], 4)
                    cyl disp  hp hp_cyl disp_cyl
Cadillac Fleetwood    8  472 205 25.625     59.0
Lincoln Continental   8  460 215 26.875     57.5
Chrysler Imperial     8  440 230 28.750     55.0
Pontiac Firebird      8  400 175 21.875     50.0
head(base_mtcars[order(base_mtcars$hp, base_mtcars$disp), ], 5)
               cyl  disp hp hp_cyl disp_cyl
Honda Civic      4  75.7 52  13.00   18.925
Merc 240D        4 146.7 62  15.50   36.675
Toyota Corolla   4  71.1 65  16.25   17.775
Fiat 128         4  78.7 66  16.50   19.675
Fiat X1-9        4  79.0 66  16.50   19.750
head(base_mtcars[order(base_mtcars$hp, -base_mtcars$disp), ], 5) 
               cyl  disp hp hp_cyl disp_cyl
Honda Civic      4  75.7 52  13.00   18.925
Merc 240D        4 146.7 62  15.50   36.675
Toyota Corolla   4  71.1 65  16.25   17.775
Fiat X1-9        4  79.0 66  16.50   19.750
Fiat 128         4  78.7 66  16.50   19.675

The %>% (pipe) Operator

new_mtcars2 <- mtcars %>%
  select(mpg, cyl, disp, hp, gear)
head(new_mtcars2)
                   mpg cyl disp  hp gear
Mazda RX4         21.0   6  160 110    4
Mazda RX4 Wag     21.0   6  160 110    4
Datsun 710        22.8   4  108  93    4
Hornet 4 Drive    21.4   6  258 110    3
Hornet Sportabout 18.7   8  360 175    3
Valiant           18.1   6  225 105    3
new_mtcars2 %>% 
  filter(disp > 300)
    mpg cyl disp  hp gear
1  18.7   8  360 175    3
2  14.3   8  360 245    3
3  10.4   8  472 205    3
4  10.4   8  460 215    3
5  14.7   8  440 230    3
6  15.5   8  318 150    3
7  15.2   8  304 150    3
8  13.3   8  350 245    3
9  19.2   8  400 175    3
10 15.8   8  351 264    5
11 15.0   8  301 335    5
new_mtcars2 %>% 
  filter(disp > 300) %>% 
  select(mpg, gear)
    mpg gear
1  18.7    3
2  14.3    3
3  10.4    3
4  10.4    3
5  14.7    3
6  15.5    3
7  15.2    3
8  13.3    3
9  19.2    3
10 15.8    5
11 15.0    5

How can we do this in Base R?

new_mtcars2[new_mtcars2$disp > 300, c(1, 5)]
                     mpg gear
Hornet Sportabout   18.7    3
Duster 360          14.3    3
Cadillac Fleetwood  10.4    3
Lincoln Continental 10.4    3
Chrysler Imperial   14.7    3
Dodge Challenger    15.5    3
AMC Javelin         15.2    3
Camaro Z28          13.3    3
Pontiac Firebird    19.2    3
Ford Pantera L      15.8    5
Maserati Bora       15.0    5

group_by()

Some more good functions provided by dplyr

mtcars %>%
  group_by(cyl) %>% 
  summarise(Ave = mean(mpg), Disp = sum(disp), Num = n())
# A tibble: 3 x 4
    cyl      Ave   Disp   Num
  <dbl>    <dbl>  <dbl> <int>
1     4 26.66364 1156.5    11
2     6 19.74286 1283.2     7
3     8 15.10000 4943.4    14

How do we achieve this with Base R?

tapply(mtcars$mpg, mtcars$cyl, mean)
       4        6        8 
26.66364 19.74286 15.10000 
tapply(mtcars$disp, mtcars$cyl, sum)
     4      6      8 
1156.5 1283.2 4943.4 
tapply(mtcars$mpg, mtcars$cyl, length)
 4  6  8 
11  7 14 

Joining data frames with dplyr

With bind_cols(), we can bind columns in data frame.

song <- c("To Be With You", "Wild World", "Just Take My Heart", "Goin' Where the Wind Blows", "Promise Her the Moon")
album <- c("Lean into it", "Bump Ahead", "Lean into it", "Hey Man", "Bump Ahead")
mr.big <- data.frame(song, album)
mr.big
                        song        album
1             To Be With You Lean into it
2                 Wild World   Bump Ahead
3         Just Take My Heart Lean into it
4 Goin' Where the Wind Blows      Hey Man
5       Promise Her the Moon   Bump Ahead
year <- c(1991, 1993, 1991, 1996, 1993)
writer <- c("Eric Martin", "Cat Stevens", "Martin, Pessis, Call", "Martin, Pessis, Andre", "Martin, Pessis, Andre")
others <- data.frame(year, writer)
others
  year                writer
1 1991           Eric Martin
2 1993           Cat Stevens
3 1991  Martin, Pessis, Call
4 1996 Martin, Pessis, Andre
5 1993 Martin, Pessis, Andre
mr.big <- bind_cols(mr.big, others)
mr.big
                        song        album year                writer
1             To Be With You Lean into it 1991           Eric Martin
2                 Wild World   Bump Ahead 1993           Cat Stevens
3         Just Take My Heart Lean into it 1991  Martin, Pessis, Call
4 Goin' Where the Wind Blows      Hey Man 1996 Martin, Pessis, Andre
5       Promise Her the Moon   Bump Ahead 1993 Martin, Pessis, Andre

We can bind rows with dplyr function bind_rows()

new_obs <- data.frame(song = "Nothing But Love",
                      album = "Bump Ahead",
                      year = 1993,
                      writer = "Paul Gilbert")
mr.big <- bind_rows(mr.big, new_obs)
mr.big
                        song        album year                writer
1             To Be With You Lean into it 1991           Eric Martin
2                 Wild World   Bump Ahead 1993           Cat Stevens
3         Just Take My Heart Lean into it 1991  Martin, Pessis, Call
4 Goin' Where the Wind Blows      Hey Man 1996 Martin, Pessis, Andre
5       Promise Her the Moon   Bump Ahead 1993 Martin, Pessis, Andre
6           Nothing But Love   Bump Ahead 1993          Paul Gilbert

union()

table1 <- data.frame(X1 = I(c("A", "B", "C")),
                     X2 = c(1:3))
table1
  X1 X2
1  A  1
2  B  2
3  C  3
table2 <- data.frame(X1 = I(c("B", "C", "D")),
                     X2 = c(2:4))
table2
  X1 X2
1  B  2
2  C  3
3  D  4
dplyr::union(table1, table2)
  X1 X2
1  A  1
2  B  2
3  C  3
4  D  4
dplyr::intersect(table1, table2)
  X1 X2
1  B  2
2  C  3
dplyr::setdiff(table2, table1)
  X1 X2
1  D  4

inner_join()

song <- I(c("Across the Universe", "Come Together", "Hello, Goodbye", "Peggy Sue"))
name <- I(c("John", "John", "Paul", "Buddy"))
songs <- data.frame(song, name)
songs
                 song  name
1 Across the Universe  John
2       Come Together  John
3      Hello, Goodbye  Paul
4           Peggy Sue Buddy
name <- I(c("George", "John", "Paul", "Ringo"))
plays <- I(c("Sitar", "Guitar", "Bass", "Drums"))
artists <- data.frame(name, plays)
artists
    name  plays
1 George  Sitar
2   John Guitar
3   Paul   Bass
4  Ringo  Drums
inner_join(songs, artists)
                 song name  plays
1 Across the Universe John Guitar
2       Come Together John Guitar
3      Hello, Goodbye Paul   Bass

right_join()

right_join(songs, artists)
                 song   name  plays
1                <NA> George  Sitar
2 Across the Universe   John Guitar
3       Come Together   John Guitar
4      Hello, Goodbye   Paul   Bass
5                <NA>  Ringo  Drums

left_join()

left_join(songs, artists)
                 song  name  plays
1 Across the Universe  John Guitar
2       Come Together  John Guitar
3      Hello, Goodbye  Paul   Bass
4           Peggy Sue Buddy   <NA>

semi_join()

semi_join(songs, artists)
                 song name
1 Across the Universe John
2       Come Together John
3      Hello, Goodbye Paul

anti_join()

anti_join(songs, artists)
       song  name
1 Peggy Sue Buddy
anti_join(artists, songs)
    name plays
1 George Sitar
2  Ringo Drums

rename()

rename(songs, title = song, band_member = name)
                title band_member
1 Across the Universe        John
2       Come Together        John
3      Hello, Goodbye        Paul
4           Peggy Sue       Buddy

How do we rename variable names in base R?

sample_n()

sample_n(mtcars, 5, replace = TRUE)
                mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Fiat 128       32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
AMC Javelin    15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
Fiat 128.1     32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1

Reading and Writing Data in R

We will use readr package to load external data. Also, to write/save data from R, we can use the same package.

First of all, let's see how we set directory in R so that we can load or save data in the folder that we want to.

We use this function to check current directory in R

getwd()

To change to our desire directory

setwd(“your computer directory path”)

Once we set the directory, we can load the data from the directory in this way -

objectname <- read_csv(“sample_data.csv”)

If you want to load a file with just strings or text lines, we load it like this -

objectname <- read_lines(“sample_data.txt”)

For more information on reading different types of data, do

help(package = 'readr')

If we want to save data then we do

write_csv(objectname, “cleaned_data.txt”)

This will save data into our current directory.

We can import data from databases and webpages as well.