Package list

List of packages used in this chapter include: * tidyverse

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6      ✔ purrr   0.3.5 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

Exploring the “mpg” dataset

mpg dataset has a list of cars with their performance and other relevant characteristics

mpg #gives a cursory look in to the dataset
## # A tibble: 234 × 11
##    manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class
##    <chr>        <chr>      <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
##  1 audi         a4           1.8  1999     4 auto… f        18    29 p     comp…
##  2 audi         a4           1.8  1999     4 manu… f        21    29 p     comp…
##  3 audi         a4           2    2008     4 manu… f        20    31 p     comp…
##  4 audi         a4           2    2008     4 auto… f        21    30 p     comp…
##  5 audi         a4           2.8  1999     6 auto… f        16    26 p     comp…
##  6 audi         a4           2.8  1999     6 manu… f        18    26 p     comp…
##  7 audi         a4           3.1  2008     6 auto… f        18    27 p     comp…
##  8 audi         a4 quattro   1.8  1999     4 manu… 4        18    26 p     comp…
##  9 audi         a4 quattro   1.8  1999     4 auto… 4        16    25 p     comp…
## 10 audi         a4 quattro   2    2008     4 manu… 4        20    28 p     comp…
## # … with 224 more rows
View(mpg) #an excel spreadsheet type view of our dataset
names(mpg) #lists all the column names in the dataset
##  [1] "manufacturer" "model"        "displ"        "year"         "cyl"         
##  [6] "trans"        "drv"          "cty"          "hwy"          "fl"          
## [11] "class"
str(mpg) # str() compactly displays the internal structure of an object passed as the argument. In this the object is is the mpg dataset which is of type tibble--a dataframe object in R
## tibble [234 × 11] (S3: tbl_df/tbl/data.frame)
##  $ manufacturer: chr [1:234] "audi" "audi" "audi" "audi" ...
##  $ model       : chr [1:234] "a4" "a4" "a4" "a4" ...
##  $ displ       : num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
##  $ year        : int [1:234] 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
##  $ cyl         : int [1:234] 4 4 4 4 6 6 6 4 4 4 ...
##  $ trans       : chr [1:234] "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
##  $ drv         : chr [1:234] "f" "f" "f" "f" ...
##  $ cty         : int [1:234] 18 21 20 21 16 18 18 18 16 20 ...
##  $ hwy         : int [1:234] 29 29 31 30 26 26 27 26 25 28 ...
##  $ fl          : chr [1:234] "p" "p" "p" "p" ...
##  $ class       : chr [1:234] "compact" "compact" "compact" "compact" ...

The class() function can help understand the exact type of variable we are dealing with in a dataset. This is important when dealing with mixed variables with categorical, numerical and other types of data fixed in.

However, using class() on every single variable will be tiresome and hence we will make us of the sapply() function to greatly reduce our effort…

sapply(mpg,class)
## manufacturer        model        displ         year          cyl        trans 
##  "character"  "character"    "numeric"    "integer"    "integer"  "character" 
##          drv          cty          hwy           fl        class 
##  "character"    "integer"    "integer"  "character"  "character"

Let’s explore the data with some ggplots…

Plot No.1

ggplot(mpg) + geom_point(aes(x =  displ, y= hwy))

let’s again look at the types of data that ‘hwy’ and ‘displ’ are..

sapply(mpg,class)
## manufacturer        model        displ         year          cyl        trans 
##  "character"  "character"    "numeric"    "integer"    "integer"  "character" 
##          drv          cty          hwy           fl        class 
##  "character"    "integer"    "integer"  "character"  "character"

Clearly, hwy and displ are both numeric variables. Are there any factors? Let sapply help us out again…

sapply(mpg, is.factor)
## manufacturer        model        displ         year          cyl        trans 
##        FALSE        FALSE        FALSE        FALSE        FALSE        FALSE 
##          drv          cty          hwy           fl        class 
##        FALSE        FALSE        FALSE        FALSE        FALSE

Some variables that are encoded as character or string can turn out to be factors.Let’s see which of our variables are characters… ō

sapply(mpg, is.character)
## manufacturer        model        displ         year          cyl        trans 
##         TRUE         TRUE        FALSE        FALSE        FALSE         TRUE 
##          drv          cty          hwy           fl        class 
##         TRUE        FALSE        FALSE         TRUE         TRUE

The following turn out to be character/string..

  • manufacturer

  • model

  • trans

  • drv

  • fl

  • class

6 of the 11 variables in this data frame are character variables.

Plot No.2 : scatterplot of highway efficiency vs no of cylinders

ggplot(data = mpg,aes(x= cyl, y = hwy)) + geom_point()

Plot No. 3 : Adding color-code according to class of vehicles to plot 1

ggplot(mpg, aes(x = displ, y =  hwy, color = class)) + geom_point() +facet_wrap(~class)

Plot No.4 : Drawing the same graph in another color for customizability

ggplot(mpg, aes(x= displ, y= hwy)) + geom_point( color = "blue") + theme_classic()

  • the aesthetic mapping functionality also allows using logical operations for the aesthetics, say for example, if we wanted to change the scatter plot above to delineate all points for which the value of the x-axis(displ) is less than 5. Such a condition can be passed via the aesthetic mapping…

Plot No.5 : Mapping color aesthetics based on a filter over the data(less than, greater than, etc.)

ggplot(mpg, aes(x= displ, y = hwy)) + geom_point(aes(color = displ < 5))

## let’s make some tasty facet wraps and grids..

Plot No. 6 : Using the facet wrap feature to plot the displacement and highway efficiency relation for each class of the cars in mpg data

ggplot(mpg) + geom_point(aes(x = displ, y = hwy, color = (hwy >20 & displ>4)) )+ facet_wrap(~class, nrow = 3)

  • the plot above also outlines those cars (datapoints) for which the highway efficiency is greater than 20, despite the engine displacement being higher than 4 litres. This is the piece of code that is modified from plot no 5.

  • As an important reminder, the variable that facet_wrap works on needs to be discrete or categorical. You may need to bin variables if they are continuous and you want to facet on them.

Plot No. 7 : Using the facet_grid( ) option to facet the plot on a combination of 2 variables.

  • The example plot here will facet the scatter plot of highway efficiency vs displacement over 2 factors- drive and cylinder.

how many levels of unique values does drv and cyl have? We can use sapply again in conjunction with the length(unique()) function to answer this for all the columns in our data set….

sapply(mpg, function(x) length(unique(x)))
## manufacturer        model        displ         year          cyl        trans 
##           15           38           35            2            4           10 
##          drv          cty          hwy           fl        class 
##            3           21           27            5            7

See that drv has 3 unique levels and cyl has 4, so we expect our plot to repeat around 12 times for 3x4 combinations

ggplot(mpg) + geom_point(aes(x = displ, y = hwy)) + facet_grid(drv~cyl)

### Plot no 8 : Adding some color code to the plot 7

ggplot(mpg ) + geom_point( aes(x = displ, y = hwy, color = drv)) + facet_grid(drv~cyl)

* 3 different colors corresponding to the type of drivetrain of each car : 4x4, rear wheel drive and front wheel drive.

Plot 9 : Plot 7 colored with respect to the engine no of cylinders in the car engines

ggplot(mpg ) + geom_point( aes(x = displ, y = hwy, color = cyl)) + facet_grid(drv~cyl)

  • Since cyl is an integer variable,the fill is gradient of the same color. To force R to recognise it as a character variable like drv, we can use as.factor..
ggplot(mpg ) + geom_point( aes(x = displ, y = hwy, color = as.factor(cyl))) + facet_grid(drv~cyl)

And now we see that we can have a solid color for each of the 4,5,6 and 8 cylinder option available to us in the mpg data set.

Plot 10: When faceting using a raw continous variable

Let’s use mpg$cty ~ a continuous variable ~ to make the facet grid on

ggplot(mpg ) + geom_point( aes(x = displ, y = hwy, color = as.factor(cyl))) + facet_wrap(~cty)

Here again we have treated no of cylinders as a factor variable and color coded our scatterplot accordingly. We see that for each unique value in the cty variable, the scatter plot of displ vs hwy is created. The column has 21 unique continuous values, as verified below…

length(unique(mpg$cty))
## [1] 21

Our plot 8 had some empty zones for the facet grid of drv ~ cyl, plotting it again below…

ggplot(mpg ) + geom_point( aes(x = displ, y = hwy, color = drv)) + facet_grid(drv~cyl)

It seems the cells corresponding to 4 x r, 5xr and 5x4 are empty. Let’s confir this with the plot below…

ggplot(mpg )  + geom_jitter(aes(x = drv, y = cyl, color = drv))

ggplot(mpg) + geom_point(aes(x = displ, y = hwy)) + facet_wrap(~class, nrow = 20)

ggplot(mpg, aes(color = drv)) + geom_smooth(aes(x= displ, y = hwy, linetype = drv)) + geom_point(aes( x= displ, y = hwy))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

3 different versions - just grouping and cloring them makes the difference…

  • the first
ggplot(mpg, aes(x = displ, y= hwy)) + geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

  • Second …
ggplot(mpg, aes (x= displ, y = hwy)) + geom_smooth(aes(group = drv))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

  • Third …
ggplot(mpg, aes (x= displ, y = hwy)) + geom_smooth(aes(color = drv, group = drv), show.legend = FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot(mpg, aes (x= displ, y = hwy, color = drv)) + geom_smooth(aes( group = drv), show.legend = FALSE) + geom_point() + facet_wrap(~drv) 
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'