List of packages used in this chapter include: * tidyverse
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.5
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
mpg dataset has a list of cars with their performance and other relevant characteristics
mpg #gives a cursory look in to the dataset
## # A tibble: 234 × 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto… f 18 29 p comp…
## 2 audi a4 1.8 1999 4 manu… f 21 29 p comp…
## 3 audi a4 2 2008 4 manu… f 20 31 p comp…
## 4 audi a4 2 2008 4 auto… f 21 30 p comp…
## 5 audi a4 2.8 1999 6 auto… f 16 26 p comp…
## 6 audi a4 2.8 1999 6 manu… f 18 26 p comp…
## 7 audi a4 3.1 2008 6 auto… f 18 27 p comp…
## 8 audi a4 quattro 1.8 1999 4 manu… 4 18 26 p comp…
## 9 audi a4 quattro 1.8 1999 4 auto… 4 16 25 p comp…
## 10 audi a4 quattro 2 2008 4 manu… 4 20 28 p comp…
## # … with 224 more rows
View(mpg) #an excel spreadsheet type view of our dataset
names(mpg) #lists all the column names in the dataset
## [1] "manufacturer" "model" "displ" "year" "cyl"
## [6] "trans" "drv" "cty" "hwy" "fl"
## [11] "class"
str(mpg) # str() compactly displays the internal structure of an object passed as the argument. In this the object is is the mpg dataset which is of type tibble--a dataframe object in R
## tibble [234 × 11] (S3: tbl_df/tbl/data.frame)
## $ manufacturer: chr [1:234] "audi" "audi" "audi" "audi" ...
## $ model : chr [1:234] "a4" "a4" "a4" "a4" ...
## $ displ : num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
## $ year : int [1:234] 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
## $ cyl : int [1:234] 4 4 4 4 6 6 6 4 4 4 ...
## $ trans : chr [1:234] "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
## $ drv : chr [1:234] "f" "f" "f" "f" ...
## $ cty : int [1:234] 18 21 20 21 16 18 18 18 16 20 ...
## $ hwy : int [1:234] 29 29 31 30 26 26 27 26 25 28 ...
## $ fl : chr [1:234] "p" "p" "p" "p" ...
## $ class : chr [1:234] "compact" "compact" "compact" "compact" ...
The class() function can help understand the exact type of variable we are dealing with in a dataset. This is important when dealing with mixed variables with categorical, numerical and other types of data fixed in.
However, using class() on every single variable will be tiresome and hence we will make us of the sapply() function to greatly reduce our effort…
sapply(mpg,class)
## manufacturer model displ year cyl trans
## "character" "character" "numeric" "integer" "integer" "character"
## drv cty hwy fl class
## "character" "integer" "integer" "character" "character"
ggplot(mpg) + geom_point(aes(x = displ, y= hwy))
let’s again look at the types of data that ‘hwy’ and ‘displ’ are..
sapply(mpg,class)
## manufacturer model displ year cyl trans
## "character" "character" "numeric" "integer" "integer" "character"
## drv cty hwy fl class
## "character" "integer" "integer" "character" "character"
Clearly, hwy and displ are both numeric variables. Are there any factors? Let sapply help us out again…
sapply(mpg, is.factor)
## manufacturer model displ year cyl trans
## FALSE FALSE FALSE FALSE FALSE FALSE
## drv cty hwy fl class
## FALSE FALSE FALSE FALSE FALSE
Some variables that are encoded as character or string can turn out to be factors.Let’s see which of our variables are characters… ō
sapply(mpg, is.character)
## manufacturer model displ year cyl trans
## TRUE TRUE FALSE FALSE FALSE TRUE
## drv cty hwy fl class
## TRUE FALSE FALSE TRUE TRUE
The following turn out to be character/string..
manufacturer
model
trans
drv
fl
class
6 of the 11 variables in this data frame are character variables.
ggplot(data = mpg,aes(x= cyl, y = hwy)) + geom_point()
ggplot(mpg, aes(x = displ, y = hwy, color = class)) + geom_point() +facet_wrap(~class)
ggplot(mpg, aes(x= displ, y= hwy)) + geom_point( color = "blue") + theme_classic()
ggplot(mpg, aes(x= displ, y = hwy)) + geom_point(aes(color = displ < 5))
## let’s make some tasty facet wraps and grids..
ggplot(mpg) + geom_point(aes(x = displ, y = hwy, color = (hwy >20 & displ>4)) )+ facet_wrap(~class, nrow = 3)
the plot above also outlines those cars (datapoints) for which the highway efficiency is greater than 20, despite the engine displacement being higher than 4 litres. This is the piece of code that is modified from plot no 5.
As an important reminder, the variable that facet_wrap works on needs to be discrete or categorical. You may need to bin variables if they are continuous and you want to facet on them.
how many levels of unique values does drv and cyl have? We can use sapply again in conjunction with the length(unique()) function to answer this for all the columns in our data set….
sapply(mpg, function(x) length(unique(x)))
## manufacturer model displ year cyl trans
## 15 38 35 2 4 10
## drv cty hwy fl class
## 3 21 27 5 7
See that drv has 3 unique levels and cyl has 4, so we expect our plot to repeat around 12 times for 3x4 combinations
ggplot(mpg) + geom_point(aes(x = displ, y = hwy)) + facet_grid(drv~cyl)
### Plot no 8 : Adding some color code to the plot 7
ggplot(mpg ) + geom_point( aes(x = displ, y = hwy, color = drv)) + facet_grid(drv~cyl)
* 3 different colors corresponding to the type of drivetrain of each car
: 4x4, rear wheel drive and front wheel drive.
ggplot(mpg ) + geom_point( aes(x = displ, y = hwy, color = cyl)) + facet_grid(drv~cyl)
ggplot(mpg ) + geom_point( aes(x = displ, y = hwy, color = as.factor(cyl))) + facet_grid(drv~cyl)
And now we see that we can have a solid color for each of the 4,5,6 and 8 cylinder option available to us in the mpg data set.
Let’s use mpg$cty ~ a continuous variable ~ to make the facet grid on
ggplot(mpg ) + geom_point( aes(x = displ, y = hwy, color = as.factor(cyl))) + facet_wrap(~cty)
Here again we have treated no of cylinders as a factor variable and
color coded our scatterplot accordingly. We see that for each unique
value in the cty variable, the scatter plot of displ vs hwy is created.
The column has 21 unique continuous values, as verified below…
length(unique(mpg$cty))
## [1] 21
Our plot 8 had some empty zones for the facet grid of drv ~ cyl, plotting it again below…
ggplot(mpg ) + geom_point( aes(x = displ, y = hwy, color = drv)) + facet_grid(drv~cyl)
It seems the cells corresponding to 4 x r, 5xr and 5x4 are empty. Let’s
confir this with the plot below…
ggplot(mpg ) + geom_jitter(aes(x = drv, y = cyl, color = drv))
ggplot(mpg) + geom_point(aes(x = displ, y = hwy)) + facet_wrap(~class, nrow = 20)
ggplot(mpg, aes(color = drv)) + geom_smooth(aes(x= displ, y = hwy, linetype = drv)) + geom_point(aes( x= displ, y = hwy))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
3 different versions - just grouping and cloring them makes the difference…
ggplot(mpg, aes(x = displ, y= hwy)) + geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
ggplot(mpg, aes (x= displ, y = hwy)) + geom_smooth(aes(group = drv))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
ggplot(mpg, aes (x= displ, y = hwy)) + geom_smooth(aes(color = drv, group = drv), show.legend = FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
ggplot(mpg, aes (x= displ, y = hwy, color = drv)) + geom_smooth(aes( group = drv), show.legend = FALSE) + geom_point() + facet_wrap(~drv)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'