Source file ⇒ lec10.Rmd

Today

  1. New printr package for making nice tables
  2. More on Joins (Chapter 7)
  3. ggplot from DataCamp’s ggplot2 (1) course

1. printr

Here is a print out of BabyNames rendered using knitr (it looks like the printout at the console):

head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

To produce more publication quality tables use the new package printr. It isn’t available yet on the CRAN repository so you have to install it by running the following chunk (you neeed to run it not just Knit HTML).

install.packages(
  'printr',
  type = 'source',
  repos = c('http://yihui.name/xran', 'http://cran.rstudio.com')
)

Once you install it you will see prinr in your packages list in rstudio. You only have to do this once.

Next you need to remember to put library(printr) in a chunk in your rmarkdown file.

library(printr)

Here is a print out of mtcars now:

head(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1

2. Joins

Each of the data verbs described until now takes one one data table as an input. Join is different. It’s well named: it brings together two data tables. Once you can join two data tables, you can join any number of data tables by repeated joins. Joining is at the heart of combining data from multiple sources.

It’s conventional to refer to the two tables as the “left” and “right” tables.

To illustrate the differences and similarities between the different kinds of join, suppose you have these two tables:

To illustrate the differences and similarities between the different kinds of join, suppose you have these two tables:

clinicName postalCode
A 22120
B 35752
C 56718
D 35752
E 67756
F 69129
G 73455
H 73455
I 76292
over65 postalCode
0.46 35752
0.72 22120
0.93 22120
0.26 92332
0.46 84739
0.94 67756

Variables that appear in both the left and right tables are called “overlap variables.” The only overlap variable here is postalCode.

The overlap variables determine which cases will go into the output table. In this example, there is just one overlap variable: postalCode.

The diagram below shows the cases in the left and right tables. The lines show the matches between left and right. The cases connected by a match are the overlap cases; there are five of them in the diagram. Cases without a match also appear in both the left and right tables.

Note that there are three different kinds of cases here:

  1. The matching cases that are in both the left and right. These come as pairs: the cases connected by a line in the diagram.
  2. Non-matching ones in the left.
  3. Non-matching ones in the right.

There are different types of join. The type of join specifies whether you want to include in the output the matching cases, the matching pairs, or the non-matching cases.

Inner Join

An inner join gives the matching pairs. Note that clinic A, which had two matches in the right table, appears twice, once for each matching pair in which clinic A is involved.

LL %>% inner_join(RR, by=c("postalCode"="postalCode"))
clinicName postalCode over65
A 22120 0.72
A 22120 0.93
B 35752 0.46
D 35752 0.46
E 67756 0.94

Outer Join

An outer join can include cases where there is no match. You might want to include the unmatched cases from the left table, from the right table, or from both tables.

Unmatched cases from the left table
LL %>% left_join( RR)
clinicName postalCode over65
A 22120 0.72
A 22120 0.93
B 35752 0.46
C 56718 NA
D 35752 0.46
E 67756 0.94
F 69129 NA
G 73455 NA
H 73455 NA
I 76292 NA
Unmatched cases from the right table
LL %>%  right_join(RR) 
clinicName postalCode over65
B 35752 0.46
D 35752 0.46
A 22120 0.72
A 22120 0.93
NA 92332 0.26
NA 84739 0.46
E 67756 0.94
Unmatched cases from both tables
LL %>% full_join(RR)
clinicName postalCode over65
A 22120 0.72
A 22120 0.93
B 35752 0.46
C 56718 NA
D 35752 0.46
E 67756 0.94
F 69129 NA
G 73455 NA
H 73455 NA
I 76292 NA
NA 92332 0.26
NA 84739 0.46

Task For You:

Here is a left and right data table:

Left <- CountryCentroids %>% select(name, iso_a3) %>% tail()
Left
name iso_a3
236 Vietnam VNM
237 W. Sahara ESH
238 Wallis and Futuna Is. WLF
239 Yemen YEM
240 Zambia ZMB
241 Zimbabwe ZWE
Right <-  CountryData %>% select(country,life, infant) %>% tail()
Right
country life infant
251 Wallis and Futuna 79.42 4.49
252 West Bank 75.69 13.49
253 Western Sahara 62.27 56.09
254 Yemen 64.83 50.41
255 Zambia 51.83 66.62
256 Zimbabwe 55.68 26.55

Make the following join:

name iso_a3 life infant
Yemen YEM 64.83 50.41
Zambia ZMB 51.83 66.62
Zimbabwe ZWE 55.68 26.55

3. Chapter 2 Data in course ggplot2 (1) in Data Camp

we will work with the mtcars data set

mtcars_m <- mtcars %>% 
  filter(am==0)

mtcars_a <- mtcars %>%
  filter(am==1) 
  
head(mtcars_a)
mpg cyl disp hp drat wt qsec vs am gear carb
21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1

ggplot2 versus base package

Before ggplot there was plotting with the base R package. Many research papers still make their plots with base package so you should be familiar with it.

example 1

In base package if you want to make a scatter diagram of wt versus mpg in mtcars for manual cars

plot( mtcars_m$wt,mtcars_m$mpg, col=as.factor(mtcars_m$cyl))

That is fine but, suppose we wish to add an additional layer of points corresponding to cars with automatic transmission

plot(mtcars_m$wt,mtcars_m$mpg,  col=as.factor(mtcars_m$cyl))
points( mtcars_a$wt,mtcars_a$mpg, col="blue")

Here we see a major limitation of base package drawing.

  1. Plot doesn’t get redrawn
  2. Plot is drawn as an image (every layer is drawn on top of the image. In ggplot the plot is an object which we can change)
  3. We need to add a legend ourselves (you may forget what the different colors mean when you go into manually make a legend)

In ggplot we would have:

mtcars %>% ggplot(aes(x=wt,y=mpg)) + geom_point(aes(col=as.factor(cyl))) + facet_wrap(~am)

task for you

The vector precip gives the yearly precipitation in differnt cities. Using the base package function hist make a histogram of precip (hint: try hist(precip)). Next make the plot in ggplot using the geom_histogram() function. You will need to convert the vector precip to a data frame using as.data.frame(precip). This might be helpful:ggplot2.org

example 2

As another example, suppose we wish to make a linear model of how mpg varies with car weight.

# Use lm() to calculate a linear model and save it as carModel
carModel <- lm(mpg ~ wt, data = mtcars_m)
carModel
## 
## Call:
## lm(formula = mpg ~ wt, data = mtcars_m)
## 
## Coefficients:
## (Intercept)           wt  
##      31.416       -3.786

We see that the best fitting line through the scatter plot for the manual transmission cars is \[ mpg= -9*wt + 46.3 \]

We can draw the regression line through our plot for manual transmission cars. The legend is hard to make manually.

plot(mtcars_m$wt,mtcars_m$mpg,  col=as.factor(mtcars_m$cyl))
carModel <- lm(mpg ~ wt, data = mtcars_m)
abline(carModel, lty = 2)
legend(x = 5, y = 25, legend = levels(as.factor(mtcars_m$cyl)), col = 1:3, pch = 1, bty = "n")

In fact you can draw the regression line for each cylinder type.

plot(mtcars_m$wt,mtcars_m$mpg,  col=as.factor(mtcars_m$cyl))
abline(lm(mpg ~ wt, data = mtcars_m, subset= (cyl ==4)), lty = 2)
abline(lm(mpg ~ wt, data = mtcars_m, subset= (cyl ==6)), lty = 2)
abline(lm(mpg ~ wt, data = mtcars_m, subset= (cyl ==8)), lty = 2)
legend(x = 5, y = 25, legend = levels(as.factor(mtcars_m$cyl)), col = 1:3, pch = 1, bty = "n")

or more efficiently using lapply() which you will learn about in the data camp course intermediate R soon.

plot(mtcars_m$wt,mtcars_m$mpg,  col=as.factor(mtcars_m$cyl))
lapply(mtcars_m$cyl, function(x) {
  abline(lm(mpg ~ wt, mtcars_m, subset = (cyl == x)), col = x)
  })
legend(x = 5, y = 25, legend = levels(as.factor(mtcars_m$cyl)), col = 1:3, pch = 1, bty = "n")

In ggplot it is much easier. Note that we put the color aesthetic in the ggplot frame instead of geom_point since we want both the points and the regression lines to be categorized by color.

mtcars_m %>% ggplot(aes(x = wt, y = mpg, col = as.factor(cyl))) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)

Data verbs gather(), separate(), and spread() to wrangle your data

Lets examine the iris data table:

head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
5.1 3.5 1.4 0.2 setosa
4.9 3.0 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5.0 3.6 1.4 0.2 setosa
5.4 3.9 1.7 0.4 setosa

example 1

Suppose you want to make the following plot:

The data table iris isn’t gyph ready. We need variable names , Species, Part, Measure and Value:

Species Part Measure Value
setosa Sepal Length 5.1
setosa Sepal Length 4.9
setosa Sepal Length 4.7
setosa Sepal Length 4.6
setosa Sepal Length 5.0
setosa Sepal Length 5.4

To achieve this first we create a key column with the gather() command.

iris %>%
  gather(key, Value, -Species) %>%
  head()
Species key Value
setosa Sepal.Length 5.1
setosa Sepal.Length 4.9
setosa Sepal.Length 4.7
setosa Sepal.Length 4.6
setosa Sepal.Length 5.0
setosa Sepal.Length 5.4

This key column contains values like Sepal.Width that we want to separate into Sepal and Width. These parts will later go into a Parts variable.

Next we separate the key column into a Part and Measure column with separate()

iris.tidy <- iris %>%
  gather(key, Value, -Species) %>%
  separate(key, c("Part", "Measure"),  "\\.") 
iris.tidy %>% head()
Species Part Measure Value
setosa Sepal Length 5.1
setosa Sepal Length 4.9
setosa Sepal Length 4.7
setosa Sepal Length 4.6
setosa Sepal Length 5.0
setosa Sepal Length 5.4

example 2

Suppose we modify iris so that it has a new variable Flower so that iris contains unique ids

iris<- iris %>% mutate(Flower=1:nrow(iris))
head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Flower
5.1 3.5 1.4 0.2 setosa 1
4.9 3.0 1.4 0.2 setosa 2
4.7 3.2 1.3 0.2 setosa 3
4.6 3.1 1.5 0.2 setosa 4
5.0 3.6 1.4 0.2 setosa 5
5.4 3.9 1.7 0.4 setosa 6

Suppose you want to make the following plot:

iris isn’t glyph ready. We need the table:

Species Flower Part Length Width
setosa 1 Petal 1.4 0.2
setosa 1 Sepal 5.1 3.5
setosa 2 Petal 1.4 0.2
setosa 2 Sepal 4.9 3.0
setosa 3 Petal 1.3 0.2
setosa 3 Sepal 4.7 3.2

To achieve this we first first we create a key column with the gather() command.

iris %>%
  gather(key, value, -Species, -Flower) %>%
  head()
Species Flower key value
setosa 1 Sepal.Length 5.1
setosa 2 Sepal.Length 4.9
setosa 3 Sepal.Length 4.7
setosa 4 Sepal.Length 4.6
setosa 5 Sepal.Length 5.0
setosa 6 Sepal.Length 5.4

Next we separate the key column into a Part and Measure column with separate()

iris %>%
  gather(key, value, -Species, -Flower) %>%
  separate(key, c("Part", "Measure"), "\\.") %>%
  head()
Species Flower Part Measure value
setosa 1 Sepal Length 5.1
setosa 2 Sepal Length 4.9
setosa 3 Sepal Length 4.7
setosa 4 Sepal Length 4.6
setosa 5 Sepal Length 5.0
setosa 6 Sepal Length 5.4

The last step is to use spread() to distribute the new Measure column and Value column into two columns. We are essentially spreading each case in two.

iris.wide <- iris %>%
  gather(key, value, -Species, -Flower) %>%
  separate(key, c("Part", "Measure"), "\\.") %>%
  spread(Measure, value) 
iris.wide %>% head()
Species Flower Part Length Width
setosa 1 Petal 1.4 0.2
setosa 1 Sepal 5.1 3.5
setosa 2 Petal 1.4 0.2
setosa 2 Sepal 4.9 3.0
setosa 3 Petal 1.3 0.2
setosa 3 Sepal 4.7 3.2

Aesthetics versus fixed attributes

Data camp uses the word aesthetic to mean a mapping from a visual property to a variable. This is actually a scale according to our book’s definition but nevermind that. In the real world people abuse language and use the word aesthetic when they really mean a scale.

An attribute is different from an aesthetic (i.e. a scale). An attribute sets color equal to a constant, for example “red” instead of a variable.

example:
x,y, col are aesthetics

mtcars %>% ggplot(aes(x=wt,y=mpg)) + geom_point(aes(col=as.factor(cyl))) 

x,y are aesthetics, col is an attribute

mtcars %>% ggplot(aes(x=wt,y=mpg)) + geom_point(col="red") 

Note: attributes don’t have a legend.

iclicker queston