Today

New printr package for making nice tables
More on Joins (Chapter 7)
ggplot from DataCamp’s ggplot2 (1) course

1. printr

Here is a print out of BabyNames rendered using knitr (it looks like the printout at the console):

head(mtcars)

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

To produce more publication quality tables use the new package printr. It isn’t available yet on the CRAN repository so you have to install it by running the following chunk (you neeed to run it not just Knit HTML).

install.packages(
  'printr',
  type = 'source',
  repos = c('http://yihui.name/xran', 'http://cran.rstudio.com')
)

Once you install it you will see prinr in your packages list in rstudio. You only have to do this once.

Next you need to remember to put library(printr) in a chunk in your rmarkdown file.

library(printr)

Here is a print out of mtcars now:

head(mtcars)

	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
Mazda RX4	21.0	6	160	110	3.90	2.620	16.46	0	1	4	4
Mazda RX4 Wag	21.0	6	160	110	3.90	2.875	17.02	0	1	4	4
Datsun 710	22.8	4	108	93	3.85	2.320	18.61	1	1	4	1
Hornet 4 Drive	21.4	6	258	110	3.08	3.215	19.44	1	0	3	1
Hornet Sportabout	18.7	8	360	175	3.15	3.440	17.02	0	0	3	2
Valiant	18.1	6	225	105	2.76	3.460	20.22	1	0	3	1

2. Joins

Each of the data verbs described until now takes one one data table as an input. Join is different. It’s well named: it brings together two data tables. Once you can join two data tables, you can join any number of data tables by repeated joins. Joining is at the heart of combining data from multiple sources.

It’s conventional to refer to the two tables as the “left” and “right” tables.

To illustrate the differences and similarities between the different kinds of join, suppose you have these two tables:

Left: cases are medical clinics. The variables: clinicName, postalCode. Each clinic does multiple procedures.

clinicName	postalCode
A	22120
B	35752
C	56718
D	35752
E	67756
F	69129
G	73455
H	73455
I	76292

Right: cases are postal codes. Variables reflect the demographics of that postal code: postalCode, over65, etc.

over65	postalCode
0.46	35752
0.72	22120
0.93	22120
0.26	92332
0.46	84739
0.94	67756

Variables that appear in both the left and right tables are called “overlap variables.” The only overlap variable here is postalCode.

The overlap variables determine which cases will go into the output table. In this example, there is just one overlap variable: postalCode.

The diagram below shows the cases in the left and right tables. The lines show the matches between left and right. The cases connected by a match are the overlap cases; there are five of them in the diagram. Cases without a match also appear in both the left and right tables.

Note that there are three different kinds of cases here:

The matching cases that are in both the left and right. These come as pairs: the cases connected by a line in the diagram.
Non-matching ones in the left.
Non-matching ones in the right.

There are different types of join. The type of join specifies whether you want to include in the output the matching cases, the matching pairs, or the non-matching cases.

Inner Join

An inner join gives the matching pairs. Note that clinic A, which had two matches in the right table, appears twice, once for each matching pair in which clinic A is involved.

LL %>% inner_join(RR, by=c("postalCode"="postalCode"))

clinicName	postalCode	over65
A	22120	0.72
A	22120	0.93
B	35752	0.46
D	35752	0.46
E	67756	0.94

Outer Join

An outer join can include cases where there is no match. You might want to include the unmatched cases from the left table, from the right table, or from both tables.

Unmatched cases from the left table

LL %>% left_join( RR)

clinicName	postalCode	over65
A	22120	0.72
A	22120	0.93
B	35752	0.46
C	56718	NA
D	35752	0.46
E	67756	0.94
F	69129	NA
G	73455	NA
H	73455	NA
I	76292	NA

Unmatched cases from the right table

LL %>%  right_join(RR)

clinicName	postalCode	over65
B	35752	0.46
D	35752	0.46
A	22120	0.72
A	22120	0.93
NA	92332	0.26
NA	84739	0.46
E	67756	0.94

Unmatched cases from both tables

LL %>% full_join(RR)

clinicName	postalCode	over65
A	22120	0.72
A	22120	0.93
B	35752	0.46
C	56718	NA
D	35752	0.46
E	67756	0.94
F	69129	NA
G	73455	NA
H	73455	NA
I	76292	NA
NA	92332	0.26
NA	84739	0.46

Task For You:

Here is a left and right data table:

Left <- CountryCentroids %>% select(name, iso_a3) %>% tail()
Left

	name	iso_a3
236	Vietnam	VNM
237	W. Sahara	ESH
238	Wallis and Futuna Is.	WLF
239	Yemen	YEM
240	Zambia	ZMB
241	Zimbabwe	ZWE

Right <-  CountryData %>% select(country,life, infant) %>% tail()
Right

	country	life	infant
251	Wallis and Futuna	79.42	4.49
252	West Bank	75.69	13.49
253	Western Sahara	62.27	56.09
254	Yemen	64.83	50.41
255	Zambia	51.83	66.62
256	Zimbabwe	55.68	26.55

Make the following join:

name	iso_a3	life	infant
Yemen	YEM	64.83	50.41
Zambia	ZMB	51.83	66.62
Zimbabwe	ZWE	55.68	26.55

3. Chapter 2 Data in course ggplot2 (1) in Data Camp

we will work with the mtcars data set

mtcars_m <- mtcars %>% 
  filter(am==0)

mtcars_a <- mtcars %>%
  filter(am==1) 
  
head(mtcars_a)

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
21.0	6	160.0	110	3.90	2.620	16.46	0	1	4	4
21.0	6	160.0	110	3.90	2.875	17.02	0	1	4	4
22.8	4	108.0	93	3.85	2.320	18.61	1	1	4	1
32.4	4	78.7	66	4.08	2.200	19.47	1	1	4	1
30.4	4	75.7	52	4.93	1.615	18.52	1	1	4	2
33.9	4	71.1	65	4.22	1.835	19.90	1	1	4	1

ggplot2 versus base package

Before ggplot there was plotting with the base R package. Many research papers still make their plots with base package so you should be familiar with it.

example 1

In base package if you want to make a scatter diagram of wt versus mpg in mtcars for manual cars

plot( mtcars_m$wt,mtcars_m$mpg, col=as.factor(mtcars_m$cyl))

That is fine but, suppose we wish to add an additional layer of points corresponding to cars with automatic transmission

plot(mtcars_m$wt,mtcars_m$mpg,  col=as.factor(mtcars_m$cyl))
points( mtcars_a$wt,mtcars_a$mpg, col="blue")

Here we see a major limitation of base package drawing.

Plot doesn’t get redrawn
Plot is drawn as an image (every layer is drawn on top of the image. In ggplot the plot is an object which we can change)
We need to add a legend ourselves (you may forget what the different colors mean when you go into manually make a legend)

In ggplot we would have:

mtcars %>% ggplot(aes(x=wt,y=mpg)) + geom_point(aes(col=as.factor(cyl))) + facet_wrap(~am)

task for you

The vector precip gives the yearly precipitation in differnt cities. Using the base package function hist make a histogram of precip (hint: try hist(precip)). Next make the plot in ggplot using the geom_histogram() function. You will need to convert the vector precip to a data frame using as.data.frame(precip). This might be helpful:ggplot2.org

example 2

As another example, suppose we wish to make a linear model of how mpg varies with car weight.

# Use lm() to calculate a linear model and save it as carModel
carModel <- lm(mpg ~ wt, data = mtcars_m)
carModel

## 
## Call:
## lm(formula = mpg ~ wt, data = mtcars_m)
## 
## Coefficients:
## (Intercept)           wt  
##      31.416       -3.786

We see that the best fitting line through the scatter plot for the manual transmission cars is \[ mpg= -9*wt + 46.3 \]

We can draw the regression line through our plot for manual transmission cars. The legend is hard to make manually.

plot(mtcars_m$wt,mtcars_m$mpg,  col=as.factor(mtcars_m$cyl))
carModel <- lm(mpg ~ wt, data = mtcars_m)
abline(carModel, lty = 2)
legend(x = 5, y = 25, legend = levels(as.factor(mtcars_m$cyl)), col = 1:3, pch = 1, bty = "n")

In fact you can draw the regression line for each cylinder type.

plot(mtcars_m$wt,mtcars_m$mpg,  col=as.factor(mtcars_m$cyl))
abline(lm(mpg ~ wt, data = mtcars_m, subset= (cyl ==4)), lty = 2)
abline(lm(mpg ~ wt, data = mtcars_m, subset= (cyl ==6)), lty = 2)
abline(lm(mpg ~ wt, data = mtcars_m, subset= (cyl ==8)), lty = 2)
legend(x = 5, y = 25, legend = levels(as.factor(mtcars_m$cyl)), col = 1:3, pch = 1, bty = "n")

or more efficiently using lapply() which you will learn about in the data camp course intermediate R soon.

plot(mtcars_m$wt,mtcars_m$mpg,  col=as.factor(mtcars_m$cyl))
lapply(mtcars_m$cyl, function(x) {
  abline(lm(mpg ~ wt, mtcars_m, subset = (cyl == x)), col = x)
  })
legend(x = 5, y = 25, legend = levels(as.factor(mtcars_m$cyl)), col = 1:3, pch = 1, bty = "n")

In ggplot it is much easier. Note that we put the color aesthetic in the ggplot frame instead of geom_point since we want both the points and the regression lines to be categorized by color.

mtcars_m %>% ggplot(aes(x = wt, y = mpg, col = as.factor(cyl))) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)

Data verbs `gather()`, `separate()`, and `spread()` to wrangle your data

Lets examine the iris data table:

head(iris)

Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
5.1	3.5	1.4	0.2	setosa
4.9	3.0	1.4	0.2	setosa
4.7	3.2	1.3	0.2	setosa
4.6	3.1	1.5	0.2	setosa
5.0	3.6	1.4	0.2	setosa
5.4	3.9	1.7	0.4	setosa

example 1

Suppose you want to make the following plot:

The data table iris isn’t gyph ready. We need variable names , Species, Part, Measure and Value:

Species	Part	Measure	Value
setosa	Sepal	Length	5.1
setosa	Sepal	Length	4.9
setosa	Sepal	Length	4.7
setosa	Sepal	Length	4.6
setosa	Sepal	Length	5.0
setosa	Sepal	Length	5.4

To achieve this first we create a key column with the gather() command.

iris %>%
  gather(key, Value, -Species) %>%
  head()

Species	key	Value
setosa	Sepal.Length	5.1
setosa	Sepal.Length	4.9
setosa	Sepal.Length	4.7
setosa	Sepal.Length	4.6
setosa	Sepal.Length	5.0
setosa	Sepal.Length	5.4

This key column contains values like Sepal.Width that we want to separate into Sepal and Width. These parts will later go into a Parts variable.

Next we separate the key column into a Part and Measure column with separate()

iris.tidy <- iris %>%
  gather(key, Value, -Species) %>%
  separate(key, c("Part", "Measure"),  "\\.") 
iris.tidy %>% head()

Species	Part	Measure	Value
setosa	Sepal	Length	5.1
setosa	Sepal	Length	4.9
setosa	Sepal	Length	4.7
setosa	Sepal	Length	4.6
setosa	Sepal	Length	5.0
setosa	Sepal	Length	5.4

example 2

Suppose we modify iris so that it has a new variable Flower so that iris contains unique ids

iris<- iris %>% mutate(Flower=1:nrow(iris))
head(iris)

Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species	Flower
5.1	3.5	1.4	0.2	setosa	1
4.9	3.0	1.4	0.2	setosa	2
4.7	3.2	1.3	0.2	setosa	3
4.6	3.1	1.5	0.2	setosa	4
5.0	3.6	1.4	0.2	setosa	5
5.4	3.9	1.7	0.4	setosa	6

Suppose you want to make the following plot:

iris isn’t glyph ready. We need the table:

Species	Flower	Part	Length	Width
setosa	1	Petal	1.4	0.2
setosa	1	Sepal	5.1	3.5
setosa	2	Petal	1.4	0.2
setosa	2	Sepal	4.9	3.0
setosa	3	Petal	1.3	0.2
setosa	3	Sepal	4.7	3.2

To achieve this we first first we create a key column with the gather() command.

iris %>%
  gather(key, value, -Species, -Flower) %>%
  head()

Species	Flower	key	value
setosa	1	Sepal.Length	5.1
setosa	2	Sepal.Length	4.9
setosa	3	Sepal.Length	4.7
setosa	4	Sepal.Length	4.6
setosa	5	Sepal.Length	5.0
setosa	6	Sepal.Length	5.4

Next we separate the key column into a Part and Measure column with separate()

iris %>%
  gather(key, value, -Species, -Flower) %>%
  separate(key, c("Part", "Measure"), "\\.") %>%
  head()

Species	Flower	Part	Measure	value
setosa	1	Sepal	Length	5.1
setosa	2	Sepal	Length	4.9
setosa	3	Sepal	Length	4.7
setosa	4	Sepal	Length	4.6
setosa	5	Sepal	Length	5.0
setosa	6	Sepal	Length	5.4

The last step is to use spread() to distribute the new Measure column and Value column into two columns. We are essentially spreading each case in two.

iris.wide <- iris %>%
  gather(key, value, -Species, -Flower) %>%
  separate(key, c("Part", "Measure"), "\\.") %>%
  spread(Measure, value) 
iris.wide %>% head()

Species	Flower	Part	Length	Width
setosa	1	Petal	1.4	0.2
setosa	1	Sepal	5.1	3.5
setosa	2	Petal	1.4	0.2
setosa	2	Sepal	4.9	3.0
setosa	3	Petal	1.3	0.2
setosa	3	Sepal	4.7	3.2

Aesthetics versus fixed attributes

Data camp uses the word aesthetic to mean a mapping from a visual property to a variable. This is actually a scale according to our book’s definition but nevermind that. In the real world people abuse language and use the word aesthetic when they really mean a scale.

An attribute is different from an aesthetic (i.e. a scale). An attribute sets color equal to a constant, for example “red” instead of a variable.

example:
x,y, col are aesthetics

mtcars %>% ggplot(aes(x=wt,y=mpg)) + geom_point(aes(col=as.factor(cyl)))

x,y are aesthetics, col is an attribute

mtcars %>% ggplot(aes(x=wt,y=mpg)) + geom_point(col="red")

Note: attributes don’t have a legend.

Lec10

stat 133

February 10 2016

Today

1. printr

2. Joins

Inner Join

Outer Join

Unmatched cases from the left table

Unmatched cases from the right table

Unmatched cases from both tables

Task For You:

3. Chapter 2 Data in course ggplot2 (1) in Data Camp

ggplot2 versus base package

example 1

task for you

example 2

Data verbs `gather()`, `separate()`, and `spread()` to wrangle your data

example 1

example 2

Aesthetics versus fixed attributes

iclicker queston

Lec10

stat 133

February 10 2016

Today

1. printr

2. Joins

Inner Join

Outer Join

Unmatched cases from the left table

Unmatched cases from the right table

Unmatched cases from both tables

Task For You:

3. Chapter 2 Data in course ggplot2 (1) in Data Camp

ggplot2 versus base package

example 1

task for you

example 2

Data verbs gather(), separate(), and spread() to wrangle your data

example 1

example 2

Aesthetics versus fixed attributes

iclicker queston

Data verbs `gather()`, `separate()`, and `spread()` to wrangle your data