Source file ⇒ lec10.Rmd
Here is a print out of BabyNames rendered using knitr (it looks like the printout at the console):
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
To produce more publication quality tables use the new package printr. It isn’t available yet on the CRAN repository so you have to install it by running the following chunk (you neeed to run it not just Knit HTML).
install.packages(
'printr',
type = 'source',
repos = c('http://yihui.name/xran', 'http://cran.rstudio.com')
)
Once you install it you will see prinr in your packages list in rstudio. You only have to do this once.
Next you need to remember to put library(printr) in a chunk in your rmarkdown file.
library(printr)
Here is a print out of mtcars now:
head(mtcars)
mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
---|---|---|---|---|---|---|---|---|---|---|---|
Mazda RX4 | 21.0 | 6 | 160 | 110 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 | 4 |
Mazda RX4 Wag | 21.0 | 6 | 160 | 110 | 3.90 | 2.875 | 17.02 | 0 | 1 | 4 | 4 |
Datsun 710 | 22.8 | 4 | 108 | 93 | 3.85 | 2.320 | 18.61 | 1 | 1 | 4 | 1 |
Hornet 4 Drive | 21.4 | 6 | 258 | 110 | 3.08 | 3.215 | 19.44 | 1 | 0 | 3 | 1 |
Hornet Sportabout | 18.7 | 8 | 360 | 175 | 3.15 | 3.440 | 17.02 | 0 | 0 | 3 | 2 |
Valiant | 18.1 | 6 | 225 | 105 | 2.76 | 3.460 | 20.22 | 1 | 0 | 3 | 1 |
Each of the data verbs described until now takes one one data table as an input. Join is different. It’s well named: it brings together two data tables. Once you can join two data tables, you can join any number of data tables by repeated joins. Joining is at the heart of combining data from multiple sources.
It’s conventional to refer to the two tables as the “left” and “right” tables.
To illustrate the differences and similarities between the different kinds of join, suppose you have these two tables:
To illustrate the differences and similarities between the different kinds of join, suppose you have these two tables:
clinicName
, postalCode
. Each clinic does multiple procedures.clinicName | postalCode |
---|---|
A | 22120 |
B | 35752 |
C | 56718 |
D | 35752 |
E | 67756 |
F | 69129 |
G | 73455 |
H | 73455 |
I | 76292 |
postalCode
, over65
, etc.over65 | postalCode |
---|---|
0.46 | 35752 |
0.72 | 22120 |
0.93 | 22120 |
0.26 | 92332 |
0.46 | 84739 |
0.94 | 67756 |
Variables that appear in both the left and right tables are called “overlap variables.” The only overlap variable here is postalCode
.
The overlap variables determine which cases will go into the output table. In this example, there is just one overlap variable: postalCode
.
The diagram below shows the cases in the left and right tables. The lines show the matches between left and right. The cases connected by a match are the overlap cases; there are five of them in the diagram. Cases without a match also appear in both the left and right tables.
Note that there are three different kinds of cases here:
There are different types of join. The type of join specifies whether you want to include in the output the matching cases, the matching pairs, or the non-matching cases.
An inner join gives the matching pairs. Note that clinic A, which had two matches in the right table, appears twice, once for each matching pair in which clinic A is involved.
LL %>% inner_join(RR, by=c("postalCode"="postalCode"))
clinicName | postalCode | over65 |
---|---|---|
A | 22120 | 0.72 |
A | 22120 | 0.93 |
B | 35752 | 0.46 |
D | 35752 | 0.46 |
E | 67756 | 0.94 |
An outer join can include cases where there is no match. You might want to include the unmatched cases from the left table, from the right table, or from both tables.
LL %>% left_join( RR)
clinicName | postalCode | over65 |
---|---|---|
A | 22120 | 0.72 |
A | 22120 | 0.93 |
B | 35752 | 0.46 |
C | 56718 | NA |
D | 35752 | 0.46 |
E | 67756 | 0.94 |
F | 69129 | NA |
G | 73455 | NA |
H | 73455 | NA |
I | 76292 | NA |
LL %>% right_join(RR)
clinicName | postalCode | over65 |
---|---|---|
B | 35752 | 0.46 |
D | 35752 | 0.46 |
A | 22120 | 0.72 |
A | 22120 | 0.93 |
NA | 92332 | 0.26 |
NA | 84739 | 0.46 |
E | 67756 | 0.94 |
LL %>% full_join(RR)
clinicName | postalCode | over65 |
---|---|---|
A | 22120 | 0.72 |
A | 22120 | 0.93 |
B | 35752 | 0.46 |
C | 56718 | NA |
D | 35752 | 0.46 |
E | 67756 | 0.94 |
F | 69129 | NA |
G | 73455 | NA |
H | 73455 | NA |
I | 76292 | NA |
NA | 92332 | 0.26 |
NA | 84739 | 0.46 |
Here is a left and right data table:
Left <- CountryCentroids %>% select(name, iso_a3) %>% tail()
Left
name | iso_a3 | |
---|---|---|
236 | Vietnam | VNM |
237 | W. Sahara | ESH |
238 | Wallis and Futuna Is. | WLF |
239 | Yemen | YEM |
240 | Zambia | ZMB |
241 | Zimbabwe | ZWE |
Right <- CountryData %>% select(country,life, infant) %>% tail()
Right
country | life | infant | |
---|---|---|---|
251 | Wallis and Futuna | 79.42 | 4.49 |
252 | West Bank | 75.69 | 13.49 |
253 | Western Sahara | 62.27 | 56.09 |
254 | Yemen | 64.83 | 50.41 |
255 | Zambia | 51.83 | 66.62 |
256 | Zimbabwe | 55.68 | 26.55 |
Make the following join:
name | iso_a3 | life | infant |
---|---|---|---|
Yemen | YEM | 64.83 | 50.41 |
Zambia | ZMB | 51.83 | 66.62 |
Zimbabwe | ZWE | 55.68 | 26.55 |
we will work with the mtcars data set
mtcars_m <- mtcars %>%
filter(am==0)
mtcars_a <- mtcars %>%
filter(am==1)
head(mtcars_a)
mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb |
---|---|---|---|---|---|---|---|---|---|---|
21.0 | 6 | 160.0 | 110 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 | 4 |
21.0 | 6 | 160.0 | 110 | 3.90 | 2.875 | 17.02 | 0 | 1 | 4 | 4 |
22.8 | 4 | 108.0 | 93 | 3.85 | 2.320 | 18.61 | 1 | 1 | 4 | 1 |
32.4 | 4 | 78.7 | 66 | 4.08 | 2.200 | 19.47 | 1 | 1 | 4 | 1 |
30.4 | 4 | 75.7 | 52 | 4.93 | 1.615 | 18.52 | 1 | 1 | 4 | 2 |
33.9 | 4 | 71.1 | 65 | 4.22 | 1.835 | 19.90 | 1 | 1 | 4 | 1 |
Before ggplot there was plotting with the base R package. Many research papers still make their plots with base package so you should be familiar with it.
In base package if you want to make a scatter diagram of wt
versus mpg
in mtcars
for manual cars
plot( mtcars_m$wt,mtcars_m$mpg, col=as.factor(mtcars_m$cyl))
That is fine but, suppose we wish to add an additional layer of points corresponding to cars with automatic transmission
plot(mtcars_m$wt,mtcars_m$mpg, col=as.factor(mtcars_m$cyl))
points( mtcars_a$wt,mtcars_a$mpg, col="blue")
Here we see a major limitation of base package drawing.
In ggplot we would have:
mtcars %>% ggplot(aes(x=wt,y=mpg)) + geom_point(aes(col=as.factor(cyl))) + facet_wrap(~am)
The vector precip
gives the yearly precipitation in differnt cities. Using the base package function hist
make a histogram of precip
(hint: try hist(precip)
). Next make the plot in ggplot using the geom_histogram()
function. You will need to convert the vector precip
to a data frame using as.data.frame(precip)
. This might be helpful:ggplot2.org
As another example, suppose we wish to make a linear model of how mpg varies with car weight.
# Use lm() to calculate a linear model and save it as carModel
carModel <- lm(mpg ~ wt, data = mtcars_m)
carModel
##
## Call:
## lm(formula = mpg ~ wt, data = mtcars_m)
##
## Coefficients:
## (Intercept) wt
## 31.416 -3.786
We see that the best fitting line through the scatter plot for the manual transmission cars is \[ mpg= -9*wt + 46.3 \]
We can draw the regression line through our plot for manual transmission cars. The legend is hard to make manually.
plot(mtcars_m$wt,mtcars_m$mpg, col=as.factor(mtcars_m$cyl))
carModel <- lm(mpg ~ wt, data = mtcars_m)
abline(carModel, lty = 2)
legend(x = 5, y = 25, legend = levels(as.factor(mtcars_m$cyl)), col = 1:3, pch = 1, bty = "n")
In fact you can draw the regression line for each cylinder type.
plot(mtcars_m$wt,mtcars_m$mpg, col=as.factor(mtcars_m$cyl))
abline(lm(mpg ~ wt, data = mtcars_m, subset= (cyl ==4)), lty = 2)
abline(lm(mpg ~ wt, data = mtcars_m, subset= (cyl ==6)), lty = 2)
abline(lm(mpg ~ wt, data = mtcars_m, subset= (cyl ==8)), lty = 2)
legend(x = 5, y = 25, legend = levels(as.factor(mtcars_m$cyl)), col = 1:3, pch = 1, bty = "n")
or more efficiently using lapply()
which you will learn about in the data camp course intermediate R soon.
plot(mtcars_m$wt,mtcars_m$mpg, col=as.factor(mtcars_m$cyl))
lapply(mtcars_m$cyl, function(x) {
abline(lm(mpg ~ wt, mtcars_m, subset = (cyl == x)), col = x)
})
legend(x = 5, y = 25, legend = levels(as.factor(mtcars_m$cyl)), col = 1:3, pch = 1, bty = "n")
In ggplot it is much easier. Note that we put the color aesthetic in the ggplot frame instead of geom_point since we want both the points and the regression lines to be categorized by color.
mtcars_m %>% ggplot(aes(x = wt, y = mpg, col = as.factor(cyl))) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
gather()
, separate()
, and spread()
to wrangle your dataLets examine the iris data table:
head(iris)
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
---|---|---|---|---|
5.1 | 3.5 | 1.4 | 0.2 | setosa |
4.9 | 3.0 | 1.4 | 0.2 | setosa |
4.7 | 3.2 | 1.3 | 0.2 | setosa |
4.6 | 3.1 | 1.5 | 0.2 | setosa |
5.0 | 3.6 | 1.4 | 0.2 | setosa |
5.4 | 3.9 | 1.7 | 0.4 | setosa |
Suppose you want to make the following plot:
The data table iris
isn’t gyph ready. We need variable names , Species
, Part
, Measure
and Value
:
Species | Part | Measure | Value |
---|---|---|---|
setosa | Sepal | Length | 5.1 |
setosa | Sepal | Length | 4.9 |
setosa | Sepal | Length | 4.7 |
setosa | Sepal | Length | 4.6 |
setosa | Sepal | Length | 5.0 |
setosa | Sepal | Length | 5.4 |
To achieve this first we create a key
column with the gather()
command.
iris %>%
gather(key, Value, -Species) %>%
head()
Species | key | Value |
---|---|---|
setosa | Sepal.Length | 5.1 |
setosa | Sepal.Length | 4.9 |
setosa | Sepal.Length | 4.7 |
setosa | Sepal.Length | 4.6 |
setosa | Sepal.Length | 5.0 |
setosa | Sepal.Length | 5.4 |
This key
column contains values like Sepal.Width that we want to separate into Sepal
and Width
. These parts will later go into a Parts
variable.
Next we separate the key column into a Part and Measure column with separate()
iris.tidy <- iris %>%
gather(key, Value, -Species) %>%
separate(key, c("Part", "Measure"), "\\.")
iris.tidy %>% head()
Species | Part | Measure | Value |
---|---|---|---|
setosa | Sepal | Length | 5.1 |
setosa | Sepal | Length | 4.9 |
setosa | Sepal | Length | 4.7 |
setosa | Sepal | Length | 4.6 |
setosa | Sepal | Length | 5.0 |
setosa | Sepal | Length | 5.4 |
Suppose we modify iris
so that it has a new variable Flower
so that iris contains unique ids
iris<- iris %>% mutate(Flower=1:nrow(iris))
head(iris)
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | Flower |
---|---|---|---|---|---|
5.1 | 3.5 | 1.4 | 0.2 | setosa | 1 |
4.9 | 3.0 | 1.4 | 0.2 | setosa | 2 |
4.7 | 3.2 | 1.3 | 0.2 | setosa | 3 |
4.6 | 3.1 | 1.5 | 0.2 | setosa | 4 |
5.0 | 3.6 | 1.4 | 0.2 | setosa | 5 |
5.4 | 3.9 | 1.7 | 0.4 | setosa | 6 |
Suppose you want to make the following plot:
iris
isn’t glyph ready. We need the table:
Species | Flower | Part | Length | Width |
---|---|---|---|---|
setosa | 1 | Petal | 1.4 | 0.2 |
setosa | 1 | Sepal | 5.1 | 3.5 |
setosa | 2 | Petal | 1.4 | 0.2 |
setosa | 2 | Sepal | 4.9 | 3.0 |
setosa | 3 | Petal | 1.3 | 0.2 |
setosa | 3 | Sepal | 4.7 | 3.2 |
To achieve this we first first we create a key
column with the gather()
command.
iris %>%
gather(key, value, -Species, -Flower) %>%
head()
Species | Flower | key | value |
---|---|---|---|
setosa | 1 | Sepal.Length | 5.1 |
setosa | 2 | Sepal.Length | 4.9 |
setosa | 3 | Sepal.Length | 4.7 |
setosa | 4 | Sepal.Length | 4.6 |
setosa | 5 | Sepal.Length | 5.0 |
setosa | 6 | Sepal.Length | 5.4 |
Next we separate the key column into a Part and Measure column with separate()
iris %>%
gather(key, value, -Species, -Flower) %>%
separate(key, c("Part", "Measure"), "\\.") %>%
head()
Species | Flower | Part | Measure | value |
---|---|---|---|---|
setosa | 1 | Sepal | Length | 5.1 |
setosa | 2 | Sepal | Length | 4.9 |
setosa | 3 | Sepal | Length | 4.7 |
setosa | 4 | Sepal | Length | 4.6 |
setosa | 5 | Sepal | Length | 5.0 |
setosa | 6 | Sepal | Length | 5.4 |
The last step is to use spread()
to distribute the new Measure
column and Value
column into two columns. We are essentially spreading each case in two.
iris.wide <- iris %>%
gather(key, value, -Species, -Flower) %>%
separate(key, c("Part", "Measure"), "\\.") %>%
spread(Measure, value)
iris.wide %>% head()
Species | Flower | Part | Length | Width |
---|---|---|---|---|
setosa | 1 | Petal | 1.4 | 0.2 |
setosa | 1 | Sepal | 5.1 | 3.5 |
setosa | 2 | Petal | 1.4 | 0.2 |
setosa | 2 | Sepal | 4.9 | 3.0 |
setosa | 3 | Petal | 1.3 | 0.2 |
setosa | 3 | Sepal | 4.7 | 3.2 |
Data camp uses the word aesthetic to mean a mapping from a visual property to a variable. This is actually a scale according to our book’s definition but nevermind that. In the real world people abuse language and use the word aesthetic when they really mean a scale.
An attribute is different from an aesthetic (i.e. a scale). An attribute sets color equal to a constant, for example “red” instead of a variable.
example:
x,y, col are aesthetics
mtcars %>% ggplot(aes(x=wt,y=mpg)) + geom_point(aes(col=as.factor(cyl)))
x,y are aesthetics, col is an attribute
mtcars %>% ggplot(aes(x=wt,y=mpg)) + geom_point(col="red")
Note: attributes don’t have a legend.