Source file ⇒ 2017-lec8.Rmd
last compiled on Fri Feb 10 10:49:32 2017
Aesthetics are properties of the graph that we map to a variable.
(example col=sex
in the BabyNames
data set)
Attribute are properties of the graph that we set equal to a fixed value.
(example col=“red”)
mtcars %>% ggplot(aes(x=wt,y=mpg)) + geom_point(aes(col=as.factor(cyl)))
mtcars %>% ggplot(aes(x=wt,y=mpg)) + geom_point(col="red")
Note: attributes don’t have a legend since since it takes only a fixed value.
The theming system in ggplot2 enables a user to control non-data elements of a ggplot object. For example you can:
These influence the rendering of the graphic but are independent of the data being plotted. These are called theme elements, i.e., aspects of a ggplot object that are capable of modifying its appearance but are neither directly related to data nor aesthetics associated with data.
To illustrate, lets start with this plot:
p <- ggplot2::mpg %>% ggplot( aes(x = cty, y = hwy, color = factor(cyl))) +
geom_jitter() +
labs(
x = "City mileage/gallon",
y = "Highway mileage/gallon",
color = "Cylinders"
)
p
ggplot2
makes attractive plots but sometimes you want to customize it to meet your needs.
The themes in ggplot2
are composed of the following:
theme elements, which refer to individual attributes of a graphic that are independent of the data, such as the appearance of axis text (ex: axis.text
);
theme element functions, which enables you to modify the settings of certain theme elements (ex: axis.text = element_text(size = 14)
;
theme functions, which define the settings of a collection of theme elements for the purpose of creating a specific style of graphics production
p + theme(axis.text = element_text(colour = "blue", size = 15, face = "italic"), axis.text.y = element_text(size = rel(0.7), angle = 90))
library(ggthemes) #in console: install.packages("ggthemes") to install
p + theme_igray()
Note that if we wish to change the colors of the points in the plot that this depends on the data itself, so we can use one of the scale_colour_*()
functions for that purpose —see ggplot2 help.
p + scale_colour_brewer(palette = "Dark2")
Most theme elements have several properties that can be modified through a corresponding element function.
element_text()
element_line()
element_rect()
element_blank()
The element_xx()
functions modify theme elements with attributes (e.g., color, text size). Some theme elements are defined in terms of a unit of measurement, while others, such as legend.position, control the positioning of a theme element.
For example suppose you wish to:
We can make those changes as follows:
p +
theme(
axis.text = element_text(size = 14),
legend.background = element_rect(fill = "white"),
legend.position = c(0.14, 0.70),
panel.grid.minor = element_blank(),
panel.background = element_rect(fill = "navy")
)
Here is a detailed list of the the theme element functions and their arguements:
element_text()
Purpose: To control the drawing of labels and headings.
The table below lists the arguments of element_text() and their corresponding default values.
Argument | Description | Default |
---|---|---|
family | font family | “” |
face | font face | plain |
colour | font color | black |
size | font size | 10 |
hjust | horizontal justification | 0.5 |
vjust | vertical justification | 0.5 |
angle | text angle | 0 |
lineheight | line height | 1.1 |
element_line()
Purpose: To draw lines and segments such as graphics region boundaries, axis tick marks and grid lines.
Arument | Description | Default |
---|---|---|
colour | line color | black |
size | line thinkness | 0.5 |
linetype | type of line | 1 |
element_rect()
Purpose: To draw rectangles. It is mostly used for background elements and legend keys.
Arguement | Description | Default |
---|---|---|
fill | fill color | none |
colour | border color | black |
size | thinkness of border line | 0.5 |
linetype | type of border line | 1 |
element_blank()
Purpose: To draw nothing.
Arguments: none.
The element_blank() function can be applied to any theme element controlled by a theme element function.
Examine the plot below and see how we made changes to the theme elements
# Use theme() to modify theme elements
p + labs(title = "Highway vs. city mileage per gallon") +
theme(
axis.text = element_text(size = 20),
plot.title = element_text(size = 20,color = "red"),
legend.key = element_rect(fill = "black"),
legend.background = element_rect(fill = "white"),
legend.position = "right",
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_rect(fill = "grey40")
)
Do the first two exercises:
http://gandalf.berkeley.edu:3838/alucas/Lecture-08-collection/
The first three elements in ggplot-themes are text
, line
and rect
, which, not coincidentally, are the same as the names of the basic theme element functions element_text()
,element_line()
, element_rect()
. Other theme elements inherit the values of these theme elements. For example, the theme elements axis.text
, legend.text
, strip.text
and axis.title
all inherit from text
, while axis.text.x
and axis.text.y
further inherit from axis.text
. This means the values of the components of the theme element text
are passed on to axis.text
as well as other elements that inherit from text
or its children. You can override the default values of one or more theme elements by calling theme()
and modifying the desired properties of theme elements therein.
Here is a useful figure: inheritance:
Here is an example:
set.seed(123)
df <- diamonds[sample(1:nrow(diamonds), size = 1000),]
df %>% ggplot(aes(carat, price)) +
geom_point() + labs(title="Diamonds") +
theme(
text =element_text(size=30,colour="red", face="bold.italic"),
axis.text = element_text(colour="purple"),
axis.title=element_text(size=20,colour="blue"),
axis.title.y=element_text(size=10, colour="green"))
The purpose of a theme function is to either specify default settings for each theme element or modify the settings of an existing theme function to produce a new theme. For example, the foundational theme_grey
function specifies default settings of each theme element, whereas theme_bw
is a modification of theme_grey
.
There are nice theme functions made by other users:
library(ggthemes) #need to install ggthemes
mpg %>% ggplot( aes(x = cty, y = hwy, color = factor(cyl))) +
geom_jitter() +
labs(
x = "City mileage/gallon",
y = "Highway mileage/gallon",
color = "Cylinders"
) + theme_igray()
Although the default themes in ggplot2 produce attractive graphics, they may not necessarily correspond with user requirements. If you find yourself modifying the same theme elements repeatedly with theme()
or need to adapt a set of theme elements to conform to the requirements of a journal or other publication, then you should consider writing your own theme function.
Say my old default theme is theme_grey()
I have written my own theme called theme_pink
that I would like to use in all of my plots.
theme_pink <- theme(panel.background = element_blank(),
legend.key = element_blank(),
legend.background = element_blank(),
strip.background = element_blank(),
plot.background = element_rect(fill = "red", color = "black", size = 3),
panel.grid = element_blank(),
axis.line = element_line(color = "black"),
axis.ticks = element_line(color = "black"),
strip.text = element_text(size = 16, color = "red"),
axis.title.y = element_text(color = "red", hjust = 0, face = "italic"),
axis.title.x = element_text(color = "red", hjust = 0, face = "italic"),
axis.text = element_text(color = "black"),
legend.position = "none")
mpg %>% ggplot( aes(x = cty, y = hwy, color = factor(cyl))) +
geom_jitter() +
labs(
x = "City mileage/gallon",
y = "Highway mileage/gallon",
color = "Cylinders"
) + theme_pink
You can make this your default theme using theme_update()
. The value of the this function is your old theme which you can save: old <- theme_update()
old <- theme_update(panel.background = element_blank(),
legend.key = element_blank(),
legend.background = element_blank(),
strip.background = element_blank(),
plot.background = element_rect(fill = "red", color = "black", size = 3),
panel.grid = element_blank(),
axis.line = element_line(color = "black"),
axis.ticks = element_line(color = "black"),
strip.text = element_text(size = 16, color = "red"),
axis.title.y = element_text(color = "red", hjust = 0, face = "italic"),
axis.title.x = element_text(color = "red", hjust = 0, face = "italic"),
axis.text = element_text(color = "black"),
legend.position = "none")
Now, you don’t need to write theme_pink()
in your ggplot command.
set.seed(123)
df %>% ggplot(aes(carat, price)) +
geom_point()
You can restore your old theme
theme_set(old)
df %>% ggplot(aes(carat, price)) +
geom_point()
Here is a list of colors
Colors can specified as a hexadecimal RGB triplet, such as “#0066CC”. The first two digits are the level of red, the next two green, and the last two blue. The value for each ranges from 00 to FF in hexadecimal (base-16) notation, which is equivalent to 0 and 255 in base-10. For example, in the table below, “#FFFFFF” is white and “#990000” is a deep red.
hexadecimal:
Try some different colors in the code below (try hexadecimal and non hexadecimal code)
Example:
set.seed(955)
# Make some noisily increasing data
dat <- data.frame(xvar = 1:20 + rnorm(20,sd=3),
yvar = 1:20 + rnorm(20,sd=3))
ggplot(dat, aes(x=xvar, y=yvar)) +
geom_point(shape=1) + # Use hollow circles
geom_smooth(method=lm, fill="#013A59") + # Add linear regression line
theme_igray() #default theme
Do the last two exercises:
http://gandalf.berkeley.edu:3838/alucas/Lecture-08-collection/
Below we will discuss the most common parametric and nonparametric regression models (simple linear regression and loess). They are at the heart of statistical learning.
We have two continuous normal variables X and Y. For example in the mtcars data table, X=wt and Y=mpg. Intuitively the regression line is the best fitting line through your data.
mtcars %>% ggplot(aes(x=wt, y=mpg)) +geom_point() +geom_smooth(method="lm",se=FALSE)
Many scientists misuse the regression line so it is important to know more about it:
In a linear regression model you assume that the average value of y for a given value of x is given by the relationship \[M(x)=\beta_0 + \beta_1x.\] M(x) is the mean values of all the y in your scatter plot in a narrow strip around x. Only Tyche, the Greek goddess of fortune, knows what \(\beta_0\) and \(\beta_1\) are.
This is called a parametric model because the relationship between \(M(x)\) and x is given by an equation with two parameters \(\beta_0\) and \(\beta_1\).
The error of the regression line in estimating \(y_i\) from \(x_i\) is called the residual and is
\[y_i-(\beta_0 +\beta_1x_i)\].
Here is a picture of all of the residuals in a scatter plot.
residuals:
Thinking of \[ \sum_{i=1}^{n}(y_i-\beta_0 -\beta_1x_i)^2. \] as a function of \(\beta_0\) and \(\beta_1\) we can use calculus to find the value, \(\widehat{\beta_0}\) and \(\widehat{\beta_1}\), that minimizes \[ \sum_{i=1}^{n}(y_i-\beta_0 -\beta_1x_i)^2. \]
The regression line based on my sample is given by \[\widehat{M}(x)=\widehat{\beta_0} + \widehat{\beta_1}x.\] \(\widehat{\beta_0}\) and \(\widehat{\beta_1}\) are random variables here since you will get a different value with every sample you take. Again, only Tyche knows what the true parameters, \(\beta_0\) and \(\beta_1\) are.
It turns out that \[\widehat{\beta_1} = Cov(x_i,y_i)/Var(x_i) \] and \[ \widehat{\beta_0}=\overline{y} -\widehat{\beta_1}\overline{x} \] where \(\overline{x}\) and \(\overline{y}\) are your sample averages.
For example
lm(formula = mpg ~ wt, data = mtcars)
##
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
##
## Coefficients:
## (Intercept) wt
## 37.285 -5.344
so \(\beta_0=37.3\) is your y intercept
and \(\beta_1=-5.3\) is your slope
\(\widehat{M}(x)\) is a random variable. This means that for every different sample of points from our data set we will get another function of x. Next we will discuss how confident that this function can estimate a point on our scatter plot.
Let \(x_0\) be an arbitrary data point (for example \(x_0=3\) is a car with weight 3000 pounds in the mtcars dataset). \(\widehat{M}(x_0)\) is then an estimate of the height of the regression line at \(x_0\) (i.e the expected mpg of a car with weight 3000 pounds).
We have, \[\widehat{M}(x_0)=\widehat{\beta_0} +\widehat{\beta_1}x_0\] \[\widehat{M}(x_0)=(\overline{y}-\widehat{\beta_1}\overline{x}) + \widehat{\beta_1}x_0\] \[\widehat{M}(x_0)=\overline{y} + \widehat{\beta_1}(x_0-\overline{x})\]
From here, using the property (Var(A+B)=Var(A)+Var(B) if A and B are independent random variables) and the amazing fact that \(\overline{y}\) and \(\widehat{\beta_1}\) are independent random variables), you can show that
\[Var(\widehat{M}(x_0))=\frac{\sigma^2}{n} + \frac{(x_0-\overline{x})^2\sigma^2}{\sum_{i=1}^{n}(x_i-\overline{x})^2}.\]
What we see from this is that the variance of the height of the regression line varies with \(x_0\) and that it gets larger the further away \(x_0\) is from \(\overline{x}\). This is why the confidence band gets largers the further you are away from the point of averages \((\overline{x},\overline{y})\)
For example:
mtcars %>% ggplot(aes(x=wt, y=mpg)) +geom_point() +geom_smooth(method="lm") + geom_point(aes(x=mean(wt),y=mean(mpg)),size=5)
We have two continuous variables X and Y. For example in the mtcars data table, X=wt and Y=mpg.
mtcars %>% ggplot(aes(x=wt, y=mpg)) +geom_point() +geom_smooth(se=FALSE)
Algorithm for Loess
Let \(x_0\) be an observation. For example \(x_0=3.435\) corresponding to the 3435 pound Merc 280.
span
) of the \(x_i\) closest to \(x_0\).span=.4
of 32 cars, then we look for the 13 (actually 0.4*32=12.8) closest car weights to the Merc 280 (shown in blue below).In this example cars nearest to the Merc 280 have a weight close to 1 and blue cars further away have smaller weights. All the red cars have zero weights.
Just as we did for simple linear regression find \(\widehat{\beta_0}\) and \(\widehat{\beta_1}\) that minimize \[ \sum_{i=1}^{n} K_{i0}(y_i-\beta_0 -\beta_1x_i)^2. \] The difference here is that we have weights \(K_{i0}\).
The fitted value of \(x_0\) is given by \[\widehat{M}(x_0)=\widehat{\beta_0} + \widehat{\beta_1}x_0\]
We do this for every observation \(x_0\) in our dataset and connect the points \(\widehat{M}(x_0)\). So we are getting a different \(\widehat{\beta_0}\) and \(\widehat{\beta_1}\) for every \(x_0\) even though our notation doesn’t indicate this. How we connect the points is a little complicated and I won’t go into it. What is important is to understand that if the span
is close to zero then the accuracy of the regression line will be limitted only for a very small range. Hence at every observation there will be an adjustment in the direction of the line resulting in a wiggly curve. If the span is close to 1 then the regression line will be true for a large range and the curve will be almost straight.
mtcars %>% ggplot(aes(x=wt, y=mpg)) +geom_point() +stat_smooth(se=FALSE,method="loess", span=.4)
mtcars %>% ggplot(aes(x=wt, y=mpg)) +geom_point() +stat_smooth(se=FALSE,method="loess", span=.8)
The Loess method is non parametric meaning that we are entirely relaxing the linearity assumption.