Source file ⇒ 2017-lec8.Rmd

last compiled on Fri Feb 10 10:49:32 2017

Announcements

  1. I put file ScoreCard.Rda on B-courses assignment/hw3

Today:

  1. Aesthetics versus fixed attributes
  2. ggplot2 themes (Data camp chapter 3 ggplot part 2)
  3. Linear Model (simple linear regression)
  4. Loess (Locally weighted linear Regression)

1. Aesthetics versus fixed attributes

Aesthetics are properties of the graph that we map to a variable.
(example col=sex in the BabyNames data set)
Attribute are properties of the graph that we set equal to a fixed value.
(example col=“red”)

Examples

mtcars %>% ggplot(aes(x=wt,y=mpg)) + geom_point(aes(col=as.factor(cyl))) 

mtcars %>% ggplot(aes(x=wt,y=mpg)) + geom_point(col="red") 

Note: attributes don’t have a legend since since it takes only a fixed value.

2. Themes in ggplot2

The theming system in ggplot2 enables a user to control non-data elements of a ggplot object. For example you can:

These influence the rendering of the graphic but are independent of the data being plotted. These are called theme elements, i.e., aspects of a ggplot object that are capable of modifying its appearance but are neither directly related to data nor aesthetics associated with data.

To illustrate, lets start with this plot:

p <- ggplot2::mpg %>% ggplot( aes(x = cty, y = hwy, color = factor(cyl))) +
  geom_jitter() +
  labs(
    x = "City mileage/gallon",
    y = "Highway mileage/gallon",
    color = "Cylinders"
  )
p

ggplot2 makes attractive plots but sometimes you want to customize it to meet your needs.

The themes in ggplot2 are composed of the following:

p + theme(axis.text = element_text(colour = "blue", size = 15, face = "italic"), axis.text.y = element_text(size = rel(0.7), angle = 90))

library(ggthemes)  #in console: install.packages("ggthemes") to install
p  + theme_igray()

Note that if we wish to change the colors of the points in the plot that this depends on the data itself, so we can use one of the scale_colour_*() functions for that purpose —see ggplot2 help.

p + scale_colour_brewer(palette = "Dark2") 

Theme elements and theme element functions

Most theme elements have several properties that can be modified through a corresponding element function.

  • element_text()
  • element_line()
  • element_rect()
  • element_blank()

The element_xx() functions modify theme elements with attributes (e.g., color, text size). Some theme elements are defined in terms of a unit of measurement, while others, such as legend.position, control the positioning of a theme element.

For example suppose you wish to:

  • increase the font size of the axis labels
  • remove the minor grid lines on both axes
  • move the legend inside the graphics region
  • change the background color of the graphics region

We can make those changes as follows:

p  +
  theme(
    axis.text = element_text(size = 14),
    legend.background = element_rect(fill = "white"),
    legend.position = c(0.14, 0.70),
    panel.grid.minor = element_blank(),
    panel.background = element_rect(fill = "navy")
  )

Here is a detailed list of the the theme element functions and their arguements:

element_text()

Purpose: To control the drawing of labels and headings.

The table below lists the arguments of element_text() and their corresponding default values.

Argument Description Default
family font family “”
face font face plain
colour font color black
size font size 10
hjust horizontal justification 0.5
vjust vertical justification 0.5
angle text angle 0
lineheight line height 1.1

element_line()

Purpose: To draw lines and segments such as graphics region boundaries, axis tick marks and grid lines.

Arument Description Default
colour line color black
size line thinkness 0.5
linetype type of line 1

element_rect()

Purpose: To draw rectangles. It is mostly used for background elements and legend keys.

Arguement Description Default
fill fill color none
colour border color black
size thinkness of border line 0.5
linetype type of border line 1

element_blank()

Purpose: To draw nothing.

Arguments: none.

The element_blank() function can be applied to any theme element controlled by a theme element function.

another example

Examine the plot below and see how we made changes to the theme elements

# Use theme() to modify theme elements
p + labs(title = "Highway vs. city mileage per gallon") +
  theme(
    axis.text = element_text(size = 20),
    plot.title = element_text(size = 20,color = "red"),
    legend.key = element_rect(fill = "black"),
    legend.background = element_rect(fill = "white"),
    legend.position = "right",
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    panel.background = element_rect(fill = "grey40")
  )

In class exercise

Do the first two exercises:

http://gandalf.berkeley.edu:3838/alucas/Lecture-08-collection/

Inheritance of Theme Elements

The first three elements in ggplot-themes are text, line and rect, which, not coincidentally, are the same as the names of the basic theme element functions element_text(),element_line(), element_rect(). Other theme elements inherit the values of these theme elements. For example, the theme elements axis.text, legend.text, strip.text and axis.title all inherit from text, while axis.text.x and axis.text.y further inherit from axis.text. This means the values of the components of the theme element text are passed on to axis.text as well as other elements that inherit from text or its children. You can override the default values of one or more theme elements by calling theme() and modifying the desired properties of theme elements therein.

Here is a useful figure: inheritance:

Here is an example:

set.seed(123)

df <- diamonds[sample(1:nrow(diamonds), size = 1000),]

df %>% ggplot(aes(carat, price)) + 
  geom_point() + labs(title="Diamonds") +
  theme(
        text =element_text(size=30,colour="red", face="bold.italic"),
        axis.text = element_text(colour="purple"),
        axis.title=element_text(size=20,colour="blue"),
        axis.title.y=element_text(size=10, colour="green"))

Theme functions

The purpose of a theme function is to either specify default settings for each theme element or modify the settings of an existing theme function to produce a new theme. For example, the foundational theme_grey function specifies default settings of each theme element, whereas theme_bw is a modification of theme_grey.

There are nice theme functions made by other users:

library(ggthemes)  #need to install ggthemes
mpg %>% ggplot( aes(x = cty, y = hwy, color = factor(cyl))) +
  geom_jitter() +
  labs(
    x = "City mileage/gallon",
    y = "Highway mileage/gallon",
    color = "Cylinders"
  ) + theme_igray()

Saving themes

Although the default themes in ggplot2 produce attractive graphics, they may not necessarily correspond with user requirements. If you find yourself modifying the same theme elements repeatedly with theme() or need to adapt a set of theme elements to conform to the requirements of a journal or other publication, then you should consider writing your own theme function.

Say my old default theme is theme_grey()

I have written my own theme called theme_pink that I would like to use in all of my plots.

theme_pink <- theme(panel.background = element_blank(),
                    legend.key = element_blank(),
                    legend.background = element_blank(),
                    strip.background = element_blank(),
                    plot.background = element_rect(fill = "red", color = "black", size = 3),
                    panel.grid = element_blank(),
                    axis.line = element_line(color = "black"),
                    axis.ticks = element_line(color = "black"),
                    strip.text = element_text(size = 16, color = "red"),
                    axis.title.y = element_text(color = "red", hjust = 0, face = "italic"),
                    axis.title.x = element_text(color = "red", hjust = 0, face = "italic"),
                    axis.text = element_text(color = "black"),
                    legend.position = "none")


mpg %>% ggplot( aes(x = cty, y = hwy, color = factor(cyl))) +
  geom_jitter() +
  labs(
    x = "City mileage/gallon",
    y = "Highway mileage/gallon",
    color = "Cylinders"
  ) + theme_pink

You can make this your default theme using theme_update(). The value of the this function is your old theme which you can save: old <- theme_update()

old <- theme_update(panel.background = element_blank(),
                    legend.key = element_blank(),
                    legend.background = element_blank(),
                    strip.background = element_blank(),
                    plot.background = element_rect(fill = "red", color = "black", size = 3),
                    panel.grid = element_blank(),
                    axis.line = element_line(color = "black"),
                    axis.ticks = element_line(color = "black"),
                    strip.text = element_text(size = 16, color = "red"),
                    axis.title.y = element_text(color = "red", hjust = 0, face = "italic"),
                    axis.title.x = element_text(color = "red", hjust = 0, face = "italic"),
                    axis.text = element_text(color = "black"),
                    legend.position = "none")

Now, you don’t need to write theme_pink() in your ggplot command.

set.seed(123)

df %>% ggplot(aes(carat, price)) + 
  geom_point() 

You can restore your old theme

theme_set(old)

df %>% ggplot(aes(carat, price)) + 
  geom_point()

Colors in R

Here is a list of colors

hexidecimal color code

Colors can specified as a hexadecimal RGB triplet, such as “#0066CC”. The first two digits are the level of red, the next two green, and the last two blue. The value for each ranges from 00 to FF in hexadecimal (base-16) notation, which is equivalent to 0 and 255 in base-10. For example, in the table below, “#FFFFFF” is white and “#990000” is a deep red.

hexadecimal:

Try some different colors in the code below (try hexadecimal and non hexadecimal code)

Example:

set.seed(955)
# Make some noisily increasing data
dat <- data.frame(xvar = 1:20 + rnorm(20,sd=3),
                  yvar = 1:20 + rnorm(20,sd=3))

ggplot(dat, aes(x=xvar, y=yvar)) +
    geom_point(shape=1) +    # Use hollow circles
    geom_smooth(method=lm, fill="#013A59") + # Add linear regression line
    theme_igray()  #default theme

In Class exercises

Do the last two exercises:

http://gandalf.berkeley.edu:3838/alucas/Lecture-08-collection/

3. Linear Model (simple linear regression)

Below we will discuss the most common parametric and nonparametric regression models (simple linear regression and loess). They are at the heart of statistical learning.

We have two continuous normal variables X and Y. For example in the mtcars data table, X=wt and Y=mpg. Intuitively the regression line is the best fitting line through your data.

mtcars %>% ggplot(aes(x=wt, y=mpg)) +geom_point() +geom_smooth(method="lm",se=FALSE)

Many scientists misuse the regression line so it is important to know more about it:

In a linear regression model you assume that the average value of y for a given value of x is given by the relationship \[M(x)=\beta_0 + \beta_1x.\] M(x) is the mean values of all the y in your scatter plot in a narrow strip around x. Only Tyche, the Greek goddess of fortune, knows what \(\beta_0\) and \(\beta_1\) are.

This is called a parametric model because the relationship between \(M(x)\) and x is given by an equation with two parameters \(\beta_0\) and \(\beta_1\).

The error of the regression line in estimating \(y_i\) from \(x_i\) is called the residual and is

\[y_i-(\beta_0 +\beta_1x_i)\].

Here is a picture of all of the residuals in a scatter plot.

residuals:

Thinking of \[ \sum_{i=1}^{n}(y_i-\beta_0 -\beta_1x_i)^2. \] as a function of \(\beta_0\) and \(\beta_1\) we can use calculus to find the value, \(\widehat{\beta_0}\) and \(\widehat{\beta_1}\), that minimizes \[ \sum_{i=1}^{n}(y_i-\beta_0 -\beta_1x_i)^2. \]

The regression line based on my sample is given by \[\widehat{M}(x)=\widehat{\beta_0} + \widehat{\beta_1}x.\] \(\widehat{\beta_0}\) and \(\widehat{\beta_1}\) are random variables here since you will get a different value with every sample you take. Again, only Tyche knows what the true parameters, \(\beta_0\) and \(\beta_1\) are.

It turns out that \[\widehat{\beta_1} = Cov(x_i,y_i)/Var(x_i) \] and \[ \widehat{\beta_0}=\overline{y} -\widehat{\beta_1}\overline{x} \] where \(\overline{x}\) and \(\overline{y}\) are your sample averages.

For example

lm(formula = mpg ~ wt, data = mtcars)
## 
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
## 
## Coefficients:
## (Intercept)           wt  
##      37.285       -5.344

so \(\beta_0=37.3\) is your y intercept
and \(\beta_1=-5.3\) is your slope

\(\widehat{M}(x)\) is a random variable. This means that for every different sample of points from our data set we will get another function of x. Next we will discuss how confident that this function can estimate a point on our scatter plot.

Let \(x_0\) be an arbitrary data point (for example \(x_0=3\) is a car with weight 3000 pounds in the mtcars dataset). \(\widehat{M}(x_0)\) is then an estimate of the height of the regression line at \(x_0\) (i.e the expected mpg of a car with weight 3000 pounds).

We have, \[\widehat{M}(x_0)=\widehat{\beta_0} +\widehat{\beta_1}x_0\] \[\widehat{M}(x_0)=(\overline{y}-\widehat{\beta_1}\overline{x}) + \widehat{\beta_1}x_0\] \[\widehat{M}(x_0)=\overline{y} + \widehat{\beta_1}(x_0-\overline{x})\]

From here, using the property (Var(A+B)=Var(A)+Var(B) if A and B are independent random variables) and the amazing fact that \(\overline{y}\) and \(\widehat{\beta_1}\) are independent random variables), you can show that

\[Var(\widehat{M}(x_0))=\frac{\sigma^2}{n} + \frac{(x_0-\overline{x})^2\sigma^2}{\sum_{i=1}^{n}(x_i-\overline{x})^2}.\]

What we see from this is that the variance of the height of the regression line varies with \(x_0\) and that it gets larger the further away \(x_0\) is from \(\overline{x}\). This is why the confidence band gets largers the further you are away from the point of averages \((\overline{x},\overline{y})\)

For example:

mtcars %>% ggplot(aes(x=wt, y=mpg)) +geom_point() +geom_smooth(method="lm") + geom_point(aes(x=mean(wt),y=mean(mpg)),size=5)

4. Loess (Locally weighted linear Regression)

We have two continuous variables X and Y. For example in the mtcars data table, X=wt and Y=mpg.

mtcars %>% ggplot(aes(x=wt, y=mpg)) +geom_point() +geom_smooth(se=FALSE)

Algorithm for Loess

Let \(x_0\) be an observation. For example \(x_0=3.435\) corresponding to the 3435 pound Merc 280.

  1. We gather a fraction (span) of the \(x_i\) closest to \(x_0\).
    For example if span=.4 of 32 cars, then we look for the 13 (actually 0.4*32=12.8) closest car weights to the Merc 280 (shown in blue below).

  1. We assign a weight \(K_{i0}=K(x_i,x_0)\) to each point in this neighborhoood, so that the point furthest from the \(x_0\) has weight zero, and the closest has the highest weight.

In this example cars nearest to the Merc 280 have a weight close to 1 and blue cars further away have smaller weights. All the red cars have zero weights.

  1. Just as we did for simple linear regression find \(\widehat{\beta_0}\) and \(\widehat{\beta_1}\) that minimize \[ \sum_{i=1}^{n} K_{i0}(y_i-\beta_0 -\beta_1x_i)^2. \] The difference here is that we have weights \(K_{i0}\).

  2. The fitted value of \(x_0\) is given by \[\widehat{M}(x_0)=\widehat{\beta_0} + \widehat{\beta_1}x_0\]

We do this for every observation \(x_0\) in our dataset and connect the points \(\widehat{M}(x_0)\). So we are getting a different \(\widehat{\beta_0}\) and \(\widehat{\beta_1}\) for every \(x_0\) even though our notation doesn’t indicate this. How we connect the points is a little complicated and I won’t go into it. What is important is to understand that if the span is close to zero then the accuracy of the regression line will be limitted only for a very small range. Hence at every observation there will be an adjustment in the direction of the line resulting in a wiggly curve. If the span is close to 1 then the regression line will be true for a large range and the curve will be almost straight.

mtcars %>% ggplot(aes(x=wt, y=mpg)) +geom_point() +stat_smooth(se=FALSE,method="loess", span=.4)

mtcars %>% ggplot(aes(x=wt, y=mpg)) +geom_point() +stat_smooth(se=FALSE,method="loess", span=.8)

The Loess method is non parametric meaning that we are entirely relaxing the linearity assumption.

To Do: HW 3 including DataCamp. Next time: first three chapters of Intermediate R in data camp (conditionals/loops/functions)