Tour the Cheatsheet

Harold Nelson

6/12/2017

Load Libraries

library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag():    dplyr, stats
library(RColorBrewer)
library(ggplot2movies)

Create Base Objects

a <- ggplot(mpg, aes(hwy))
b <- ggplot(mpg, aes(fl))
c <- ggplot(map, aes(long, lat))
d <- ggplot(economics,aes(date,unemploy))
e <- ggplot(seals,aes(x=long,y=lat))
f <- ggplot(mpg,aes(cty,hwy))
g <- ggplot(mpg,aes(class,hwy))
h <- ggplot(diamonds,aes(cut,color))
i <- ggplot(movies,aes(year,rating))
j <- ggplot(economics,aes(date,unemploy))

df <- data.frame(grp = c("A", "B"), fit = 4:5, se = 1:2)
k <- ggplot(df, aes(grp, fit, ymin = fit-se, ymax = fit+se))

data <- data.frame(murder = USArrests$Murder, state = tolower(rownames(USArrests)))
map <- map_data("state")
## 
## Attaching package: 'maps'
## The following object is masked from 'package:purrr':
## 
##     map
l <- ggplot(data, aes(fill = murder))

seals$z <- with(seals, sqrt(delta_long^2 + delta_lat^2))
m <- ggplot(seals, aes(long, lat))


r <- b + geom_bar()
s <- ggplot(mpg, aes(fl, fill = drv))
t <- ggplot(mpg, aes(cty, hwy)) + geom_point()

Describing One Continuous Variable

Use geom_density() with the default settings.

a + geom_density()

Now try changing the parameter “adjust.”

a + geom_density(adjust = .5) + ggtitle("adjust = .5")

a + geom_density(adjust = .25 ) + ggtitle("adjust = .25")

Do a basic histogram

a + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Try to vary the number of bins from the default 30.

a + geom_histogram(bins=10) + ggtitle("bins = 10")

a + geom_histogram(bins=45) + ggtitle("bins = 45")

Try a violin plot

a + geom_violin(aes(y=hwy))

Try playing with adjust for more detail.

a + geom_violin(aes(y=hwy),adjust = .5) + ggtitle("adjust = .5")

a + geom_violin(aes(y=hwy),adjust = .25) + ggtitle("adjust = .25")

Exercise

Use the variable Sepal.Width in the iris dataframe to experiment with two of the methods of examining the distribution of a single continuous variable. Vary the level of detail at three different levels.

A Need to Rescale

Sometimes the nature of the data renders our usual graphical techniques for examining single continuous variables useless. The problem is outlying data. The solution to this is re-scaling or transforming the data using logs.

We will use data in the countyComplete dataset from the openIntro package.

library(openintro)
## Please visit openintro.org for free statistics materials
## 
## Attaching package: 'openintro'
## The following object is masked from 'package:datasets':
## 
##     cars
summary(countyComplete$pop2010)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      82   11100   25860   98230   66700 9819000

Try to view the distribution of this variable.

ccpoph <- ggplot(countyComplete,aes(x=pop2010)) + geom_histogram()
ccpoph
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ccpopd <- ggplot(countyComplete,aes(x=pop2010)) + geom_density()
ccpopd

The problem is that by including the outlying data, most of the data is cramped into a small space and the detail is lost.

This can be avoided by using a log rescaling.

ccpoph + scale_x_log10()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ccpopd + scale_x_log10()

Note that we can do the almost the same thing by creating a new variable in the dataframe.

countyComplete$lpop = log10(countyComplete$pop2010)
ggplot(countyComplete,aes(x=lpop)) + geom_histogram() + scale_x_continuous("Logarithms of the 2010 Population")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Exercise

Explore the distribution of density in countyComplete. Use both the raw data and a log rescaling. Try two different geoms.

Discrete x and Continuous y

g + geom_boxplot()

g + geom_violin(aes(color=class))

g + geom_violin(aes(fill=class))

Exercise

Look at the documentation for RColorBrewer and pick out a new color scheme. Change the colors in the violin plots.

Discrete x and Discrete y

h + geom_jitter(alpha = .1,shape=1) +
  ggtitle("alpha = .1,shape=1")

h + geom_jitter(alpha = .25)

Here’s a different way to do this using dplyr. Graphical work almost always involves some restructuring of the data.

diamonds %>%
  group_by(color,cut) %>% 
  summarize(count = n()) %>% 
  ungroup() %>% 
  ggplot(aes(x=cut,y=color,
             color=count,size=count)) +        geom_point()

Two Continuous Variables

With small datasets a rug can be a useful complement to a scatterplot.

f + geom_rug(sides = "bl")

f + geom_rug(sides = "bl") + geom_point()

Regression smoothing uses smooth with method = “lm”

f + geom_smooth(method = "lm")

Loess smoothing uses method = “loess” (the default).

f + geom_smooth(method="loess")

Both the lm and loess methods produce estimated of the mean value of the predicted variable for each value of the predictor. Quantile regression produces estimates of specified percentiles of the distribution of the predicted variable given the value of the predictor.

# The default is to supply the 25th, 50th and 75th percentiles.
f + geom_quantile()
## Loading required package: SparseM
## 
## Attaching package: 'SparseM'
## The following object is masked from 'package:base':
## 
##     backsolve
## Smoothing formula not specified. Using: y ~ x

f + geom_quantile(quantiles = c(0.05,0.95))
## Smoothing formula not specified. Using: y ~ x

Exercise

Examine the relationship between carat and price in the diamonds dataset. In one graph, place an lm smoother colored red, a loess smoother colored green and a quantile regression of the 50th percentile colored blue. Add the points with a shape of 1 and experiment with values of alpha to get what looks good to you.

Answer

ggplot(diamonds,aes(x=carat,y=price)) +
  geom_smooth(method = "lm",color="red") +
  geom_quantile(quantiles=.25,color="blue") +
  geom_smooth(color="green") +
  geom_point(shape=1,alpha=.008)
## Smoothing formula not specified. Using: y ~ x
## `geom_smooth()` using method = 'gam'