Tour the Cheatsheet

Harold Nelson

6/12/2017

Load Libraries

library(tidyverse)

## ── Attaching packages ───────────────────────────── tidyverse 1.2.1 ──

## ✔ ggplot2 3.1.0     ✔ purrr   0.2.5
## ✔ tibble  1.4.2     ✔ dplyr   0.7.8
## ✔ tidyr   0.8.2     ✔ stringr 1.3.1
## ✔ readr   1.1.1     ✔ forcats 0.3.0

## ── Conflicts ──────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(RColorBrewer)
library(ggplot2movies)

Create Base Objects

a <- ggplot(mpg, aes(hwy))
b <- ggplot(mpg, aes(fl))
#c <- ggplot(map, aes(long, lat))
#d <- ggplot(economics,aes(date,unemploy))
#e <- ggplot(seals,aes(x=long,y=lat))
f <- ggplot(mpg,aes(cty,hwy))
g <- ggplot(mpg,aes(class,hwy))
h <- ggplot(diamonds,aes(cut,color))
#i <- ggplot(movies,aes(year,rating))
#j <- ggplot(economics,aes(date,unemploy))

#df <- data.frame(grp = c("A", "B"), fit = 4:5, se = 1:2)
#k <- ggplot(df, aes(grp, fit, ymin = fit-se, ymax = fit+se))

dat1 <- data.frame(murder = USArrests$Murder, state = tolower(rownames(USArrests)))
#map <- map_data("state")
l <- ggplot(dat1, aes(fill = murder))

seals$z <- with(seals, sqrt(delta_long^2 + delta_lat^2))
m <- ggplot(seals, aes(long, lat))


r <- b + geom_bar()
s <- ggplot(mpg, aes(fl, fill = drv))
t <- ggplot(mpg, aes(cty, hwy)) + geom_point()

Describing One Continuous Variable

Use “a” above to start. Use geom_density() with the default settings.

Answer

a + geom_density()

Exercise

Now try changing the parameter “adjust.”

Answer

a + geom_density(adjust = .5) + ggtitle("adjust = .5")

a + geom_density(adjust = .25 ) + ggtitle("adjust = .25")

Exercise

Do a basic histogram

Answer

a + geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Exercise

Try to vary the number of bins from the default 30.

Answer

a + geom_histogram(bins=10) + ggtitle("bins = 10")

a + geom_histogram(bins=45) + ggtitle("bins = 45")

Exercise

Try a violin plot

Answer

a + geom_violin(aes(y=hwy))

Exercise

Try playing with adjust for more detail.

Answer

a + geom_violin(aes(y=hwy),adjust = .5) + ggtitle("adjust = .5")

a + geom_violin(aes(y=hwy),adjust = .25) + ggtitle("adjust = .25")

Exercise

Use the variable Sepal.Width in the iris dataframe to experiment with two of the methods of examining the distribution of a single continuous variable. Vary the level of detail at three different levels.

Answer

sw = ggplot(iris,aes(x=Sepal.Width))

sw + geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

sw + geom_histogram(bins = 10)

sw + geom_density()

sw + geom_density() + geom_rug()

sw + geom_density(adjust=.5)

A Need to Rescale

Sometimes the nature of the data renders our usual graphical techniques for examining single continuous variables useless. The problem is outlying data. The solution to this is re-scaling or transforming the data using logs.

We will use data in the countyComplete dataset from the openIntro package.

library(openintro)

## Please visit openintro.org for free statistics materials

## 
## Attaching package: 'openintro'

## The following object is masked from 'package:ggplot2':
## 
##     diamonds

## The following objects are masked from 'package:datasets':
## 
##     cars, trees

Exercise

Do a summary of the variable pop2010 in this dataframe.

summary(countyComplete$pop2010)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      82   11104   25857   98233   66699 9818605

Exercise

Try to view the distribution of this variable.

Answer

ccpoph <- ggplot(countyComplete,aes(x=pop2010)) + geom_histogram()
ccpoph

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ccpopd <- ggplot(countyComplete,aes(x=pop2010)) + geom_density()
ccpopd

The problem is that by including the outlying data, most of the data is cramped into a small space and the detail is lost.

This can be avoided by using a log rescaling.

Exercise

Add scale_x_log10() to your graph.

Answer

ccpoph + scale_x_log10()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ccpopd + scale_x_log10()

Note that we can do the almost the same thing by creating a new variable in the dataframe.

countyComplete$lpop = log10(countyComplete$pop2010)
ggplot(countyComplete,aes(x=lpop)) + geom_histogram() + scale_x_continuous("Logarithms of the 2010 Population")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Exercise

Explore the distribution of density in countyComplete. Use both the raw data and a log rescaling. Try two different geoms.

Discrete x and Continuous y

g + geom_boxplot()

g + geom_violin(aes(color=class))

g + geom_violin(aes(fill=class))

g + geom_violin(aes(fill=class)) +
  scale_color_brewer(palette = "Accent")

Discrete x and Discrete y

h + geom_jitter(alpha = .1,shape=1) +
  ggtitle("alpha = .1,shape=1")

h + geom_jitter(alpha = .25)

Here’s a different way to do this using dplyr. Graphical work almost always involves some restructuring of the data.

diamonds %>%
  group_by(color,cut) %>% 
  summarize(count = n()) %>% 
  ungroup() %>% 
  ggplot(aes(x=cut,y=color,
             color=count,size=count)) +        geom_point()

Two Continuous Variables

With small datasets a rug can be a useful complement to a scatterplot.

f + geom_rug(sides = "bl")

f + geom_rug(sides = "bl") + geom_point()

Regression smoothing uses smooth with method = “lm”

f + geom_smooth(method = "lm")

Loess smoothing uses method = “loess” (the default).

f + geom_smooth(method="loess")

Both the lm and loess methods produce estimated of the mean value of the predicted variable for each value of the predictor. Quantile regression produces estimates of specified percentiles of the distribution of the predicted variable given the value of the predictor.

library(quantreg)

## Loading required package: SparseM

## 
## Attaching package: 'SparseM'

## The following object is masked from 'package:base':
## 
##     backsolve

# The default is to supply the 25th, 50th and 75th percentiles.
f + geom_quantile()

## Smoothing formula not specified. Using: y ~ x

f + geom_quantile(quantiles = c(0.05,0.95))

## Smoothing formula not specified. Using: y ~ x

Exercise

Examine the relationship between carat and price in the diamonds dataset. In one graph, place an lm smoother colored red, a loess smoother colored green and a quantile regression of the 50th percentile colored blue. Add the points with a shape of 1 and experiment with values of alpha to get what looks good to you.

Answer

ggplot(diamonds,aes(x=carat,y=price)) +
  geom_smooth(method = "lm",color="red") +
  geom_quantile(quantiles=.25,color="blue") +
  geom_smooth(color="green") +
  geom_point(shape=1,alpha=.008)

## Smoothing formula not specified. Using: y ~ x

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'