Harold Nelson
6/12/2017
## ── Attaching packages ───────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.1.0 ✔ purrr 0.2.5
## ✔ tibble 1.4.2 ✔ dplyr 0.7.8
## ✔ tidyr 0.8.2 ✔ stringr 1.3.1
## ✔ readr 1.1.1 ✔ forcats 0.3.0
## ── Conflicts ──────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
a <- ggplot(mpg, aes(hwy))
b <- ggplot(mpg, aes(fl))
#c <- ggplot(map, aes(long, lat))
#d <- ggplot(economics,aes(date,unemploy))
#e <- ggplot(seals,aes(x=long,y=lat))
f <- ggplot(mpg,aes(cty,hwy))
g <- ggplot(mpg,aes(class,hwy))
h <- ggplot(diamonds,aes(cut,color))
#i <- ggplot(movies,aes(year,rating))
#j <- ggplot(economics,aes(date,unemploy))
#df <- data.frame(grp = c("A", "B"), fit = 4:5, se = 1:2)
#k <- ggplot(df, aes(grp, fit, ymin = fit-se, ymax = fit+se))
dat1 <- data.frame(murder = USArrests$Murder, state = tolower(rownames(USArrests)))
#map <- map_data("state")
l <- ggplot(dat1, aes(fill = murder))
seals$z <- with(seals, sqrt(delta_long^2 + delta_lat^2))
m <- ggplot(seals, aes(long, lat))
r <- b + geom_bar()
s <- ggplot(mpg, aes(fl, fill = drv))
t <- ggplot(mpg, aes(cty, hwy)) + geom_point()
Use “a” above to start. Use geom_density() with the default settings.
Now try changing the parameter “adjust.”
Do a basic histogram
Try to vary the number of bins from the default 30.
Try a violin plot
Try playing with adjust for more detail.
Use the variable Sepal.Width in the iris dataframe to experiment with two of the methods of examining the distribution of a single continuous variable. Vary the level of detail at three different levels.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Sometimes the nature of the data renders our usual graphical techniques for examining single continuous variables useless. The problem is outlying data. The solution to this is re-scaling or transforming the data using logs.
We will use data in the countyComplete dataset from the openIntro package.
## Please visit openintro.org for free statistics materials
##
## Attaching package: 'openintro'
## The following object is masked from 'package:ggplot2':
##
## diamonds
## The following objects are masked from 'package:datasets':
##
## cars, trees
Do a summary of the variable pop2010 in this dataframe.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 82 11104 25857 98233 66699 9818605
Try to view the distribution of this variable.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The problem is that by including the outlying data, most of the data is cramped into a small space and the detail is lost.
This can be avoided by using a log rescaling.
Add scale_x_log10() to your graph.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Note that we can do the almost the same thing by creating a new variable in the dataframe.
countyComplete$lpop = log10(countyComplete$pop2010)
ggplot(countyComplete,aes(x=lpop)) + geom_histogram() + scale_x_continuous("Logarithms of the 2010 Population")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Explore the distribution of density in countyComplete. Use both the raw data and a log rescaling. Try two different geoms.
Here’s a different way to do this using dplyr. Graphical work almost always involves some restructuring of the data.
diamonds %>%
group_by(color,cut) %>%
summarize(count = n()) %>%
ungroup() %>%
ggplot(aes(x=cut,y=color,
color=count,size=count)) + geom_point()
With small datasets a rug can be a useful complement to a scatterplot.
Regression smoothing uses smooth with method = “lm”
Loess smoothing uses method = “loess” (the default).
Both the lm and loess methods produce estimated of the mean value of the predicted variable for each value of the predictor. Quantile regression produces estimates of specified percentiles of the distribution of the predicted variable given the value of the predictor.
## Loading required package: SparseM
##
## Attaching package: 'SparseM'
## The following object is masked from 'package:base':
##
## backsolve
## Smoothing formula not specified. Using: y ~ x
## Smoothing formula not specified. Using: y ~ x
Examine the relationship between carat and price in the diamonds dataset. In one graph, place an lm smoother colored red, a loess smoother colored green and a quantile regression of the 50th percentile colored blue. Add the points with a shape of 1 and experiment with values of alpha to get what looks good to you.
ggplot(diamonds,aes(x=carat,y=price)) +
geom_smooth(method = "lm",color="red") +
geom_quantile(quantiles=.25,color="blue") +
geom_smooth(color="green") +
geom_point(shape=1,alpha=.008)
## Smoothing formula not specified. Using: y ~ x
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'