To make the most of ggplot2 it is important to wrap your mind around “The Grammar of Graphics”. Briefly, the act of building a graph can be broken down into three steps:
Define what data set we are using.
What is the major relationship we wish to examine?
In what way should we present that relationship?
# Problem 1
## Create a graph using ggplot2 with Height on the x-axis, Volume on the y-axis, and Girth as the either the size of the data point or the color of the data point. Which do you think is a more intuitive representation?
## Add appropriate labels for the main title and the x and y axes.
## The R-squared value for a regression through these points is 0.36 and the p-value for the statistical significance of height is 0.00038. Add text labels “R-squared = 0.36” and “p-value = 0.0004” somewhere on the graph.
### ggplot2 (data=z, aes(x=Column_X, y=Column_Y)) +
### geom_XXX()
Trees.Plot <- ggplot(trees, aes(x=Height, y=Volume)) +
geom_point(aes(size=Girth)) +
labs(title="Tree Height vs. Volume", x="Height (ft)", y="Volume (ft^3)", size="Diameter (in)", caption="A scatterplot comparing the height and girth of cherry trees to the volume of lumber they produced") +
annotate('label', x=65, y=67, size=2.5,label="R-squared: 0.36") +
annotate('label', x=65, y=63, size=2.5, label="p-value: 0.0004")
Trees.Plot
# Problem 2
## Consider the following small dataset that represents the number of times per day my wife played “Ring around the Rosy” with my daughter relative to the number of days since she has learned this game. The column yhat represents the best fitting line through the data, and lwr and upr represent a 95% confidence interval for the predicted value on that day. Because these questions ask you to produce several graphs and evaluate which is better and why, please include each graph and response with each sub-question.
Rosy <- data.frame(
times = c(15, 11, 9, 12, 5, 2, 3),
day = 1:7,
yhat = c(14.36, 12.29, 10.21, 8.14, 6.07, 4.00, 1.93),
lwr = c(9.54, 8.5, 7.22, 5.47, 3.08, 0.22, -2.89),
upr = c(19.18, 16.07, 13.2, 10.82, 9.06, 7.78, 6.75)
)
### a. Using ggplot() and geom_point(), create a scatterplot with day along the x-axis and times along the y-axis.
Rosy.Plot <- ggplot(Rosy, aes(x=day, y=times)) +
geom_point(shape=1) +
labs(title="Ring Around the Rosy Scatterplot", x="Days After Learning", y="Number of Times Played")
Rosy.Plot
### b. Add a line to the graph where the x-values are the day values but now the y-values are the predicted values which we’ve called yhat. Notice that you have to set the aesthetic y=times for the points and y=yhat for the line. Because each geom_ will accept an aes() command, you can specify the y attribute to be different for different layers of the graph.
Rosy.Plot <- ggplot(Rosy, aes(x=day, y=times)) +
geom_point(shape=1) + geom_point(aes(y=times)) +
geom_line(aes(y=yhat)) +
labs(title="Ring Around the Rosy Scatterplot", x="Days After Learning", y="Number of Times Played")
Rosy.Plot
### c. Add a ribbon that represents the confidence region of the regression line. The geom_ribbon() function requires an x, ymin, and ymax columns to be defined. For examples of using geom_ribbon() see the online documentation: http://docs.ggplot2.org/current/geom_ribbon.html.
Rosy.Plot <- ggplot(Rosy, aes(x=day, y=times)) +
geom_point(shape=1) + geom_point(aes(y=times)) +
geom_line(aes(y=yhat)) +
geom_ribbon (aes(ymin=lwr, ymax=upr), fill='salmon') +
labs(title="Ring Around the Rosy Scatterplot", x="Days After Learning", y="Number of Times Played")
Rosy.Plot
### d. What happened when you added the ribbon? Did some points get hidden? If so, why? Yes, the ribbon was opaque.
### e. Reorder the statements that created the graph so that the ribbon is on the bottom and the data points are on top and the regression line is visible.
Rosy.Plot <- ggplot(Rosy, aes(x=day, y=times)) +
geom_ribbon (aes(ymin=lwr, ymax=upr), fill='salmon') +
geom_point(shape=1) + geom_point(aes(y=times)) +
geom_line(aes(y=yhat)) +
labs(title="Ring Around the Rosy Scatterplot", x="Days After Learning", y="Number of Times Played")
Rosy.Plot
### f. The color of the ribbon fill is ugly. Use Google to find a list of named colors available to ggplot2. For example, I googled “ggplot2 named colors” and found the following link: http://sape.inf.usi.ch/quick-reference/ggplot2/colour. Choose a color for the fill that is pleasing to you.
Rosy.Plot <- ggplot(Rosy, aes(x=day, y=times)) +
geom_ribbon (aes(ymin=lwr, ymax=upr), fill='rosybrown1') +
geom_point(shape=1) + geom_point(aes(y=times)) +
geom_line(aes(y=yhat)) +
labs(title="Ring Around the Rosy Scatterplot", x="Days After Learning", y="Number of Times Played")
Rosy.Plot
### g. Add labels for the x-axis and y-axis that are appropriate along with a main title. Same as above? n/a?
# Problem 3
## We’ll next make some density plots that relate several factors towards the birth weight of a child. Because these questions ask you to produce several graphs and evaluate which is better and why, please include each graph and response with each sub-question.
### a. The MASS package contains a dataset called birthwt which contains information about 189 babies and their mothers. In particular there are columns for the mother’s race and smoking status during the pregnancy. Load the birthwt by either using the data() command or loading the MASS library.
### b. Read the help file for the dataset using MASS::birthwt. The covariates race and smoke are not stored in a user friendly manner. For example, smoking status is labeled using a 0 or a 1. Because it is not obvious which should represent that the mother smoked, we’ll add better labels to the race and smoke variables. For more information about dealing with factors and their levels, see the Factors chapter in these notes.
library(tidyverse)
data('birthwt', package='MASS')
birthwt <- birthwt %>% mutate(
race = factor(race, labels=c('White','Black','Other')),
smoke = factor(smoke, labels=c('No Smoke', 'Smoke')))
ggplot(birthwt, aes(x=bwt)) +
geom_histogram() +
labs(title="Birthweight Histogram Plot 1", x="Birthweight (g)")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(birthwt, aes(x=bwt)) +
geom_histogram() +
facet_grid(smoke~.) +
labs(title="Birthweight Histogram Plot 2", x="Birthweight (g)")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(birthwt, aes(x=bwt)) +
geom_histogram() +
facet_grid(smoke~race) +
labs(title=" Birthweight Histogram Plot 3", x="Birthweight (g)")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(birthwt, aes(x=bwt, y=..density..)) +
geom_histogram() +
facet_grid(smoke~.) +
labs(title="Birthweight Histogram Plot 4", x="Birthweight (g)", y="Density")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(birthwt, aes(x=bwt)) +
geom_histogram(fill='azure2', color='gray81') +
facet_grid(smoke~.) +
labs(title="Birthweight Histogram Plot 5", x="Birthweight (g)", y="Density")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Problem 4