In this lab we will be learning about some basics with
ggplot2.
First let’s load in the diamonds dataset. This data set
is in the tidyverse package, so make sure that that library
is called first.
library(tidyverse)
data("diamonds") #call upon data set that lives in a package
The str function allows you to learn about the structure
of a dataset.
str(diamonds)
## tibble [53,940 × 10] (S3: tbl_df/tbl/data.frame)
## $ carat : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
## $ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
## $ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
## $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
## $ depth : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
## $ table : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
## $ price : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
## $ x : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
## $ y : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
## $ z : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
diamonds? How many columns?nrow(diamonds)
## [1] 53940
ncol(diamonds)
## [1] 10
diamonds are categorical? Which
variables are continuous? (Hint: look at the output for
str())str(diamonds)
## tibble [53,940 × 10] (S3: tbl_df/tbl/data.frame)
## $ carat : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
## $ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
## $ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
## $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
## $ depth : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
## $ table : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
## $ price : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
## $ x : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
## $ y : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
## $ z : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
table variable describe? Read the help
for ?diamonds to find out.Here is a simple scatterplot of price vs
carat:
ggplot(data=diamonds, aes(x=carat, y=price))+
geom_point()
What do you observe? - A scatterplot where price increases as carat increases. It is not exactly linear.
ggplot(data=diamonds). What do you see?ggplot(data=diamonds)
I see nothing because we haven’t told ggplot how we want the diamond
data to be displayed. We need mapping and geometry in order to do
something with it.
price vs depth.ggplot(data=diamonds, aes(x=depth, y=price))+
geom_point()
cut vs
clarity? Why is the plot not useful?ggplot(data=diamonds, aes(x=clarity, y=cut))+
geom_point()
- cut and clarity are both ordered categorical so we just get a point
for every combination (the dots are many dots on top of eachother). This
graphic shows we have representation of every combination… That’s about
it.
Aesthetic mappings translate
# If using a categorical variable each category will have a color
## INSERT YOUR CODE HERE ##
# if not ordered..
## INSERT YOUR CODE HERE ##
ggplot(data=diamonds, aes(x=carat, y=price, color=clarity))+ #what we see below is color palet for ordinal data given
geom_point()
ggplot(data=diamonds, aes(x=carat, y=price, color=as.character(clarity)))+ #this no longer treats as ordinal (instead orders alphabetically) and colors are different
geom_point()
# If using a numeric variable there will be a color gradient
## INSERT YOUR CODE HERE ##
ggplot(data=diamonds, aes(x=carat, y=price, color=depth))+ #this is the numeric color pallet
geom_point()
You can also apply a single color to all the data points by
specifying the color outside of the aesthetic mapping.
## INSERT YOUR CODE HERE ##
ggplot(data=diamonds, aes(x=carat, y=price))+
geom_point(color="blue") #why did we map color to geom now? not coloring by a variable. When map in aes we map a variable to something aesthetically. IF it doesnt change as a function of a variable it will be written outside of the aesthetic.
ggplot(data=diamonds, aes(x=carat, y=price))+
geom_point(color="orchid")
#oops
ggplot(data=diamonds, aes(x=carat, y=price, color="blue"))+ #this creates a variable blue and labels everyhting blue but colors in coral
geom_point()
## INSERT YOUR CODE HERE ##
ggplot(data=diamonds, aes(x=carat, y=price))+
geom_point(alpha=.05) #change all to the same transparency between 0 and 1
ggplot(data=diamonds, aes(x=carat, y=price, alpha=clarity))+ #mapping transparency based on clarity... dont do this it's hard to read. dont map clarity to variables
geom_point()
## INSERT YOUR CODE HERE ##
ggplot(data=diamonds, aes(x=carat, y=price, shape=clarity))+ #can only use 6 shapes shouldnt use more than 3 (4 is pushing it). They are hard to remember and read
geom_point()
## Warning: Using shapes for an ordinal variable is not advised
## Warning: The shape palette can deal with a maximum of 6 discrete values because
## more than 6 becomes difficult to discriminate; you have 8. Consider
## specifying shapes manually if you must have them.
## Warning: Removed 5445 rows containing missing values (geom_point).
## INSERT YOUR CODE HERE ##
ggplot(data=diamonds, aes(x=carat, y=price, size=clarity))+ #this is so ugly and hard to read can be useful for numeric variables to show population for example
geom_point()
ggplot(diamonds, aes(carat, price, color="blue"))+ #shown above
geom_point()
Map a continuous variable to color, size, and shape. How do these aesthetics behave differently for categorical vs. continuous variables?
What happens if you map the same variable to multiple aesthetics?
What happens if you map an aesthetic to something other than a
variable name, like aes(colour = carat < 3)? Note,
you’ll also need to specify x and y.
Sometimes it’s useful to look at subgroups within our data. We can do this with facets.
facet_wrap()You can specify a single discrete variable to facet by and R organize plots to fill the space.
## INSERT YOUR CODE HERE ##
ggplot(data=diamonds, aes(x=carat, y=price))+
geom_point() +
facet_wrap(.~cut) #only one variable to wrap around itself so dot is a place holder to represent row
facet_grid()You can also create a grid of graphs. The first argument to the function specifies rows and the second columns.
## Grid
## INSERT YOUR CODE HERE ##
ggplot(data=diamonds, aes(x=carat, y=price))+
geom_point() +
facet_grid(color~cut) #specify rows and columns #grids 3x2 is probably enough for a visual
If you prefer to not facet in the rows or columns dimension, use a .
instead of a variable name,
e.g. + facet_grid(. ~ color).
What happens if you facet on a continuous variable?
What plots does the following code make? What does .
do?
ggplot(diamonds, aes(carat, price))+
geom_point()+
facet_grid(color~.)
ggplot(diamonds, aes(carat, price))+
geom_point()+
facet_grid(.~cut)
#only one variable to wrap around itself so dot is a place holder to represent row or column
facet_grid() you should usually put the
variable with more unique levels in the columns. Why?