In this lab we will be learning about some basics with ggplot2.

Part I: Learning about your data

Step 1: Load in the data

First let’s load in the diamonds dataset. This data set is in the tidyverse package, so make sure that that library is called first.

library(tidyverse)
data("diamonds") #call upon data set that lives in a package

Step 2: Learn a little bit about this data

The str function allows you to learn about the structure of a dataset.

str(diamonds)
## tibble [53,940 × 10] (S3: tbl_df/tbl/data.frame)
##  $ carat  : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
##  $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
##  $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
##  $ depth  : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##  $ table  : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
##  $ price  : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
##  $ x      : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
##  $ y      : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
##  $ z      : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...

It’s Your Turn! Learning by doing

  1. How many rows are in diamonds? How many columns?
nrow(diamonds)
## [1] 53940
ncol(diamonds)
## [1] 10
  1. Which variables in diamonds are categorical? Which variables are continuous? (Hint: look at the output for str())
str(diamonds)
## tibble [53,940 × 10] (S3: tbl_df/tbl/data.frame)
##  $ carat  : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
##  $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
##  $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
##  $ depth  : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##  $ table  : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
##  $ price  : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
##  $ x      : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
##  $ y      : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
##  $ z      : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
  1. What does the table variable describe? Read the help for ?diamonds to find out.
  • width of top of diamond relative to widest point

Part II: Start with a basic scatterplot

Here is a simple scatterplot of price vs carat:

ggplot(data=diamonds, aes(x=carat, y=price))+
  geom_point()

What do you observe? - A scatterplot where price increases as carat increases. It is not exactly linear.

It’s Your Turn! Learning by doing

  1. Run ggplot(data=diamonds). What do you see?
ggplot(data=diamonds)

I see nothing because we haven’t told ggplot how we want the diamond data to be displayed. We need mapping and geometry in order to do something with it.

  1. Make a scatterplot of price vs depth.
ggplot(data=diamonds, aes(x=depth, y=price))+
  geom_point()

  1. What happens if you make a scatterplot of cut vs clarity? Why is the plot not useful?
ggplot(data=diamonds, aes(x=clarity, y=cut))+
  geom_point()

- cut and clarity are both ordered categorical so we just get a point for every combination (the dots are many dots on top of eachother). This graphic shows we have representation of every combination… That’s about it.

Part III: Aesthetic Mappings

Aesthetic mappings translate

A) Color

Color Gradient for Ordinal Data

# If using a categorical variable each category will have a color
## INSERT YOUR CODE HERE ##

Unique colors for Nominal Data

# if not ordered..
## INSERT YOUR CODE HERE ##
ggplot(data=diamonds, aes(x=carat, y=price, color=clarity))+ #what we see below is color palet for ordinal data given 
  geom_point()

ggplot(data=diamonds, aes(x=carat, y=price, color=as.character(clarity)))+ #this no longer treats as ordinal (instead orders alphabetically) and colors are different
  geom_point()

Saturation Gradient for Numeric

# If using a numeric variable there will be a color gradient 
## INSERT YOUR CODE HERE ##
ggplot(data=diamonds, aes(x=carat, y=price, color=depth))+ #this is the numeric color pallet 
  geom_point()

You can also apply a single color to all the data points by specifying the color outside of the aesthetic mapping.

## INSERT YOUR CODE HERE ##
ggplot(data=diamonds, aes(x=carat, y=price))+
  geom_point(color="blue") #why did we map color to geom now? not coloring by a variable. When map in aes we map a variable to something aesthetically. IF it doesnt change as a function of a variable it will be written outside of the aesthetic. 

ggplot(data=diamonds, aes(x=carat, y=price))+
  geom_point(color="orchid")

#oops
ggplot(data=diamonds, aes(x=carat, y=price, color="blue"))+ #this creates a variable blue and labels everyhting blue but colors in coral
  geom_point()

B) Transparency

## INSERT YOUR CODE HERE ##
ggplot(data=diamonds, aes(x=carat, y=price))+
  geom_point(alpha=.05) #change all to the same transparency between 0 and 1

ggplot(data=diamonds, aes(x=carat, y=price, alpha=clarity))+ #mapping transparency based on clarity... dont do this it's hard to read. dont map clarity to variables
  geom_point()

C) Shape

## INSERT YOUR CODE HERE ##
ggplot(data=diamonds, aes(x=carat, y=price, shape=clarity))+ #can only use 6 shapes shouldnt use more than 3 (4 is pushing it). They are hard to remember and read
  geom_point()
## Warning: Using shapes for an ordinal variable is not advised
## Warning: The shape palette can deal with a maximum of 6 discrete values because
## more than 6 becomes difficult to discriminate; you have 8. Consider
## specifying shapes manually if you must have them.
## Warning: Removed 5445 rows containing missing values (geom_point).

D) Size

## INSERT YOUR CODE HERE ##
ggplot(data=diamonds, aes(x=carat, y=price, size=clarity))+ #this is so ugly and hard to read can be useful for numeric variables to show population for example 
  geom_point()

It’s Your Turn! Learning by doing

  1. What’s gone wrong with this code? Why are the points not blue?
ggplot(diamonds, aes(carat, price, color="blue"))+ #shown above 
  geom_point()

  1. Map a continuous variable to color, size, and shape. How do these aesthetics behave differently for categorical vs. continuous variables?

  2. What happens if you map the same variable to multiple aesthetics?

  3. What happens if you map an aesthetic to something other than a variable name, like aes(colour = carat < 3)? Note, you’ll also need to specify x and y.

Part IV: Facets

Sometimes it’s useful to look at subgroups within our data. We can do this with facets.

facet_wrap()

You can specify a single discrete variable to facet by and R organize plots to fill the space.

## INSERT YOUR CODE HERE ##
ggplot(data=diamonds, aes(x=carat, y=price))+
  geom_point() +
  facet_wrap(.~cut) #only one variable to wrap around itself so dot is a place holder to represent row 

facet_grid()

You can also create a grid of graphs. The first argument to the function specifies rows and the second columns.

## Grid
## INSERT YOUR CODE HERE ##
ggplot(data=diamonds, aes(x=carat, y=price))+
  geom_point() +
  facet_grid(color~cut) #specify rows and columns  #grids 3x2 is probably enough for a visual 

If you prefer to not facet in the rows or columns dimension, use a . instead of a variable name, e.g. + facet_grid(. ~ color).

It’s Your Turn! Learning by doing

  1. What happens if you facet on a continuous variable?

  2. What plots does the following code make? What does . do?

ggplot(diamonds, aes(carat, price))+
  geom_point()+
  facet_grid(color~.)

ggplot(diamonds, aes(carat, price))+
  geom_point()+
  facet_grid(.~cut)

#only one variable to wrap around itself so dot is a place holder to represent row or column
  1. When using facet_grid() you should usually put the variable with more unique levels in the columns. Why?