This assignment is going to require you to read some sections out of the text “R in Data Science” and answer the questions at the end of each section. You will use this document to answer the questions. You should make all plots that the questions ask you to make using ggplot.

When completed, name your final output .html file as: YourName_ANLY512-0-2018.html and upload it to the “Visualization Coding Exercise #2” assignment on Moodle. This assignment is worth 30 points. Questions 1 - 10 are worth 1 point each. Question 11 through Question 15 is are 4 points.

To get a first feel for ggplot2, let’s try to run some basic ggplot2 commands. Together, they build a plot of the mtcars dataset that contains information about 32 cars from a 1973 Motor Trend magazine. This dataset is small, intuitive, and contains a variety of continuous and categorical variables. Questions 1-7 are based on the mtcars data frame.

  1. Load the ggplot2 package using the library() command.
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.4.4
  1. Use str() to explore the structure of the mtcars dataset.
str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
  1. Execute the following code in the code chunk below. Describe what ggplot is doing with the data.
library(ggplot2)
ggplot(mtcars, aes(x = cyl, y = mpg)) +
  geom_point()

The plot from #3 isn’t really satisfying. Although cyl (the number of cylinders) is categorical, it is classified as numeric in mtcars. You’ll have to explicitly tell ggplot2 that cyl is a categorical variable.

  1. Change the ggplot() code from #3 by wrapping factor() around cyl.
library(ggplot2)
ggplot(mtcars, aes(x = factor(cyl), y = mpg)) +
  geom_point()

We will use several datasets throughout the class to showcase the concepts discussed in the weekly lectures. In the previous exercises, you already got to know mtcars. Let’s dive a little deeper to explore the three main three layers in the grammar of graphics: data, aesthetics, and geom layers.

The mtcars dataset contains information about 32 cars from 1973 Motor Trend magazine. This dataset is small, intuitive, and contains a variety of continuous and categorical variables.

Think about how the examples and concepts we discuss throughout the grammar of graphics lectures can be applied to your own data-sets!

  1. ggplot2 has already been loaded for you in the code chunk below. Take a look at the first command. It plots the mpg (miles per galon) against the weight (in thousands of pounds). You don’t have to change anything about this command. Run the ggplot code to examine the graph that is produced.
#first ggplot call
library(ggplot2)
ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point()

In the second call of ggplot() change the color argument in aes(). The color should be dependent on the displacement of the car engine, found in disp.

library(ggplot2)
ggplot(mtcars, aes(x = wt, y = mpg, color = disp)) +
  geom_point()

In the third call of ggplot() change the size argument in aes(). The size should be dependent on the displacement of the car engine, found in disp.

#third ggplot call

# Replace ___ with the correct column
library(ggplot2)
ggplot(mtcars, aes(x = wt, y = mpg, size = disp)) +
  geom_point()

  1. After running the above code in the second and third calls to ggplot2 in #5, were the legend for the color and size scales automatically generated? State Yes or No.

Yes, automatically generated with color and size attributes in ggplot2

  1. In the previous exercise you saw that disp can be mapped onto a color gradient or onto a continuous size scale. Another argument of aes() is the shape of the points. There are a finite number of shapes which ggplot() can automatically assign to the points. However, if you try this command in the code chunk below, you will receive an error. Run the code and examine the error that is produced.
library(ggplot2)
ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point()

The code in the code chunk above gives an error. What does it mean?

  1. shape is not a defined argument

  2. shape only makes sense with categorical data and disp is continuous

  3. shape only makes sense with continuous data and disp is categorical

  4. shape is not a variable in your data frame

Type one and only one letter as your answer to #7.

Answer is b

Questions 8-15 use the diamonds_sample data frame

The diamonds data frame contains information on the prices and various metrics of 50,000 diamonds. This is a data frame that is built-in when you install the ggplot2 package. Among the variables included are carat (a measurement of the size of the diamond) and price.

You will be working with a subset of this data frame. The name of the data frame you will be using will be diamond_sample. Run the following code chunk to create the diamond_sample data frame. You will use the diamond_sample data frame to answer all questions for this assignment. Do not use the diamonds data frame.

diamonds_sample<-diamonds[sample(1:nrow(diamonds),1000, replace=FALSE),]

Here you will use two common geom layer functions: geom_point() and geom_smooth(). We already discussed in class how these layers are added using the + operator.

  1. Use str() to explore the structure of the diamonds_sample data frame.
str(diamonds_sample)
## Classes 'tbl_df', 'tbl' and 'data.frame':    1000 obs. of  10 variables:
##  $ carat  : num  0.67 0.62 1.2 1.59 0.7 0.31 0.57 1.14 0.34 1.51 ...
##  $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 2 3 5 5 2 5 3 4 3 5 ...
##  $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 7 5 6 2 2 5 3 3 6 4 ...
##  $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 5 4 2 2 3 4 2 3 6 3 ...
##  $ depth  : num  64.2 61.8 60.5 62.3 63.6 61.5 63.5 62.5 61.3 61.6 ...
##  $ table  : num  55.6 57 58 55 62 55.2 56 59 57 56 ...
##  $ price  : int  1581 1734 5050 11251 2386 513 1397 5228 596 8283 ...
##  $ x      : num  5.54 5.47 6.92 7.52 5.57 4.36 5.28 6.67 4.49 7.34 ...
##  $ y      : num  5.57 5.5 6.83 7.48 5.6 4.39 5.25 6.65 4.52 7.25 ...
##  $ z      : num  3.57 3.39 4.16 4.67 3.55 2.7 3.34 4.16 2.76 4.5 ...
  1. Use the + operator to add geom_point() to the ggplot() command. This will tell ggplot2 to draw points on the plot.
library(ggplot2)
ggplot(diamonds_sample, aes(x = carat, y = price))+geom_point()

  1. This problem is a continuation of #9. Use the + operator to add geom_point() and geom_smooth() to the ggplot() command. These just stack on each other! geom_smooth() will draw a smoothed line over the points.
library(ggplot2)
ggplot(diamonds_sample, aes(x = carat, y = price))+geom_point()+ geom_smooth()
## `geom_smooth()` using method = 'gam'

  1. In #10, you built a scatter plot of the diamonds_sample dataset, with carat on the x-axis and price on the y-axis. geom_smooth() is used to add a smooth line. Copy and paste the code that created the scatterplot in #10, but show only the smooth line, no points.
library(ggplot2)
ggplot(diamonds_sample, aes(x = carat, y = price))+ geom_smooth()
## `geom_smooth()` using method = 'gam'

  1. This problem is a continuation of # 11. Show only the smooth line, but color according to clarity by placing the argument color = clarity in the aes() function of your ggplot() call.
library(ggplot2)
ggplot(diamonds_sample, aes(x = carat, y = price, color= clarity))+geom_smooth()
## `geom_smooth()` using method = 'loess'

  1. This problem is a continuation of #12. You are going to construct a graph with translucent colored points.

Copy the ggplot() command from #12 (with clarity mapped to color). Remove the smooth layer. Add the points layer back in. Set alpha = 0.4 inside geom_point(); this will make the points 40% transparent.

library(ggplot2)
ggplot(diamonds_sample, aes(x = carat, y = price, color= clarity))+geom_point(alpha= 0.4)

In #14 you are going to explore some of the different grammatical elements of ggplot2. You will start by creating a ggplot object from the diamonds_sample dataset. Next, you will add layers onto this object to build informative graphics.

  1. This problem can be broken into three parts.

    1. Define the data (diamonds_subset) and aesthetics layers. Map the carat on the x-axis and price on the y-axis. Assign it to an object entitled dia_plot.

    2. Using +, add a geom_point() layer (with no arguments), to the dia_plot object. This can be in single or multiple lines.

    3. You can also call aes() within the geom_point() function. Map clarity to the color argument in this way.

library(ggplot2)

#part a

dia_plot <- ggplot(diamonds_sample, aes(x = carat, y = price))

dia_plot

#part b
dia_plot <- ggplot(diamonds_sample, aes(x = carat, y = price)) + geom_point()

dia_plot

#part c

dia_plot <- ggplot(diamonds_sample, aes(x = carat, y = price)) + geom_point(aes(color = clarity))

dia_plot

  1. This problem is a continuation of #14. You have created an object entitled dia_plot. This problem can be broken into three parts.

    1. Update dia_plot so that it contains all the functions to make a scatterplot by using geom_point() for the geom layer. Set alpha=0.2.

    2. Using +, plot the dia_plot object with a geom_smooth() layer on top. You do not want any error shading, which can be achieved by setting the se = FALSE in
      geom_smooth().

    3. Modify the geom_smooth() function from part b so that it contains aes() and map clarity to the col argument.

#part a
dia_plot <- ggplot(diamonds_sample, aes(x = carat, y = price))

dia_plot <- dia_plot + geom_point(alpha=0.2)

#part b


dia_plot + geom_smooth(se = F)
## `geom_smooth()` using method = 'gam'

#part c


dia_plot + geom_smooth(aes(col=clarity), se = F)
## `geom_smooth()` using method = 'loess'