Instructions

Exercises: 1-5 (Pgs. 6-7); 1-2, 5 (Pg. 12); 1-5 (Pgs. 20-21); Open Response

Submission: Submit via an electronic document on Sakai. Must be submitted as a HTML file generated in RStudio. All assigned problems are chosen according to the textbook R for Data Science. You do not need R code to answer every question. If you answer without using R code, delete the code chunk. If the question requires R code, make sure you display R code. If the question requires a figure, make sure you display a figure. A lot of the questions can be answered in written response, but require R code and/or figures for understanding and explaining.

Chapter 1 (Pgs. 6-7)

Exercise 1

ggplot(data=mpg)

I see a blank space.

Exercise 2

dim(mpg)
## [1] 234  11
nrow(mpg)
## [1] 234
ncol(mpg)
## [1] 11

There are 234 rows and 11 columns in the dataset mpg.

Exercise 3

?mpg
unique(mpg$drv)
## [1] "f" "4" "r"

The variable drg is a factor variable that takes the following values:

  • “f” = front-wheel drive
  • “r” = rear-wheel drive
  • “4” = 4-wheel drive

Exercise 4

ggplot(data=mpg,aes(x=hwy,y=cyl)) +
  geom_point() + 
  xlab("Highway Miles Per Gallon") +
  ylab("Number of Cylinders")

Exercise 5

ggplot(data=mpg,aes(x=class,y=drv)) + 
  geom_point() + 
  xlab("Type of Car") +
  ylab("Type of Drive")

Scatter plots are not meant to visualize the relationship between two categorical/qualitative variables.

Chapter 1 (Pg. 12)

Exercise 1

The code below has “color = ‘blue’” inside of aes(), so it does not work.

ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy, color ="blue"))

The code below has “color = ‘blue’” outside of aes(), so now it works.

ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy), color = "blue")

To set an aesthetic manually, you have to set the aesthetic by name as an argument of your geom function OUTSIDE of aes(). The given code shows “color = ’blue” INSIDE of aes(), which is why the code does not work.

Exercise 2

str(mpg)
## Classes 'tbl_df', 'tbl' and 'data.frame':    234 obs. of  11 variables:
##  $ manufacturer: chr  "audi" "audi" "audi" "audi" ...
##  $ model       : chr  "a4" "a4" "a4" "a4" ...
##  $ displ       : num  1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
##  $ year        : int  1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
##  $ cyl         : int  4 4 4 4 6 6 6 4 4 4 ...
##  $ trans       : chr  "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
##  $ drv         : chr  "f" "f" "f" "f" ...
##  $ cty         : int  18 21 20 21 16 18 18 18 16 20 ...
##  $ hwy         : int  29 29 31 30 26 26 27 26 25 28 ...
##  $ fl          : chr  "p" "p" "p" "p" ...
##  $ class       : chr  "compact" "compact" "compact" "compact" ...

The categorical variables in mpg are manufacturer model, type of transmission (trans), type of drive (drv), fuel type (fl), and “type” of car (class). The continuous variables in mpg are engine displacement (displ), year of manufacture (year), number of cylinders (cyl), city miles per gallon (cty), and highway miles per gallon (hwy).

Exercise 5

The stroke aesthic modifies the width of the border. It works for shapes that have a border, so shapes 21-24 of R’s built-in shapes. Below is a code where the stroke value, or width of the border, is 2.

ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy), shape = 22, fill = "green", size = 6, stroke = 2, color = "white")

Below is a code where I increased the stroke value to 4. You can see how the border of each point increased.

ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy), shape = 22, fill = "green", size = 6, stroke = 4, color = "white")

Chapter 1 (Pgs. 20-21)

Exercise 1

To draw a line chart, you would use line geoms. To draw a boxplot, you would use boxplot geoms. To draw a histogram, you would use histogram geoms. To draw an area chart, you would use area chart geoms. Below is a code for a line chart.

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + geom_line(mapping = aes(x = displ, y = hwy, color = drv))

Below is a code for a boxplot.

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + geom_boxplot(mapping = aes(x = displ, y = hwy, color = drv))

Below is a code for a histogram.

ggplot(data = mpg, mapping = aes(x = displ)) + geom_histogram(mapping = aes(x = displ, color = drv))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Below is a code for an area chart.

ggplot(data = mpg, aes(x = displ, fill = drv)) + geom_area(stat = "bin")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Exercise 2

ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) + geom_point() + geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

I predicted the graph to look like the one on page 20 except the legend would say “drv” instead of “class”.

Exercise 3

ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot(data = mpg) + geom_smooth(mapping = aes(x = displ, y = hwy, group = drv))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot(data = mpg) + geom_smooth(mapping = aes(x = displ, y = hwy, color = drv), show.legend = FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

“show.legend = FALSE” does not include the legend that is usually included when we map to an aesthetic. When we remove the argument, the legend will be included in the plots because the legend is always automatically included when mapping to an aesthetic to explain the mapping between levels and values. I think it was used earlier in the chapter to simply exclude the legend.

Exercise 4

The se argument removes the confidence interval from the line. Below is the code excluding se.

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + geom_point(mapping = aes(color = class)) +
geom_smooth(data = filter(mpg, class == "subcompact"), se = FALSE )
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Below is the code including se.

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + geom_point(mapping = aes(color = class)) +
geom_smooth(data = filter(mpg, class == "subcompact"))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Exercise 5

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point() + 
  geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot() + 
  geom_point(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_smooth(data = mpg, mapping = aes(x = displ, y = hwy))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

I do not see a difference - the graphs are the same. The codes are basically the same with rearranged arguments.

Open Response

For this exercise, use the diamonds dataset in the tidyverse. Use ?diamonds to get more information about the dataset.

Step 1: Select 1 numeric variable and 2 categorical variables. Create a graphic using geom_boxplot() and facet_wrap to illustrate the empirical distributions of the sample.

Below is the code for the graphic.

ggplot(data = diamonds, mapping = aes(x = price, y = carat)) + geom_boxplot(mapping = aes(x = price, y = carat, color = cut)) + facet_wrap(~color)

Step 2: Choose 2 numeric variables and 2 categorical variables and creatively illustrate the relationship between all the variables.

Below is the code for the illustration.

ggplot(data = diamonds, mapping = aes(x = price, y = carat)) + geom_point(mapping = aes(x = price, y = carat, shape = clarity, color = color))
## Warning: Using shapes for an ordinal variable is not advised
## Warning: The shape palette can deal with a maximum of 6 discrete values because
## more than 6 becomes difficult to discriminate; you have 8. Consider
## specifying shapes manually if you must have them.
## Warning: Removed 5445 rows containing missing values (geom_point).