Instructions

Exercises: 1-5 (Pgs. 6-7); 1-2, 5 (Pg. 12); 1-5 (Pgs. 20-21); Open Response

Submission: Submit via an electronic document on Sakai. Must be submitted as a HTML file generated in RStudio. All assigned problems are chosen according to the textbook R for Data Science. You do not need R code to answer every question. If you answer without using R code, delete the code chunk. If the question requires R code, make sure you display R code. If the question requires a figure, make sure you display a figure. A lot of the questions can be answered in written response, but require R code and/or figures for understanding and explaining.

Chapter 1 (Pgs. 6-7)

Exercise 1

ggplot(data=mpg)

I see absolutely nothing. There is just a blank space for a graph. Why am I even doing this nonsense?

Exercise 2

dim(mpg)
## [1] 234  11
nrow(mpg)
## [1] 234
ncol(mpg)
## [1] 11

There are 234 rows and 11 columns in the dataset mpg.

Exercise 3

?mpg
unique(mpg$drv)
## [1] "f" "4" "r"

The variable drg is a factor variable that takes the following values:

  • “f” = front-wheel drive
  • “r” = rear-wheel drive
  • “4” = 4-wheel drive

Excercise 4

ggplot(data=mpg,aes(x=hwy,y=cyl)) +
  geom_point() + 
  xlab("Highway Miles Per Gallon") +
  ylab("Number of Cylinders")

Excercise 5

ggplot(data=mpg,aes(x=class,y=drv)) + 
  geom_point() + 
  xlab("Type of Car") +
  ylab("Type of Drive")

Scatter plots are not meant to visualize the relationship between two categorical/qualitative variables.

Chapter 1 (Pg. 12)

Exercise 1

ggplot(data = mpg) +
 geom_point(
 mapping = aes(x = displ, y = hwy, color = "blue")
 )

The colors are not blue because the color attribute is inside aes(). It takes the color as a categorical variable because of which the color name shows blue, but the color isn’t actually blue. The color attribute needs to be out of the aes() because it is fixed and not mapped with data.

Exercise 2

str(mpg)
## tibble [234 × 11] (S3: tbl_df/tbl/data.frame)
##  $ manufacturer: chr [1:234] "audi" "audi" "audi" "audi" ...
##  $ model       : chr [1:234] "a4" "a4" "a4" "a4" ...
##  $ displ       : num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
##  $ year        : int [1:234] 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
##  $ cyl         : int [1:234] 4 4 4 4 6 6 6 4 4 4 ...
##  $ trans       : chr [1:234] "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
##  $ drv         : chr [1:234] "f" "f" "f" "f" ...
##  $ cty         : int [1:234] 18 21 20 21 16 18 18 18 16 20 ...
##  $ hwy         : int [1:234] 29 29 31 30 26 26 27 26 25 28 ...
##  $ fl          : chr [1:234] "p" "p" "p" "p" ...
##  $ class       : chr [1:234] "compact" "compact" "compact" "compact" ...
summary(mpg)
##  manufacturer          model               displ            year     
##  Length:234         Length:234         Min.   :1.600   Min.   :1999  
##  Class :character   Class :character   1st Qu.:2.400   1st Qu.:1999  
##  Mode  :character   Mode  :character   Median :3.300   Median :2004  
##                                        Mean   :3.472   Mean   :2004  
##                                        3rd Qu.:4.600   3rd Qu.:2008  
##                                        Max.   :7.000   Max.   :2008  
##       cyl           trans               drv                 cty       
##  Min.   :4.000   Length:234         Length:234         Min.   : 9.00  
##  1st Qu.:4.000   Class :character   Class :character   1st Qu.:14.00  
##  Median :6.000   Mode  :character   Mode  :character   Median :17.00  
##  Mean   :5.889                                         Mean   :16.86  
##  3rd Qu.:8.000                                         3rd Qu.:19.00  
##  Max.   :8.000                                         Max.   :35.00  
##       hwy             fl               class          
##  Min.   :12.00   Length:234         Length:234        
##  1st Qu.:18.00   Class :character   Class :character  
##  Median :24.00   Mode  :character   Mode  :character  
##  Mean   :23.44                                        
##  3rd Qu.:27.00                                        
##  Max.   :44.00

There are 6 categorical variables in the dataset mpg. There are 5 continuous variables in this dataset. All this information can be seen for mpg when we use summary(), as shown above in the code and the output.

Exercise 5

ggplot(mtcars, aes(wt, mpg)) +
  geom_point(shape = 2, colour = "black", fill = "white", size = 2, stroke = 5)

ggplot(mtcars, aes(wt, mpg)) +
  geom_point(shape = 2, colour = "black", fill = "white", size = 2, stroke = 1)

The stroke aesthetic modifies the width of the border. As seen in the code and the output above, the first has the stroke set to 5 as compared to 1 for the second graph, so the circles end up thin because 1 is less width than 5.

It works for non-filled shapes (hollow shapes).

Chapter 1 (Pgs. 20-21)

Exercise 1

ggplot(mpg, aes(x = displ, y = hwy)) +
    geom_line()

ggplot(mpg, aes(x = displ, y = hwy)) +
    geom_boxplot()
## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?

ggplot(mpg, aes(x = displ)) +
    geom_histogram(binwidth = 1)

ggplot(mpg, aes(x = displ, y = hwy)) +
    geom_area()

For a line chart: geom_line() For a box plot: geom_boxplot() For a histogram: geom_histogram() For an area chart:geom_area()

Exercise 2

ggplot(
 data = mpg,
 mapping = aes(x = displ, y = hwy, color = drv)
) +
 geom_point() +
 geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

The graph comes out to be what I predicted. I predicted that it is going to be a very clean graph because se has been put as False, which means that the shaded area (confidence band) will not occur around the smooth line.

Exercise 3

ggplot(data = mpg) +
 geom_smooth(
 mapping = aes(x = displ, y = hwy, color = drv),
 show.legend = FALSE
 )
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

ggplot(data = mpg) +
 geom_smooth(
 mapping = aes(x = displ, y = hwy, color = drv),
 )
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

show.legend = false just hides the legend (i.e the key) of the graph of a specific geome. The problem is that the viewer won’t be able to know what color corresponds to what variable or what geome. If you remove it, the legend appears again, as shown in the 2nd graph. It was used earlier in the chapter because at that time, complex plots weren’t dealt with, so it was okay to not give the legend, because the variables were self-explanatory.

Exercise 4

ggplot(
 data = mpg,
 mapping = aes(x = displ, y = hwy, color = drv)
) +
 geom_point() +
 geom_smooth()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

ggplot(
 data = mpg,
 mapping = aes(x = displ, y = hwy, color = drv)
) +
 geom_point() +
 geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

The se argument details the gray shaded area around the smooth line (this shaded area is also known as the confidence band). Using se = false, there will be no gray shaded area on the graph which can contribute to cleaner and better visualization of the data.

Exercise 5

I don’t know if they will look different. Let me check.

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point() + 
  geom_smooth()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

ggplot() + 
  geom_point(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_smooth(data = mpg, mapping = aes(x = displ, y = hwy))
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

They do not look different. I am incredibly surprised.

Open Response

For this exercise, use the diamonds dataset in the tidyverse. Use ?diamonds to get more information about the dataset.

Step 1: Select 1 numeric variable and 2 categorical variables. Create a graphic using geom_boxplot() and facet_wrap to illustrate the empirical distributions of the sample.

ggplot(diamonds, aes(x = cut, y = price, fill = color)) +
  geom_boxplot(outlier.size = 0.5, alpha = 0.8) +
  facet_wrap(~ clarity) +
  labs(
    title = "Boxplot of Diamond Prices by Cut Faceted by Clarity",
    x = "Cut",
    y = "Price"
  )

Step 2: Choose 2 numeric variables and 2 categorical variables and creatively illustrate the relationship between all the variables.

ggplot(diamonds, aes(x = carat, y = price, color = cut)) +
  geom_point() +
  facet_grid(color ~ clarity) + 
  labs(
    title = "Relationship Between Carat, Price, Cut, and Clarity",
    x = "Carat",
    y = "Price",
    color = "Cut"
  )