Instructions

Exercises: 1,3 (Pgs. 90-91); 1 (Pg. 93); 2,4 (Pg. 99); 1,2 (Pg. 101); 2,3,5 (Pg. 104)

Assigned: Friday, February 8, 2019

Due: Friday, February 15, 2019 by 5:00 PM

Submission: Submit via an electronic document on Sakai. Must be submitted as a HTML file generated in RStudio. All assigned problems are chosen according to the textbook R for Data Science. You do not need R code to answer every question. If you answer without using R code, delete the code chunk. If the question requires R code, make sure you display R code. If the question requires a figure, make sure you display a figure. A lot of the questions can be answered in written response, but require R code and/or figures for understanding and explaining.

Chapter 5 (Pgs. 90-91)

Exercise 1

ggplot(data = diamonds) + geom_histogram(mapping = aes(x = x), binwidth = 0.5)

ggplot(data = diamonds) + geom_histogram(mapping = aes(x = y), binwidth = 0.5)

ggplot(data = diamonds) + geom_histogram(mapping = aes(x = z), binwidth = 0.5)

All of the distributions are skewed right. There are two peaks within each distribution. The x value falls between 0 and 10 and the y value falls between 1 and approximately 1500 for all three distributions. Using the distributions of each of the x, y, and z variables, the x and y histograms look very similar. I would guess x and y are length and width and z is depth. I do not know too much about dimensions of a diamond, but because diamond pieces in jewelry are usually pretty symmetrical, I would assume length and width values would be rather similar.

Exercise 3

diamonds99 = filter(diamonds, carat == 0.99)
count(diamonds99)
## # A tibble: 1 x 1
##       n
##   <int>
## 1    23
diamonds1 = filter(diamonds, carat == 1)
count(diamonds1)
## # A tibble: 1 x 1
##       n
##   <int>
## 1  1558

I think there are significantly more 1 carat diamonds than 0.99 carat diamonds because 1 carat carries more value than 0.99 carats. It is more appealing to say “my diamond is 1 carat” versus “my diamond is 99% carat”, so people are probably more inclined to purchase the 1 carat and therefore, it would make sense that more of the sample would be 1 carat rather than 0.99 carats.

Chapter 5 (Pg. 93)

Exercise 1

In a histogram, missing values are removed and ggplot2 warns that they have been removed. In a bar chart, missing values are put into a created category called NA.

missingvalues = diamonds %>% mutate(y = ifelse(y < 3 | y > 20, NA, y))
ggplot(data = missingvalues) + geom_histogram(mapping = aes(x = y), binwidth = 0.5)
## Warning: Removed 9 rows containing non-finite values (stat_bin).

missingvalues1 = diamonds %>% mutate(cut = as.factor(ifelse(y < 3 | y > 20, NA, cut)))
ggplot(data = missingvalues1) + geom_bar(mapping = aes(x = cut))

Chapter 5 (Pg. 99)

Exercise 2

ggplot(data = diamonds, mapping = aes(x = price, y = ..density..))+ geom_freqpoly(mapping = aes(color = cut), binwidth = 500)

ggplot(data = diamonds) + geom_point(aes(x = depth, y = price), color = "blue", alpha = 0.5)

ggplot(data = diamonds) + geom_point(aes(x = carat, y = price), color = "red", alpha = 0.5)

ggplot(data = diamonds) + geom_point(aes(x = table, y = price), color = "green", alpha = 0.5)

?diamonds

I tried out three different variables to see how they correlate price and carat was the only evident variable that had a relationship with price. As carats increased, price increased. For cuts, as their density decreases, their prices increases. Thus, as carats increased, cuts decreased, and prices increased. Since people are shallow and like to buy big, flashy rings, it is possible that larger carats had lower quality cuts because consumers would still be appealed to with the large diamond and they would still sell for more.

Exercise 4

ggplot(data = diamonds) + geom_boxplot(mapping = aes(x = cut, y = price))

library(lvplot)
ggplot(data = diamonds, aes(x = cut, y = price)) + geom_lv()

The letter value plot shows multiple quartiles.

Chapter 5 (Pg. 101)

Exercise 1

diamonds %>% count(color, cut) %>% ggplot(mapping = aes(x = color, y = cut)) + geom_tile(mapping = aes(fill = n))

You can rescale by redoing the count dataset so that n counts for the proportion of cut within color or color within cut.

Exercise 2

library(nycflights13)
flights %>% group_by(dest, month) %>% summarize(dep_delay = mean(dep_delay), na.rm = TRUE) %>% ggplot(aes(x = month, y = dest, fill = dep_delay)) + geom_tile()

Yikes. Everything makes the plot hard to read - the color variables and the destination labels make everything too busy to understand clearly. We could remove missing variables since they add no value to understanding how flight delays vary by destination and month. We could also make a subset of destinations, so we don’t have an absurb number of labels.

Chapter 5 (Pg. 104)

Exercise 2

Both cut_width and cut_number minimize our variables into subsets. cut_width requires the width and divides the variables into bins of width by width. cut_number requires width and makes the width of the boxplot proportional to the number of points.

ggplot(data = diamonds, mapping = aes(x = carat, y = price)) + geom_boxplot(mapping = aes(group = cut_width(carat, 0.1)))

ggplot(data = diamonds, mapping = aes(x = carat, y = price)) + geom_boxplot(mapping = aes(group = cut_number(carat, 20)))

Exercise 3

ggplot(data = diamonds, mapping = aes(x = carat, y = price)) + geom_boxplot(mapping = aes(group = cut_number(carat, 20)))

?diamonds

There is more variability with larger diamonds than there is with smaller diamonds. I expected this because I assume other factors come into play with how the diamonds are presented and smaller diamonds have less options for varying in cut, color, price, depth, etc., so their prices stay in a smaller range than larger diamonds, which have more influence from the other variables accounted for.

Exercise 5

ggplot(data = diamonds) + geom_point(mapping = aes(x = x, y = y)) + coord_cartesian(xlim = c(4, 11), ylim = c(4, 11))

I think a scatterplot makes outliers and relationships among variables more easier to see and interpret whereas binned plots would not be able to specify where the extremities are.