Problem Set 3


What to Do First?

Notes:

install.packages("ggplot2")
## Installing package into '/home/bobhy/R/x86_64-pc-linux-gnu-library/3.0'
## (as 'lib' is unspecified)
## Error: trying to use CRAN without setting a mirror
library(ggplot2)
install.packages("gridExtra")
## Installing package into '/home/bobhy/R/x86_64-pc-linux-gnu-library/3.0'
## (as 'lib' is unspecified)
## Error: trying to use CRAN without setting a mirror
library(gridExtra)
## Loading required package: grid

data(diamonds)
dd = diamonds

price box plots

# Investigate the price of diamonds using box plots, # numerical summaries, and one of the following categorical # variables: cut, clarity, or color.

# There won’t be a solution video for this # exercise so go to the discussion thread for either # BOXPLOTS BY CLARITY, BOXPLOT BY COLOR, or BOXPLOTS BY CUT # to share you thoughts and to # see what other people found.

# You can save images by using the ggsave() command. # ggsave() will save the last plot created. # For example… # qplot(x = price, data = diamonds) # ggsave('priceHistogram.png')

# ggsave currently recognises the extensions eps/ps, tex (pictex), # pdf, jpeg, tiff, png, bmp, svg and wmf (windows only).

# Copy and paste all of the code that you used for # your investigation, and submit it when you are ready. # =================================================================

Compare price by clarity.

ggplot(dd, aes(y = price)) + geom_boxplot(notch = TRUE, aes(x = clarity))

plot of chunk unnamed-chunk-1

Hmmm. IF is the finest (of the grades in the data), I1 the least fine. [wikipedia][http://en.wikipedia.org/wiki/Diamond_clarity] confirms. Yet median prices seem to vary in opposite way: IF is loest, I1 nearly the highest.

Hah! Normalize by price per carat!

ggplot(dd, aes(y = price/carat)) + geom_boxplot(notch = TRUE, aes(x = clarity)) + 
    ylab("price / carat")

plot of chunk unnamed-chunk-2

Something sensible coming into focus. At least the high outlier values make sense: highest prices for finest clarity. But the median values don't seem to vary much, and, if anything, there's still an anti-trend in the medians for SI2 to VVS1. Try zooming in.

ggplot(dd, aes(y = price/carat)) + geom_boxplot(notch = TRUE, aes(x = clarity)) + 
    ylab("price / carat") + coord_cartesian(ylim = c(2000, 6000))

plot of chunk unnamed-chunk-3

Yep, still a mess. What's going on here? Maybe people don't really care much about clarity except for very good and very bad grades? (bottom of the notch for IF is above the top of the notch for I1, so it is “significantly” higher).

Hmmm, do any of the factors map nicely to value?

ggplot(dd, aes(y = price/carat)) + geom_boxplot(notch = TRUE, aes(x = color)) + 
    ylab("price / carat") + coord_cartesian(ylim = c(2000, 6000))

plot of chunk unnamed-chunk-4

Color does not seem to explain price

ggplot(dd, aes(y = price/carat)) + geom_boxplot(notch = TRUE, aes(x = cut)) + 
    ylab("price / carat") + coord_cartesian(ylim = c(2000, 6000))

plot of chunk unnamed-chunk-5

Nor does cut.

Does weight?

ggplot(dd, aes(y = price, x = carat)) + geom_point(alpha = 0.5)

plot of chunk unnamed-chunk-6

Not obviously by itself. For a given weight (e.g, 1.5 carat), there's a very wide range of prices.

But, look what we see by overlaying one of the factors with the weight!

ggplot(dd, aes(y = price, x = carat, color = cut)) + geom_point(alpha = 0.5)

plot of chunk unnamed-chunk-7

Now we see something. For a given weight (again 1.5 carat), we now see the quality of the cut does drive the price higher.

Aha! Confirm this with a severly filtered box plot…

ggplot(subset(dd, dd$carat > 1.4 & dd$carat < 1.7), aes(y = price/carat)) + 
    geom_boxplot(notch = TRUE, aes(x = cut)) + ylab("price / carat")

plot of chunk unnamed-chunk-8

And finally, we see the expected trend toward higher prices with increasing quality of cut

ggplot(subset(dd, dd$carat > 1.4 & dd$carat < 1.7), aes(y = price/carat)) + 
    geom_boxplot(notch = TRUE, aes(x = color)) + ylab("price / carat")

plot of chunk unnamed-chunk-9

Interestingly, color in the range D-G doesn't matter much, but price falls off after that.

ggplot(subset(dd, dd$carat > 1.4 & dd$carat < 1.7), aes(y = price/carat)) + 
    geom_boxplot(notch = TRUE, aes(x = clarity)) + ylab("price / carat") + ggtitle("Price vs Clarity, weight between 1.4 and 1.7 carats")

plot of chunk unnamed-chunk-10

And clarity seems to make the most difference among color, cut and clarity. Q.E.D.