Notes:
install.packages("ggplot2")
## Installing package into '/home/bobhy/R/x86_64-pc-linux-gnu-library/3.0'
## (as 'lib' is unspecified)
## Error: trying to use CRAN without setting a mirror
library(ggplot2)
install.packages("gridExtra")
## Installing package into '/home/bobhy/R/x86_64-pc-linux-gnu-library/3.0'
## (as 'lib' is unspecified)
## Error: trying to use CRAN without setting a mirror
library(gridExtra)
## Loading required package: grid
data(diamonds)
dd = diamonds
# Investigate the price of diamonds using box plots, # numerical summaries, and one of the following categorical # variables: cut, clarity, or color.
# There won’t be a solution video for this # exercise so go to the discussion thread for either # BOXPLOTS BY CLARITY, BOXPLOT BY COLOR, or BOXPLOTS BY CUT # to share you thoughts and to # see what other people found.
# You can save images by using the ggsave() command. # ggsave() will save the last plot created. # For example… # qplot(x = price, data = diamonds) # ggsave('priceHistogram.png')
# ggsave currently recognises the extensions eps/ps, tex (pictex), # pdf, jpeg, tiff, png, bmp, svg and wmf (windows only).
# Copy and paste all of the code that you used for # your investigation, and submit it when you are ready. # =================================================================
Compare price by clarity.
ggplot(dd, aes(y = price)) + geom_boxplot(notch = TRUE, aes(x = clarity))
Hmmm. IF is the finest (of the grades in the data), I1 the least fine. [wikipedia][http://en.wikipedia.org/wiki/Diamond_clarity] confirms. Yet median prices seem to vary in opposite way: IF is loest, I1 nearly the highest.
Hah! Normalize by price per carat!
ggplot(dd, aes(y = price/carat)) + geom_boxplot(notch = TRUE, aes(x = clarity)) +
ylab("price / carat")
Something sensible coming into focus. At least the high outlier values make sense: highest prices for finest clarity. But the median values don't seem to vary much, and, if anything, there's still an anti-trend in the medians for SI2 to VVS1. Try zooming in.
ggplot(dd, aes(y = price/carat)) + geom_boxplot(notch = TRUE, aes(x = clarity)) +
ylab("price / carat") + coord_cartesian(ylim = c(2000, 6000))
Yep, still a mess. What's going on here? Maybe people don't really care much about clarity except for very good and very bad grades? (bottom of the notch for IF is above the top of the notch for I1, so it is “significantly” higher).
Hmmm, do any of the factors map nicely to value?
ggplot(dd, aes(y = price/carat)) + geom_boxplot(notch = TRUE, aes(x = color)) +
ylab("price / carat") + coord_cartesian(ylim = c(2000, 6000))
Color does not seem to explain price
ggplot(dd, aes(y = price/carat)) + geom_boxplot(notch = TRUE, aes(x = cut)) +
ylab("price / carat") + coord_cartesian(ylim = c(2000, 6000))
Nor does cut.
Does weight?
ggplot(dd, aes(y = price, x = carat)) + geom_point(alpha = 0.5)
Not obviously by itself. For a given weight (e.g, 1.5 carat), there's a very wide range of prices.
But, look what we see by overlaying one of the factors with the weight!
ggplot(dd, aes(y = price, x = carat, color = cut)) + geom_point(alpha = 0.5)
Now we see something. For a given weight (again 1.5 carat), we now see the quality of the cut does drive the price higher.
Aha! Confirm this with a severly filtered box plot…
ggplot(subset(dd, dd$carat > 1.4 & dd$carat < 1.7), aes(y = price/carat)) +
geom_boxplot(notch = TRUE, aes(x = cut)) + ylab("price / carat")
And finally, we see the expected trend toward higher prices with increasing quality of cut
ggplot(subset(dd, dd$carat > 1.4 & dd$carat < 1.7), aes(y = price/carat)) +
geom_boxplot(notch = TRUE, aes(x = color)) + ylab("price / carat")
Interestingly, color in the range D-G doesn't matter much, but price falls off after that.
ggplot(subset(dd, dd$carat > 1.4 & dd$carat < 1.7), aes(y = price/carat)) +
geom_boxplot(notch = TRUE, aes(x = clarity)) + ylab("price / carat") + ggtitle("Price vs Clarity, weight between 1.4 and 1.7 carats")
And clarity seems to make the most difference among color, cut and clarity. Q.E.D.