data(diamonds)
## Warning in data(diamonds): data set 'diamonds' not found
library('ggplot2')
ggplot(data = diamonds, aes(x = carat, y = price)) +
geom_point() +
scale_x_continuous(limits = c(0, quantile(diamonds$carat, 0.99))) +
scale_y_continuous(limits = c(0, quantile(diamonds$price, 0.99)))
## Warning: Removed 926 rows containing missing values (geom_point).
Response: Generally, as carat increases, so does price, but it is not exactly linear. There is also great variance of price for each carat.
ggplot(data = diamonds, aes(x = carat, y = price)) +
geom_point(alpha = 1/4) +
geom_smooth(method = 'lm') +
scale_x_continuous(limits = c(0, quantile(diamonds$carat, 0.99))) +
scale_y_continuous(limits = c(0, quantile(diamonds$price, 0.99)))
## Warning: Removed 926 rows containing non-finite values (stat_smooth).
## Warning: Removed 926 rows containing missing values (geom_point).
## Warning: Removed 4 rows containing missing values (geom_smooth).
Notes: The linear model doesn’t properly fit the graph. If using the model, we might miss some correct values, specifically on the lower and upper end.
Notes: The person who helped the control of continuous diamond sales by insisting that anyone who was thinking of getting married should have one. Her famous slogan is,
Notes: DeBeers company said the diamonds hold the value and emotion of love. They handed out diamonds to famous people to advertise to the lower class. Everyone had to have one, and they had to be large to be worth anything.
Notes:
# load the ggplot graphics package and the others
library(ggplot2)
library(GGally)
library(scales)
library(memisc)
## Loading required package: lattice
## Loading required package: MASS
##
## Attaching package: 'memisc'
## The following object is masked from 'package:scales':
##
## percent
## The following objects are masked from 'package:stats':
##
## contr.sum, contr.treatment, contrasts
## The following object is masked from 'package:base':
##
## as.array
# sample 10,000 diamonds from the data set
set.seed(20022012)
diamond_samp <- diamonds[sample(1:length(diamonds$price), 10000), ]
ggpairs(diamond_samp,
lower = list(continuous = wrap("points", shape = I('.'))),
upper = list(combo = wrap("box", outlier.shape = I('.'))))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
What are some things you notice in the ggpairs output? Response: The biggest motivator of price is carat.
library(gridExtra)
p1 <- qplot(data = diamonds, x = price, binwidth = 100) +
ggtitle('Price')
p2 <- qplot(data = diamonds, x = price, binwidth = 0.01) +
ggtitle('Price (log10)') +
scale_x_log10()
grid.arrange(p1, p2, ncol = 2)
Notes: I dont understand why the counts dont match up between graphs.
qplot(carat, price, data = diamonds) +
scale_y_log10() +
ggtitle('Price (log10) by Carat')
cuberoot_trans = function() trans_new('cuberoot',
transform = function(x) x^(1/3),
inverse = function(x) x^3)
ggplot(aes(carat, price), data = diamonds) +
geom_point() +
scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
breaks = c(0.2, 0.5, 1, 2, 3)) +
scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
breaks = c(350, 1000, 5000, 10000, 15000)) +
ggtitle('Price (log10) by Cube-Root of Carat')
## Warning: Removed 1683 rows containing missing values (geom_point).
head(sort(table(diamonds$carat), decreasing = T))
##
## 0.3 0.31 1.01 0.7 0.32 1
## 2604 2249 2242 1981 1840 1558
head(sort(table(diamonds$price), decreasing = T))
##
## 605 802 625 828 776 698
## 132 127 126 125 124 121
ggplot(aes(carat, price), data = diamonds) +
geom_jitter(alpha = 1/2, size = 3/4) +
scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
breaks = c(0.2, 0.5, 1, 2, 3)) +
scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
breaks = c(350, 1000, 5000, 10000, 15000)) +
ggtitle('Price (log10) by Cube-Root of Carat')
## Warning: Removed 1691 rows containing missing values (geom_point).
Alter the code below.
# install and load the RColorBrewer package
install.packages('RColorBrewer', repos= 'https://cran.rstudio.com/bin/macosx/mavericks/contrib/3.3/RColorBrewer_1.1-2.tgz')
## Warning: unable to access index for repository https://cran.rstudio.com/bin/macosx/mavericks/contrib/3.3/RColorBrewer_1.1-2.tgz/src/contrib:
## cannot download all files
## Warning: package 'RColorBrewer' is not available (for R version 3.3.1)
## Warning: unable to access index for repository https://cran.rstudio.com/bin/macosx/mavericks/contrib/3.3/RColorBrewer_1.1-2.tgz/bin/macosx/mavericks/contrib/3.3:
## cannot download all files
library(RColorBrewer)
ggplot(aes(x = carat, y = price, color = clarity), data = diamonds) +
geom_point(alpha = 0.5, size = 1, position = 'jitter') +
scale_color_brewer(type = 'div',
guide = guide_legend(title = 'Clarity', reverse = T,
override.aes = list(alpha = 1, size = 2))) +
scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
breaks = c(0.2, 0.5, 1, 2, 3)) +
scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
breaks = c(350, 1000, 5000, 10000, 15000)) +
ggtitle('Price (log10) by Cube-Root of Carat and Clarity')
## Warning: Removed 1693 rows containing missing values (geom_point).
Response: There are apparent stripes or levels in clarity across the graph. This would suggest that clarity affets price as well.
Alter the code below.
ggplot(aes(x = carat, y = price, color = cut), data = diamonds) +
geom_point(alpha = 0.5, size = 1, position = 'jitter') +
scale_color_brewer(type = 'div',
guide = guide_legend(title = 'Cut', reverse = T,
override.aes = list(alpha = 1, size = 2))) +
scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
breaks = c(0.2, 0.5, 1, 2, 3)) +
scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
breaks = c(350, 1000, 5000, 10000, 15000)) +
ggtitle('Price (log10) by Cube-Root of Carat and Cut')
## Warning: Removed 1696 rows containing missing values (geom_point).
Response: Cut does not have much affect on price as there are ideal cut diamonds high and low on the graph. There are no apparent levels in the graph.
Alter the code below.
ggplot(aes(x = carat, y = price, color = color), data = diamonds) +
geom_point(alpha = 0.5, size = 1, position = 'jitter') +
scale_color_brewer(type = 'div',
guide = guide_legend(title = 'Color',
override.aes = list(alpha = 1, size = 2))) +
scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
breaks = c(0.2, 0.5, 1, 2, 3)) +
scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
breaks = c(350, 1000, 5000, 10000, 15000)) +
ggtitle('Price (log10) by Cube-Root of Carat and Color')
## Warning: Removed 1688 rows containing missing values (geom_point).
Response: The plot with color looks similar to the one with clarity. Color affects price.
Response: lm(log(price)~carat^(1/3))
Notes:
m1 <- lm(I(log(price)) ~ I(carat^(1/3)), data = diamonds)
m2 <- update(m1, ~ . + carat)
m3 <- update(m2, ~ . + cut)
m4 <- update(m3, ~ . + color)
m5 <- update(m4, ~ . + clarity)
mtable(m1, m2, m3, m4, m5)
##
## Calls:
## m1: lm(formula = I(log(price)) ~ I(carat^(1/3)), data = diamonds)
## m2: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat, data = diamonds)
## m3: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + cut, data = diamonds)
## m4: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + cut + color,
## data = diamonds)
## m5: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + cut + color +
## clarity, data = diamonds)
##
## =========================================================================
## m1 m2 m3 m4 m5
## -------------------------------------------------------------------------
## (Intercept) 2.821*** 1.039*** 0.874*** 0.932*** 0.415***
## (0.006) (0.019) (0.019) (0.017) (0.010)
## I(carat^(1/3)) 5.558*** 8.568*** 8.703*** 8.438*** 9.144***
## (0.007) (0.032) (0.031) (0.028) (0.016)
## carat -1.137*** -1.163*** -0.992*** -1.093***
## (0.012) (0.011) (0.010) (0.006)
## cut: .L 0.224*** 0.224*** 0.120***
## (0.004) (0.004) (0.002)
## cut: .Q -0.062*** -0.062*** -0.031***
## (0.004) (0.003) (0.002)
## cut: .C 0.051*** 0.052*** 0.014***
## (0.003) (0.003) (0.002)
## cut: ^4 0.018*** 0.018*** -0.002
## (0.003) (0.002) (0.001)
## color: .L -0.373*** -0.441***
## (0.003) (0.002)
## color: .Q -0.129*** -0.093***
## (0.003) (0.002)
## color: .C 0.001 -0.013***
## (0.003) (0.002)
## color: ^4 0.029*** 0.012***
## (0.003) (0.002)
## color: ^5 -0.016*** -0.003*
## (0.003) (0.001)
## color: ^6 -0.023*** 0.001
## (0.002) (0.001)
## clarity: .L 0.907***
## (0.003)
## clarity: .Q -0.240***
## (0.003)
## clarity: .C 0.131***
## (0.003)
## clarity: ^4 -0.063***
## (0.002)
## clarity: ^5 0.026***
## (0.002)
## clarity: ^6 -0.002
## (0.002)
## clarity: ^7 0.032***
## (0.001)
## -------------------------------------------------------------------------
## R-squared 0.9 0.9 0.9 1.0 1.0
## adj. R-squared 0.9 0.9 0.9 1.0 1.0
## sigma 0.3 0.3 0.3 0.2 0.1
## F 652012.1 387489.4 138654.5 87959.5 173791.1
## p 0.0 0.0 0.0 0.0 0.0
## Log-likelihood -7962.5 -3631.3 -1837.4 4235.2 34091.3
## Deviance 4242.8 3613.4 3380.8 2699.2 892.2
## AIC 15931.0 7270.6 3690.8 -8442.5 -68140.5
## BIC 15957.7 7306.2 3762.0 -8317.9 -67953.7
## N 53940 53940 53940 53940 53940
## =========================================================================
Notice how adding cut to our model does not help explain much of the variance in the price of diamonds. This fits with out exploration earlier.
Response: Prices change over time