There is a dataframe with more than 50,000 diamonds available in the ggplot2 package. We wnat to analyze this data to explain the factors determining the price of a diamond.
library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag(): dplyr, stats
Examine the diamonds dataset.
str(diamonds)
## Classes 'tbl_df', 'tbl' and 'data.frame': 53940 obs. of 10 variables:
## $ carat : num 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
## $ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
## $ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
## $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
## $ depth : num 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
## $ table : num 55 61 65 58 58 57 57 55 61 61 ...
## $ price : int 326 326 327 334 335 336 336 337 337 338 ...
## $ x : num 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
## $ y : num 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
## $ z : num 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
summary(diamonds)
## carat cut color clarity
## Min. :0.2000 Fair : 1610 D: 6775 SI1 :13065
## 1st Qu.:0.4000 Good : 4906 E: 9797 VS2 :12258
## Median :0.7000 Very Good:12082 F: 9542 SI2 : 9194
## Mean :0.7979 Premium :13791 G:11292 VS1 : 8171
## 3rd Qu.:1.0400 Ideal :21551 H: 8304 VVS2 : 5066
## Max. :5.0100 I: 5422 VVS1 : 3655
## J: 2808 (Other): 2531
## depth table price x
## Min. :43.00 Min. :43.00 Min. : 326 Min. : 0.000
## 1st Qu.:61.00 1st Qu.:56.00 1st Qu.: 950 1st Qu.: 4.710
## Median :61.80 Median :57.00 Median : 2401 Median : 5.700
## Mean :61.75 Mean :57.46 Mean : 3933 Mean : 5.731
## 3rd Qu.:62.50 3rd Qu.:59.00 3rd Qu.: 5324 3rd Qu.: 6.540
## Max. :79.00 Max. :95.00 Max. :18823 Max. :10.740
##
## y z
## Min. : 0.000 Min. : 0.000
## 1st Qu.: 4.720 1st Qu.: 2.910
## Median : 5.710 Median : 3.530
## Mean : 5.735 Mean : 3.539
## 3rd Qu.: 6.540 3rd Qu.: 4.040
## Max. :58.900 Max. :31.800
##
There are a few bad values in the variables, x, y and z. Eliminate these and make a cleaned copy of diamonds as d.
d = diamonds[diamonds$x > 0 &
diamonds$y > 0 &
diamonds$z > 0,]
str(d)
## Classes 'tbl_df', 'tbl' and 'data.frame': 53920 obs. of 10 variables:
## $ carat : num 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
## $ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
## $ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
## $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
## $ depth : num 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
## $ table : num 55 61 65 58 58 57 57 55 61 61 ...
## $ price : int 326 326 327 334 335 336 336 337 337 338 ...
## $ x : num 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
## $ y : num 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
## $ z : num 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
We lost only 20 observations in cleaning the data.
Intuitively, all other things being the same, a heavier diamond is probably worth more. Let’s look a t a scatterplot to see how carat explains price.
g1 = ggplot(data=d,aes(x=carat,y=price))
g2 = g1 + geom_point()
g2
We do see the expected positive relationship, but we notice other things as well. There is vertical striping in the data, and the overplotting obscures the detail. The gap below 2 carats is clear. We can see this more clearly by focusing on a narrow band around 2.
g1 = ggplot(data=d[d$carat >1.89 & d$carat < 2.11,],aes(x=carat,y=price))
g2 = g1 + geom_point()
g2
For whatever reason, we can see that these are not the naturally occurring weights of diamonds.
Now, let’s consider ways to see around the overplotting. We can make the data more transparent by setting the alpha parameter.
g1 = ggplot(data=d,aes(x=carat,y=price))
g2 = g1 + geom_point(alpha = .2) + ggtitle("alpha = .2")
g2
g3 = g1 + geom_point(alpha = .1) + ggtitle("alpha = .1")
g3
g4 = g1 + geom_point(alpha = .01) + ggtitle("alpha = .01")
g4
Another tactic is to create a loess curve instead of plotting points.
g1 = ggplot(data=d,aes(x=carat,y=price))
g2 = g1 + geom_smooth()
g2
## `geom_smooth()` using method = 'gam'
The relationship is very strong up to about 2.5 carats. What fraction of diamonds weighs more than 2.5 carats?
mean(d$carat > 2.5)
## [1] 0.002318249
Less than .2%. For diamonds in this range the other characteristics must be considered.
To focus on the other characteristics, we can try to explain price per carat rather than price itself. Let’s create this variable and look at its distribution.
d$ppc = d$price/d$carat
summary(d$ppc)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1051 2478 3495 4008 4949 17830
g1 = ggplot(d,aes(x=ppc))
g2 = g1 + geom_histogram()
g2
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
g3 = g1 + geom_freqpoly()
g3
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We can try different binwidths
g2 = g1 + geom_histogram(binwidth=100) + ggtitle("Binwidth 100")
g2
g3 = g1 + geom_freqpoly(binwidth=100) + ggtitle("Binwidth 100")
g3
g2 = g1 + geom_histogram(binwidth=250) + ggtitle("Binwidth 250")
g2
g3 = g1 + geom_freqpoly(binwidth=250) + ggtitle("Binwidth 250")
g3
The three characteristics we have left to explore are the categorical variables cut, color and clarity. We can create side-by-side boxplots to see the distribution of price per carat broken down by each of these.
ggplot(data=d,aes(x=cut,y=ppc)) + geom_boxplot() + ggtitle("Distribution of ppc by cut")
ggplot(data=d,aes(x=clarity,y=ppc)) + geom_boxplot() + ggtitle("Distribution of ppc by clarity")
ggplot(data=d,aes(x=color,y=ppc)) + geom_boxplot() + ggtitle("Distribution of ppc by color")
We can also look at numerical displays of these breakdowns using tapply() and summary().
First, we’ll do cut.
tapply(d$ppc,d$cut,summary)
## $Fair
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1168 2743 3449 3766 4513 10910
##
## $Good
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1081 2394 3614 3860 4787 15930
##
## $`Very Good`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1139 2332 3606 4014 5016 17830
##
## $Premium
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1051 2592 3763 4221 5322 17080
##
## $Ideal
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1109 2456 3307 3919 4766 17080
Now color.
tapply(d$ppc,d$color,summary)
## $D
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1128 2455 3411 3951 4749 17830
##
## $E
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1078 2430 3254 3805 4508 14610
##
## $F
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1168 2587 3494 4135 4949 13860
##
## $G
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1139 2538 3490 4163 5497 12460
##
## $H
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1051 2395 3818 4006 5123 10190
##
## $I
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1152 2344 3780 3996 5195 9398
##
## $J
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1081 2563 3780 3826 4928 8647
Now clarity.
tapply(d$ppc,d$clarity,summary)
## $I1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1051 2112 2887 2796 3354 6353
##
## $SI2
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1081 2999 3951 4010 4738 9912
##
## $SI1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1130 2362 3669 3849 4928 9693
##
## $VS2
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1152 2438 3429 4080 5484 12460
##
## $VS1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1215 2411 3450 4156 5484 12400
##
## $VVS2
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1339 2455 3169 4204 4939 13440
##
## $VVS1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1400 2545 2982 3849 4054 14500
##
## $IF
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1588 2865 3156 4260 4284 17830