Notes Feb 27

There is a dataframe with more than 50,000 diamonds available in the ggplot2 package. We wnat to analyze this data to explain the factors determining the price of a diamond.

library(tidyverse)

## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr

## Conflicts with tidy packages ----------------------------------------------

## filter(): dplyr, stats
## lag():    dplyr, stats

Examine the diamonds dataset.

str(diamonds)

## Classes 'tbl_df', 'tbl' and 'data.frame':    53940 obs. of  10 variables:
##  $ carat  : num  0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
##  $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
##  $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
##  $ depth  : num  61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##  $ table  : num  55 61 65 58 58 57 57 55 61 61 ...
##  $ price  : int  326 326 327 334 335 336 336 337 337 338 ...
##  $ x      : num  3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
##  $ y      : num  3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
##  $ z      : num  2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...

summary(diamonds)

##      carat               cut        color        clarity     
##  Min.   :0.2000   Fair     : 1610   D: 6775   SI1    :13065  
##  1st Qu.:0.4000   Good     : 4906   E: 9797   VS2    :12258  
##  Median :0.7000   Very Good:12082   F: 9542   SI2    : 9194  
##  Mean   :0.7979   Premium  :13791   G:11292   VS1    : 8171  
##  3rd Qu.:1.0400   Ideal    :21551   H: 8304   VVS2   : 5066  
##  Max.   :5.0100                     I: 5422   VVS1   : 3655  
##                                     J: 2808   (Other): 2531  
##      depth           table           price             x         
##  Min.   :43.00   Min.   :43.00   Min.   :  326   Min.   : 0.000  
##  1st Qu.:61.00   1st Qu.:56.00   1st Qu.:  950   1st Qu.: 4.710  
##  Median :61.80   Median :57.00   Median : 2401   Median : 5.700  
##  Mean   :61.75   Mean   :57.46   Mean   : 3933   Mean   : 5.731  
##  3rd Qu.:62.50   3rd Qu.:59.00   3rd Qu.: 5324   3rd Qu.: 6.540  
##  Max.   :79.00   Max.   :95.00   Max.   :18823   Max.   :10.740  
##                                                                  
##        y                z         
##  Min.   : 0.000   Min.   : 0.000  
##  1st Qu.: 4.720   1st Qu.: 2.910  
##  Median : 5.710   Median : 3.530  
##  Mean   : 5.735   Mean   : 3.539  
##  3rd Qu.: 6.540   3rd Qu.: 4.040  
##  Max.   :58.900   Max.   :31.800  
##

There are a few bad values in the variables, x, y and z. Eliminate these and make a cleaned copy of diamonds as d.

d = diamonds[diamonds$x > 0 &
             diamonds$y > 0 &
             diamonds$z > 0,]
str(d)

## Classes 'tbl_df', 'tbl' and 'data.frame':    53920 obs. of  10 variables:
##  $ carat  : num  0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
##  $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
##  $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
##  $ depth  : num  61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##  $ table  : num  55 61 65 58 58 57 57 55 61 61 ...
##  $ price  : int  326 326 327 334 335 336 336 337 337 338 ...
##  $ x      : num  3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
##  $ y      : num  3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
##  $ z      : num  2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...

We lost only 20 observations in cleaning the data.

Intuitively, all other things being the same, a heavier diamond is probably worth more. Let’s look a t a scatterplot to see how carat explains price.

g1 = ggplot(data=d,aes(x=carat,y=price))
g2 = g1 + geom_point()
g2

We do see the expected positive relationship, but we notice other things as well. There is vertical striping in the data, and the overplotting obscures the detail. The gap below 2 carats is clear. We can see this more clearly by focusing on a narrow band around 2.

g1 = ggplot(data=d[d$carat >1.89 & d$carat < 2.11,],aes(x=carat,y=price))
g2 = g1 + geom_point()
g2

For whatever reason, we can see that these are not the naturally occurring weights of diamonds.

Now, let’s consider ways to see around the overplotting. We can make the data more transparent by setting the alpha parameter.

g1 = ggplot(data=d,aes(x=carat,y=price))
g2 = g1 + geom_point(alpha = .2) + ggtitle("alpha = .2")
g2

g3 = g1 + geom_point(alpha = .1) + ggtitle("alpha = .1")
g3

g4 = g1 + geom_point(alpha = .01) + ggtitle("alpha = .01")
g4

Another tactic is to create a loess curve instead of plotting points.

g1 = ggplot(data=d,aes(x=carat,y=price))
g2 = g1 + geom_smooth()
g2

## `geom_smooth()` using method = 'gam'

The relationship is very strong up to about 2.5 carats. What fraction of diamonds weighs more than 2.5 carats?

mean(d$carat > 2.5)

## [1] 0.002318249

Less than .2%. For diamonds in this range the other characteristics must be considered.

To focus on the other characteristics, we can try to explain price per carat rather than price itself. Let’s create this variable and look at its distribution.

d$ppc = d$price/d$carat
summary(d$ppc)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1051    2478    3495    4008    4949   17830

g1 = ggplot(d,aes(x=ppc))
g2 = g1 + geom_histogram()
g2

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

g3 = g1 + geom_freqpoly()
g3

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We can try different binwidths

g2 = g1 + geom_histogram(binwidth=100) + ggtitle("Binwidth 100")
g2

g3 = g1 + geom_freqpoly(binwidth=100) + ggtitle("Binwidth 100")
g3

g2 = g1 + geom_histogram(binwidth=250) + ggtitle("Binwidth 250")
g2

g3 = g1 + geom_freqpoly(binwidth=250) + ggtitle("Binwidth 250")
g3

The three characteristics we have left to explore are the categorical variables cut, color and clarity. We can create side-by-side boxplots to see the distribution of price per carat broken down by each of these.

ggplot(data=d,aes(x=cut,y=ppc)) + geom_boxplot() + ggtitle("Distribution of ppc by cut")

ggplot(data=d,aes(x=clarity,y=ppc)) + geom_boxplot() + ggtitle("Distribution of ppc by clarity")

ggplot(data=d,aes(x=color,y=ppc)) + geom_boxplot() + ggtitle("Distribution of ppc by color")

We can also look at numerical displays of these breakdowns using tapply() and summary().

First, we’ll do cut.

tapply(d$ppc,d$cut,summary)

## $Fair
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1168    2743    3449    3766    4513   10910 
## 
## $Good
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1081    2394    3614    3860    4787   15930 
## 
## $`Very Good`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1139    2332    3606    4014    5016   17830 
## 
## $Premium
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1051    2592    3763    4221    5322   17080 
## 
## $Ideal
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1109    2456    3307    3919    4766   17080

Now color.

tapply(d$ppc,d$color,summary)

## $D
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1128    2455    3411    3951    4749   17830 
## 
## $E
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1078    2430    3254    3805    4508   14610 
## 
## $F
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1168    2587    3494    4135    4949   13860 
## 
## $G
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1139    2538    3490    4163    5497   12460 
## 
## $H
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1051    2395    3818    4006    5123   10190 
## 
## $I
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1152    2344    3780    3996    5195    9398 
## 
## $J
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1081    2563    3780    3826    4928    8647

Now clarity.

tapply(d$ppc,d$clarity,summary)

## $I1
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1051    2112    2887    2796    3354    6353 
## 
## $SI2
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1081    2999    3951    4010    4738    9912 
## 
## $SI1
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1130    2362    3669    3849    4928    9693 
## 
## $VS2
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1152    2438    3429    4080    5484   12460 
## 
## $VS1
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1215    2411    3450    4156    5484   12400 
## 
## $VVS2
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1339    2455    3169    4204    4939   13440 
## 
## $VVS1
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1400    2545    2982    3849    4054   14500 
## 
## $IF
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1588    2865    3156    4260    4284   17830