Diamonds_Analysis_Revised_AndrewMarchant

Exploratory Data Analysis

Clean Data

Read In Data

# A tibble: 53,940 × 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39
# ℹ 53,930 more rows

Description of Variables in Diamonds Data Set

Remove unusual values

diamonds2 <- diamonds %>% 
  mutate(x = ifelse(x == 0, NA, x)) 

diamonds2 <- diamonds2 %>% 
  mutate(y = ifelse(y < 3 | y > 20, NA, y))
  
diamonds2 <- diamonds2 %>% 
  mutate(z = ifelse(z == 0 | z > 30, NA, z))

Question to consider?

  • Given a low carat and high carat diamond, what is the most important characteristic in determining price?

Show Carat has the greatest impact on price of the diamond

Carat Vs. Price

[1] 0.9215913

What characteristic of the diamond has the most impact on it’s price?

lm <- lm(price ~ carat + cut + color + clarity, diamonds2)
summary(lm)

Call:
lm(formula = price ~ carat + cut + color + clarity, data = diamonds2)

Residuals:
     Min       1Q   Median       3Q      Max 
-16813.5   -680.4   -197.6    466.4  10394.9 

Coefficients:
             Estimate Std. Error  t value Pr(>|t|)    
(Intercept) -3710.603     13.980 -265.414  < 2e-16 ***
carat        8886.129     12.034  738.437  < 2e-16 ***
cut.L         698.907     20.335   34.369  < 2e-16 ***
cut.Q        -327.686     17.911  -18.295  < 2e-16 ***
cut.C         180.565     15.557   11.607  < 2e-16 ***
cut^4          -1.207     12.458   -0.097    0.923    
color.L     -1910.288     17.712 -107.853  < 2e-16 ***
color.Q      -627.954     16.121  -38.952  < 2e-16 ***
color.C      -171.960     15.070  -11.410  < 2e-16 ***
color^4        21.678     13.840    1.566    0.117    
color^5       -85.943     13.076   -6.572 5.00e-11 ***
color^6       -49.986     11.889   -4.205 2.62e-05 ***
clarity.L    4217.535     30.831  136.794  < 2e-16 ***
clarity.Q   -1832.406     28.827  -63.565  < 2e-16 ***
clarity.C     923.273     24.679   37.411  < 2e-16 ***
clarity^4    -361.995     19.739  -18.339  < 2e-16 ***
clarity^5     216.616     16.109   13.447  < 2e-16 ***
clarity^6       2.105     14.037    0.150    0.881    
clarity^7     110.340     12.383    8.910  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1157 on 53921 degrees of freedom
Multiple R-squared:  0.9159,    Adjusted R-squared:  0.9159 
F-statistic: 3.264e+04 on 18 and 53921 DF,  p-value: < 2.2e-16
lm2 <- lm(price ~ cut + color + clarity, diamonds2)
summary(lm2)

Call:
lm(formula = price ~ cut + color + clarity, data = diamonds2)

Residuals:
   Min     1Q Median     3Q    Max 
 -6373  -2517  -1254   1159  17013 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3864.910     31.661 122.070  < 2e-16 ***
cut.L         -56.512     67.702  -0.835 0.403883    
cut.Q        -249.924     59.706  -4.186 2.84e-05 ***
cut.C        -525.553     51.760 -10.154  < 2e-16 ***
cut^4        -169.343     41.523  -4.078 4.54e-05 ***
color.L      2093.398     56.209  37.243  < 2e-16 ***
color.Q        22.783     53.660   0.425 0.671144    
color.C      -230.440     50.237  -4.587 4.51e-06 ***
color^4        68.794     46.136   1.491 0.135940    
color^5      -272.289     43.582  -6.248 4.20e-10 ***
color^6        -3.699     39.631  -0.093 0.925641    
clarity.L   -1502.243     99.481 -15.101  < 2e-16 ***
clarity.Q    -568.685     95.928  -5.928 3.08e-09 ***
clarity.C     684.040     82.263   8.315  < 2e-16 ***
clarity^4    -240.521     65.798  -3.655 0.000257 ***
clarity^5     808.666     53.635  15.077  < 2e-16 ***
clarity^6    -212.184     46.783  -4.536 5.76e-06 ***
clarity^7     197.454     41.279   4.783 1.73e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3856 on 53922 degrees of freedom
Multiple R-squared:  0.06587,   Adjusted R-squared:  0.06557 
F-statistic: 223.7 on 17 and 53922 DF,  p-value: < 2.2e-16

Remove strong correlation of carat size vs. price

mod <- lm(log(price) ~ log(carat), data = diamonds2)

diamonds3 <- diamonds2 %>% 
  add_residuals(mod) %>% 
  mutate(resid = exp(resid))

ggplot(data = diamonds3) + 
  geom_point(mapping = aes(x = carat, y = resid))

ggplot(diamonds3, aes(x = resid)) +
  geom_histogram(binwidth = .05) +
  labs(title = "Histogram of resididuals", x = "Residual", y = "Count") +
  theme_minimal()

Residuals are normally distributed which will allow us to run an anova test

After removing strong relationship between carat and price, view characteristics of diamonds relation to price: relative to their size

Cut vs. Price

Color vs. Price

Clarity vs. Price

Interpretation: The lower end of each diamond characteristic has more clustered density, which means that member has a higher probability of taking on a lower price. The higher end characteristics have a more scattered density indicating more variability in price.

What is the most important characteristic of determining high / low price diamonds relative to size?

Subset high and low weight diamonds

hc_diamonds <- subset(diamonds3, carat >= 1.5)
lc_diamonds <- subset(diamonds3, carat <= 1)

Switched from price to weight

Characteristics of high carat diamonds

Anova Test

model_hc <- aov(resid ~ cut + color + clarity, data = hc_diamonds)
summary(model_hc)
              Df Sum Sq Mean Sq F value Pr(>F)    
cut            4  34.68   8.671   474.1 <2e-16 ***
color          6 126.11  21.018  1149.1 <2e-16 ***
clarity        7 208.89  29.842  1631.5 <2e-16 ***
Residuals   6217 113.71   0.018                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Characteristics of low carat diamonds

Anova Test

model_lc <- aov(resid ~ cut + color + clarity, data = lc_diamonds)
summary(model_lc)
               Df Sum Sq Mean Sq F value Pr(>F)    
cut             4  114.9   28.73    1142 <2e-16 ***
color           6  329.9   54.98    2186 <2e-16 ***
clarity         7  938.4  134.06    5330 <2e-16 ***
Residuals   36420  916.0    0.03                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Conclusions

  • Carat of the diamond is most important is determining the price of the diamond

  • As a diamond re-seller looking for good deals on diamonds

    • given a low weight diamond clarity is going to drive up the price the most

    • given a higher weight diamond clarity is the most important characteristic to look at, but color is going to have a decent significance on the price