This R Markdown document is an analysis output based on dataset diamonds from ggplot

A data frame with 53940 rows and 10 variables:

Column Desciption
price price in US dollars ($326–$18,823)
carat weight of the diamond (0.2–5.01)
cut quality of the cut (Fair, Good, Very Good, Premium, Ideal)
color diamond colour, from J (worst) to D (best)
clarity a measurement of how clear the diamond is (I1(worst),SI2,SI1,VS2,VS1,VVS2,VVS1,IF(best))
x length in mm (0–10.74)
y width in mm (0–58.9)
z depth in mm (0–31.8)
depth total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43–79)
table width of top of diamond relative to widest point (43–95)

Load required packages

library(QuantPsyc)
library(MASS)
library(ggplot2)
library(dplyr)
library(data.table)
library(tidyr)
library(GGally)

data(diamonds)
diamonds= data.table(diamonds)
diamonds
##        carat       cut color clarity depth table price    x    y    z
##     1:  0.23     Ideal     E     SI2  61.5    55   326 3.95 3.98 2.43
##     2:  0.21   Premium     E     SI1  59.8    61   326 3.89 3.84 2.31
##     3:  0.23      Good     E     VS1  56.9    65   327 4.05 4.07 2.31
##     4:  0.29   Premium     I     VS2  62.4    58   334 4.20 4.23 2.63
##     5:  0.31      Good     J     SI2  63.3    58   335 4.34 4.35 2.75
##    ---                                                               
## 53936:  0.72     Ideal     D     SI1  60.8    57  2757 5.75 5.76 3.50
## 53937:  0.72      Good     D     SI1  63.1    55  2757 5.69 5.75 3.61
## 53938:  0.70 Very Good     D     SI1  62.8    60  2757 5.66 5.68 3.56
## 53939:  0.86   Premium     H     SI2  61.0    58  2757 6.15 6.12 3.74
## 53940:  0.75     Ideal     D     SI2  62.2    55  2757 5.83 5.87 3.64

Data Exploration

First, we can use summary(diamonds) to view some descriptive facts of the dataset:

##      carat               cut        color        clarity     
##  Min.   :0.2000   Fair     : 1610   D: 6775   SI1    :13065  
##  1st Qu.:0.4000   Good     : 4906   E: 9797   VS2    :12258  
##  Median :0.7000   Very Good:12082   F: 9542   SI2    : 9194  
##  Mean   :0.7979   Premium  :13791   G:11292   VS1    : 8171  
##  3rd Qu.:1.0400   Ideal    :21551   H: 8304   VVS2   : 5066  
##  Max.   :5.0100                     I: 5422   VVS1   : 3655  
##                                     J: 2808   (Other): 2531  
##      depth           table           price             x         
##  Min.   :43.00   Min.   :43.00   Min.   :  326   Min.   : 0.000  
##  1st Qu.:61.00   1st Qu.:56.00   1st Qu.:  950   1st Qu.: 4.710  
##  Median :61.80   Median :57.00   Median : 2401   Median : 5.700  
##  Mean   :61.75   Mean   :57.46   Mean   : 3933   Mean   : 5.731  
##  3rd Qu.:62.50   3rd Qu.:59.00   3rd Qu.: 5324   3rd Qu.: 6.540  
##  Max.   :79.00   Max.   :95.00   Max.   :18823   Max.   :10.740  
##                                                                  
##        y                z         
##  Min.   : 0.000   Min.   : 0.000  
##  1st Qu.: 4.720   1st Qu.: 2.910  
##  Median : 5.710   Median : 3.530  
##  Mean   : 5.735   Mean   : 3.539  
##  3rd Qu.: 6.540   3rd Qu.: 4.040  
##  Max.   :58.900   Max.   :31.800  
## 

We can then utilise ggpairs from GGally to visualise the correlation of each variables

ggpairs(diamonds)

From the pairwise correlation plot, we could observe that:
1- There is a relationship (trendable and diverging) in between log Price and carat.
2- Price exhibits very limited or no relationship with table and depths.
3- carat is highly correlated to x,y,z. Further understanding from the dictionary, we then understand that the weight (carat) of a diamond would definitely be affected by its length, width and width (x,y,z respectively).


Relationship of Price and Cut

Question 1: Take the diamonds data set from the base R on R-Studio and explore the relationship between cut and price.

Answer: Through a box plot, we observe that the highest price diamonds are premium cut, lowest price diamonds are ideal cut.

diamonds %>%
  group_by(cut) %>%
  summarise(n=n(), 
            mean= mean(price), 
            median=median(price), 
            Q1= quantile(price,0.25),
            Q3= quantile(price,0.75))
## # A tibble: 5 x 6
##         cut     n     mean median      Q1      Q3
##       <ord> <int>    <dbl>  <dbl>   <dbl>   <dbl>
## 1      Fair  1610 4358.758 3282.0 2050.25 5205.50
## 2      Good  4906 3928.864 3050.5 1145.00 5028.00
## 3 Very Good 12082 3981.760 2648.0  912.00 5372.75
## 4   Premium 13791 4584.258 3185.0 1046.00 6296.00
## 5     Ideal 21551 3457.542 1810.0  878.00 4678.50
diamonds %>%
  ggplot(aes(x=cut,y=price, color=cut)) +
  geom_boxplot()


Distribution of each cut

Question 2: Show the distribution of each cuts

diamonds %>%
  ggplot(aes(x=(price))) +
  geom_histogram(stat="bin",binwidth= 500) +
  facet_wrap(~cut, scales = "free")


Histogram or density plot

Question 3: Which is the better representation for comparative continuous data(price ~ cut) : histogram or density plot? Explain with reasons.

Answer: Histogram, reason is that there is one cut group (Fair) exposing noise due to small sample size. By adjusting bins adequately noise could be reduced and hence derive to insights more easily without too much noise.


Predicting Diamond Price

Question 4: What variable in the diamonds dataset is most important for predicting the price of a diamond? How is that variable correlated with cut? Why does the combination of those two relationships lead to lower quality diamonds being more expensive?

To predict diamond price, we would first try to fit the data with a linear regression model. Few assumptions are required before carrying out the analysis:
1) Normality
2) Independecy
3) Homogeneity

1- Normality

QQ plot and histogram both show that variable price shows a better normality after log transformation

par(mfrow=c(1,2))
qqnorm((diamonds$price),main="Normal Q-Q Plot of Price");qqline((diamonds$price))
qqnorm(log(diamonds$price),main="Normal Q-Q Plot of log Price");qqline(log(diamonds$price))

par(mfrow=c(1,2))
hist(diamonds$price,main="Price")
hist(log(diamonds$price),main="log Price")

2- Independency

Recall the ‘Data Exploration’ section which we observed that carat, x, y, z all are highly correlated (dependent) to each other, there is no need to carry variables with similar information, we could hence shortlist carat as one of the features and exclude x,y,z from our model.

cor((diamonds)[,.(carat,x,y,z)])
##           carat         x         y         z
## carat 1.0000000 0.9750942 0.9517222 0.9533874
## x     0.9750942 1.0000000 0.9747015 0.9707718
## y     0.9517222 0.9747015 1.0000000 0.9520057
## z     0.9533874 0.9707718 0.9520057 1.0000000

3- Homogeneity

This check can be performed through inspecting the relationship of Price with other continuous variable.

cor(diamonds[,.(price, carat)])
##           price     carat
## price 1.0000000 0.9215913
## carat 0.9215913 1.0000000
cor(diamonds[,.(log(price), log(carat))])
##           V1        V2
## V1 1.0000000 0.9659137
## V2 0.9659137 1.0000000

Price vs Categorical Variables: Clarity, Cut, Color

By plotting the categorical variables as color, we can see that these three variables all show a relationship with Price. From here we could assume that these three variables would be important in predicting price, hence including them as features in our following prediction model.

par(mfrow=c(1,3))
diamonds %>%
  ggplot(aes(log(price),log(carat), col= clarity))+
  geom_point()

diamonds %>%
  ggplot(aes(log(price),log(carat), col= cut))+
  geom_point()

diamonds %>%
  ggplot(aes(log(price),log(carat), col= color))+
  geom_point()

Model Building: Linear Regression

We use the lm function to build the regression model, with the features that we have selected in earlier sessions. To recall, we have excluded x,y,z due to their dependency to carat, additionally depth and table are excluded due to their very low correlation with price.

lm1= lm(log(price)~ log(carat) + cut+color+clarity, data= diamonds)
summary(lm1)
## 
## Call:
## lm(formula = log(price) ~ log(carat) + cut + color + clarity, 
##     data = diamonds)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.01107 -0.08636 -0.00023  0.08341  1.94778 
## 
## Coefficients:
##              Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)  8.457030   0.001168 7242.225  < 2e-16 ***
## log(carat)   1.883718   0.001129 1668.750  < 2e-16 ***
## cut.L        0.120714   0.002354   51.284  < 2e-16 ***
## cut.Q       -0.035115   0.002072  -16.950  < 2e-16 ***
## cut.C        0.013479   0.001799    7.494 6.77e-14 ***
## cut^4       -0.001562   0.001441   -1.084    0.278    
## color.L     -0.439576   0.002027 -216.828  < 2e-16 ***
## color.Q     -0.095623   0.001863  -51.335  < 2e-16 ***
## color.C     -0.014783   0.001743   -8.481  < 2e-16 ***
## color^4      0.011852   0.001601    7.403 1.35e-13 ***
## color^5     -0.002201   0.001513   -1.455    0.146    
## color^6      0.002391   0.001375    1.739    0.082 .  
## clarity.L    0.916832   0.003578  256.274  < 2e-16 ***
## clarity.Q   -0.243038   0.003330  -72.982  < 2e-16 ***
## clarity.C    0.132400   0.002854   46.387  < 2e-16 ***
## clarity^4   -0.066104   0.002283  -28.955  < 2e-16 ***
## clarity^5    0.027418   0.001864   14.711  < 2e-16 ***
## clarity^6   -0.001819   0.001623   -1.120    0.263    
## clarity^7    0.033531   0.001432   23.412  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1338 on 53921 degrees of freedom
## Multiple R-squared:  0.9826, Adjusted R-squared:  0.9826 
## F-statistic: 1.693e+05 on 18 and 53921 DF,  p-value: < 2.2e-16

We can also compare with a model before log transformation.

lm2= lm(price~carat+cut+color+clarity, data= diamonds)
summary(lm2)
## 
## Call:
## lm(formula = price ~ carat + cut + color + clarity, data = diamonds)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16813.5   -680.4   -197.6    466.4  10394.9 
## 
## Coefficients:
##              Estimate Std. Error  t value Pr(>|t|)    
## (Intercept) -3710.603     13.980 -265.414  < 2e-16 ***
## carat        8886.129     12.034  738.437  < 2e-16 ***
## cut.L         698.907     20.335   34.369  < 2e-16 ***
## cut.Q        -327.686     17.911  -18.295  < 2e-16 ***
## cut.C         180.565     15.557   11.607  < 2e-16 ***
## cut^4          -1.207     12.458   -0.097    0.923    
## color.L     -1910.288     17.712 -107.853  < 2e-16 ***
## color.Q      -627.954     16.121  -38.952  < 2e-16 ***
## color.C      -171.960     15.070  -11.410  < 2e-16 ***
## color^4        21.678     13.840    1.566    0.117    
## color^5       -85.943     13.076   -6.572 5.00e-11 ***
## color^6       -49.986     11.889   -4.205 2.62e-05 ***
## clarity.L    4217.535     30.831  136.794  < 2e-16 ***
## clarity.Q   -1832.406     28.827  -63.565  < 2e-16 ***
## clarity.C     923.273     24.679   37.411  < 2e-16 ***
## clarity^4    -361.995     19.739  -18.339  < 2e-16 ***
## clarity^5     216.616     16.109   13.447  < 2e-16 ***
## clarity^6       2.105     14.037    0.150    0.881    
## clarity^7     110.340     12.383    8.910  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1157 on 53921 degrees of freedom
## Multiple R-squared:  0.9159, Adjusted R-squared:  0.9159 
## F-statistic: 3.264e+04 on 18 and 53921 DF,  p-value: < 2.2e-16

Results show that lm1 is better than lm2, though lm2 is good enough, there is a risk to yield inaccurate estimates from lm2 due to its violation against regression’s assumptions.
To understand which feature has the most impact in price prediction, standardized beta coefficients can be referred as the coefficients are standardized to a same scale:

lm.beta(lm1)
##    log(carat)         cut.L         cut.Q         cut.C         cut^4 
##  1.0857443928  0.1328432935 -0.0588713794  0.0218813985 -0.0009000604 
##       color.L       color.Q       color.C       color^4       color^5 
## -0.4837439068 -0.1603169316 -0.0239982978  0.0068311064 -0.0024223173 
##       color^6     clarity.L     clarity.Q     clarity.C     clarity^4 
##  0.0040089295  1.4883442619 -0.1400833855  0.1457032931 -0.1108268197 
##     clarity^5     clarity^6     clarity^7 
##  0.0445095197 -0.0010482888  0.0369001511

Question 4-1: What variable in the diamonds dataset is most important for predicting the price of a diamond?
Answer: The result shows that carat carries is the most important feature.

Question 4-2: How is that variable correlated with cut? Why does the combination of those two relationships lead to lower quality diamonds being more expensive?
Answer: In general, carat and cut hold a negative correlation (see box-plot below): fair cuts are usually higher carat, ideal cuts are usually lower carat. This could be due to larger diamonds are rare also the technical challenge of cutting a large stone into an ideal cutting. This is also the reason why lower quality diamonds, if it is high carats, would be more expensive.

diamonds %>%
  ggplot(aes(x=cut,y=carat, color=cut)) +
  geom_boxplot()


Cut vs Color

Question: Explain the relationship between cut and colour (hint: both are categorical)
Since these two categories are categorical, we can cross-tabulate the count of these two variables, and further express in a relative frequency manner.

new= 
  diamonds %>%
  group_by(color, cut) %>%
  summarise(n=n()) %>%
  group_by(cut) %>%
  mutate(sum.n= sum(n)) %>%
  ungroup() %>%
  mutate(n2= n/sum.n) %>%
  select(color, cut, n2)

new %>% spread(cut,n2)
## # A tibble: 7 x 6
##   color       Fair       Good `Very Good`    Premium      Ideal
## * <ord>      <dbl>      <dbl>       <dbl>      <dbl>      <dbl>
## 1     D 0.10124224 0.13493681  0.12522761 0.11623523 0.13150202
## 2     E 0.13913043 0.19017530  0.19864261 0.16945834 0.18110529
## 3     F 0.19378882 0.18528333  0.17910942 0.16902328 0.17753237
## 4     G 0.19503106 0.17753771  0.19028307 0.21202233 0.22662521
## 5     H 0.18819876 0.14309009  0.15096838 0.17112610 0.14454086
## 6     I 0.10869565 0.10640033  0.09965238 0.10354579 0.09711846
## 7     J 0.07391304 0.06257644  0.05611654 0.05858893 0.04157580

We can also use a heatmap to visualise this information. Darker blue indicates higher count and white indicates low count.

new %>%
  ggplot(aes(color, cut)) +
  geom_tile(aes(fill=n2*100), colour = "white") +
  scale_fill_gradient(low="white",high="blue") +
  labs(fill = "Density")

Answer:
From the heatmap, we can conclude that:
Most ideal and premium cuts are from colour G.
Most very good and good cut diamonds are from colour E.
Fair cut diamonds are usually from colour F, G, H.
Overall, all cut group diamonds are rare in colour J.