This script was created to investigate the relationship between diamonds’ prices and their physical properties as described in the diamonds dataset from ggplot2.
Data and summary statistics: the diamonds data has 10 variables that include price, carat (unit of mass equal to 200 mg), cut, color, clarity, and measures of x, y and z axis.
# Load ggplot2 and data
library(ggplot2)
data(diamonds)
# See variables names
names(diamonds)
## [1] "carat" "cut" "color" "clarity" "depth" "table" "price"
## [8] "x" "y" "z"
# Get summary stats
summary(diamonds)
## carat cut color clarity
## Min. :0.2000 Fair : 1610 D: 6775 SI1 :13065
## 1st Qu.:0.4000 Good : 4906 E: 9797 VS2 :12258
## Median :0.7000 Very Good:12082 F: 9542 SI2 : 9194
## Mean :0.7979 Premium :13791 G:11292 VS1 : 8171
## 3rd Qu.:1.0400 Ideal :21551 H: 8304 VVS2 : 5066
## Max. :5.0100 I: 5422 VVS1 : 3655
## J: 2808 (Other): 2531
## depth table price x
## Min. :43.00 Min. :43.00 Min. : 326 Min. : 0.000
## 1st Qu.:61.00 1st Qu.:56.00 1st Qu.: 950 1st Qu.: 4.710
## Median :61.80 Median :57.00 Median : 2401 Median : 5.700
## Mean :61.75 Mean :57.46 Mean : 3933 Mean : 5.731
## 3rd Qu.:62.50 3rd Qu.:59.00 3rd Qu.: 5324 3rd Qu.: 6.540
## Max. :79.00 Max. :95.00 Max. :18823 Max. :10.740
##
## y z
## Min. : 0.000 Min. : 0.000
## 1st Qu.: 4.720 1st Qu.: 2.910
## Median : 5.710 Median : 3.530
## Mean : 5.735 Mean : 3.539
## 3rd Qu.: 6.540 3rd Qu.: 4.040
## Max. :58.900 Max. :31.800
##
Plotting price data in a histogram evidences that the prices have a long tailed distribution (left plot); the price of the most expensive diamond is over 50 times larger than the least expensive, which means the price data distribution is easier to see on a logarithmic scale (right plot).
Most diamonds cost less than 1000 dollars, but there is an interesting multipeak distribution above 1000 dollars.
library(gridExtra)
plot1 <- qplot(price, data = diamonds, binwidth = 100) +
ggtitle('Price')
plot2 <- qplot(price, data = diamonds, binwidth = 0.01) +
ggtitle('Price (log10)') +
scale_x_log10()
grid.arrange(plot1, plot2, ncol = 2)
A few measures can be investigated as potential price determinants: carat (weight), table (width of top of diamond relative to widest point), depth, width (y) and length (x).
Analysing the plots below, it is clear that carat has the closest to a linear relationship to price, so it will be the focus for the price modelling.
plot3 <- ggplot(aes(x = carat, y = price), data = diamonds) +
geom_point() +
ggtitle('Carat')
plot4 <- ggplot(aes(x = table, y = price), data = diamonds) +
geom_point() +
ggtitle('Table')
plot5 <- ggplot(aes(x = depth, y = price), data = diamonds) +
geom_point() +
ggtitle('Depth percentage')
plot6 <- ggplot(aes(x = y, y = price), data = diamonds) +
geom_point() +
ggtitle('Width (mm)')
plot7 <- ggplot(aes(x = x, y = price), data = diamonds) +
geom_point() +
ggtitle('Length (mm)')
grid.arrange(plot3, plot4, plot5, plot6, plot7, ncol = 3)
Price and carat relationship: when plotting all data, it is possible to see the highest priced diamonds do not follow the overall distribution. Trying to plot data again excluding the heaviest / most expensive 1%, the distribution is much better (see blue lines in both figures below). However, some fitting is still needed so it becomes a linear distribution.
# Scatter plot of carat versus price
plot8 <- ggplot(aes(x = carat, y = price), data = diamonds) +
geom_point() +
geom_smooth() +
ggtitle('All data')
# Scatter plot of carat versus price excluding outliers
plot9 <- ggplot(aes(x = carat, y = price), data = diamonds) +
geom_point() +
coord_cartesian(xlim = c(0, quantile(diamonds$carat, 0.99)),
ylim = c(0, quantile(diamonds$price, 0.99))) +
geom_smooth() +
ggtitle('Excluding highest 1%')
grid.arrange(plot8, plot9, ncol = 2)
## `geom_smooth()` using method = 'gam'
## `geom_smooth()` using method = 'gam'
As noticed before, the price distribution is better represented when in logarithmic scale.
# Plot the price in log scale
ggplot(aes(x = carat, y = price), data = diamonds) +
geom_point() +
scale_y_continuous(trans = "log10", limits = c(350, 15000),
breaks = c(350, 1000, 5000, 10000, 15000)) +
geom_smooth() +
ggtitle('Price (log10) versus carat')
## `geom_smooth()` using method = 'gam'
As the carat distribution presented resembles the x cube function distribution, a transformation can be attempted so data is plotted as the cube root of carat as well. A function can be defined to get the cube root, and then be applied to the x axis. Points are also made smaller for clarity.
# Load the ggplot graphics package
library(GGally)
library(scales)
library(memisc)
## Loading required package: lattice
## Loading required package: MASS
##
## Attaching package: 'memisc'
## The following object is masked from 'package:scales':
##
## percent
## The following objects are masked from 'package:stats':
##
## contr.sum, contr.treatment, contrasts
## The following object is masked from 'package:base':
##
## as.array
# Function to extract the cube root
trans = function() trans_new('cuberoot', transform = function(x) x^(1/3),
inverse = function(x) x^3)
# Plot the cube root of carat
ggplot(aes(x = carat, y = price), data = diamonds) +
geom_point(alpha = 0.5, size = 0.75, position = "jitter") +
scale_x_continuous(trans = trans(), limits = c(0.2, 3),
breaks = c(0.2, 0.5, 1, 2, 3)) +
scale_y_continuous(trans = "log10", limits = c(350, 15000),
breaks = c(350, 1000, 5000, 10000, 15000)) +
geom_smooth() +
ggtitle('Price (log10) versus carat (cube root)')
## `geom_smooth()` using method = 'gam'
Within this distribution, the qualitative characteristics can be analysed to see if they impact on price.
# Load color brewer
library(RColorBrewer)
# Same plot as above with color by other diamond properties
ggplot(aes(x = carat, y = price), data = diamonds) +
geom_point(aes(color = clarity), alpha = 0.5, size = 1, position = 'jitter') +
scale_color_brewer(type = 'div',
guide = guide_legend(title = 'Clarity', reverse = T,
override.aes = list(alpha = 1, size = 2))) +
scale_x_continuous(trans = trans(), limits = c(0.2, 3),
breaks = c(0.2, 0.5, 1, 2, 3)) +
scale_y_continuous(trans = "log10", limits = c(350, 15000),
breaks = c(350, 1000, 5000, 10000, 15000)) +
ggtitle('Price (log10) by cube root of carat and clarity')
ggplot(aes(x = carat, y = price), data = diamonds) +
geom_point(aes(color = color), alpha = 0.5, size = 1, position = 'jitter') +
scale_color_brewer(type = 'div',
guide = guide_legend(title = 'Color',
override.aes = list(alpha = 1, size = 2))) +
scale_x_continuous(trans = trans(), limits = c(0.2, 3),
breaks = c(0.2, 0.5, 1, 2, 3)) +
scale_y_continuous(trans = "log10", limits = c(350, 15000),
breaks = c(350, 1000, 5000, 10000, 15000)) +
ggtitle('Price (log10) by cube root of carat and color')
ggplot(aes(x = carat, y = price), data = diamonds) +
geom_point(aes(color = cut), alpha = 0.5, size = 1, position = 'jitter') +
scale_color_brewer(type = 'div',
guide = guide_legend(title = 'Cut', reverse = T,
override.aes = list(alpha = 1, size = 2))) +
scale_x_continuous(trans = trans(), limits = c(0.2, 3),
breaks = c(0.2, 0.5, 1, 2, 3)) +
scale_y_continuous(trans = "log10", limits = c(350, 15000),
breaks = c(350, 1000, 5000, 10000, 15000)) +
ggtitle('Price (log10) by cube root of carat and Cut')
Cut seems to have the least impact on price; clarity and color do impact price, having the best color and clarity associated with highest prices within the same carat.
To model the price variation the following parameters are used: cube root of carat, carat, color and clarity.
m1 <- lm(I(log(price)) ~ I(carat^(1/3)), data = diamonds)
m2 <- update(m1, ~ . + carat)
m3 <- update(m2, ~ . + color)
m4 <- update(m3, ~ . + clarity)
mtable(m1, m2, m3, m4)
##
## Calls:
## m1: lm(formula = I(log(price)) ~ I(carat^(1/3)), data = diamonds)
## m2: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat, data = diamonds)
## m3: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + color,
## data = diamonds)
## m4: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + color +
## clarity, data = diamonds)
##
## ==============================================================================
## m1 m2 m3 m4
## ------------------------------------------------------------------------------
## (Intercept) 2.821*** 1.039*** 1.096*** 0.485***
## (0.006) (0.019) (0.017) (0.010)
## I(carat^(1/3)) 5.558*** 8.568*** 8.303*** 9.088***
## (0.007) (0.032) (0.029) (0.017)
## carat -1.137*** -0.966*** -1.078***
## (0.012) (0.011) (0.006)
## color: .L -0.373*** -0.442***
## (0.004) (0.002)
## color: .Q -0.132*** -0.094***
## (0.003) (0.002)
## color: .C -0.002 -0.016***
## (0.003) (0.002)
## color: ^4 0.029*** 0.012***
## (0.003) (0.002)
## color: ^5 -0.018*** -0.003*
## (0.003) (0.002)
## color: ^6 -0.027*** 0.000
## (0.002) (0.001)
## clarity: .L 0.941***
## (0.003)
## clarity: .Q -0.255***
## (0.003)
## clarity: .C 0.144***
## (0.003)
## clarity: ^4 -0.069***
## (0.002)
## clarity: ^5 0.030***
## (0.002)
## clarity: ^6 -0.005**
## (0.002)
## clarity: ^7 0.034***
## (0.001)
## ------------------------------------------------------------------------------
## R-squared 0.924 0.935 0.947 0.983
## adj. R-squared 0.924 0.935 0.947 0.983
## sigma 0.280 0.259 0.233 0.133
## F 652012.063 387489.366 120932.024 205751.644
## p 0.000 0.000 0.000 0.000
## Log-likelihood -7962.499 -3631.319 2002.977 32297.142
## Deviance 4242.831 3613.360 2932.128 953.586
## AIC 15930.999 7270.637 -3985.953 -64560.284
## BIC 15957.685 7306.220 -3896.997 -64409.058
## N 53940 53940 53940 53940
## ==============================================================================
Now let’s test this model creating a hypothetic diamond with the following characteristics: carat of 1, color J, clarity IF and charted price of 5000 dollars (from https://www.creditdonkey.com/diamond-prices.html).
hypDiamond = data.frame(carat = 1.00, color = "J", clarity = "IF")
modelEstimate = predict(m4, newdata = hypDiamond,
interval = "prediction", level = .95)
print(exp(modelEstimate))
## fit lwr upr
## 1 5469.189 4213.778 7098.626
The calculated price is 5469 dollars (95% confidence). The model predicts fairly well the chartered price a diamond with the above characteristics should have!