2023-10-15

Introduction to the Diamonds Dataset

Using the built in data set, diamonds, we will explore the various factors that affect a diamond’s price

## # A tibble: 6 × 10
##   carat cut       color clarity depth table price     x     y     z
##   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
## 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
## 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
## 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
## 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
## 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48

We will use plots to visualize the data. We will also analyze carat and depth factors using single linear regression.

Visualize the Data

The following plot shows the Price vs Depth of the clearest and Ideal Cut diamonds along with color and size.

Observations from the Data

From the previous plot, we can see that larger carat diamonds seem to be worth more, even if their color is not the best. It is possible that depth is not strongly correlated to price, since there are diamonds of varying depths at every price. This is something we will explore more in our linear regression.

First, let’s explore how color and cut affect the price of diamonds.

Average Price of a Diamond by Color

## # A tibble: 7 × 2
##   color average_by_color
##   <ord>            <dbl>
## 1 D                3170.
## 2 E                3077.
## 3 F                3725.
## 4 G                3999.
## 5 H                4487.
## 6 I                5092.
## 7 J                5324.

Wow! It seems strange, but most people are paying more on average for J color diamonds (the worst) than for D colored diamonds (the best). Let’s take a deeper look into the data to see more.

Boxplots of Color vs Price

Well, it seems to be true! People are paying more on average for worse color diamonds. I wonder if people just like the colors or if they’re getting ripped off? Buyer beware!

Average Price based on Cut

Here’s the average price of diamonds based on cut.

cut average_by_cut
Fair 4358.758
Good 3928.864
Very Good 3981.760
Premium 4584.258
Ideal 3457.542

Again, the best (ideal) diamonds are not the most expensive on average!

Linear Regression

Linear regression is the process of modeling the relationship between two variables. We will create two different simple linear regression models. The first will examine the relationship between price and carat (weight of the diamond) and the second model will look at price vs depth (total depth percentage) of the diamonds.

Our two linear regression lines will take the form of:

\[\begin{equation} E(price\: in\: USD)= \beta_{0} + \beta_{1} (carat) + \varepsilon \\ E(price\: in\: USD)= \beta_{0} + \beta_{1} (depth) + \varepsilon \end{equation}\]

Price vs Carat

Based off the scatter, it seems like the price is not directly related to carat alone. Past 1 carat, there are diamonds selling for a range of prices. In this situation, I might try a different model with multiple regression, but that’s a topic for later.

Price vs Depth

pdepth <- ggplot(diamonds, aes(x=depth, y=price)) + geom_point() + 
  geom_smooth(method = "lm")
pdepth

So, it does not seem like there is a relationship between price and depth at all.

Linear Regression Equations

The two plots we made for simple linear regression, while not necessarily the best models, produce the following equations.

\[\begin{equation} E(price\: in\: USD)= -2256 + 7756 (carat) \\ E(price\: in\: USD)= 5763.67 - 29.65 (depth) \end{equation}\]