Part 1: Data Description and Descriptive Statistic

diamonds <- read.csv("Diamonds Prices2022.csv")

Taking a random sample

For this experiment, we’ll be taking a sample of n = 1000

set.seed(333)
sample_data <- diamonds %>% sample_n(1000)

Describing variables

str(sample_data)
## 'data.frame':    1000 obs. of  11 variables:
##  $ X      : int  5545 37442 47420 20102 20012 31079 43583 8518 4686 17542 ...
##  $ carat  : num  1.02 0.33 0.57 1.22 0.32 0.33 0.5 1.04 1.01 1.01 ...
##  $ cut    : chr  "Premium" "Ideal" "Ideal" "Premium" ...
##  $ color  : chr  "I" "G" "E" "F" ...
##  $ clarity: chr  "SI2" "IF" "VS2" "VS2" ...
##  $ depth  : num  61.2 61.9 61.6 62 63.4 61.5 62.2 62.1 62.3 59.1 ...
##  $ table  : num  59 54 56 58 55 56 60 56 55 55 ...
##  $ price  : int  3856 984 1851 8576 421 752 1431 4426 3674 7059 ...
##  $ x      : num  6.51 4.42 5.33 6.86 4.32 4.45 5.15 6.54 6.35 6.6 ...
##  $ y      : num  6.46 4.46 5.38 6.89 4.35 4.49 5.05 6.47 6.29 6.53 ...
##  $ z      : num  3.97 2.75 3.3 4.26 2.75 2.75 3.17 4.04 3.94 3.88 ...
summary(sample_data)
##        X             carat            cut               color          
##  Min.   :   95   Min.   :0.2300   Length:1000        Length:1000       
##  1st Qu.:13016   1st Qu.:0.3975   Class :character   Class :character  
##  Median :26409   Median :0.7000   Mode  :character   Mode  :character  
##  Mean   :26372   Mean   :0.7898                                        
##  3rd Qu.:39420   3rd Qu.:1.0500                                        
##  Max.   :53807   Max.   :2.8000                                        
##    clarity              depth           table           price      
##  Length:1000        Min.   :56.30   Min.   :50.10   Min.   :  373  
##  Class :character   1st Qu.:61.10   1st Qu.:56.00   1st Qu.:  917  
##  Mode  :character   Median :61.80   Median :57.00   Median : 2338  
##                     Mean   :61.77   Mean   :57.41   Mean   : 3907  
##                     3rd Qu.:62.50   3rd Qu.:59.00   3rd Qu.: 5162  
##                     Max.   :69.70   Max.   :67.00   Max.   :18730  
##        x              y               z         
##  Min.   :3.88   Min.   :3.920   Min.   : 2.350  
##  1st Qu.:4.69   1st Qu.:4.710   1st Qu.: 2.900  
##  Median :5.68   Median :5.700   Median : 3.510  
##  Mean   :5.71   Mean   :5.713   Mean   : 3.557  
##  3rd Qu.:6.53   3rd Qu.:6.540   3rd Qu.: 4.040  
##  Max.   :9.03   Max.   :8.980   Max.   :31.800
ggplot(diamonds, aes(price)) +
  geom_histogram(bins = 30, fill = "lightgreen", color = "black") +
  theme_minimal()

ggplot(diamonds, aes(depth)) +
  geom_histogram(bins = 30, fill = "lightpink", color = "black") +
  theme_minimal()

ggplot(diamonds, aes(table)) +
  geom_histogram(bins = 30, fill = "lightgray", color = "black") +
  theme_minimal()

ggplot(diamonds, aes(x = carat)) +
  geom_histogram(bins = 30, fill = "skyblue", color = "black") +
  theme_minimal()

ggplot(diamonds, aes(x)) +
  geom_histogram(bins = 30, fill = "orange", color = "black") +
  theme_minimal()

ggplot(diamonds, aes(y)) +
  geom_histogram(bins = 30, fill = "magenta", color = "black") +
  theme_minimal()

ggplot(diamonds, aes(z)) +
  geom_histogram(bins = 30, fill = "yellow", color = "black") +
  theme_minimal()

ggplot(diamonds, aes(cut)) +
  geom_bar(fill = "steelblue") +
  theme_minimal()

ggplot(diamonds, aes(clarity)) +
  geom_bar(fill = "purple") +
  theme_minimal()

ggplot(diamonds, aes(color)) +
  geom_bar(fill = "orange") +
  theme_minimal()

Examining the continuous variables, we have “carat”, “depth”, “table”, “price”, and the dimensions x, y, and z.

Our Categorical variables include “cut”, “color”, and “clarity”.

Some notes we have from the obtained histograms.

  • Price is right skewed as this indicates that most diamonds are small.

  • Depth is normally distributed at roughly 61% to 62%.

  • Table is roughly uniform with a emphasis on the range 56-60.

  • Carat is right-skewed as most diamonds are on the lower end with a few being very expensive. This is expected as we know diamonds overall are quite rare.

  • The dimension variables x, y, and z represent the length, width, and depth in mm respectively. All show a right-skewed distribution which reflects on the principle of how most diamonds fall within a typical size range with larger diamonds being very uncommon.

Some notes we have from the obtained barplots.

  • For cut, we can see that ideal and premium are the most popular cuts with very good trailing in third.

  • For clarity, SI1 and VS2 are the most abundant in terms of observation count.

  • For color, the most common grades tend to fall around the mid-range , being G. This alligns with market availability and preference.

Overall, the distribution shapes we’ve obtained gives us insight into what makes diamonds so expensive as well as variables that factor into what makes diamonds vary to such degrees. Progressing forward, we will be analyzing the multitude of variables involved and finding correlations with one another.

Quantitative and categorical variables

For our analysis, we will be utilizing the quantitative variables carat, depth, and table. Categorical variables will be cut and clarity.

  • Carat is highly correlated with price and almost universal when it comes to assessing diamonds.

  • Depth would be a great variable to observe it’s influence on cut quality.

  • Table allows us to balance the proportion of the variables involved.

  • Cut is a huge factor when it comes to quality grades of diamonds. Has a very strong association with price.

  • Clarity allows for good variety in categories.

Multiple linear regression model

model <- lm (price ~ carat + depth + table + cut + clarity, data = sample_data)
summary(model)
## 
## Call:
## lm(formula = price ~ carat + depth + table + cut + clarity, data = sample_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4705.3  -712.3   -90.0   486.4  6667.0 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -2893.73    3331.14  -0.869 0.385228    
## carat         8563.86      95.21  89.944  < 2e-16 ***
## depth          -18.04      36.38  -0.496 0.620189    
## table          -34.32      25.80  -1.330 0.183788    
## cutGood        476.68     294.75   1.617 0.106145    
## cutIdeal       713.62     292.24   2.442 0.014784 *  
## cutPremium     530.90     283.57   1.872 0.061474 .  
## cutVery Good   500.79     281.59   1.778 0.075635 .  
## clarityIF     3380.02     581.70   5.811 8.40e-09 ***
## claritySI1    2082.17     541.15   3.848 0.000127 ***
## claritySI2    1157.24     543.26   2.130 0.033404 *  
## clarityVS1    3056.89     545.30   5.606 2.69e-08 ***
## clarityVS2    2854.70     543.15   5.256 1.81e-07 ***
## clarityVVS1   3598.80     557.40   6.456 1.68e-10 ***
## clarityVVS2   3788.32     550.83   6.877 1.08e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1301 on 985 degrees of freedom
## Multiple R-squared:  0.8965, Adjusted R-squared:  0.895 
## F-statistic: 609.2 on 14 and 985 DF,  p-value: < 2.2e-16

Here, we ran the multiple linear regression model using price as the primary response variable and carat, depth, table, cut, and clarity as our predictors. Shown above is the summary output and here, we can examine the estimates for each coefficient as well as very useful summaries such as SE, R-squared, adjusted R-squared, RSE, etc.

Comments:

  • Overall, the multiple linear regression model we produced performed extraordinarily well given the output.

  • Achieved an R-Squared value of 0.8965 which means that 89.65% of the variability of Y can be explained by the predictors. This means that carat, depth, table, cut, and clarity are strong predictors of price in diamond.

  • We can see that carat heavily dominates the model with a positive estimate of 8563. This makes sense as carat is one of the largest factor when it comes to pricing in diamonds. The predictor’s depth and table shows very little contribution to our model. This was quite surprising as this suggests that the presence of carat and quality lowers the impact of depth and table when it comes to affecting price.

  • Our categorical variables gave us varying conclusions. The cut level did not differ substantially from the baseline which wasn’t expected of. Normally, we would attribute cut quality as a huge factor in pricing.

  • Our clarity categories showed large positive effects with higher clarity diamonds. This includes IF, VS1, and VS2 while the others aren’t far behind. Higher-clarity diamonds are associated with higher prices, aligning with what we would see in a real world setting.

Part 2: Simple Linear Regression

simple_model <- lm(price ~ carat, data = sample_data)
summary(simple_model)
## 
## Call:
## lm(formula = price ~ carat, data = sample_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4843.1  -910.8   -39.3   549.0  8048.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2312.30      96.68  -23.92   <2e-16 ***
## carat        7874.33     105.24   74.83   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1562 on 998 degrees of freedom
## Multiple R-squared:  0.8487, Adjusted R-squared:  0.8486 
## F-statistic:  5599 on 1 and 998 DF,  p-value: < 2.2e-16
ggplot(sample_data, aes(x = carat, y = price)) +
  geom_point(alpha = 0.4) +
  geom_smooth(method = "lm", se = FALSE, color = "blue") +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

Slop and intercept

We got an intercept of -2312.30 and a slope of 7874.33 This means that for every one-carat increase in diamond weight, the predicted price increases by approximately $7874.33 on average. A strong correlation between carat and price.

Hypothesis Test

H0: b1 = 0 (carat has no effect on price)

Ha: b1 NE 0 (carat affects price)

Conducting our hypothesis test, our p-value is < 2e-16 which indicates that it’s very significant, so we reject the null hypothesis. We can conclude that carat has a statistically significant effect on price, indicating a strong linear relationship between the size and price. Adding on to that, our obtained estimate for carat is positively large.

R^2 and adjusted R^2

We obtained an R^2 value of 0.8487 which translates to carat explaining 84.87% of the variation in diamond prices. As predicted, carat is an extremely strong predictor of price. Our obtained adjusted R^2 is 0.8486 which is extremely close to our R^2. This means it’s good.

Confidence interval

confint(simple_model)
##                 2.5 %    97.5 %
## (Intercept) -2502.026 -2122.571
## carat        7667.823  8080.846

The 95% confidence interval for the intercept is (-2502.026, -2122.571). This represents the predicted price when carat = 0. Not really meaningful in anyway.

The 95% confidence interval for carat is (7667.823, 8080.846). This means that as carat increases, price increases. To elaborate, for each additional carat, the price is expected to increase by an additional $7,667 to $8,080 on average.

Prediction Interval

new <- data.frame(carat = 1)
predict(simple_model, new, interval = "prediction")
##        fit      lwr      upr
## 1 5562.036 2495.248 8628.824

For our prediction interval, we are predicting the price for a 1-carat diamond.

Here, we get a fit of 5562.036, which is the predicted price for a diamond with the given carat value (1).

lwr of 2496.248 which represents the lower bound of the 95% prediction interval.

upr of 8628.824 which represents the upper bound of the 95% prediction interval.

With everything together, for a diamond with a carat value of 1, we predict its price to be about $5,562.036 Conducting a 95% confidence will reveal that the carat is expected to have a price between $2,495 and $8628.

Residual analysis

plot(simple_model$residuals)

hist(simple_model$residuals)

plot(simple_model, which = 1)

plot(simple_model, which = 2)

Observing our plots and summary, the residual vs. fitted plot shows an increasing spread as fitted values increase. This would indicate heterosedasticity or, constant variance. This is as expected because higher-carat diamonds equate to higher varying prices.

Testing assumptions and applying transformations

Testing for linearity, we examine the Residuals vs Fitted plot. From the graph, we can see that the residuals show a cured pattern where residuals are negative for small carat values,, positive in the middle portion, and negative for larger carat values. Therefore, the assumption is violated. This means that the relationship between price and carat is not perfectly linear.

The Residuals vs Fitted plot also shows the residuals spread out more as fitted values increase. This reflects the price variability increases for larger diamonds. Therefore, the constant variance assumption is violated.

Looking at the QQ-plot, we can see that the plot shows noticeable deviation from the straight reference line in both tails. The lower tail bends downwards while the upper tail bends upward, indicating the presence of heavy tails and outliers. We sampled a size of n = 1000, where even small deviations become visible. In our case however, the departures are substantial enough to conclude that the residuals are not normally distributed. Therefore, normality assumption is violated.

Evaluating the regression assumptions utilizing our plots, I’ve discovered evidence of nonlinearity, heteroscedasticity (constant variance), and non-normal residuals. We can address these violations through log-transformation to the response variable price.

model_log <- lm(log(price) ~ carat, data = sample_data)
summary(model_log)
## 
## Call:
## lm(formula = log(price) ~ carat, data = sample_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.21420 -0.23875  0.03182  0.24433  1.06519 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.17011    0.02411  255.94   <2e-16 ***
## carat        2.02211    0.02624   77.06   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3894 on 998 degrees of freedom
## Multiple R-squared:  0.8561, Adjusted R-squared:  0.856 
## F-statistic:  5938 on 1 and 998 DF,  p-value: < 2.2e-16
plot(model_log, which = 1)

plot(model_log, which = 2)

After transformations, our QQ-plot is much straighter with less variability and our residuals vs fitted will show much more constant variance. Our R^2 should increase in turn.

  • Our intercept is 6.17011 which is when carat = 0, the expected log(price) = 6.17. This corresponds to an estimated price of e^6.17011 = 4,765. Note: Not very applicable as carat = 0 isn’t meaningful.

  • Achieved slope is 2.02211. This represents for each additional 1 carat, the expected price multiplies by: e^2.02211 = 7.55. Every 1 carat increase multiplies the price by about 7.5x which matches reality with how diamond prices increase exponentially.

  • RSE = 0.3894 which is smaller compared to our other model.

  • R^2 = 0.8561 which is similar to our original untransformed model.

  • The addition of log transformation causes residual variance to be more constant, normal, and the model assumptions are much better satisfied.

  • Before log transformation:

    • Right-skewed

    • Heteroscedasticity

    • non-normal residuals

  • After log transformation:

    • Tighter residuals

    • Straighter QQ-plot

    • Constant variance

    • less branching outliers.

Additional Variable for Model Improvement

  • Adding depth to the model decreased the adjusted R^2, so we excluded it.

  • Adding table increased the adjusted R^2 slight, but minimal improvement.

  • Adding cut increased the adjusted R^2 more noticeably.

    • This means cut quality provides meaningful influence.
  • Adding clarity produced the largest improvement in model performance, significantly increasing R^2.

The final model will include carat, cut, and clarity has these variables best improve the overall model without causing overfitting.

Comments

While conducting this portion of the project, I discovered that variables such as depth and table, had very little contribution to explanatory power to the model given the introduction of carat. These variable define the dimensions of the diamond yet, they have very little influence on R^2. Clarity on the other hand had the single most impactful increase in R^2.

PART II Continuation

Best Model

model_best <- lm(log(price) ~ carat + cut + clarity, data = sample_data)
summary(model_best)
## 
## Call:
## lm(formula = log(price) ~ carat + cut + clarity, data = sample_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.16232 -0.23639  0.04429  0.25762  0.96451 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   5.837657   0.163777  35.644  < 2e-16 ***
## carat         2.126935   0.026780  79.424  < 2e-16 ***
## cutGood      -0.038122   0.080519  -0.473  0.63599    
## cutIdeal      0.024371   0.072601   0.336  0.73718    
## cutPremium   -0.059935   0.073718  -0.813  0.41639    
## cutVery Good  0.003515   0.074230   0.047  0.96224    
## clarityIF     0.433874   0.164388   2.639  0.00844 ** 
## claritySI1    0.186457   0.152908   1.219  0.22298    
## claritySI2    0.046987   0.153384   0.306  0.75941    
## clarityVS1    0.345604   0.154034   2.244  0.02507 *  
## clarityVS2    0.297253   0.153484   1.937  0.05307 .  
## clarityVVS1   0.365375   0.157609   2.318  0.02064 *  
## clarityVVS2   0.462129   0.155625   2.970  0.00305 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3681 on 987 degrees of freedom
## Multiple R-squared:  0.8729, Adjusted R-squared:  0.8713 
## F-statistic: 564.7 on 12 and 987 DF,  p-value: < 2.2e-16

Above is the best model summary we produced with carat, cut, and clarity. The summary output produced gives us:

  • Adjusted R^2 of 0.8713, which means the model explains 87.13% of the variation in log(price). A very strong fit.

  • F-statistic is extremely large at 564.7 with a p-value < 2.2e-16, indicating the mode overall is highly significant.

  • Carat had an estimate of 2.1269 which is the strongest predictor in the model. For each 1 unit increase in carat, the expected log(price) increases by 2.1269. In pricing terms e^2.1269 = 8.39, meaning each additional carat multiplies the price by 8.4x.

  • All of the cut levels produced weren’t statistically significant despite being marketed as an important factor when purchasing diamonds. Price effect observed is smaller than anticipated.

  • Several clarity categories are significant with higher clarity grades leading to an increase in price.

  • RSE = 0.3681, low for log-transformed prices.

Multicollinearity and VIF

modelz <- lm(log(price) ~ carat + cut + clarity, data = sample_data)
vif(modelz)
##             GVIF Df GVIF^(1/(2*Df))
## carat   1.165628  1        1.079642
## cut     1.084322  4        1.010171
## clarity 1.216555  7        1.014100

Cut and clarity are categorical variables –> GVIF1/(2 x df)

  • Carat: 1.08

  • Cut: 1.01

  • Clarity: 1.01

All VIF values obtained are very close to 1, far below the rule of thumb threshold of 5 and 10. We can conclude that there is no evidence of multicollinearity among carat, cut, and clarity. Our predictors are statistically independent enough to where coefficient estimates are stable, noninflated SE, and no to adjust variables.

CI

CI_data <- data.frame (carat = 1, cut = "Ideal", clarity = "VS2")
predict(model_best, newdata = CI_data, interval = "confidence")
##        fit      lwr      upr
## 1 8.286216 8.226466 8.345966
predict(model_best, newdata = CI_data, interval = "prediction")
##        fit      lwr     upr
## 1 8.286216 7.561341 9.01109

Here, we produced confidence intervals for both the mean predicted value and the PIs of of a future predicted value for at least one combination of X’s. The combination we chose is carat = 1, cut = “Ideal”, and clarity = “VS2”.

For carat = 1, cut = “Ideal”, clarity = “VS2”

The 95% confidence intervals for the mean log(price) is given as:

  • Predicted log(price): 8.286

  • Achieved a lower bound of 8.226 and an upper bound of 8.346

  • Repeatedly taking a sample size n = 1000 and fit the same regression model, 95% of the confidence intervals for the mean log(price) of diamond with the set characteristics would contain the true mean log(price).

For carat = 1, cut = “Ideal”, clarity = “VS2”

The 95% prediction intervals for the mean log(price) is given as:

  • Achieved lower bound of 7.561 and an upper bound of 9.011.

Summary of Report

This project had many ups and downs, but we explored the impact of different diamond characteristics on price. Utilizing a real data set spanning thousands of observations, we worked with a random sample of 1,000 diamonds to make analysis manageable while still capturing the essential trends in the data.

Firstly we examined and visualized ALL variables to understand distributions and relationships. Setting a strong foundation and selecting most relevant predictors in order to produce a good model. I built simple and multiple linear regression models and applied transformation when assumptions were violated. Eventually, producing a final mode in which we took the log transform which proved to be both interpretable and highly effective. Confidence interval and prediction intervals allowed me to forecast future observations for diamonds under specific characteristics.

In the end, this project opened the multitude of insight hidden within the data set, revealing how measurable attributes like carat, cut, and clarity could impact price. Being able to extrapolate and apply to real world scenarios not only strengthened the practical values of analysis but also highlighted the importance of decision making through statistical modeling. Explaining the countless variables became a drag but effectively it allowed for proper comparisons.