diamonds <- read.csv("Diamonds Prices2022.csv")
For this experiment, we’ll be taking a sample of n = 1000
set.seed(333)
sample_data <- diamonds %>% sample_n(1000)
str(sample_data)
## 'data.frame': 1000 obs. of 11 variables:
## $ X : int 5545 37442 47420 20102 20012 31079 43583 8518 4686 17542 ...
## $ carat : num 1.02 0.33 0.57 1.22 0.32 0.33 0.5 1.04 1.01 1.01 ...
## $ cut : chr "Premium" "Ideal" "Ideal" "Premium" ...
## $ color : chr "I" "G" "E" "F" ...
## $ clarity: chr "SI2" "IF" "VS2" "VS2" ...
## $ depth : num 61.2 61.9 61.6 62 63.4 61.5 62.2 62.1 62.3 59.1 ...
## $ table : num 59 54 56 58 55 56 60 56 55 55 ...
## $ price : int 3856 984 1851 8576 421 752 1431 4426 3674 7059 ...
## $ x : num 6.51 4.42 5.33 6.86 4.32 4.45 5.15 6.54 6.35 6.6 ...
## $ y : num 6.46 4.46 5.38 6.89 4.35 4.49 5.05 6.47 6.29 6.53 ...
## $ z : num 3.97 2.75 3.3 4.26 2.75 2.75 3.17 4.04 3.94 3.88 ...
summary(sample_data)
## X carat cut color
## Min. : 95 Min. :0.2300 Length:1000 Length:1000
## 1st Qu.:13016 1st Qu.:0.3975 Class :character Class :character
## Median :26409 Median :0.7000 Mode :character Mode :character
## Mean :26372 Mean :0.7898
## 3rd Qu.:39420 3rd Qu.:1.0500
## Max. :53807 Max. :2.8000
## clarity depth table price
## Length:1000 Min. :56.30 Min. :50.10 Min. : 373
## Class :character 1st Qu.:61.10 1st Qu.:56.00 1st Qu.: 917
## Mode :character Median :61.80 Median :57.00 Median : 2338
## Mean :61.77 Mean :57.41 Mean : 3907
## 3rd Qu.:62.50 3rd Qu.:59.00 3rd Qu.: 5162
## Max. :69.70 Max. :67.00 Max. :18730
## x y z
## Min. :3.88 Min. :3.920 Min. : 2.350
## 1st Qu.:4.69 1st Qu.:4.710 1st Qu.: 2.900
## Median :5.68 Median :5.700 Median : 3.510
## Mean :5.71 Mean :5.713 Mean : 3.557
## 3rd Qu.:6.53 3rd Qu.:6.540 3rd Qu.: 4.040
## Max. :9.03 Max. :8.980 Max. :31.800
ggplot(diamonds, aes(price)) +
geom_histogram(bins = 30, fill = "lightgreen", color = "black") +
theme_minimal()
ggplot(diamonds, aes(depth)) +
geom_histogram(bins = 30, fill = "lightpink", color = "black") +
theme_minimal()
ggplot(diamonds, aes(table)) +
geom_histogram(bins = 30, fill = "lightgray", color = "black") +
theme_minimal()
ggplot(diamonds, aes(x = carat)) +
geom_histogram(bins = 30, fill = "skyblue", color = "black") +
theme_minimal()
ggplot(diamonds, aes(x)) +
geom_histogram(bins = 30, fill = "orange", color = "black") +
theme_minimal()
ggplot(diamonds, aes(y)) +
geom_histogram(bins = 30, fill = "magenta", color = "black") +
theme_minimal()
ggplot(diamonds, aes(z)) +
geom_histogram(bins = 30, fill = "yellow", color = "black") +
theme_minimal()
ggplot(diamonds, aes(cut)) +
geom_bar(fill = "steelblue") +
theme_minimal()
ggplot(diamonds, aes(clarity)) +
geom_bar(fill = "purple") +
theme_minimal()
ggplot(diamonds, aes(color)) +
geom_bar(fill = "orange") +
theme_minimal()
Examining the continuous variables, we have “carat”, “depth”, “table”, “price”, and the dimensions x, y, and z.
Our Categorical variables include “cut”, “color”, and “clarity”.
Some notes we have from the obtained histograms.
Price is right skewed as this indicates that most diamonds are small.
Depth is normally distributed at roughly 61% to 62%.
Table is roughly uniform with a emphasis on the range 56-60.
Carat is right-skewed as most diamonds are on the lower end with a few being very expensive. This is expected as we know diamonds overall are quite rare.
The dimension variables x, y, and z represent the length, width, and depth in mm respectively. All show a right-skewed distribution which reflects on the principle of how most diamonds fall within a typical size range with larger diamonds being very uncommon.
Some notes we have from the obtained barplots.
For cut, we can see that ideal and premium are the most popular cuts with very good trailing in third.
For clarity, SI1 and VS2 are the most abundant in terms of observation count.
For color, the most common grades tend to fall around the mid-range , being G. This alligns with market availability and preference.
Overall, the distribution shapes we’ve obtained gives us insight into what makes diamonds so expensive as well as variables that factor into what makes diamonds vary to such degrees. Progressing forward, we will be analyzing the multitude of variables involved and finding correlations with one another.
For our analysis, we will be utilizing the quantitative variables carat, depth, and table. Categorical variables will be cut and clarity.
Carat is highly correlated with price and almost universal when it comes to assessing diamonds.
Depth would be a great variable to observe it’s influence on cut quality.
Table allows us to balance the proportion of the variables involved.
Cut is a huge factor when it comes to quality grades of diamonds. Has a very strong association with price.
Clarity allows for good variety in categories.
model <- lm (price ~ carat + depth + table + cut + clarity, data = sample_data)
summary(model)
##
## Call:
## lm(formula = price ~ carat + depth + table + cut + clarity, data = sample_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4705.3 -712.3 -90.0 486.4 6667.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2893.73 3331.14 -0.869 0.385228
## carat 8563.86 95.21 89.944 < 2e-16 ***
## depth -18.04 36.38 -0.496 0.620189
## table -34.32 25.80 -1.330 0.183788
## cutGood 476.68 294.75 1.617 0.106145
## cutIdeal 713.62 292.24 2.442 0.014784 *
## cutPremium 530.90 283.57 1.872 0.061474 .
## cutVery Good 500.79 281.59 1.778 0.075635 .
## clarityIF 3380.02 581.70 5.811 8.40e-09 ***
## claritySI1 2082.17 541.15 3.848 0.000127 ***
## claritySI2 1157.24 543.26 2.130 0.033404 *
## clarityVS1 3056.89 545.30 5.606 2.69e-08 ***
## clarityVS2 2854.70 543.15 5.256 1.81e-07 ***
## clarityVVS1 3598.80 557.40 6.456 1.68e-10 ***
## clarityVVS2 3788.32 550.83 6.877 1.08e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1301 on 985 degrees of freedom
## Multiple R-squared: 0.8965, Adjusted R-squared: 0.895
## F-statistic: 609.2 on 14 and 985 DF, p-value: < 2.2e-16
Here, we ran the multiple linear regression model using price as the primary response variable and carat, depth, table, cut, and clarity as our predictors. Shown above is the summary output and here, we can examine the estimates for each coefficient as well as very useful summaries such as SE, R-squared, adjusted R-squared, RSE, etc.
Comments:
Overall, the multiple linear regression model we produced performed extraordinarily well given the output.
Achieved an R-Squared value of 0.8965 which means that 89.65% of the variability of Y can be explained by the predictors. This means that carat, depth, table, cut, and clarity are strong predictors of price in diamond.
We can see that carat heavily dominates the model with a positive estimate of 8563. This makes sense as carat is one of the largest factor when it comes to pricing in diamonds. The predictor’s depth and table shows very little contribution to our model. This was quite surprising as this suggests that the presence of carat and quality lowers the impact of depth and table when it comes to affecting price.
Our categorical variables gave us varying conclusions. The cut level did not differ substantially from the baseline which wasn’t expected of. Normally, we would attribute cut quality as a huge factor in pricing.
Our clarity categories showed large positive effects with higher clarity diamonds. This includes IF, VS1, and VS2 while the others aren’t far behind. Higher-clarity diamonds are associated with higher prices, aligning with what we would see in a real world setting.
simple_model <- lm(price ~ carat, data = sample_data)
summary(simple_model)
##
## Call:
## lm(formula = price ~ carat, data = sample_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4843.1 -910.8 -39.3 549.0 8048.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2312.30 96.68 -23.92 <2e-16 ***
## carat 7874.33 105.24 74.83 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1562 on 998 degrees of freedom
## Multiple R-squared: 0.8487, Adjusted R-squared: 0.8486
## F-statistic: 5599 on 1 and 998 DF, p-value: < 2.2e-16
ggplot(sample_data, aes(x = carat, y = price)) +
geom_point(alpha = 0.4) +
geom_smooth(method = "lm", se = FALSE, color = "blue") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
We got an intercept of -2312.30 and a slope of 7874.33 This means that for every one-carat increase in diamond weight, the predicted price increases by approximately $7874.33 on average. A strong correlation between carat and price.
H0: b1 = 0 (carat has no effect on price)
Ha: b1 NE 0 (carat affects price)
Conducting our hypothesis test, our p-value is < 2e-16 which indicates that it’s very significant, so we reject the null hypothesis. We can conclude that carat has a statistically significant effect on price, indicating a strong linear relationship between the size and price. Adding on to that, our obtained estimate for carat is positively large.
We obtained an R^2 value of 0.8487 which translates to carat explaining 84.87% of the variation in diamond prices. As predicted, carat is an extremely strong predictor of price. Our obtained adjusted R^2 is 0.8486 which is extremely close to our R^2. This means it’s good.
confint(simple_model)
## 2.5 % 97.5 %
## (Intercept) -2502.026 -2122.571
## carat 7667.823 8080.846
The 95% confidence interval for the intercept is (-2502.026, -2122.571). This represents the predicted price when carat = 0. Not really meaningful in anyway.
The 95% confidence interval for carat is (7667.823, 8080.846). This means that as carat increases, price increases. To elaborate, for each additional carat, the price is expected to increase by an additional $7,667 to $8,080 on average.
new <- data.frame(carat = 1)
predict(simple_model, new, interval = "prediction")
## fit lwr upr
## 1 5562.036 2495.248 8628.824
For our prediction interval, we are predicting the price for a 1-carat diamond.
Here, we get a fit of 5562.036, which is the predicted price for a diamond with the given carat value (1).
lwr of 2496.248 which represents the lower bound of the 95% prediction interval.
upr of 8628.824 which represents the upper bound of the 95% prediction interval.
With everything together, for a diamond with a carat value of 1, we predict its price to be about $5,562.036 Conducting a 95% confidence will reveal that the carat is expected to have a price between $2,495 and $8628.
plot(simple_model$residuals)
hist(simple_model$residuals)
plot(simple_model, which = 1)
plot(simple_model, which = 2)
Observing our plots and summary, the residual vs. fitted plot shows an increasing spread as fitted values increase. This would indicate heterosedasticity or, constant variance. This is as expected because higher-carat diamonds equate to higher varying prices.
Testing for linearity, we examine the Residuals vs Fitted plot. From the graph, we can see that the residuals show a cured pattern where residuals are negative for small carat values,, positive in the middle portion, and negative for larger carat values. Therefore, the assumption is violated. This means that the relationship between price and carat is not perfectly linear.
The Residuals vs Fitted plot also shows the residuals spread out more as fitted values increase. This reflects the price variability increases for larger diamonds. Therefore, the constant variance assumption is violated.
Looking at the QQ-plot, we can see that the plot shows noticeable deviation from the straight reference line in both tails. The lower tail bends downwards while the upper tail bends upward, indicating the presence of heavy tails and outliers. We sampled a size of n = 1000, where even small deviations become visible. In our case however, the departures are substantial enough to conclude that the residuals are not normally distributed. Therefore, normality assumption is violated.
Evaluating the regression assumptions utilizing our plots, I’ve discovered evidence of nonlinearity, heteroscedasticity (constant variance), and non-normal residuals. We can address these violations through log-transformation to the response variable price.
model_log <- lm(log(price) ~ carat, data = sample_data)
summary(model_log)
##
## Call:
## lm(formula = log(price) ~ carat, data = sample_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.21420 -0.23875 0.03182 0.24433 1.06519
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.17011 0.02411 255.94 <2e-16 ***
## carat 2.02211 0.02624 77.06 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3894 on 998 degrees of freedom
## Multiple R-squared: 0.8561, Adjusted R-squared: 0.856
## F-statistic: 5938 on 1 and 998 DF, p-value: < 2.2e-16
plot(model_log, which = 1)
plot(model_log, which = 2)
After transformations, our QQ-plot is much straighter with less variability and our residuals vs fitted will show much more constant variance. Our R^2 should increase in turn.
Our intercept is 6.17011 which is when carat = 0, the expected log(price) = 6.17. This corresponds to an estimated price of e^6.17011 = 4,765. Note: Not very applicable as carat = 0 isn’t meaningful.
Achieved slope is 2.02211. This represents for each additional 1 carat, the expected price multiplies by: e^2.02211 = 7.55. Every 1 carat increase multiplies the price by about 7.5x which matches reality with how diamond prices increase exponentially.
RSE = 0.3894 which is smaller compared to our other model.
R^2 = 0.8561 which is similar to our original untransformed model.
The addition of log transformation causes residual variance to be more constant, normal, and the model assumptions are much better satisfied.
Before log transformation:
Right-skewed
Heteroscedasticity
non-normal residuals
After log transformation:
Tighter residuals
Straighter QQ-plot
Constant variance
less branching outliers.
Adding depth to the model decreased the adjusted R^2, so we excluded it.
Adding table increased the adjusted R^2 slight, but minimal improvement.
Adding cut increased the adjusted R^2 more noticeably.
Adding clarity produced the largest improvement in model performance, significantly increasing R^2.
The final model will include carat, cut, and clarity has these variables best improve the overall model without causing overfitting.
model_best <- lm(log(price) ~ carat + cut + clarity, data = sample_data)
summary(model_best)
##
## Call:
## lm(formula = log(price) ~ carat + cut + clarity, data = sample_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.16232 -0.23639 0.04429 0.25762 0.96451
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.837657 0.163777 35.644 < 2e-16 ***
## carat 2.126935 0.026780 79.424 < 2e-16 ***
## cutGood -0.038122 0.080519 -0.473 0.63599
## cutIdeal 0.024371 0.072601 0.336 0.73718
## cutPremium -0.059935 0.073718 -0.813 0.41639
## cutVery Good 0.003515 0.074230 0.047 0.96224
## clarityIF 0.433874 0.164388 2.639 0.00844 **
## claritySI1 0.186457 0.152908 1.219 0.22298
## claritySI2 0.046987 0.153384 0.306 0.75941
## clarityVS1 0.345604 0.154034 2.244 0.02507 *
## clarityVS2 0.297253 0.153484 1.937 0.05307 .
## clarityVVS1 0.365375 0.157609 2.318 0.02064 *
## clarityVVS2 0.462129 0.155625 2.970 0.00305 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3681 on 987 degrees of freedom
## Multiple R-squared: 0.8729, Adjusted R-squared: 0.8713
## F-statistic: 564.7 on 12 and 987 DF, p-value: < 2.2e-16
Above is the best model summary we produced with carat, cut, and clarity. The summary output produced gives us:
Adjusted R^2 of 0.8713, which means the model explains 87.13% of the variation in log(price). A very strong fit.
F-statistic is extremely large at 564.7 with a p-value < 2.2e-16, indicating the mode overall is highly significant.
Carat had an estimate of 2.1269 which is the strongest predictor in the model. For each 1 unit increase in carat, the expected log(price) increases by 2.1269. In pricing terms e^2.1269 = 8.39, meaning each additional carat multiplies the price by 8.4x.
All of the cut levels produced weren’t statistically significant despite being marketed as an important factor when purchasing diamonds. Price effect observed is smaller than anticipated.
Several clarity categories are significant with higher clarity grades leading to an increase in price.
RSE = 0.3681, low for log-transformed prices.
modelz <- lm(log(price) ~ carat + cut + clarity, data = sample_data)
vif(modelz)
## GVIF Df GVIF^(1/(2*Df))
## carat 1.165628 1 1.079642
## cut 1.084322 4 1.010171
## clarity 1.216555 7 1.014100
Cut and clarity are categorical variables –> GVIF1/(2 x df)
Carat: 1.08
Cut: 1.01
Clarity: 1.01
All VIF values obtained are very close to 1, far below the rule of thumb threshold of 5 and 10. We can conclude that there is no evidence of multicollinearity among carat, cut, and clarity. Our predictors are statistically independent enough to where coefficient estimates are stable, noninflated SE, and no to adjust variables.
CI_data <- data.frame (carat = 1, cut = "Ideal", clarity = "VS2")
predict(model_best, newdata = CI_data, interval = "confidence")
## fit lwr upr
## 1 8.286216 8.226466 8.345966
predict(model_best, newdata = CI_data, interval = "prediction")
## fit lwr upr
## 1 8.286216 7.561341 9.01109
Here, we produced confidence intervals for both the mean predicted value and the PIs of of a future predicted value for at least one combination of X’s. The combination we chose is carat = 1, cut = “Ideal”, and clarity = “VS2”.
For carat = 1, cut = “Ideal”, clarity = “VS2”
The 95% confidence intervals for the mean log(price) is given as:
Predicted log(price): 8.286
Achieved a lower bound of 8.226 and an upper bound of 8.346
Repeatedly taking a sample size n = 1000 and fit the same regression model, 95% of the confidence intervals for the mean log(price) of diamond with the set characteristics would contain the true mean log(price).
For carat = 1, cut = “Ideal”, clarity = “VS2”
The 95% prediction intervals for the mean log(price) is given as:
This project had many ups and downs, but we explored the impact of different diamond characteristics on price. Utilizing a real data set spanning thousands of observations, we worked with a random sample of 1,000 diamonds to make analysis manageable while still capturing the essential trends in the data.
Firstly we examined and visualized ALL variables to understand distributions and relationships. Setting a strong foundation and selecting most relevant predictors in order to produce a good model. I built simple and multiple linear regression models and applied transformation when assumptions were violated. Eventually, producing a final mode in which we took the log transform which proved to be both interpretable and highly effective. Confidence interval and prediction intervals allowed me to forecast future observations for diamonds under specific characteristics.
In the end, this project opened the multitude of insight hidden within the data set, revealing how measurable attributes like carat, cut, and clarity could impact price. Being able to extrapolate and apply to real world scenarios not only strengthened the practical values of analysis but also highlighted the importance of decision making through statistical modeling. Explaining the countless variables became a drag but effectively it allowed for proper comparisons.
Comments
While conducting this portion of the project, I discovered that variables such as depth and table, had very little contribution to explanatory power to the model given the introduction of carat. These variable define the dimensions of the diamond yet, they have very little influence on R^2. Clarity on the other hand had the single most impactful increase in R^2.