## X carat cut color
## Min. : 3 Min. :0.2300 Length:1000 Length:1000
## 1st Qu.:13074 1st Qu.:0.4000 Class :character Class :character
## Median :25174 Median :0.7100 Mode :character Mode :character
## Mean :26143 Mean :0.8084
## 3rd Qu.:39066 3rd Qu.:1.0525
## Max. :53921 Max. :3.0000
## clarity depth table price
## Length:1000 Min. :56.20 Min. :52.00 Min. : 327.0
## Class :character 1st Qu.:61.00 1st Qu.:56.00 1st Qu.: 956.5
## Mode :character Median :61.90 Median :57.00 Median : 2602.0
## Mean :61.75 Mean :57.53 Mean : 4048.0
## 3rd Qu.:62.50 3rd Qu.:59.00 3rd Qu.: 5572.2
## Max. :67.40 Max. :66.00 Max. :18709.0
## x y z
## Min. :3.870 Min. :3.890 Min. :2.310
## 1st Qu.:4.718 1st Qu.:4.728 1st Qu.:2.910
## Median :5.695 Median :5.710 Median :3.550
## Mean :5.755 Mean :5.756 Mean :3.557
## 3rd Qu.:6.540 3rd Qu.:6.540 3rd Qu.:4.050
## Max. :9.130 Max. :9.030 Max. :6.160
carat: The weight of the diamond, measured in carats. A higher carat weight generally means a larger and more expensive diamond.
cut: The quality of the diamond cut, which affects how well it reflects light. Typical values include:
color: The color grade of the diamond, ranging from D (colorless) to Z (light yellow or brown). A lower grade (closer to D) indicates a more colorless diamond.
clarity: The clarity grade of the diamond, indicating the presence of inclusions or blemishes. Typical grades include:
depth: The total depth percentage of the diamond, calculated as: \[ \text{depth} = \frac{z}{\text{average of } (x, y)} \times 100 \] where:
table: The width of the diamond’s top facet as a percentage of its average diameter.
price: The price of the diamond in US dollars.
x: Length of the diamond in millimeters.
y: Width of the diamond in millimeters.
z: Height (depth) of the diamond in millimeters.
a1 = ggplot(diamond_sample, aes(x = depth)) +
geom_histogram(aes(y = ..density..), bins = 30, fill = "lavender", color = "black", alpha = 0.5) +
geom_density(color = 'red')+
labs(title = "Distribution of Diamond Depth", x = "Depth", y = "Frequency")+
theme_minimal()
a2 = ggplot(diamond_sample, aes(x = carat)) +
geom_histogram(aes(y = ..density..), bins = 30, fill = "darkblue", color = "black", alpha = 0.5) +
geom_density(color = 'red')+
labs(title = "Distribution of Diamond Carat", x = "Carat", y = "Frequency")+
theme_minimal()
a3 = ggplot(diamond_sample, aes(x = table)) +
geom_histogram(aes(y = ..density..), bins = 30, fill = "brown", color = "black", alpha = 0.5) +
geom_density(color = 'red')+
labs(title = "Distribution of Diamond Table", x = "Table", y = "Frequency")+
theme_minimal()
a4 = ggplot(diamond_sample, aes(x = price)) +
geom_histogram(aes(y = ..density..), bins = 30, fill = "darkgreen", color = "black", alpha = 0.5) +
geom_density(color = 'red')+
labs(title = "Distribution of Diamond Price", x = "Price", y = "Frequency")+
theme_minimal()
grid.arrange(a1, a2, a3, a4, nrow = 2, ncol = 2)The Distribution of Diamond Depth appears to be approximately normal, centered around a depth of 61%, with most values falling between 59% and 63%. The distribution is symmetric with a well-defined peak and minimal skewness, suggesting that diamond depth is consistently within a narrow range for most diamonds.
The Distribution of Diamond Carat shows a highly right-skewed distribution, with a majority of diamonds weighing less than 1 carat. There is a steep decline in frequency as carat weight increases, indicating that larger diamonds are significantly less common. The presence of multiple peaks suggests that some specific carat weights may be more popular or preferred in the market.
The Distribution of Diamond Table also appears to be approximately normal, though slightly right-skewed, centered around 57%. The range is relatively narrow, and the distribution shows a single prominent peak, indicating that most diamonds have a similar table percentage, reflecting standard cutting practices.
The Distribution of Diamond Price exhibits strong right skewness, with most diamonds priced under $5,000 and a long tail extending towards higher prices. This indicates that while the majority of diamonds are relatively affordable, there are a few exceptionally expensive ones that drive the tail, likely due to larger carat sizes or superior quality.
b1 = ggplot(diamond_sample, aes(x = x)) +
geom_histogram(aes(y = ..density..), bins = 30, fill = "lavender", color = "black", alpha = 0.5) +
geom_density(color = 'red')+
labs(title = "Distribution of Diamond Length ", x = "Length (mm)", y = "Frequency")+
theme_minimal()
b2 = ggplot(diamond_sample, aes(x = y)) +
geom_histogram(aes(y = ..density..), bins = 30, fill = "darkblue", color = "black", alpha = 0.5) +
geom_density(color = 'red')+
labs(title = "Distribution of Diamond Width", x = "Width (mm)", y = "Frequency")+
theme_minimal()
b3 = ggplot(diamond_sample, aes(x = z)) +
geom_histogram(aes(y = ..density..), bins = 30, fill = "brown", color = "black", alpha = 0.5) +
geom_density(color = 'red')+
labs(title = "Distribution of Diamond Height", x = "Height (mm)", y = "Frequency")+
theme_minimal()
grid.arrange(b1, b2, b3, nrow = 3)The Distribution of Diamond Length shows a right-skewed pattern, with most diamonds having lengths between 4 and 6 mm. There is a peak around 4.5 mm, and the frequency decreases as the length increases, indicating that longer diamonds are relatively uncommon.
The Distribution of Diamond Width also exhibits a right-skewed distribution, with most diamonds having widths between 4 and 6 mm. There is a noticeable peak near 4.5 mm, and the distribution gradually tails off as the width increases, suggesting that wider diamonds are less frequent.
The Distribution of Diamond Height demonstrates a pronounced right skew, with a peak around 3.5 mm. The frequency decreases steadily as height increases beyond this point, indicating that diamonds with greater heights are significantly less common. The distribution shows a few extreme values, hinting at the presence of taller diamonds that are rare in the dataset.
Cut_counts <- diamond_sample %>%
group_by(cut) %>%
summarise(Count = n()) %>%
mutate(Percentage = Count / sum(Count) * 100)
Color_counts <- diamond_sample %>%
group_by(color) %>%
summarise(Count = n()) %>%
mutate(Percentage = Count / sum(Count) * 100)
Clarity_counts <- diamond_sample %>%
group_by(clarity) %>%
summarise(Count = n()) %>%
mutate(Percentage = Count / sum(Count) * 100)
# Plot with percentage labels
ggplot(Cut_counts, aes(x = cut, y = Count, fill = factor(cut))) +
geom_bar(stat = "identity") +
geom_text(aes(label = paste0(round(Percentage, 1), "%")), vjust = -0.5, size = 4) +
labs(
x = "Cut",
y = "Count",
fill = "Cut",
title = "Diamond Cut Distribution"
) +
scale_fill_brewer(palette = "Paired")+
theme_minimal()+
ylim(0,450) # Plot with percentage labels
ggplot(Color_counts, aes(x = color, y = Count, fill = factor(color))) +
geom_bar(stat = "identity") +
geom_text(aes(label = paste0(round(Percentage, 1), "%")), vjust = -0.5, size = 4) +
labs(
x = "Color",
y = "Count",
fill = "Color",
title = "Diamond Color Distribution"
) +
scale_fill_viridis_d(option = "cividis")+
theme_minimal()+
ylim(0,220) # Plot with percentage labels
ggplot(Clarity_counts, aes(x = clarity, y = Count, fill = factor(clarity))) +
geom_bar(stat = "identity") +
geom_text(aes(label = paste0(round(Percentage, 1), "%")), vjust = -0.5, size = 4) +
labs(
x = "Clarity",
y = "Count",
fill = "Clarity",
title = "Diamond Clarity Distribution"
) +
scale_fill_viridis_d(option = "inferno")+
theme_minimal()+
ylim(0,275) Our variables of interest will be:
numeric_cols = diamond_sample[, sapply(diamond_sample, is.numeric)]%>%
dplyr::select(-X, -x, - y, -z, -table)
# Correlation matrix
cor_matrix = cor(numeric_cols)
# Correlation visualization
corrplot(cor_matrix, method = "color", type = "upper", tl.col = "black", tl.cex = 0.6, addCoef.col = "black", number.cex = .4)## Df Sum Sq Mean Sq F value Pr(>F)
## cut 4 2.407e+08 60182136 3.83 0.00427 **
## Residuals 995 1.564e+10 15715381
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# ANOVA for price by color
anova_color <- aov(price ~ color, data = diamond_sample)
summary(anova_color)## Df Sum Sq Mean Sq F value Pr(>F)
## color 6 5.397e+08 89956982 5.824 5.6e-06 ***
## Residuals 993 1.534e+10 15445912
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
As we cannot asses correlation between our variables for categorical ones with a correlation matrix, we have instead used an ANOVA. Our P-value indicates whether or not a relationship between the price and our categorical variables are statistically significant. we found statistical significance for Color to impact price, but not cut.
full_model <- lm(price ~ carat + depth + x + y + z + table + cut + color + clarity, data = diamond_sample)
summary(full_model)##
## Call:
## lm(formula = price ~ carat + depth + x + y + z + table + cut +
## color + clarity, data = diamond_sample)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8902.6 -617.4 -144.1 389.6 7411.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5123.08 3719.72 1.377 0.168743
## carat 11527.97 388.93 29.640 < 2e-16 ***
## depth -57.00 46.36 -1.230 0.219134
## x -2949.33 843.82 -3.495 0.000495 ***
## y 1814.81 842.50 2.154 0.031478 *
## z 16.19 532.08 0.030 0.975734
## table -71.41 23.22 -3.076 0.002156 **
## cutGood -204.41 283.54 -0.721 0.471148
## cutIdeal 227.74 278.05 0.819 0.412934
## cutPremium 124.00 270.09 0.459 0.646250
## cutVery Good 63.04 273.94 0.230 0.818034
## colorE -325.17 135.64 -2.397 0.016701 *
## colorF -278.33 133.16 -2.090 0.036854 *
## colorG -519.03 131.66 -3.942 8.66e-05 ***
## colorH -1096.89 142.69 -7.687 3.66e-14 ***
## colorI -1541.38 160.57 -9.599 < 2e-16 ***
## colorJ -2494.53 197.47 -12.632 < 2e-16 ***
## clarityIF 5972.93 505.49 11.816 < 2e-16 ***
## claritySI1 3932.27 468.68 8.390 < 2e-16 ***
## claritySI2 2838.63 468.65 6.057 1.98e-09 ***
## clarityVS1 4769.50 474.15 10.059 < 2e-16 ***
## clarityVS2 4479.02 469.23 9.545 < 2e-16 ***
## clarityVVS1 5026.31 485.02 10.363 < 2e-16 ***
## clarityVVS2 5179.29 480.41 10.781 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1144 on 976 degrees of freedom
## Multiple R-squared: 0.9196, Adjusted R-squared: 0.9177
## F-statistic: 485.2 on 23 and 976 DF, p-value: < 2.2e-16
Upon intial inspection, I findi ti surprising how certain categorical variales categories such colorF and colorG can have vastly different relationships with Price. I also find it interesting how x is weighed in importance to the model but not y or z as these are alos measurements of size for the diamond, and on that note depth as well. And finally The extremely large coeficient for carat really stick out to me, further analysis must be done
##
## Call:
## lm(formula = price ~ carat, data = diamond_sample)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9322.4 -800.0 -51.1 506.0 9902.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2156.8 96.8 -22.28 <2e-16 ***
## carat 7675.7 103.1 74.46 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1558 on 998 degrees of freedom
## Multiple R-squared: 0.8475, Adjusted R-squared: 0.8473
## F-statistic: 5544 on 1 and 998 DF, p-value: < 2.2e-16
Hypothesis Test
Our equation for our SLR is:
\[ \hat{y} = \hat{\beta_0 }+ \hat{\beta_1}x_1 \]
and thus we get the SLR for this model to be:
\[ \hat{price} = -2206.47+ 7593.80x_1 \]
Where:
plot(diamond_sample$carat, diamond_sample$price, main = "Price vs Carat",
xlab = "Carat", ylab = "Price", pch = 16, col = "purple")
abline(carat_model, col = "blue", lwd = 2)Hypothesis Testing
\(\begin{aligned}H_0 &: \beta_1 = 0 \end{aligned}\)
\(H_A:\beta_1 \neq 0\)
Our model has given us a t-value of 73.26, telling us the estimate for our coefficient of our carat is that many standard errors (103.66 units) away from 0. This t-value is extreme producing a p-value of <2e-16. This p-value gives us the probability of producing our value for \(\beta_1\) strictly due to chance based off of its standard error. Since our p-value is less than our statistical significance level of \(\alpha\) = 0.05, we reject the Null Hypothesis and conclude we have statistically significant evidence of there being a relationship between our diamonds carat value and its price.
R-squared and Adjusted R-squared:
An \(R^2\) of 0.8432 means that 84.32% of the variability in diamond price is explained by the model, suggesting that the model fits the data well.
Adjusted \(R^2\) accounts for the number of predictors and adjusts for over fitting. Since out model only has one predictor, the \(R^2\) and Adjusted \(R^2\) will be the same
Residual Standard Error (RSE)
Residual Standard Error (RSE) = 1586
Confidence Interval
## 2.5 % 97.5 %
## (Intercept) -2346.714 -1966.794
## carat 7473.425 7878.002
The 95% confidence interval for the intercept ranges from -2396.838 to -2016.110, indicating that the true intercept value lies within this range with 95% confidence.
The confidence interval for the carat coefficient ranges from 7390.383 to 7797.209, suggesting that for every one-unit increase in carat weight, the diamond price increases by approximately $7390 to $7797 95% of the time.
Prediction Interval:
new_data <- data.frame(carat = 0.5)
predict(carat_model, newdata = new_data, interval = "prediction")## fit lwr upr
## 1 1681.103 -1378.104 4740.31
We predict the outcome for this SLR with a carat measurement of .5 to fall between -1523.72 and 4704.568 95% of the time
Assumptions made for SLR are:
plot(carat_model$fitted.values, residuals(carat_model),
xlab = "Fitted Values", ylab = "Residuals",
main = "Residuals vs Fitted")
abline(h = 0, col = "red")Immediately seen the Residuals have a funnel shape, increasing in their deviation from their expected value as the fitted value increases. This is in violation of our assumptions fo SLR, Thus we will Transform our model using Log Transfomrations
Before doing so we will check the other assumptions of Normality and constant Variances:
We can see here that the assumptions of normality are violated in our QQ residual plot along with constant variance. Thus let us perform our log transfromation.
log_model<- lm(log(price) ~ log(carat), data = diamond_sample)
plot(log_model$fitted.values, residuals(log_model),
xlab = "Fitted Values", ylab = "Residuals",
main = "Residuals vs Fitted")
abline(h = 0, col = "red")Now the Linearity looks much better. Our Residuals deviation is constant throughout our graph and centered around 0. Let us check the normality condition as well
As seen now the Normality and other assumptions for are linear model are now met much better then before. we can move forward and anlyze our new model
##
## Call:
## lm(formula = log(price) ~ log(carat), data = diamond_sample)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.00112 -0.18199 0.00328 0.17844 1.15782
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.462485 0.009972 848.6 <2e-16 ***
## log(carat) 1.672504 0.014181 117.9 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2644 on 998 degrees of freedom
## Multiple R-squared: 0.9331, Adjusted R-squared: 0.933
## F-statistic: 1.391e+04 on 1 and 998 DF, p-value: < 2.2e-16
\[ \hat{price} = 8.424+ 1.685x_1 \]
Model Comparisons:
The log-log model has a significantly higher R-squared (0.9285) compared to the original linear model (0.843), indicating that it explains more variance in the data. The adjusted R-squared is also higher, confirming a better model fit even after accounting for the number of predictors.
The log-log model interprets the coefficient as an elasticity, showing how a percentage change in carat affects a percentage change in price. The original model interprets the coefficient as an absolute increase in price per unit increase in carat, which is less intuitive given the skewed nature of the data.
The log-log model has a much lower residual standard error (0.27), indicating less error in prediction compared to the original model (1586).
Since both the price and carat variables are skewed, the log transformation normalizes the distribution, leading to a more reliable and interpretative model. The log-log model is generally preferred when variables follow a power-law relationship or have exponential growth patterns.
In conclusion the Log model is much more suitable for the data then the original model.
Adding Clarity to the model increases \(R^2_{adj}\) from 0.9285 to 0.9663 This shows that clarity is a significant predictor for the price of a diamond.
The addition of depth and table variables keeps \(R^2_{adj}\) roughly the same at .931. This may indicate that the variables of depth and table are not significant to the model and do not have a strong relationship to the diamond price.
Adding \(y\) (width), \(x\) (length), and \(z\) (height) Also do not have much affect on the model bringing it to .931. This shows that they are not significant to the model change.
Adding color and cut slightly increases the \(R^2_{adj}\) to 0.946. This suggests that the color and cut could each have a relationship to price.
Since carat already took up a majority of the variation in the SLR graph, it wll be difficult to go one by one to check which model is the best, further analysis will be conducted.
Variance Inflation Factor (VIF) indicates how much multicollinearity (correlation between predictor variables) exists in the regression model. Generally:
VIF > 5: Moderate multicollinearity. VIF > 10: High multicollinearity (problematic). VIF > 100: Severe multicollinearity (critical).
For this model we will be looking at which factors have VIF scores much over 10
model2 <- lm(log(price) ~ log(carat)+ log(depth) + log(x) + log(y) + log(z) + log(table) + cut + color + clarity, data = diamond_sample)
faraway::vif(model2)## log(carat) log(depth) log(x) log(y) log(z) log(table)
## 1852.007622 9.993642 1192.799541 1141.867457 164.207292 2.327211
## cutGood cutIdeal cutPremium cutVery Good colorE colorF
## 5.287374 13.929067 10.414607 10.300024 2.012240 2.110587
## colorG colorH colorI colorJ clarityIF claritySI1
## 2.241717 1.932025 1.692887 1.394214 6.575367 28.422746
## claritySI2 clarityVS1 clarityVS2 clarityVVS1 clarityVVS2
## 24.452863 22.762664 29.272646 12.881169 15.352601
Severe Multicollinearity:
These values are critical and indicate severe multicollinearity, meaning that these variables are highly correlated with each other.
Moving Forward we will elimate xyz since we are aware these are all measurements of the diamond and see how this affects our VIF
model3 <- lm(log(price) ~ log(carat)+ log(depth) + log(table) + cut + color + clarity, data = diamond_sample)
faraway::vif(model3)## log(carat) log(depth) log(table) cutGood cutIdeal cutPremium
## 1.368182 1.409981 1.911568 5.052088 13.514338 10.315163
## cutVery Good colorE colorF colorG colorH colorI
## 9.735159 2.006444 2.108512 2.235184 1.927388 1.686389
## colorJ clarityIF claritySI1 claritySI2 clarityVS1 clarityVS2
## 1.389173 6.487150 28.236659 24.290196 22.562979 29.043764
## clarityVVS1 clarityVVS2
## 12.721164 15.198413
Although our model does still has 1 value over 10, it is only slightly over 10, and the VIF scores indicating Multicollinearitly look much better.
model0 = lm(log(price) ~ 1, diamond_sample)
step(model0,
scope = ~ log(carat)+ log(depth) + log(x) + log(y) + log(z) + log(table) + cut + color + clarity,
direction = "both", trace = 0)##
## Call:
## lm(formula = log(price) ~ log(y) + clarity + color + log(carat) +
## cut + log(x) + log(z), data = diamond_sample)
##
## Coefficients:
## (Intercept) log(y) clarityIF claritySI1 claritySI2
## 7.02698 -1.58665 1.25385 0.72747 0.54914
## clarityVS1 clarityVS2 clarityVVS1 clarityVVS2 colorE
## 0.94020 0.86204 1.14822 1.09788 -0.05107
## colorF colorG colorH colorI colorJ
## -0.10327 -0.16114 -0.24297 -0.38725 -0.50952
## log(carat) cutGood cutIdeal cutPremium cutVery Good
## 1.71658 0.04940 0.14102 0.09773 0.10599
## log(x) log(z)
## 1.65440 0.43654
model0 = lm(log(price) ~ 1, diamond_sample)
n = nrow(diamond_sample)
step(model0,
scope = ~ log(carat)+ log(depth) + log(x) + log(y) + log(z) + log(table) + cut + color + clarity,
direction = "both", trace = 0, k = log(n))##
## Call:
## lm(formula = log(price) ~ clarity + color + log(carat) + cut,
## data = diamond_sample)
##
## Coefficients:
## (Intercept) clarityIF claritySI1 claritySI2 clarityVS1
## 7.78781 1.23503 0.71441 0.53620 0.92372
## clarityVS2 clarityVVS1 clarityVVS2 colorE colorF
## 0.84809 1.13458 1.08083 -0.04796 -0.10073
## colorG colorH colorI colorJ log(carat)
## -0.15676 -0.24035 -0.38235 -0.50926 1.88454
## cutGood cutIdeal cutPremium cutVery Good
## 0.02116 0.11759 0.08255 0.07653
Our AIC and BIC models both include the following predictors:
Where our two models differ is in the fact that our AIC has decided to include preditor y as part of the model where our BIC model has decided to. This makes sense as BIC tends to penalize models more heavily for the inclusion of additional predictors. We can check the \(R^2\) values for both of these models to decide which to move forward with.
AIC <- lm(formula = log(price) ~ log(y) + clarity + color + log(carat) +
cut + log(x) + log(depth), data = diamond_sample)
BIC <- lm(formula = log(price) ~ clarity + color + log(carat) + cut +
log(x) + log(depth), data = diamond_sample)
summary(AIC)##
## Call:
## lm(formula = log(price) ~ log(y) + clarity + color + log(carat) +
## cut + log(x) + log(depth), data = diamond_sample)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.49555 -0.08487 0.00238 0.08583 0.45565
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.61985 3.20538 1.441 0.149826
## log(y) -1.32534 0.67719 -1.957 0.050616 .
## clarityIF 1.25571 0.05787 21.697 < 2e-16 ***
## claritySI1 0.72945 0.05335 13.672 < 2e-16 ***
## claritySI2 0.55157 0.05351 10.308 < 2e-16 ***
## clarityVS1 0.94221 0.05400 17.448 < 2e-16 ***
## clarityVS2 0.86401 0.05353 16.142 < 2e-16 ***
## clarityVVS1 1.15294 0.05535 20.831 < 2e-16 ***
## clarityVVS2 1.10004 0.05482 20.065 < 2e-16 ***
## colorE -0.05129 0.01558 -3.293 0.001027 **
## colorF -0.10357 0.01527 -6.780 2.07e-11 ***
## colorG -0.16020 0.01508 -10.623 < 2e-16 ***
## colorH -0.24354 0.01635 -14.899 < 2e-16 ***
## colorI -0.38775 0.01834 -21.140 < 2e-16 ***
## colorJ -0.50990 0.02250 -22.662 < 2e-16 ***
## log(carat) 1.69127 0.27463 6.158 1.07e-09 ***
## cutGood 0.04592 0.03234 1.420 0.155962
## cutIdeal 0.13734 0.03103 4.426 1.07e-05 ***
## cutPremium 0.09465 0.03083 3.069 0.002203 **
## cutVery Good 0.10340 0.03122 3.312 0.000962 ***
## log(x) 1.90694 0.70212 2.716 0.006725 **
## log(depth) 0.49826 0.42773 1.165 0.244352
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1313 on 978 degrees of freedom
## Multiple R-squared: 0.9838, Adjusted R-squared: 0.9835
## F-statistic: 2830 on 21 and 978 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = log(price) ~ clarity + color + log(carat) + cut +
## log(x) + log(depth), data = diamond_sample)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.50222 -0.08587 0.00000 0.08693 0.47092
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.99987 2.62166 0.381 0.70300
## clarityIF 1.24124 0.05748 21.593 < 2e-16 ***
## claritySI1 0.72010 0.05322 13.532 < 2e-16 ***
## claritySI2 0.54177 0.05335 10.155 < 2e-16 ***
## clarityVS1 0.93094 0.05377 17.313 < 2e-16 ***
## clarityVS2 0.85343 0.05333 16.003 < 2e-16 ***
## clarityVVS1 1.14076 0.05508 20.712 < 2e-16 ***
## clarityVVS2 1.08884 0.05460 19.941 < 2e-16 ***
## colorE -0.05107 0.01560 -3.274 0.00110 **
## colorF -0.10320 0.01530 -6.747 2.58e-11 ***
## colorG -0.16033 0.01510 -10.617 < 2e-16 ***
## colorH -0.24390 0.01637 -14.901 < 2e-16 ***
## colorI -0.38581 0.01834 -21.035 < 2e-16 ***
## colorJ -0.51242 0.02250 -22.778 < 2e-16 ***
## log(carat) 1.37876 0.22377 6.162 1.05e-09 ***
## cutGood 0.03395 0.03181 1.067 0.28609
## cutIdeal 0.12302 0.03020 4.074 5.00e-05 ***
## cutPremium 0.09018 0.03079 2.929 0.00348 **
## cutVery Good 0.08846 0.03032 2.918 0.00361 **
## log(x) 1.52901 0.67602 2.262 0.02393 *
## log(depth) 0.95463 0.35910 2.658 0.00798 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1315 on 979 degrees of freedom
## Multiple R-squared: 0.9837, Adjusted R-squared: 0.9834
## F-statistic: 2962 on 20 and 979 DF, p-value: < 2.2e-16
Both Produce a very good \(R^2\) value of .983, and i will be using the BIC model as it uses less predictors for my final model.
final_model = lm(log(price) ~ clarity + color + log(carat) + cut + log(x) + log(depth), data = diamond_sample)
# Define a new data point for prediction (hypothetical values)
new_data <- data.frame(
clarity = "SI1", # Example clarity level
color = "E", # Example color grade
cut = "Premium", # Example cut quality
carat = 0.75, # Log-transformed carat value
x = 5.5, # Log-transformed length
depth = 61 # Log-transformed depth percentage
)
# Predict with confidence and prediction intervals
predictions <- predict(final_model, newdata = new_data, interval = "prediction", level = 0.95)
confidence <- predict(final_model, newdata = new_data, interval = "confidence", level = 0.95)
predictions## fit lwr upr
## 1 7.893379 7.619736 8.167023
## fit lwr upr
## 1 7.893379 7.802518 7.98424
The CI provides the range in which the mean predicted price lies, given the specified combination of variables. It represents the precision of the mean estimate from the model. We say the Mean predicted log - price lies within this range 95% of the time.
The PI shows the range in which an individual predicted price is likely to fall. This interval is wider than the CI because it accounts for both the uncertainty of the mean estimate and the variability of individual responses. We say the predicted log - price lies within this range 95% of the time.
The final model we developed to predict diamond prices incorporates a carefully selected combination of predictors: clarity, color, cut, log(carat), log(x), and log(depth). These variables were chosen through a thorough process of exploratory data analysis, model testing, and diagnostic evaluation, ensuring that the final model is both accurate and robust. The primary criterion for selecting these variables was the adjusted R-squared value, which reflects the model’s ability to explain the maximum variability in price while penalizing for the inclusion of non-informative predictors. By achieving a high adjusted R-squared, we ensure that the model captures essential relationships without overfitting, making it generalizable to new data.
During the model development process, we examined a wide range of potential predictors, including carat, x (length), y (width), z (height), depth, table, and various categorical variables like clarity, color, and cut. However, multicollinearity posed a significant challenge, especially among carat, x, y, and z. To address this, we performed Variance Inflation Factor (VIF) analysis and dropped highly correlated variables, retaining log(carat) as the primary size predictor. Log transformations were applied to continuous variables like price, carat, x, and depth to normalize their distributions and establish linear relationships with the outcome variable. These transformations not only improved model accuracy but also enhanced interpretability by allowing the coefficients to represent percentage changes.
Additionally, we validated the model by calculating Confidence Intervals (CI) and Prediction Intervals (PI) to quantify the uncertainty in the predictions. The CI provides the range for the mean predicted price, while the PI gives the range for individual predictions, both at a 95% confidence level. Diagnostic tests confirmed the model’s assumptions of linearity, homoscedasticity, normality, and independence, reinforcing the model’s validity. By prioritizing variable selection, addressing multicollinearity, and performing thorough diagnostic checks, we ensured that our final model not only achieves high predictive accuracy but also maintains reliability and practical utility in real-world applications.