In this assignment, I will analyze the Diamonds dataset from R. I’ll begin by reviewing the data dictionary to understand variable definitions and explore how these features potentially interact with one another.
## carat cut color clarity depth
## Min. :0.2000 Fair : 1610 D: 6775 SI1 :13065 Min. :43.00
## 1st Qu.:0.4000 Good : 4906 E: 9797 VS2 :12258 1st Qu.:61.00
## Median :0.7000 Very Good:12082 F: 9542 SI2 : 9194 Median :61.80
## Mean :0.7979 Premium :13791 G:11292 VS1 : 8171 Mean :61.75
## 3rd Qu.:1.0400 Ideal :21551 H: 8304 VVS2 : 5066 3rd Qu.:62.50
## Max. :5.0100 I: 5422 VVS1 : 3655 Max. :79.00
## J: 2808 (Other): 2531
## table price x y
## Min. :43.00 Min. : 326 Min. : 0.000 Min. : 0.000
## 1st Qu.:56.00 1st Qu.: 950 1st Qu.: 4.710 1st Qu.: 4.720
## Median :57.00 Median : 2401 Median : 5.700 Median : 5.710
## Mean :57.46 Mean : 3933 Mean : 5.731 Mean : 5.735
## 3rd Qu.:59.00 3rd Qu.: 5324 3rd Qu.: 6.540 3rd Qu.: 6.540
## Max. :95.00 Max. :18823 Max. :10.740 Max. :58.900
##
## z
## Min. : 0.000
## 1st Qu.: 2.910
## Median : 3.530
## Mean : 3.539
## 3rd Qu.: 4.040
## Max. :31.800
##
price: Integer, in US dollars ($326–$18,823).
carat: Numeric, weight (0.2–5.01).
cut: Ordinal factor (Fair to Ideal).
color: Ordinal factor (J to D, worst to best).
clarity: Ordinal factor (I1 to IF, worst to best).
x, y, z: Numeric,length, width, and depth in mm.
depth: Numeric, total depth percentage
table: Numeric, width of top relative to widest point.
diamonds$cut <- factor(diamonds$cut, ordered = FALSE )
diamonds$color <- factor(diamonds$color, ordered = FALSE)
diamonds$clarity <- factor(diamonds$clarity, ordered = FALSE)
str(diamonds)## tibble [53,940 × 10] (S3: tbl_df/tbl/data.frame)
## $ carat : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
## $ cut : Factor w/ 5 levels "Fair","Good",..: 5 4 2 4 2 3 3 3 1 3 ...
## $ color : Factor w/ 7 levels "D","E","F","G",..: 2 2 2 6 7 7 6 5 2 5 ...
## $ clarity: Factor w/ 8 levels "I1","SI2","SI1",..: 2 3 5 4 2 6 7 3 4 5 ...
## $ depth : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
## $ table : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
## $ price : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
## $ x : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
## $ y : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
## $ z : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
matrix <- cor(diamonds[sapply(diamonds, is.numeric)])
corrplot(matrix, method = "color",
order = "hclust",
addCoef.col = "gray",
number.cex = 0.8,
tl.col = "black",
tl.srt = 45) Carat and Price show a very strong
positive relationship with a correlation of 0.92. This
indicates that as the carat of the diamond increases, the price tends to
increase significantly.
Physical Dimensions (x, y, z) and Price
also showing strong positive relationship with correlations ranging from
0.86 to 0.88. This means that larger diamonds in terms
of length, width, and depth are consistently associated with higher
prices.
Carat and Dimensions (x, y, z) are
showing extremely strong positive relationships (Cor = 0.95 to
0.98). This suggests that these variables are nearly
multicollinear; as a diamond’s physical size increases, its carat weight
increases almost proportionately.
Table and Depth show a moderate
negative relationship with a correlation of -0.30. This
implies that diamonds with a larger table tend to have a slightly lower
depth percentage.
Depth and Price show a negligible
relationship with a correlation of -0.01. This suggests that the depth
percentage of a diamond has almost no linear impact on its price.
Similarly, table has a very weak relationship with price (Cor =
0.13).
Surprisingly,
Idealcut diamonds have lowest median price, on the other handPremiumcut generally sits at the top of the price range, but every category contains outliers reaching up above $15,000. This suggests that while “cut” matters, other factors are likely the have affect on driving these prices up.
Claritygrade forSI2is showing highest median price andWS1andIFare showing lowest median price. All clarity grades consist of outliers.
Color
Jhas the highest median price followed byIandH. ColorDandEseems to have lowest median price.
Fit the full multiple linear regression using all provided predictors. Then interpret each regression coefficient except the intercept. Each interpretation must:
● explicitly say “holding other variables fixed,”
● explain direction (increase/decrease),
● comment on magnitude in the units of the variables (or per 1-unit change / per meaningful change if rescaling is needed).
##
## Call:
## lm(formula = price ~ ., data = diamonds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21376.0 -592.4 -183.5 376.4 10694.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2184.477 408.197 5.352 8.76e-08 ***
## carat 11256.978 48.628 231.494 < 2e-16 ***
## cutGood 579.751 33.592 17.259 < 2e-16 ***
## cutVery Good 726.783 32.241 22.542 < 2e-16 ***
## cutPremium 762.144 32.228 23.649 < 2e-16 ***
## cutIdeal 832.912 33.407 24.932 < 2e-16 ***
## colorE -209.118 17.893 -11.687 < 2e-16 ***
## colorF -272.854 18.093 -15.081 < 2e-16 ***
## colorG -482.039 17.716 -27.209 < 2e-16 ***
## colorH -980.267 18.836 -52.043 < 2e-16 ***
## colorI -1466.244 21.162 -69.286 < 2e-16 ***
## colorJ -2369.398 26.131 -90.674 < 2e-16 ***
## claritySI2 2702.586 43.818 61.677 < 2e-16 ***
## claritySI1 3665.472 43.634 84.005 < 2e-16 ***
## clarityVS2 4267.224 43.853 97.306 < 2e-16 ***
## clarityVS1 4578.398 44.546 102.779 < 2e-16 ***
## clarityVVS2 4950.814 45.855 107.967 < 2e-16 ***
## clarityVVS1 5007.759 47.160 106.187 < 2e-16 ***
## clarityIF 5345.102 51.024 104.757 < 2e-16 ***
## depth -63.806 4.535 -14.071 < 2e-16 ***
## table -26.474 2.912 -9.092 < 2e-16 ***
## x -1008.261 32.898 -30.648 < 2e-16 ***
## y 9.609 19.333 0.497 0.619
## z -50.119 33.486 -1.497 0.134
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1130 on 53916 degrees of freedom
## Multiple R-squared: 0.9198, Adjusted R-squared: 0.9198
## F-statistic: 2.688e+04 on 23 and 53916 DF, p-value: < 2.2e-16
Multiple R-squared = 0.9198 means that our model explains approximately 92% of the variation in diamond prices.
Adjusted R-squared = 0.9198 means that after adjusting for the number of predictors, the model still explains 92% of the variation, suggesting the included variables are highly relevant.
The coefficient for carat is
11256.978, meaning that holding all other variables
fixed, a 1 unit increase in carat weight is associated with
an increase in predicted price of about $11,257. The p-value < 2e-16
is highly significant, indicating carat weight is a primary driver of
price.
In terms of cut quality relative to the baseline
(Fair), all levels show a statistically significant positive effect (p
< 2e-16). Holding other variables fixed, Good cuts increase predicted
price by $579.75, Very Good by
$726.78, Premium by $762.14, and Ideal
cuts show the largest increase of $832.91 compared to
Fair cuts.
In terms of color relative to the baseline (D), all
coefficients are negative and highly significant (p < 2e-16). Holding
other variables fixed, as the color grade moves down the alphabet toward
J, the predicted price decreases steadily, with a color J diamond
costing about $2,369.40 less than a color D
diamond.
In terms of clarity relative to the baseline (I1),
all levels show a statistically significant increase in price (p <
2e-16). Holding other variables fixed, price increases as clarity
improves, ranging from an extra $2,702.59 for SI2 up to
$5,345.10 for IF (Internally Flawless) compared to the
baseline.
The coefficient for depth is
-63.806, meaning that holding other variables fixed, 1
percentage point increase in total depth percentage is associated with a
decrease in predicted price of $63.81. The p-value <
2e-16 is highly significant.
The coefficient for table is
-26.474, meaning that holding other variables fixed, 1
unit increase in table width is associated with a decrease in predicted
price of $26.47. The p-value < 2e-16 is highly
significant.
The coefficient for x (length) is
-1008.261, meaning that holding other variables fixed,
a 1mm increase in length is associated with a decrease in predicted
price of $1,008.26. The p-value < 2e-16 is highly
significant. `
The coefficient for y (width) is
9.609, meaning that holding other variables fixed, a
1mm increase in width is associated with an increase in predicted price
of $9.61. However, the p-value = 0.619 is
not statistically significant, suggesting width does not have a
meaningful independent effect in this model.
The coefficient for z (depth in mm) is
-50.119, meaning that holding other variables fixed, a
1mm increase in depth is associated with a decrease in predicted price
of $50.12. The p-value = 0.134 is not statistically
significant at the 0.05 level, indicating no meaningful independent
relationship with price.
Carat: With a t-value of 231.49 and a p-value <
2e-16, this is the most statistically significant predictor in the
model. It has a massive positive impact on price and is an essential
contributor.
Depth and Table: Both have highly
significant p-values (< 2e-16). Despite having smaller coefficients
than carat, their t-statistics (-14.07 and -9.09 respectively) indicate
they are important contributors to the model’s accuracy.
X (Length): This dimension is highly significant (p
< 2e-16) with a strong negative relationship (t-value of -30.65),
suggesting it provides critical information not captured by carats
alone. Non-Significant Predictors
Y (Width) and Z (Depth in mm): These
variables have p-values of 0.619 and 0.134, respectively. Since these
values are well above the standard 0.05 threshold, individual t-tests
suggest they do not provide a meaningful independent contribution to
predicting price.
reduced_model <- lm(price ~ carat + cut + color + clarity + depth + table + x, data=diamonds)
summary(reduced_model)##
## Call:
## lm(formula = price ~ carat + cut + color + clarity + depth +
## table + x, data = diamonds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21385.0 -592.4 -183.7 376.5 10694.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2366.086 390.351 6.061 1.36e-09 ***
## carat 11256.968 48.600 231.626 < 2e-16 ***
## cutGood 580.240 33.572 17.283 < 2e-16 ***
## cutVery Good 726.820 32.212 22.564 < 2e-16 ***
## cutPremium 762.759 32.225 23.670 < 2e-16 ***
## cutIdeal 833.260 33.396 24.951 < 2e-16 ***
## colorE -209.237 17.893 -11.694 < 2e-16 ***
## colorF -272.834 18.093 -15.080 < 2e-16 ***
## colorG -481.943 17.716 -27.204 < 2e-16 ***
## colorH -980.122 18.836 -52.035 < 2e-16 ***
## colorI -1466.182 21.162 -69.283 < 2e-16 ***
## colorJ -2369.504 26.131 -90.678 < 2e-16 ***
## claritySI2 2702.077 43.812 61.674 < 2e-16 ***
## claritySI1 3664.905 43.627 84.005 < 2e-16 ***
## clarityVS2 4266.612 43.847 97.308 < 2e-16 ***
## clarityVS1 4577.589 44.535 102.786 < 2e-16 ***
## clarityVVS2 4950.168 45.847 107.972 < 2e-16 ***
## clarityVVS1 5007.061 47.152 106.190 < 2e-16 ***
## clarityIF 5344.338 51.015 104.761 < 2e-16 ***
## depth -66.769 4.091 -16.322 < 2e-16 ***
## table -26.457 2.911 -9.089 < 2e-16 ***
## x -1029.478 20.549 -50.098 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1130 on 53918 degrees of freedom
## Multiple R-squared: 0.9198, Adjusted R-squared: 0.9198
## F-statistic: 2.944e+04 on 21 and 53918 DF, p-value: < 2.2e-16
Even though our reduced model performs just as well as the full model, I will proceed with the reduced model for the sake of parsimony. To justify this, I conducted an ANOVA (Partial F-test) to test our null hypothesis, which indicates that the variables dropped (y and z) do not provide additional predictive power beyond what is already captured by the other variables.
## Analysis of Variance Table
##
## Model 1: price ~ carat + cut + color + clarity + depth + table + x + y +
## z
## Model 2: price ~ carat + cut + color + clarity + depth + table + x
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 53916 6.8857e+10
## 2 53918 6.8860e+10 -2 -2977657 1.1658 0.3117
The ANOVA test confirms that y and z are redundant. The reduced model is the superior choice because it is simpler (more parsimonious) and maintains the same predictive accuracy as the more complex model.
Confidence Interval: This interval estimates the average price for the entire population of diamonds that share these specific characteristics. It is narrower because individual fluctuations tend to average out, leaving only the uncertainty of the population mean.
## 2.5 % 97.5 %
## (Intercept) 1600.99477 3131.17680
## carat 11161.71230 11352.22380
## cutGood 514.43842 646.04258
## cutVery Good 663.68519 789.95503
## cutPremium 699.59786 825.91928
## cutIdeal 767.80430 898.71630
## colorE -244.30722 -174.16686
## colorF -308.29603 -237.37227
## colorG -516.66657 -447.21932
## colorH -1017.03987 -943.20379
## colorI -1507.65989 -1424.70321
## colorJ -2420.72048 -2318.28704
## claritySI2 2616.20525 2787.94889
## claritySI1 3579.39574 3750.41493
## clarityVS2 4180.67208 4352.55142
## clarityVS1 4490.29974 4664.87872
## clarityVVS2 4860.30803 5040.02798
## clarityVVS1 4914.64293 5099.47934
## clarityIF 5244.34888 5444.32730
## depth -74.78711 -58.75152
## table -32.16290 -20.75175
## x -1069.75495 -989.20085
In our reduced model, the fact that none of the 95% confidence intervals intercept zero confirms that every included predictor is statistically significant and has a mathematically certain relationship with diamond price. Because the range of plausible values for each coefficient never crosses zero, we can be at least 95% confident that these variables exert a real, non-zero effect that is not due to random chance.
Prediction Interval: This interval is designed to capture the price of a single new diamond. It is always wider because it must account for both the uncertainty in the model’s parameters and the natural “noise” or residual error found in individual data points.
new_diamond <- data.frame(carat = 1.2, cut = "Ideal", color = "G",
clarity = "VS1", depth = 61.5, table = 57, x = 6.8)
pred1 <- predict(reduced_model, newdata = new_diamond, interval = "prediction")
print(pred1)## fit lwr upr
## 1 8188.524 5973.247 10403.8
## GVIF Df GVIF^(1/(2*Df))
## carat 22.413762 1 4.734317
## cut 1.928215 4 1.085537
## color 1.178108 6 1.013753
## clarity 1.346615 7 1.021484
## depth 1.450534 1 1.204381
## table 1.786944 1 1.336766
## x 22.442478 1 4.737349
Our
vifresults shows us thatcaratandx (length)are the primary sources of multicollinearity in our reduced model. In other words,caratandx (length)are telling the model almost the exact same thing about the diamond.
Residuals vs Fitted: Ideally, our red line should be flat and centered around zero. Instead, we are seeing slight upward curve which suggests heteroscedasticity. The relationship between our predictors and price might be non-linear, or that we are missing interaction terms.
Q-Q Residuals: While the middle section follows the straight line, the significant “tail” deviations indicate that our model produces more outliers than expected. This suggest that our model may under predict or over predict prices for certain diamond types.
Scale-Location: Similar to our first plot, the red line slopes upward and the point spread widens to the right, it confirms our prediction errors are not constant. The model is notably more accurate for lower priced diamonds and becomes less reliable as predicted prices increase.
Residuals vs Leverage: We have several labeled outliers with high residuals, no points cross the dashed Cook’s Distance boundaries. This means that although outliers exist, no single observation is currently exerting an undue influence on the overall model.
At this point we have to adjust our model to gain better prediction power. First step we have noticed high
GVIFvalues for variablesxandcarattherefore I will only utilizecaratand dropx.
reduced_model2 <-lm(price ~ carat + cut + color + clarity + depth + table, data=diamonds)
summary(reduced_model2)##
## Call:
## lm(formula = price ~ carat + cut + color + clarity + depth +
## table, data = diamonds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16828.8 -678.7 -199.4 464.6 10341.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4555.171 373.482 -12.197 < 2e-16 ***
## carat 8895.194 12.079 736.390 < 2e-16 ***
## cutGood 614.424 34.337 17.894 < 2e-16 ***
## cutVery Good 778.428 32.936 23.635 < 2e-16 ***
## cutPremium 806.024 32.954 24.459 < 2e-16 ***
## cutIdeal 877.569 34.152 25.696 < 2e-16 ***
## colorE -210.849 18.304 -11.519 < 2e-16 ***
## colorF -304.288 18.498 -16.450 < 2e-16 ***
## colorG -506.964 18.116 -27.984 < 2e-16 ***
## colorH -977.974 19.269 -50.754 < 2e-16 ***
## colorI -1438.277 21.642 -66.459 < 2e-16 ***
## colorJ -2322.565 26.715 -86.940 < 2e-16 ***
## claritySI2 2619.004 44.788 58.476 < 2e-16 ***
## claritySI1 3567.794 44.587 80.020 < 2e-16 ***
## clarityVS2 4210.194 44.840 93.893 < 2e-16 ***
## clarityVS1 4525.400 45.547 99.356 < 2e-16 ***
## clarityVVS2 4957.310 46.901 105.697 < 2e-16 ***
## clarityVVS1 5061.734 48.224 104.964 < 2e-16 ***
## clarityIF 5404.237 52.174 103.582 < 2e-16 ***
## depth -21.024 4.079 -5.154 2.56e-07 ***
## table -24.803 2.978 -8.329 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1156 on 53919 degrees of freedom
## Multiple R-squared: 0.9161, Adjusted R-squared: 0.916
## F-statistic: 2.942e+04 on 20 and 53919 DF, p-value: < 2.2e-16
## Analysis of Variance Table
##
## Model 1: price ~ carat + cut + color + clarity + depth + table + x
## Model 2: price ~ carat + cut + color + clarity + depth + table
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 53918 6.8860e+10
## 2 53919 7.2065e+10 -1 -3205284683 2509.8 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Anova results shows that removing
xwas a bad choice because it provided strong predicting power. At this point I have to consider interaction terms betweencaratand some other variable. Carat and clarity or cut might be correlated somehow since they both have to describe somehow diamonds physical characteristic, but I cannot interactcaratvsxbecause they are already showing multicolinearity.
interaction_model <- lm(price ~ carat * clarity + carat * cut + color + depth + table + x, data=diamonds)
summary(interaction_model)##
## Call:
## lm(formula = price ~ carat * clarity + carat * cut + color +
## depth + table + x, data = diamonds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17258.3 -380.5 -86.5 259.2 9091.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 20552.936 344.347 59.687 < 2e-16 ***
## carat 7838.503 67.310 116.454 < 2e-16 ***
## claritySI2 -3134.233 83.179 -37.680 < 2e-16 ***
## claritySI1 -3259.175 83.021 -39.257 < 2e-16 ***
## clarityVS2 -3386.459 83.397 -40.606 < 2e-16 ***
## clarityVS1 -3365.269 84.402 -39.872 < 2e-16 ***
## clarityVVS2 -3786.800 86.522 -43.767 < 2e-16 ***
## clarityVVS1 -3824.524 88.324 -43.301 < 2e-16 ***
## clarityIF -3957.945 93.039 -42.541 < 2e-16 ***
## cutGood -237.053 61.960 -3.826 0.00013 ***
## cutVery Good -467.198 58.407 -7.999 1.28e-15 ***
## cutPremium -225.942 57.595 -3.923 8.76e-05 ***
## cutIdeal -499.731 58.073 -8.605 < 2e-16 ***
## colorE -153.553 14.728 -10.426 < 2e-16 ***
## colorF -228.935 14.901 -15.364 < 2e-16 ***
## colorG -467.784 14.593 -32.056 < 2e-16 ***
## colorH -901.130 15.539 -57.990 < 2e-16 ***
## colorI -1467.995 17.450 -84.126 < 2e-16 ***
## colorJ -2442.036 21.523 -113.462 < 2e-16 ***
## depth -127.728 3.391 -37.668 < 2e-16 ***
## table -37.813 2.404 -15.731 < 2e-16 ***
## x -2416.657 19.195 -125.900 < 2e-16 ***
## carat:claritySI2 4373.713 60.117 72.753 < 2e-16 ***
## carat:claritySI1 5315.914 61.168 86.907 < 2e-16 ***
## carat:clarityVS2 6174.966 61.965 99.652 < 2e-16 ***
## carat:clarityVS1 6589.075 64.099 102.795 < 2e-16 ***
## carat:clarityVVS2 8007.003 70.812 113.074 < 2e-16 ***
## carat:clarityVVS1 8522.676 79.969 106.575 < 2e-16 ***
## carat:clarityIF 9472.378 93.105 101.739 < 2e-16 ***
## carat:cutGood 556.529 56.295 9.886 < 2e-16 ***
## carat:cutVery Good 1006.764 51.831 19.424 < 2e-16 ***
## carat:cutPremium 789.073 50.396 15.657 < 2e-16 ***
## carat:cutIdeal 1242.457 50.869 24.425 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 929.2 on 53907 degrees of freedom
## Multiple R-squared: 0.9458, Adjusted R-squared: 0.9457
## F-statistic: 2.939e+04 on 32 and 53907 DF, p-value: < 2.2e-16
As it shown in the summary, our
interaction modelshowed very strong improvement in R^2 = 09458 and Adjusted R^2 = 09457 compare to previous models.
Now lets look at our residual plots once again.
Plot diagnostic didnt show strong improvement from the previous models. Therefore I will look at Box Cox Transformation to see if we can apply log transformation to our model.
Box Cox results showing us that peak part of the curve(lambda) is really close to “0”, therefore log transformation would be ideal for our next model transformation.
log_model <- lm(log(price) ~ carat * clarity + carat * cut + color + depth + table + x, data=diamonds)
summary(log_model)##
## Call:
## lm(formula = log(price) ~ carat * clarity + carat * cut + color +
## depth + table + x, data = diamonds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.2008 -0.0891 0.0033 0.0901 9.7128
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.2311224 0.0633279 -35.231 < 2e-16 ***
## carat -0.7349688 0.0123788 -59.373 < 2e-16 ***
## claritySI2 0.1811865 0.0152973 11.844 < 2e-16 ***
## claritySI1 0.2446716 0.0152681 16.025 < 2e-16 ***
## clarityVS2 0.3783190 0.0153374 24.666 < 2e-16 ***
## clarityVS1 0.4298039 0.0155222 27.690 < 2e-16 ***
## clarityVVS2 0.5058498 0.0159120 31.791 < 2e-16 ***
## clarityVVS1 0.5685932 0.0162434 35.005 < 2e-16 ***
## clarityIF 0.6805032 0.0171106 39.771 < 2e-16 ***
## cutGood 0.0194710 0.0113948 1.709 0.087502 .
## cutVery Good 0.0411980 0.0107415 3.835 0.000126 ***
## cutPremium 0.1438955 0.0105922 13.585 < 2e-16 ***
## cutIdeal 0.0993517 0.0106801 9.303 < 2e-16 ***
## colorE -0.0553481 0.0027086 -20.434 < 2e-16 ***
## colorF -0.0875649 0.0027404 -31.953 < 2e-16 ***
## colorG -0.1580774 0.0026837 -58.903 < 2e-16 ***
## colorH -0.2552167 0.0028578 -89.305 < 2e-16 ***
## colorI -0.3866249 0.0032092 -120.475 < 2e-16 ***
## colorJ -0.5290905 0.0039582 -133.669 < 2e-16 ***
## depth 0.0503853 0.0006236 80.797 < 2e-16 ***
## table 0.0071690 0.0004421 16.217 < 2e-16 ***
## x 1.1441774 0.0035301 324.120 < 2e-16 ***
## carat:claritySI2 0.1927007 0.0110560 17.429 < 2e-16 ***
## carat:claritySI1 0.2913761 0.0112492 25.902 < 2e-16 ***
## carat:clarityVS2 0.3003676 0.0113959 26.358 < 2e-16 ***
## carat:clarityVS1 0.3255308 0.0117883 27.615 < 2e-16 ***
## carat:clarityVVS2 0.4052854 0.0130229 31.121 < 2e-16 ***
## carat:clarityVVS1 0.4302571 0.0147069 29.255 < 2e-16 ***
## carat:clarityIF 0.3883890 0.0171226 22.683 < 2e-16 ***
## carat:cutGood 0.0695302 0.0103531 6.716 1.89e-11 ***
## carat:cutVery Good 0.0856714 0.0095321 8.988 < 2e-16 ***
## carat:cutPremium -0.0527069 0.0092683 -5.687 1.30e-08 ***
## carat:cutIdeal 0.0519600 0.0093551 5.554 2.80e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1709 on 53907 degrees of freedom
## Multiple R-squared: 0.9717, Adjusted R-squared: 0.9716
## F-statistic: 5.774e+04 on 32 and 53907 DF, p-value: < 2.2e-16
Our log model performed strongly with Multiple R-squared: 0.9717 and Adjusted R-squared: 0.9716. now lets look at our residual plots once again.
Our residuals plots show a massive improvement after applying the log transformation.
Residuals vs Fitted: The red line is now much flatter and centered at zero, indicating that the log transformation has successfully corrected the major non-linearity and heteroscedasticity from the first model. The relationship between our predictors and the log of price is now much more linear and stable.
Normal Q-Q: This plot shows a significant improvement in the normality of our residuals. The majority of the points now sit directly on the dashed line, meaning our error distribution is much closer to a normal bell curve than before. While there is still showing “heavy tails” at the very right end, the most of your data is now statistically well behaved, making our p-values and confidence intervals much more trustworthy.
Scale-Location: The log transformation has successfully stabilized the variance in our model, the horizontal red line and the more uniform vertical spread of points across the plot. We no longer see the heteroscedasticity from our previous model.
Residuals vs Leverage: This plot identifies a few specific observations that are still struggling to fit the model. However we are still gonna keep them as the rest of the points land within Cook’s Distance.
cooksD <- cooks.distance(log_model)
n <- nrow(diamonds)
threshold <- 4 / n
num_influential <- sum(cooksD > threshold)
num_influential## [1] 2061
influential_points <- which(cooksD > threshold)
diamonds_clean <- diamonds[-influential_points, ]
log_model_final <- update(log_model, data = diamonds_clean)
summary(log_model_final)##
## Call:
## lm(formula = log(price) ~ carat * clarity + carat * cut + color +
## depth + table + x, data = diamonds_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.49170 -0.08103 0.00022 0.07998 0.51092
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.5966952 0.0501129 -71.772 < 2e-16 ***
## carat -1.0125801 0.0142056 -71.280 < 2e-16 ***
## claritySI2 0.1715415 0.0170053 10.088 < 2e-16 ***
## claritySI1 0.2987189 0.0169782 17.594 < 2e-16 ***
## clarityVS2 0.4559549 0.0170173 26.794 < 2e-16 ***
## clarityVS1 0.5148361 0.0171094 30.091 < 2e-16 ***
## clarityVVS2 0.6076226 0.0173321 35.058 < 2e-16 ***
## clarityVVS1 0.7000897 0.0175757 39.833 < 2e-16 ***
## clarityIF 0.8065384 0.0183267 44.009 < 2e-16 ***
## cutGood 0.0387989 0.0100558 3.858 0.000114 ***
## cutVery Good 0.0840323 0.0095755 8.776 < 2e-16 ***
## cutPremium 0.1912824 0.0094867 20.163 < 2e-16 ***
## cutIdeal 0.1476040 0.0095391 15.474 < 2e-16 ***
## colorE -0.0517333 0.0019325 -26.770 < 2e-16 ***
## colorF -0.0913307 0.0019583 -46.638 < 2e-16 ***
## colorG -0.1571669 0.0019213 -81.802 < 2e-16 ***
## colorH -0.2510189 0.0020493 -122.488 < 2e-16 ***
## colorI -0.3762973 0.0023105 -162.864 < 2e-16 ***
## colorJ -0.5164845 0.0029208 -176.832 < 2e-16 ***
## depth 0.0591495 0.0004770 124.016 < 2e-16 ***
## table 0.0079682 0.0003258 24.461 < 2e-16 ***
## x 1.3119393 0.0030050 436.589 < 2e-16 ***
## carat:claritySI2 0.1924624 0.0132403 14.536 < 2e-16 ***
## carat:claritySI1 0.2335036 0.0133088 17.545 < 2e-16 ***
## carat:clarityVS2 0.2213278 0.0133897 16.530 < 2e-16 ***
## carat:clarityVS1 0.2398993 0.0135888 17.654 < 2e-16 ***
## carat:clarityVVS2 0.3063160 0.0143521 21.343 < 2e-16 ***
## carat:clarityVVS1 0.2650088 0.0157789 16.795 < 2e-16 ***
## carat:clarityIF 0.2139468 0.0192142 11.135 < 2e-16 ***
## carat:cutGood 0.0763946 0.0097553 7.831 4.93e-15 ***
## carat:cutVery Good 0.0624647 0.0091085 6.858 7.07e-12 ***
## carat:cutPremium -0.0843867 0.0089598 -9.418 < 2e-16 ***
## carat:cutIdeal 0.0222708 0.0090071 2.473 0.013417 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1202 on 51846 degrees of freedom
## Multiple R-squared: 0.9854, Adjusted R-squared: 0.9854
## F-statistic: 1.094e+05 on 32 and 51846 DF, p-value: < 2.2e-16
Influential observations were identified using Cook’s Distance with a threshold of 4/n. Observations exceeding this threshold were flagged as potentially influential. These points were extracted and removed to evaluate their impact on the model. A revised model was then fitted on the cleaned dataset to assess improvements in stability and fit.
From the collinearity check, I learned that some variables especially carat and x were highly related to each other. This makes sense because both are measuring the size of the diamond. Because of this multicollinearity, the coefficients can become unstable and harder to interpret, even though the model may still predict well. So I realized there is a trade off between keeping variables for better prediction versus removing them for clearer interpretation.
From the model diagnostics, I learned that my original model had issues like non constant variance, abnormal residuals, and some influential points. The residual plots showed patterns, which means the model was missing some relationships. After adding interaction terms like carat with clarity, the model improved a lot, with higher R-squared and lower error. This showed me that the relationship between variables is not purely additive, and interactions are important.
Previously, we examined categorical interactions with the response variable using separate boxplots. We can now visualize Cut and Clarity in the same plot for further analyze.
ggplot(diamonds, aes(x = carat, y = log(price), color = clarity)) +
geom_smooth(method = "lm", se = FALSE, size = 1.2) +
facet_wrap(~ cut, ncol = 3) +
scale_color_viridis_d(option = "plasma") +
theme_minimal() +
labs(
title = "Visualizing Triple Interaction: Carat, Clarity, and Cut",
subtitle = "Diverging slopes across panels confirm that interaction terms are necessary",
x = "Carat (Weight)",
y = "Log of Price",
color = "Clarity Grade"
) +
theme(legend.position = "bottom")## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
This plot shows that a diamond’s weight (carat) and its quality (cut and clarity) work together to decide the price. You can see this because the lines in each box “fan out” rather than staying parallel. This means that as a diamond gets bigger, the price doesn’t just go up rather it jumps significantly for high-quality diamonds. Including these interactions is necessary for our model because the value of each extra carat depends entirely on the diamond’s specific cut and clarity.
models <- list(
Full = full_model,
Reduced1 = reduced_model,
Reduced2 = reduced_model2,
Interaction = interaction_model,
Log = log_model,
Log_Final = log_model_final
)
model_summary <- data.frame(
Model = names(models),
R2 = sapply(models, function(m) summary(m)$r.squared),
Adj_R2 = sapply(models, function(m) summary(m)$adj.r.squared),
Residual_SE = sapply(models, function(m) summary(m)$sigma)
)
model_summary## Model R2 Adj_R2 Residual_SE
## Full Full 0.9197915 0.9197573 1130.0944260
## Reduced1 Reduced1 0.9197880 0.9197568 1130.0979006
## Reduced2 Reduced2 0.9160543 0.9160232 1156.0899066
## Interaction Interaction 0.9457805 0.9457484 929.2196153
## Log Log 0.9716505 0.9716337 0.1708904
## Log_Final Log_Final 0.9854119 0.9854029 0.1201679
To arrive at my final model, I followed a structured EDA and refinement process that consistently boosted performance at every step. I started with a Full linear model (R-Squared = 0.919) and experimented with Reduced versions, but quickly realized that a simple additive approach couldn’t capture the true complexity of diamond pricing. By introducing Interaction terms (Carat vs Clarity and Carat vs Cut), I significantly improved the model’s logic, acknowledging that weight becomes exponentially more valuable as quality increases. However, the biggest breakthrough came from applying a Log transformation to the price; this stabilized the variance and jumped my Adjusted R-Squared to 0.971. Finally, after conducting regression diagnostics and removing influential outliers that were skewing the data, I reached my Log_Final model, which achieved a remarkable 0.985 Adjusted and a much lower Residual Standard Error.
Previously, we looked at how our variables interact with the response variable. After making some manual adjustments to boost predictive power, I’m going to use stepwise selection to double check my work and see if a different combination of variables yields a better result.
## Start: AIC=758426.5
## price ~ carat + cut + color + clarity + depth + table + x + y +
## z
##
## Df Sum of Sq RSS AIC
## - y 1 3.1549e+05 6.8857e+10 758425
## <none> 6.8857e+10 758426
## - z 1 2.8609e+06 6.8860e+10 758427
## - table 1 1.0558e+08 6.8962e+10 758507
## - depth 1 2.5286e+08 6.9110e+10 758622
## - cut 4 8.6357e+08 6.9720e+10 759091
## - x 1 1.1996e+09 7.0056e+10 759356
## - color 6 1.7082e+10 8.5939e+10 770368
## - clarity 7 3.5703e+10 1.0456e+11 780945
## - carat 1 6.8440e+10 1.3730e+11 795649
##
## Step: AIC=758424.7
## price ~ carat + cut + color + clarity + depth + table + x + z
##
## Df Sum of Sq RSS AIC
## <none> 6.8857e+10 758425
## - z 1 2.6622e+06 6.8860e+10 758425
## - table 1 1.0584e+08 6.8963e+10 758506
## - depth 1 2.5637e+08 6.9114e+10 758623
## - cut 4 8.6409e+08 6.9721e+10 759089
## - x 1 1.5413e+09 7.0398e+10 759617
## - color 6 1.7082e+10 8.5940e+10 770366
## - clarity 7 3.5708e+10 1.0457e+11 780946
## - carat 1 6.8520e+10 1.3738e+11 795679
##
## Call:
## lm(formula = price ~ carat + cut + color + clarity + depth +
## table + x + z, data = diamonds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21378.8 -592.5 -183.5 376.3 10694.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2198.886 407.163 5.401 6.67e-08 ***
## carat 11257.752 48.602 231.630 < 2e-16 ***
## cutGood 580.325 33.572 17.286 < 2e-16 ***
## cutVery Good 727.431 32.214 22.581 < 2e-16 ***
## cutPremium 762.287 32.226 23.654 < 2e-16 ***
## cutIdeal 833.352 33.396 24.954 < 2e-16 ***
## colorE -209.100 17.893 -11.686 < 2e-16 ***
## colorF -272.837 18.093 -15.080 < 2e-16 ***
## colorG -482.035 17.716 -27.209 < 2e-16 ***
## colorH -980.247 18.836 -52.042 < 2e-16 ***
## colorI -1466.257 21.162 -69.287 < 2e-16 ***
## colorJ -2369.412 26.131 -90.675 < 2e-16 ***
## claritySI2 2702.855 43.815 61.688 < 2e-16 ***
## claritySI1 3665.735 43.631 84.018 < 2e-16 ***
## clarityVS2 4267.476 43.850 97.319 < 2e-16 ***
## clarityVS1 4578.702 44.541 102.796 < 2e-16 ***
## clarityVVS2 4951.100 45.851 107.983 < 2e-16 ***
## clarityVVS1 5008.029 47.156 106.201 < 2e-16 ***
## clarityIF 5345.420 51.020 104.772 < 2e-16 ***
## depth -64.003 4.517 -14.168 < 2e-16 ***
## table -26.501 2.911 -9.103 < 2e-16 ***
## x -1000.354 28.795 -34.740 < 2e-16 ***
## z -47.925 33.194 -1.444 0.149
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1130 on 53917 degrees of freedom
## Multiple R-squared: 0.9198, Adjusted R-squared: 0.9198
## F-statistic: 2.81e+04 on 22 and 53917 DF, p-value: < 2.2e-16
Using backward selection method, formula only drop non-significant p-value from the full model which was
yand kept the other ones. This is could be a good method for detecthing significant predictors in large dataset, however it lack detecting its relationships in context. now let’s try to forward selection method. First we have to create a Null model.
null_model <- lm(price ~ 1, data = diamonds)
forward_model <- step(null_model, scope = list(lower = null_model, upper = full_model),
direction = "forward")## Start: AIC=894477.9
## price ~ 1
##
## Df Sum of Sq RSS AIC
## + carat 1 7.2913e+11 1.2935e+11 792389
## + x 1 6.7152e+11 1.8695e+11 812259
## + y 1 6.4296e+11 2.1552e+11 819929
## + z 1 6.3677e+11 2.2170e+11 821454
## + color 6 2.6849e+10 8.3162e+11 892776
## + clarity 7 2.3308e+10 8.3517e+11 893007
## + table 1 1.3876e+10 8.4460e+11 893601
## + cut 4 1.1042e+10 8.4743e+11 893788
## + depth 1 9.7323e+07 8.5838e+11 894474
## <none> 8.5847e+11 894478
##
## Step: AIC=792389.4
## price ~ carat
##
## Df Sum of Sq RSS AIC
## + clarity 7 3.9082e+10 9.0264e+10 772998
## + color 6 1.2561e+10 1.1678e+11 786891
## + cut 4 6.1332e+09 1.2321e+11 789777
## + x 1 3.5206e+09 1.2583e+11 790903
## + z 1 2.8493e+09 1.2650e+11 791190
## + table 1 1.4377e+09 1.2791e+11 791789
## + y 1 1.2425e+09 1.2810e+11 791871
## + depth 1 1.1546e+09 1.2819e+11 791908
## <none> 1.2935e+11 792389
##
## Step: AIC=772998.5
## price ~ carat + clarity
##
## Df Sum of Sq RSS AIC
## + color 6 1.6402e+10 7.3862e+10 762193
## + x 1 1.8542e+09 8.8410e+10 771881
## + cut 4 1.7808e+09 8.8483e+10 771932
## + z 1 1.4814e+09 8.8783e+10 772108
## + y 1 7.4127e+08 8.9523e+10 772556
## + table 1 3.7751e+08 8.9886e+10 772774
## + depth 1 3.5822e+08 8.9906e+10 772786
## <none> 9.0264e+10 772998
##
## Step: AIC=762193.4
## price ~ carat + clarity + color
##
## Df Sum of Sq RSS AIC
## + x 1 2733710969 7.1128e+10 760161
## + z 1 1842294631 7.2020e+10 760833
## + cut 4 1699187372 7.2163e+10 760946
## + y 1 1145039064 7.2717e+10 761353
## + table 1 409645878 7.3452e+10 761895
## + depth 1 174658715 7.3687e+10 762068
## <none> 7.3862e+10 762193
##
## Step: AIC=760161.1
## price ~ carat + clarity + color + x
##
## Df Sum of Sq RSS AIC
## + cut 4 1918248123 6.9210e+10 758694
## + depth 1 722282102 7.0406e+10 759613
## + table 1 273738191 7.0855e+10 759955
## + z 1 199547343 7.0929e+10 760012
## + y 1 5354253 7.1123e+10 760159
## <none> 7.1128e+10 760161
##
## Step: AIC=758694.4
## price ~ carat + clarity + color + x + cut
##
## Df Sum of Sq RSS AIC
## + depth 1 244682865 6.8965e+10 758505
## + z 1 72666922 6.9137e+10 758640
## + table 1 9935285 6.9200e+10 758689
## <none> 6.9210e+10 758694
## + y 1 982101 6.9209e+10 758696
##
## Step: AIC=758505.4
## price ~ carat + clarity + color + x + cut + depth
##
## Df Sum of Sq RSS AIC
## + table 1 105497218 6.8860e+10 758425
## <none> 6.8965e+10 758505
## + z 1 2323719 6.8963e+10 758506
## + y 1 298553 6.8965e+10 758507
##
## Step: AIC=758424.8
## price ~ carat + clarity + color + x + cut + depth + table
##
## Df Sum of Sq RSS AIC
## + z 1 2662170 6.8857e+10 758425
## <none> 6.8860e+10 758425
## + y 1 116788 6.8860e+10 758427
##
## Step: AIC=758424.7
## price ~ carat + clarity + color + x + cut + depth + table + z
##
## Df Sum of Sq RSS AIC
## <none> 6.8857e+10 758425
## + y 1 315487 6.8857e+10 758426
##
## Call:
## lm(formula = price ~ carat + clarity + color + x + cut + depth +
## table + z, data = diamonds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21378.8 -592.5 -183.5 376.3 10694.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2198.886 407.163 5.401 6.67e-08 ***
## carat 11257.752 48.602 231.630 < 2e-16 ***
## claritySI2 2702.855 43.815 61.688 < 2e-16 ***
## claritySI1 3665.735 43.631 84.018 < 2e-16 ***
## clarityVS2 4267.476 43.850 97.319 < 2e-16 ***
## clarityVS1 4578.702 44.541 102.796 < 2e-16 ***
## clarityVVS2 4951.100 45.851 107.983 < 2e-16 ***
## clarityVVS1 5008.029 47.156 106.201 < 2e-16 ***
## clarityIF 5345.420 51.020 104.772 < 2e-16 ***
## colorE -209.100 17.893 -11.686 < 2e-16 ***
## colorF -272.837 18.093 -15.080 < 2e-16 ***
## colorG -482.035 17.716 -27.209 < 2e-16 ***
## colorH -980.247 18.836 -52.042 < 2e-16 ***
## colorI -1466.257 21.162 -69.287 < 2e-16 ***
## colorJ -2369.412 26.131 -90.675 < 2e-16 ***
## x -1000.354 28.795 -34.740 < 2e-16 ***
## cutGood 580.325 33.572 17.286 < 2e-16 ***
## cutVery Good 727.431 32.214 22.581 < 2e-16 ***
## cutPremium 762.287 32.226 23.654 < 2e-16 ***
## cutIdeal 833.352 33.396 24.954 < 2e-16 ***
## depth -64.003 4.517 -14.168 < 2e-16 ***
## table -26.501 2.911 -9.103 < 2e-16 ***
## z -47.925 33.194 -1.444 0.149
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1130 on 53917 degrees of freedom
## Multiple R-squared: 0.9198, Adjusted R-squared: 0.9198
## F-statistic: 2.81e+04 on 22 and 53917 DF, p-value: < 2.2e-16
Our forward selection process started with an null model and systematically added variables that provided the most significant improvement to the model’s fit, measured by a decreasing AIC score. It first identified
caratas the single most important predictor, followed byclarity,color, and physical dimensions, continuing until adding new variables no longer meaningfully improved the model. Eventually, it ended with the same result as a backward selection. One again this is good method but still lacks insight of variable relationships.
## Loading required package: lattice
##
## Attaching package: 'lattice'
## The following object is masked from 'package:boot':
##
## melanoma
train_control <- trainControl(method = "cv", number = 10)
log_final_model_cv <- train(log(price) ~ carat * clarity + carat * cut + color + depth + table + x,
data = diamonds,
method = "lm",
trControl = train_control)
log_final_model_cv$results## intercept RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 TRUE 0.1699695 0.971408 0.1106054 0.0271839 0.009128032 0.001172535
I evaluated my models using cross-validation to see how well they predict new data. The final log-transformed model performed very well, with an RMSE of about 0.17 and an R² of about 0.97, which means it explains almost all of the variation in the log of price and has very low prediction error. Compared to earlier models, this log model improved performance and also helped address issues like non-constant variance and non-normal residuals. When I compared it with stepwise models, I noticed that while those models were simpler, they did not provide better predictive performance. This shows that some methods focus more on simplifying the model, while cross-validation focuses on prediction accuracy, and these goals do not always match.
Based on this, I chose the log interaction model because it provides the best predictive performance and better satisfies model assumptions. The trade-off is that the model is more complex due to interaction terms, but the key predictors like carat, clarity, and cut are still highly significant and consistent. In simple terms, the model shows that diamond price increases with size, but the rate of increase depends on quality—diamonds with better clarity and cut increase in price much faster as carat increases. Since the model uses a log transformation, these effects are interpreted in percentage terms rather than absolute changes. Overall, my final model leans more toward prediction, because I prioritized accuracy and model performance over simplicity.