Understanding Diamond Pricing: A Statistical Overview

1 Exploratory visualization and expected relationships

In this assignment, I will analyze the Diamonds dataset from R. I’ll begin by reviewing the data dictionary to understand variable definitions and explore how these features potentially interact with one another.

library(corrplot)
library(ggplot2)
library(boot)
library(car)
library(MASS)

data("diamonds", package = "ggplot2")
summary(diamonds)

##      carat               cut        color        clarity          depth      
##  Min.   :0.2000   Fair     : 1610   D: 6775   SI1    :13065   Min.   :43.00  
##  1st Qu.:0.4000   Good     : 4906   E: 9797   VS2    :12258   1st Qu.:61.00  
##  Median :0.7000   Very Good:12082   F: 9542   SI2    : 9194   Median :61.80  
##  Mean   :0.7979   Premium  :13791   G:11292   VS1    : 8171   Mean   :61.75  
##  3rd Qu.:1.0400   Ideal    :21551   H: 8304   VVS2   : 5066   3rd Qu.:62.50  
##  Max.   :5.0100                     I: 5422   VVS1   : 3655   Max.   :79.00  
##                                     J: 2808   (Other): 2531                  
##      table           price             x                y         
##  Min.   :43.00   Min.   :  326   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.:56.00   1st Qu.:  950   1st Qu.: 4.710   1st Qu.: 4.720  
##  Median :57.00   Median : 2401   Median : 5.700   Median : 5.710  
##  Mean   :57.46   Mean   : 3933   Mean   : 5.731   Mean   : 5.735  
##  3rd Qu.:59.00   3rd Qu.: 5324   3rd Qu.: 6.540   3rd Qu.: 6.540  
##  Max.   :95.00   Max.   :18823   Max.   :10.740   Max.   :58.900  
##                                                                   
##        z         
##  Min.   : 0.000  
##  1st Qu.: 2.910  
##  Median : 3.530  
##  Mean   : 3.539  
##  3rd Qu.: 4.040  
##  Max.   :31.800  
##

price: Integer, in US dollars ($326–$18,823).

carat: Numeric, weight (0.2–5.01).

cut: Ordinal factor (Fair to Ideal).

color: Ordinal factor (J to D, worst to best).

clarity: Ordinal factor (I1 to IF, worst to best).

x, y, z: Numeric,length, width, and depth in mm.

depth: Numeric, total depth percentage

table: Numeric, width of top relative to widest point.

diamonds$cut <- factor(diamonds$cut, ordered = FALSE )
diamonds$color <- factor(diamonds$color, ordered = FALSE)
diamonds$clarity <- factor(diamonds$clarity, ordered = FALSE)
str(diamonds)

## tibble [53,940 × 10] (S3: tbl_df/tbl/data.frame)
##  $ carat  : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ cut    : Factor w/ 5 levels "Fair","Good",..: 5 4 2 4 2 3 3 3 1 3 ...
##  $ color  : Factor w/ 7 levels "D","E","F","G",..: 2 2 2 6 7 7 6 5 2 5 ...
##  $ clarity: Factor w/ 8 levels "I1","SI2","SI1",..: 2 3 5 4 2 6 7 3 4 5 ...
##  $ depth  : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##  $ table  : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
##  $ price  : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
##  $ x      : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
##  $ y      : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
##  $ z      : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...

matrix <- cor(diamonds[sapply(diamonds, is.numeric)])
corrplot(matrix, method = "color",       
         order = "hclust",       
         addCoef.col = "gray", 
         number.cex = 0.8,       
         tl.col = "black",      
         tl.srt = 45)

1.1 Positive vs. Negative associations

Carat and Price show a very strong positive relationship with a correlation of 0.92. This indicates that as the carat of the diamond increases, the price tends to increase significantly.
Physical Dimensions (x, y, z) and Price also showing strong positive relationship with correlations ranging from 0.86 to 0.88. This means that larger diamonds in terms of length, width, and depth are consistently associated with higher prices.
Carat and Dimensions (x, y, z) are showing extremely strong positive relationships (Cor = 0.95 to 0.98). This suggests that these variables are nearly multicollinear; as a diamond’s physical size increases, its carat weight increases almost proportionately.
Table and Depth show a moderate negative relationship with a correlation of -0.30. This implies that diamonds with a larger table tend to have a slightly lower depth percentage.
Depth and Price show a negligible relationship with a correlation of -0.01. This suggests that the depth percentage of a diamond has almost no linear impact on its price. Similarly, table has a very weak relationship with price (Cor = 0.13).

ggplot(diamonds, aes(x = cut, y = price, fill = cut)) +
  geom_boxplot() +
  theme_minimal()

Surprisingly, Ideal cut diamonds have lowest median price, on the other hand Premium cut generally sits at the top of the price range, but every category contains outliers reaching up above $15,000. This suggests that while “cut” matters, other factors are likely the have affect on driving these prices up.

ggplot(diamonds, aes(x = clarity, y = price, fill = clarity)) +
  geom_boxplot() +
  theme_minimal()

Clarity grade for SI2 is showing highest median price and WS1 and IF are showing lowest median price. All clarity grades consist of outliers.

ggplot(diamonds, aes(x = color, y = price, fill = color)) +
  geom_boxplot() +
  theme_minimal()

Color J has the highest median price followed by I and H. Color D and E seems to have lowest median price.

2 Fit the full multiple linear regression model

Fit the full multiple linear regression using all provided predictors. Then interpret each regression coefficient except the intercept. Each interpretation must:

● explicitly say “holding other variables fixed,”

● explain direction (increase/decrease),

● comment on magnitude in the units of the variables (or per 1-unit change / per meaningful change if rescaling is needed).

full_model <- lm(price ~., data=diamonds)
summary(full_model)

## 
## Call:
## lm(formula = price ~ ., data = diamonds)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -21376.0   -592.4   -183.5    376.4  10694.2 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2184.477    408.197   5.352 8.76e-08 ***
## carat        11256.978     48.628 231.494  < 2e-16 ***
## cutGood        579.751     33.592  17.259  < 2e-16 ***
## cutVery Good   726.783     32.241  22.542  < 2e-16 ***
## cutPremium     762.144     32.228  23.649  < 2e-16 ***
## cutIdeal       832.912     33.407  24.932  < 2e-16 ***
## colorE        -209.118     17.893 -11.687  < 2e-16 ***
## colorF        -272.854     18.093 -15.081  < 2e-16 ***
## colorG        -482.039     17.716 -27.209  < 2e-16 ***
## colorH        -980.267     18.836 -52.043  < 2e-16 ***
## colorI       -1466.244     21.162 -69.286  < 2e-16 ***
## colorJ       -2369.398     26.131 -90.674  < 2e-16 ***
## claritySI2    2702.586     43.818  61.677  < 2e-16 ***
## claritySI1    3665.472     43.634  84.005  < 2e-16 ***
## clarityVS2    4267.224     43.853  97.306  < 2e-16 ***
## clarityVS1    4578.398     44.546 102.779  < 2e-16 ***
## clarityVVS2   4950.814     45.855 107.967  < 2e-16 ***
## clarityVVS1   5007.759     47.160 106.187  < 2e-16 ***
## clarityIF     5345.102     51.024 104.757  < 2e-16 ***
## depth          -63.806      4.535 -14.071  < 2e-16 ***
## table          -26.474      2.912  -9.092  < 2e-16 ***
## x            -1008.261     32.898 -30.648  < 2e-16 ***
## y                9.609     19.333   0.497    0.619    
## z              -50.119     33.486  -1.497    0.134    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1130 on 53916 degrees of freedom
## Multiple R-squared:  0.9198, Adjusted R-squared:  0.9198 
## F-statistic: 2.688e+04 on 23 and 53916 DF,  p-value: < 2.2e-16

Multiple R-squared = 0.9198 means that our model explains approximately 92% of the variation in diamond prices.
Adjusted R-squared = 0.9198 means that after adjusting for the number of predictors, the model still explains 92% of the variation, suggesting the included variables are highly relevant.
The coefficient for carat is 11256.978, meaning that holding all other variables fixed, a 1 unit increase in carat weight is associated with an increase in predicted price of about $11,257. The p-value < 2e-16 is highly significant, indicating carat weight is a primary driver of price.
In terms of cut quality relative to the baseline (Fair), all levels show a statistically significant positive effect (p < 2e-16). Holding other variables fixed, Good cuts increase predicted price by $579.75, Very Good by $726.78, Premium by $762.14, and Ideal cuts show the largest increase of $832.91 compared to Fair cuts.
In terms of color relative to the baseline (D), all coefficients are negative and highly significant (p < 2e-16). Holding other variables fixed, as the color grade moves down the alphabet toward J, the predicted price decreases steadily, with a color J diamond costing about $2,369.40 less than a color D diamond.
In terms of clarity relative to the baseline (I1), all levels show a statistically significant increase in price (p < 2e-16). Holding other variables fixed, price increases as clarity improves, ranging from an extra $2,702.59 for SI2 up to $5,345.10 for IF (Internally Flawless) compared to the baseline.
The coefficient for depth is -63.806, meaning that holding other variables fixed, 1 percentage point increase in total depth percentage is associated with a decrease in predicted price of $63.81. The p-value < 2e-16 is highly significant.
The coefficient for table is -26.474, meaning that holding other variables fixed, 1 unit increase in table width is associated with a decrease in predicted price of $26.47. The p-value < 2e-16 is highly significant.
The coefficient for x (length) is -1008.261, meaning that holding other variables fixed, a 1mm increase in length is associated with a decrease in predicted price of $1,008.26. The p-value < 2e-16 is highly significant. `
The coefficient for y (width) is 9.609, meaning that holding other variables fixed, a 1mm increase in width is associated with an increase in predicted price of $9.61. However, the p-value = 0.619 is not statistically significant, suggesting width does not have a meaningful independent effect in this model.
The coefficient for z (depth in mm) is -50.119, meaning that holding other variables fixed, a 1mm increase in depth is associated with a decrease in predicted price of $50.12. The p-value = 0.134 is not statistically significant at the 0.05 level, indicating no meaningful independent relationship with price.

3 Regression importance testing

Carat: With a t-value of 231.49 and a p-value < 2e-16, this is the most statistically significant predictor in the model. It has a massive positive impact on price and is an essential contributor.
Depth and Table: Both have highly significant p-values (< 2e-16). Despite having smaller coefficients than carat, their t-statistics (-14.07 and -9.09 respectively) indicate they are important contributors to the model’s accuracy.
X (Length): This dimension is highly significant (p < 2e-16) with a strong negative relationship (t-value of -30.65), suggesting it provides critical information not captured by carats alone. Non-Significant Predictors
Y (Width) and Z (Depth in mm): These variables have p-values of 0.619 and 0.134, respectively. Since these values are well above the standard 0.05 threshold, individual t-tests suggest they do not provide a meaningful independent contribution to predicting price.

3.1 Fitting Reduced Model

reduced_model <- lm(price ~ carat + cut + color + clarity + depth + table + x, data=diamonds)
summary(reduced_model)

## 
## Call:
## lm(formula = price ~ carat + cut + color + clarity + depth + 
##     table + x, data = diamonds)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -21385.0   -592.4   -183.7    376.5  10694.6 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2366.086    390.351   6.061 1.36e-09 ***
## carat        11256.968     48.600 231.626  < 2e-16 ***
## cutGood        580.240     33.572  17.283  < 2e-16 ***
## cutVery Good   726.820     32.212  22.564  < 2e-16 ***
## cutPremium     762.759     32.225  23.670  < 2e-16 ***
## cutIdeal       833.260     33.396  24.951  < 2e-16 ***
## colorE        -209.237     17.893 -11.694  < 2e-16 ***
## colorF        -272.834     18.093 -15.080  < 2e-16 ***
## colorG        -481.943     17.716 -27.204  < 2e-16 ***
## colorH        -980.122     18.836 -52.035  < 2e-16 ***
## colorI       -1466.182     21.162 -69.283  < 2e-16 ***
## colorJ       -2369.504     26.131 -90.678  < 2e-16 ***
## claritySI2    2702.077     43.812  61.674  < 2e-16 ***
## claritySI1    3664.905     43.627  84.005  < 2e-16 ***
## clarityVS2    4266.612     43.847  97.308  < 2e-16 ***
## clarityVS1    4577.589     44.535 102.786  < 2e-16 ***
## clarityVVS2   4950.168     45.847 107.972  < 2e-16 ***
## clarityVVS1   5007.061     47.152 106.190  < 2e-16 ***
## clarityIF     5344.338     51.015 104.761  < 2e-16 ***
## depth          -66.769      4.091 -16.322  < 2e-16 ***
## table          -26.457      2.911  -9.089  < 2e-16 ***
## x            -1029.478     20.549 -50.098  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1130 on 53918 degrees of freedom
## Multiple R-squared:  0.9198, Adjusted R-squared:  0.9198 
## F-statistic: 2.944e+04 on 21 and 53918 DF,  p-value: < 2.2e-16

4 Nested model testing (full vs. reduced model)

Even though our reduced model performs just as well as the full model, I will proceed with the reduced model for the sake of parsimony. To justify this, I conducted an ANOVA (Partial F-test) to test our null hypothesis, which indicates that the variables dropped (y and z) do not provide additional predictive power beyond what is already captured by the other variables.

anova(full_model,reduced_model)

## Analysis of Variance Table
## 
## Model 1: price ~ carat + cut + color + clarity + depth + table + x + y + 
##     z
## Model 2: price ~ carat + cut + color + clarity + depth + table + x
##   Res.Df        RSS Df Sum of Sq      F Pr(>F)
## 1  53916 6.8857e+10                           
## 2  53918 6.8860e+10 -2  -2977657 1.1658 0.3117

The ANOVA test confirms that y and z are redundant. The reduced model is the superior choice because it is simpler (more parsimonious) and maintains the same predictive accuracy as the more complex model.

5 Confidence intervals for regression coefficients

Confidence Interval: This interval estimates the average price for the entire population of diamonds that share these specific characteristics. It is narrower because individual fluctuations tend to average out, leaving only the uncertainty of the population mean.

confint(reduced_model, level = 0.95)

##                    2.5 %      97.5 %
## (Intercept)   1600.99477  3131.17680
## carat        11161.71230 11352.22380
## cutGood        514.43842   646.04258
## cutVery Good   663.68519   789.95503
## cutPremium     699.59786   825.91928
## cutIdeal       767.80430   898.71630
## colorE        -244.30722  -174.16686
## colorF        -308.29603  -237.37227
## colorG        -516.66657  -447.21932
## colorH       -1017.03987  -943.20379
## colorI       -1507.65989 -1424.70321
## colorJ       -2420.72048 -2318.28704
## claritySI2    2616.20525  2787.94889
## claritySI1    3579.39574  3750.41493
## clarityVS2    4180.67208  4352.55142
## clarityVS1    4490.29974  4664.87872
## clarityVVS2   4860.30803  5040.02798
## clarityVVS1   4914.64293  5099.47934
## clarityIF     5244.34888  5444.32730
## depth          -74.78711   -58.75152
## table          -32.16290   -20.75175
## x            -1069.75495  -989.20085

In our reduced model, the fact that none of the 95% confidence intervals intercept zero confirms that every included predictor is statistically significant and has a mathematically certain relationship with diamond price. Because the range of plausible values for each coefficient never crosses zero, we can be at least 95% confident that these variables exert a real, non-zero effect that is not due to random chance.

6 Prediction interval for a new observation

Prediction Interval: This interval is designed to capture the price of a single new diamond. It is always wider because it must account for both the uncertainty in the model’s parameters and the natural “noise” or residual error found in individual data points.

new_diamond <- data.frame(carat = 1.2, cut = "Ideal", color = "G",
  clarity = "VS1", depth = 61.5, table = 57, x = 6.8)

pred1 <- predict(reduced_model, newdata = new_diamond, interval = "prediction")

print(pred1)

##        fit      lwr     upr
## 1 8188.524 5973.247 10403.8

7 Multicollinearity investigation

vif(reduced_model)

##              GVIF Df GVIF^(1/(2*Df))
## carat   22.413762  1        4.734317
## cut      1.928215  4        1.085537
## color    1.178108  6        1.013753
## clarity  1.346615  7        1.021484
## depth    1.450534  1        1.204381
## table    1.786944  1        1.336766
## x       22.442478  1        4.737349

Our vif results shows us that carat and x (length) are the primary sources of multicollinearity in our reduced model. In other words, carat and x (length) are telling the model almost the exact same thing about the diamond.

8 Residual-based model diagnostics + remedy + final model interpretation

par(mfrow = c(2,2))
plot(reduced_model)

Residuals vs Fitted: Ideally, our red line should be flat and centered around zero. Instead, we are seeing slight upward curve which suggests heteroscedasticity. The relationship between our predictors and price might be non-linear, or that we are missing interaction terms.
Q-Q Residuals: While the middle section follows the straight line, the significant “tail” deviations indicate that our model produces more outliers than expected. This suggest that our model may under predict or over predict prices for certain diamond types.
Scale-Location: Similar to our first plot, the red line slopes upward and the point spread widens to the right, it confirms our prediction errors are not constant. The model is notably more accurate for lower priced diamonds and becomes less reliable as predicted prices increase.
Residuals vs Leverage: We have several labeled outliers with high residuals, no points cross the dashed Cook’s Distance boundaries. This means that although outliers exist, no single observation is currently exerting an undue influence on the overall model.

9 Model Remedy

At this point we have to adjust our model to gain better prediction power. First step we have noticed high GVIF values for variables x and carat therefore I will only utilize carat and drop x.

reduced_model2 <-lm(price ~ carat + cut + color + clarity + depth + table, data=diamonds)
summary(reduced_model2)

## 
## Call:
## lm(formula = price ~ carat + cut + color + clarity + depth + 
##     table, data = diamonds)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16828.8   -678.7   -199.4    464.6  10341.2 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -4555.171    373.482 -12.197  < 2e-16 ***
## carat         8895.194     12.079 736.390  < 2e-16 ***
## cutGood        614.424     34.337  17.894  < 2e-16 ***
## cutVery Good   778.428     32.936  23.635  < 2e-16 ***
## cutPremium     806.024     32.954  24.459  < 2e-16 ***
## cutIdeal       877.569     34.152  25.696  < 2e-16 ***
## colorE        -210.849     18.304 -11.519  < 2e-16 ***
## colorF        -304.288     18.498 -16.450  < 2e-16 ***
## colorG        -506.964     18.116 -27.984  < 2e-16 ***
## colorH        -977.974     19.269 -50.754  < 2e-16 ***
## colorI       -1438.277     21.642 -66.459  < 2e-16 ***
## colorJ       -2322.565     26.715 -86.940  < 2e-16 ***
## claritySI2    2619.004     44.788  58.476  < 2e-16 ***
## claritySI1    3567.794     44.587  80.020  < 2e-16 ***
## clarityVS2    4210.194     44.840  93.893  < 2e-16 ***
## clarityVS1    4525.400     45.547  99.356  < 2e-16 ***
## clarityVVS2   4957.310     46.901 105.697  < 2e-16 ***
## clarityVVS1   5061.734     48.224 104.964  < 2e-16 ***
## clarityIF     5404.237     52.174 103.582  < 2e-16 ***
## depth          -21.024      4.079  -5.154 2.56e-07 ***
## table          -24.803      2.978  -8.329  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1156 on 53919 degrees of freedom
## Multiple R-squared:  0.9161, Adjusted R-squared:  0.916 
## F-statistic: 2.942e+04 on 20 and 53919 DF,  p-value: < 2.2e-16

anova(reduced_model,reduced_model2)

## Analysis of Variance Table
## 
## Model 1: price ~ carat + cut + color + clarity + depth + table + x
## Model 2: price ~ carat + cut + color + clarity + depth + table
##   Res.Df        RSS Df   Sum of Sq      F    Pr(>F)    
## 1  53918 6.8860e+10                                    
## 2  53919 7.2065e+10 -1 -3205284683 2509.8 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Anova results shows that removing x was a bad choice because it provided strong predicting power. At this point I have to consider interaction terms between carat and some other variable. Carat and clarity or cut might be correlated somehow since they both have to describe somehow diamonds physical characteristic, but I cannot interact carat vs x because they are already showing multicolinearity.

interaction_model <- lm(price ~ carat * clarity + carat * cut + color + depth + table + x, data=diamonds)
summary(interaction_model)

## 
## Call:
## lm(formula = price ~ carat * clarity + carat * cut + color + 
##     depth + table + x, data = diamonds)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17258.3   -380.5    -86.5    259.2   9091.9 
## 
## Coefficients:
##                     Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)        20552.936    344.347   59.687  < 2e-16 ***
## carat               7838.503     67.310  116.454  < 2e-16 ***
## claritySI2         -3134.233     83.179  -37.680  < 2e-16 ***
## claritySI1         -3259.175     83.021  -39.257  < 2e-16 ***
## clarityVS2         -3386.459     83.397  -40.606  < 2e-16 ***
## clarityVS1         -3365.269     84.402  -39.872  < 2e-16 ***
## clarityVVS2        -3786.800     86.522  -43.767  < 2e-16 ***
## clarityVVS1        -3824.524     88.324  -43.301  < 2e-16 ***
## clarityIF          -3957.945     93.039  -42.541  < 2e-16 ***
## cutGood             -237.053     61.960   -3.826  0.00013 ***
## cutVery Good        -467.198     58.407   -7.999 1.28e-15 ***
## cutPremium          -225.942     57.595   -3.923 8.76e-05 ***
## cutIdeal            -499.731     58.073   -8.605  < 2e-16 ***
## colorE              -153.553     14.728  -10.426  < 2e-16 ***
## colorF              -228.935     14.901  -15.364  < 2e-16 ***
## colorG              -467.784     14.593  -32.056  < 2e-16 ***
## colorH              -901.130     15.539  -57.990  < 2e-16 ***
## colorI             -1467.995     17.450  -84.126  < 2e-16 ***
## colorJ             -2442.036     21.523 -113.462  < 2e-16 ***
## depth               -127.728      3.391  -37.668  < 2e-16 ***
## table                -37.813      2.404  -15.731  < 2e-16 ***
## x                  -2416.657     19.195 -125.900  < 2e-16 ***
## carat:claritySI2    4373.713     60.117   72.753  < 2e-16 ***
## carat:claritySI1    5315.914     61.168   86.907  < 2e-16 ***
## carat:clarityVS2    6174.966     61.965   99.652  < 2e-16 ***
## carat:clarityVS1    6589.075     64.099  102.795  < 2e-16 ***
## carat:clarityVVS2   8007.003     70.812  113.074  < 2e-16 ***
## carat:clarityVVS1   8522.676     79.969  106.575  < 2e-16 ***
## carat:clarityIF     9472.378     93.105  101.739  < 2e-16 ***
## carat:cutGood        556.529     56.295    9.886  < 2e-16 ***
## carat:cutVery Good  1006.764     51.831   19.424  < 2e-16 ***
## carat:cutPremium     789.073     50.396   15.657  < 2e-16 ***
## carat:cutIdeal      1242.457     50.869   24.425  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 929.2 on 53907 degrees of freedom
## Multiple R-squared:  0.9458, Adjusted R-squared:  0.9457 
## F-statistic: 2.939e+04 on 32 and 53907 DF,  p-value: < 2.2e-16

As it shown in the summary, our interaction model showed very strong improvement in R^2 = 09458 and Adjusted R^2 = 09457 compare to previous models.

Now lets look at our residual plots once again.

par(mfrow = c(2,2))
plot(interaction_model)

Plot diagnostic didnt show strong improvement from the previous models. Therefore I will look at Box Cox Transformation to see if we can apply log transformation to our model.

boxcox(interaction_model, lambda = seq(-2, 2, 0.1))

Box Cox results showing us that peak part of the curve(lambda) is really close to “0”, therefore log transformation would be ideal for our next model transformation.

log_model <- lm(log(price) ~ carat * clarity + carat * cut + color + depth + table + x, data=diamonds)
summary(log_model)

## 
## Call:
## lm(formula = log(price) ~ carat * clarity + carat * cut + color + 
##     depth + table + x, data = diamonds)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.2008 -0.0891  0.0033  0.0901  9.7128 
## 
## Coefficients:
##                      Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)        -2.2311224  0.0633279  -35.231  < 2e-16 ***
## carat              -0.7349688  0.0123788  -59.373  < 2e-16 ***
## claritySI2          0.1811865  0.0152973   11.844  < 2e-16 ***
## claritySI1          0.2446716  0.0152681   16.025  < 2e-16 ***
## clarityVS2          0.3783190  0.0153374   24.666  < 2e-16 ***
## clarityVS1          0.4298039  0.0155222   27.690  < 2e-16 ***
## clarityVVS2         0.5058498  0.0159120   31.791  < 2e-16 ***
## clarityVVS1         0.5685932  0.0162434   35.005  < 2e-16 ***
## clarityIF           0.6805032  0.0171106   39.771  < 2e-16 ***
## cutGood             0.0194710  0.0113948    1.709 0.087502 .  
## cutVery Good        0.0411980  0.0107415    3.835 0.000126 ***
## cutPremium          0.1438955  0.0105922   13.585  < 2e-16 ***
## cutIdeal            0.0993517  0.0106801    9.303  < 2e-16 ***
## colorE             -0.0553481  0.0027086  -20.434  < 2e-16 ***
## colorF             -0.0875649  0.0027404  -31.953  < 2e-16 ***
## colorG             -0.1580774  0.0026837  -58.903  < 2e-16 ***
## colorH             -0.2552167  0.0028578  -89.305  < 2e-16 ***
## colorI             -0.3866249  0.0032092 -120.475  < 2e-16 ***
## colorJ             -0.5290905  0.0039582 -133.669  < 2e-16 ***
## depth               0.0503853  0.0006236   80.797  < 2e-16 ***
## table               0.0071690  0.0004421   16.217  < 2e-16 ***
## x                   1.1441774  0.0035301  324.120  < 2e-16 ***
## carat:claritySI2    0.1927007  0.0110560   17.429  < 2e-16 ***
## carat:claritySI1    0.2913761  0.0112492   25.902  < 2e-16 ***
## carat:clarityVS2    0.3003676  0.0113959   26.358  < 2e-16 ***
## carat:clarityVS1    0.3255308  0.0117883   27.615  < 2e-16 ***
## carat:clarityVVS2   0.4052854  0.0130229   31.121  < 2e-16 ***
## carat:clarityVVS1   0.4302571  0.0147069   29.255  < 2e-16 ***
## carat:clarityIF     0.3883890  0.0171226   22.683  < 2e-16 ***
## carat:cutGood       0.0695302  0.0103531    6.716 1.89e-11 ***
## carat:cutVery Good  0.0856714  0.0095321    8.988  < 2e-16 ***
## carat:cutPremium   -0.0527069  0.0092683   -5.687 1.30e-08 ***
## carat:cutIdeal      0.0519600  0.0093551    5.554 2.80e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1709 on 53907 degrees of freedom
## Multiple R-squared:  0.9717, Adjusted R-squared:  0.9716 
## F-statistic: 5.774e+04 on 32 and 53907 DF,  p-value: < 2.2e-16

Our log model performed strongly with Multiple R-squared: 0.9717 and Adjusted R-squared: 0.9716. now lets look at our residual plots once again.

par(mfrow=c(2,2))
plot(log_model)

Our residuals plots show a massive improvement after applying the log transformation.

Residuals vs Fitted: The red line is now much flatter and centered at zero, indicating that the log transformation has successfully corrected the major non-linearity and heteroscedasticity from the first model. The relationship between our predictors and the log of price is now much more linear and stable.
Normal Q-Q: This plot shows a significant improvement in the normality of our residuals. The majority of the points now sit directly on the dashed line, meaning our error distribution is much closer to a normal bell curve than before. While there is still showing “heavy tails” at the very right end, the most of your data is now statistically well behaved, making our p-values and confidence intervals much more trustworthy.
Scale-Location: The log transformation has successfully stabilized the variance in our model, the horizontal red line and the more uniform vertical spread of points across the plot. We no longer see the heteroscedasticity from our previous model.
Residuals vs Leverage: This plot identifies a few specific observations that are still struggling to fit the model. However we are still gonna keep them as the rest of the points land within Cook’s Distance.

9.1 Investigating influential points

cooksD <- cooks.distance(log_model)
n <- nrow(diamonds)
threshold <- 4 / n
num_influential <- sum(cooksD > threshold)
num_influential

## [1] 2061

influential_points <- which(cooksD > threshold)
diamonds_clean <- diamonds[-influential_points, ]
log_model_final <- update(log_model, data = diamonds_clean)
summary(log_model_final)

## 
## Call:
## lm(formula = log(price) ~ carat * clarity + carat * cut + color + 
##     depth + table + x, data = diamonds_clean)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.49170 -0.08103  0.00022  0.07998  0.51092 
## 
## Coefficients:
##                      Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)        -3.5966952  0.0501129  -71.772  < 2e-16 ***
## carat              -1.0125801  0.0142056  -71.280  < 2e-16 ***
## claritySI2          0.1715415  0.0170053   10.088  < 2e-16 ***
## claritySI1          0.2987189  0.0169782   17.594  < 2e-16 ***
## clarityVS2          0.4559549  0.0170173   26.794  < 2e-16 ***
## clarityVS1          0.5148361  0.0171094   30.091  < 2e-16 ***
## clarityVVS2         0.6076226  0.0173321   35.058  < 2e-16 ***
## clarityVVS1         0.7000897  0.0175757   39.833  < 2e-16 ***
## clarityIF           0.8065384  0.0183267   44.009  < 2e-16 ***
## cutGood             0.0387989  0.0100558    3.858 0.000114 ***
## cutVery Good        0.0840323  0.0095755    8.776  < 2e-16 ***
## cutPremium          0.1912824  0.0094867   20.163  < 2e-16 ***
## cutIdeal            0.1476040  0.0095391   15.474  < 2e-16 ***
## colorE             -0.0517333  0.0019325  -26.770  < 2e-16 ***
## colorF             -0.0913307  0.0019583  -46.638  < 2e-16 ***
## colorG             -0.1571669  0.0019213  -81.802  < 2e-16 ***
## colorH             -0.2510189  0.0020493 -122.488  < 2e-16 ***
## colorI             -0.3762973  0.0023105 -162.864  < 2e-16 ***
## colorJ             -0.5164845  0.0029208 -176.832  < 2e-16 ***
## depth               0.0591495  0.0004770  124.016  < 2e-16 ***
## table               0.0079682  0.0003258   24.461  < 2e-16 ***
## x                   1.3119393  0.0030050  436.589  < 2e-16 ***
## carat:claritySI2    0.1924624  0.0132403   14.536  < 2e-16 ***
## carat:claritySI1    0.2335036  0.0133088   17.545  < 2e-16 ***
## carat:clarityVS2    0.2213278  0.0133897   16.530  < 2e-16 ***
## carat:clarityVS1    0.2398993  0.0135888   17.654  < 2e-16 ***
## carat:clarityVVS2   0.3063160  0.0143521   21.343  < 2e-16 ***
## carat:clarityVVS1   0.2650088  0.0157789   16.795  < 2e-16 ***
## carat:clarityIF     0.2139468  0.0192142   11.135  < 2e-16 ***
## carat:cutGood       0.0763946  0.0097553    7.831 4.93e-15 ***
## carat:cutVery Good  0.0624647  0.0091085    6.858 7.07e-12 ***
## carat:cutPremium   -0.0843867  0.0089598   -9.418  < 2e-16 ***
## carat:cutIdeal      0.0222708  0.0090071    2.473 0.013417 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1202 on 51846 degrees of freedom
## Multiple R-squared:  0.9854, Adjusted R-squared:  0.9854 
## F-statistic: 1.094e+05 on 32 and 51846 DF,  p-value: < 2.2e-16

Influential observations were identified using Cook’s Distance with a threshold of 4/n. Observations exceeding this threshold were flagged as potentially influential. These points were extracted and removed to evaluate their impact on the model. A revised model was then fitted on the cleaned dataset to assess improvements in stability and fit.

10 Conclusion

From the collinearity check, I learned that some variables especially carat and x were highly related to each other. This makes sense because both are measuring the size of the diamond. Because of this multicollinearity, the coefficients can become unstable and harder to interpret, even though the model may still predict well. So I realized there is a trade off between keeping variables for better prediction versus removing them for clearer interpretation.

From the model diagnostics, I learned that my original model had issues like non constant variance, abnormal residuals, and some influential points. The residual plots showed patterns, which means the model was missing some relationships. After adding interaction terms like carat with clarity, the model improved a lot, with higher R-squared and lower error. This showed me that the relationship between variables is not purely additive, and interactions are important.

11 Problem 1: Modeling with Multiple Categorical Predictors

Previously, we examined categorical interactions with the response variable using separate boxplots. We can now visualize Cut and Clarity in the same plot for further analyze.

ggplot(diamonds, aes(x = carat, y = log(price), color = clarity)) +
  geom_smooth(method = "lm", se = FALSE, size = 1.2) + 
  facet_wrap(~ cut, ncol = 3) +
  scale_color_viridis_d(option = "plasma") + 
  theme_minimal() +
  labs(
    title = "Visualizing Triple Interaction: Carat, Clarity, and Cut",
    subtitle = "Diverging slopes across panels confirm that interaction terms are necessary",
    x = "Carat (Weight)",
    y = "Log of Price",
    color = "Clarity Grade"
  ) +
  theme(legend.position = "bottom")

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

This plot shows that a diamond’s weight (carat) and its quality (cut and clarity) work together to decide the price. You can see this because the lines in each box “fan out” rather than staying parallel. This means that as a diamond gets bigger, the price doesn’t just go up rather it jumps significantly for high-quality diamonds. Including these interactions is necessary for our model because the value of each extra carat depends entirely on the diamond’s specific cut and clarity.

models <- list(
  Full = full_model,
  Reduced1 = reduced_model,
  Reduced2 = reduced_model2,
  Interaction = interaction_model,
  Log = log_model,
  Log_Final = log_model_final
)

model_summary <- data.frame(
Model = names(models),
R2 = sapply(models, function(m) summary(m)$r.squared),
Adj_R2 = sapply(models, function(m) summary(m)$adj.r.squared),
Residual_SE = sapply(models, function(m) summary(m)$sigma)
)
model_summary

##                   Model        R2    Adj_R2  Residual_SE
## Full               Full 0.9197915 0.9197573 1130.0944260
## Reduced1       Reduced1 0.9197880 0.9197568 1130.0979006
## Reduced2       Reduced2 0.9160543 0.9160232 1156.0899066
## Interaction Interaction 0.9457805 0.9457484  929.2196153
## Log                 Log 0.9716505 0.9716337    0.1708904
## Log_Final     Log_Final 0.9854119 0.9854029    0.1201679

To arrive at my final model, I followed a structured EDA and refinement process that consistently boosted performance at every step. I started with a Full linear model (R-Squared = 0.919) and experimented with Reduced versions, but quickly realized that a simple additive approach couldn’t capture the true complexity of diamond pricing. By introducing Interaction terms (Carat vs Clarity and Carat vs Cut), I significantly improved the model’s logic, acknowledging that weight becomes exponentially more valuable as quality increases. However, the biggest breakthrough came from applying a Log transformation to the price; this stabilized the variance and jumped my Adjusted R-Squared to 0.971. Finally, after conducting regression diagnostics and removing influential outliers that were skewing the data, I reached my Log_Final model, which achieved a remarkable 0.985 Adjusted and a much lower Residual Standard Error.

12 Problem 2: Model Selection and Competing Criteria

Previously, we looked at how our variables interact with the response variable. After making some manual adjustments to boost predictive power, I’m going to use stepwise selection to double check my work and see if a different combination of variables yields a better result.

12.1 Bacward selection

backward_model <- step(full_model, direction = "backward")

## Start:  AIC=758426.5
## price ~ carat + cut + color + clarity + depth + table + x + y + 
##     z
## 
##           Df  Sum of Sq        RSS    AIC
## - y        1 3.1549e+05 6.8857e+10 758425
## <none>                  6.8857e+10 758426
## - z        1 2.8609e+06 6.8860e+10 758427
## - table    1 1.0558e+08 6.8962e+10 758507
## - depth    1 2.5286e+08 6.9110e+10 758622
## - cut      4 8.6357e+08 6.9720e+10 759091
## - x        1 1.1996e+09 7.0056e+10 759356
## - color    6 1.7082e+10 8.5939e+10 770368
## - clarity  7 3.5703e+10 1.0456e+11 780945
## - carat    1 6.8440e+10 1.3730e+11 795649
## 
## Step:  AIC=758424.7
## price ~ carat + cut + color + clarity + depth + table + x + z
## 
##           Df  Sum of Sq        RSS    AIC
## <none>                  6.8857e+10 758425
## - z        1 2.6622e+06 6.8860e+10 758425
## - table    1 1.0584e+08 6.8963e+10 758506
## - depth    1 2.5637e+08 6.9114e+10 758623
## - cut      4 8.6409e+08 6.9721e+10 759089
## - x        1 1.5413e+09 7.0398e+10 759617
## - color    6 1.7082e+10 8.5940e+10 770366
## - clarity  7 3.5708e+10 1.0457e+11 780946
## - carat    1 6.8520e+10 1.3738e+11 795679

summary(backward_model)

## 
## Call:
## lm(formula = price ~ carat + cut + color + clarity + depth + 
##     table + x + z, data = diamonds)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -21378.8   -592.5   -183.5    376.3  10694.1 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2198.886    407.163   5.401 6.67e-08 ***
## carat        11257.752     48.602 231.630  < 2e-16 ***
## cutGood        580.325     33.572  17.286  < 2e-16 ***
## cutVery Good   727.431     32.214  22.581  < 2e-16 ***
## cutPremium     762.287     32.226  23.654  < 2e-16 ***
## cutIdeal       833.352     33.396  24.954  < 2e-16 ***
## colorE        -209.100     17.893 -11.686  < 2e-16 ***
## colorF        -272.837     18.093 -15.080  < 2e-16 ***
## colorG        -482.035     17.716 -27.209  < 2e-16 ***
## colorH        -980.247     18.836 -52.042  < 2e-16 ***
## colorI       -1466.257     21.162 -69.287  < 2e-16 ***
## colorJ       -2369.412     26.131 -90.675  < 2e-16 ***
## claritySI2    2702.855     43.815  61.688  < 2e-16 ***
## claritySI1    3665.735     43.631  84.018  < 2e-16 ***
## clarityVS2    4267.476     43.850  97.319  < 2e-16 ***
## clarityVS1    4578.702     44.541 102.796  < 2e-16 ***
## clarityVVS2   4951.100     45.851 107.983  < 2e-16 ***
## clarityVVS1   5008.029     47.156 106.201  < 2e-16 ***
## clarityIF     5345.420     51.020 104.772  < 2e-16 ***
## depth          -64.003      4.517 -14.168  < 2e-16 ***
## table          -26.501      2.911  -9.103  < 2e-16 ***
## x            -1000.354     28.795 -34.740  < 2e-16 ***
## z              -47.925     33.194  -1.444    0.149    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1130 on 53917 degrees of freedom
## Multiple R-squared:  0.9198, Adjusted R-squared:  0.9198 
## F-statistic: 2.81e+04 on 22 and 53917 DF,  p-value: < 2.2e-16

Using backward selection method, formula only drop non-significant p-value from the full model which was y and kept the other ones. This is could be a good method for detecthing significant predictors in large dataset, however it lack detecting its relationships in context. now let’s try to forward selection method. First we have to create a Null model.

12.2 Forward selection

null_model <- lm(price ~ 1, data = diamonds)
forward_model <- step(null_model, scope = list(lower = null_model, upper = full_model), 
                      direction = "forward")

## Start:  AIC=894477.9
## price ~ 1
## 
##           Df  Sum of Sq        RSS    AIC
## + carat    1 7.2913e+11 1.2935e+11 792389
## + x        1 6.7152e+11 1.8695e+11 812259
## + y        1 6.4296e+11 2.1552e+11 819929
## + z        1 6.3677e+11 2.2170e+11 821454
## + color    6 2.6849e+10 8.3162e+11 892776
## + clarity  7 2.3308e+10 8.3517e+11 893007
## + table    1 1.3876e+10 8.4460e+11 893601
## + cut      4 1.1042e+10 8.4743e+11 893788
## + depth    1 9.7323e+07 8.5838e+11 894474
## <none>                  8.5847e+11 894478
## 
## Step:  AIC=792389.4
## price ~ carat
## 
##           Df  Sum of Sq        RSS    AIC
## + clarity  7 3.9082e+10 9.0264e+10 772998
## + color    6 1.2561e+10 1.1678e+11 786891
## + cut      4 6.1332e+09 1.2321e+11 789777
## + x        1 3.5206e+09 1.2583e+11 790903
## + z        1 2.8493e+09 1.2650e+11 791190
## + table    1 1.4377e+09 1.2791e+11 791789
## + y        1 1.2425e+09 1.2810e+11 791871
## + depth    1 1.1546e+09 1.2819e+11 791908
## <none>                  1.2935e+11 792389
## 
## Step:  AIC=772998.5
## price ~ carat + clarity
## 
##         Df  Sum of Sq        RSS    AIC
## + color  6 1.6402e+10 7.3862e+10 762193
## + x      1 1.8542e+09 8.8410e+10 771881
## + cut    4 1.7808e+09 8.8483e+10 771932
## + z      1 1.4814e+09 8.8783e+10 772108
## + y      1 7.4127e+08 8.9523e+10 772556
## + table  1 3.7751e+08 8.9886e+10 772774
## + depth  1 3.5822e+08 8.9906e+10 772786
## <none>                9.0264e+10 772998
## 
## Step:  AIC=762193.4
## price ~ carat + clarity + color
## 
##         Df  Sum of Sq        RSS    AIC
## + x      1 2733710969 7.1128e+10 760161
## + z      1 1842294631 7.2020e+10 760833
## + cut    4 1699187372 7.2163e+10 760946
## + y      1 1145039064 7.2717e+10 761353
## + table  1  409645878 7.3452e+10 761895
## + depth  1  174658715 7.3687e+10 762068
## <none>                7.3862e+10 762193
## 
## Step:  AIC=760161.1
## price ~ carat + clarity + color + x
## 
##         Df  Sum of Sq        RSS    AIC
## + cut    4 1918248123 6.9210e+10 758694
## + depth  1  722282102 7.0406e+10 759613
## + table  1  273738191 7.0855e+10 759955
## + z      1  199547343 7.0929e+10 760012
## + y      1    5354253 7.1123e+10 760159
## <none>                7.1128e+10 760161
## 
## Step:  AIC=758694.4
## price ~ carat + clarity + color + x + cut
## 
##         Df Sum of Sq        RSS    AIC
## + depth  1 244682865 6.8965e+10 758505
## + z      1  72666922 6.9137e+10 758640
## + table  1   9935285 6.9200e+10 758689
## <none>               6.9210e+10 758694
## + y      1    982101 6.9209e+10 758696
## 
## Step:  AIC=758505.4
## price ~ carat + clarity + color + x + cut + depth
## 
##         Df Sum of Sq        RSS    AIC
## + table  1 105497218 6.8860e+10 758425
## <none>               6.8965e+10 758505
## + z      1   2323719 6.8963e+10 758506
## + y      1    298553 6.8965e+10 758507
## 
## Step:  AIC=758424.8
## price ~ carat + clarity + color + x + cut + depth + table
## 
##        Df Sum of Sq        RSS    AIC
## + z     1   2662170 6.8857e+10 758425
## <none>              6.8860e+10 758425
## + y     1    116788 6.8860e+10 758427
## 
## Step:  AIC=758424.7
## price ~ carat + clarity + color + x + cut + depth + table + z
## 
##        Df Sum of Sq        RSS    AIC
## <none>              6.8857e+10 758425
## + y     1    315487 6.8857e+10 758426

summary(forward_model)

## 
## Call:
## lm(formula = price ~ carat + clarity + color + x + cut + depth + 
##     table + z, data = diamonds)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -21378.8   -592.5   -183.5    376.3  10694.1 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2198.886    407.163   5.401 6.67e-08 ***
## carat        11257.752     48.602 231.630  < 2e-16 ***
## claritySI2    2702.855     43.815  61.688  < 2e-16 ***
## claritySI1    3665.735     43.631  84.018  < 2e-16 ***
## clarityVS2    4267.476     43.850  97.319  < 2e-16 ***
## clarityVS1    4578.702     44.541 102.796  < 2e-16 ***
## clarityVVS2   4951.100     45.851 107.983  < 2e-16 ***
## clarityVVS1   5008.029     47.156 106.201  < 2e-16 ***
## clarityIF     5345.420     51.020 104.772  < 2e-16 ***
## colorE        -209.100     17.893 -11.686  < 2e-16 ***
## colorF        -272.837     18.093 -15.080  < 2e-16 ***
## colorG        -482.035     17.716 -27.209  < 2e-16 ***
## colorH        -980.247     18.836 -52.042  < 2e-16 ***
## colorI       -1466.257     21.162 -69.287  < 2e-16 ***
## colorJ       -2369.412     26.131 -90.675  < 2e-16 ***
## x            -1000.354     28.795 -34.740  < 2e-16 ***
## cutGood        580.325     33.572  17.286  < 2e-16 ***
## cutVery Good   727.431     32.214  22.581  < 2e-16 ***
## cutPremium     762.287     32.226  23.654  < 2e-16 ***
## cutIdeal       833.352     33.396  24.954  < 2e-16 ***
## depth          -64.003      4.517 -14.168  < 2e-16 ***
## table          -26.501      2.911  -9.103  < 2e-16 ***
## z              -47.925     33.194  -1.444    0.149    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1130 on 53917 degrees of freedom
## Multiple R-squared:  0.9198, Adjusted R-squared:  0.9198 
## F-statistic: 2.81e+04 on 22 and 53917 DF,  p-value: < 2.2e-16

Our forward selection process started with an null model and systematically added variables that provided the most significant improvement to the model’s fit, measured by a decreasing AIC score. It first identified carat as the single most important predictor, followed by clarity, color, and physical dimensions, continuing until adding new variables no longer meaningfully improved the model. Eventually, it ended with the same result as a backward selection. One again this is good method but still lacks insight of variable relationships.

12.3 Model evaluation and Cross-Validation

library(caret)

## Loading required package: lattice

## 
## Attaching package: 'lattice'

## The following object is masked from 'package:boot':
## 
##     melanoma

train_control <- trainControl(method = "cv", number = 10)

log_final_model_cv <- train(log(price) ~ carat * clarity + carat * cut + color + depth + table + x,
                       data = diamonds,
                       method = "lm",
                       trControl = train_control)

log_final_model_cv$results

##   intercept      RMSE Rsquared       MAE    RMSESD  RsquaredSD       MAESD
## 1      TRUE 0.1699695 0.971408 0.1106054 0.0271839 0.009128032 0.001172535

I evaluated my models using cross-validation to see how well they predict new data. The final log-transformed model performed very well, with an RMSE of about 0.17 and an R² of about 0.97, which means it explains almost all of the variation in the log of price and has very low prediction error. Compared to earlier models, this log model improved performance and also helped address issues like non-constant variance and non-normal residuals. When I compared it with stepwise models, I noticed that while those models were simpler, they did not provide better predictive performance. This shows that some methods focus more on simplifying the model, while cross-validation focuses on prediction accuracy, and these goals do not always match.

Based on this, I chose the log interaction model because it provides the best predictive performance and better satisfies model assumptions. The trade-off is that the model is more complex due to interaction terms, but the key predictors like carat, clarity, and cut are still highly significant and consistent. In simple terms, the model shows that diamond price increases with size, but the rate of increase depends on quality—diamonds with better clarity and cut increase in price much faster as carat increases. Since the model uses a log transformation, these effects are interpreted in percentage terms rather than absolute changes. Overall, my final model leans more toward prediction, because I prioritized accuracy and model performance over simplicity.