group10_2

Group10_2

Why X,Y,Z out and Why log?

Before modeling, potential outliers (x,y,z) were removed to ensure data accuracy. (I will explain this later*). Since the original price variable exhibited a strong right-skew, both price and carat were log-transformed. This transformation stabilized the variance, reduced skewness, and made the data approximately normal, an important condition for Ordinary Least Squares (OLS) regression.

The histogram of log(price) shows a more symmetrical distribution compared to the original variable. This confirms that the log transformation was effective in normalizing the data. After transformation, the variable became more suitable for linear modeling as it reduced the influence of extremely high prices.

The scatter plot of log(price) vs log(carat) reveals a strong positive and approximately linear relationship. The red regression line fits tightly among the data points, showing that as carat increases, price grows exponentially but when both variables are log-transformed, this relationship becomes linear and easier to model. This demonstrates that carat weight is the major determinant of diamond price.(We estimate this from previous graph, preliminary observation 5.3)

Correlation analysis supports these findings.

Carat and price show a very strong positive correlation (r = 0.92), meaning that larger diamonds are generally more expensive. The dimensions x,y and z are also strongly correlated with both carat and price (r = 0.95–0.98), because they all measure the physical size of the diamond. However, because x, y and z are almost perfectly correlated with carat, including them in the model would lead to multicollinearity (a condition where predictors provide redundant information.
Multicollinearity inflates standard errors and makes the interpretation of coefficients unreliable.) Since carat already captures the volumetric size of the diamond, x, y and z were excluded from the model to maintain simplicity and statistical validity.

In contrast, depth and table show weak correlations with price (close to 0), suggesting that cutting proportions have less direct influence on diamond pricing. The negative correlation between depth and table (r = –0.3) indicates a geometric trade-off in diamond proportions.

Model Construction and Comparison

Three models were built incrementally:

Model 1: log(price) ~ log(carat)
→ Examines the basic relationship between size and price.

Model 2: log(price) ~ log(carat) + cut + color + clarity
→ Adds quality characteristics to improve prediction.

Model 3: log(price) ~ log(carat) + cut + color + clarity + depth + table
→ Further adds geometric proportions.

Model 1 explained 93% of the variation in price (R² = 0.933), confirming that carat alone is a powerful predictor.
Model 2 significantly improved model fit (Adjusted R² = 0.983) by including the quality variables cut, color, and clarity, which were all statistically significant (p < 0.001).
Model 3 added depth and table, but these variables were not significant (p > 0.05) and did not improve the model’s performance.
Therefore, Model 2 was selected as the optimal model for its simplicity, interpretability, and strong explanatory power.

Coefficient Interpretation

The regression coefficients from Model 2 show that:

log(carat) has the largest positive coefficient (p < 0.001), confirming that diamond size strongly increases price.

Cut quality increases price significantly (p < 0.001).

Clarity contributes positively and significantly to price.

Color has a negative coefficient, meaning that lower color quality decreases price.

All predictors are statistically significant (p < 0.05). These results confirm that both physical characteristics (carat) and optical quality factors (cut, clarity, color) jointly determine diamond pricing.

Model Fit and Performance

Model 2 explains 98.3% (Adjusted R² = 0.983) of the variation in log(price). Residual errors are small, and Root Mean Square Error (RMSE) is low. Compared to Model 1, the residual standard error decreased from 0.26 to 0.13, indicating a much tighter fit. Model 3 did not further improve Adjusted R² or AIC, meaning additional predictors only added unnecessary complexity.

Diagnostic Analysis

The multiple regression diagnostic plots for Model 2 confirm that OLS assumptions are mostly met:

Residuals vs Fitted: Residuals are randomly scattered around zero, supporting linearity and independence.

Q–Q Plot: Residuals closely follow the diagonal line, indicating approximate normality.(Shapiro–Wilk Test will be done)

Scale–Location Plot: Residuals display nearly constant variance, though a slight widening at higher fitted values suggests mild heteroskedasticity.(Breusch–Pagan test will be done)

Residuals vs Leverage: No influential outliers are observed, confirming model stability.

Homoskedasticity Test (Breusch–Pagan)

The Breusch–Pagan test result was significant (BP = 1361.8, p < 0.001), indicating the presence of heteroskedasticity. Although this does not bias the coefficient estimates, it may affect the reliability of standard errors and p-values. Therefore, using robust standard errors is recommended for more precise inference.

Robust standard errors

Given that the Breusch–Pagan test indicated heteroskedasticity (p < 0.001). The results confirmed that all main predictors (log(carat), cut, color, and clarity) remained highly significant (p < 0.001), indicating that heteroskedasticity did not alter the substantive conclusions of the model. (checked if there is a problem)

Normality of Residuals (Shapiro–Wilk Test)

The Shapiro–Wilk test performed on a random sample of 5,000 residuals gave W = 0.969 (p < 0.001), suggesting a small deviation from perfect normality. However, given the large sample size, even minor deviations become statistically significant. Visual inspection of the Q–Q plot shows that residuals are approximately normal, meaning the normality assumption is practically satisfied.(confirmed)

Cross-Validation

A 10-fold cross-validation was conducted using the caret package to evaluate predictive performance. The model maintained an average R² of about 0.98, confirming that it generalizes well to unseen data and is not overfitted.

Summary Judgment:

Outliers removed (x,y,z), ensured data integrity

Log transformations corrected skewness and met OLS normality assumptions.

Excluding x, y and z prevented multicollinearity.

Model 2 (carat + cut + color + clarity) offered the best trade-off between interpretability and accuracy.

Minor heteroskedasticity detected, but model remains robust.

Cross-validation confirmed high predictive power (R² ≈ 0.98).

Overall, Model 2 provides an accurate, interpretable, and statistically valid model for predicting diamond prices, successfully capturing both physical and qualitative aspects of valuation.

library(ggplot2)
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(car)
Loading required package: carData

Attaching package: 'car'
The following object is masked from 'package:dplyr':

    recode
library(caret)
Loading required package: lattice
library(ggcorrplot)
library(broom)
library(lmtest)
Loading required package: zoo

Attaching package: 'zoo'
The following objects are masked from 'package:base':

    as.Date, as.Date.numeric
# Remove potential outliers (x,y,z = 0 are invalid)
diamonds_clean <- diamonds %>%
  filter(x > 0, y > 0, z > 0)

# Log-transform price and carat to handle right-skewed distributions
diamonds_clean <- diamonds_clean %>%
  mutate(log_price = log(price),
         log_carat = log(carat))

This code cleans up erroneous measurements (x, y, z = 0) and then logarithmizes the price and carat variables to stabilize the distribution. This makes the data more reliable and creates a structure suitable for linear regression.

# Distribution of log(price)
ggplot(diamonds_clean, aes(x = log_price)) +
  geom_histogram(bins = 40, fill = "steelblue", color = "white") +
  labs(title = "Distribution of log(Price)", x = "log(Price)", y = "Count")

# Relationship between carat and price
ggplot(diamonds_clean, aes(x = log_carat, y = log_price)) +
  geom_point(alpha = 0.2) +
  geom_smooth(method = "lm", color = "red") +
  labs(title = "log(Price) vs log(Carat)",
       x = "log(Carat)", y = "log(Price)")
`geom_smooth()` using formula = 'y ~ x'

The histogram shows the distribution of the log-transformed diamond prices. The log transformation was applied to reduce right skewness and make the distribution more symmetric and Taking the logarithm of both Price and Carat linearized their exponential relationship, making the association between them approximately linear.

# Correlation matrix for numerical variables
num_vars <- diamonds_clean %>% 
  select(carat, depth, table, x, y, z, price)
corr <- cor(num_vars)
ggcorrplot(corr, lab = TRUE, title = "Correlation Heatmap of Numerical Variables")
Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
ℹ Please use tidy evaluation idioms with `aes()`.
ℹ See also `vignette("ggplot2-in-packages")` for more information.
ℹ The deprecated feature was likely used in the ggcorrplot package.
  Please report the issue at <https://github.com/kassambara/ggcorrplot/issues>.

The heatmap shows that carat has a very strong positive correlation with price (r = 0.92), meaning that larger diamonds tend to be more expensive. X, Y ,Z are also highly correlated with both carat and price, as expected. In contrast, depth and table show very weak correlations with price, indicating that cut proportions have less direct influence on price compared to size and weight.

# Step-by-step model construction

# Baseline model: log(price) ~ log(carat)
model1 <- lm(log_price ~ log_carat, data = diamonds_clean)

# Add cut, color, clarity as categorical predictors
model2 <- lm(log_price ~ log_carat + cut + color + clarity, data = diamonds_clean)

# Full model: add geometric features
model3 <- lm(log_price ~ log_carat + cut + color + clarity + depth + table, data = diamonds_clean)

Model1 shows relationship carat with price.
Model2 shows relationship carat, cut, color, clarity with price. Model3 shows relationship carat, cut, color, clarity, depth, table with price.

# Compare models
summary(model1)

Call:
lm(formula = log_price ~ log_carat, data = diamonds_clean)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.50852 -0.16951 -0.00594  0.16631  1.33796 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 8.448748   0.001365  6189.6   <2e-16 ***
log_carat   1.675908   0.001934   866.5   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2626 on 53918 degrees of freedom
Multiple R-squared:  0.933, Adjusted R-squared:  0.933 
F-statistic: 7.509e+05 on 1 and 53918 DF,  p-value: < 2.2e-16
summary(model2)

Call:
lm(formula = log_price ~ log_carat + cut + color + clarity, data = diamonds_clean)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.01111 -0.08636 -0.00031  0.08342  1.94779 

Coefficients:
             Estimate Std. Error  t value Pr(>|t|)    
(Intercept)  8.457037   0.001168 7238.916  < 2e-16 ***
log_carat    1.883745   0.001129 1668.528  < 2e-16 ***
cut.L        0.120737   0.002354   51.281  < 2e-16 ***
cut.Q       -0.035142   0.002072  -16.959  < 2e-16 ***
cut.C        0.013515   0.001799    7.512 5.92e-14 ***
cut^4       -0.001605   0.001441   -1.114   0.2653    
color.L     -0.439561   0.002027 -216.836  < 2e-16 ***
color.Q     -0.095702   0.001863  -51.378  < 2e-16 ***
color.C     -0.014760   0.001743   -8.468  < 2e-16 ***
color^4      0.011885   0.001601    7.424 1.15e-13 ***
color^5     -0.002219   0.001513   -1.467   0.1425    
color^6      0.002291   0.001375    1.666   0.0958 .  
clarity.L    0.916717   0.003582  255.950  < 2e-16 ***
clarity.Q   -0.243015   0.003334  -72.883  < 2e-16 ***
clarity.C    0.132430   0.002857   46.347  < 2e-16 ***
clarity^4   -0.066091   0.002285  -28.928  < 2e-16 ***
clarity^5    0.027474   0.001864   14.736  < 2e-16 ***
clarity^6   -0.001810   0.001624   -1.115   0.2650    
clarity^7    0.033555   0.001432   23.430  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1338 on 53901 degrees of freedom
Multiple R-squared:  0.9826,    Adjusted R-squared:  0.9826 
F-statistic: 1.693e+05 on 18 and 53901 DF,  p-value: < 2.2e-16
summary(model3)

Call:
lm(formula = log_price ~ log_carat + cut + color + clarity + 
    depth + table, data = diamonds_clean)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.01154 -0.08632 -0.00032  0.08342  1.94116 

Coefficients:
              Estimate Std. Error  t value Pr(>|t|)    
(Intercept)  8.5341918  0.0418202  204.068  < 2e-16 ***
log_carat    1.8838300  0.0011349 1659.875  < 2e-16 ***
cut.L        0.1184744  0.0026612   44.519  < 2e-16 ***
cut.Q       -0.0344439  0.0021294  -16.176  < 2e-16 ***
cut.C        0.0131590  0.0018302    7.190 6.56e-13 ***
cut^4       -0.0016521  0.0014640   -1.128   0.2591    
color.L     -0.4393955  0.0020292 -216.533  < 2e-16 ***
color.Q     -0.0956751  0.0018629  -51.358  < 2e-16 ***
color.C     -0.0148027  0.0017432   -8.492  < 2e-16 ***
color^4      0.0118843  0.0016011    7.423 1.16e-13 ***
color^5     -0.0021937  0.0015127   -1.450   0.1470    
color^6      0.0022783  0.0013753    1.657   0.0976 .  
clarity.L    0.9163025  0.0035879  255.384  < 2e-16 ***
clarity.Q   -0.2429325  0.0033346  -72.852  < 2e-16 ***
clarity.C    0.1322757  0.0028584   46.276  < 2e-16 ***
clarity^4   -0.0660370  0.0022849  -28.901  < 2e-16 ***
clarity^5    0.0273520  0.0018654   14.663  < 2e-16 ***
clarity^6   -0.0017544  0.0016238   -1.080   0.2799    
clarity^7    0.0335504  0.0014321   23.427  < 2e-16 ***
depth       -0.0009117  0.0004723   -1.931   0.0535 .  
table       -0.0003510  0.0003450   -1.017   0.3090    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1338 on 53899 degrees of freedom
Multiple R-squared:  0.9826,    Adjusted R-squared:  0.9826 
F-statistic: 1.523e+05 on 20 and 53899 DF,  p-value: < 2.2e-16

Model 1 shows a strong relationship between carat and price (R² = 0.93). Model 2 significantly improves the fit (Adj R² = 0.983) by including cut, color, and clarity, which are all highly significant predictors. Model 3 adds depth and table, but these variables are not statistically significant (p > 0.05) and do not improve the overall model fit. Therefore, Model 2 is selected as the final model.

# Get tidy summary
results <- tidy(model2)
print(results)
# A tibble: 19 × 5
   term        estimate std.error statistic   p.value
   <chr>          <dbl>     <dbl>     <dbl>     <dbl>
 1 (Intercept)  8.46      0.00117   7239.   0        
 2 log_carat    1.88      0.00113   1669.   0        
 3 cut.L        0.121     0.00235     51.3  0        
 4 cut.Q       -0.0351    0.00207    -17.0  2.44e- 64
 5 cut.C        0.0135    0.00180      7.51 5.92e- 14
 6 cut^4       -0.00161   0.00144     -1.11 2.65e-  1
 7 color.L     -0.440     0.00203   -217.   0        
 8 color.Q     -0.0957    0.00186    -51.4  0        
 9 color.C     -0.0148    0.00174     -8.47 2.55e- 17
10 color^4      0.0119    0.00160      7.42 1.15e- 13
11 color^5     -0.00222   0.00151     -1.47 1.42e-  1
12 color^6      0.00229   0.00138      1.67 9.58e-  2
13 clarity.L    0.917     0.00358    256.   0        
14 clarity.Q   -0.243     0.00333    -72.9  0        
15 clarity.C    0.132     0.00286     46.3  0        
16 clarity^4   -0.0661    0.00228    -28.9  1.35e-182
17 clarity^5    0.0275    0.00186     14.7  4.70e- 49
18 clarity^6   -0.00181   0.00162     -1.11 2.65e-  1
19 clarity^7    0.0336    0.00143     23.4  8.58e-121

The coefficients show that carat has the largest positive effect on price. Cut quality also increases price, while color has a strong negative effect.

# Goodness-of-fit comparison
glance(model1)
# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl> <dbl> <dbl>
1     0.933         0.933 0.263   750873.       0     1 -4411. 8828. 8855.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
glance(model2)
# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic p.value    df logLik     AIC     BIC
      <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl>   <dbl>   <dbl>
1     0.983         0.983 0.134   169267.       0    18 31961. -63883. -63705.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
glance(model3)
# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic p.value    df logLik     AIC     BIC
      <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl>   <dbl>   <dbl>
1     0.983         0.983 0.134   152346.       0    20 31963. -63882. -63687.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

Model1 includes only log(carat) and explains 93.3% of the variation in log(price).
While it shows a strong positive relationship between size and price (p < 0.001), the residual error (σ = 0.26) is relatively high, indicating that other factors influence price as well.

Model2 adds cut, color, and clarity, raising the Adjusted R² to 0.983 and cutting the residual error nearly in half (σ = 0.13). All main predictors are highly significant, confirming that diamond quality has a major effect on price. This model provides a substantial improvement in accuracy and fit.

Model3 further includes depth and table, but these variables are not statistically significant (p > 0.05) and do not improve Adjusted R², AIC, or residual error.
Their inclusion only increases model complexity without meaningful gains.

In conclusion, Model2 offers the best balance between simplicity and explanatory power, effectively capturing both size and quality effects on diamond price.

# Residual plots
par(mfrow=c(2,2))
plot(model2)

The diagnostic plots for Model2 indicate that the key linear regression assumptions are well satisfied. Residuals are randomly distributed with constant variance, the Q–Q plot suggests near-normality, and no influential outliers are detected. Therefore, Model2 provides a stable and reliable fit for predicting log(price).

# Homoskedasticity test (Breusch-Pagan)
bptest(model2)

    studentized Breusch-Pagan test

data:  model2
BP = 1361.8, df = 18, p-value < 2.2e-16

While the Scale–Location plot already provides a visual check for constant variance, the Breusch–Pagan test offers a formal statistical confirmation. Both methods assess homoskedasticity. if the plot shows random scatter and the BP test p-value is above 0.05, the model satisfies this assumption. The Breusch–Pagan test indicates a significant result (BP = 1361.8, p < 0.001), suggesting the presence of heteroskedasticity in Model2. Although this does not bias the coefficient estimates, it may affect the reliability of the standard errors and p-values. Therefore, robust standard errors should be considered for more accurate inference.

#Robust standard errors 
library(lmtest)
library(sandwich)
coeftest(model2, vcov = vcovHC(model2, type = "HC1"))

t test of coefficients:

              Estimate Std. Error   t value  Pr(>|t|)    
(Intercept)  8.4570371  0.0016233 5209.8097 < 2.2e-16 ***
log_carat    1.8837455  0.0012971 1452.3177 < 2.2e-16 ***
cut.L        0.1207374  0.0029211   41.3326 < 2.2e-16 ***
cut.Q       -0.0351415  0.0025280  -13.9008 < 2.2e-16 ***
cut.C        0.0135145  0.0019858    6.8057 1.016e-11 ***
cut^4       -0.0016055  0.0014253   -1.1264   0.26001    
color.L     -0.4395608  0.0021557 -203.9081 < 2.2e-16 ***
color.Q     -0.0957023  0.0019619  -48.7813 < 2.2e-16 ***
color.C     -0.0147601  0.0017917   -8.2379 < 2.2e-16 ***
color^4      0.0118852  0.0016039    7.4103 1.278e-13 ***
color^5     -0.0022186  0.0015053   -1.4739   0.14052    
color^6      0.0022907  0.0013370    1.7133   0.08667 .  
clarity.L    0.9167168  0.0053461  171.4743 < 2.2e-16 ***
clarity.Q   -0.2430149  0.0051072  -47.5830 < 2.2e-16 ***
clarity.C    0.1324298  0.0042271   31.3290 < 2.2e-16 ***
clarity^4   -0.0660911  0.0031009  -21.3137 < 2.2e-16 ***
clarity^5    0.0274741  0.0022230   12.3592 < 2.2e-16 ***
clarity^6   -0.0018098  0.0017524   -1.0328   0.30172    
clarity^7    0.0335550  0.0014129   23.7493 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Normality of residuals (Shapiro test on sample)
set.seed(123)
sample_resid <- sample(resid(model3), 5000)
shapiro.test(sample_resid)

    Shapiro-Wilk normality test

data:  sample_resid
W = 0.96887, p-value < 2.2e-16

The Shapiro–Wilk test on a random sample of 5,000 residuals indicated a deviation from perfect normality (W = 0.969, p < 0.001). However, given the large dataset, this result is likely not practically significant. The Q–Q plot suggests that residuals are approximately normal, meaning the normality assumption is adequately satisfied for inference.

# Cross-validation (10-fold)
library(caret)
# Cross-validation (örneğin 10 katlı CV)
train_control <- trainControl(
  method = "cv",       # "cv" = cross-validation
  number = 10,         # 10 katlı
  verboseIter = FALSE  # eğitim sürecini göstermesin
)

cv_model <- train(
  log_price ~ log_carat + cut + color + clarity, data = diamonds_clean,
  method = "lm",
  trControl = train_control
)

A 10-fold cross-validation was performed using the caret e to evaluate the model’s predictive performance. So CV, Model 2 predicts diamond prices accurately and generalizes well. The model, which includes size and quality factors (carat, cut, color, clarity), explains over 98 % of the variance in log(price). Its residuals are small and well-behaved, indicating a strong and reliable fit.

# Predicting price for a hypothetical diamond (business use case)
new_diamond <- data.frame(
  log_carat = log(1.0),
  cut = "Ideal",
  color = "E",
  clarity = "VS1",
  depth = 61.5,
  table = 57
)

predicted_log_price <- predict(model2, newdata = new_diamond)
predicted_price <- exp(predicted_log_price)

cat("💎 Predicted price for 1.0 carat, Ideal/E/VS1 diamond:", round(predicted_price, 2), "USD\n")
💎 Predicted price for 1.0 carat, Ideal/E/VS1 diamond: 6477.35 USD

Using Model 2, the predicted price for a 1.0-carat diamond with Ideal cut, E color, and VS1 clarity is approximately $6,477 USD. This result illustrates the model’s practical use in estimating diamond prices based on key quality characteristics such as size, cut, color, and clarity.