Part 1 - Data Description and Descriptive Statistics

1.1) Selecting a random Sample:

diamond <- read.csv("Diamonds Prices2022.csv")%>%
  na.omit()
diamond_sample = diamond[sample(nrow(diamond), 1000), ]

1.2) Description of Variables and Visualizations

summary(diamond_sample)
##        X             carat            cut               color          
##  Min.   :    3   Min.   :0.2300   Length:1000        Length:1000       
##  1st Qu.:13074   1st Qu.:0.4000   Class :character   Class :character  
##  Median :25174   Median :0.7100   Mode  :character   Mode  :character  
##  Mean   :26143   Mean   :0.8084                                        
##  3rd Qu.:39066   3rd Qu.:1.0525                                        
##  Max.   :53921   Max.   :3.0000                                        
##    clarity              depth           table           price        
##  Length:1000        Min.   :56.20   Min.   :52.00   Min.   :  327.0  
##  Class :character   1st Qu.:61.00   1st Qu.:56.00   1st Qu.:  956.5  
##  Mode  :character   Median :61.90   Median :57.00   Median : 2602.0  
##                     Mean   :61.75   Mean   :57.53   Mean   : 4048.0  
##                     3rd Qu.:62.50   3rd Qu.:59.00   3rd Qu.: 5572.2  
##                     Max.   :67.40   Max.   :66.00   Max.   :18709.0  
##        x               y               z        
##  Min.   :3.870   Min.   :3.890   Min.   :2.310  
##  1st Qu.:4.718   1st Qu.:4.728   1st Qu.:2.910  
##  Median :5.695   Median :5.710   Median :3.550  
##  Mean   :5.755   Mean   :5.756   Mean   :3.557  
##  3rd Qu.:6.540   3rd Qu.:6.540   3rd Qu.:4.050  
##  Max.   :9.130   Max.   :9.030   Max.   :6.160
  1. carat: The weight of the diamond, measured in carats. A higher carat weight generally means a larger and more expensive diamond.

  2. cut: The quality of the diamond cut, which affects how well it reflects light. Typical values include:

    • Ideal
    • Premium
    • Very Good
    • Good
    • Fair
  3. color: The color grade of the diamond, ranging from D (colorless) to Z (light yellow or brown). A lower grade (closer to D) indicates a more colorless diamond.

  4. clarity: The clarity grade of the diamond, indicating the presence of inclusions or blemishes. Typical grades include:

    • IF (Internally Flawless)
    • VVS1 (Very, Very Slightly Included 1)
    • VVS2 (Very, Very Slightly Included 2)
    • VS1 (Very Slightly Included 1)
    • VS2 (Very Slightly Included 2)
    • SI1 (Slightly Included 1)
    • SI2 (Slightly Included 2)
    • I1 (Included 1)
  5. depth: The total depth percentage of the diamond, calculated as: \[ \text{depth} = \frac{z}{\text{average of } (x, y)} \times 100 \] where:

    • x: Length in millimeters
    • y: Width in millimeters
    • z: Height in millimeters
  6. table: The width of the diamond’s top facet as a percentage of its average diameter.

  7. price: The price of the diamond in US dollars.

  8. x: Length of the diamond in millimeters.

  9. y: Width of the diamond in millimeters.

  10. z: Height (depth) of the diamond in millimeters.

a1 = ggplot(diamond_sample, aes(x = depth)) + 
  geom_histogram(aes(y = ..density..), bins = 30, fill = "lavender", color = "black", alpha = 0.5) +
  geom_density(color = 'red')+
  labs(title = "Distribution of Diamond Depth", x = "Depth", y = "Frequency")+
  theme_minimal()

a2 = ggplot(diamond_sample, aes(x = carat)) + 
  geom_histogram(aes(y = ..density..), bins = 30, fill = "darkblue", color = "black", alpha = 0.5) +
  geom_density(color = 'red')+
  labs(title = "Distribution of Diamond Carat", x = "Carat", y = "Frequency")+
  theme_minimal()

a3 = ggplot(diamond_sample, aes(x = table)) + 
  geom_histogram(aes(y = ..density..), bins = 30, fill = "brown", color = "black", alpha = 0.5) +
  geom_density(color = 'red')+
  labs(title = "Distribution of Diamond Table", x = "Table", y = "Frequency")+
  theme_minimal()

a4 = ggplot(diamond_sample, aes(x = price)) + 
  geom_histogram(aes(y = ..density..), bins = 30, fill = "darkgreen", color = "black", alpha = 0.5) +
  geom_density(color = 'red')+
  labs(title = "Distribution of Diamond Price", x = "Price", y = "Frequency")+
  theme_minimal()

grid.arrange(a1, a2, a3, a4,  nrow = 2, ncol = 2)

  • The Distribution of Diamond Depth appears to be approximately normal, centered around a depth of 61%, with most values falling between 59% and 63%. The distribution is symmetric with a well-defined peak and minimal skewness, suggesting that diamond depth is consistently within a narrow range for most diamonds.

  • The Distribution of Diamond Carat shows a highly right-skewed distribution, with a majority of diamonds weighing less than 1 carat. There is a steep decline in frequency as carat weight increases, indicating that larger diamonds are significantly less common. The presence of multiple peaks suggests that some specific carat weights may be more popular or preferred in the market.

  • The Distribution of Diamond Table also appears to be approximately normal, though slightly right-skewed, centered around 57%. The range is relatively narrow, and the distribution shows a single prominent peak, indicating that most diamonds have a similar table percentage, reflecting standard cutting practices.

  • The Distribution of Diamond Price exhibits strong right skewness, with most diamonds priced under $5,000 and a long tail extending towards higher prices. This indicates that while the majority of diamonds are relatively affordable, there are a few exceptionally expensive ones that drive the tail, likely due to larger carat sizes or superior quality.

b1 = ggplot(diamond_sample, aes(x = x)) + 
  geom_histogram(aes(y = ..density..), bins = 30, fill = "lavender", color = "black", alpha = 0.5) +
  geom_density(color = 'red')+
  labs(title = "Distribution of Diamond Length ", x = "Length (mm)", y = "Frequency")+
  theme_minimal()

b2 = ggplot(diamond_sample, aes(x = y)) + 
  geom_histogram(aes(y = ..density..), bins = 30, fill = "darkblue", color = "black", alpha = 0.5) +
  geom_density(color = 'red')+
  labs(title = "Distribution of Diamond Width", x = "Width (mm)", y = "Frequency")+
  theme_minimal()

b3 = ggplot(diamond_sample, aes(x = z)) + 
  geom_histogram(aes(y = ..density..), bins = 30, fill = "brown", color = "black", alpha = 0.5) +
  geom_density(color = 'red')+
  labs(title = "Distribution of Diamond Height", x = "Height (mm)", y = "Frequency")+
  theme_minimal()

grid.arrange(b1, b2, b3,  nrow = 3)

  • The Distribution of Diamond Length shows a right-skewed pattern, with most diamonds having lengths between 4 and 6 mm. There is a peak around 4.5 mm, and the frequency decreases as the length increases, indicating that longer diamonds are relatively uncommon.

  • The Distribution of Diamond Width also exhibits a right-skewed distribution, with most diamonds having widths between 4 and 6 mm. There is a noticeable peak near 4.5 mm, and the distribution gradually tails off as the width increases, suggesting that wider diamonds are less frequent.

  • The Distribution of Diamond Height demonstrates a pronounced right skew, with a peak around 3.5 mm. The frequency decreases steadily as height increases beyond this point, indicating that diamonds with greater heights are significantly less common. The distribution shows a few extreme values, hinting at the presence of taller diamonds that are rare in the dataset.

Cut_counts <- diamond_sample %>%
  group_by(cut) %>%
  summarise(Count = n()) %>%
  mutate(Percentage = Count / sum(Count) * 100)

Color_counts <- diamond_sample %>%
  group_by(color) %>%
  summarise(Count = n()) %>%
  mutate(Percentage = Count / sum(Count) * 100)

Clarity_counts <- diamond_sample %>%
  group_by(clarity) %>%
  summarise(Count = n()) %>%
  mutate(Percentage = Count / sum(Count) * 100)


# Plot with percentage labels
ggplot(Cut_counts, aes(x = cut, y = Count, fill = factor(cut))) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = paste0(round(Percentage, 1), "%")), vjust = -0.5, size = 4) + 
  labs(
    x = "Cut",
    y = "Count",
    fill = "Cut",
    title = "Diamond Cut Distribution"
  ) +
  scale_fill_brewer(palette = "Paired")+
  theme_minimal()+
  ylim(0,450) 

# Plot with percentage labels
ggplot(Color_counts, aes(x = color, y = Count, fill = factor(color))) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = paste0(round(Percentage, 1), "%")), vjust = -0.5, size = 4) + 
  labs(
    x = "Color",
    y = "Count",
    fill = "Color",
    title = "Diamond Color Distribution"
  ) +
  scale_fill_viridis_d(option = "cividis")+
  theme_minimal()+
  ylim(0,220) 

# Plot with percentage labels
ggplot(Clarity_counts, aes(x = clarity, y = Count, fill = factor(clarity))) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = paste0(round(Percentage, 1), "%")), vjust = -0.5, size = 4) + 
  labs(
    x = "Clarity",
    y = "Count",
    fill = "Clarity",
    title = "Diamond Clarity Distribution"
  ) +
  scale_fill_viridis_d(option = "inferno")+
  theme_minimal()+
  ylim(0,275) 

1.3) Correlation Between Variables:

Our variables of interest will be:

  • Price
  • Carat
  • Depth
  • Cut
  • Clarity
numeric_cols = diamond_sample[, sapply(diamond_sample, is.numeric)]%>%
  dplyr::select(-X, -x, - y, -z, -table)

# Correlation matrix
cor_matrix = cor(numeric_cols)

# Correlation visualization
corrplot(cor_matrix, method = "color", type = "upper", tl.col = "black", tl.cex = 0.6, addCoef.col = "black", number.cex = .4)

# ANOVA for price by cut
anova_cut <- aov(price ~ cut, data = diamond_sample)
summary(anova_cut)
##              Df    Sum Sq  Mean Sq F value  Pr(>F)   
## cut           4 2.407e+08 60182136    3.83 0.00427 **
## Residuals   995 1.564e+10 15715381                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# ANOVA for price by color
anova_color <- aov(price ~ color, data = diamond_sample)
summary(anova_color)
##              Df    Sum Sq  Mean Sq F value  Pr(>F)    
## color         6 5.397e+08 89956982   5.824 5.6e-06 ***
## Residuals   993 1.534e+10 15445912                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As we cannot asses correlation between our variables for categorical ones with a correlation matrix, we have instead used an ANOVA. Our P-value indicates whether or not a relationship between the price and our categorical variables are statistically significant. we found statistical significance for Color to impact price, but not cut.

1.4) MLR Model Using all Variables:

full_model <- lm(price ~ carat + depth + x + y + z + table + cut + color + clarity, data = diamond_sample)
summary(full_model)
## 
## Call:
## lm(formula = price ~ carat + depth + x + y + z + table + cut + 
##     color + clarity, data = diamond_sample)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8902.6  -617.4  -144.1   389.6  7411.6 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   5123.08    3719.72   1.377 0.168743    
## carat        11527.97     388.93  29.640  < 2e-16 ***
## depth          -57.00      46.36  -1.230 0.219134    
## x            -2949.33     843.82  -3.495 0.000495 ***
## y             1814.81     842.50   2.154 0.031478 *  
## z               16.19     532.08   0.030 0.975734    
## table          -71.41      23.22  -3.076 0.002156 ** 
## cutGood       -204.41     283.54  -0.721 0.471148    
## cutIdeal       227.74     278.05   0.819 0.412934    
## cutPremium     124.00     270.09   0.459 0.646250    
## cutVery Good    63.04     273.94   0.230 0.818034    
## colorE        -325.17     135.64  -2.397 0.016701 *  
## colorF        -278.33     133.16  -2.090 0.036854 *  
## colorG        -519.03     131.66  -3.942 8.66e-05 ***
## colorH       -1096.89     142.69  -7.687 3.66e-14 ***
## colorI       -1541.38     160.57  -9.599  < 2e-16 ***
## colorJ       -2494.53     197.47 -12.632  < 2e-16 ***
## clarityIF     5972.93     505.49  11.816  < 2e-16 ***
## claritySI1    3932.27     468.68   8.390  < 2e-16 ***
## claritySI2    2838.63     468.65   6.057 1.98e-09 ***
## clarityVS1    4769.50     474.15  10.059  < 2e-16 ***
## clarityVS2    4479.02     469.23   9.545  < 2e-16 ***
## clarityVVS1   5026.31     485.02  10.363  < 2e-16 ***
## clarityVVS2   5179.29     480.41  10.781  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1144 on 976 degrees of freedom
## Multiple R-squared:  0.9196, Adjusted R-squared:  0.9177 
## F-statistic: 485.2 on 23 and 976 DF,  p-value: < 2.2e-16

Upon intial inspection, I findi ti surprising how certain categorical variales categories such colorF and colorG can have vastly different relationships with Price. I also find it interesting how x is weighed in importance to the model but not y or z as these are alos measurements of size for the diamond, and on that note depth as well. And finally The extremely large coeficient for carat really stick out to me, further analysis must be done

Part 2:

2.1) SLR with Carat & Price

carat_model <- lm(price ~ carat , data = diamond_sample)
summary(carat_model)
## 
## Call:
## lm(formula = price ~ carat, data = diamond_sample)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9322.4  -800.0   -51.1   506.0  9902.5 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -2156.8       96.8  -22.28   <2e-16 ***
## carat         7675.7      103.1   74.46   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1558 on 998 degrees of freedom
## Multiple R-squared:  0.8475, Adjusted R-squared:  0.8473 
## F-statistic:  5544 on 1 and 998 DF,  p-value: < 2.2e-16

2.2) Interpretation of Summary Statistics

Hypothesis Test

Our equation for our SLR is:

\[ \hat{y} = \hat{\beta_0 }+ \hat{\beta_1}x_1 \]

and thus we get the SLR for this model to be:

\[ \hat{price} = -2206.47+ 7593.80x_1 \]

Where:

  • \(\hat{\beta_0 }\) is our Intercept: Our intercept of -2206.47 suggests that if a diamond had a carat of 0, its predicted price would be -$2206.47.
  • \(\hat{\beta_1}\) is the coefficient for our carat: Our \(\beta_1\) of 7593.80 indicates that for each additional whole carat value to our diamond, we expect an increase of $7593.80 in the price.
  • \(x_1\) is our carat value
  • \(\hat{y}\) is our expected price output
plot(diamond_sample$carat, diamond_sample$price, main = "Price vs Carat",
     xlab = "Carat", ylab = "Price", pch = 16, col = "purple")
abline(carat_model, col = "blue", lwd = 2)

Hypothesis Testing

  • \(\begin{aligned}H_0 &: \beta_1 = 0 \end{aligned}\)

    • Our Null Hypothesis, that our slope (\(\beta_1\)) is equal to 0. This indicates that there is no relationship between our carat value and the price of our diamond.
  • \(H_A:\beta_1 \neq 0\)

    • Our Alternate Hypothesis, we have statistically significant evidence that there exists a relationship between our diamond’s carat value and its price.

Our model has given us a t-value of 73.26, telling us the estimate for our coefficient of our carat is that many standard errors (103.66 units) away from 0. This t-value is extreme producing a p-value of <2e-16. This p-value gives us the probability of producing our value for \(\beta_1\) strictly due to chance based off of its standard error. Since our p-value is less than our statistical significance level of \(\alpha\) = 0.05, we reject the Null Hypothesis and conclude we have statistically significant evidence of there being a relationship between our diamonds carat value and its price.

R-squared and Adjusted R-squared:

  • An \(R^2\) of 0.8432 means that 84.32% of the variability in diamond price is explained by the model, suggesting that the model fits the data well.

  • Adjusted \(R^2\) accounts for the number of predictors and adjusts for over fitting. Since out model only has one predictor, the \(R^2\) and Adjusted \(R^2\) will be the same

Residual Standard Error (RSE)

Residual Standard Error (RSE) = 1586

  • RSE measures the average deviation of actual diamond price from the predicted price.
  • An RSE of 1586 means that the model’s predictions typically deviate from actual price scores by $1586.
  • Lower RSE values indicate more precise predictions, while higher values suggest greater prediction errors.

Confidence Interval

confint(carat_model)
##                 2.5 %    97.5 %
## (Intercept) -2346.714 -1966.794
## carat        7473.425  7878.002

The 95% confidence interval for the intercept ranges from -2396.838 to -2016.110, indicating that the true intercept value lies within this range with 95% confidence.

The confidence interval for the carat coefficient ranges from 7390.383 to 7797.209, suggesting that for every one-unit increase in carat weight, the diamond price increases by approximately $7390 to $7797 95% of the time.

Prediction Interval:

new_data <- data.frame(carat = 0.5)
predict(carat_model, newdata = new_data, interval = "prediction")
##        fit       lwr     upr
## 1 1681.103 -1378.104 4740.31

We predict the outcome for this SLR with a carat measurement of .5 to fall between -1523.72 and 4704.568 95% of the time

2.3) Testing Assumptions on Residuals

Assumptions made for SLR are:

  • Linearity - \(E(\epsilon) = 0\)
  • Constant Variance \(( \text{Var}(\epsilon) = \sigma^2 I )\)
  • Independence \(\epsilon_i \perp \epsilon_j\)
  • Normality of Residuals \(\epsilon_i \sim N(0, \sigma^2 I)\)

Linearity:

plot(carat_model$fitted.values, residuals(carat_model),
     xlab = "Fitted Values", ylab = "Residuals",
     main = "Residuals vs Fitted")
abline(h = 0, col = "red")

Immediately seen the Residuals have a funnel shape, increasing in their deviation from their expected value as the fitted value increases. This is in violation of our assumptions fo SLR, Thus we will Transform our model using Log Transfomrations

Before doing so we will check the other assumptions of Normality and constant Variances:

par(mfrow = c(2, 2)) 
plot(carat_model)

We can see here that the assumptions of normality are violated in our QQ residual plot along with constant variance. Thus let us perform our log transfromation.

Log Transformation:

log_model<- lm(log(price) ~ log(carat), data = diamond_sample)

plot(log_model$fitted.values, residuals(log_model),
     xlab = "Fitted Values", ylab = "Residuals",
     main = "Residuals vs Fitted")
abline(h = 0, col = "red")

Now the Linearity looks much better. Our Residuals deviation is constant throughout our graph and centered around 0. Let us check the normality condition as well

par(mfrow = c(2, 2)) 
plot(log_model)

As seen now the Normality and other assumptions for are linear model are now met much better then before. we can move forward and anlyze our new model

2.4) Summary of Transformed Model

summary(log_model)
## 
## Call:
## lm(formula = log(price) ~ log(carat), data = diamond_sample)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.00112 -0.18199  0.00328  0.17844  1.15782 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 8.462485   0.009972   848.6   <2e-16 ***
## log(carat)  1.672504   0.014181   117.9   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2644 on 998 degrees of freedom
## Multiple R-squared:  0.9331, Adjusted R-squared:  0.933 
## F-statistic: 1.391e+04 on 1 and 998 DF,  p-value: < 2.2e-16

\[ \hat{price} = 8.424+ 1.685x_1 \]

Model Comparisons:

  1. The log-log model has a significantly higher R-squared (0.9285) compared to the original linear model (0.843), indicating that it explains more variance in the data. The adjusted R-squared is also higher, confirming a better model fit even after accounting for the number of predictors.

  2. The log-log model interprets the coefficient as an elasticity, showing how a percentage change in carat affects a percentage change in price. The original model interprets the coefficient as an absolute increase in price per unit increase in carat, which is less intuitive given the skewed nature of the data.

  3. The log-log model has a much lower residual standard error (0.27), indicating less error in prediction compared to the original model (1586).

  4. Since both the price and carat variables are skewed, the log transformation normalizes the distribution, leading to a more reliable and interpretative model. The log-log model is generally preferred when variables follow a power-law relationship or have exponential growth patterns.

In conclusion the Log model is much more suitable for the data then the original model.

2.5) Model Improvements

  • Adding Clarity to the model increases \(R^2_{adj}\) from 0.9285 to 0.9663 This shows that clarity is a significant predictor for the price of a diamond.

  • The addition of depth and table variables keeps \(R^2_{adj}\) roughly the same at .931. This may indicate that the variables of depth and table are not significant to the model and do not have a strong relationship to the diamond price.

  • Adding \(y\) (width), \(x\) (length), and \(z\) (height) Also do not have much affect on the model bringing it to .931. This shows that they are not significant to the model change.

  • Adding color and cut slightly increases the \(R^2_{adj}\) to 0.946. This suggests that the color and cut could each have a relationship to price.

Since carat already took up a majority of the variation in the SLR graph, it wll be difficult to go one by one to check which model is the best, further analysis will be conducted.

2.6) Checking for Multicollinearity

Variance Inflation Factor (VIF) indicates how much multicollinearity (correlation between predictor variables) exists in the regression model. Generally:

VIF > 5: Moderate multicollinearity. VIF > 10: High multicollinearity (problematic). VIF > 100: Severe multicollinearity (critical).

For this model we will be looking at which factors have VIF scores much over 10

model2 <- lm(log(price) ~ log(carat)+ log(depth) + log(x) + log(y) + log(z) + log(table) +  cut + color + clarity, data = diamond_sample)

faraway::vif(model2)
##   log(carat)   log(depth)       log(x)       log(y)       log(z)   log(table) 
##  1852.007622     9.993642  1192.799541  1141.867457   164.207292     2.327211 
##      cutGood     cutIdeal   cutPremium cutVery Good       colorE       colorF 
##     5.287374    13.929067    10.414607    10.300024     2.012240     2.110587 
##       colorG       colorH       colorI       colorJ    clarityIF   claritySI1 
##     2.241717     1.932025     1.692887     1.394214     6.575367    28.422746 
##   claritySI2   clarityVS1   clarityVS2  clarityVVS1  clarityVVS2 
##    24.452863    22.762664    29.272646    12.881169    15.352601

Severe Multicollinearity:

  • log(carat): VIF = 534.26 (Extremely high)
  • log(x): VIF = 794.49 (Extremely high)
  • log(y): VIF = 760.30 (Extremely high)
  • log(z): VIF = 31.61 (High)

These values are critical and indicate severe multicollinearity, meaning that these variables are highly correlated with each other.

Moving Forward we will elimate xyz since we are aware these are all measurements of the diamond and see how this affects our VIF

model3 <- lm(log(price) ~ log(carat)+ log(depth) + log(table) +  cut + color + clarity, data = diamond_sample)

faraway::vif(model3)
##   log(carat)   log(depth)   log(table)      cutGood     cutIdeal   cutPremium 
##     1.368182     1.409981     1.911568     5.052088    13.514338    10.315163 
## cutVery Good       colorE       colorF       colorG       colorH       colorI 
##     9.735159     2.006444     2.108512     2.235184     1.927388     1.686389 
##       colorJ    clarityIF   claritySI1   claritySI2   clarityVS1   clarityVS2 
##     1.389173     6.487150    28.236659    24.290196    22.562979    29.043764 
##  clarityVVS1  clarityVVS2 
##    12.721164    15.198413

Although our model does still has 1 value over 10, it is only slightly over 10, and the VIF scores indicating Multicollinearitly look much better.

Part 3

3.1) Best Model usiing AIC and BIC

AIC

model0 = lm(log(price) ~ 1, diamond_sample)

step(model0,
     scope = ~ log(carat)+ log(depth) + log(x) + log(y) + log(z) + log(table) + cut + color + clarity,
     direction = "both", trace = 0)
## 
## Call:
## lm(formula = log(price) ~ log(y) + clarity + color + log(carat) + 
##     cut + log(x) + log(z), data = diamond_sample)
## 
## Coefficients:
##  (Intercept)        log(y)     clarityIF    claritySI1    claritySI2  
##      7.02698      -1.58665       1.25385       0.72747       0.54914  
##   clarityVS1    clarityVS2   clarityVVS1   clarityVVS2        colorE  
##      0.94020       0.86204       1.14822       1.09788      -0.05107  
##       colorF        colorG        colorH        colorI        colorJ  
##     -0.10327      -0.16114      -0.24297      -0.38725      -0.50952  
##   log(carat)       cutGood      cutIdeal    cutPremium  cutVery Good  
##      1.71658       0.04940       0.14102       0.09773       0.10599  
##       log(x)        log(z)  
##      1.65440       0.43654

BIC

model0 = lm(log(price) ~ 1, diamond_sample)

n = nrow(diamond_sample)

step(model0,
     scope = ~ log(carat)+ log(depth) + log(x) + log(y) + log(z) + log(table) + cut + color + clarity,
     direction = "both", trace = 0, k = log(n))
## 
## Call:
## lm(formula = log(price) ~ clarity + color + log(carat) + cut, 
##     data = diamond_sample)
## 
## Coefficients:
##  (Intercept)     clarityIF    claritySI1    claritySI2    clarityVS1  
##      7.78781       1.23503       0.71441       0.53620       0.92372  
##   clarityVS2   clarityVVS1   clarityVVS2        colorE        colorF  
##      0.84809       1.13458       1.08083      -0.04796      -0.10073  
##       colorG        colorH        colorI        colorJ    log(carat)  
##     -0.15676      -0.24035      -0.38235      -0.50926       1.88454  
##      cutGood      cutIdeal    cutPremium  cutVery Good  
##      0.02116       0.11759       0.08255       0.07653

Our AIC and BIC models both include the following predictors:

  • carat
  • clarity
  • color
  • cut
  • x
  • depth

Where our two models differ is in the fact that our AIC has decided to include preditor y as part of the model where our BIC model has decided to. This makes sense as BIC tends to penalize models more heavily for the inclusion of additional predictors. We can check the \(R^2\) values for both of these models to decide which to move forward with.

AIC <- lm(formula = log(price) ~ log(y) + clarity + color + log(carat) + 
    cut + log(x) + log(depth), data = diamond_sample)

BIC <- lm(formula = log(price) ~ clarity + color + log(carat) + cut + 
    log(x) + log(depth), data = diamond_sample)

summary(AIC)
## 
## Call:
## lm(formula = log(price) ~ log(y) + clarity + color + log(carat) + 
##     cut + log(x) + log(depth), data = diamond_sample)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.49555 -0.08487  0.00238  0.08583  0.45565 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   4.61985    3.20538   1.441 0.149826    
## log(y)       -1.32534    0.67719  -1.957 0.050616 .  
## clarityIF     1.25571    0.05787  21.697  < 2e-16 ***
## claritySI1    0.72945    0.05335  13.672  < 2e-16 ***
## claritySI2    0.55157    0.05351  10.308  < 2e-16 ***
## clarityVS1    0.94221    0.05400  17.448  < 2e-16 ***
## clarityVS2    0.86401    0.05353  16.142  < 2e-16 ***
## clarityVVS1   1.15294    0.05535  20.831  < 2e-16 ***
## clarityVVS2   1.10004    0.05482  20.065  < 2e-16 ***
## colorE       -0.05129    0.01558  -3.293 0.001027 ** 
## colorF       -0.10357    0.01527  -6.780 2.07e-11 ***
## colorG       -0.16020    0.01508 -10.623  < 2e-16 ***
## colorH       -0.24354    0.01635 -14.899  < 2e-16 ***
## colorI       -0.38775    0.01834 -21.140  < 2e-16 ***
## colorJ       -0.50990    0.02250 -22.662  < 2e-16 ***
## log(carat)    1.69127    0.27463   6.158 1.07e-09 ***
## cutGood       0.04592    0.03234   1.420 0.155962    
## cutIdeal      0.13734    0.03103   4.426 1.07e-05 ***
## cutPremium    0.09465    0.03083   3.069 0.002203 ** 
## cutVery Good  0.10340    0.03122   3.312 0.000962 ***
## log(x)        1.90694    0.70212   2.716 0.006725 ** 
## log(depth)    0.49826    0.42773   1.165 0.244352    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1313 on 978 degrees of freedom
## Multiple R-squared:  0.9838, Adjusted R-squared:  0.9835 
## F-statistic:  2830 on 21 and 978 DF,  p-value: < 2.2e-16
summary(BIC)
## 
## Call:
## lm(formula = log(price) ~ clarity + color + log(carat) + cut + 
##     log(x) + log(depth), data = diamond_sample)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.50222 -0.08587  0.00000  0.08693  0.47092 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.99987    2.62166   0.381  0.70300    
## clarityIF     1.24124    0.05748  21.593  < 2e-16 ***
## claritySI1    0.72010    0.05322  13.532  < 2e-16 ***
## claritySI2    0.54177    0.05335  10.155  < 2e-16 ***
## clarityVS1    0.93094    0.05377  17.313  < 2e-16 ***
## clarityVS2    0.85343    0.05333  16.003  < 2e-16 ***
## clarityVVS1   1.14076    0.05508  20.712  < 2e-16 ***
## clarityVVS2   1.08884    0.05460  19.941  < 2e-16 ***
## colorE       -0.05107    0.01560  -3.274  0.00110 ** 
## colorF       -0.10320    0.01530  -6.747 2.58e-11 ***
## colorG       -0.16033    0.01510 -10.617  < 2e-16 ***
## colorH       -0.24390    0.01637 -14.901  < 2e-16 ***
## colorI       -0.38581    0.01834 -21.035  < 2e-16 ***
## colorJ       -0.51242    0.02250 -22.778  < 2e-16 ***
## log(carat)    1.37876    0.22377   6.162 1.05e-09 ***
## cutGood       0.03395    0.03181   1.067  0.28609    
## cutIdeal      0.12302    0.03020   4.074 5.00e-05 ***
## cutPremium    0.09018    0.03079   2.929  0.00348 ** 
## cutVery Good  0.08846    0.03032   2.918  0.00361 ** 
## log(x)        1.52901    0.67602   2.262  0.02393 *  
## log(depth)    0.95463    0.35910   2.658  0.00798 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1315 on 979 degrees of freedom
## Multiple R-squared:  0.9837, Adjusted R-squared:  0.9834 
## F-statistic:  2962 on 20 and 979 DF,  p-value: < 2.2e-16

Both Produce a very good \(R^2\) value of .983, and i will be using the BIC model as it uses less predictors for my final model.

3.2) CI’s and PI for Final Model

final_model = lm(log(price) ~ clarity + color + log(carat) + cut +  log(x) + log(depth), data = diamond_sample)

# Define a new data point for prediction (hypothetical values)
new_data <- data.frame(
  clarity = "SI1",          # Example clarity level
  color = "E",              # Example color grade
  cut = "Premium",          # Example cut quality
  carat = 0.75,   # Log-transformed carat value
  x = 5.5,        # Log-transformed length
  depth = 61      # Log-transformed depth percentage
)

# Predict with confidence and prediction intervals
predictions <- predict(final_model, newdata = new_data, interval = "prediction", level = 0.95)


confidence <- predict(final_model, newdata = new_data, interval = "confidence", level = 0.95)


predictions
##        fit      lwr      upr
## 1 7.893379 7.619736 8.167023
confidence
##        fit      lwr     upr
## 1 7.893379 7.802518 7.98424

The CI provides the range in which the mean predicted price lies, given the specified combination of variables. It represents the precision of the mean estimate from the model. We say the Mean predicted log - price lies within this range 95% of the time.

The PI shows the range in which an individual predicted price is likely to fall. This interval is wider than the CI because it accounts for both the uncertainty of the mean estimate and the variability of individual responses. We say the predicted log - price lies within this range 95% of the time.

3.3) Final Interpretation

The final model we developed to predict diamond prices incorporates a carefully selected combination of predictors: clarity, color, cut, log(carat), log(x), and log(depth). These variables were chosen through a thorough process of exploratory data analysis, model testing, and diagnostic evaluation, ensuring that the final model is both accurate and robust. The primary criterion for selecting these variables was the adjusted R-squared value, which reflects the model’s ability to explain the maximum variability in price while penalizing for the inclusion of non-informative predictors. By achieving a high adjusted R-squared, we ensure that the model captures essential relationships without overfitting, making it generalizable to new data.

During the model development process, we examined a wide range of potential predictors, including carat, x (length), y (width), z (height), depth, table, and various categorical variables like clarity, color, and cut. However, multicollinearity posed a significant challenge, especially among carat, x, y, and z. To address this, we performed Variance Inflation Factor (VIF) analysis and dropped highly correlated variables, retaining log(carat) as the primary size predictor. Log transformations were applied to continuous variables like price, carat, x, and depth to normalize their distributions and establish linear relationships with the outcome variable. These transformations not only improved model accuracy but also enhanced interpretability by allowing the coefficients to represent percentage changes.

Additionally, we validated the model by calculating Confidence Intervals (CI) and Prediction Intervals (PI) to quantify the uncertainty in the predictions. The CI provides the range for the mean predicted price, while the PI gives the range for individual predictions, both at a 95% confidence level. Diagnostic tests confirmed the model’s assumptions of linearity, homoscedasticity, normality, and independence, reinforcing the model’s validity. By prioritizing variable selection, addressing multicollinearity, and performing thorough diagnostic checks, we ensured that our final model not only achieves high predictive accuracy but also maintains reliability and practical utility in real-world applications.