Abstract

This project applies regression analysis to the Kaggle Diamonds dataset to explore the factors influencing diamond prices. Starting with descriptive statistics and visualizations, I examine relationships among variables such as carat, cut, color, and clarity. Simple and multiple linear regression models are then developed to assess predictive power, followed by transformations, assumption testing, and model selection techniques including AIC and VIF diagnostics. The final model highlights the most significant predictors of price and provides insights into how diamond attributes affect valuation. These findings demonstrate both the statistical methods used and their practical applications in real-world pricing problems.

Data Loading and Cleaning

Load Necessary Libraries

library(readr)
library(dplyr)
library(tidyr)
library(readr)
library(ggplot2)
library(knitr)
library(corrplot)

Load Diamonds dataset

# Replace file path with your appropriate file path
data <- read_csv("C:/Users/jaych/Downloads/PSTAT126 Final Proj Dataset/Diamonds Prices2022.csv")

head(data)

Data Cleaning/Filtering:

Upon review of the dataset, there doesn’t seem to be any need for cleaning or filtering. All of the rows are non-null for all columns, so there are no missing data, and there are no egregious outliers or unreasonable data values within the dataset, so we can proceed with the dataset as is. The only thing we will change is renaming the ‘x’,‘y’, and ‘z’ column headers to a more appropriate column name based on the information gathered from Kaggle, and by dropping the first column, since it is simply an index column for the dataset.

# Rename columns
colnames(data)[colnames(data) == "x"] <- "length (mm)"
colnames(data)[colnames(data) == "y"] <- "width (mm)"
colnames(data)[colnames(data) == "z"] <- "depth (mm)"

# Drop index column
data <- data[, -1]

head(data)

Data Description and Descriptive Statistics

Selecting a Random Sample

set.seed(126)

sample_size <- max(1000)
sample <- data %>% 
  sample_n(size = sample_size, replace = FALSE)

nrow(sample)
## [1] 1000
head(sample)

Variable Description

# Display summary statistics
summary(sample)
##      carat           cut               color             clarity         
##  Min.   :0.220   Length:1000        Length:1000        Length:1000       
##  1st Qu.:0.400   Class :character   Class :character   Class :character  
##  Median :0.710   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :0.817                                                           
##  3rd Qu.:1.040                                                           
##  Max.   :3.240                                                           
##      depth           table           price        length (mm)   
##  Min.   :55.80   Min.   :53.00   Min.   :  379   Min.   :3.900  
##  1st Qu.:61.10   1st Qu.:56.00   1st Qu.:  975   1st Qu.:4.737  
##  Median :61.80   Median :57.00   Median : 2596   Median :5.740  
##  Mean   :61.75   Mean   :57.52   Mean   : 4017   Mean   :5.774  
##  3rd Qu.:62.52   3rd Qu.:59.00   3rd Qu.: 5367   3rd Qu.:6.530  
##  Max.   :72.20   Max.   :68.00   Max.   :18781   Max.   :9.440  
##    width (mm)      depth (mm)   
##  Min.   :3.850   Min.   :2.380  
##  1st Qu.:4.750   1st Qu.:2.940  
##  Median :5.750   Median :3.560  
##  Mean   :5.777   Mean   :3.566  
##  3rd Qu.:6.510   3rd Qu.:4.030  
##  Max.   :9.400   Max.   :5.850
# Display dataset structure
str(sample)
## tibble [1,000 × 10] (S3: tbl_df/tbl/data.frame)
##  $ carat      : num [1:1000] 0.5 0.5 1.01 3 0.28 0.79 0.72 0.4 0.3 1.12 ...
##  $ cut        : chr [1:1000] "Fair" "Ideal" "Premium" "Good" ...
##  $ color      : chr [1:1000] "F" "E" "G" "I" ...
##  $ clarity    : chr [1:1000] "I1" "VS2" "SI1" "I1" ...
##  $ depth      : num [1:1000] 63.8 62.1 59.6 57 61.7 61.5 61.7 64 63 63.7 ...
##  $ table      : num [1:1000] 58 57 61 64 58 55 55 55 55 61 ...
##  $ price      : num [1:1000] 713 1629 4672 10863 487 ...
##  $ length (mm): num [1:1000] 5.08 5.1 6.52 9.38 4.17 5.98 5.76 4.67 4.29 6.54 ...
##  $ width (mm) : num [1:1000] 4.97 5.05 6.47 9.31 4.19 5.95 5.81 4.7 4.31 6.5 ...
##  $ depth (mm) : num [1:1000] 3.21 3.15 3.87 5.33 2.58 3.67 3.57 3 2.71 4.15 ...
# Histograms (for continuous variables)
continuous_vars <- c("carat", "depth", "table", "price", "length (mm)", "width (mm)", "depth (mm)")

histograms <- list()
for (var in continuous_vars) {
  histograms[[var]] <- ggplot(sample, aes(x = .data[[var]])) +
    geom_histogram(bins = 30, fill = "blue", color = "black") +
    labs(title = paste("Distribution of", var), x = var, y = "Frequency") +
    theme_minimal()
}

histograms
## $carat

## 
## $depth

## 
## $table

## 
## $price

## 
## $`length (mm)`

## 
## $`width (mm)`

## 
## $`depth (mm)`

# Bar Plots (for categorical variables)
categorical_vars <- c("cut", "color", "clarity")

bar_plots <- list()
for (var in categorical_vars) {
  bar_plots[[var]] <- ggplot(sample, aes(x = .data[[var]])) +
    geom_bar(fill = "blue", color = "black") +
    labs(title = paste("Bar Plot of", var), x = var, y = "Count") +
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 45, hjust = 1))
}

bar_plots
## $cut

## 
## $color

## 
## $clarity

### Analysis/Interpretation

  • Carat: For a simple description of this variable, a diamond’s carat is essentially the overall weight of it. From the summary statistics, we can see that the mean is 0.817, with values ranging from 0.22 to 3.24. This indicates that a majority of the diamonds in our sample are on the smaller side, with some being larger. This can be seen in the histogram, with the distribution being typically right-skewed.

  • Depth: From the summary statistics, we can see that the mean is 61.75, with the values ranging from 55.8 to 72.2. This indicates that the depth is relatively normally distributed, and we can confirm from the histogram that it shows an approximate normal distribution.

  • Table: From the summary statistics, we can see that the mean is 57.52, with the values ranging from 53 to 68. Upon looking at the histogram, we can confirm that it resembles a relative normal distribution, with it being slightly right-skewed.

  • Price: From the summary statistics, we can see that the mean is 4017, with values ranging from 379 to 18,781. This indicates that there is a large spread of values, but that the majority of diamonds in our sample are moderately priced with a few that are very expensive. This can be confirmed by our histogram, which strongly resembles a right-skewed distribution.

  • Length (mm): From the summary statistics, we can see that the mean is 5.774, with values ranging from 3.9 to 9.44. Looking at the histogram, there isn’t a clear distribution pattern, but we can say that it is closely correlated with the carat, since the dimensions (length, width, depth) can greatly affect the carat (or weight) of a diamond. In turn, we can see a very slight right-skewness of the distribution.

  • Width (mm): From the summary statistics, we can see that the mean is 5.777, with values ranging from 3.85 to 9.4. We can see that this is very similar to the values from the ‘length (mm)’. In turn, the histogram and its distribution is also very similar to that of the ‘length (mm)’ variable.

  • Depth (mm): From the summary statistics, we can see that the mean is 3.566, with values ranging from 2.38 to 5.85. The histogram and its distribution is also similar to that of the length and width variables.

  • Cut: This variable represents the quality grades of the diamonds, ranging from ‘Fair’ to ‘Very Good’. From our bar plot, we can see that ‘Ideal’ category is much more prominent than others, indicating that certain cut qualities dominate the dataset.

  • Color: This variable represents the “color” grade of a diamond, ranging from ‘D’(best) to ‘J’(worst). The distribution from the bar plots are pretty evenly spread out.

  • Clarity: This variable represents the different clarity grades of a diamond, which ranges from worst to best: I1, SI2, SI1, VS2, VS1, VVS2, VVS1, IF. From the bar plots, we can see that most of the diamonds in our sample fall within the middle of the clarity scale, with very few being either really “poor”(I1) or really “good”(IF, VVS).

Variable Correlation

numeric_vars <- sample[, sapply(sample, is.numeric)]

# Correlation matrix
cor_matrix <- cor(numeric_vars)

print(round(cor_matrix, 2))
##             carat depth table price length (mm) width (mm) depth (mm)
## carat        1.00  0.02  0.21  0.93        0.98       0.97       0.97
## depth        0.02  1.00 -0.25 -0.02       -0.03      -0.03       0.09
## table        0.21 -0.25  1.00  0.14        0.22       0.21       0.18
## price        0.93 -0.02  0.14  1.00        0.90       0.90       0.89
## length (mm)  0.98 -0.03  0.22  0.90        1.00       1.00       0.99
## width (mm)   0.97 -0.03  0.21  0.90        1.00       1.00       0.99
## depth (mm)   0.97  0.09  0.18  0.89        0.99       0.99       1.00
# Correlation heatmap
corrplot(cor_matrix, method = "color", addCoef.col = "black", number.cex = 0.7,
         tl.cex = 0.8, tl.col = "black", title = "Correlation Matrix Heatmap", mar = c(0,0,1,0))

To determine if there is any correlation between our variables, we have utilized a correlation matrix along with a correlation heatmap for a better visual understanding. As we can see from the matrix and heatmap, there are obvious correlations between certain variables. Specifically, there is a strong positive correlation between the variables carat, price, length (mm), width (mm) and depth (mm), with all of them having a correlation coefficient greater than 0.9 with each other. This makes clear sense, since it is well-known that a diamond’s price is largely dependent on its carat, and that a diamond’s size/dimensions(length, width, and depth) oftentimes determines its carat(weight), and in turn, its price. So, these variables all correlate with each other. This also corroborates our assumptions made in the previous observation of the variables distributions.

Multiple Linear Regression Model

sample <- sample %>%
  mutate_if(is.character, as.factor)

# Multiple linear regression with all variables
## Using 'price' as the response variable and all others as predictors
mult_model <- lm(price ~ ., data = sample)

summary(mult_model)
## 
## Call:
## lm(formula = price ~ ., data = sample)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9627.3  -599.3  -159.4   396.4  6384.8 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -3542.06    7820.27  -0.453  0.65070    
## carat          9892.58     365.28  27.082  < 2e-16 ***
## cutGood         159.83     255.38   0.626  0.53155    
## cutIdeal        472.60     248.86   1.899  0.05786 .  
## cutPremium      343.73     238.88   1.439  0.15049    
## cutVery Good    416.80     244.33   1.706  0.08834 .  
## colorE         -240.22     132.46  -1.814  0.07005 .  
## colorF          -16.35     135.23  -0.121  0.90381    
## colorG         -428.59     136.66  -3.136  0.00176 ** 
## colorH         -962.69     137.93  -6.979 5.47e-12 ***
## colorI        -1165.65     159.71  -7.298 6.03e-13 ***
## colorJ        -2328.68     214.98 -10.832  < 2e-16 ***
## clarityIF      5631.88     359.25  15.677  < 2e-16 ***
## claritySI1     4131.68     290.08  14.243  < 2e-16 ***
## claritySI2     3170.09     290.39  10.917  < 2e-16 ***
## clarityVS1     4935.34     299.70  16.467  < 2e-16 ***
## clarityVS2     4617.70     292.13  15.807  < 2e-16 ***
## clarityVVS1    5145.43     334.79  15.369  < 2e-16 ***
## clarityVVS2    5089.68     312.59  16.282  < 2e-16 ***
## depth            10.73     121.63   0.088  0.92974    
## table           -35.51      22.57  -1.574  0.11593    
## `length (mm)`   415.21     985.28   0.421  0.67355    
## `width (mm)`   -598.94    1022.29  -0.586  0.55809    
## `depth (mm)`   -627.55    1936.77  -0.324  0.74599    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1153 on 976 degrees of freedom
## Multiple R-squared:  0.9192, Adjusted R-squared:  0.9173 
## F-statistic:   483 on 23 and 976 DF,  p-value: < 2.2e-16

Insights

All in all, the data was approximately what we expected. As expected, there was a strong influence of carat. It had a very strong and positive relationship with price, with a large and highly significant coefficient, confirming that diamond size is a primary indicator of its price. Additionally, the categorical variables of cut, color and clarity showed notable effects on the price, especially the clarity, which we expected.

However, some things that surprised us was that some levels of these categorical variables, such as “J” color, had significant negative coefficients, with “E” color being negative as well, which was particularly surprising considering the “E” color grade is the “best” out of the dataset’s options.

Furthermore, we found that some variables were less impactful than expected, such as ‘depth’, ‘table’, and the dimension variables (length, width, depth). Some had small or statistically insignificant coefficients, with the coefficients of ‘width (mm)’ and ‘depth (mm)’ being negative, suggesting they might not be as significant (for price) as we thought.

Modeling, Interpretations, and Testing

Simple Linear Regression Model

simple_model <- lm(price ~ carat, data=sample)
summary(simple_model)
## 
## Call:
## lm(formula = price ~ carat, data = sample)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10414.2   -779.7    -44.4    489.8   7448.5 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2148.61      93.00  -23.10   <2e-16 ***
## carat        7546.17      97.53   77.37   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1516 on 998 degrees of freedom
## Multiple R-squared:  0.8571, Adjusted R-squared:  0.857 
## F-statistic:  5987 on 1 and 998 DF,  p-value: < 2.2e-16

Running the Model, Summary Statistics, and Interpretations

Intercept Term

The estimated value for \(B_0\) indicates that when the carat value is 0, the diamond’s predicted price is $-2148.61. This interpretation is obviously not meaningful in the real-world, but still important to our model.

The standard error indicates that \(\hat{B_0}\) deviates by 93 units from the true \(B_0\) on average.

The t-value signifies that \(\hat{B_0}\) is 23.10 standard errors below 0.

The p-value here is representative of the result of our partial significance test, which examines if we can drop the intercept term from our model or not. The extremely small p-value of below \(2 * 10^{-16}\) causes us to reject our null hypothesis and conclude that our intercept is significantly different from zero.

Carat (\(B_1\))

The estimated value for \(B_1\) indicates that for every 1 unit increase in carat, price is estimated to increase by $7546.17.

The standard error indicates that \(\hat{B_1}\) deviates by 97.53 units from the true \(B_1\) on average.

The t-value signifies that \(\hat{B_1}\) is 77.37 standard errors above 0.

The p-value here is representative of the result of our other partial significance test to see if we can drop the carat predictor term from our model or not. The extremely small p-value of below \(2 * 10^{-16}\) causes us to reject our null hypothesis and conclude that carat is a meaningful predictor for price.

Residual Standard Error

The residual standard error tells us that the actual data differ from their model-predicted values by an average of 1516 units.

\(R^2\) Values

The adjusted \(R^2\) value signifies that about 85.7% of the change in price is explained by carat. This value is so similar to the multiple \(R^2\) value because our model only contains one predictor variable, therefore limiting the opportunity for multicollinearity between variables.

Hypothesis Testing

Since this simple model only contains one predictor variable, the global significance test is the same as the partial significance test for \(B_1\), and produces an equally small p-value of below \(2 * 10^{-16}\). This tells us that again, carat is a meaningful predictor variable for price.

Confidence Interval of \(E[y|x_0]\)

# Choose a carat value to predict price based off of
# In this case I set carat = 1
new_data <- data.frame(carat=1)
predict(simple_model, newdata = new_data, interval = 'confidence', level=0.95)
##        fit     lwr      upr
## 1 5397.551 5297.14 5497.961

From the output, the predicted price for a 1 carat diamond is about $5397.55. We are 95% confident that the true average price of a 1 carat diamond lies in the range of (5297.14, 5497.96) dollars.

Prediction Interval of \(y|x_0\)

# Again, I selected a carat value of 1
predict(simple_model, newdata = new_data, interval = 'prediction', level=0.95)
##        fit      lwr      upr
## 1 5397.551 2420.028 8375.073

From the output, the predicted price for a 1 carat diamond is the same, but now we say that we are 95% confident that any given diamond with a carat value of 1 will have a price in the range of (2420.02, 8375.07) dollars.

Confidence Intervals for \(B_0\) and \(B_1\)

confint(simple_model)
##                 2.5 %    97.5 %
## (Intercept) -2331.107 -1966.122
## carat        7354.781  7737.549

The confidence interval for \(\hat{B_0}\) tells us that we are 95% confident that the true \(B_0\) value lies in the interval (-2331.10, -1966.12).

The confidence interval for \(\hat{B_1}\) tells us that we are 95% confident that the true \(B_1\) value lies in the interval (7354.78, 7737.55).

Assumptions and Transformations

The assumptions for our model are as follows:

  1. Linearity: the mean response is linear in the predictors, meaning that the model is correct

  2. Homoscedasticity: errors and thus responses across the model all have the same variance

  3. Independence: the errors terms are independent of each other across observations

  4. Normality: the errors are normally distributed with mean zero

  5. Rank(x) = p + 1: the columns of x are not related, meaning our model does not contain multicollinearity

plot(simple_model, which = 1)

In the residuals vs. fitted values plot above, the residual spread increases as the fitted values grow, meaning that homoscedasticity is likely violated. Additionally, the LOESS smoother line curves slightly at the beginning, and also rises as fitted values grow, indicating that linearity is also likely violated.To fix these issues, we must first transform the carat variable.

model_transx <- lm(price ~ log(carat), data=sample)
plot(model_transx, which=1)

We can see that the model is still in need of transformation, so we will now transform the price variable.

model_transxy <- lm(log(price) ~ log(carat), data=sample)
plot(model_transxy, which=1)

This model proves to be much more accurate for our data. The LOESS smoother line is almost completely flat, meaning that the linearity condition is met. Additionally, the spread of the residuals is somewhat even for all fitted values, meaning that the homoscedasticity condition is also met.

The independence condition can be assumed in this case because we took a random sample from the Diamonds dataset, which we know followed an apt sampling design.

plot(model_transxy, which=2)

The qq plot of residuals shows that the residual values stick very closely to the qq line, meaning that our normality condition is also met.

Finally, we also know that the rank of x is approximately p + 1 = 2 in this case because our model only has one predictor variable, eliminating the possibility for multicollinearity.

Interpreting Transformed Variable

summary(model_transxy)
## 
## Call:
## lm(formula = log(price) ~ log(carat), data = sample)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.04761 -0.15498 -0.00151  0.16256  1.04205 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 8.438674   0.009634   875.9   <2e-16 ***
## log(carat)  1.671156   0.013858   120.6   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2571 on 998 degrees of freedom
## Multiple R-squared:  0.9358, Adjusted R-squared:  0.9357 
## F-statistic: 1.454e+04 on 1 and 998 DF,  p-value: < 2.2e-16

The estimate, standard error, and t-values for \(B_0\) and \(B_1\) are obviously much smaller now, as they are now being calculated in terms log(carat) and log(price) rather than the full values. However, our p-values for both the partial and global significance tests remain extremely small, producing the same conclusions. The residual standard error is also much smaller. However, the \(R^2\) values are much greater than before, being about 0.9357 after the transformation. This leads us to believe that this new model is a much better fit for our data, as this model explains a much larger proportion of the variance in the data.

Variable Addition and Model Assessment

Some other variables besides carat that generated low p-values within our original multiple linear regression model were color and clarity. We want to test if adding these values to the model improves the adjusted \(R^2\) value.

Original simple model: price ~ carat Adjusted \(R^2\): 0.857

Model 1: price ~ carat + color Adjusted \(R^2\): 0.8685

Model 2: price ~ carat + clarity Adjusted \(R^2\): 0.8991

Full model: price ~ carat + color + clarity Adjusted \(R^2\): 0.9155

As represented above, adding both carat and colors as variables increase the adjusted \(R^2\) value significantly. Including both of these variables along with carat produces a p-value of 0.9155, which is almost identical to the 0.9173 p-value produced by our full model. This tells us that all other variables from the dataset have little to no significant impact on predicting price, and that simply carat, color and clarity do a very good job of predicting price while also reducing multicollinearity due to the smaller amount of predictor variables.

Determining the Best Model

Backward Elimination using AIC

step(mult_model, direction = "backward")
## Start:  AIC=14123.7
## price ~ carat + cut + color + clarity + depth + table + `length (mm)` + 
##     `width (mm)` + `depth (mm)`
## 
##                 Df Sum of Sq        RSS   AIC
## - depth          1     10338 1297183281 14122
## - `depth (mm)`   1    139536 1297312480 14122
## - `length (mm)`  1    236028 1297408972 14122
## - `width (mm)`   1    456215 1297629159 14122
## - cut            4   8497571 1305670514 14122
## <none>                       1297172943 14124
## - table          1   3290683 1300463627 14124
## - color          6 269201811 1566374754 14300
## - clarity        7 644847770 1942020714 14513
## - carat          1 974813569 2271986513 14682
## 
## Step:  AIC=14121.71
## price ~ carat + cut + color + clarity + table + `length (mm)` + 
##     `width (mm)` + `depth (mm)`
## 
##                 Df Sum of Sq        RSS   AIC
## - `length (mm)`  1    258950 1297442231 14120
## - cut            4   8619177 1305802458 14120
## - `width (mm)`   1    886745 1298070026 14120
## - `depth (mm)`   1   1227653 1298410934 14121
## <none>                       1297183281 14122
## - table          1   3390738 1300574020 14122
## - color          6 269823215 1567006496 14299
## - clarity        7 644843011 1942026293 14511
## - carat          1 975134720 2272318001 14680
## 
## Step:  AIC=14119.91
## price ~ carat + cut + color + clarity + table + `width (mm)` + 
##     `depth (mm)`
## 
##                Df Sum of Sq        RSS   AIC
## - cut           4   8481035 1305923266 14118
## - `depth (mm)`  1   1028192 1298470424 14119
## - `width (mm)`  1   1676169 1299118400 14119
## <none>                      1297442231 14120
## - table         1   3181917 1300624148 14120
## - color         6 271073269 1568515501 14298
## - clarity       7 646121733 1943563964 14510
## - carat         1 994830589 2292272820 14687
## 
## Step:  AIC=14118.42
## price ~ carat + color + clarity + table + `width (mm)` + `depth (mm)`
## 
##                Df  Sum of Sq        RSS   AIC
## - `width (mm)`  1     142188 1306065454 14116
## <none>                       1305923266 14118
## - `depth (mm)`  1    4570302 1310493567 14120
## - table         1   13864545 1319787811 14127
## - color         6  275221246 1581144512 14298
## - clarity       7  689255417 1995178683 14528
## - carat         1 1023181795 2329105060 14695
## 
## Step:  AIC=14116.53
## price ~ carat + color + clarity + table + `depth (mm)`
## 
##                Df  Sum of Sq        RSS   AIC
## <none>                       1306065454 14116
## - table         1   15209310 1321274764 14126
## - `depth (mm)`  1   19503293 1325568746 14129
## - color         6  276666790 1582732244 14297
## - clarity       7  700823545 2006888998 14532
## - carat         1 1120666261 2426731715 14734
## 
## Call:
## lm(formula = price ~ carat + color + clarity + table + `depth (mm)`, 
##     data = sample)
## 
## Coefficients:
##  (Intercept)         carat        colorE        colorF        colorG  
##     -1281.63       9895.85       -242.24        -23.25       -434.26  
##       colorH        colorI        colorJ     clarityIF    claritySI1  
##      -953.40      -1172.77      -2364.20       5732.30       4213.17  
##   claritySI2    clarityVS1    clarityVS2   clarityVVS1   clarityVVS2  
##      3264.18       5034.69       4714.34       5254.44       5212.02  
##        table  `depth (mm)`  
##       -58.60       -919.01

Above, I used backward elimination using AIC in order to find the best model for our dataset. The lowest overall AIC was achieved when using carat, color, clarity, table and depth (mm) as predictors for price. Therefore the ideal model for predicting price is:

\[ \hat{price} = -1281.63 + 9895.85carat -242.24colorE - 23.25colorF - 434.26colorG - 953.40colorH - 1172.77colorI - 2364.20colorJ + 5732.30clarityIF + 4213.17claritySI1 + 3264.18clairtySI2 + 5034.68clarityVS1 + 4714.34clarityVS2 + 5254.44clarityVVS1 - 58.60table - 919.01depth(mm) \]

As color and clarity are both categorical variables, this model creates k - 1 dummy variables for the k levels of each of these predictors. Although this model produces an essentially equal adjusted \(R^2\) value to our full model with all variables, it utilizes less predictor variables, therefore creating less multicollinearity.

Multicollinearity using VIF

library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
reduced_model <- lm(formula = price ~ carat + color + clarity + table + `depth (mm)`, 
    data = sample)
vif(reduced_model)
##                   GVIF Df GVIF^(1/(2*Df))
## carat        21.126787  1        4.596388
## color         1.287147  6        1.021259
## clarity       1.402832  7        1.024473
## table         1.077686  1        1.038116
## `depth (mm)` 21.148378  1        4.598737

The VIF value for carat and depth are both quite large and exceed the cut-off value of 5, meaning that they cause a large amount of multiocollinearity in our model. We need to consider their vitality to the model in order to know if they should be kept in the final reduced model.

Confidence and Prediction Intervals

Confidence Interval for Mean Predicted Price

library(tibble)
# Choose X values for the 5 predictors in our model
# In this case I set carat = 1, color = F, clarity = VS2, table = 58 and depth (mm) = 4
new_data <- tibble(carat = 1, color = "F", clarity = "VS2", table=58, `depth (mm)` = 3.21)
predict(reduced_model, newdata = new_data, interval = 'confidence', level=0.95)
##        fit      lwr      upr
## 1 6956.522 6594.666 7318.379

From the output, the predicted price for a diamond with 1 carat, a color of F, a VS2 clarity, a table of 58 and a depth of 4 mm is about $6956.52. We are 95% confident that the true average price of a this diamond would lie in the range of (6594.67, 7318.38) dollars.

Prediction Interval for Future Predicted Values

# I used the same X values for the 5 predictors in our model
predict(reduced_model, newdata = new_data, interval = 'prediction', level=0.95)
##        fit      lwr      upr
## 1 6956.522 4665.781 9247.264

From the output, the predicted price for a diamond with 1 carat, a color of F, a VS2 clarity, a table of 58 and a depth of 4 mm is the same, but now we say that we are 95% confident that any given diamond with these input values will have a price in the range of (4665.78, 9247.26) dollars.

Conclusion

Overall, when primarily checking for correlation between the numeric variables in our dataset, we saw that carat, price, length (mm), width (mm), and depth (mm) were all highly positively correlated with each other. This was the first indication of multicollinearity in the model that should be elimated by removing as many predictor variables as possible while still being able to accurately estimate our response variable, price.

A linear regression model that included all predictor variables produced an adjusted R-squared value of 0.9173, with carat, color and clarity appearing to produce the lowest p-values for the partial significance tests. Each of these p-values told us that these variables were likely significant predictors for price.

When utilizing a simple linear model with only carat as a predictor for price, we found that carat is a statistically significant predictor for price, as the model produced a p-value of less than \(2.2 * 10^{16}\), and an adjusted R-squred value of 0.857. Adding the predictor variables of color and clarity to this model increased the R-squred value to 0.9155, almost the same as the R-squred value from our model using all predictors. However, this model utilized way less variables therefore reducing multicollinearity.

Finally, after using the backward elimination using AIC method, we found that the ideal model to predict price included carat, color, clarity, table and depth (mm) as predictor variables. However, the VIF values for carat and depth (mm), appeared to be very large, causing us to worry about multicollinearity. These two values were some of the most meaningful out of all the predictors for price, so even though they might be causes for some multicollinearity, they are vital to the model and cannot be excluded. Additionally, this reduced model produces an adjusted R-squared value of 0.9174, which actually exceeds the adjusted R-squared value from the full model. We know that this model’s explanation of variance is even better than the full model and also utilizes less predictors, making it the best model.