Diamonds

RQ: “How do the carat weight and cut quality of a diamond influence its price?”

# Import the dataset
diamonds <- read.csv("/Users/markodjordjevic/Desktop/IMB/R data/MVA HW 3/MVA Homework 3/diamonds.csv")
mydata <- diamonds [c(8, 2, 3)]

head(mydata)

##   price carat   cut
## 1   326  0.23 Ideal
## 2   326  0.21 Ideal
## 3   327  0.23  Good
## 4   334  0.29 Ideal
## 5   335  0.31  Good
## 6   336  0.24  <NA>

Description of the dataset

The unit of observation: An individual diamond.

The sample size: 53 490.

Explanation of variables

carat: Weight of the diamond.
cut: Quality of the cut (Fair, Good, Very Good, Premium, Ideal).
price: Price in US dollars ($326–$18,823).

Source of the dataset

The dataset is from Kaggle, and it’s called “Diamond Price Prediction.” The author of this dataset is Swati Khedekar. It can be found on the following link https://www.kaggle.com/datasets/swatikhedekar/price-prediction-of-diamond?resource=download.

Data manipulation

#Remove all NA units
library(tidyr)
mydata <- drop_na(mydata)

#Setting initial point of sampling and selecting 100 random units
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

set.seed(1)
mydata  <- sample_n (mydata, 100)


mydata$ID <- seq_len(nrow(mydata))
head(mydata,10)

##    price carat   cut ID
## 1    776  0.30 Ideal  1
## 2   3804  0.74 Ideal  2
## 3   6176  1.01  Good  3
## 4    460  0.35 Ideal  4
## 5   1257  0.33 Ideal  5
## 6   7528  1.01 Ideal  6
## 7   4914  0.90  Good  7
## 8    816  0.34 Ideal  8
## 9    805  0.34 Ideal  9
## 10   698  0.31 Ideal 10

#Showing some descriptive statistics
mydata$cut <- as.factor(mydata$cut)

summary(mydata[ , -4])

##      price           carat           cut    
##  Min.   :  435   Min.   :0.2400   Good :21  
##  1st Qu.: 1002   1st Qu.:0.4075   Ideal:79  
##  Median : 2258   Median :0.7150             
##  Mean   : 3614   Mean   :0.7783             
##  3rd Qu.: 5616   3rd Qu.:1.0200             
##  Max.   :16384   Max.   :2.0100

Carat: The weight of the diamonds (Carat) ranges from 0.24 to 2.01, with an average weight of approximately 0.78 carats.
Price: The price of the diamonds ranges from 435 to 16384 US dollars, with an average price of approximately 3614 $.
Cut: There are 21 diamonds with a ‘Good’ cut quality and 79 with an ‘Ideal’ cut quality.

Defining the multiple regression model

The price of a diamond (that would be dependent variable Y) is significantly influenced by its carat weight (independent variable X1) and the quality of its cut (independent variable X2). As the size of a diamond increases, so does its weight, measured in carats. Larger diamonds, being rarer and more desirable, command a higher price point. Concurrently, the quality of a diamond’s cut is pivotal in its pricing. A superior cut, which enhances the diamond’s brilliance and sparkl (key factors in its aesthetic appearence) makes diamonds more sought after. Consequently, consumers are often willing to pay more for diamonds with high-quality cuts. Thus, both the carat weight and the cut quality play crucial roles in determining a diamond’s price.

The multiple regression model can be expressed as: Price=β0+β1(Carat)+β2(Cut)+ϵ

Assumptions:

Linearity: There exists a linear relationship between the independent variables and the dependent variable.

Independence: The residuals (errors) are independent. In particular, there is no correlation between consecutive residuals in time series data.

Homoscedasticity: The residuals have constant variance at every level of the independent variables. (Breuch Pagan)

Normality: The residuals of the model are normally distributed. (Shapiro Wilk test)

No Multicollinearity: The independent variables are not highly correlated with each other (VIF)

Graphical representation of the relationships

# Create a scatterplot for Price vs Carat
library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

scatterplotMatrix(mydata[ ,c(-3, -4)],
smooth = FALSE)

The scatter plot reveals a positive correlation between a diamond’s carat (weight) and its selling price. As the carat weight increases, so does the price. However, other variables influencing diamond prices are categorical, making scatter plots unsuitable for visualizing their relationships with price. Based on this scatter plot, it appears to me to be linear.

library(Hmisc)

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:dplyr':
## 
##     src, summarize

## The following objects are masked from 'package:base':
## 
##     format.pval, units

rcorr(as.matrix(mydata[, c("price", "carat")]))

##       price carat
## price  1.00  0.91
## carat  0.91  1.00
## 
## n= 100 
## 
## 
## P
##       price carat
## price        0   
## carat  0

H0: Rho(price, carat) = 0 (There is no correlation between price and carat).

H1: Rho(price, carat) ≠ 0 (There is a correlation between price and carat).

From previous correlation analysis, the correlation coefficient between price and carat is 0.91 and the p-value is 0, which is less than 0.05. Therefore, we can reject the null hypothesis (H0). This suggests that there is a correlation between price and carat. In other words, the carat weight of a diamond appears to have an effect on its price.

Regression model

fit1 <- lm(price ~ carat + cut,  
           data = mydata)

#Checking mulitcolinearity with VIF
vif(fit1)

##    carat      cut 
## 1.043463 1.043463

mean(vif(fit1))

## [1] 1.043463

The VIF value is 1.043, which is close to 1. This suggests that there is likely no multicolinearity problem with data, as well as also not to strong multicolinearity.

#Checking for outliers and units with high impact
reg <- lm(price ~ carat + cut, data = mydata)

mydata$StdFittedValues <- scale(fitted.values(reg))

mydata$StdResiduals <- round(rstandard(reg),3)

mydata$CooksD <- round(cooks.distance(reg),3)

hist(mydata$StdResid,
     xlab= "Standard residuals",
     ylab = "Frequency",
     main= "Histogram of standardized residuals",
     col= "pink")

head(mydata[order(-mydata$StdResid),], 6)

##     price carat   cut  ID StdFittedValues StdResiduals CooksD
## 72  16384  1.71 Ideal  72       2.3436176        4.721  0.580
## 21  14071  1.50 Ideal  21       1.8241025        4.036  0.282
## 17   9405  1.24  Good  17       0.9965970        2.271  0.097
## 68  11303  1.53  Good  68       1.7140226        2.139  0.114
## 100  8011  1.02 Ideal 100       0.6366394        1.963  0.023
## 6    7528  1.01 Ideal   6       0.6119006        1.645  0.016

head(mydata[order(mydata$StdResid),], 10)

##    price carat   cut ID StdFittedValues StdResiduals CooksD
## 29  5268  1.50  Good 29       1.6398062       -2.471  0.147
## 45  4098  1.23 Ideal 45       1.1561545       -2.236  0.048
## 23 10412  2.01 Ideal 23       3.0857820       -1.778  0.137
## 77  3521  1.02 Ideal 77       0.6366394       -1.502  0.014
## 38  2912  1.00  Good 38       0.4028654       -1.469  0.036
## 46  2241  0.83 Ideal 46       0.1666019       -1.430  0.009
## 99  5547  1.27 Ideal 99       1.2551097       -1.337  0.019
## 71  4465  1.09 Ideal 71       0.8098111       -1.164  0.009
## 56  7553  1.51 Ideal 56       1.8488413       -1.133  0.023
## 34  7577  1.51 Ideal 34       1.8488413       -1.114  0.022

library(dplyr)
mydata <- mydata %>%
  filter(!ID == 21)
mydata <- mydata %>%
  filter(!ID == 72)

reg1 <- lm(price ~ carat + cut, data = mydata)

mydata$StdFittedValues <- scale(fitted.values(reg1))

mydata$StdResiduals <- round(rstandard(reg1),3)

mydata$CooksD <- round(cooks.distance(reg1),3)

head(mydata[order(-mydata$CooksD),], 6)

##    price carat   cut ID StdFittedValues StdResiduals CooksD
## 67 11303  1.53  Good 68        1.887208        3.220  0.268
## 28  5268  1.50  Good 29        1.810723       -2.823  0.199
## 17  9405  1.24  Good 17        1.147860        3.178  0.192
## 49 10424  1.56 Ideal 50        2.055641        1.856  0.077
## 22 10412  2.01 Ideal 23        3.202904       -1.219  0.074
## 44  4098  1.23 Ideal 45        1.214314       -2.410  0.063

library(dplyr)
mydata <- mydata %>%
  filter(!ID == 68)
mydata <- mydata %>%
  filter(!ID == 29)
mydata <- mydata %>%
  filter(!ID == 17)

reg2 <- lm(price ~ carat + cut, data = mydata)

mydata$StdFittedValues <- scale(fitted.values(reg2))

mydata$StdResiduals <- round(rstandard(reg2),3)

mydata$CooksD <- round(cooks.distance(reg2),3)

library(dplyr)
mydata <- mydata %>%
  filter(!ID == 100)

reg3 <- lm(price ~ carat + cut, data = mydata)

mydata$StdFittedValues <- scale(fitted.values(reg3))

mydata$StdResiduals <- round(rstandard(reg3),3)

mydata$CooksD <- round(cooks.distance(reg3),3)

library(dplyr)
mydata <- mydata %>%
  filter(!ID == 6)
mydata <- mydata %>%
  filter(!ID == 50)
mydata <- mydata %>%
  filter(!ID == 61)

reg4 <- lm(price ~ carat + cut, data = mydata)

mydata$StdFittedValues <- scale(fitted.values(reg4))

mydata$StdResiduals <- round(rstandard(reg4),3)

mydata$CooksD <- round(cooks.distance(reg4),3)

#Check if there are any units with high impact
hist(mydata$CooksD, 
     xlab = "Cooks distance", 
     ylab = "Frequency", 
     main = "Histogram of Cooks distances")

library(dplyr)
mydata <- mydata %>%
  filter(!ID == 3)
mydata <- mydata %>%
  filter(!ID == 45)
mydata <- mydata %>%
  filter(!ID == 38)

reg5 <- lm(price ~ carat + cut, data = mydata)

mydata$StdFittedValues <- scale(fitted.values(reg5))

mydata$StdResiduals <- round(rstandard(reg5),3)

mydata$CooksD <- round(cooks.distance(reg5),3)

head(mydata[order(-mydata$CooksD),], 10)

##    price carat   cut ID StdFittedValues StdResiduals CooksD
## 23  7119  1.03 Ideal 27       0.8858904        3.425  0.094
## 5   4914  0.90  Good  7       0.3461064        1.928  0.089
## 57  2140  0.82  Good 66       0.1326256       -1.835  0.080
## 19 10412  2.01 Ideal 23       3.5010305       -1.118  0.078
## 72  5251  1.01  Good 83       0.6396425        1.363  0.046
## 66  3521  1.02 Ideal 77       0.8592053       -2.281  0.041
## 84  6333  1.02 Ideal 95       0.8592053        2.255  0.040
## 88  5547  1.27 Ideal 99       1.5263329       -1.521  0.034
## 39  2241  0.83 Ideal 46       0.3521884       -2.442  0.031
## 54  6589  1.11 Ideal 63       1.0993713        1.776  0.031

library(dplyr)
mydata <- mydata %>%
  filter(!ID == 7)
mydata <- mydata %>%
  filter(!ID == 27)
mydata <- mydata %>%
  filter(!ID == 23)
mydata <- mydata %>%
  filter(!ID == 66)

reg6 <- lm(price ~ carat + cut, data = mydata)

mydata$StdFittedValues <- scale(fitted.values(reg6))

mydata$StdResiduals <- round(rstandard(reg6),3)

mydata$CooksD <- round(cooks.distance(reg6),3)

shapiro.test(mydata$StdResid)

## 
##  Shapiro-Wilk normality test
## 
## data:  mydata$StdResid
## W = 0.98491, p-value = 0.4341

H0: Standard residuals are normally distributed.

H1: Standard residuals are not normally distributed.

We fail to reject H0 because p > 0.05 and we can assume that standard residuals are normally distributed.

Reestimating regression model

fit2 <- lm(price ~ carat + cut,  
           data = mydata)

mydata$StdResid <- round(rstandard(fit2), 3)
mydata$CooksD <- round(cooks.distance(fit2),3)

Now, we will do histogram again.

hist(mydata$StdResid, 
     xlab = "Standardized residuals", 
     ylab = "Frequency", 
     main = "Histogram of standardized residuals",
     col= "purple",
     border = "black")

hist(mydata$CooksD, 
     xlab = "Cooks distance", 
     ylab = "Frequency", 
     main = "Histogram of Cooks distances")

All values of cooks distances are less than 1. There is a gap, but not significant. As a result, we do not need to remove any further units.

mydata$StdFitted <- scale(fit2$fitted.values)

library(car)
scatterplot(StdResid ~ StdFitted, 
            ylab = "Standardized residuals", 
            xlab = "Standardized fitted values", 
            smooth = FALSE,
            regLine = FALSE,
            boxplots = FALSE,
            data = mydata)

library(olsrr)

## 
## Attaching package: 'olsrr'

## The following object is masked from 'package:datasets':
## 
##     rivers

ols_test_breusch_pagan(fit2)

## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##               Data                
##  ---------------------------------
##  Response : price 
##  Variables: fitted values of price 
## 
##          Test Summary          
##  ------------------------------
##  DF            =    1 
##  Chi2          =    12.11175 
##  Prob > Chi2   =    0.000501052

We should use robust standard errors.

H0: the variance is constant. H1: the variance is not constant.

We reject the null hypothesis at p<0.001, which means we need to use the robust standard error, because the variance of standard residuals are not constant.

As for the assumption of error independence, it is satisfied because each unit is observed only once, and we are not dealing with panel data.

Lastly, the assumption that the number of units is greater than the number of estimated parameters (n>k) is also met.

Results for regression model

library(estimatr)
fit_robust <- lm_robust(price ~ carat + cut,  
                        se_type = "HC1", #White's robust standard errors
                        data = mydata)

summary(fit_robust)

## 
## Call:
## lm_robust(formula = price ~ carat + cut, data = mydata, se_type = "HC1")
## 
## Standard error type:  HC1 
## 
## Coefficients:
##             Estimate Std. Error t value  Pr(>|t|) CI Lower CI Upper DF
## (Intercept)  -1821.8      175.2 -10.397 1.418e-16  -2170.4  -1473.1 81
## carat         6181.8      171.5  36.044 1.218e-51   5840.6   6523.1 81
## cutIdeal       430.3      139.9   3.076 2.865e-03    151.9    708.6 81
## 
## Multiple R-squared:  0.9385 ,    Adjusted R-squared:  0.937 
## F-statistic: 771.5 on 2 and 81 DF,  p-value: < 2.2e-16

For carat:

H0: beta1 = 0

H1: beta1 ≠ 0

Based on the data, we reject H0 at p<0.001, indicating that carat has a significant effect on price.

carat interpretation : If the weight of the diamond increases by 1 gram, then on average, price increases by 6180.9 $, assuming type of cut stays the same (p < 0.001).
For cutIdeal:

H0: beta1 = 0

H1: beta1 ≠ 0

Based on the data, we reject H0 at p< 0.001, indicating that cutIdeal has a significant effect on price.

cutIdeal: Given the weight of a diamond, the cost of the diamond with ideal cut is on average 471.3 $ more than the cost of a diamond with good cut (p < 0.002).
The Multiple R-squared value is 0.9344, which means that 93.44% of the variability in price is explained by the carat and the cut.
F statistics:

H0: Rho squared = 0

H1: Rho squared is more than 0.

Based on our data, we reject H0 (p<0.001) and conclude that at least one of the explanatory variables (carat and cut) impacts price.

Answer to the research question: Based on our data, we found out that carat and the cut of the diamond have a statistically significant influence on the price of the diamond.