# Import the dataset
diamonds <- read.csv("/Users/markodjordjevic/Desktop/IMB/R data/MVA HW 3/MVA Homework 3/diamonds.csv")
mydata <- diamonds [c(8, 2, 3)]
head(mydata)
## price carat cut
## 1 326 0.23 Ideal
## 2 326 0.21 Ideal
## 3 327 0.23 Good
## 4 334 0.29 Ideal
## 5 335 0.31 Good
## 6 336 0.24 <NA>
The unit of observation: An individual diamond.
The sample size: 53 490.
carat: Weight of the diamond.
cut: Quality of the cut (Fair, Good, Very Good, Premium, Ideal).
price: Price in US dollars ($326–$18,823).
#Remove all NA units
library(tidyr)
mydata <- drop_na(mydata)
#Setting initial point of sampling and selecting 100 random units
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
set.seed(1)
mydata <- sample_n (mydata, 100)
mydata$ID <- seq_len(nrow(mydata))
head(mydata,10)
## price carat cut ID
## 1 776 0.30 Ideal 1
## 2 3804 0.74 Ideal 2
## 3 6176 1.01 Good 3
## 4 460 0.35 Ideal 4
## 5 1257 0.33 Ideal 5
## 6 7528 1.01 Ideal 6
## 7 4914 0.90 Good 7
## 8 816 0.34 Ideal 8
## 9 805 0.34 Ideal 9
## 10 698 0.31 Ideal 10
#Showing some descriptive statistics
mydata$cut <- as.factor(mydata$cut)
summary(mydata[ , -4])
## price carat cut
## Min. : 435 Min. :0.2400 Good :21
## 1st Qu.: 1002 1st Qu.:0.4075 Ideal:79
## Median : 2258 Median :0.7150
## Mean : 3614 Mean :0.7783
## 3rd Qu.: 5616 3rd Qu.:1.0200
## Max. :16384 Max. :2.0100
Carat: The weight of the diamonds (Carat) ranges from 0.24 to 2.01, with an average weight of approximately 0.78 carats.
Price: The price of the diamonds ranges from 435 to 16384 US dollars, with an average price of approximately 3614 $.
Cut: There are 21 diamonds with a ‘Good’ cut quality and 79 with an ‘Ideal’ cut quality.
The price of a diamond (that would be dependent variable Y) is significantly influenced by its carat weight (independent variable X1) and the quality of its cut (independent variable X2). As the size of a diamond increases, so does its weight, measured in carats. Larger diamonds, being rarer and more desirable, command a higher price point. Concurrently, the quality of a diamond’s cut is pivotal in its pricing. A superior cut, which enhances the diamond’s brilliance and sparkl (key factors in its aesthetic appearence) makes diamonds more sought after. Consequently, consumers are often willing to pay more for diamonds with high-quality cuts. Thus, both the carat weight and the cut quality play crucial roles in determining a diamond’s price.
The multiple regression model can be expressed as: Price=β0+β1(Carat)+β2(Cut)+ϵ
Assumptions:
Linearity: There exists a linear relationship between the independent variables and the dependent variable.
Independence: The residuals (errors) are independent. In particular, there is no correlation between consecutive residuals in time series data.
Homoscedasticity: The residuals have constant variance at every level of the independent variables. (Breuch Pagan)
Normality: The residuals of the model are normally distributed. (Shapiro Wilk test)
No Multicollinearity: The independent variables are not highly correlated with each other (VIF)
# Create a scatterplot for Price vs Carat
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
scatterplotMatrix(mydata[ ,c(-3, -4)],
smooth = FALSE)
The scatter plot reveals a positive correlation between a diamond’s carat (weight) and its selling price. As the carat weight increases, so does the price. However, other variables influencing diamond prices are categorical, making scatter plots unsuitable for visualizing their relationships with price. Based on this scatter plot, it appears to me to be linear.
library(Hmisc)
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, units
rcorr(as.matrix(mydata[, c("price", "carat")]))
## price carat
## price 1.00 0.91
## carat 0.91 1.00
##
## n= 100
##
##
## P
## price carat
## price 0
## carat 0
H0: Rho(price, carat) = 0 (There is no correlation between price and carat).
H1: Rho(price, carat) ≠ 0 (There is a correlation between price and carat).
From previous correlation analysis, the correlation coefficient between price and carat is 0.91 and the p-value is 0, which is less than 0.05. Therefore, we can reject the null hypothesis (H0). This suggests that there is a correlation between price and carat. In other words, the carat weight of a diamond appears to have an effect on its price.
fit1 <- lm(price ~ carat + cut,
data = mydata)
#Checking mulitcolinearity with VIF
vif(fit1)
## carat cut
## 1.043463 1.043463
mean(vif(fit1))
## [1] 1.043463
The VIF value is 1.043, which is close to 1. This suggests that there is likely no multicolinearity problem with data, as well as also not to strong multicolinearity.
#Checking for outliers and units with high impact
reg <- lm(price ~ carat + cut, data = mydata)
mydata$StdFittedValues <- scale(fitted.values(reg))
mydata$StdResiduals <- round(rstandard(reg),3)
mydata$CooksD <- round(cooks.distance(reg),3)
hist(mydata$StdResid,
xlab= "Standard residuals",
ylab = "Frequency",
main= "Histogram of standardized residuals",
col= "pink")
head(mydata[order(-mydata$StdResid),], 6)
## price carat cut ID StdFittedValues StdResiduals CooksD
## 72 16384 1.71 Ideal 72 2.3436176 4.721 0.580
## 21 14071 1.50 Ideal 21 1.8241025 4.036 0.282
## 17 9405 1.24 Good 17 0.9965970 2.271 0.097
## 68 11303 1.53 Good 68 1.7140226 2.139 0.114
## 100 8011 1.02 Ideal 100 0.6366394 1.963 0.023
## 6 7528 1.01 Ideal 6 0.6119006 1.645 0.016
head(mydata[order(mydata$StdResid),], 10)
## price carat cut ID StdFittedValues StdResiduals CooksD
## 29 5268 1.50 Good 29 1.6398062 -2.471 0.147
## 45 4098 1.23 Ideal 45 1.1561545 -2.236 0.048
## 23 10412 2.01 Ideal 23 3.0857820 -1.778 0.137
## 77 3521 1.02 Ideal 77 0.6366394 -1.502 0.014
## 38 2912 1.00 Good 38 0.4028654 -1.469 0.036
## 46 2241 0.83 Ideal 46 0.1666019 -1.430 0.009
## 99 5547 1.27 Ideal 99 1.2551097 -1.337 0.019
## 71 4465 1.09 Ideal 71 0.8098111 -1.164 0.009
## 56 7553 1.51 Ideal 56 1.8488413 -1.133 0.023
## 34 7577 1.51 Ideal 34 1.8488413 -1.114 0.022
library(dplyr)
mydata <- mydata %>%
filter(!ID == 21)
mydata <- mydata %>%
filter(!ID == 72)
reg1 <- lm(price ~ carat + cut, data = mydata)
mydata$StdFittedValues <- scale(fitted.values(reg1))
mydata$StdResiduals <- round(rstandard(reg1),3)
mydata$CooksD <- round(cooks.distance(reg1),3)
head(mydata[order(-mydata$CooksD),], 6)
## price carat cut ID StdFittedValues StdResiduals CooksD
## 67 11303 1.53 Good 68 1.887208 3.220 0.268
## 28 5268 1.50 Good 29 1.810723 -2.823 0.199
## 17 9405 1.24 Good 17 1.147860 3.178 0.192
## 49 10424 1.56 Ideal 50 2.055641 1.856 0.077
## 22 10412 2.01 Ideal 23 3.202904 -1.219 0.074
## 44 4098 1.23 Ideal 45 1.214314 -2.410 0.063
library(dplyr)
mydata <- mydata %>%
filter(!ID == 68)
mydata <- mydata %>%
filter(!ID == 29)
mydata <- mydata %>%
filter(!ID == 17)
reg2 <- lm(price ~ carat + cut, data = mydata)
mydata$StdFittedValues <- scale(fitted.values(reg2))
mydata$StdResiduals <- round(rstandard(reg2),3)
mydata$CooksD <- round(cooks.distance(reg2),3)
library(dplyr)
mydata <- mydata %>%
filter(!ID == 100)
reg3 <- lm(price ~ carat + cut, data = mydata)
mydata$StdFittedValues <- scale(fitted.values(reg3))
mydata$StdResiduals <- round(rstandard(reg3),3)
mydata$CooksD <- round(cooks.distance(reg3),3)
library(dplyr)
mydata <- mydata %>%
filter(!ID == 6)
mydata <- mydata %>%
filter(!ID == 50)
mydata <- mydata %>%
filter(!ID == 61)
reg4 <- lm(price ~ carat + cut, data = mydata)
mydata$StdFittedValues <- scale(fitted.values(reg4))
mydata$StdResiduals <- round(rstandard(reg4),3)
mydata$CooksD <- round(cooks.distance(reg4),3)
#Check if there are any units with high impact
hist(mydata$CooksD,
xlab = "Cooks distance",
ylab = "Frequency",
main = "Histogram of Cooks distances")
library(dplyr)
mydata <- mydata %>%
filter(!ID == 3)
mydata <- mydata %>%
filter(!ID == 45)
mydata <- mydata %>%
filter(!ID == 38)
reg5 <- lm(price ~ carat + cut, data = mydata)
mydata$StdFittedValues <- scale(fitted.values(reg5))
mydata$StdResiduals <- round(rstandard(reg5),3)
mydata$CooksD <- round(cooks.distance(reg5),3)
head(mydata[order(-mydata$CooksD),], 10)
## price carat cut ID StdFittedValues StdResiduals CooksD
## 23 7119 1.03 Ideal 27 0.8858904 3.425 0.094
## 5 4914 0.90 Good 7 0.3461064 1.928 0.089
## 57 2140 0.82 Good 66 0.1326256 -1.835 0.080
## 19 10412 2.01 Ideal 23 3.5010305 -1.118 0.078
## 72 5251 1.01 Good 83 0.6396425 1.363 0.046
## 66 3521 1.02 Ideal 77 0.8592053 -2.281 0.041
## 84 6333 1.02 Ideal 95 0.8592053 2.255 0.040
## 88 5547 1.27 Ideal 99 1.5263329 -1.521 0.034
## 39 2241 0.83 Ideal 46 0.3521884 -2.442 0.031
## 54 6589 1.11 Ideal 63 1.0993713 1.776 0.031
library(dplyr)
mydata <- mydata %>%
filter(!ID == 7)
mydata <- mydata %>%
filter(!ID == 27)
mydata <- mydata %>%
filter(!ID == 23)
mydata <- mydata %>%
filter(!ID == 66)
reg6 <- lm(price ~ carat + cut, data = mydata)
mydata$StdFittedValues <- scale(fitted.values(reg6))
mydata$StdResiduals <- round(rstandard(reg6),3)
mydata$CooksD <- round(cooks.distance(reg6),3)
shapiro.test(mydata$StdResid)
##
## Shapiro-Wilk normality test
##
## data: mydata$StdResid
## W = 0.98491, p-value = 0.4341
H0: Standard residuals are normally distributed.
H1: Standard residuals are not normally distributed.
We fail to reject H0 because p > 0.05 and we can assume that standard residuals are normally distributed.
fit2 <- lm(price ~ carat + cut,
data = mydata)
mydata$StdResid <- round(rstandard(fit2), 3)
mydata$CooksD <- round(cooks.distance(fit2),3)
Now, we will do histogram again.
hist(mydata$StdResid,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals",
col= "purple",
border = "black")
hist(mydata$CooksD,
xlab = "Cooks distance",
ylab = "Frequency",
main = "Histogram of Cooks distances")
All values of cooks distances are less than 1. There is a gap, but not significant. As a result, we do not need to remove any further units.
mydata$StdFitted <- scale(fit2$fitted.values)
library(car)
scatterplot(StdResid ~ StdFitted,
ylab = "Standardized residuals",
xlab = "Standardized fitted values",
smooth = FALSE,
regLine = FALSE,
boxplots = FALSE,
data = mydata)
library(olsrr)
##
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
##
## rivers
ols_test_breusch_pagan(fit2)
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## ---------------------------------
## Response : price
## Variables: fitted values of price
##
## Test Summary
## ------------------------------
## DF = 1
## Chi2 = 12.11175
## Prob > Chi2 = 0.000501052
We should use robust standard errors.
H0: the variance is constant. H1: the variance is not constant.
We reject the null hypothesis at p<0.001, which means we need to use the robust standard error, because the variance of standard residuals are not constant.
As for the assumption of error independence, it is satisfied because each unit is observed only once, and we are not dealing with panel data.
Lastly, the assumption that the number of units is greater than the number of estimated parameters (n>k) is also met.
library(estimatr)
fit_robust <- lm_robust(price ~ carat + cut,
se_type = "HC1", #White's robust standard errors
data = mydata)
summary(fit_robust)
##
## Call:
## lm_robust(formula = price ~ carat + cut, data = mydata, se_type = "HC1")
##
## Standard error type: HC1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
## (Intercept) -1821.8 175.2 -10.397 1.418e-16 -2170.4 -1473.1 81
## carat 6181.8 171.5 36.044 1.218e-51 5840.6 6523.1 81
## cutIdeal 430.3 139.9 3.076 2.865e-03 151.9 708.6 81
##
## Multiple R-squared: 0.9385 , Adjusted R-squared: 0.937
## F-statistic: 771.5 on 2 and 81 DF, p-value: < 2.2e-16
For carat:
H0: beta1 = 0
H1: beta1 ≠ 0
Based on the data, we reject H0 at p<0.001, indicating that carat has a significant effect on price.
carat interpretation : If the weight of the diamond increases by 1 gram, then on average, price increases by 6180.9 $, assuming type of cut stays the same (p < 0.001).
For cutIdeal:
H0: beta1 = 0
H1: beta1 ≠ 0
Based on the data, we reject H0 at p< 0.001, indicating that cutIdeal has a significant effect on price.
cutIdeal: Given the weight of a diamond, the cost of the diamond with ideal cut is on average 471.3 $ more than the cost of a diamond with good cut (p < 0.002).
The Multiple R-squared value is 0.9344, which means that 93.44% of the variability in price is explained by the carat and the cut.
F statistics:
H0: Rho squared = 0
H1: Rho squared is more than 0.
Based on our data, we reject H0 (p<0.001) and conclude that at least one of the explanatory variables (carat and cut) impacts price.
Answer to the research question: Based on our data, we found out that carat and the cut of the diamond have a statistically significant influence on the price of the diamond.