Importing the data set.
mydata <- read.csv("~/Desktop/IMB/2. SEMESTER/MULTIVARIATE ANALYSIS/HOMEWORKS/HW3/diamonds.csv",
header = TRUE,
sep = ",",
dec = ".")
Cleaning the data set for better usability and understanding by removing the unneeded columns and renaming the remaining ones.
mydata1 <- mydata[,c(-6,-7)]
colnames(mydata1)<-c("ID", "Carat","Cut_Quality","Color","Clarity","Price","Length","Width","Depth")
head(mydata1,10)
## ID Carat Cut_Quality Color Clarity Price Length Width Depth
## 1 1 0.23 Ideal E SI2 326 3.95 3.98 2.43
## 2 2 0.21 Premium E SI1 326 3.89 3.84 2.31
## 3 3 0.23 Good E VS1 327 4.05 4.07 2.31
## 4 4 0.29 Premium I VS2 334 4.20 4.23 2.63
## 5 5 0.31 Good J SI2 335 4.34 4.35 2.75
## 6 6 0.24 Very Good J VVS2 336 3.94 3.96 2.48
## 7 7 0.24 Very Good I VVS1 336 3.95 3.98 2.47
## 8 8 0.26 Very Good H SI1 337 4.07 4.11 2.53
## 9 9 0.22 Fair E VS2 337 3.87 3.78 2.49
## 10 10 0.23 Very Good H VS1 338 4.00 4.05 2.39
The unit of observation in my sample is a single diamond. The original size of the data set is 53940 units with 11 variables. However, for the purpose of this analysis I will be using a sample of 1000 randomly selected observations from the data set.
The variables are the following:
Carat: Weight of the diamond in carats (1 carat equals 200mg or 0,2g).
Cut Quality: The quality of the cut of the diamond in the order from best to worst: Ideal, Premium, Very Good, Good & Fair (ordinal variable).
Color: The color that the observed diamond is from J (which is the worst) to D (which is the best).
Clarity: The measurement of clarity of the observed diamond in the following order from worst to best: I1, SI2, SI1, VS2, VS1, VVS2, VVS1 & IF.
Price: Selling price of the diamond in US dollars.
Length: Length of the diamond in millimeters.
Width: Width of the diamond in millimeters.
Depth: Depth of the diamond in millimeters.
The source of the above data was found on the Kaggle website, the author is Swati Khedekar. Retrieved January 3rd, 2023, from https://www.kaggle.com/datasets/swatikhedekar/price-prediction-of-diamond.
The main goal of the data analysis is to see how variables like carat, cut quality, color and clarity affect the price of the diamonds on the market.
Removing all variables that have the value 0.
mydata2 <- filter_if(mydata1, is.numeric, all_vars((.) != 0))
This removed 20 units of observation from the data set.
Combining the categorical variables into larger groups for easier and clearer interpretation of regression results.
mydata2[mydata2 == 'Fair'] <- 'Low'
mydata2[mydata2 == 'Good'] <- 'Medium'
mydata2[mydata2 == 'Very Good'] <- 'Medium'
mydata2[mydata2 == 'Premium'] <- 'High'
mydata2[mydata2 == 'Ideal'] <- 'High'
mydata2[mydata2 == 'D'] <- 'Colorless'
mydata2[mydata2 == 'E'] <- 'Colorless'
mydata2[mydata2 == 'F'] <- 'Colorless'
mydata2[mydata2 == 'G'] <- 'Near_Colorless'
mydata2[mydata2 == 'H'] <- 'Near_Colorless'
mydata2[mydata2 == 'I'] <- 'Near_Colorless'
mydata2[mydata2 == 'J'] <- 'Near_Colorless'
mydata2[mydata2 == 'I1'] <- 'Has_Imperfections'
mydata2[mydata2 == 'SI2'] <- 'Slight_Imperfections'
mydata2[mydata2 == 'SI1'] <- 'Slight_Imperfections'
mydata2[mydata2 == 'VS2'] <- 'Slight_Imperfections'
mydata2[mydata2 == 'VS1'] <- 'Slight_Imperfections'
mydata2[mydata2 == 'VVS2'] <- 'No_Imperfections'
mydata2[mydata2 == 'VVS1'] <- 'No_Imperfections'
mydata2[mydata2 == 'IF'] <- 'No_Imperfections'
We are left with 8 larger groups that will be transformed into factors for more control over the included variables in the regression process.
Transforming the categorical variables into factor, to be able to control for the base variable comparison in the regression model.
mydata2$Cut_QualityF <- factor(mydata2$Cut_Quality,
levels = c('Low', 'Medium', 'High'),
labels = c('Low', 'Medium', 'High'))
mydata2$ColorF <- factor(mydata2$Color,
levels = c('Near_Colorless', 'Colorless'),
labels = c('Near_Colorless', 'Colorless'))
mydata2$ClarityF <- factor(mydata2$Clarity,
levels = c('Has_Imperfections', 'Slight_Imperfections', 'No_Imperfections'),
labels = c('Has_Imperfections', 'Slight_Imperfections', 'No_Imperfections'))
Selecting 1000 random observations from the data set for further analysis.
set.seed(7)
mydata3 <- sample_n(mydata2, 1000)
head(mydata3,10)
## ID Carat Cut_Quality Color Clarity Price Length
## 1 26078 1.02 Medium Colorless No_Imperfections 15306 6.36
## 2 4574 0.92 High Near_Colorless Slight_Imperfections 3648 6.40
## 3 51926 0.90 Low Near_Colorless Slight_Imperfections 2438 5.92
## 4 30105 0.32 High Near_Colorless Slight_Imperfections 720 4.37
## 5 15179 1.01 High Colorless Slight_Imperfections 6097 6.40
## 6 6316 1.00 Medium Colorless Slight_Imperfections 4026 6.31
## 7 5555 1.02 Medium Near_Colorless Slight_Imperfections 3857 6.37
## 8 571 0.70 Medium Colorless Slight_Imperfections 2833 5.77
## 9 20757 1.53 Low Near_Colorless Slight_Imperfections 8996 7.60
## 10 19357 0.30 Medium Colorless No_Imperfections 622 4.35
## Width Depth Cut_QualityF ColorF ClarityF
## 1 6.46 4.02 Medium Colorless No_Imperfections
## 2 6.34 3.73 High Near_Colorless Slight_Imperfections
## 3 5.87 3.81 Low Near_Colorless Slight_Imperfections
## 4 4.35 2.74 High Near_Colorless Slight_Imperfections
## 5 6.39 4.01 High Colorless Slight_Imperfections
## 6 6.26 4.01 Medium Colorless Slight_Imperfections
## 7 6.43 4.00 Medium Near_Colorless Slight_Imperfections
## 8 5.80 3.45 Medium Colorless Slight_Imperfections
## 9 7.51 4.36 Low Near_Colorless Slight_Imperfections
## 10 4.39 2.60 Medium Colorless No_Imperfections
The price of a diamond (the independent variable) depends on a variety of factors (independent variables) the carat (weight), the quality of cut, the color of the diamond and the clarity of the observed diamond. The larger the diamond, the heavier it will be and therefore the price charge for it will be higher. The higher the cut quality, the more people will want the diamond and be prepared to pay a higher price for it. Furthermore, a diamond that is clearer and of a better color (colorless) will also have the same affect on the customers, thus the price that a diamond will be able to achieve will be higher. The main source that will back the theoretical explanation behind the regression idea is Days Jewelers official website. The information was retrieved on January 19th, 2023, from https://www.daysjewelers.com/education-center/diamonds/34-factors
mydata3 %>%
ggplot() +
aes(x = Carat, y = Price) +
geom_point(color = "#570861") +
geom_smooth(method = "lm", formula = y ~ x, se = F, color = "#570861") +
ggtitle('Scatterplot of Diamond Prices as a Function of Carats')
From the above scatter plot we can observe a positive relationship
between the carat (weight) of a diamond and the price that a diamond
will be able to sell for. Since the rest of the variables that affect
the price of diamonds are categorical it would not make sense to present
the relationship between them in the form of a scatter plot or scatter
plot matrix. From the scatter plot there are signs that show a potential
non-linear relationship. To solve this the natural logarithms of price
and carats will be taken into the model.
mydata3$CaratLOG <- log(mydata3$Carat)
mydata3$PriceLOG <- log(mydata3$Price)
mydata3 <- mydata3[c(1,2,13,3,4,5,6,14,7,8,9,10,11,12)]
mydata3 %>%
ggplot() +
aes(y = PriceLOG, x = CaratLOG) +
geom_point(color = "#570861") +
geom_smooth(method = "lm", formula = y ~ x, se = F, color = "#570861") +
ggtitle('Scatterplot of Diamond Prices as a Function of Carats')
reg <- lm(data = mydata3, PriceLOG ~ CaratLOG + Cut_QualityF + ColorF + ClarityF)
summary(reg)
##
## Call:
## lm(formula = PriceLOG ~ CaratLOG + Cut_QualityF + ColorF + ClarityF,
## data = mydata3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.62872 -0.13331 -0.01118 0.12620 0.87100
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.58326 0.05129 147.837 < 2e-16 ***
## CaratLOG 1.78071 0.01153 154.478 < 2e-16 ***
## Cut_QualityFMedium 0.11963 0.03704 3.230 0.00128 **
## Cut_QualityFHigh 0.18331 0.03614 5.072 4.69e-07 ***
## ColorFColorless 0.19105 0.01308 14.610 < 2e-16 ***
## ClarityFSlight_Imperfections 0.59658 0.04569 13.058 < 2e-16 ***
## ClarityFNo_Imperfections 0.94663 0.04861 19.473 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2007 on 993 degrees of freedom
## Multiple R-squared: 0.9617, Adjusted R-squared: 0.9614
## F-statistic: 4150 on 6 and 993 DF, p-value: < 2.2e-16
Creating and checking the first regression model with all of the mentioned explanatory variables, all of the regression coefficient estimates are significant. However, before we can confirm that these estimates of coefficients are correct and begin to explain them, diagnostics have to be performed on the model.
Creating additional variables for standardized fitted values, standardized residuals, and Cooks distances that will be used to perform diagnostics on the first regression model.
mydata3$StdFittedValues <- scale(fitted.values(reg))
mydata3$StdResiduals <- round(rstandard(reg),3)
mydata3$CooksD <- round(cooks.distance(reg),3)
Creating a histogram, to check whether the distribution of the standardized residuals is normal and confirming this assumption using the Shapiro-Wilk normality test.
hist(mydata3$StdResid,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals",
col = "#570861")
shapiro.test(mydata3$StdResid)
##
## Shapiro-Wilk normality test
##
## data: mydata3$StdResid
## W = 0.99557, p-value = 0.005485
Both the histogram as well as the Shapiro-Wilk test reject our null hypothesis, meaning the distribution of the standardized residuals is not normal. Furthermore, we can clearly see that there are outliers that spread way beyond the +/-3 standard deviations that will have to be removed. The normality test is not going to affect the results much, especially since the sample is large, the p-values will be low.
Before removing the outliers the check regarding Cooks distances is also done using a histogram where we can check for units that potentially have a high impact on the results.
hist(mydata3$CooksD,
xlab = "Cooks Distances",
ylab = "Frequency",
main = "Histogram of Cooks Distances",
col = "#570861")
While theory stipulates that Cooks distances are normal (not a high
impact unit) up to 1 there is a significant amount of space (units lie
far from the majority of the others) between the units observed in the
sample and we should therefore also remove those lying farther away to
improve the regression model.
To find the specific values of where to cut the data and confirm the assumptions made above regarding the standardized residuals and Cooks distances the data is ordered according to the values of these variables.
head(mydata3[order(mydata3$StdResiduals),], 20)
## ID Carat CaratLOG Cut_Quality Color Clarity
## 842 4803 1.31 0.2700271 High Near_Colorless Slight_Imperfections
## 303 5236 1.16 0.1484200 High Colorless Slight_Imperfections
## 850 49496 1.00 0.0000000 Low Near_Colorless Slight_Imperfections
## 395 19709 0.42 -0.8675006 Medium Colorless Slight_Imperfections
## 19 28642 0.33 -1.1086626 High Colorless Slight_Imperfections
## 660 3413 0.38 -0.9675840 High Colorless Slight_Imperfections
## 567 39617 0.35 -1.0498221 High Colorless Slight_Imperfections
## 127 22686 0.40 -0.9162907 High Colorless Slight_Imperfections
## 389 25193 2.61 0.9593502 Medium Near_Colorless Slight_Imperfections
## 534 21281 2.03 0.7080358 High Near_Colorless Slight_Imperfections
## 130 36708 0.46 -0.7765288 High Near_Colorless No_Imperfections
## 83 46820 0.72 -0.3285041 High Colorless Slight_Imperfections
## 392 13127 1.21 0.1906204 High Near_Colorless No_Imperfections
## 712 17974 1.73 0.5481214 High Near_Colorless Slight_Imperfections
## 458 49311 0.40 -0.9162907 High Near_Colorless Slight_Imperfections
## 757 15532 1.62 0.4824261 Medium Near_Colorless Slight_Imperfections
## 963 38301 0.37 -0.9942523 High Near_Colorless Slight_Imperfections
## 726 48920 0.70 -0.3566749 Medium Near_Colorless No_Imperfections
## 804 42637 0.58 -0.5447272 High Colorless Slight_Imperfections
## 226 18077 1.52 0.4187103 Low Near_Colorless No_Imperfections
## Price PriceLOG Length Width Depth Cut_QualityF ColorF
## 842 3697 8.215277 7.06 7.01 4.20 High Near_Colorless
## 303 3787 8.239329 6.74 6.63 4.15 High Colorless
## 850 2117 7.657755 6.13 5.86 4.23 Low Near_Colorless
## 395 625 6.437752 4.91 4.98 2.90 Medium Colorless
## 19 434 6.073045 4.40 4.43 2.75 High Colorless
## 660 568 6.342121 4.69 4.74 2.87 High Colorless
## 567 491 6.196444 4.52 4.56 2.77 High Colorless
## 127 629 6.444131 4.75 4.78 2.96 High Colorless
## 389 13784 9.531264 8.66 8.57 5.53 Medium Near_Colorless
## 534 9435 9.152181 8.15 8.12 4.91 High Near_Colorless
## 130 953 6.859615 4.93 4.97 3.06 High Near_Colorless
## 83 1808 7.499977 5.88 5.85 3.48 High Colorless
## 392 5431 8.599879 6.82 6.85 4.22 High Near_Colorless
## 712 7270 8.891512 7.64 7.54 4.64 High Near_Colorless
## 458 540 6.291569 4.75 4.77 2.96 High Near_Colorless
## 757 6223 8.736007 7.41 7.33 4.73 Medium Near_Colorless
## 963 487 6.188264 4.54 4.57 2.85 High Near_Colorless
## 726 2042 7.621685 5.63 5.68 3.58 Medium Near_Colorless
## 804 1332 7.194437 5.35 5.32 3.31 High Colorless
## 226 7303 8.896041 7.28 7.21 4.70 Low Near_Colorless
## ClarityF StdFittedValues StdResiduals CooksD
## 842 Slight_Imperfections 1.0466005 -3.138 0.005
## 303 Slight_Imperfections 1.0211569 -2.892 0.005
## 850 Slight_Imperfections 0.3838258 -2.644 0.033
## 395 Slight_Imperfections -0.8477112 -2.537 0.004
## 19 Slight_Imperfections -1.2127168 -2.531 0.004
## 660 Slight_Imperfections -0.9620159 -2.442 0.003
## 567 Slight_Imperfections -1.1081555 -2.438 0.004
## 127 Slight_Imperfections -0.8708662 -2.388 0.003
## 389 Slight_Imperfections 2.2080017 -2.384 0.007
## 534 Slight_Imperfections 1.8249550 -2.357 0.004
## 130 No_Imperfections -0.4638416 -2.354 0.006
## 83 Slight_Imperfections 0.1736484 -2.342 0.003
## 392 No_Imperfections 1.2548116 -2.266 0.007
## 712 Slight_Imperfections 1.5407824 -2.236 0.003
## 458 Slight_Imperfections -1.0615221 -2.196 0.003
## 757 Slight_Imperfections 1.3604932 -2.111 0.004
## 963 Slight_Imperfections -1.2000621 -2.019 0.003
## 726 No_Imperfections 0.2187047 -1.966 0.005
## 804 Slight_Imperfections -0.2105865 -1.945 0.002
## 226 No_Imperfections 1.4772060 -1.930 0.022
head(mydata3[order(-mydata3$StdResiduals),], 20)
## ID Carat CaratLOG Cut_Quality Color Clarity
## 599 26999 1.02 0.019802627 Medium Colorless No_Imperfections
## 1 26078 1.02 0.019802627 Medium Colorless No_Imperfections
## 793 8593 0.56 -0.579818495 Medium Colorless No_Imperfections
## 731 18986 1.01 0.009950331 High Near_Colorless Slight_Imperfections
## 624 21856 1.19 0.173953307 High Near_Colorless Slight_Imperfections
## 634 18305 1.02 0.019802627 High Near_Colorless Slight_Imperfections
## 20 20951 1.08 0.076961041 Medium Colorless Slight_Imperfections
## 992 25797 1.52 0.418710335 High Near_Colorless Slight_Imperfections
## 884 18210 1.03 0.029558802 High Near_Colorless Slight_Imperfections
## 310 13268 0.91 -0.094310679 Medium Near_Colorless Slight_Imperfections
## 453 20227 1.02 0.019802627 High Colorless Slight_Imperfections
## 78 32691 0.31 -1.171182982 Medium Near_Colorless Slight_Imperfections
## 198 26507 1.51 0.412109651 Medium Colorless Slight_Imperfections
## 122 27347 1.74 0.553885113 High Near_Colorless Slight_Imperfections
## 513 16564 1.01 0.009950331 High Near_Colorless Slight_Imperfections
## 590 32678 0.31 -1.171182982 High Near_Colorless Slight_Imperfections
## 849 26167 1.21 0.190620360 High Colorless No_Imperfections
## 252 35718 0.30 -1.203972804 High Colorless Slight_Imperfections
## 543 35710 0.30 -1.203972804 High Colorless Slight_Imperfections
## 153 22475 1.01 0.009950331 Medium Colorless No_Imperfections
## Price PriceLOG Length Width Depth Cut_QualityF ColorF
## 599 17100 9.746834 6.42 6.52 3.99 Medium Colorless
## 1 15306 9.636000 6.36 6.46 4.02 Medium Colorless
## 793 4440 8.398410 5.34 5.36 3.25 Medium Colorless
## 731 7814 8.963672 6.39 6.43 3.98 High Near_Colorless
## 624 9918 9.202107 6.78 6.81 4.24 High Near_Colorless
## 634 7440 8.914626 6.41 6.43 4.01 High Near_Colorless
## 20 9168 9.123474 6.61 6.67 4.00 Medium Colorless
## 992 14787 9.601504 7.39 7.43 4.53 High Near_Colorless
## 884 7391 8.908018 6.50 6.54 3.99 High Near_Colorless
## 310 5484 8.609590 6.13 6.17 3.83 Medium Near_Colorless
## 453 8658 9.066239 6.48 6.53 3.98 High Colorless
## 78 802 6.687109 4.29 4.27 2.70 Medium Near_Colorless
## 198 16129 9.688374 7.48 7.54 4.47 Medium Colorless
## 122 17904 9.792779 7.78 7.69 4.77 High Near_Colorless
## 513 6618 8.797548 6.42 6.38 3.99 High Near_Colorless
## 590 802 6.687109 4.36 4.32 2.70 High Near_Colorless
## 849 15497 9.648402 6.75 6.84 4.27 High Colorless
## 252 911 6.814543 4.34 4.32 2.64 High Colorless
## 543 911 6.814543 4.38 4.33 2.65 High Colorless
## 153 10499 9.259035 6.49 6.58 3.91 Medium Colorless
## ClarityF StdFittedValues StdResiduals CooksD
## 599 No_Imperfections 1.0783726 4.364 0.031
## 1 No_Imperfections 1.0783726 3.809 0.024
## 793 No_Imperfections 0.0128278 2.955 0.011
## 731 Slight_Imperfections 0.5844362 2.908 0.003
## 624 Slight_Imperfections 0.8758744 2.641 0.003
## 634 Slight_Imperfections 0.6019440 2.575 0.003
## 20 Slight_Imperfections 0.8306255 2.477 0.004
## 992 Slight_Imperfections 1.3108150 2.460 0.004
## 884 Slight_Imperfections 0.6192810 2.456 0.002
## 310 Slight_Imperfections 0.3356145 2.387 0.004
## 453 Slight_Imperfections 0.7925999 2.380 0.003
## 78 Slight_Imperfections -1.5780200 2.367 0.006
## 198 Slight_Imperfections 1.4261946 2.319 0.005
## 122 Slight_Imperfections 1.5510246 2.214 0.003
## 513 Slight_Imperfections 0.5844362 2.079 0.002
## 590 Slight_Imperfections -1.5144733 2.046 0.003
## 849 No_Imperfections 1.4454675 2.028 0.007
## 252 Slight_Imperfections -1.3820859 2.019 0.003
## 543 Slight_Imperfections -1.3820859 2.019 0.003
## 153 No_Imperfections 1.0608648 2.008 0.007
head(mydata3[order(-mydata3$CooksD),], 20)
## ID Carat CaratLOG Cut_Quality Color Clarity
## 827 13758 2.22 0.797507196 Low Near_Colorless Has_Imperfections
## 850 49496 1.00 0.000000000 Low Near_Colorless Slight_Imperfections
## 599 26999 1.02 0.019802627 Medium Colorless No_Imperfections
## 115 19922 2.50 0.916290732 High Near_Colorless Has_Imperfections
## 1 26078 1.02 0.019802627 Medium Colorless No_Imperfections
## 226 18077 1.52 0.418710335 Low Near_Colorless No_Imperfections
## 698 15981 1.00 0.000000000 Low Colorless Slight_Imperfections
## 260 30341 0.50 -0.693147181 Low Near_Colorless Has_Imperfections
## 281 15507 1.00 0.000000000 Low Colorless Slight_Imperfections
## 443 15234 1.00 0.000000000 Low Colorless Slight_Imperfections
## 752 16318 1.15 0.139761942 Low Near_Colorless Slight_Imperfections
## 975 19599 2.33 0.845868268 High Near_Colorless Has_Imperfections
## 48 2802 1.04 0.039220713 High Near_Colorless Has_Imperfections
## 609 47656 0.34 -1.078809661 High Colorless Has_Imperfections
## 793 8593 0.56 -0.579818495 Medium Colorless No_Imperfections
## 206 4702 1.13 0.122217633 High Near_Colorless Has_Imperfections
## 635 45426 0.51 -0.673344553 Low Colorless Slight_Imperfections
## 153 22475 1.01 0.009950331 Medium Colorless No_Imperfections
## 389 25193 2.61 0.959350221 Medium Near_Colorless Slight_Imperfections
## 392 13127 1.21 0.190620360 High Near_Colorless No_Imperfections
## Price PriceLOG Length Width Depth Cut_QualityF ColorF
## 827 5607 8.631771 8.04 8.02 5.36 Low Near_Colorless
## 850 2117 7.657755 6.13 5.86 4.23 Low Near_Colorless
## 599 17100 9.746834 6.42 6.52 3.99 Medium Colorless
## 115 8467 9.043932 8.75 8.67 5.22 High Near_Colorless
## 1 15306 9.636000 6.36 6.46 4.02 Medium Colorless
## 226 7303 8.896041 7.28 7.21 4.70 Low Near_Colorless
## 698 6397 8.763584 6.25 6.21 4.07 Low Colorless
## 260 727 6.588926 5.01 4.97 3.24 Low Near_Colorless
## 281 6210 8.733916 6.24 6.33 4.08 Low Colorless
## 443 6115 8.718500 6.26 6.21 4.04 Low Colorless
## 752 6530 8.784162 6.60 6.50 4.23 Low Near_Colorless
## 975 8220 9.014325 8.64 8.56 5.22 High Near_Colorless
## 48 3261 8.089789 6.47 6.45 3.98 High Near_Colorless
## 609 530 6.272877 4.52 4.50 2.77 High Colorless
## 793 4440 8.398410 5.34 5.36 3.25 Medium Colorless
## 206 3678 8.210124 6.65 6.69 4.15 High Near_Colorless
## 635 1668 7.419381 5.06 5.03 3.25 Low Colorless
## 153 10499 9.259035 6.49 6.58 3.91 Medium Colorless
## 389 13784 9.531264 8.66 8.57 5.53 Medium Near_Colorless
## 392 5431 8.599879 6.82 6.85 4.22 High Near_Colorless
## ClarityF StdFittedValues StdResiduals CooksD
## 827 Has_Imperfections 1.20566987 -1.915 0.036
## 850 Slight_Imperfections 0.38382581 -2.644 0.033
## 599 No_Imperfections 1.07837258 4.364 0.031
## 115 Has_Imperfections 1.59968009 -1.813 0.026
## 1 No_Imperfections 1.07837258 3.809 0.024
## 226 No_Imperfections 1.47720602 -1.930 0.022
## 698 Slight_Imperfections 0.57448175 1.989 0.019
## 260 Has_Imperfections -1.44326782 1.239 0.016
## 281 Slight_Imperfections 0.57448175 1.839 0.016
## 443 Slight_Imperfections 0.57448175 1.761 0.015
## 752 Slight_Imperfections 0.63218699 1.800 0.015
## 975 Has_Imperfections 1.47453725 -1.323 0.014
## 48 Has_Imperfections 0.04110027 1.296 0.013
## 609 Has_Imperfections -1.75501742 1.212 0.012
## 793 No_Imperfections 0.01282780 2.955 0.011
## 206 Has_Imperfections 0.18858829 1.155 0.010
## 635 Slight_Imperfections -0.62207179 1.254 0.008
## 153 No_Imperfections 1.06086476 2.008 0.007
## 389 Slight_Imperfections 2.20800166 -2.384 0.007
## 392 No_Imperfections 1.25481157 -2.266 0.007
In the below steps the data is cleaned of the values lying above the +/- 3 standard deviations for standardized residuals and all units above the value of 0.016 in Cooks distances.
mydata4 <- mydata3[!(mydata3$StdResiduals >= 3.0),]
mydata4 <- mydata4[!(mydata4$StdResiduals <= -3.0),]
mydata4 <- mydata4[!(mydata4$CooksD > 0.016),]
Showing that the data has been cleaned of these units.
hist(mydata4$StdResid,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals",
col = "#570861")
shapiro.test(mydata3$StdResid)
##
## Shapiro-Wilk normality test
##
## data: mydata3$StdResid
## W = 0.99557, p-value = 0.005485
hist(mydata4$CooksD,
xlab = "Cooks Distances",
ylab = "Frequency",
main = "Histogram of Cooks Distances",
col = "#570861")
head(mydata4[order(mydata4$StdResiduals),], 10)
## ID Carat CaratLOG Cut_Quality Color Clarity
## 303 5236 1.16 0.1484200 High Colorless Slight_Imperfections
## 395 19709 0.42 -0.8675006 Medium Colorless Slight_Imperfections
## 19 28642 0.33 -1.1086626 High Colorless Slight_Imperfections
## 660 3413 0.38 -0.9675840 High Colorless Slight_Imperfections
## 567 39617 0.35 -1.0498221 High Colorless Slight_Imperfections
## 127 22686 0.40 -0.9162907 High Colorless Slight_Imperfections
## 389 25193 2.61 0.9593502 Medium Near_Colorless Slight_Imperfections
## 534 21281 2.03 0.7080358 High Near_Colorless Slight_Imperfections
## 130 36708 0.46 -0.7765288 High Near_Colorless No_Imperfections
## 83 46820 0.72 -0.3285041 High Colorless Slight_Imperfections
## Price PriceLOG Length Width Depth Cut_QualityF ColorF
## 303 3787 8.239329 6.74 6.63 4.15 High Colorless
## 395 625 6.437752 4.91 4.98 2.90 Medium Colorless
## 19 434 6.073045 4.40 4.43 2.75 High Colorless
## 660 568 6.342121 4.69 4.74 2.87 High Colorless
## 567 491 6.196444 4.52 4.56 2.77 High Colorless
## 127 629 6.444131 4.75 4.78 2.96 High Colorless
## 389 13784 9.531264 8.66 8.57 5.53 Medium Near_Colorless
## 534 9435 9.152181 8.15 8.12 4.91 High Near_Colorless
## 130 953 6.859615 4.93 4.97 3.06 High Near_Colorless
## 83 1808 7.499977 5.88 5.85 3.48 High Colorless
## ClarityF StdFittedValues StdResiduals CooksD
## 303 Slight_Imperfections 1.0211569 -2.892 0.005
## 395 Slight_Imperfections -0.8477112 -2.537 0.004
## 19 Slight_Imperfections -1.2127168 -2.531 0.004
## 660 Slight_Imperfections -0.9620159 -2.442 0.003
## 567 Slight_Imperfections -1.1081555 -2.438 0.004
## 127 Slight_Imperfections -0.8708662 -2.388 0.003
## 389 Slight_Imperfections 2.2080017 -2.384 0.007
## 534 Slight_Imperfections 1.8249550 -2.357 0.004
## 130 No_Imperfections -0.4638416 -2.354 0.006
## 83 Slight_Imperfections 0.1736484 -2.342 0.003
head(mydata4[order(-mydata4$StdResiduals),], 10)
## ID Carat CaratLOG Cut_Quality Color Clarity
## 793 8593 0.56 -0.579818495 Medium Colorless No_Imperfections
## 731 18986 1.01 0.009950331 High Near_Colorless Slight_Imperfections
## 624 21856 1.19 0.173953307 High Near_Colorless Slight_Imperfections
## 634 18305 1.02 0.019802627 High Near_Colorless Slight_Imperfections
## 20 20951 1.08 0.076961041 Medium Colorless Slight_Imperfections
## 992 25797 1.52 0.418710335 High Near_Colorless Slight_Imperfections
## 884 18210 1.03 0.029558802 High Near_Colorless Slight_Imperfections
## 310 13268 0.91 -0.094310679 Medium Near_Colorless Slight_Imperfections
## 453 20227 1.02 0.019802627 High Colorless Slight_Imperfections
## 78 32691 0.31 -1.171182982 Medium Near_Colorless Slight_Imperfections
## Price PriceLOG Length Width Depth Cut_QualityF ColorF
## 793 4440 8.398410 5.34 5.36 3.25 Medium Colorless
## 731 7814 8.963672 6.39 6.43 3.98 High Near_Colorless
## 624 9918 9.202107 6.78 6.81 4.24 High Near_Colorless
## 634 7440 8.914626 6.41 6.43 4.01 High Near_Colorless
## 20 9168 9.123474 6.61 6.67 4.00 Medium Colorless
## 992 14787 9.601504 7.39 7.43 4.53 High Near_Colorless
## 884 7391 8.908018 6.50 6.54 3.99 High Near_Colorless
## 310 5484 8.609590 6.13 6.17 3.83 Medium Near_Colorless
## 453 8658 9.066239 6.48 6.53 3.98 High Colorless
## 78 802 6.687109 4.29 4.27 2.70 Medium Near_Colorless
## ClarityF StdFittedValues StdResiduals CooksD
## 793 No_Imperfections 0.0128278 2.955 0.011
## 731 Slight_Imperfections 0.5844362 2.908 0.003
## 624 Slight_Imperfections 0.8758744 2.641 0.003
## 634 Slight_Imperfections 0.6019440 2.575 0.003
## 20 Slight_Imperfections 0.8306255 2.477 0.004
## 992 Slight_Imperfections 1.3108150 2.460 0.004
## 884 Slight_Imperfections 0.6192810 2.456 0.002
## 310 Slight_Imperfections 0.3356145 2.387 0.004
## 453 Slight_Imperfections 0.7925999 2.380 0.003
## 78 Slight_Imperfections -1.5780200 2.367 0.006
head(mydata4[order(-mydata4$CooksD),], 10)
## ID Carat CaratLOG Cut_Quality Color Clarity
## 260 30341 0.50 -0.69314718 Low Near_Colorless Has_Imperfections
## 281 15507 1.00 0.00000000 Low Colorless Slight_Imperfections
## 443 15234 1.00 0.00000000 Low Colorless Slight_Imperfections
## 752 16318 1.15 0.13976194 Low Near_Colorless Slight_Imperfections
## 975 19599 2.33 0.84586827 High Near_Colorless Has_Imperfections
## 48 2802 1.04 0.03922071 High Near_Colorless Has_Imperfections
## 609 47656 0.34 -1.07880966 High Colorless Has_Imperfections
## 793 8593 0.56 -0.57981850 Medium Colorless No_Imperfections
## 206 4702 1.13 0.12221763 High Near_Colorless Has_Imperfections
## 635 45426 0.51 -0.67334455 Low Colorless Slight_Imperfections
## Price PriceLOG Length Width Depth Cut_QualityF ColorF
## 260 727 6.588926 5.01 4.97 3.24 Low Near_Colorless
## 281 6210 8.733916 6.24 6.33 4.08 Low Colorless
## 443 6115 8.718500 6.26 6.21 4.04 Low Colorless
## 752 6530 8.784162 6.60 6.50 4.23 Low Near_Colorless
## 975 8220 9.014325 8.64 8.56 5.22 High Near_Colorless
## 48 3261 8.089789 6.47 6.45 3.98 High Near_Colorless
## 609 530 6.272877 4.52 4.50 2.77 High Colorless
## 793 4440 8.398410 5.34 5.36 3.25 Medium Colorless
## 206 3678 8.210124 6.65 6.69 4.15 High Near_Colorless
## 635 1668 7.419381 5.06 5.03 3.25 Low Colorless
## ClarityF StdFittedValues StdResiduals CooksD
## 260 Has_Imperfections -1.44326782 1.239 0.016
## 281 Slight_Imperfections 0.57448175 1.839 0.016
## 443 Slight_Imperfections 0.57448175 1.761 0.015
## 752 Slight_Imperfections 0.63218699 1.800 0.015
## 975 Has_Imperfections 1.47453725 -1.323 0.014
## 48 Has_Imperfections 0.04110027 1.296 0.013
## 609 Has_Imperfections -1.75501742 1.212 0.012
## 793 No_Imperfections 0.01282780 2.955 0.011
## 206 Has_Imperfections 0.18858829 1.155 0.010
## 635 Slight_Imperfections -0.62207179 1.254 0.008
Checking the relationship between the standardized residuals and standardized fitted values to see whether the assumption regarding heteroskedasticity is met.
scatterplot(y = mydata4$StdResiduals, x = mydata4$StdFittedValues,
xlab = "Standardized Fitted Values",
ylab = "Standardized Residuals",
boxplot = FALSE,
smooth = FALSE,
regLine = TRUE)
The above scatter plot doesn’t show clear signs of heteroskedasticity
being present in the model, however, to be confirm the Breusch Pagan
test will be performed to double-check this assumption.
ols_test_breusch_pagan(reg)
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## ------------------------------------
## Response : PriceLOG
## Variables: fitted values of PriceLOG
##
## Test Summary
## -----------------------------
## DF = 1
## Chi2 = 8.631458
## Prob > Chi2 = 0.00330407
The variance is not constant, therefore, we reject the null hypothesis, meaning there is heteroskedasticity present in the model. This implies that the lm robust function will have to be used in order to be able to interpret the results of this regression.
Using the VIF function to check for the assumption of multicolinearity.
vif(reg)
## GVIF Df GVIF^(1/(2*Df))
## CaratLOG 1.161688 1 1.077816
## Cut_QualityF 1.068949 2 1.016809
## ColorF 1.056859 1 1.028036
## ClarityF 1.161891 2 1.038225
The only numeric independent variable is carats, meaning the GVIF can be used directly, the number is well below the required 5 for detecting multicolinearity. Since, the other variables that have been used are categorical we check the √GVIF and can see that there are no issues with multicolinearity also with the remaining variables in the observed sample, as can be observed, all values are very close to 1.
reg2 <- lm(data = mydata4, PriceLOG ~ CaratLOG + Cut_QualityF + ColorF + ClarityF)
summary(lm_robust(data = mydata4, PriceLOG ~ CaratLOG + Cut_QualityF + ColorF + ClarityF, se_type = "HC1"))
##
## Call:
## lm_robust(formula = PriceLOG ~ CaratLOG + Cut_QualityF + ColorF +
## ClarityF, data = mydata4, se_type = "HC1")
##
## Standard error type: HC1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|) CI Lower
## (Intercept) 7.64285 0.04144 184.447 0.000e+00 7.56154
## CaratLOG 1.77945 0.01211 147.001 0.000e+00 1.75570
## Cut_QualityFMedium 0.09181 0.03462 2.652 8.137e-03 0.02387
## Cut_QualityFHigh 0.16082 0.03393 4.740 2.451e-06 0.09424
## ColorFColorless 0.18245 0.01288 14.161 1.419e-41 0.15717
## ClarityFSlight_Imperfections 0.56526 0.03543 15.956 3.766e-51 0.49574
## ClarityFNo_Imperfections 0.90644 0.03983 22.758 1.832e-92 0.82828
## CI Upper DF
## (Intercept) 7.7242 985
## CaratLOG 1.8032 985
## Cut_QualityFMedium 0.1598 985
## Cut_QualityFHigh 0.2274 985
## ColorFColorless 0.2077 985
## ClarityFSlight_Imperfections 0.6348 985
## ClarityFNo_Imperfections 0.9846 985
##
## Multiple R-squared: 0.9638 , Adjusted R-squared: 0.9636
## F-statistic: 4034 on 6 and 985 DF, p-value: < 2.2e-16
Explanations of the estimated regression coefficients:
While the intercept is very significant (p-value > 0.001), it cannot be explained since it would make no sense to interpret if a diamond has all values equal to 0 how much its price on the market would be.
If the carat of the diamond increases by 1% then the price of the diamond increases on average by approximately 178%, ceteris paribus (p < 0.001).
If the quality of cut improves from low to medium, then the price of the diamond increases on average by 9.1%, ceteris paribus (p < 0.01).
If the quality of cut improves from low to high, then the price of the diamond increases on average by 16.1%, ceteris paribus (p < 0.001).
If the color of a diamond improves from nearly colorless to colorless, then the price of the diamond increases on average by 18.3%, ceteris paribus (p < 0.001).
If the clarity of a diamond improves from having imperfections to only having slight imperfections then the price of the diamond increases on average by 56.5%, ceteris paribus (p < 0.001).
If the clarity of a diamond improves from having imperfections to only having slight imperfections then the price of the diamond increases on average by 90.6%, ceteris paribus (p < 0.001).
All of the estimated regression coefficients are significant at p-values < 0.01 (1%).
The multiple coefficient of determination also the adjusted R^2 has the value of 0.9636 meaning that 96.4% of the variability within the natural logarithm of the PriceLOG of diamonds can be explained by the explanatory variables of CaratLOG, Cut Quality, Color and Clarity. It is clear that the build model fits the data extremely well.
The multiple correlation coefficient also √(R^2) has the value of 0.9816, from this we can deduce that there is a strong linear relationship between the explained variable and the explanatory variables.
Multiple correlation coefficient: is calculated as the square root of R-squared, and it equals to 0.93. From this coefficient we can conclude, that the maximum degree of linear relationship between the variable LogFert, and independent variables is 0.93 - strong relationship.
The analysis of variance or the F-test statistic evaluates how good the regression model is where the hypotheses are:
H0: ρ2 = 0
H1: ρ2 > 0
Since the p-value of this is very low (p < 0.001) we can reject the null hypothesis and assume that the model is well structured and that there is a relationship between the dependent and at least one of the independent variables.