THE DATA SET

Importing the data set.

mydata <- read.csv("~/Desktop/IMB/2. SEMESTER/MULTIVARIATE ANALYSIS/HOMEWORKS/HW3/diamonds.csv",
          header = TRUE, 
          sep = ",", 
          dec = ".")

Cleaning the data set for better usability and understanding by removing the unneeded columns and renaming the remaining ones.

mydata1 <- mydata[,c(-6,-7)]
colnames(mydata1)<-c("ID", "Carat","Cut_Quality","Color","Clarity","Price","Length","Width","Depth")
head(mydata1,10)
##    ID Carat Cut_Quality Color Clarity Price Length Width Depth
## 1   1  0.23       Ideal     E     SI2   326   3.95  3.98  2.43
## 2   2  0.21     Premium     E     SI1   326   3.89  3.84  2.31
## 3   3  0.23        Good     E     VS1   327   4.05  4.07  2.31
## 4   4  0.29     Premium     I     VS2   334   4.20  4.23  2.63
## 5   5  0.31        Good     J     SI2   335   4.34  4.35  2.75
## 6   6  0.24   Very Good     J    VVS2   336   3.94  3.96  2.48
## 7   7  0.24   Very Good     I    VVS1   336   3.95  3.98  2.47
## 8   8  0.26   Very Good     H     SI1   337   4.07  4.11  2.53
## 9   9  0.22        Fair     E     VS2   337   3.87  3.78  2.49
## 10 10  0.23   Very Good     H     VS1   338   4.00  4.05  2.39

The unit of observation in my sample is a single diamond. The original size of the data set is 53940 units with 11 variables. However, for the purpose of this analysis I will be using a sample of 1000 randomly selected observations from the data set.

The variables are the following:

The source of the above data was found on the Kaggle website, the author is Swati Khedekar. Retrieved January 3rd, 2023, from https://www.kaggle.com/datasets/swatikhedekar/price-prediction-of-diamond.

The main goal of the data analysis is to see how variables like carat, cut quality, color and clarity affect the price of the diamonds on the market.

DATA MANIPULATIONS

Removing all variables that have the value 0.

mydata2 <- filter_if(mydata1, is.numeric, all_vars((.) != 0))

This removed 20 units of observation from the data set.

Combining the categorical variables into larger groups for easier and clearer interpretation of regression results.

mydata2[mydata2 == 'Fair'] <- 'Low'
mydata2[mydata2 == 'Good'] <- 'Medium'
mydata2[mydata2 == 'Very Good'] <- 'Medium'
mydata2[mydata2 == 'Premium'] <- 'High'
mydata2[mydata2 == 'Ideal'] <- 'High'

mydata2[mydata2 == 'D'] <- 'Colorless'
mydata2[mydata2 == 'E'] <- 'Colorless'
mydata2[mydata2 == 'F'] <- 'Colorless'
mydata2[mydata2 == 'G'] <- 'Near_Colorless'
mydata2[mydata2 == 'H'] <- 'Near_Colorless'
mydata2[mydata2 == 'I'] <- 'Near_Colorless'
mydata2[mydata2 == 'J'] <- 'Near_Colorless'

mydata2[mydata2 == 'I1'] <- 'Has_Imperfections'
mydata2[mydata2 == 'SI2'] <- 'Slight_Imperfections'
mydata2[mydata2 == 'SI1'] <- 'Slight_Imperfections'
mydata2[mydata2 == 'VS2'] <- 'Slight_Imperfections'
mydata2[mydata2 == 'VS1'] <- 'Slight_Imperfections'
mydata2[mydata2 == 'VVS2'] <- 'No_Imperfections'
mydata2[mydata2 == 'VVS1'] <- 'No_Imperfections'
mydata2[mydata2 == 'IF'] <- 'No_Imperfections'

We are left with 8 larger groups that will be transformed into factors for more control over the included variables in the regression process.

Transforming the categorical variables into factor, to be able to control for the base variable comparison in the regression model.

mydata2$Cut_QualityF <- factor(mydata2$Cut_Quality,
                               levels = c('Low', 'Medium', 'High'),
                               labels = c('Low', 'Medium', 'High'))

mydata2$ColorF <- factor(mydata2$Color,
                               levels = c('Near_Colorless', 'Colorless'),
                               labels = c('Near_Colorless', 'Colorless'))

mydata2$ClarityF <- factor(mydata2$Clarity,
                               levels = c('Has_Imperfections', 'Slight_Imperfections', 'No_Imperfections'),
                               labels = c('Has_Imperfections', 'Slight_Imperfections', 'No_Imperfections'))

Selecting 1000 random observations from the data set for further analysis.

set.seed(7)
mydata3 <- sample_n(mydata2, 1000)
head(mydata3,10)
##       ID Carat Cut_Quality          Color              Clarity Price Length
## 1  26078  1.02      Medium      Colorless     No_Imperfections 15306   6.36
## 2   4574  0.92        High Near_Colorless Slight_Imperfections  3648   6.40
## 3  51926  0.90         Low Near_Colorless Slight_Imperfections  2438   5.92
## 4  30105  0.32        High Near_Colorless Slight_Imperfections   720   4.37
## 5  15179  1.01        High      Colorless Slight_Imperfections  6097   6.40
## 6   6316  1.00      Medium      Colorless Slight_Imperfections  4026   6.31
## 7   5555  1.02      Medium Near_Colorless Slight_Imperfections  3857   6.37
## 8    571  0.70      Medium      Colorless Slight_Imperfections  2833   5.77
## 9  20757  1.53         Low Near_Colorless Slight_Imperfections  8996   7.60
## 10 19357  0.30      Medium      Colorless     No_Imperfections   622   4.35
##    Width Depth Cut_QualityF         ColorF             ClarityF
## 1   6.46  4.02       Medium      Colorless     No_Imperfections
## 2   6.34  3.73         High Near_Colorless Slight_Imperfections
## 3   5.87  3.81          Low Near_Colorless Slight_Imperfections
## 4   4.35  2.74         High Near_Colorless Slight_Imperfections
## 5   6.39  4.01         High      Colorless Slight_Imperfections
## 6   6.26  4.01       Medium      Colorless Slight_Imperfections
## 7   6.43  4.00       Medium Near_Colorless Slight_Imperfections
## 8   5.80  3.45       Medium      Colorless Slight_Imperfections
## 9   7.51  4.36          Low Near_Colorless Slight_Imperfections
## 10  4.39  2.60       Medium      Colorless     No_Imperfections

DEFINING THE MULTIPLE REGRESSION MODEL

The price of a diamond (the independent variable) depends on a variety of factors (independent variables) the carat (weight), the quality of cut, the color of the diamond and the clarity of the observed diamond. The larger the diamond, the heavier it will be and therefore the price charge for it will be higher. The higher the cut quality, the more people will want the diamond and be prepared to pay a higher price for it. Furthermore, a diamond that is clearer and of a better color (colorless) will also have the same affect on the customers, thus the price that a diamond will be able to achieve will be higher. The main source that will back the theoretical explanation behind the regression idea is Days Jewelers official website. The information was retrieved on January 19th, 2023, from https://www.daysjewelers.com/education-center/diamonds/34-factors

GRAPHICAL REPRESENTATION OF THE RELATIONSHIPS BETWEEN THE ANALYZED VARIABLES

mydata3 %>% 
  ggplot() + 
  aes(x = Carat, y = Price) +
  geom_point(color = "#570861") +
  geom_smooth(method = "lm", formula = y ~ x, se = F, color = "#570861") +
  ggtitle('Scatterplot of Diamond Prices as a Function of Carats')

From the above scatter plot we can observe a positive relationship between the carat (weight) of a diamond and the price that a diamond will be able to sell for. Since the rest of the variables that affect the price of diamonds are categorical it would not make sense to present the relationship between them in the form of a scatter plot or scatter plot matrix. From the scatter plot there are signs that show a potential non-linear relationship. To solve this the natural logarithms of price and carats will be taken into the model.

mydata3$CaratLOG <- log(mydata3$Carat)
mydata3$PriceLOG <- log(mydata3$Price)
mydata3 <- mydata3[c(1,2,13,3,4,5,6,14,7,8,9,10,11,12)]
mydata3 %>% 
  ggplot() + 
  aes(y = PriceLOG, x = CaratLOG) +
  geom_point(color = "#570861") +
  geom_smooth(method = "lm", formula = y ~ x, se = F, color = "#570861") +
  ggtitle('Scatterplot of Diamond Prices as a Function of Carats')

reg <- lm(data = mydata3, PriceLOG ~ CaratLOG + Cut_QualityF + ColorF + ClarityF)
summary(reg)
## 
## Call:
## lm(formula = PriceLOG ~ CaratLOG + Cut_QualityF + ColorF + ClarityF, 
##     data = mydata3)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.62872 -0.13331 -0.01118  0.12620  0.87100 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   7.58326    0.05129 147.837  < 2e-16 ***
## CaratLOG                      1.78071    0.01153 154.478  < 2e-16 ***
## Cut_QualityFMedium            0.11963    0.03704   3.230  0.00128 ** 
## Cut_QualityFHigh              0.18331    0.03614   5.072 4.69e-07 ***
## ColorFColorless               0.19105    0.01308  14.610  < 2e-16 ***
## ClarityFSlight_Imperfections  0.59658    0.04569  13.058  < 2e-16 ***
## ClarityFNo_Imperfections      0.94663    0.04861  19.473  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2007 on 993 degrees of freedom
## Multiple R-squared:  0.9617, Adjusted R-squared:  0.9614 
## F-statistic:  4150 on 6 and 993 DF,  p-value: < 2.2e-16

Creating and checking the first regression model with all of the mentioned explanatory variables, all of the regression coefficient estimates are significant. However, before we can confirm that these estimates of coefficients are correct and begin to explain them, diagnostics have to be performed on the model.

DIAGNOSTICS OF THE INITIAL REGRESSION MODEL

Creating additional variables for standardized fitted values, standardized residuals, and Cooks distances that will be used to perform diagnostics on the first regression model.

mydata3$StdFittedValues <- scale(fitted.values(reg))

mydata3$StdResiduals <- round(rstandard(reg),3)

mydata3$CooksD <- round(cooks.distance(reg),3)

Creating a histogram, to check whether the distribution of the standardized residuals is normal and confirming this assumption using the Shapiro-Wilk normality test.

hist(mydata3$StdResid, 
     xlab = "Standardized residuals", 
     ylab = "Frequency", 
     main = "Histogram of standardized residuals",
     col = "#570861")

shapiro.test(mydata3$StdResid)
## 
##  Shapiro-Wilk normality test
## 
## data:  mydata3$StdResid
## W = 0.99557, p-value = 0.005485

Both the histogram as well as the Shapiro-Wilk test reject our null hypothesis, meaning the distribution of the standardized residuals is not normal. Furthermore, we can clearly see that there are outliers that spread way beyond the +/-3 standard deviations that will have to be removed. The normality test is not going to affect the results much, especially since the sample is large, the p-values will be low.

Before removing the outliers the check regarding Cooks distances is also done using a histogram where we can check for units that potentially have a high impact on the results.

hist(mydata3$CooksD, 
     xlab = "Cooks Distances", 
     ylab = "Frequency", 
     main = "Histogram of Cooks Distances",
     col = "#570861")

While theory stipulates that Cooks distances are normal (not a high impact unit) up to 1 there is a significant amount of space (units lie far from the majority of the others) between the units observed in the sample and we should therefore also remove those lying farther away to improve the regression model.

To find the specific values of where to cut the data and confirm the assumptions made above regarding the standardized residuals and Cooks distances the data is ordered according to the values of these variables.

head(mydata3[order(mydata3$StdResiduals),], 20)
##        ID Carat   CaratLOG Cut_Quality          Color              Clarity
## 842  4803  1.31  0.2700271        High Near_Colorless Slight_Imperfections
## 303  5236  1.16  0.1484200        High      Colorless Slight_Imperfections
## 850 49496  1.00  0.0000000         Low Near_Colorless Slight_Imperfections
## 395 19709  0.42 -0.8675006      Medium      Colorless Slight_Imperfections
## 19  28642  0.33 -1.1086626        High      Colorless Slight_Imperfections
## 660  3413  0.38 -0.9675840        High      Colorless Slight_Imperfections
## 567 39617  0.35 -1.0498221        High      Colorless Slight_Imperfections
## 127 22686  0.40 -0.9162907        High      Colorless Slight_Imperfections
## 389 25193  2.61  0.9593502      Medium Near_Colorless Slight_Imperfections
## 534 21281  2.03  0.7080358        High Near_Colorless Slight_Imperfections
## 130 36708  0.46 -0.7765288        High Near_Colorless     No_Imperfections
## 83  46820  0.72 -0.3285041        High      Colorless Slight_Imperfections
## 392 13127  1.21  0.1906204        High Near_Colorless     No_Imperfections
## 712 17974  1.73  0.5481214        High Near_Colorless Slight_Imperfections
## 458 49311  0.40 -0.9162907        High Near_Colorless Slight_Imperfections
## 757 15532  1.62  0.4824261      Medium Near_Colorless Slight_Imperfections
## 963 38301  0.37 -0.9942523        High Near_Colorless Slight_Imperfections
## 726 48920  0.70 -0.3566749      Medium Near_Colorless     No_Imperfections
## 804 42637  0.58 -0.5447272        High      Colorless Slight_Imperfections
## 226 18077  1.52  0.4187103         Low Near_Colorless     No_Imperfections
##     Price PriceLOG Length Width Depth Cut_QualityF         ColorF
## 842  3697 8.215277   7.06  7.01  4.20         High Near_Colorless
## 303  3787 8.239329   6.74  6.63  4.15         High      Colorless
## 850  2117 7.657755   6.13  5.86  4.23          Low Near_Colorless
## 395   625 6.437752   4.91  4.98  2.90       Medium      Colorless
## 19    434 6.073045   4.40  4.43  2.75         High      Colorless
## 660   568 6.342121   4.69  4.74  2.87         High      Colorless
## 567   491 6.196444   4.52  4.56  2.77         High      Colorless
## 127   629 6.444131   4.75  4.78  2.96         High      Colorless
## 389 13784 9.531264   8.66  8.57  5.53       Medium Near_Colorless
## 534  9435 9.152181   8.15  8.12  4.91         High Near_Colorless
## 130   953 6.859615   4.93  4.97  3.06         High Near_Colorless
## 83   1808 7.499977   5.88  5.85  3.48         High      Colorless
## 392  5431 8.599879   6.82  6.85  4.22         High Near_Colorless
## 712  7270 8.891512   7.64  7.54  4.64         High Near_Colorless
## 458   540 6.291569   4.75  4.77  2.96         High Near_Colorless
## 757  6223 8.736007   7.41  7.33  4.73       Medium Near_Colorless
## 963   487 6.188264   4.54  4.57  2.85         High Near_Colorless
## 726  2042 7.621685   5.63  5.68  3.58       Medium Near_Colorless
## 804  1332 7.194437   5.35  5.32  3.31         High      Colorless
## 226  7303 8.896041   7.28  7.21  4.70          Low Near_Colorless
##                 ClarityF StdFittedValues StdResiduals CooksD
## 842 Slight_Imperfections       1.0466005       -3.138  0.005
## 303 Slight_Imperfections       1.0211569       -2.892  0.005
## 850 Slight_Imperfections       0.3838258       -2.644  0.033
## 395 Slight_Imperfections      -0.8477112       -2.537  0.004
## 19  Slight_Imperfections      -1.2127168       -2.531  0.004
## 660 Slight_Imperfections      -0.9620159       -2.442  0.003
## 567 Slight_Imperfections      -1.1081555       -2.438  0.004
## 127 Slight_Imperfections      -0.8708662       -2.388  0.003
## 389 Slight_Imperfections       2.2080017       -2.384  0.007
## 534 Slight_Imperfections       1.8249550       -2.357  0.004
## 130     No_Imperfections      -0.4638416       -2.354  0.006
## 83  Slight_Imperfections       0.1736484       -2.342  0.003
## 392     No_Imperfections       1.2548116       -2.266  0.007
## 712 Slight_Imperfections       1.5407824       -2.236  0.003
## 458 Slight_Imperfections      -1.0615221       -2.196  0.003
## 757 Slight_Imperfections       1.3604932       -2.111  0.004
## 963 Slight_Imperfections      -1.2000621       -2.019  0.003
## 726     No_Imperfections       0.2187047       -1.966  0.005
## 804 Slight_Imperfections      -0.2105865       -1.945  0.002
## 226     No_Imperfections       1.4772060       -1.930  0.022
head(mydata3[order(-mydata3$StdResiduals),], 20)
##        ID Carat     CaratLOG Cut_Quality          Color              Clarity
## 599 26999  1.02  0.019802627      Medium      Colorless     No_Imperfections
## 1   26078  1.02  0.019802627      Medium      Colorless     No_Imperfections
## 793  8593  0.56 -0.579818495      Medium      Colorless     No_Imperfections
## 731 18986  1.01  0.009950331        High Near_Colorless Slight_Imperfections
## 624 21856  1.19  0.173953307        High Near_Colorless Slight_Imperfections
## 634 18305  1.02  0.019802627        High Near_Colorless Slight_Imperfections
## 20  20951  1.08  0.076961041      Medium      Colorless Slight_Imperfections
## 992 25797  1.52  0.418710335        High Near_Colorless Slight_Imperfections
## 884 18210  1.03  0.029558802        High Near_Colorless Slight_Imperfections
## 310 13268  0.91 -0.094310679      Medium Near_Colorless Slight_Imperfections
## 453 20227  1.02  0.019802627        High      Colorless Slight_Imperfections
## 78  32691  0.31 -1.171182982      Medium Near_Colorless Slight_Imperfections
## 198 26507  1.51  0.412109651      Medium      Colorless Slight_Imperfections
## 122 27347  1.74  0.553885113        High Near_Colorless Slight_Imperfections
## 513 16564  1.01  0.009950331        High Near_Colorless Slight_Imperfections
## 590 32678  0.31 -1.171182982        High Near_Colorless Slight_Imperfections
## 849 26167  1.21  0.190620360        High      Colorless     No_Imperfections
## 252 35718  0.30 -1.203972804        High      Colorless Slight_Imperfections
## 543 35710  0.30 -1.203972804        High      Colorless Slight_Imperfections
## 153 22475  1.01  0.009950331      Medium      Colorless     No_Imperfections
##     Price PriceLOG Length Width Depth Cut_QualityF         ColorF
## 599 17100 9.746834   6.42  6.52  3.99       Medium      Colorless
## 1   15306 9.636000   6.36  6.46  4.02       Medium      Colorless
## 793  4440 8.398410   5.34  5.36  3.25       Medium      Colorless
## 731  7814 8.963672   6.39  6.43  3.98         High Near_Colorless
## 624  9918 9.202107   6.78  6.81  4.24         High Near_Colorless
## 634  7440 8.914626   6.41  6.43  4.01         High Near_Colorless
## 20   9168 9.123474   6.61  6.67  4.00       Medium      Colorless
## 992 14787 9.601504   7.39  7.43  4.53         High Near_Colorless
## 884  7391 8.908018   6.50  6.54  3.99         High Near_Colorless
## 310  5484 8.609590   6.13  6.17  3.83       Medium Near_Colorless
## 453  8658 9.066239   6.48  6.53  3.98         High      Colorless
## 78    802 6.687109   4.29  4.27  2.70       Medium Near_Colorless
## 198 16129 9.688374   7.48  7.54  4.47       Medium      Colorless
## 122 17904 9.792779   7.78  7.69  4.77         High Near_Colorless
## 513  6618 8.797548   6.42  6.38  3.99         High Near_Colorless
## 590   802 6.687109   4.36  4.32  2.70         High Near_Colorless
## 849 15497 9.648402   6.75  6.84  4.27         High      Colorless
## 252   911 6.814543   4.34  4.32  2.64         High      Colorless
## 543   911 6.814543   4.38  4.33  2.65         High      Colorless
## 153 10499 9.259035   6.49  6.58  3.91       Medium      Colorless
##                 ClarityF StdFittedValues StdResiduals CooksD
## 599     No_Imperfections       1.0783726        4.364  0.031
## 1       No_Imperfections       1.0783726        3.809  0.024
## 793     No_Imperfections       0.0128278        2.955  0.011
## 731 Slight_Imperfections       0.5844362        2.908  0.003
## 624 Slight_Imperfections       0.8758744        2.641  0.003
## 634 Slight_Imperfections       0.6019440        2.575  0.003
## 20  Slight_Imperfections       0.8306255        2.477  0.004
## 992 Slight_Imperfections       1.3108150        2.460  0.004
## 884 Slight_Imperfections       0.6192810        2.456  0.002
## 310 Slight_Imperfections       0.3356145        2.387  0.004
## 453 Slight_Imperfections       0.7925999        2.380  0.003
## 78  Slight_Imperfections      -1.5780200        2.367  0.006
## 198 Slight_Imperfections       1.4261946        2.319  0.005
## 122 Slight_Imperfections       1.5510246        2.214  0.003
## 513 Slight_Imperfections       0.5844362        2.079  0.002
## 590 Slight_Imperfections      -1.5144733        2.046  0.003
## 849     No_Imperfections       1.4454675        2.028  0.007
## 252 Slight_Imperfections      -1.3820859        2.019  0.003
## 543 Slight_Imperfections      -1.3820859        2.019  0.003
## 153     No_Imperfections       1.0608648        2.008  0.007
head(mydata3[order(-mydata3$CooksD),], 20)
##        ID Carat     CaratLOG Cut_Quality          Color              Clarity
## 827 13758  2.22  0.797507196         Low Near_Colorless    Has_Imperfections
## 850 49496  1.00  0.000000000         Low Near_Colorless Slight_Imperfections
## 599 26999  1.02  0.019802627      Medium      Colorless     No_Imperfections
## 115 19922  2.50  0.916290732        High Near_Colorless    Has_Imperfections
## 1   26078  1.02  0.019802627      Medium      Colorless     No_Imperfections
## 226 18077  1.52  0.418710335         Low Near_Colorless     No_Imperfections
## 698 15981  1.00  0.000000000         Low      Colorless Slight_Imperfections
## 260 30341  0.50 -0.693147181         Low Near_Colorless    Has_Imperfections
## 281 15507  1.00  0.000000000         Low      Colorless Slight_Imperfections
## 443 15234  1.00  0.000000000         Low      Colorless Slight_Imperfections
## 752 16318  1.15  0.139761942         Low Near_Colorless Slight_Imperfections
## 975 19599  2.33  0.845868268        High Near_Colorless    Has_Imperfections
## 48   2802  1.04  0.039220713        High Near_Colorless    Has_Imperfections
## 609 47656  0.34 -1.078809661        High      Colorless    Has_Imperfections
## 793  8593  0.56 -0.579818495      Medium      Colorless     No_Imperfections
## 206  4702  1.13  0.122217633        High Near_Colorless    Has_Imperfections
## 635 45426  0.51 -0.673344553         Low      Colorless Slight_Imperfections
## 153 22475  1.01  0.009950331      Medium      Colorless     No_Imperfections
## 389 25193  2.61  0.959350221      Medium Near_Colorless Slight_Imperfections
## 392 13127  1.21  0.190620360        High Near_Colorless     No_Imperfections
##     Price PriceLOG Length Width Depth Cut_QualityF         ColorF
## 827  5607 8.631771   8.04  8.02  5.36          Low Near_Colorless
## 850  2117 7.657755   6.13  5.86  4.23          Low Near_Colorless
## 599 17100 9.746834   6.42  6.52  3.99       Medium      Colorless
## 115  8467 9.043932   8.75  8.67  5.22         High Near_Colorless
## 1   15306 9.636000   6.36  6.46  4.02       Medium      Colorless
## 226  7303 8.896041   7.28  7.21  4.70          Low Near_Colorless
## 698  6397 8.763584   6.25  6.21  4.07          Low      Colorless
## 260   727 6.588926   5.01  4.97  3.24          Low Near_Colorless
## 281  6210 8.733916   6.24  6.33  4.08          Low      Colorless
## 443  6115 8.718500   6.26  6.21  4.04          Low      Colorless
## 752  6530 8.784162   6.60  6.50  4.23          Low Near_Colorless
## 975  8220 9.014325   8.64  8.56  5.22         High Near_Colorless
## 48   3261 8.089789   6.47  6.45  3.98         High Near_Colorless
## 609   530 6.272877   4.52  4.50  2.77         High      Colorless
## 793  4440 8.398410   5.34  5.36  3.25       Medium      Colorless
## 206  3678 8.210124   6.65  6.69  4.15         High Near_Colorless
## 635  1668 7.419381   5.06  5.03  3.25          Low      Colorless
## 153 10499 9.259035   6.49  6.58  3.91       Medium      Colorless
## 389 13784 9.531264   8.66  8.57  5.53       Medium Near_Colorless
## 392  5431 8.599879   6.82  6.85  4.22         High Near_Colorless
##                 ClarityF StdFittedValues StdResiduals CooksD
## 827    Has_Imperfections      1.20566987       -1.915  0.036
## 850 Slight_Imperfections      0.38382581       -2.644  0.033
## 599     No_Imperfections      1.07837258        4.364  0.031
## 115    Has_Imperfections      1.59968009       -1.813  0.026
## 1       No_Imperfections      1.07837258        3.809  0.024
## 226     No_Imperfections      1.47720602       -1.930  0.022
## 698 Slight_Imperfections      0.57448175        1.989  0.019
## 260    Has_Imperfections     -1.44326782        1.239  0.016
## 281 Slight_Imperfections      0.57448175        1.839  0.016
## 443 Slight_Imperfections      0.57448175        1.761  0.015
## 752 Slight_Imperfections      0.63218699        1.800  0.015
## 975    Has_Imperfections      1.47453725       -1.323  0.014
## 48     Has_Imperfections      0.04110027        1.296  0.013
## 609    Has_Imperfections     -1.75501742        1.212  0.012
## 793     No_Imperfections      0.01282780        2.955  0.011
## 206    Has_Imperfections      0.18858829        1.155  0.010
## 635 Slight_Imperfections     -0.62207179        1.254  0.008
## 153     No_Imperfections      1.06086476        2.008  0.007
## 389 Slight_Imperfections      2.20800166       -2.384  0.007
## 392     No_Imperfections      1.25481157       -2.266  0.007

In the below steps the data is cleaned of the values lying above the +/- 3 standard deviations for standardized residuals and all units above the value of 0.016 in Cooks distances.

mydata4 <- mydata3[!(mydata3$StdResiduals >= 3.0),]
mydata4 <- mydata4[!(mydata4$StdResiduals <= -3.0),]
mydata4 <- mydata4[!(mydata4$CooksD > 0.016),]

Showing that the data has been cleaned of these units.

hist(mydata4$StdResid, 
     xlab = "Standardized residuals", 
     ylab = "Frequency", 
     main = "Histogram of standardized residuals",
     col = "#570861")

shapiro.test(mydata3$StdResid)
## 
##  Shapiro-Wilk normality test
## 
## data:  mydata3$StdResid
## W = 0.99557, p-value = 0.005485
hist(mydata4$CooksD, 
     xlab = "Cooks Distances", 
     ylab = "Frequency", 
     main = "Histogram of Cooks Distances",
     col = "#570861")

head(mydata4[order(mydata4$StdResiduals),], 10)
##        ID Carat   CaratLOG Cut_Quality          Color              Clarity
## 303  5236  1.16  0.1484200        High      Colorless Slight_Imperfections
## 395 19709  0.42 -0.8675006      Medium      Colorless Slight_Imperfections
## 19  28642  0.33 -1.1086626        High      Colorless Slight_Imperfections
## 660  3413  0.38 -0.9675840        High      Colorless Slight_Imperfections
## 567 39617  0.35 -1.0498221        High      Colorless Slight_Imperfections
## 127 22686  0.40 -0.9162907        High      Colorless Slight_Imperfections
## 389 25193  2.61  0.9593502      Medium Near_Colorless Slight_Imperfections
## 534 21281  2.03  0.7080358        High Near_Colorless Slight_Imperfections
## 130 36708  0.46 -0.7765288        High Near_Colorless     No_Imperfections
## 83  46820  0.72 -0.3285041        High      Colorless Slight_Imperfections
##     Price PriceLOG Length Width Depth Cut_QualityF         ColorF
## 303  3787 8.239329   6.74  6.63  4.15         High      Colorless
## 395   625 6.437752   4.91  4.98  2.90       Medium      Colorless
## 19    434 6.073045   4.40  4.43  2.75         High      Colorless
## 660   568 6.342121   4.69  4.74  2.87         High      Colorless
## 567   491 6.196444   4.52  4.56  2.77         High      Colorless
## 127   629 6.444131   4.75  4.78  2.96         High      Colorless
## 389 13784 9.531264   8.66  8.57  5.53       Medium Near_Colorless
## 534  9435 9.152181   8.15  8.12  4.91         High Near_Colorless
## 130   953 6.859615   4.93  4.97  3.06         High Near_Colorless
## 83   1808 7.499977   5.88  5.85  3.48         High      Colorless
##                 ClarityF StdFittedValues StdResiduals CooksD
## 303 Slight_Imperfections       1.0211569       -2.892  0.005
## 395 Slight_Imperfections      -0.8477112       -2.537  0.004
## 19  Slight_Imperfections      -1.2127168       -2.531  0.004
## 660 Slight_Imperfections      -0.9620159       -2.442  0.003
## 567 Slight_Imperfections      -1.1081555       -2.438  0.004
## 127 Slight_Imperfections      -0.8708662       -2.388  0.003
## 389 Slight_Imperfections       2.2080017       -2.384  0.007
## 534 Slight_Imperfections       1.8249550       -2.357  0.004
## 130     No_Imperfections      -0.4638416       -2.354  0.006
## 83  Slight_Imperfections       0.1736484       -2.342  0.003
head(mydata4[order(-mydata4$StdResiduals),], 10)
##        ID Carat     CaratLOG Cut_Quality          Color              Clarity
## 793  8593  0.56 -0.579818495      Medium      Colorless     No_Imperfections
## 731 18986  1.01  0.009950331        High Near_Colorless Slight_Imperfections
## 624 21856  1.19  0.173953307        High Near_Colorless Slight_Imperfections
## 634 18305  1.02  0.019802627        High Near_Colorless Slight_Imperfections
## 20  20951  1.08  0.076961041      Medium      Colorless Slight_Imperfections
## 992 25797  1.52  0.418710335        High Near_Colorless Slight_Imperfections
## 884 18210  1.03  0.029558802        High Near_Colorless Slight_Imperfections
## 310 13268  0.91 -0.094310679      Medium Near_Colorless Slight_Imperfections
## 453 20227  1.02  0.019802627        High      Colorless Slight_Imperfections
## 78  32691  0.31 -1.171182982      Medium Near_Colorless Slight_Imperfections
##     Price PriceLOG Length Width Depth Cut_QualityF         ColorF
## 793  4440 8.398410   5.34  5.36  3.25       Medium      Colorless
## 731  7814 8.963672   6.39  6.43  3.98         High Near_Colorless
## 624  9918 9.202107   6.78  6.81  4.24         High Near_Colorless
## 634  7440 8.914626   6.41  6.43  4.01         High Near_Colorless
## 20   9168 9.123474   6.61  6.67  4.00       Medium      Colorless
## 992 14787 9.601504   7.39  7.43  4.53         High Near_Colorless
## 884  7391 8.908018   6.50  6.54  3.99         High Near_Colorless
## 310  5484 8.609590   6.13  6.17  3.83       Medium Near_Colorless
## 453  8658 9.066239   6.48  6.53  3.98         High      Colorless
## 78    802 6.687109   4.29  4.27  2.70       Medium Near_Colorless
##                 ClarityF StdFittedValues StdResiduals CooksD
## 793     No_Imperfections       0.0128278        2.955  0.011
## 731 Slight_Imperfections       0.5844362        2.908  0.003
## 624 Slight_Imperfections       0.8758744        2.641  0.003
## 634 Slight_Imperfections       0.6019440        2.575  0.003
## 20  Slight_Imperfections       0.8306255        2.477  0.004
## 992 Slight_Imperfections       1.3108150        2.460  0.004
## 884 Slight_Imperfections       0.6192810        2.456  0.002
## 310 Slight_Imperfections       0.3356145        2.387  0.004
## 453 Slight_Imperfections       0.7925999        2.380  0.003
## 78  Slight_Imperfections      -1.5780200        2.367  0.006
head(mydata4[order(-mydata4$CooksD),], 10)
##        ID Carat    CaratLOG Cut_Quality          Color              Clarity
## 260 30341  0.50 -0.69314718         Low Near_Colorless    Has_Imperfections
## 281 15507  1.00  0.00000000         Low      Colorless Slight_Imperfections
## 443 15234  1.00  0.00000000         Low      Colorless Slight_Imperfections
## 752 16318  1.15  0.13976194         Low Near_Colorless Slight_Imperfections
## 975 19599  2.33  0.84586827        High Near_Colorless    Has_Imperfections
## 48   2802  1.04  0.03922071        High Near_Colorless    Has_Imperfections
## 609 47656  0.34 -1.07880966        High      Colorless    Has_Imperfections
## 793  8593  0.56 -0.57981850      Medium      Colorless     No_Imperfections
## 206  4702  1.13  0.12221763        High Near_Colorless    Has_Imperfections
## 635 45426  0.51 -0.67334455         Low      Colorless Slight_Imperfections
##     Price PriceLOG Length Width Depth Cut_QualityF         ColorF
## 260   727 6.588926   5.01  4.97  3.24          Low Near_Colorless
## 281  6210 8.733916   6.24  6.33  4.08          Low      Colorless
## 443  6115 8.718500   6.26  6.21  4.04          Low      Colorless
## 752  6530 8.784162   6.60  6.50  4.23          Low Near_Colorless
## 975  8220 9.014325   8.64  8.56  5.22         High Near_Colorless
## 48   3261 8.089789   6.47  6.45  3.98         High Near_Colorless
## 609   530 6.272877   4.52  4.50  2.77         High      Colorless
## 793  4440 8.398410   5.34  5.36  3.25       Medium      Colorless
## 206  3678 8.210124   6.65  6.69  4.15         High Near_Colorless
## 635  1668 7.419381   5.06  5.03  3.25          Low      Colorless
##                 ClarityF StdFittedValues StdResiduals CooksD
## 260    Has_Imperfections     -1.44326782        1.239  0.016
## 281 Slight_Imperfections      0.57448175        1.839  0.016
## 443 Slight_Imperfections      0.57448175        1.761  0.015
## 752 Slight_Imperfections      0.63218699        1.800  0.015
## 975    Has_Imperfections      1.47453725       -1.323  0.014
## 48     Has_Imperfections      0.04110027        1.296  0.013
## 609    Has_Imperfections     -1.75501742        1.212  0.012
## 793     No_Imperfections      0.01282780        2.955  0.011
## 206    Has_Imperfections      0.18858829        1.155  0.010
## 635 Slight_Imperfections     -0.62207179        1.254  0.008

Checking the relationship between the standardized residuals and standardized fitted values to see whether the assumption regarding heteroskedasticity is met.

scatterplot(y = mydata4$StdResiduals, x = mydata4$StdFittedValues,
            xlab = "Standardized Fitted Values",
            ylab = "Standardized Residuals",
            boxplot = FALSE,
            smooth = FALSE,
            regLine = TRUE)

The above scatter plot doesn’t show clear signs of heteroskedasticity being present in the model, however, to be confirm the Breusch Pagan test will be performed to double-check this assumption.

ols_test_breusch_pagan(reg)
## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##                 Data                 
##  ------------------------------------
##  Response : PriceLOG 
##  Variables: fitted values of PriceLOG 
## 
##         Test Summary          
##  -----------------------------
##  DF            =    1 
##  Chi2          =    8.631458 
##  Prob > Chi2   =    0.00330407

The variance is not constant, therefore, we reject the null hypothesis, meaning there is heteroskedasticity present in the model. This implies that the lm robust function will have to be used in order to be able to interpret the results of this regression.

Using the VIF function to check for the assumption of multicolinearity.

vif(reg)
##                  GVIF Df GVIF^(1/(2*Df))
## CaratLOG     1.161688  1        1.077816
## Cut_QualityF 1.068949  2        1.016809
## ColorF       1.056859  1        1.028036
## ClarityF     1.161891  2        1.038225

The only numeric independent variable is carats, meaning the GVIF can be used directly, the number is well below the required 5 for detecting multicolinearity. Since, the other variables that have been used are categorical we check the √GVIF and can see that there are no issues with multicolinearity also with the remaining variables in the observed sample, as can be observed, all values are very close to 1.

RE-ESTIMATING THE REGRESSION MODEL

reg2 <- lm(data = mydata4, PriceLOG ~ CaratLOG + Cut_QualityF + ColorF + ClarityF)
summary(lm_robust(data = mydata4, PriceLOG ~ CaratLOG + Cut_QualityF + ColorF + ClarityF, se_type = "HC1"))
## 
## Call:
## lm_robust(formula = PriceLOG ~ CaratLOG + Cut_QualityF + ColorF + 
##     ClarityF, data = mydata4, se_type = "HC1")
## 
## Standard error type:  HC1 
## 
## Coefficients:
##                              Estimate Std. Error t value  Pr(>|t|) CI Lower
## (Intercept)                   7.64285    0.04144 184.447 0.000e+00  7.56154
## CaratLOG                      1.77945    0.01211 147.001 0.000e+00  1.75570
## Cut_QualityFMedium            0.09181    0.03462   2.652 8.137e-03  0.02387
## Cut_QualityFHigh              0.16082    0.03393   4.740 2.451e-06  0.09424
## ColorFColorless               0.18245    0.01288  14.161 1.419e-41  0.15717
## ClarityFSlight_Imperfections  0.56526    0.03543  15.956 3.766e-51  0.49574
## ClarityFNo_Imperfections      0.90644    0.03983  22.758 1.832e-92  0.82828
##                              CI Upper  DF
## (Intercept)                    7.7242 985
## CaratLOG                       1.8032 985
## Cut_QualityFMedium             0.1598 985
## Cut_QualityFHigh               0.2274 985
## ColorFColorless                0.2077 985
## ClarityFSlight_Imperfections   0.6348 985
## ClarityFNo_Imperfections       0.9846 985
## 
## Multiple R-squared:  0.9638 ,    Adjusted R-squared:  0.9636 
## F-statistic:  4034 on 6 and 985 DF,  p-value: < 2.2e-16

EXPLAINING THE RE-ESTIMATED REGRESSION MODEL

Explanations of the estimated regression coefficients:

All of the estimated regression coefficients are significant at p-values < 0.01 (1%).

The multiple coefficient of determination also the adjusted R^2 has the value of 0.9636 meaning that 96.4% of the variability within the natural logarithm of the PriceLOG of diamonds can be explained by the explanatory variables of CaratLOG, Cut Quality, Color and Clarity. It is clear that the build model fits the data extremely well.

The multiple correlation coefficient also √(R^2) has the value of 0.9816, from this we can deduce that there is a strong linear relationship between the explained variable and the explanatory variables.

Multiple correlation coefficient: is calculated as the square root of R-squared, and it equals to 0.93. From this coefficient we can conclude, that the maximum degree of linear relationship between the variable LogFert, and independent variables is 0.93 - strong relationship.

The analysis of variance or the F-test statistic evaluates how good the regression model is where the hypotheses are:

H0: ρ2 = 0

H1: ρ2 > 0

Since the p-value of this is very low (p < 0.001) we can reject the null hypothesis and assume that the model is well structured and that there is a relationship between the dependent and at least one of the independent variables.