library(readr)
library(ggplot2)
library(naniar)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode

Importing the data set.

mydata <- read.csv("~/Desktop/IMB/2. SEMESTER/MULTIVARIATE ANALYSIS/HOMEWORKS/HW1/diamonds.csv",
          header = TRUE, 
          sep = ",", 
          dec = ".")
head(mydata, 10)
##     X carat       cut color clarity depth table price    x    y    z
## 1   1  0.23     Ideal     E     SI2  61.5    55   326 3.95 3.98 2.43
## 2   2  0.21   Premium     E     SI1  59.8    61   326 3.89 3.84 2.31
## 3   3  0.23      Good     E     VS1  56.9    65   327 4.05 4.07 2.31
## 4   4  0.29   Premium     I     VS2  62.4    58   334 4.20 4.23 2.63
## 5   5  0.31      Good     J     SI2  63.3    58   335 4.34 4.35 2.75
## 6   6  0.24 Very Good     J    VVS2  62.8    57   336 3.94 3.96 2.48
## 7   7  0.24 Very Good     I    VVS1  62.3    57   336 3.95 3.98 2.47
## 8   8  0.26 Very Good     H     SI1  61.9    55   337 4.07 4.11 2.53
## 9   9  0.22      Fair     E     VS2  65.1    61   337 3.87 3.78 2.49
## 10 10  0.23 Very Good     H     VS1  59.4    61   338 4.00 4.05 2.39

Cleaning the data by selecting only the variables that hold potential for further analysis and renaming the columns for easier understanding.

mydata1 <- mydata[,c(-1,-4,-5,-6,-7)]
colnames(mydata1)<-c("Carat","Cut Quality","Price","Length","Width","Depth")
head(mydata1,10)
##    Carat Cut Quality Price Length Width Depth
## 1   0.23       Ideal   326   3.95  3.98  2.43
## 2   0.21     Premium   326   3.89  3.84  2.31
## 3   0.23        Good   327   4.05  4.07  2.31
## 4   0.29     Premium   334   4.20  4.23  2.63
## 5   0.31        Good   335   4.34  4.35  2.75
## 6   0.24   Very Good   336   3.94  3.96  2.48
## 7   0.24   Very Good   336   3.95  3.98  2.47
## 8   0.26   Very Good   337   4.07  4.11  2.53
## 9   0.22        Fair   337   3.87  3.78  2.49
## 10  0.23   Very Good   338   4.00  4.05  2.39

The unit of observation in my sample is a single diamond. The original size of the data set is 53940 units with 11 variables. However, for the purpose of this analysis I will be using a sample of 200 randomly selected observations from the data set. The amount of variables will also be reduced to 6.

The variables are the following:

The source of the above data was found on the Kaggle website, the author is Swati Khedekar. Retrieved January 3rd, 2022, from https://www.kaggle.com/datasets/swatikhedekar/price-prediction-of-diamond.

The idea behind the data analysis is to see how variables like carat and cut quality affect the price of the diamonds on the market.

Cleaning the data by getting rid of all units of observation that have a value of 0, which is not possible for the observed data.

mydata2 <- filter_if(mydata1, is.numeric, all_vars((.) != 0))

Selecting 200 random observations from the data set for further analysis.

set.seed(7)
mydata3 <- sample_n(mydata2, 200)
head(mydata3,10)
##    Carat Cut Quality Price Length Width Depth
## 1   1.02   Very Good 15306   6.36  6.46  4.02
## 2   0.92     Premium  3648   6.40  6.34  3.73
## 3   0.90        Fair  2438   5.92  5.87  3.81
## 4   0.32     Premium   720   4.37  4.35  2.74
## 5   1.01     Premium  6097   6.40  6.39  4.01
## 6   1.00        Good  4026   6.31  6.26  4.01
## 7   1.02   Very Good  3857   6.37  6.43  4.00
## 8   0.70   Very Good  2833   5.77  5.80  3.45
## 9   1.53        Fair  8996   7.60  7.51  4.36
## 10  0.30        Good   622   4.35  4.39  2.60

Describing the remaining 200 units using the summary function.

summary(mydata3[c(-2,-7)])
##      Carat            Price           Length          Width           Depth      
##  Min.   :0.2000   Min.   :  367   Min.   :3.810   Min.   :3.780   Min.   :2.240  
##  1st Qu.:0.3675   1st Qu.:  827   1st Qu.:4.605   1st Qu.:4.615   1st Qu.:2.808  
##  Median :0.7150   Median : 2494   Median :5.735   Median :5.760   Median :3.570  
##  Mean   :0.8420   Mean   : 4268   Mean   :5.805   Mean   :5.812   Mean   :3.572  
##  3rd Qu.:1.1025   3rd Qu.: 5260   3rd Qu.:6.595   3rd Qu.:6.625   3rd Qu.:4.103  
##  Max.   :2.5000   Max.   :18787   Max.   :8.870   Max.   :8.810   Max.   :5.220

The minimal carat value of the observed diamonds was 0.20 carats, and the maximum was 2.5 carats, the range of this variable was therefore 2.3 carats. The median price for diamonds was 2,494 USD, meaning half of the diamonds in the sample were priced at or below 2,494 USD and the other half cost more than this. The average price for a diamond, however, was 4,268 USD and the most expensive diamond cost over 18.5 thousand USD. We can also see that on average a diamond would measure 5.81 millimeters in both length and width and 3.57 millimeters in depth.

sd(mydata3$Price)
## [1] 4555.857
mean(mydata3$Price)
## [1] 4267.95
CoefVariation <- sd(mydata3$Price)/mean(mydata3$Price)
print(CoefVariation)
## [1] 1.067458

The sample data has the coefficient of variation for the price of diamonds of 1.067 or 106.7%, which means that the standard deviation is larger than one mean value, and the dispersion of values around the mean is very high.

hist(mydata3$Carat,
     xlab = "Carat",
     ylab = "Frequency",
     main = "Number of Diamonds",
     breaks = seq(0,3,0.1))

The histogram representing the distribution of the frequency of diamonds in each “carat”/weight category with the separations occurring with intervals of 0.1 carats is positively (right) skewed. The mode of the sample can also be estimated from the histogram, the mode being the 0,3 to 0,4 Carat category, with over 40 unit observations in this bracket.

ggplot(mydata3, aes(x = Price)) +    
  geom_histogram(binwidth = 1000, colour="black", fill="white") + 
  facet_wrap(~mydata3$`Cut Quality`, ncol = 5) + 
  ylab("Frequency") +
  ggtitle("Number of Diamonds in Price Bracket by their Cut Quality")

In the above graph(s) we can observe the distributions of diamonds by Price brackets of 1000 USD in their respective Cut Quality assessment range. We can observe that the majority of diamonds in the observed sample are of Ideal Cut Quality. All of the 5 diamond Cut Quality distributions are right skewed, from which we can conclude that the majority of diamonds lie in the lower Price ranges.

Below we can find two versions of a scatter plot. From the first below found scatter plot and box plots we can estimate that there is a positive relationship between the Carats (weight) and Price of a diamond, leading into the next part of the analysis (the regression). We can also identify outliers in both categories that would potentially have to be removed before further analysis. The second scatter plot is showing the relation between the Carats and Price, while also indicating the Cut Quality of each individual diamond in the observed sample.

scatterplot(x=mydata3$Carat, y=mydata3$Price,
        main = "Relation between Carats and Price",
        ylab = "Price",
        xlab = "Carat",
        smooth = FALSE)

ggplot(data=mydata3, aes(x=Carat, y=Price, color=`Cut Quality`)) + 
  geom_point()+ 
  xlab("Carats") + 
  ylab("Price")

Creating factors for the variable Cut Quality, where the ‘Cut Quality’ Fair is used as the base. This was done in order to use these categories in the regression.

mydata3$CutQualityF <- factor(mydata3$`Cut Quality`,
                              levels=c("Fair","Good","Very Good","Premium","Ideal"),
                              labels=c(0,1,2,3,4))
reg1 <- lm(Price ~ Carat + `Cut Quality`,
           data = mydata3)
summary(reg1)
## 
## Call:
## lm(formula = Price ~ Carat + `Cut Quality`, data = mydata3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8733.7  -910.8    27.1   557.2  9543.0 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             -4502.1      888.5  -5.067 9.38e-07 ***
## Carat                    7857.6      261.2  30.080  < 2e-16 ***
## `Cut Quality`Good        1826.1      932.2   1.959  0.05155 .  
## `Cut Quality`Ideal       2403.1      878.3   2.736  0.00679 ** 
## `Cut Quality`Premium     2058.9      881.8   2.335  0.02057 *  
## `Cut Quality`Very Good   2250.4      882.6   2.550  0.01155 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1880 on 194 degrees of freedom
## Multiple R-squared:  0.834,  Adjusted R-squared:  0.8297 
## F-statistic: 194.9 on 5 and 194 DF,  p-value: < 2.2e-16

The regression returned the following results, the Carat(s)/weight of a diamond has a statistically significant impact at p<0.001. The coefficient of 7.857,6 means that for every additional carat (the weight) of the diamond will increase the market price of the diamond by 7,857.6 USD, holding all else (the quality of the cut) constant. The cut quality of the diamond was also considered and the results are, that Very Good, Premium and Ideal cut quality(ies) do have a statistically significant effect on the price of the diamond with all three having the value of p<0.05. The Good quality of diamond does not, strictly speaking, have a statistically significant effect on the price of the diamond at p=0.05155, however, it is very close and therefore, could potentially also be considered as relevant. However using the practical significance “test” would in this case fail, since the Very Good cut quality (2250) of a diamond should not bring higher additional value to the diamond in comparison to the Premium cut quality (2059), which is a grade better.