library(readr)
library(ggplot2)
library(naniar)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
Importing the data set.
mydata <- read.csv("~/Desktop/IMB/2. SEMESTER/MULTIVARIATE ANALYSIS/HOMEWORKS/HW1/diamonds.csv",
header = TRUE,
sep = ",",
dec = ".")
head(mydata, 10)
## X carat cut color clarity depth table price x y z
## 1 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
## 5 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
## 7 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
## 8 8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
## 9 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
## 10 10 0.23 Very Good H VS1 59.4 61 338 4.00 4.05 2.39
Cleaning the data by selecting only the variables that hold potential for further analysis and renaming the columns for easier understanding.
mydata1 <- mydata[,c(-1,-4,-5,-6,-7)]
colnames(mydata1)<-c("Carat","Cut Quality","Price","Length","Width","Depth")
head(mydata1,10)
## Carat Cut Quality Price Length Width Depth
## 1 0.23 Ideal 326 3.95 3.98 2.43
## 2 0.21 Premium 326 3.89 3.84 2.31
## 3 0.23 Good 327 4.05 4.07 2.31
## 4 0.29 Premium 334 4.20 4.23 2.63
## 5 0.31 Good 335 4.34 4.35 2.75
## 6 0.24 Very Good 336 3.94 3.96 2.48
## 7 0.24 Very Good 336 3.95 3.98 2.47
## 8 0.26 Very Good 337 4.07 4.11 2.53
## 9 0.22 Fair 337 3.87 3.78 2.49
## 10 0.23 Very Good 338 4.00 4.05 2.39
The unit of observation in my sample is a single diamond. The original size of the data set is 53940 units with 11 variables. However, for the purpose of this analysis I will be using a sample of 200 randomly selected observations from the data set. The amount of variables will also be reduced to 6.
The variables are the following:
Carat: Weight of the diamond in carats (1 carat equals 200mg or 0,2g).
Cut Quality: The quality of the cut of the diamond in the order from best to worst: Ideal, Premium, Very Good, Good and Fair (ordinal variable).
Price: Selling price of the diamond in US dollars.
Length: Length of the diamond in millimeters.
Width: Width of the diamond in millimeters.
Depth: Depth of the diamond in millimeters.
The source of the above data was found on the Kaggle website, the author is Swati Khedekar. Retrieved January 3rd, 2022, from https://www.kaggle.com/datasets/swatikhedekar/price-prediction-of-diamond.
The idea behind the data analysis is to see how variables like carat and cut quality affect the price of the diamonds on the market.
Cleaning the data by getting rid of all units of observation that have a value of 0, which is not possible for the observed data.
mydata2 <- filter_if(mydata1, is.numeric, all_vars((.) != 0))
Selecting 200 random observations from the data set for further analysis.
set.seed(7)
mydata3 <- sample_n(mydata2, 200)
head(mydata3,10)
## Carat Cut Quality Price Length Width Depth
## 1 1.02 Very Good 15306 6.36 6.46 4.02
## 2 0.92 Premium 3648 6.40 6.34 3.73
## 3 0.90 Fair 2438 5.92 5.87 3.81
## 4 0.32 Premium 720 4.37 4.35 2.74
## 5 1.01 Premium 6097 6.40 6.39 4.01
## 6 1.00 Good 4026 6.31 6.26 4.01
## 7 1.02 Very Good 3857 6.37 6.43 4.00
## 8 0.70 Very Good 2833 5.77 5.80 3.45
## 9 1.53 Fair 8996 7.60 7.51 4.36
## 10 0.30 Good 622 4.35 4.39 2.60
Describing the remaining 200 units using the summary function.
summary(mydata3[c(-2,-7)])
## Carat Price Length Width Depth
## Min. :0.2000 Min. : 367 Min. :3.810 Min. :3.780 Min. :2.240
## 1st Qu.:0.3675 1st Qu.: 827 1st Qu.:4.605 1st Qu.:4.615 1st Qu.:2.808
## Median :0.7150 Median : 2494 Median :5.735 Median :5.760 Median :3.570
## Mean :0.8420 Mean : 4268 Mean :5.805 Mean :5.812 Mean :3.572
## 3rd Qu.:1.1025 3rd Qu.: 5260 3rd Qu.:6.595 3rd Qu.:6.625 3rd Qu.:4.103
## Max. :2.5000 Max. :18787 Max. :8.870 Max. :8.810 Max. :5.220
The minimal carat value of the observed diamonds was 0.20 carats, and the maximum was 2.5 carats, the range of this variable was therefore 2.3 carats. The median price for diamonds was 2,494 USD, meaning half of the diamonds in the sample were priced at or below 2,494 USD and the other half cost more than this. The average price for a diamond, however, was 4,268 USD and the most expensive diamond cost over 18.5 thousand USD. We can also see that on average a diamond would measure 5.81 millimeters in both length and width and 3.57 millimeters in depth.
sd(mydata3$Price)
## [1] 4555.857
mean(mydata3$Price)
## [1] 4267.95
CoefVariation <- sd(mydata3$Price)/mean(mydata3$Price)
print(CoefVariation)
## [1] 1.067458
The sample data has the coefficient of variation for the price of diamonds of 1.067 or 106.7%, which means that the standard deviation is larger than one mean value, and the dispersion of values around the mean is very high.
hist(mydata3$Carat,
xlab = "Carat",
ylab = "Frequency",
main = "Number of Diamonds",
breaks = seq(0,3,0.1))
The histogram representing the distribution of the frequency of diamonds in each “carat”/weight category with the separations occurring with intervals of 0.1 carats is positively (right) skewed. The mode of the sample can also be estimated from the histogram, the mode being the 0,3 to 0,4 Carat category, with over 40 unit observations in this bracket.
ggplot(mydata3, aes(x = Price)) +
geom_histogram(binwidth = 1000, colour="black", fill="white") +
facet_wrap(~mydata3$`Cut Quality`, ncol = 5) +
ylab("Frequency") +
ggtitle("Number of Diamonds in Price Bracket by their Cut Quality")
In the above graph(s) we can observe the distributions of diamonds by Price brackets of 1000 USD in their respective Cut Quality assessment range. We can observe that the majority of diamonds in the observed sample are of Ideal Cut Quality. All of the 5 diamond Cut Quality distributions are right skewed, from which we can conclude that the majority of diamonds lie in the lower Price ranges.
Below we can find two versions of a scatter plot. From the first below found scatter plot and box plots we can estimate that there is a positive relationship between the Carats (weight) and Price of a diamond, leading into the next part of the analysis (the regression). We can also identify outliers in both categories that would potentially have to be removed before further analysis. The second scatter plot is showing the relation between the Carats and Price, while also indicating the Cut Quality of each individual diamond in the observed sample.
scatterplot(x=mydata3$Carat, y=mydata3$Price,
main = "Relation between Carats and Price",
ylab = "Price",
xlab = "Carat",
smooth = FALSE)
ggplot(data=mydata3, aes(x=Carat, y=Price, color=`Cut Quality`)) +
geom_point()+
xlab("Carats") +
ylab("Price")
Creating factors for the variable Cut Quality, where the ‘Cut Quality’ Fair is used as the base. This was done in order to use these categories in the regression.
mydata3$CutQualityF <- factor(mydata3$`Cut Quality`,
levels=c("Fair","Good","Very Good","Premium","Ideal"),
labels=c(0,1,2,3,4))
reg1 <- lm(Price ~ Carat + `Cut Quality`,
data = mydata3)
summary(reg1)
##
## Call:
## lm(formula = Price ~ Carat + `Cut Quality`, data = mydata3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8733.7 -910.8 27.1 557.2 9543.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4502.1 888.5 -5.067 9.38e-07 ***
## Carat 7857.6 261.2 30.080 < 2e-16 ***
## `Cut Quality`Good 1826.1 932.2 1.959 0.05155 .
## `Cut Quality`Ideal 2403.1 878.3 2.736 0.00679 **
## `Cut Quality`Premium 2058.9 881.8 2.335 0.02057 *
## `Cut Quality`Very Good 2250.4 882.6 2.550 0.01155 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1880 on 194 degrees of freedom
## Multiple R-squared: 0.834, Adjusted R-squared: 0.8297
## F-statistic: 194.9 on 5 and 194 DF, p-value: < 2.2e-16
The regression returned the following results, the Carat(s)/weight of a diamond has a statistically significant impact at p<0.001. The coefficient of 7.857,6 means that for every additional carat (the weight) of the diamond will increase the market price of the diamond by 7,857.6 USD, holding all else (the quality of the cut) constant. The cut quality of the diamond was also considered and the results are, that Very Good, Premium and Ideal cut quality(ies) do have a statistically significant effect on the price of the diamond with all three having the value of p<0.05. The Good quality of diamond does not, strictly speaking, have a statistically significant effect on the price of the diamond at p=0.05155, however, it is very close and therefore, could potentially also be considered as relevant. However using the practical significance “test” would in this case fail, since the Very Good cut quality (2250) of a diamond should not bring higher additional value to the diamond in comparison to the Premium cut quality (2059), which is a grade better.