mydata <- read.table("C:/R/diamonds.csv", header=TRUE, sep=",", dec = ".")
head(mydata)
## X carat cut color clarity depth table price x y z
## 1 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
## 5 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
colnames(mydata) [1] <- "ID"
colnames(mydata) [2] <- "Carat"
colnames(mydata) [3] <- "Cut"
colnames(mydata) [4] <- "Colour"
colnames(mydata) [5] <- "Clarity"
colnames(mydata) [6] <- "Total depth perception"
colnames(mydata) [7] <- "Table"
colnames(mydata) [8] <- "Price"
colnames(mydata) [9] <- "Length"
colnames(mydata) [10] <- "Width"
colnames(mydata) [11] <- "Depth"
head(mydata)
## ID Carat Cut Colour Clarity Total depth perception Table Price Length
## 1 1 0.23 Ideal E SI2 61.5 55 326 3.95
## 2 2 0.21 Premium E SI1 59.8 61 326 3.89
## 3 3 0.23 Good E VS1 56.9 65 327 4.05
## 4 4 0.29 Premium I VS2 62.4 58 334 4.20
## 5 5 0.31 Good J SI2 63.3 58 335 4.34
## 6 6 0.24 Very Good J VVS2 62.8 57 336 3.94
## Width Depth
## 1 3.98 2.43
## 2 3.84 2.31
## 3 4.07 2.31
## 4 4.23 2.63
## 5 4.35 2.75
## 6 3.96 2.48
Agrawal, S. (2017, May 25). Diamonds. Kaggle. https://www.kaggle.com/datasets/shivam2503/diamonds/data.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
mydata <- mydata %>%
mutate(
Cut = as.factor(Cut),
Colour = as.factor(Colour),
Clarity = as.factor(Clarity)
)
summary(mydata[ , -1])
## Carat Cut Colour Clarity
## Min. :0.2000 Fair : 1610 D: 6775 SI1 :13065
## 1st Qu.:0.4000 Good : 4906 E: 9797 VS2 :12258
## Median :0.7000 Ideal :21551 F: 9542 SI2 : 9194
## Mean :0.7979 Premium :13791 G:11292 VS1 : 8171
## 3rd Qu.:1.0400 Very Good:12082 H: 8304 VVS2 : 5066
## Max. :5.0100 I: 5422 VVS1 : 3655
## J: 2808 (Other): 2531
## Total depth perception Table Price Length
## Min. :43.00 Min. :43.00 Min. : 326 Min. : 0.000
## 1st Qu.:61.00 1st Qu.:56.00 1st Qu.: 950 1st Qu.: 4.710
## Median :61.80 Median :57.00 Median : 2401 Median : 5.700
## Mean :61.75 Mean :57.46 Mean : 3933 Mean : 5.731
## 3rd Qu.:62.50 3rd Qu.:59.00 3rd Qu.: 5324 3rd Qu.: 6.540
## Max. :79.00 Max. :95.00 Max. :18823 Max. :10.740
##
## Width Depth
## Min. : 0.000 Min. : 0.000
## 1st Qu.: 4.720 1st Qu.: 2.910
## Median : 5.710 Median : 3.530
## Mean : 5.735 Mean : 3.539
## 3rd Qu.: 6.540 3rd Qu.: 4.040
## Max. :58.900 Max. :31.800
##
Mean: The mean length of a diamond is approximately 5.731 mm, and the mean width is approximately 5.735 mm.
3rd Quartile length: 75% of diamonds have up to 6.540 mm of length based on this sample.
Median width: If arranged in ascending order, the median petal width would be 5.710 mm.
The average price of diamonds in 2024 was $9.502 and the median was $4,500 (from the reference: https://www.diamondse.info/diamonds-price-index.asp). Has the average price of diamonds changed, since this data which was collected 8 years ago.
library(psych)
describe(mydata$Price)
## vars n mean sd median trimmed mad min max range skew
## X1 1 53940 3932.8 3989.44 2401 3158.99 2475.94 326 18823 18497 1.62
## kurtosis se
## X1 2.18 17.18
From this we can see that the average seems quite lower ($3932.8), but we need to check with a formal test. Also skew isn’t 0, therefore we can expect that the distribution is not normal, but we have to also check with a formal test.
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
ggplot(mydata, aes(x = Price)) +
geom_histogram(binwidth = 7500, fill = "green4", color = "black") +
ylab("Frequency") +
xlab("Price od diamonds in $")
Based on the histrogram the data seems asymmetrical to the left, we will check with shapiro test.
mydata_reduced <- sample(mydata$Price, size = 5000) #the sample data was too big, for shapiro test maximum is 5000, so I reduced the number of sample data
shapiro.test(mydata_reduced)
##
## Shapiro-Wilk normality test
##
## data: mydata_reduced
## W = 0.80328, p-value < 2.2e-16
We can reject H0 at p<0.001.
t.test(mydata_reduced,
mu = 9502,
alternative = "two.sided")
##
## One Sample t-test
##
## data: mydata_reduced
## t = -100.82, df = 4999, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 9502
## 95 percent confidence interval:
## 3802.591 4020.020
## sample estimates:
## mean of x
## 3911.306
We can reject H0 at p<0.001.
#install.packages("effectsize")
library(effectsize)
##
## Attaching package: 'effectsize'
## The following object is masked from 'package:psych':
##
## phi
effectsize::cohens_d(mydata_reduced, mu = 9502)
## Cohen's d | 95% CI
## --------------------------
## -1.43 | [-1.47, -1.39]
##
## - Deviation from a difference of 9502.
interpret_cohens_d(1.36, rules = "sawilowsky2009")
## [1] "very large"
## (Rules: sawilowsky2009)
Conclusion: We can reject the null hypothesis at p < 0.001, that means that there is a difference in the average price of diamonds in 2024 compared to 8 years ago. The difference in distribution is very large (r = 1.36).
median(mydata_reduced)
## [1] 2394.5
The median seems lower than 4500, we need to check with a formal test.
wilcox.test(mydata_reduced,
mu = 4500,
correct = FALSE)
##
## Wilcoxon signed rank test
##
## data: mydata_reduced
## V = 4162930, p-value < 2.2e-16
## alternative hypothesis: true location is not equal to 4500
We can reject H0 at p<0.001.
library(effectsize)
effectsize(wilcox.test(mydata_reduced,
mu = 4500,
correct = FALSE))
## r (rank biserial) | 95% CI
## ----------------------------------
## -0.33 | [-0.36, -0.31]
##
## - Deviation from a difference of 4500.
interpret_rank_biserial(0.31, rules = "funder2019")
## [1] "large"
## (Rules: funder2019)
Using the data frame, we found that there is a difference in the median in 2024 compared to 8 years ago (p < 0.001). The effect size is large (r = 0.31).
For hypothesis about the population arithmetic mean the assumptions are that variable is numeric and that variable on the population is normally distributed. For my data sample the variable is numeric, however the Shapiro-Wilk test showed that normality is violated. For this reason the non-parametric test is more suitable in this case.
Is there a correlation between the carat weight of a diamond and its price?
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:psych':
##
## logit
## The following object is masked from 'package:dplyr':
##
## recode
scatterplotMatrix(mydata[ , c(2,8)], smooth=FALSE)
From the scatterplot there seems to be a positive linear relationship. With the Pearson correlation test for two numeric variables we can check this.
library(Hmisc)
##
## Attaching package: 'Hmisc'
## The following object is masked from 'package:psych':
##
## describe
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, units
rcorr(as.matrix(mydata[ ,c(2,8)]),
type = "pearson")
## Carat Price
## Carat 1.00 0.92
## Price 0.92 1.00
##
## n= 53940
##
##
## P
## Carat Price
## Carat 0
## Price 0
There seems to be a linear relationship between the diamonds carat weight and price, which is positive and very strong (r=0.92). For this reason we reject H0 at p<0.001.
Is there an association between the cut quality and the colour grade of diamonds?
results <- chisq.test(mydata$Cut, mydata$Colour,
correct = TRUE)
results
##
## Pearson's Chi-squared test
##
## data: mydata$Cut and mydata$Colour
## X-squared = 310.32, df = 24, p-value < 2.2e-16
addmargins(results$observed)
## mydata$Colour
## mydata$Cut D E F G H I J Sum
## Fair 163 224 312 314 303 175 119 1610
## Good 662 933 909 871 702 522 307 4906
## Ideal 2834 3903 3826 4884 3115 2093 896 21551
## Premium 1603 2337 2331 2924 2360 1428 808 13791
## Very Good 1513 2400 2164 2299 1824 1204 678 12082
## Sum 6775 9797 9542 11292 8304 5422 2808 53940
round(results$expected, 2)
## mydata$Colour
## mydata$Cut D E F G H I J
## Fair 202.22 292.42 284.81 337.04 247.86 161.84 83.81
## Good 616.21 891.07 867.87 1027.04 755.27 493.15 255.40
## Ideal 2706.86 3914.26 3812.38 4511.57 3317.75 2166.29 1121.90
## Premium 1732.18 2504.83 2439.63 2887.06 2123.11 1386.26 717.93
## Very Good 1517.53 2194.43 2137.31 2529.29 1860.01 1214.47 628.96
round(results$res, 2)
## mydata$Colour
## mydata$Cut D E F G H I J
## Fair -2.76 -4.00 1.61 -1.26 3.50 1.03 3.84
## Good 1.84 1.40 1.40 -4.87 -1.94 1.30 3.23
## Ideal 2.44 -0.18 0.22 5.54 -3.52 -1.57 -6.74
## Premium -3.10 -3.35 -2.20 0.69 5.14 1.12 3.36
## Very Good -0.12 4.39 0.58 -4.58 -0.83 -0.30 1.96
Assumptions for the association between two categorical variables are that the observations are independent of each other and that all expected frequencies are greater than 5. Both assumptions are met in this data sample (expected frequencies table is < 5).
We can reject H0 at p<0.001.
Explanations for category ideal cut & D (best colour): 1. Observed frequency: the observed value for ideal cut & D (best colour) is 2834. 2. Expected frequency: if there would be no association, we would expect 2706.86 diamonds in the category completed ideal cut & D (best colour). 3. Standard residual: In the combination ideal cut & D (best colour) there is more than expected number of diamonds in this category at α = 0.05.
library(effectsize)
effectsize::cramers_v(mydata$Cut, mydata$Colour)
## Cramer's V (adj.) | 95% CI
## --------------------------------
## 0.04 | [0.03, 1.00]
##
## - One-sided CIs: upper bound fixed at [1.00].
interpret_cramers_v(0.04)
## [1] "tiny"
## (Rules: funder2019)
We found that there is an association between diamond cut quality and colour, the effect size of this association is tiny (r = 0.04).