mydata <- read.table("C:/R/diamonds.csv", header=TRUE, sep=",", dec = ".")

head(mydata)
##   X carat       cut color clarity depth table price    x    y    z
## 1 1  0.23     Ideal     E     SI2  61.5    55   326 3.95 3.98 2.43
## 2 2  0.21   Premium     E     SI1  59.8    61   326 3.89 3.84 2.31
## 3 3  0.23      Good     E     VS1  56.9    65   327 4.05 4.07 2.31
## 4 4  0.29   Premium     I     VS2  62.4    58   334 4.20 4.23 2.63
## 5 5  0.31      Good     J     SI2  63.3    58   335 4.34 4.35 2.75
## 6 6  0.24 Very Good     J    VVS2  62.8    57   336 3.94 3.96 2.48
colnames(mydata) [1] <- "ID"
colnames(mydata) [2] <- "Carat"
colnames(mydata) [3] <- "Cut"
colnames(mydata) [4] <- "Colour"
colnames(mydata) [5] <- "Clarity"
colnames(mydata) [6] <- "Total depth perception"
colnames(mydata) [7] <- "Table"
colnames(mydata) [8] <- "Price"
colnames(mydata) [9] <- "Length"
colnames(mydata) [10] <- "Width"
colnames(mydata) [11] <- "Depth"

head(mydata)
##   ID Carat       Cut Colour Clarity Total depth perception Table Price Length
## 1  1  0.23     Ideal      E     SI2                   61.5    55   326   3.95
## 2  2  0.21   Premium      E     SI1                   59.8    61   326   3.89
## 3  3  0.23      Good      E     VS1                   56.9    65   327   4.05
## 4  4  0.29   Premium      I     VS2                   62.4    58   334   4.20
## 5  5  0.31      Good      J     SI2                   63.3    58   335   4.34
## 6  6  0.24 Very Good      J    VVS2                   62.8    57   336   3.94
##   Width Depth
## 1  3.98  2.43
## 2  3.84  2.31
## 3  4.07  2.31
## 4  4.23  2.63
## 5  4.35  2.75
## 6  3.96  2.48

Description

Definition of variables:

Source:

Agrawal, S. (2017, May 25). Diamonds. Kaggle. https://www.kaggle.com/datasets/shivam2503/diamonds/data.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
mydata <- mydata %>%
  mutate(
    Cut = as.factor(Cut),
    Colour = as.factor(Colour),
    Clarity = as.factor(Clarity)
  )

summary(mydata[ , -1])
##      Carat               Cut        Colour       Clarity     
##  Min.   :0.2000   Fair     : 1610   D: 6775   SI1    :13065  
##  1st Qu.:0.4000   Good     : 4906   E: 9797   VS2    :12258  
##  Median :0.7000   Ideal    :21551   F: 9542   SI2    : 9194  
##  Mean   :0.7979   Premium  :13791   G:11292   VS1    : 8171  
##  3rd Qu.:1.0400   Very Good:12082   H: 8304   VVS2   : 5066  
##  Max.   :5.0100                     I: 5422   VVS1   : 3655  
##                                     J: 2808   (Other): 2531  
##  Total depth perception     Table           Price           Length      
##  Min.   :43.00          Min.   :43.00   Min.   :  326   Min.   : 0.000  
##  1st Qu.:61.00          1st Qu.:56.00   1st Qu.:  950   1st Qu.: 4.710  
##  Median :61.80          Median :57.00   Median : 2401   Median : 5.700  
##  Mean   :61.75          Mean   :57.46   Mean   : 3933   Mean   : 5.731  
##  3rd Qu.:62.50          3rd Qu.:59.00   3rd Qu.: 5324   3rd Qu.: 6.540  
##  Max.   :79.00          Max.   :95.00   Max.   :18823   Max.   :10.740  
##                                                                         
##      Width            Depth       
##  Min.   : 0.000   Min.   : 0.000  
##  1st Qu.: 4.720   1st Qu.: 2.910  
##  Median : 5.710   Median : 3.530  
##  Mean   : 5.735   Mean   : 3.539  
##  3rd Qu.: 6.540   3rd Qu.: 4.040  
##  Max.   :58.900   Max.   :31.800  
## 

Research question 1

The average price of diamonds in 2024 was $9.502 and the median was $4,500 (from the reference: https://www.diamondse.info/diamonds-price-index.asp). Has the average price of diamonds changed, since this data which was collected 8 years ago.

Parametric test

library(psych)
describe(mydata$Price)
##    vars     n   mean      sd median trimmed     mad min   max range skew
## X1    1 53940 3932.8 3989.44   2401 3158.99 2475.94 326 18823 18497 1.62
##    kurtosis    se
## X1     2.18 17.18

From this we can see that the average seems quite lower ($3932.8), but we need to check with a formal test. Also skew isn’t 0, therefore we can expect that the distribution is not normal, but we have to also check with a formal test.

library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
## 
##     %+%, alpha
ggplot(mydata, aes(x = Price)) +
  geom_histogram(binwidth = 7500, fill = "green4", color = "black") +
  ylab("Frequency") + 
  xlab("Price od diamonds in $")

Based on the histrogram the data seems asymmetrical to the left, we will check with shapiro test.

mydata_reduced <- sample(mydata$Price, size = 5000) #the sample data was too big, for shapiro test maximum is 5000, so I reduced the number of sample data

shapiro.test(mydata_reduced)
## 
##  Shapiro-Wilk normality test
## 
## data:  mydata_reduced
## W = 0.80328, p-value < 2.2e-16
  • H0: Distribution of price is normal.
  • H1: Distribution of price is not normal.

We can reject H0 at p<0.001.

t.test(mydata_reduced,
       mu = 9502,
       alternative = "two.sided")
## 
##  One Sample t-test
## 
## data:  mydata_reduced
## t = -100.82, df = 4999, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 9502
## 95 percent confidence interval:
##  3802.591 4020.020
## sample estimates:
## mean of x 
##  3911.306
  • H0: μ = 9502
  • H1: μ ≠ 9502

We can reject H0 at p<0.001.

#install.packages("effectsize")
library(effectsize)
## 
## Attaching package: 'effectsize'
## The following object is masked from 'package:psych':
## 
##     phi
effectsize::cohens_d(mydata_reduced, mu = 9502)
## Cohen's d |         95% CI
## --------------------------
## -1.43     | [-1.47, -1.39]
## 
## - Deviation from a difference of 9502.
interpret_cohens_d(1.36, rules = "sawilowsky2009")
## [1] "very large"
## (Rules: sawilowsky2009)

Conclusion: We can reject the null hypothesis at p < 0.001, that means that there is a difference in the average price of diamonds in 2024 compared to 8 years ago. The difference in distribution is very large (r = 1.36).

Non-parametric test

median(mydata_reduced)
## [1] 2394.5

The median seems lower than 4500, we need to check with a formal test.

wilcox.test(mydata_reduced,
            mu = 4500,
            correct = FALSE)
## 
##  Wilcoxon signed rank test
## 
## data:  mydata_reduced
## V = 4162930, p-value < 2.2e-16
## alternative hypothesis: true location is not equal to 4500
  • H0: Me = 4500
  • H1: Me ≠ 4500

We can reject H0 at p<0.001.

library(effectsize)
effectsize(wilcox.test(mydata_reduced,
                       mu = 4500,
                       correct = FALSE))
## r (rank biserial) |         95% CI
## ----------------------------------
## -0.33             | [-0.36, -0.31]
## 
## - Deviation from a difference of 4500.
interpret_rank_biserial(0.31, rules = "funder2019")
## [1] "large"
## (Rules: funder2019)

Using the data frame, we found that there is a difference in the median in 2024 compared to 8 years ago (p < 0.001). The effect size is large (r = 0.31).

For hypothesis about the population arithmetic mean the assumptions are that variable is numeric and that variable on the population is normally distributed. For my data sample the variable is numeric, however the Shapiro-Wilk test showed that normality is violated. For this reason the non-parametric test is more suitable in this case.

Research question 2

Is there a correlation between the carat weight of a diamond and its price?

library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:psych':
## 
##     logit
## The following object is masked from 'package:dplyr':
## 
##     recode
scatterplotMatrix(mydata[ , c(2,8)], smooth=FALSE)

From the scatterplot there seems to be a positive linear relationship. With the Pearson correlation test for two numeric variables we can check this.

library(Hmisc)
## 
## Attaching package: 'Hmisc'
## The following object is masked from 'package:psych':
## 
##     describe
## The following objects are masked from 'package:dplyr':
## 
##     src, summarize
## The following objects are masked from 'package:base':
## 
##     format.pval, units
rcorr(as.matrix(mydata[ ,c(2,8)]), 
      type = "pearson")
##       Carat Price
## Carat  1.00  0.92
## Price  0.92  1.00
## 
## n= 53940 
## 
## 
## P
##       Carat Price
## Carat        0   
## Price  0

There seems to be a linear relationship between the diamonds carat weight and price, which is positive and very strong (r=0.92). For this reason we reject H0 at p<0.001.

Research question 3

Is there an association between the cut quality and the colour grade of diamonds?

results <- chisq.test(mydata$Cut, mydata$Colour, 
                      correct = TRUE)

results
## 
##  Pearson's Chi-squared test
## 
## data:  mydata$Cut and mydata$Colour
## X-squared = 310.32, df = 24, p-value < 2.2e-16
addmargins(results$observed)
##            mydata$Colour
## mydata$Cut      D     E     F     G     H     I     J   Sum
##   Fair        163   224   312   314   303   175   119  1610
##   Good        662   933   909   871   702   522   307  4906
##   Ideal      2834  3903  3826  4884  3115  2093   896 21551
##   Premium    1603  2337  2331  2924  2360  1428   808 13791
##   Very Good  1513  2400  2164  2299  1824  1204   678 12082
##   Sum        6775  9797  9542 11292  8304  5422  2808 53940
round(results$expected, 2)
##            mydata$Colour
## mydata$Cut        D       E       F       G       H       I       J
##   Fair       202.22  292.42  284.81  337.04  247.86  161.84   83.81
##   Good       616.21  891.07  867.87 1027.04  755.27  493.15  255.40
##   Ideal     2706.86 3914.26 3812.38 4511.57 3317.75 2166.29 1121.90
##   Premium   1732.18 2504.83 2439.63 2887.06 2123.11 1386.26  717.93
##   Very Good 1517.53 2194.43 2137.31 2529.29 1860.01 1214.47  628.96
round(results$res, 2)
##            mydata$Colour
## mydata$Cut      D     E     F     G     H     I     J
##   Fair      -2.76 -4.00  1.61 -1.26  3.50  1.03  3.84
##   Good       1.84  1.40  1.40 -4.87 -1.94  1.30  3.23
##   Ideal      2.44 -0.18  0.22  5.54 -3.52 -1.57 -6.74
##   Premium   -3.10 -3.35 -2.20  0.69  5.14  1.12  3.36
##   Very Good -0.12  4.39  0.58 -4.58 -0.83 -0.30  1.96

Assumptions for the association between two categorical variables are that the observations are independent of each other and that all expected frequencies are greater than 5. Both assumptions are met in this data sample (expected frequencies table is < 5).

We can reject H0 at p<0.001.

Explanations for category ideal cut & D (best colour): 1. Observed frequency: the observed value for ideal cut & D (best colour) is 2834. 2. Expected frequency: if there would be no association, we would expect 2706.86 diamonds in the category completed ideal cut & D (best colour). 3. Standard residual: In the combination ideal cut & D (best colour) there is more than expected number of diamonds in this category at α = 0.05.

library(effectsize)
effectsize::cramers_v(mydata$Cut, mydata$Colour)
## Cramer's V (adj.) |       95% CI
## --------------------------------
## 0.04              | [0.03, 1.00]
## 
## - One-sided CIs: upper bound fixed at [1.00].
interpret_cramers_v(0.04)
## [1] "tiny"
## (Rules: funder2019)

We found that there is an association between diamond cut quality and colour, the effect size of this association is tiny (r = 0.04).