Ashish Bhambhaney

Research Question:

If the way a diamond is cut has a significant impact on its price

#install.packages("car")
#install.packages("viridis")
library(ggplot2)
mydataA <- force(diamonds)
head(mydataA)
## # A tibble: 6 × 10
##   carat cut       color clarity depth table price     x     y     z
##   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
## 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
## 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
## 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
## 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
## 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48

Description:

Unit of observation: A diamond

Sample Size: 53940

Variables:

  1. Carat: Weight of the diamond (0.2–5.01)

  2. Cut: Quality of the cut (Fair, Good, Very Good, Premium, Ideal)

  3. Color: Diamond colour, from D (best) to J (worst)

  4. Clarity: A measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))

  5. Depth: Total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43–79)

  6. Table: Width of top of diamond relative to widest point (43–95)

  7. Price: Price in US dollars ($326–$18,823)

  8. x: Length in mm (0–10.74)

  9. y: Width in mm (0–58.9)

  10. z: Depth in mm (0–31.8)

Source: ggplot2 dataset

Since the sample size is quite large, we take 200 random diamonds for our analysis

set.seed(1)
mydataA <- mydataA[sample(nrow(mydataA), 200), ]
mydataA$cutFactor <- factor(mydataA$cut, 
                            levels = c("Fair", "Good", "Very Good", "Premium", "Ideal"), 
                            labels = c("Fair", "Good", "Very Good", "Premium", "Ideal"))
mydataA$cutFactor <- factor(mydataA$cutFactor, levels = c("Fair", "Good", "Very Good", "Premium", "Ideal"))

Showing descriptive statistics:

library(psych)
## 
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
describeBy(x = mydataA$price, group = mydataA$cutFactor)
## 
##  Descriptive statistics by group 
## group: Fair
##    vars n    mean      sd median trimmed     mad  min   max range skew kurtosis
## X1    1 9 5722.89 4801.22   4751 5722.89 1810.25 2338 18018 15680 1.77     1.84
##         se
## X1 1600.41
## ------------------------------------------------------------ 
## group: Good
##    vars  n    mean      sd median trimmed     mad min   max range skew kurtosis
## X1    1 22 4222.18 3368.98 4142.5 3745.67 1747.24 568 13976 13408 1.26     1.27
##        se
## X1 718.27
## ------------------------------------------------------------ 
## group: Very Good
##    vars  n   mean      sd median trimmed     mad min   max range skew kurtosis
## X1    1 52 4567.9 3912.34   4231 4020.86 4003.76 363 17019 16656 1.32     1.68
##        se
## X1 542.54
## ------------------------------------------------------------ 
## group: Premium
##    vars  n    mean     sd median trimmed     mad min   max range skew kurtosis
## X1    1 47 5259.72 5247.1   3557 4585.54 3989.68 522 17197 16675 1.01    -0.35
##        se
## X1 765.37
## ------------------------------------------------------------ 
## group: Ideal
##    vars  n    mean      sd median trimmed     mad min   max range skew kurtosis
## X1    1 70 3977.76 4553.83   1660 3098.89 1447.76 457 18757 18300 1.51     1.48
##        se
## X1 544.29
summary(mydataA)
##      carat               cut     color     clarity       depth     
##  Min.   :0.2300   Fair     : 9   D:23   SI1    :51   Min.   :56.9  
##  1st Qu.:0.4025   Good     :22   E:30   VS2    :46   1st Qu.:61.3  
##  Median :0.8850   Very Good:52   F:31   VS1    :33   Median :62.0  
##  Mean   :0.8831   Premium  :47   G:49   SI2    :32   Mean   :62.1  
##  3rd Qu.:1.1500   Ideal    :70   H:37   VVS2   :20   3rd Qu.:62.8  
##  Max.   :5.0100                  I:15   VVS1   : 9   Max.   :70.2  
##                                  J:15   (Other): 9                 
##      table           price             x                y         
##  Min.   :53.00   Min.   :  363   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.:56.00   1st Qu.:  938   1st Qu.: 4.702   1st Qu.: 4.723  
##  Median :57.00   Median : 3504   Median : 5.975   Median : 5.930  
##  Mean   :57.34   Mean   : 4538   Mean   : 5.844   Mean   : 5.853  
##  3rd Qu.:58.25   3rd Qu.: 5676   3rd Qu.: 6.732   3rd Qu.: 6.755  
##  Max.   :64.00   Max.   :18757   Max.   :10.740   Max.   :10.540  
##                                                                   
##        z             cutFactor 
##  Min.   :0.000   Fair     : 9  
##  1st Qu.:2.897   Good     :22  
##  Median :3.720   Very Good:52  
##  Mean   :3.613   Premium  :47  
##  3rd Qu.:4.133   Ideal    :70  
##  Max.   :6.980                 
## 

Explaining a few parameters:

  1. Price mean: Based on this sample of 200 diamonds, the average price of a diamond across all cut categories is $4538

  2. Min carat: The minimum weight of the diamond observed in this sample is 0.23

  3. 1st and third quartile of x: Based on this sample of 200 diamonds, 75% of the diamonds have a length between 4.702 and 6.732 mm

Hypothesis Testing

  1. Test details:

Since this data is an independent sample with three or more arithmetic means, we do the one way analysis of variance- ANOVA as a parametric test, and if the assumption of normality is not met, we do the Kruskal-Wallis Rank Sum Test as a non parametric test

  1. Checking the assumptions:
    1. Analysed variable is numeric: Satisfied, because price is a numeric variable

    2. Variable in the population is normally distributed within each group. We do the Shapiro WIlk Test for checking this

      H0: The data within the group is normally distributed

      H1: The data within the group is not normally distributed

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(rstatix)
## 
## Attaching package: 'rstatix'
## The following object is masked from 'package:stats':
## 
##     filter
mydataA %>%
  group_by(cutFactor) %>%
  shapiro_test(price)
## # A tibble: 5 × 4
##   cutFactor variable statistic             p
##   <ord>     <chr>        <dbl>         <dbl>
## 1 Fair      price        0.656 0.000438     
## 2 Good      price        0.837 0.00198      
## 3 Very Good price        0.865 0.0000295    
## 4 Premium   price        0.816 0.00000362   
## 5 Ideal     price        0.756 0.00000000190
  Based on the above values, we reject H0 for all the groups(at p< 0.002), which means
  neither of the groups have a normal distribution of the variable price
  1. Homoscedasticity: the variance of analyzed variable is the same within all groups: We do the Levene test for checking this

    H0: The variance of price within the cut groups Fair, Good, Very Good, Premium, Ideal is the same

    H1: At least one group has a different variance

library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following object is masked from 'package:psych':
## 
##     logit
leveneTest(mydataA$price, group = mydataA$cutFactor)
## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value Pr(>F)
## group   4  1.2179 0.3045
##       195
  As p = 0.3045(more than 5%), we cannot reject H0 and which implies we cannot reject
  that Homoscedasticity is met     
  1. No significant outliers:
library(ggplot2)
ggplot(mydataA, aes(x = price)) +
  geom_histogram(binwidth = 2500, colour="gray") +
  facet_wrap(~cutFactor, ncol = 20) + 
  ylab("Frequency")

There are no significant outliers in this data, there seems to be one appearing in the group Fair, but its because its sample size is really small

Conclusion: Since the assumption of normality is not met, a non parametric test is suitable for this sample, but we would do both the parametric and non parametric test

  1. Parametric Test

    H0: Mean(Fair) = Mean(Good) = Mean(Very Good) = Mean(Premium) = Mean(Ideal)

    H1: At least one mean is different

ANOVA_results <- aov(price ~ cutFactor, 
                     data = mydataA)

summary(ANOVA_results)
##              Df    Sum Sq  Mean Sq F value Pr(>F)
## cutFactor     4 6.133e+07 15332270   0.766  0.548
## Residuals   195 3.901e+09 20003825

Based on the above values, we cannot reject H0, which implies none of the means is statistically different than the others

library(effectsize)
## 
## Attaching package: 'effectsize'
## The following objects are masked from 'package:rstatix':
## 
##     cohens_d, eta_squared
## The following object is masked from 'package:psych':
## 
##     phi
eta_squared(ANOVA_results)
## For one-way between subjects designs, partial eta squared is equivalent
##   to eta squared. Returning eta squared.
## # Effect Size for ANOVA
## 
## Parameter | Eta2 |       95% CI
## -------------------------------
## cutFactor | 0.02 | [0.00, 1.00]
## 
## - One-sided CIs: upper bound fixed at [1.00].
interpret_eta_squared(0.02, rules = "cohen1992")
## [1] "small"
## (Rules: cohen1992)
pwc <- mydataA %>%
  pairwise_t_test(price ~ cutFactor, 
                  paired = FALSE,
                  p.adjust.method = "bonferroni")

ANOVA_results <- anova_test(price ~ cutFactor, 
                            data = mydataA)

library(rstatix)
pwc <- pwc %>% 
       add_y_position(fun = "median", step.increase = 0.35)

library(ggpubr)
ggboxplot(mydataA, x = "cutFactor", y = "price", add = "point", ylim=c(0, 20000)) +
  stat_pvalue_manual(pwc, hide.ns = FALSE) +
  stat_summary(fun = mean, geom = "point", shape = 16, size = 4,
               aes(group = cutFactor), color = "darkred",
               position = position_dodge(width = 0.5)) +
  stat_summary(fun = mean, colour = "darkred", 
               position = position_dodge(width = 0.5),
               geom = "text", vjust = -0.2, hjust = -0.2,
               aes(label = round(after_stat(y), digits = 2), group = cutFactor)) +
  labs(subtitle = get_test_label(ANOVA_results,  detailed = TRUE),
       caption = get_pwc_label(pwc))

The above figure further conveys the conclusion that none of the means is statistically different than any other

  1. Non Parametric Test

As the assumption of normality is not met, the Kruskal-Wallis test is suitable for this sample

H0: The location distribution of price is the same for all the cut groups

H1: At least one is different

kruskal.test(price ~ cutFactor, 
             data = mydataA)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  price by cutFactor
## Kruskal-Wallis chi-squared = 4.9929, df = 4, p-value = 0.288

As p>0.05, we cannot reject H0, which means we cannot reject the statement that the location distribution of price is the same for all the cut groups

kruskal_effsize(price ~ cutFactor, 
                data = mydataA)
## # A tibble: 1 × 5
##   .y.       n effsize method  magnitude
## * <chr> <int>   <dbl> <chr>   <ord>    
## 1 price   200 0.00509 eta2[H] small
pwc <- mydataA %>%
  wilcox_test(price ~ cutFactor, 
              paired = FALSE, 
              p.adjust.method = "bonferroni")
 
Kruskal_results <- kruskal_test(price ~ cutFactor, 
                                data = mydataA)

library(rstatix)
pwc <- pwc %>% 
       add_y_position(fun = "median", step.increase = 0.35)

library(ggpubr)
ggboxplot(mydataA, x = "cutFactor", y = "price", add = "point", ylim=c(0, 20000)) +
  stat_pvalue_manual(pwc, hide.ns = FALSE) +
  stat_summary(fun = median, geom = "point", shape = 16, size = 4,
               aes(group = cutFactor), color = "darkred",
               position = position_dodge(width = 0.8)) +
  stat_summary(fun = median, colour = "darkred", 
               position = position_dodge(width = 0.8),
               geom = "text", vjust = -0.5, hjust = -1,
               aes(label = round(after_stat(y), digits = 2), group = cutFactor)) +
  labs(subtitle = get_test_label(Kruskal_results,  detailed = TRUE),
       caption = get_pwc_label(pwc))

Conclusion(Answer of the question)

Based on the parametric test and the non parametric test we can conclude that there is no impact of the way a diamond is cut on its price