If the way a diamond is cut has a significant impact on its price
#install.packages("car")
#install.packages("viridis")
library(ggplot2)
mydataA <- force(diamonds)
head(mydataA)
## # A tibble: 6 × 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
Unit of observation: A diamond
Sample Size: 53940
Variables:
Carat: Weight of the diamond (0.2–5.01)
Cut: Quality of the cut (Fair, Good, Very Good, Premium, Ideal)
Color: Diamond colour, from D (best) to J (worst)
Clarity: A measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))
Depth: Total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43–79)
Table: Width of top of diamond relative to widest point (43–95)
Price: Price in US dollars ($326–$18,823)
x: Length in mm (0–10.74)
y: Width in mm (0–58.9)
z: Depth in mm (0–31.8)
Source: ggplot2 dataset
Since the sample size is quite large, we take 200 random diamonds for our analysis
set.seed(1)
mydataA <- mydataA[sample(nrow(mydataA), 200), ]
mydataA$cutFactor <- factor(mydataA$cut,
levels = c("Fair", "Good", "Very Good", "Premium", "Ideal"),
labels = c("Fair", "Good", "Very Good", "Premium", "Ideal"))
mydataA$cutFactor <- factor(mydataA$cutFactor, levels = c("Fair", "Good", "Very Good", "Premium", "Ideal"))
library(psych)
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
describeBy(x = mydataA$price, group = mydataA$cutFactor)
##
## Descriptive statistics by group
## group: Fair
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 9 5722.89 4801.22 4751 5722.89 1810.25 2338 18018 15680 1.77 1.84
## se
## X1 1600.41
## ------------------------------------------------------------
## group: Good
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 22 4222.18 3368.98 4142.5 3745.67 1747.24 568 13976 13408 1.26 1.27
## se
## X1 718.27
## ------------------------------------------------------------
## group: Very Good
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 52 4567.9 3912.34 4231 4020.86 4003.76 363 17019 16656 1.32 1.68
## se
## X1 542.54
## ------------------------------------------------------------
## group: Premium
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 47 5259.72 5247.1 3557 4585.54 3989.68 522 17197 16675 1.01 -0.35
## se
## X1 765.37
## ------------------------------------------------------------
## group: Ideal
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 70 3977.76 4553.83 1660 3098.89 1447.76 457 18757 18300 1.51 1.48
## se
## X1 544.29
summary(mydataA)
## carat cut color clarity depth
## Min. :0.2300 Fair : 9 D:23 SI1 :51 Min. :56.9
## 1st Qu.:0.4025 Good :22 E:30 VS2 :46 1st Qu.:61.3
## Median :0.8850 Very Good:52 F:31 VS1 :33 Median :62.0
## Mean :0.8831 Premium :47 G:49 SI2 :32 Mean :62.1
## 3rd Qu.:1.1500 Ideal :70 H:37 VVS2 :20 3rd Qu.:62.8
## Max. :5.0100 I:15 VVS1 : 9 Max. :70.2
## J:15 (Other): 9
## table price x y
## Min. :53.00 Min. : 363 Min. : 0.000 Min. : 0.000
## 1st Qu.:56.00 1st Qu.: 938 1st Qu.: 4.702 1st Qu.: 4.723
## Median :57.00 Median : 3504 Median : 5.975 Median : 5.930
## Mean :57.34 Mean : 4538 Mean : 5.844 Mean : 5.853
## 3rd Qu.:58.25 3rd Qu.: 5676 3rd Qu.: 6.732 3rd Qu.: 6.755
## Max. :64.00 Max. :18757 Max. :10.740 Max. :10.540
##
## z cutFactor
## Min. :0.000 Fair : 9
## 1st Qu.:2.897 Good :22
## Median :3.720 Very Good:52
## Mean :3.613 Premium :47
## 3rd Qu.:4.133 Ideal :70
## Max. :6.980
##
Price mean: Based on this sample of 200 diamonds, the average price of a diamond across all cut categories is $4538
Min carat: The minimum weight of the diamond observed in this sample is 0.23
1st and third quartile of x: Based on this sample of 200 diamonds, 75% of the diamonds have a length between 4.702 and 6.732 mm
Since this data is an independent sample with three or more arithmetic means, we do the one way analysis of variance- ANOVA as a parametric test, and if the assumption of normality is not met, we do the Kruskal-Wallis Rank Sum Test as a non parametric test
Analysed variable is numeric: Satisfied, because price is a numeric variable
Variable in the population is normally distributed within each group. We do the Shapiro WIlk Test for checking this
H0: The data within the group is normally distributed
H1: The data within the group is not normally distributed
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(rstatix)
##
## Attaching package: 'rstatix'
## The following object is masked from 'package:stats':
##
## filter
mydataA %>%
group_by(cutFactor) %>%
shapiro_test(price)
## # A tibble: 5 × 4
## cutFactor variable statistic p
## <ord> <chr> <dbl> <dbl>
## 1 Fair price 0.656 0.000438
## 2 Good price 0.837 0.00198
## 3 Very Good price 0.865 0.0000295
## 4 Premium price 0.816 0.00000362
## 5 Ideal price 0.756 0.00000000190
Based on the above values, we reject H0 for all the groups(at p< 0.002), which means
neither of the groups have a normal distribution of the variable price
Homoscedasticity: the variance of analyzed variable is the same within all groups: We do the Levene test for checking this
H0: The variance of price within the cut groups Fair, Good, Very Good, Premium, Ideal is the same
H1: At least one group has a different variance
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:psych':
##
## logit
leveneTest(mydataA$price, group = mydataA$cutFactor)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 4 1.2179 0.3045
## 195
As p = 0.3045(more than 5%), we cannot reject H0 and which implies we cannot reject
that Homoscedasticity is met
library(ggplot2)
ggplot(mydataA, aes(x = price)) +
geom_histogram(binwidth = 2500, colour="gray") +
facet_wrap(~cutFactor, ncol = 20) +
ylab("Frequency")
There are no significant outliers in this data, there seems to be one appearing in the group Fair, but its because its sample size is really small
Conclusion: Since the assumption of normality is not met, a non parametric test is suitable for this sample, but we would do both the parametric and non parametric test
Parametric Test
H0: Mean(Fair) = Mean(Good) = Mean(Very Good) = Mean(Premium) = Mean(Ideal)
H1: At least one mean is different
ANOVA_results <- aov(price ~ cutFactor,
data = mydataA)
summary(ANOVA_results)
## Df Sum Sq Mean Sq F value Pr(>F)
## cutFactor 4 6.133e+07 15332270 0.766 0.548
## Residuals 195 3.901e+09 20003825
Based on the above values, we cannot reject H0, which implies none of the means is statistically different than the others
library(effectsize)
##
## Attaching package: 'effectsize'
## The following objects are masked from 'package:rstatix':
##
## cohens_d, eta_squared
## The following object is masked from 'package:psych':
##
## phi
eta_squared(ANOVA_results)
## For one-way between subjects designs, partial eta squared is equivalent
## to eta squared. Returning eta squared.
## # Effect Size for ANOVA
##
## Parameter | Eta2 | 95% CI
## -------------------------------
## cutFactor | 0.02 | [0.00, 1.00]
##
## - One-sided CIs: upper bound fixed at [1.00].
interpret_eta_squared(0.02, rules = "cohen1992")
## [1] "small"
## (Rules: cohen1992)
pwc <- mydataA %>%
pairwise_t_test(price ~ cutFactor,
paired = FALSE,
p.adjust.method = "bonferroni")
ANOVA_results <- anova_test(price ~ cutFactor,
data = mydataA)
library(rstatix)
pwc <- pwc %>%
add_y_position(fun = "median", step.increase = 0.35)
library(ggpubr)
ggboxplot(mydataA, x = "cutFactor", y = "price", add = "point", ylim=c(0, 20000)) +
stat_pvalue_manual(pwc, hide.ns = FALSE) +
stat_summary(fun = mean, geom = "point", shape = 16, size = 4,
aes(group = cutFactor), color = "darkred",
position = position_dodge(width = 0.5)) +
stat_summary(fun = mean, colour = "darkred",
position = position_dodge(width = 0.5),
geom = "text", vjust = -0.2, hjust = -0.2,
aes(label = round(after_stat(y), digits = 2), group = cutFactor)) +
labs(subtitle = get_test_label(ANOVA_results, detailed = TRUE),
caption = get_pwc_label(pwc))
The above figure further conveys the conclusion that none of the means is statistically different than any other
As the assumption of normality is not met, the Kruskal-Wallis test is suitable for this sample
H0: The location distribution of price is the same for all the cut groups
H1: At least one is different
kruskal.test(price ~ cutFactor,
data = mydataA)
##
## Kruskal-Wallis rank sum test
##
## data: price by cutFactor
## Kruskal-Wallis chi-squared = 4.9929, df = 4, p-value = 0.288
As p>0.05, we cannot reject H0, which means we cannot reject the statement that the location distribution of price is the same for all the cut groups
kruskal_effsize(price ~ cutFactor,
data = mydataA)
## # A tibble: 1 × 5
## .y. n effsize method magnitude
## * <chr> <int> <dbl> <chr> <ord>
## 1 price 200 0.00509 eta2[H] small
pwc <- mydataA %>%
wilcox_test(price ~ cutFactor,
paired = FALSE,
p.adjust.method = "bonferroni")
Kruskal_results <- kruskal_test(price ~ cutFactor,
data = mydataA)
library(rstatix)
pwc <- pwc %>%
add_y_position(fun = "median", step.increase = 0.35)
library(ggpubr)
ggboxplot(mydataA, x = "cutFactor", y = "price", add = "point", ylim=c(0, 20000)) +
stat_pvalue_manual(pwc, hide.ns = FALSE) +
stat_summary(fun = median, geom = "point", shape = 16, size = 4,
aes(group = cutFactor), color = "darkred",
position = position_dodge(width = 0.8)) +
stat_summary(fun = median, colour = "darkred",
position = position_dodge(width = 0.8),
geom = "text", vjust = -0.5, hjust = -1,
aes(label = round(after_stat(y), digits = 2), group = cutFactor)) +
labs(subtitle = get_test_label(Kruskal_results, detailed = TRUE),
caption = get_pwc_label(pwc))
Based on the parametric test and the non parametric test we can conclude that there is no impact of the way a diamond is cut on its price