Source: Database “diamonds” in R studio.
#data()
#data(package= .packages(all.available=TRUE))
#install.packages("ggplot2")
library(ggplot2)
data("diamonds") #Show the first 6 rows of data and explain the data set
head(diamonds)
## # A tibble: 6 × 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
set.seed(1) #Setting initial point of sampling
mydata <- diamonds[sample(nrow(diamonds), 100), ] #Random sample of 100 units
A data set contains prices and other attributes of almost 54,000 diamonds, but for the sake of statistical analysis, I chose to do a simple random sample out of 100 units, and the variables available are the following:
carat: It is a weight of the diamond (from 0.2 to 5.01).
cut: It shows the quality of the cut (Fair, Good, Very Good, Premium, Ideal). It is a categorical (ordinal type) variable.
colour: Diamond colour, from D (the best one), to J (the worst one). It is a categorical (ordinal type) variable.
clarity: A measurement of how clear the diamond is (I1 = the worst, SI2, SI1, VS2, VS1, VVS2, VVS1, IF = the best). It is also a categorical (ordinal type) variable.
depth: Total depth percentage
table: Width of top diamond relative to the widest point (43-95)
price: The price in US dollars (between $326 and $18,823) - an interval variable.
x: The length in mm (0-10.74)
y: Width in mm (0-58.9).
This data frame used to have 53940 rows and 10 variables (columns). Now it has 100 rows and 10 columns.
library(ggplot2)
ggplot(mydata[mydata$Cut == "Ideal",], aes(x=Price)) +
geom_histogram(bins = 30, fill = "pink", alpha = 1) +
labs(title = "Histogram of Prices for Ideal Cut Quality of diamonds",
x = "Price",
y = "Frequency")
library(ggplot2)
ggplot(mydata[mydata$Cut == "Premium",], aes(x=Price)) +
geom_histogram(bins = 30, fill = "pink", alpha = 1) +
labs(title = "Histogram of Prices for Premium Cut Quality of diamonds",
x = "Price",
y = "Frequency")
mydata$Cut <- as.factor(mydata$Cut) #Choose only "Ideal" and "Premium" cut quality
mydata1 <- mydata[mydata$Cut %in% c("Ideal", "Premium"), ]
mydata1$CutF <- factor(mydata1$Cut,
levels = c("Ideal","Premium"),
labels = c("Ideal", "Premium"))
shapiro.test(mydata1$Price) #Shapiro-Wilk normality test
##
## Shapiro-Wilk normality test
##
## data: mydata1$Price
## W = 0.77522, p-value = 1.955e-08
H0: The price distribution is normal.
H1: The price distribution is not normal.
We reject null hypothesis, at p value < 0,001, and therefore our distribution is not normal and we violate the second assumption. It is appropriate to do non-parametric test.
nrow(mydata) #Number of units analyzed in the sample
## [1] 100
sd(mydata$Price)
## [1] 4595.068
ybar = mean(mydata$Price); sd = sd(mydata$Price); n = nrow(mydata)
se = sd/sqrt(n)
ybar_lower_t = ybar + qt(0.025, df=n-1)*se
ybar_upper_t = ybar + qt(0.975, df=n-1)*se
3562.56 < 𝜇 < 5386.11
The mean (4474.35) is included in the confidence interval.
We assume by the theory that the variances are the not same, therefore we will use the Welch-correction test.
However, we will do both parametric and nonparametric, to show the procedure if the assumptions would not be violated.
𝜇ideal = 𝜇premium
𝜇ideal =/= 𝜇premium
mydata$Cut <- as.factor(mydata$Cut) #Choose only "Ideal" and "Premium" cut quality
mydata1 <- mydata[mydata$Cut %in% c("Ideal", "Premium"), ]
t.test(mydata1$Price ~ mydata1$Cut, #Independent samples t-test with Welch correction
paired = FALSE,
var.equal = FALSE,
alternative = "two.sided")
##
## Welch Two Sample t-test
##
## data: mydata1$Price by mydata1$Cut
## t = 1.7277, df = 44.128, p-value = 0.09104
## alternative hypothesis: true difference in means between group Premium and group Ideal is not equal to 0
## 95 percent confidence interval:
## -347.1009 4518.8009
## sample estimates:
## mean in group Premium mean in group Ideal
## 5402.821 3316.971
p = 0.09104, We cannot reject the null hypothesis, because p-value is higher than 0.05.
𝜇ideal = 3316.971 𝜇premium = 5402.821
However, since we cannot reject H0, we can say that the population means are statistically significant. (Note that our original data base had 53940 units, but we made a random sample out of 100 units, which may also not be as representative).
#install.packages("effectsize")
library(effectsize)
effectsize:: cohens_d(mydata1$Price ~ mydata1$Cut, #Determine the effect size
pooled_sd = FALSE)
## Cohen's d | 95% CI
## -------------------------
## 0.45 | [-0.07, 0.96]
##
## - Estimated using un-pooled SD.
interpret_cohens_d(0.45, rules = "sawilowsky2009")
## [1] "small"
## (Rules: sawilowsky2009)
Conclusion:
Based on a sample data, we found that we cannot reject H0, which means that our population means are statistically significant and the effect size is small, d = 0.45. Therefore, the average price of Ideal and Premium cut quality of diamonds, is statistically significant.
mydata1$CutF <- factor(mydata1$Cut, #Show only diamonds of Ideal and Premium quality cut
levels = c("Ideal","Premium"),
labels = c("Ideal", "Premium"))
head(mydata1)
## Carat Cut Colour Clarity Depthin% WidthTopRelative Price Length Width
## 4 1.10 Ideal I SI1 62.1 57 5037 6.60 6.64
## 6 0.30 Ideal H VS2 62.1 55 457 4.30 4.33
## 7 0.87 Premium J SI1 61.4 57 2321 6.17 6.14
## 11 1.62 Premium G VS2 62.3 58 13578 7.49 7.54
## 13 0.30 Ideal H VS2 62.4 55 491 4.28 4.31
## 14 1.15 Premium H VS2 60.2 58 5191 6.86 6.80
## Depth CutF
## 4 4.11 Ideal
## 6 2.68 Ideal
## 7 3.78 Premium
## 11 4.68 Premium
## 13 2.68 Ideal
## 14 4.11 Premium
library(psych)
##
## Attaching package: 'psych'
## The following object is masked from 'package:effectsize':
##
## phi
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
describeBy(mydata1$Price, g = mydata1$CutF)
##
## Descriptive statistics by group
## group: Ideal
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 35 3316.97 3573.05 1221 2746.97 763.54 457 14388 13931 1.35 1
## se
## X1 603.96
## ------------------------------------------------------------
## group: Premium
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 28 5402.82 5531.6 2595.5 4839.75 2797.67 544 17197 16653 0.92 -0.7
## se
## X1 1045.37
H0: Location distribution for prices is the same for both Ideal and Premium cut quality.
H1: Location distribution for prices is not the same for both Ideal and Premium cut quality.
Median1 = 1221 Median2 = 2595.5
wilcox.test(mydata1$Price ~ mydata1$Cut, #Wilcoxon Rank Sum test
paired = FALSE,
correct = FALSE,
exact = FALSE,
alternative = "two.sided")
##
## Wilcoxon rank sum test
##
## data: mydata1$Price by mydata1$Cut
## W = 597, p-value = 0.1389
## alternative hypothesis: true location shift is not equal to 0
H0: The location of distribution of price between Ideal and Premium is the same.
H1: The location of distribution of price between Ideal and Premium is not the same.
p-value = 0.1389, it is higher than 0.05, therefore we cannot reject H0, which states that the distribution location of price is the same between Ideal and Premium cut.
library(effectsize)
effectsize(wilcox.test(mydata1$Price ~ mydata1$Cut,
paired = FALSE,
correct = FALSE,
exact = FALSE,
alternative = "two.sided"))
## r (rank biserial) | 95% CI
## ---------------------------------
## 0.22 | [-0.07, 0.47]
interpret_rank_biserial(0.22)
## [1] "medium"
## (Rules: funder2019)
Conclusion: