Source: Database “diamonds” in R studio.

#data()
#data(package= .packages(all.available=TRUE))
#install.packages("ggplot2")
library(ggplot2)
data("diamonds")      #Show the first 6 rows of data and explain the data set
head(diamonds)
## # A tibble: 6 × 10
##   carat cut       color clarity depth table price     x     y     z
##   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
## 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
## 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
## 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
## 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
## 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
set.seed(1)                                         #Setting initial point of sampling
mydata <- diamonds[sample(nrow(diamonds), 100), ]   #Random sample of 100 units

A data set contains prices and other attributes of almost 54,000 diamonds, but for the sake of statistical analysis, I chose to do a simple random sample out of 100 units, and the variables available are the following:

This data frame used to have 53940 rows and 10 variables (columns). Now it has 100 rows and 10 columns.

Research question: Does the average price of diamonds differ between Ideal and Premium cut quality?

H0: 𝜇ideal = 𝜇premium

H1: 𝜇ideal =/= 𝜇premium

mydata <- data.frame(mydata)
colnames(mydata) <- c("Carat", "Cut", "Colour", "Clarity", "Depthin%", "WidthTopRelative", "Price", "Length", "Width", "Depth")       #Change of names of the variables

head(mydata)
##   Carat       Cut Colour Clarity Depthin% WidthTopRelative Price Length Width
## 1  0.41 Very Good      D     SI2     62.3               61   638   4.72  4.75
## 2  0.50 Very Good      F     VS2     62.8               57  1402   5.05  5.08
## 3  1.03      Fair      I     SI2     65.2               56  3530   6.42  6.35
## 4  1.10     Ideal      I     SI1     62.1               57  5037   6.60  6.64
## 5  1.51 Very Good      E     VS2     63.3               61 13757   7.24  7.17
## 6  0.30     Ideal      H     VS2     62.1               55   457   4.30  4.33
##   Depth
## 1  2.95
## 2  3.18
## 3  4.16
## 4  4.11
## 5  4.56
## 6  2.68
any_missing <- any(is.na(diamonds))   #There is no missing data as it shows "FALSE"
mydata$CutF <- factor(mydata$Cut,                       #Create a new variable
                      levels = c("Ideal", "Premium"),
                      labels = c("Ideal", "Premium"))
summary(mydata[ , c(-3, -4, -5, -6, -8, -9, -10)])       #Deleted some of the variables
##      Carat               Cut         Price              CutF   
##  Min.   :0.2300   Fair     : 5   Min.   :  363.0   Ideal  :35  
##  1st Qu.:0.4025   Good     :10   1st Qu.:  930.8   Premium:28  
##  Median :0.7200   Very Good:22   Median : 2369.5   NA's   :37  
##  Mean   :0.8681   Premium  :28   Mean   : 4474.4               
##  3rd Qu.:1.1500   Ideal    :35   3rd Qu.: 5675.5               
##  Max.   :2.5100                  Max.   :17197.0
sd(mydata$Price)
## [1] 4595.068
mean(mydata$Price)
## [1] 4474.35

Descriptive statistics:

Check the assumptions

1. Variables being numeric

  • The variables used in analysis are Price and Cut, the first one if a numeric type, however Cut is a categorical variable (ordinal type). For a t-test, one variable that is being manipulated (Price) is numeric, which is according to the assumption. Therefore, the first assumption is not being violated.

2. Normality (normal distribution) in both populations

library(ggplot2)
ggplot(mydata[mydata$Cut == "Ideal",], aes(x=Price)) + 
  geom_histogram(bins = 30, fill = "pink", alpha = 1) +
  labs(title = "Histogram of Prices for Ideal Cut Quality of diamonds",
       x = "Price",
       y = "Frequency") 

  • From the graph we can see, that we do not deal with a normal distribution - it is skewed to the right. We cannot conclude that the sample does not have a normal distribution unless we do a normality test, since our sample size is large.
library(ggplot2)
ggplot(mydata[mydata$Cut == "Premium",], aes(x=Price)) + 
  geom_histogram(bins = 30, fill = "pink", alpha = 1) +
  labs(title = "Histogram of Prices for Premium Cut Quality of diamonds",
       x = "Price",
       y = "Frequency") 

  • Again, we can see that our distribution is not normal even in the other population (of the Premium cut quality), it is skewed to the right, however, normality could be violated for both populations. Let’s check it further with Shapiro-Wilk normality test.
mydata$Cut <- as.factor(mydata$Cut)        #Choose only "Ideal" and "Premium" cut quality
mydata1 <- mydata[mydata$Cut %in% c("Ideal", "Premium"), ]

mydata1$CutF <- factor(mydata1$Cut,
                      levels = c("Ideal","Premium"),
                      labels = c("Ideal", "Premium"))
shapiro.test(mydata1$Price)     #Shapiro-Wilk normality test
## 
##  Shapiro-Wilk normality test
## 
## data:  mydata1$Price
## W = 0.77522, p-value = 1.955e-08
  • H0: The price distribution is normal.

  • H1: The price distribution is not normal.

  • We reject null hypothesis, at p value < 0,001, and therefore our distribution is not normal and we violate the second assumption. It is appropriate to do non-parametric test.

nrow(mydata)                  #Number of units analyzed in the sample
## [1] 100
sd(mydata$Price)
## [1] 4595.068
ybar = mean(mydata$Price); sd = sd(mydata$Price); n = nrow(mydata)

se = sd/sqrt(n)

ybar_lower_t = ybar + qt(0.025, df=n-1)*se
ybar_upper_t = ybar + qt(0.975, df=n-1)*se
  • 95% confidence interval:

3562.56 < 𝜇 < 5386.11

The mean (4474.35) is included in the confidence interval.

3. The variances within both populations must be the same

We assume by the theory that the variances are the not same, therefore we will use the Welch-correction test.

However, we will do both parametric and nonparametric, to show the procedure if the assumptions would not be violated.

Parametric test

𝜇ideal = 𝜇premium

𝜇ideal =/= 𝜇premium

mydata$Cut <- as.factor(mydata$Cut)        #Choose only "Ideal" and "Premium" cut quality
mydata1 <- mydata[mydata$Cut %in% c("Ideal", "Premium"), ]
t.test(mydata1$Price ~ mydata1$Cut,         #Independent samples t-test with Welch correction
       paired = FALSE,
       var.equal = FALSE,
       alternative = "two.sided")
## 
##  Welch Two Sample t-test
## 
## data:  mydata1$Price by mydata1$Cut
## t = 1.7277, df = 44.128, p-value = 0.09104
## alternative hypothesis: true difference in means between group Premium and group Ideal is not equal to 0
## 95 percent confidence interval:
##  -347.1009 4518.8009
## sample estimates:
## mean in group Premium   mean in group Ideal 
##              5402.821              3316.971

However, since we cannot reject H0, we can say that the population means are statistically significant. (Note that our original data base had 53940 units, but we made a random sample out of 100 units, which may also not be as representative).

#install.packages("effectsize")
library(effectsize)
effectsize:: cohens_d(mydata1$Price ~ mydata1$Cut,    #Determine the effect size
                      pooled_sd = FALSE)
## Cohen's d |        95% CI
## -------------------------
## 0.45      | [-0.07, 0.96]
## 
## - Estimated using un-pooled SD.
interpret_cohens_d(0.45, rules = "sawilowsky2009")
## [1] "small"
## (Rules: sawilowsky2009)

Conclusion:

Based on a sample data, we found that we cannot reject H0, which means that our population means are statistically significant and the effect size is small, d = 0.45. Therefore, the average price of Ideal and Premium cut quality of diamonds, is statistically significant.

Nonparametric test

mydata1$CutF <- factor(mydata1$Cut,      #Show only diamonds of Ideal and Premium quality cut
                      levels = c("Ideal","Premium"),
                      labels = c("Ideal", "Premium"))
head(mydata1)
##    Carat     Cut Colour Clarity Depthin% WidthTopRelative Price Length Width
## 4   1.10   Ideal      I     SI1     62.1               57  5037   6.60  6.64
## 6   0.30   Ideal      H     VS2     62.1               55   457   4.30  4.33
## 7   0.87 Premium      J     SI1     61.4               57  2321   6.17  6.14
## 11  1.62 Premium      G     VS2     62.3               58 13578   7.49  7.54
## 13  0.30   Ideal      H     VS2     62.4               55   491   4.28  4.31
## 14  1.15 Premium      H     VS2     60.2               58  5191   6.86  6.80
##    Depth    CutF
## 4   4.11   Ideal
## 6   2.68   Ideal
## 7   3.78 Premium
## 11  4.68 Premium
## 13  2.68   Ideal
## 14  4.11 Premium
library(psych)
## 
## Attaching package: 'psych'
## The following object is masked from 'package:effectsize':
## 
##     phi
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
describeBy(mydata1$Price, g = mydata1$CutF)
## 
##  Descriptive statistics by group 
## group: Ideal
##    vars  n    mean      sd median trimmed    mad min   max range skew kurtosis
## X1    1 35 3316.97 3573.05   1221 2746.97 763.54 457 14388 13931 1.35        1
##        se
## X1 603.96
## ------------------------------------------------------------ 
## group: Premium
##    vars  n    mean     sd median trimmed     mad min   max range skew kurtosis
## X1    1 28 5402.82 5531.6 2595.5 4839.75 2797.67 544 17197 16653 0.92     -0.7
##         se
## X1 1045.37

Median1 = 1221 Median2 = 2595.5

wilcox.test(mydata1$Price ~ mydata1$Cut,          #Wilcoxon Rank Sum test
            paired = FALSE,
            correct = FALSE,
            exact = FALSE,
            alternative = "two.sided")
## 
##  Wilcoxon rank sum test
## 
## data:  mydata1$Price by mydata1$Cut
## W = 597, p-value = 0.1389
## alternative hypothesis: true location shift is not equal to 0
library(effectsize)
effectsize(wilcox.test(mydata1$Price ~ mydata1$Cut,
                       paired = FALSE,
                       correct = FALSE,
                       exact = FALSE,
                       alternative = "two.sided"))
## r (rank biserial) |        95% CI
## ---------------------------------
## 0.22              | [-0.07, 0.47]
interpret_rank_biserial(0.22)
## [1] "medium"
## (Rules: funder2019)

Conclusion:

For this example analysis, more appropriate it is to do the non-parametric test because there are some violations of the assumptions.