Source: Database “diamonds” in R studio.

#data()
#data(package= .packages(all.available=TRUE))

#install.packages("ggplot2")
library(ggplot2)

data("diamonds")      #Show the first 6 rows of data and explain the data set
head(diamonds)

## # A tibble: 6 × 10
##   carat cut       color clarity depth table price     x     y     z
##   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
## 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
## 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
## 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
## 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
## 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48

set.seed(1)                                         #Setting initial point of sampling
mydata <- diamonds[sample(nrow(diamonds), 100), ]   #Random sample of 100 units

A data set contains prices and other attributes of almost 54,000 diamonds, but for the sake of statistical analysis, I chose to do a simple random sample out of 100 units, and the variables available are the following:

carat: It is a weight of the diamond (from 0.2 to 5.01).
cut: It shows the quality of the cut (Fair, Good, Very Good, Premium, Ideal). It is a categorical (ordinal type) variable.
colour: Diamond colour, from D (the best one), to J (the worst one). It is a categorical (ordinal type) variable.
clarity: A measurement of how clear the diamond is (I1 = the worst, SI2, SI1, VS2, VS1, VVS2, VVS1, IF = the best). It is also a categorical (ordinal type) variable.
depth: Total depth percentage
table: Width of top diamond relative to the widest point (43-95)
price: The price in US dollars (between $326 and $18,823) - an interval variable.
x: The length in mm (0-10.74)
y: Width in mm (0-58.9).

This data frame used to have 53940 rows and 10 variables (columns). Now it has 100 rows and 10 columns.

Research question: Does the average price of diamonds differ between Ideal and Premium cut quality?

H0: The average price is the same for Ideal and Premium cut quality.

H0: 𝜇ideal = 𝜇premium

H1: The average price differs between Ideal and Premium cut quality.

H1: 𝜇ideal =/= 𝜇premium

mydata <- data.frame(mydata)

colnames(mydata) <- c("Carat", "Cut", "Colour", "Clarity", "Depthin%", "WidthTopRelative", "Price", "Length", "Width", "Depth")       #Change of names of the variables

head(mydata)

##   Carat       Cut Colour Clarity Depthin% WidthTopRelative Price Length Width
## 1  0.41 Very Good      D     SI2     62.3               61   638   4.72  4.75
## 2  0.50 Very Good      F     VS2     62.8               57  1402   5.05  5.08
## 3  1.03      Fair      I     SI2     65.2               56  3530   6.42  6.35
## 4  1.10     Ideal      I     SI1     62.1               57  5037   6.60  6.64
## 5  1.51 Very Good      E     VS2     63.3               61 13757   7.24  7.17
## 6  0.30     Ideal      H     VS2     62.1               55   457   4.30  4.33
##   Depth
## 1  2.95
## 2  3.18
## 3  4.16
## 4  4.11
## 5  4.56
## 6  2.68

any_missing <- any(is.na(diamonds))   #There is no missing data as it shows "FALSE"

mydata$CutF <- factor(mydata$Cut,                       #Create a new variable
                      levels = c("Ideal", "Premium"),
                      labels = c("Ideal", "Premium"))

summary(mydata[ , c(-3, -4, -5, -6, -8, -9, -10)])       #Deleted some of the variables

##      Carat               Cut         Price              CutF   
##  Min.   :0.2300   Fair     : 5   Min.   :  363.0   Ideal  :35  
##  1st Qu.:0.4025   Good     :10   1st Qu.:  930.8   Premium:28  
##  Median :0.7200   Very Good:22   Median : 2369.5   NA's   :37  
##  Mean   :0.8681   Premium  :28   Mean   : 4474.4               
##  3rd Qu.:1.1500   Ideal    :35   3rd Qu.: 5675.5               
##  Max.   :2.5100                  Max.   :17197.0

sd(mydata$Price)

## [1] 4595.068

mean(mydata$Price)

## [1] 4474.35

Descriptive statistics:

50% of the sample has less than 0.7200 carats, and the other 50% of the sample has more than 0.7200 carats. However, the average price of the sample is 4474.4 USD.
In the sample, we have the smallest number of Fair cut quality diamonds (5) and the highest number of Ideal cut quality diamonds (35). However, since we will further analyze only the Ideal and Premium cut quality, we also have a new variable (CutF), which shows the frequency of each of the cut we want to analyze further. Out of our sample size of 100, we have 35 diamonds of Ideal cut and 28 diamonds of the Premium cut quality, the rest is of a different cut quality.
75% of units in the sample have a price lower than 5675.5 USD, and the other 25% have a price higher than 5675.5 USD. However, the maximum price in the sample is 17,197 USD. Standard deviation of the price is 4595.068 USD. And the average price of the sample of 100 is 4474.4 USD.

Check the assumptions

1. Variables being numeric

The variables used in analysis are Price and Cut, the first one if a numeric type, however Cut is a categorical variable (ordinal type). For a t-test, one variable that is being manipulated (Price) is numeric, which is according to the assumption. Therefore, the first assumption is not being violated.

2. Normality (normal distribution) in both populations

library(ggplot2)
ggplot(mydata[mydata$Cut == "Ideal",], aes(x=Price)) + 
  geom_histogram(bins = 30, fill = "pink", alpha = 1) +
  labs(title = "Histogram of Prices for Ideal Cut Quality of diamonds",
       x = "Price",
       y = "Frequency")

From the graph we can see, that we do not deal with a normal distribution - it is skewed to the right. We cannot conclude that the sample does not have a normal distribution unless we do a normality test, since our sample size is large.

library(ggplot2)
ggplot(mydata[mydata$Cut == "Premium",], aes(x=Price)) + 
  geom_histogram(bins = 30, fill = "pink", alpha = 1) +
  labs(title = "Histogram of Prices for Premium Cut Quality of diamonds",
       x = "Price",
       y = "Frequency")

Again, we can see that our distribution is not normal even in the other population (of the Premium cut quality), it is skewed to the right, however, normality could be violated for both populations. Let’s check it further with Shapiro-Wilk normality test.

mydata$Cut <- as.factor(mydata$Cut)        #Choose only "Ideal" and "Premium" cut quality
mydata1 <- mydata[mydata$Cut %in% c("Ideal", "Premium"), ]

mydata1$CutF <- factor(mydata1$Cut,
                      levels = c("Ideal","Premium"),
                      labels = c("Ideal", "Premium"))

shapiro.test(mydata1$Price)     #Shapiro-Wilk normality test

## 
##  Shapiro-Wilk normality test
## 
## data:  mydata1$Price
## W = 0.77522, p-value = 1.955e-08

H0: The price distribution is normal.
H1: The price distribution is not normal.
We reject null hypothesis, at p value < 0,001, and therefore our distribution is not normal and we violate the second assumption. It is appropriate to do non-parametric test.

nrow(mydata)                  #Number of units analyzed in the sample

## [1] 100

sd(mydata$Price)

## [1] 4595.068

ybar = mean(mydata$Price); sd = sd(mydata$Price); n = nrow(mydata)

se = sd/sqrt(n)

ybar_lower_t = ybar + qt(0.025, df=n-1)*se
ybar_upper_t = ybar + qt(0.975, df=n-1)*se

95% confidence interval:

3562.56 < 𝜇 < 5386.11

The mean (4474.35) is included in the confidence interval.

3. The variances within both populations must be the same

We assume by the theory that the variances are the not same, therefore we will use the Welch-correction test.

However, we will do both parametric and nonparametric, to show the procedure if the assumptions would not be violated.

Parametric test

H0: The average price is the same for Ideal and Premium cut quality.

𝜇ideal = 𝜇premium

H1: The average price is not the same for Ideal and Premium cut quality

𝜇ideal =/= 𝜇premium

mydata$Cut <- as.factor(mydata$Cut)        #Choose only "Ideal" and "Premium" cut quality
mydata1 <- mydata[mydata$Cut %in% c("Ideal", "Premium"), ]

t.test(mydata1$Price ~ mydata1$Cut,         #Independent samples t-test with Welch correction
       paired = FALSE,
       var.equal = FALSE,
       alternative = "two.sided")

## 
##  Welch Two Sample t-test
## 
## data:  mydata1$Price by mydata1$Cut
## t = 1.7277, df = 44.128, p-value = 0.09104
## alternative hypothesis: true difference in means between group Premium and group Ideal is not equal to 0
## 95 percent confidence interval:
##  -347.1009 4518.8009
## sample estimates:
## mean in group Premium   mean in group Ideal 
##              5402.821              3316.971

p = 0.09104, We cannot reject the null hypothesis, because p-value is higher than 0.05.
𝜇ideal = 3316.971 𝜇premium = 5402.821

However, since we cannot reject H0, we can say that the population means are statistically significant. (Note that our original data base had 53940 units, but we made a random sample out of 100 units, which may also not be as representative).

#install.packages("effectsize")
library(effectsize)

effectsize:: cohens_d(mydata1$Price ~ mydata1$Cut,    #Determine the effect size
                      pooled_sd = FALSE)

## Cohen's d |        95% CI
## -------------------------
## 0.45      | [-0.07, 0.96]
## 
## - Estimated using un-pooled SD.

interpret_cohens_d(0.45, rules = "sawilowsky2009")

## [1] "small"
## (Rules: sawilowsky2009)

Conclusion:

Based on a sample data, we found that we cannot reject H0, which means that our population means are statistically significant and the effect size is small, d = 0.45. Therefore, the average price of Ideal and Premium cut quality of diamonds, is statistically significant.

Nonparametric test

mydata1$CutF <- factor(mydata1$Cut,      #Show only diamonds of Ideal and Premium quality cut
                      levels = c("Ideal","Premium"),
                      labels = c("Ideal", "Premium"))
head(mydata1)

##    Carat     Cut Colour Clarity Depthin% WidthTopRelative Price Length Width
## 4   1.10   Ideal      I     SI1     62.1               57  5037   6.60  6.64
## 6   0.30   Ideal      H     VS2     62.1               55   457   4.30  4.33
## 7   0.87 Premium      J     SI1     61.4               57  2321   6.17  6.14
## 11  1.62 Premium      G     VS2     62.3               58 13578   7.49  7.54
## 13  0.30   Ideal      H     VS2     62.4               55   491   4.28  4.31
## 14  1.15 Premium      H     VS2     60.2               58  5191   6.86  6.80
##    Depth    CutF
## 4   4.11   Ideal
## 6   2.68   Ideal
## 7   3.78 Premium
## 11  4.68 Premium
## 13  2.68   Ideal
## 14  4.11 Premium

library(psych)

## 
## Attaching package: 'psych'

## The following object is masked from 'package:effectsize':
## 
##     phi

## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

describeBy(mydata1$Price, g = mydata1$CutF)

## 
##  Descriptive statistics by group 
## group: Ideal
##    vars  n    mean      sd median trimmed    mad min   max range skew kurtosis
## X1    1 35 3316.97 3573.05   1221 2746.97 763.54 457 14388 13931 1.35        1
##        se
## X1 603.96
## ------------------------------------------------------------ 
## group: Premium
##    vars  n    mean     sd median trimmed     mad min   max range skew kurtosis
## X1    1 28 5402.82 5531.6 2595.5 4839.75 2797.67 544 17197 16653 0.92     -0.7
##         se
## X1 1045.37

H0: Location distribution for prices is the same for both Ideal and Premium cut quality.
H1: Location distribution for prices is not the same for both Ideal and Premium cut quality.

Median1 = 1221 Median2 = 2595.5

wilcox.test(mydata1$Price ~ mydata1$Cut,          #Wilcoxon Rank Sum test
            paired = FALSE,
            correct = FALSE,
            exact = FALSE,
            alternative = "two.sided")

## 
##  Wilcoxon rank sum test
## 
## data:  mydata1$Price by mydata1$Cut
## W = 597, p-value = 0.1389
## alternative hypothesis: true location shift is not equal to 0

H0: The location of distribution of price between Ideal and Premium is the same.
H1: The location of distribution of price between Ideal and Premium is not the same.
p-value = 0.1389, it is higher than 0.05, therefore we cannot reject H0, which states that the distribution location of price is the same between Ideal and Premium cut.

library(effectsize)
effectsize(wilcox.test(mydata1$Price ~ mydata1$Cut,
                       paired = FALSE,
                       correct = FALSE,
                       exact = FALSE,
                       alternative = "two.sided"))

## r (rank biserial) |        95% CI
## ---------------------------------
## 0.22              | [-0.07, 0.47]

interpret_rank_biserial(0.22)

## [1] "medium"
## (Rules: funder2019)

Conclusion:

Based on the sample data, we found out that on average the price of Ideal and Premium cut quality of diamonds is the same (very similar). However, the effect is medium (r = 0.22).

For this example analysis, more appropriate it is to do the non-parametric test because there are some violations of the assumptions.

HW 1: Hypothesis testing

Nika Boh

2024-01-07