If there is a correlation between the weight of a diamond and its price
library(ggplot2)
mydataB <- force(diamonds)
head(mydataB)
## # A tibble: 6 × 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
Unit of observation: A diamond
Sample Size: 53940
Variables:
Carat: Weight of the diamond (0.2–5.01)
Cut: Quality of the cut (Fair, Good, Very Good, Premium, Ideal)
Color: Diamond colour, from D (best) to J (worst)
Clarity: A measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))
Depth: Total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43–79)
Table: Width of top of diamond relative to widest point (43–95)
Price: Price in US dollars ($326–$18,823)
x: Length in mm (0–10.74)
y: Width in mm (0–58.9)
z: Depth in mm (0–31.8)
Source: ggplot2 dataset
In the next dataset, we take just take the required variables
mydataC <- mydataB[, c(1,7)]
head(mydataC)
## # A tibble: 6 × 2
## carat price
## <dbl> <int>
## 1 0.23 326
## 2 0.21 326
## 3 0.23 327
## 4 0.29 334
## 5 0.31 335
## 6 0.24 336
summary(mydataC)
## carat price
## Min. :0.2000 Min. : 326
## 1st Qu.:0.4000 1st Qu.: 950
## Median :0.7000 Median : 2401
## Mean :0.7979 Mean : 3933
## 3rd Qu.:1.0400 3rd Qu.: 5324
## Max. :5.0100 Max. :18823
Price mean: Based on this sample, the average price of a diamond is $3933
Min carat: The minimum weight of the diamond observed in this sample is 0.2
We test if there is a correlation between the weight of a diamond and its price
library(ggplot2)
ggplot(mydataC, aes(x = carat, y = price)) +
geom_point()
Based on the above figure, the data appears to be linear
library(ggplot2)
ggplot(mydataC, aes(x = price)) +
geom_histogram(binwidth = 2500, colour="gray") +
ylab("Frequency")
Based on the above graph its clear that the data is not normal and accordingly the Spearman coefficient is more suitable but we do the analysis using both Pearson and Spearman coefficients
However, since this is a significantly large sample, the requirement of normality is not as important as if the sample size was small
cor(mydataC$carat, mydataC$price,
method = "pearson")
## [1] 0.9215913
The above value of 0.92 shows there is a strong positive linear correlation between the weight of a diamond and its price
cor.test(mydataC$carat, mydataC$price,
method = "pearson",
use = "complete.obs")
##
## Pearson's product-moment correlation
##
## data: mydataC$carat and mydataC$price
## t = 551.41, df = 53938, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9203098 0.9228530
## sample estimates:
## cor
## 0.9215913
H0: There is no correlation between the weight of a diamond and its price
H1: There is a correlation between the weight of a diamond and its price
At p<0.001, we can reject H0 and say there is a linear correlation between the weight of a diamond and its price based on the Pearson correlation coefficient
Using the Spearman correlation coefficient because the data is not normally distributed:
cor(mydataC$price, mydataC$carat,
method = "spearman",
use = "complete.obs")
## [1] 0.9628828
Similar to the Pearson coefficient, the Spearman coefficient shows there is a strong positive linear correlation between the weight and price of a diamond
H0: There is no correlation between the weight of a diamond and its price
H1: There is a correlation between the weight of a diamond and its price
cor.test(mydataC$price, mydataC$carat,
method = "spearman",
exact = FALSE,
use = "complete.obs")
##
## Spearman's rank correlation rho
##
## data: mydataC$price and mydataC$carat
## S = 9.7086e+11, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.9628828
At p<0.001, we can reject H0 and say there is a linear correlation between the weight of a diamond and its price based on the Spearman correlation coefficient
Based on the values from Pearson and Spearman correlation coefficients we can say there is a strong linear correlation between the weight and price of a diamond
If there is a relationship between the gender and the political party choice in United Kingdom
library(carData)
mydataE <- force(BEPS)
head(mydataE)
## vote age economic.cond.national economic.cond.household Blair
## 1 Liberal Democrat 43 3 3 4
## 2 Labour 36 4 4 4
## 3 Labour 35 4 4 5
## 4 Labour 24 4 2 2
## 5 Labour 41 2 2 1
## 6 Labour 47 3 4 4
## Hague Kennedy Europe political.knowledge gender
## 1 1 4 2 2 female
## 2 4 4 5 2 male
## 3 2 3 3 2 male
## 4 1 3 4 0 female
## 5 1 4 6 2 male
## 6 4 2 4 2 male
Unit of observation: One person
Sample size: 1525
vote: Party choice: Conservative, Labour, or Liberal Democrat
age: in years
economic.cond.national: Assessment of current national economic conditions, 1 to 5.
economic.cond.household: Assessment of current household economic conditions, 1 to 5.
Blair: Assessment of the Labour leader, 1 to 5.
Hague: Assessment of the Conservative leader, 1 to 5.
Kennedy: Assessment of the leader of the Liberal Democrats, 1 to 5.
Europe: an 11-point scale that measures respondents’ attitudes toward European integration. High scores represent ‘Eurosceptic’ sentiment.
political.knowledge: Knowledge of parties’ positions on European integration, 0 to 3.
gender: female or male
Source: J. Fox and R. Andersen (2006) Effect displays for multinomial and proportional-odds logit models. Sociological Methodology 36, 225–255
mydataE$voteF <- factor(mydataE$vote,
levels = c("Conservative", "Labour", "Liberal Democrat"),
labels = c("Conservative", "Labour", "Liberal Democrat"))
mydataE$genderF <- factor(mydataE$gender,
levels = c("male", "female"),
labels = c("male", "female"))
results <- chisq.test(mydataE$genderF, mydataE$voteF,
correct = FALSE)
results
##
## Pearson's Chi-squared test
##
## data: mydataE$genderF and mydataE$voteF
## X-squared = 2.2228, df = 2, p-value = 0.3291
H0: There is no association between the categorical variables
H1: There is an association between the categorical variables
As p>0.05, we fail to reject H0 and as a result we fail to reject that there is no association between gender and the political party choice in the United Kingdom
addmargins(results$observed)
## mydataE$voteF
## mydataE$genderF Conservative Labour Liberal Democrat Sum
## male 203 348 162 713
## female 259 372 181 812
## Sum 462 720 343 1525
round(results$expected)
## mydataE$voteF
## mydataE$genderF Conservative Labour Liberal Democrat
## male 216 337 160
## female 246 383 183
Assumptions:
round(results$res, 2)
## mydataE$voteF
## mydataE$genderF Conservative Labour Liberal Democrat
## male -0.88 0.62 0.13
## female 0.83 -0.58 -0.12
Since all residuals<1.96, the difference between the observed and the expected frequencies are not statistically significant
There are more than expected males who vote labour or liberal democrati parties but less than expected who vote conservative
There are more than expected females who vote the conservative party but less than expected who vote liberal or liberal democrat
addmargins(round(prop.table(results$observed), 3))
## mydataE$voteF
## mydataE$genderF Conservative Labour Liberal Democrat Sum
## male 0.133 0.228 0.106 0.467
## female 0.170 0.244 0.119 0.533
## Sum 0.303 0.472 0.225 1.000
Out of all the voters, 13.3% are males who vote conservative
addmargins(round(prop.table(results$observed, 1), 3), 2)
## mydataE$voteF
## mydataE$genderF Conservative Labour Liberal Democrat Sum
## male 0.285 0.488 0.227 1.000
## female 0.319 0.458 0.223 1.000
28.5% males vote conservative, 48.8% vote labour, and 22.7% vote liberal democrat
31.9% females vote conservative, 45.8% labour, and 22.3% vote liberal democrat
addmargins(round(prop.table(results$observed, 2), 3), 1)
## mydataE$voteF
## mydataE$genderF Conservative Labour Liberal Democrat
## male 0.439 0.483 0.472
## female 0.561 0.517 0.528
## Sum 1.000 1.000 1.000
Of the total conservative voters, 43.9% are males and 56.1% are females
library(effectsize)
effectsize::cramers_v(mydataE$voteF, mydataE$genderF)
## Cramer's V (adj.) | 95% CI
## --------------------------------
## 0.01 | [0.00, 1.00]
##
## - One-sided CIs: upper bound fixed at [1.00].
interpret_cramers_v(0.01)
## [1] "tiny"
## (Rules: funder2019)
fisher.test(mydataE$voteF, mydataE$genderF)
##
## Fisher's Exact Test for Count Data
##
## data: mydataE$voteF and mydataE$genderF
## p-value = 0.3291
## alternative hypothesis: two.sided
H0: Odds ratio is equals 1
H1: Odds ratio does not equal 1
As p>0.05, we fail to reject H0, as a result we fail to reject the odds ratio is equal to 1
We fail to reject there is no association between the gender and the way a person votes