Ashish Bhambhaney

Research Question 1(Two Numerical Variables):

If there is a correlation between the weight of a diamond and its price

library(ggplot2)
mydataB <- force(diamonds)
head(mydataB)
## # A tibble: 6 × 10
##   carat cut       color clarity depth table price     x     y     z
##   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
## 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
## 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
## 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
## 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
## 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48

Description:

Unit of observation: A diamond

Sample Size: 53940

Variables:

  1. Carat: Weight of the diamond (0.2–5.01)

  2. Cut: Quality of the cut (Fair, Good, Very Good, Premium, Ideal)

  3. Color: Diamond colour, from D (best) to J (worst)

  4. Clarity: A measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))

  5. Depth: Total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43–79)

  6. Table: Width of top of diamond relative to widest point (43–95)

  7. Price: Price in US dollars ($326–$18,823)

  8. x: Length in mm (0–10.74)

  9. y: Width in mm (0–58.9)

  10. z: Depth in mm (0–31.8)

Source: ggplot2 dataset

In the next dataset, we take just take the required variables

mydataC <- mydataB[, c(1,7)]
head(mydataC)
## # A tibble: 6 × 2
##   carat price
##   <dbl> <int>
## 1  0.23   326
## 2  0.21   326
## 3  0.23   327
## 4  0.29   334
## 5  0.31   335
## 6  0.24   336

Showing descriptive statistics:

summary(mydataC)
##      carat            price      
##  Min.   :0.2000   Min.   :  326  
##  1st Qu.:0.4000   1st Qu.:  950  
##  Median :0.7000   Median : 2401  
##  Mean   :0.7979   Mean   : 3933  
##  3rd Qu.:1.0400   3rd Qu.: 5324  
##  Max.   :5.0100   Max.   :18823

Explaining a few parameters:

  1. Price mean: Based on this sample, the average price of a diamond is $3933

  2. Min carat: The minimum weight of the diamond observed in this sample is 0.2

Correlation Analysis

We test if there is a correlation between the weight of a diamond and its price

library(ggplot2)
ggplot(mydataC, aes(x = carat, y = price)) +
  geom_point()

Based on the above figure, the data appears to be linear

library(ggplot2)
ggplot(mydataC, aes(x = price)) +
  geom_histogram(binwidth = 2500, colour="gray") + 
  ylab("Frequency")

Based on the above graph its clear that the data is not normal and accordingly the Spearman coefficient is more suitable but we do the analysis using both Pearson and Spearman coefficients

However, since this is a significantly large sample, the requirement of normality is not as important as if the sample size was small

cor(mydataC$carat, mydataC$price, 
    method = "pearson")
## [1] 0.9215913

The above value of 0.92 shows there is a strong positive linear correlation between the weight of a diamond and its price

cor.test(mydataC$carat, mydataC$price, 
         method = "pearson",
         use = "complete.obs")
## 
##  Pearson's product-moment correlation
## 
## data:  mydataC$carat and mydataC$price
## t = 551.41, df = 53938, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9203098 0.9228530
## sample estimates:
##       cor 
## 0.9215913

H0: There is no correlation between the weight of a diamond and its price

H1: There is a correlation between the weight of a diamond and its price

At p<0.001, we can reject H0 and say there is a linear correlation between the weight of a diamond and its price based on the Pearson correlation coefficient

Using the Spearman correlation coefficient because the data is not normally distributed:

cor(mydataC$price, mydataC$carat, 
    method = "spearman",
    use = "complete.obs")
## [1] 0.9628828

Similar to the Pearson coefficient, the Spearman coefficient shows there is a strong positive linear correlation between the weight and price of a diamond

H0: There is no correlation between the weight of a diamond and its price

H1: There is a correlation between the weight of a diamond and its price

cor.test(mydataC$price, mydataC$carat, 
         method = "spearman",
         exact = FALSE, 
         use = "complete.obs")
## 
##  Spearman's rank correlation rho
## 
## data:  mydataC$price and mydataC$carat
## S = 9.7086e+11, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.9628828

At p<0.001, we can reject H0 and say there is a linear correlation between the weight of a diamond and its price based on the Spearman correlation coefficient

Conclusion(Answer of the question):

Based on the values from Pearson and Spearman correlation coefficients we can say there is a strong linear correlation between the weight and price of a diamond

Research Question 2(Two Categorical Variables):

If there is a relationship between the gender and the political party choice in United Kingdom

library(carData)
mydataE <- force(BEPS)
head(mydataE)
##               vote age economic.cond.national economic.cond.household Blair
## 1 Liberal Democrat  43                      3                       3     4
## 2           Labour  36                      4                       4     4
## 3           Labour  35                      4                       4     5
## 4           Labour  24                      4                       2     2
## 5           Labour  41                      2                       2     1
## 6           Labour  47                      3                       4     4
##   Hague Kennedy Europe political.knowledge gender
## 1     1       4      2                   2 female
## 2     4       4      5                   2   male
## 3     2       3      3                   2   male
## 4     1       3      4                   0 female
## 5     1       4      6                   2   male
## 6     4       2      4                   2   male

Description:

Unit of observation: One person

Sample size: 1525

  1. vote: Party choice: Conservative, Labour, or Liberal Democrat

  2. age: in years

  3. economic.cond.national: Assessment of current national economic conditions, 1 to 5.

  4. economic.cond.household: Assessment of current household economic conditions, 1 to 5.

  5. Blair: Assessment of the Labour leader, 1 to 5.

  6. Hague: Assessment of the Conservative leader, 1 to 5.

  7. Kennedy: Assessment of the leader of the Liberal Democrats, 1 to 5.

  8. Europe: an 11-point scale that measures respondents’ attitudes toward European integration. High scores represent ‘Eurosceptic’ sentiment.

  9. political.knowledge: Knowledge of parties’ positions on European integration, 0 to 3.

  10. gender: female or male

Source: J. Fox and R. Andersen (2006) Effect displays for multinomial and proportional-odds logit models. Sociological Methodology 36, 225–255

mydataE$voteF <- factor(mydataE$vote, 
                            levels = c("Conservative", "Labour", "Liberal Democrat"), 
                            labels = c("Conservative", "Labour", "Liberal Democrat"))
mydataE$genderF <- factor(mydataE$gender, 
                            levels = c("male", "female"), 
                            labels = c("male", "female"))

Pearson’s chi squared test

results <- chisq.test(mydataE$genderF, mydataE$voteF,
           correct = FALSE)

results
## 
##  Pearson's Chi-squared test
## 
## data:  mydataE$genderF and mydataE$voteF
## X-squared = 2.2228, df = 2, p-value = 0.3291

H0: There is no association between the categorical variables

H1: There is an association between the categorical variables

As p>0.05, we fail to reject H0 and as a result we fail to reject that there is no association between gender and the political party choice in the United Kingdom

Empirical Frequencies

addmargins(results$observed)
##                mydataE$voteF
## mydataE$genderF Conservative Labour Liberal Democrat  Sum
##          male            203    348              162  713
##          female          259    372              181  812
##          Sum             462    720              343 1525

Expected Frequencies

round(results$expected)
##                mydataE$voteF
## mydataE$genderF Conservative Labour Liberal Democrat
##          male            216    337              160
##          female          246    383              183

Assumptions:

  1. Independent observations -> satisfied
  2. All expected frequencies>1 -> satisfied
  3. Max 20% frequencies between 1-5 -> satisfied

Residuals

round(results$res, 2)
##                mydataE$voteF
## mydataE$genderF Conservative Labour Liberal Democrat
##          male          -0.88   0.62             0.13
##          female         0.83  -0.58            -0.12

Since all residuals<1.96, the difference between the observed and the expected frequencies are not statistically significant

There are more than expected males who vote labour or liberal democrati parties but less than expected who vote conservative

There are more than expected females who vote the conservative party but less than expected who vote liberal or liberal democrat

Frequency tables

addmargins(round(prop.table(results$observed), 3))
##                mydataE$voteF
## mydataE$genderF Conservative Labour Liberal Democrat   Sum
##          male          0.133  0.228            0.106 0.467
##          female        0.170  0.244            0.119 0.533
##          Sum           0.303  0.472            0.225 1.000

Out of all the voters, 13.3% are males who vote conservative

addmargins(round(prop.table(results$observed, 1), 3), 2)
##                mydataE$voteF
## mydataE$genderF Conservative Labour Liberal Democrat   Sum
##          male          0.285  0.488            0.227 1.000
##          female        0.319  0.458            0.223 1.000
  1. 28.5% males vote conservative, 48.8% vote labour, and 22.7% vote liberal democrat

  2. 31.9% females vote conservative, 45.8% labour, and 22.3% vote liberal democrat

addmargins(round(prop.table(results$observed, 2), 3), 1)
##                mydataE$voteF
## mydataE$genderF Conservative Labour Liberal Democrat
##          male          0.439  0.483            0.472
##          female        0.561  0.517            0.528
##          Sum           1.000  1.000            1.000

Of the total conservative voters, 43.9% are males and 56.1% are females

library(effectsize)
effectsize::cramers_v(mydataE$voteF, mydataE$genderF)
## Cramer's V (adj.) |       95% CI
## --------------------------------
## 0.01              | [0.00, 1.00]
## 
## - One-sided CIs: upper bound fixed at [1.00].
interpret_cramers_v(0.01)
## [1] "tiny"
## (Rules: funder2019)
fisher.test(mydataE$voteF, mydataE$genderF)
## 
##  Fisher's Exact Test for Count Data
## 
## data:  mydataE$voteF and mydataE$genderF
## p-value = 0.3291
## alternative hypothesis: two.sided

H0: Odds ratio is equals 1

H1: Odds ratio does not equal 1

As p>0.05, we fail to reject H0, as a result we fail to reject the odds ratio is equal to 1

Conclusion(Answer of the question)

We fail to reject there is no association between the gender and the way a person votes