Correlation Analysis

MVA Homework 2

Data

We have a data set that provides us insights into the similarities and differences between female and male employees across different factors. With correlation analysis, we will analyse the relationship between age of employees and height of their average monthly salary.

Data source: soleman sarkovish. (2023). Male vs Female [Data set]. Kaggle. https://doi.org/10.34740/KAGGLE/DSV/5665153

setwd("C:/Users/Neimenovan uporabnik/Documents/R data/R data/Basics/Reading csv data")
mydata <- read.csv("dataset csv.csv")
head(mydata)

##        Date Gender Driving.test.result    Bmi Childeren    Salary    region smoker age
## 1 1/11/2022 female                   5 27.900         0 16884.924 southwest    yes  19
## 2 1/11/2022 female                   4 33.770         1  1725.552 southeast     no  18
## 3 1/11/2022   male                   8 33.000         3  4449.462 southeast     no  28
## 4 1/11/2022   male                   9 22.705         0 21984.471 northwest     no  33
## 5 1/11/2022 female                   4 28.880         0  3866.855 northwest     no  32
## 6 2/11/2022 female                   2 25.740         0  3756.622 southeast     no  31

#install.packages("dplyr")
# Loading the dplyr package
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

# Selecting only the relevant columns
mydata <- select(mydata, Gender, smoker, Salary, age)

head(mydata)

##   Gender smoker    Salary age
## 1 female    yes 16884.924  19
## 2 female     no  1725.552  18
## 3   male     no  4449.462  28
## 4   male     no 21984.471  33
## 5 female     no  3866.855  32
## 6 female     no  3756.622  31

#Factorizing

mydata$FGender <- factor(mydata$Gender, 
                         levels = c("male", "female"),
                         labels = c("Male","Female"))

mydata$FSmoker <- factor(mydata$smoker, 
                         levels = c("no", "yes"),
                         labels = c("Smoker","Nonsmoker"))
head(mydata)

##   Gender smoker    Salary age FGender   FSmoker
## 1 female    yes 16884.924  19  Female Nonsmoker
## 2 female     no  1725.552  18  Female    Smoker
## 3   male     no  4449.462  28    Male    Smoker
## 4   male     no 21984.471  33    Male    Smoker
## 5 female     no  3866.855  32  Female    Smoker
## 6 female     no  3756.622  31  Female    Smoker

Unit of observation: Individual employee
Sample size: 354 employees
Variables:
- Salary: Average monthly salary of an individual employee in the sample.
- Age: Age of an indivudal employee in the sample (years).
- Gender: Gender of an individual employee in the sample (Male, Female).
- Smoker: Smoking or non-smoking individual employee (Smoker, Nonsmoker).

Descriptive statistics

#Showing descriptive statistics
library(psych)
summary(mydata)

##     Gender             smoker              Salary           age          FGender         FSmoker   
##  Length:354         Length:354         Min.   : 1137   Min.   :18.00   Male  :174   Smoker   :258  
##  Class :character   Class :character   1st Qu.: 3580   1st Qu.:23.00   Female:180   Nonsmoker: 96  
##  Mode  :character   Mode  :character   Median :10602   Median :34.00                               
##                                        Mean   :15390   Mean   :37.19                               
##                                        3rd Qu.:23568   3rd Qu.:55.00                               
##                                        Max.   :51195   Max.   :63.00

result <- describeBy(mydata)

## Warning in describeBy(mydata): no grouping variable requested

print(result)

##          vars   n     mean       sd   median  trimmed      mad     min      max    range  skew kurtosis     se
## Gender*     1 354     1.49     0.50     1.00     1.49     0.00    1.00     2.00     1.00  0.03    -2.00   0.03
## smoker*     2 354     1.27     0.45     1.00     1.21     0.00    1.00     2.00     1.00  1.03    -0.95   0.02
## Salary      3 354 15389.76 14498.79 10602.39 13510.62 11203.00 1137.01 51194.56 50057.55  0.97    -0.40 770.60
## age         4 354    37.19    15.28    34.00    36.53    17.79   18.00    63.00    45.00  0.39    -1.28   0.81
## FGender*    5 354     1.51     0.50     2.00     1.51     0.00    1.00     2.00     1.00 -0.03    -2.00   0.03
## FSmoker*    6 354     1.27     0.45     1.00     1.21     0.00    1.00     2.00     1.00  1.03    -0.95   0.02

Explanation of few parameters:

Min: The youngest employee in the sample is 18 years old. The employee with the lowest average monthly salary is paid 1137 (money unit) per month on average.
1st quartile: 25% of employees in the sample are 23 years old or younger. 25% of employees in the sample are paid 3580 (money unit) per month on average.
Mean: The average age of employees in the sample is 37.19 years. The average monthly salary of employees in the sample is 15390 (money unit).
In the sample, there are 258 smokers and 96 non-smokers.
In the sample, there are 174 men and 180 women.

Correlation Analysis

Research question: What is the relationship between age and height of average monthly salary?

We can use Pearson’s correlation test but we need to first meet all of the following assumptions:

Both variables are numeric.
Normality: both variables, Age and Salary, should be normally distributed.
Linearity: The relationship between Salary and Age should be linear.

#Checking the normality with histogram

library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

ggplot(mydata, aes(x = Salary, y = age)) + 
  geom_col(colour = "grey71", fill = "hotpink") + 
  scale_x_continuous(breaks = seq(0, 100000, 10000)) +
  ylab("Age")

#Checking the linearity with scatterplot

library(ggplot2)
ggplot(mydata, aes(x = age, y = Salary)) +
  geom_point()

We can see that normality is violated, therefore we should use a non-parametric test - Spearman correlation test.

#Performing Spearman correlation test

cor(mydata$Salary, mydata$age,
    method = "spearman")

## [1] 0.5280009

Based on the sample data we can conclude there is positive and moderate relationship between age and average monthly salary of employees.

#Checking the significance of correlation coefficient

cor.test(mydata$Salary, mydata$age,
         method = "spearman",
         exact = FALSE)

## 
##  Spearman's rank correlation rho
## 
## data:  mydata$Salary and mydata$age
## S = 3489766, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.5280009

H0: ro = 0

H1: ro =/ 0

Conclusion

Based on the sample data we can reject null hypothesis (p<0.001) and conclude there is a linear relationship between age of employees and their average monthly salary.

Chi squared test

Research question:Is there a significant association between gender of employee and whether employee is smoking or not?

#Creating the contingency table

contingency_table <- addmargins(table(mydata$FGender, mydata$FSmoker))

print(contingency_table)

##         
##          Smoker Nonsmoker Sum
##   Male      120        54 174
##   Female    138        42 180
##   Sum       258        96 354

Checking the assumptions for Pearson’s Chi-Squared test:

Independence
Expected frequencies:

++ Expected frequency for Male Smokers: (174 * 258) / 354 = 127.31

++ Expected frequency for Male Non-smokers: (174 * 96) / 354 = 46.69

++ Expected frequency for Female Smokers: (180 * 258) / 354 = 130.69

++ Expected frequency for Female Non-smokers: (180 * 96) / 354 = 49.31

All assumptions are met, which means that we can perform Pearson’s Chi Squared test.

#Performing the Pearson Chi squared test

results <- chisq.test(mydata$FGender, mydata$FSmoker,
                      correct = TRUE)

print(results)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  mydata$FGender and mydata$FSmoker
## X-squared = 2.2795, df = 1, p-value = 0.1311

H0: There is no association between gender of an employee and his/her smoking status.

H1: There is association between gender of an employee and his/her smoking status.

Based on the sample data, we can’t rejct the null hypothesis (p = 0.1311) and conclude that there is no association between gender of an employee and his/her smoking status.

#Calculating standardized residuals

std_residuals <- results$stdres

print(std_residuals)

##               mydata$FSmoker
## mydata$FGender    Smoker Nonsmoker
##         Male   -1.629381  1.629381
##         Female  1.629381 -1.629381

Based on calculated standardized residuals we can’t say that gender had an effect on smoking status of an employee, because all four standardized residuals are (in absolute) smaller than 1.96.

#Proportion table 1

addmargins(round(prop.table(results$observed), 2))

##               mydata$FSmoker
## mydata$FGender Smoker Nonsmoker  Sum
##         Male     0.34      0.15 0.49
##         Female   0.39      0.12 0.51
##         Sum      0.73      0.27 1.00

Interpretation:

Male Smokers: 34% of all the employees are male smokers.
Female Non-Smokers: 12% of the employees are female non-smokers.
Sum of Smokers: 73% of all the employees are smokers.
Sum of Males: 49% of all the employees are males.

#Proportion table 2

addmargins(round(prop.table(results$observed, 1), 2), 2)

##               mydata$FSmoker
## mydata$FGender Smoker Nonsmoker  Sum
##         Male     0.69      0.31 1.00
##         Female   0.77      0.23 1.00

Interpretation:

Male Smokers: 69% of male employees are smokers.
Male Non-Smokers: 31% of male employees are not smokers.
Female Smokers: 77% of female employees are smokers.
Female Non-Smokers: 23% of female employees are non-smokers.

#Proportion table 3

addmargins(round(prop.table(results$observed, 2), 2), 1)

##               mydata$FSmoker
## mydata$FGender Smoker Nonsmoker
##         Male     0.47      0.56
##         Female   0.53      0.44
##         Sum      1.00      1.00

Interpretation:

Male Smokers: 47% of smoking employees are males.
Female Smokers: 53% of smoking employees are females.
Male Non-Smokers: 56% of non-smoking employees are males.
Female Non-Smokers: 44% of non-smoking employees are females.

#Calculating effect size

library(effectsize)

## 
## Attaching package: 'effectsize'

## The following object is masked from 'package:psych':
## 
##     phi

effectsize::cramers_v(mydata$FGender, mydata$FSmoker)

## Cramer's V (adj.) |       95% CI
## --------------------------------
## 0.07              | [0.00, 1.00]
## 
## - One-sided CIs: upper bound fixed at [1.00].

interpret_cramers_v(0.07)

## [1] "very small"
## (Rules: funder2019)

The size of of association between gender and smoking status of employees is very small (V=0.07).

Correlation Analysis

Maja Zupan

MVA Homework 2

Data

Descriptive statistics

Correlation Analysis

Conclusion

Chi squared test