MVA Homework 2

Data

We have a data set that provides us insights into the similarities and differences between female and male employees across different factors. With correlation analysis, we will analyse the relationship between age of employees and height of their average monthly salary.

Data source: soleman sarkovish. (2023). Male vs Female [Data set]. Kaggle. https://doi.org/10.34740/KAGGLE/DSV/5665153

setwd("C:/Users/Neimenovan uporabnik/Documents/R data/R data/Basics/Reading csv data")
mydata <- read.csv("dataset csv.csv")
head(mydata)
##        Date Gender Driving.test.result    Bmi Childeren    Salary    region smoker age
## 1 1/11/2022 female                   5 27.900         0 16884.924 southwest    yes  19
## 2 1/11/2022 female                   4 33.770         1  1725.552 southeast     no  18
## 3 1/11/2022   male                   8 33.000         3  4449.462 southeast     no  28
## 4 1/11/2022   male                   9 22.705         0 21984.471 northwest     no  33
## 5 1/11/2022 female                   4 28.880         0  3866.855 northwest     no  32
## 6 2/11/2022 female                   2 25.740         0  3756.622 southeast     no  31
#install.packages("dplyr")
# Loading the dplyr package
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# Selecting only the relevant columns
mydata <- select(mydata, Gender, smoker, Salary, age)

head(mydata)
##   Gender smoker    Salary age
## 1 female    yes 16884.924  19
## 2 female     no  1725.552  18
## 3   male     no  4449.462  28
## 4   male     no 21984.471  33
## 5 female     no  3866.855  32
## 6 female     no  3756.622  31
#Factorizing

mydata$FGender <- factor(mydata$Gender, 
                         levels = c("male", "female"),
                         labels = c("Male","Female"))

mydata$FSmoker <- factor(mydata$smoker, 
                         levels = c("no", "yes"),
                         labels = c("Smoker","Nonsmoker"))
head(mydata)
##   Gender smoker    Salary age FGender   FSmoker
## 1 female    yes 16884.924  19  Female Nonsmoker
## 2 female     no  1725.552  18  Female    Smoker
## 3   male     no  4449.462  28    Male    Smoker
## 4   male     no 21984.471  33    Male    Smoker
## 5 female     no  3866.855  32  Female    Smoker
## 6 female     no  3756.622  31  Female    Smoker

Descriptive statistics

#Showing descriptive statistics
library(psych)
summary(mydata)
##     Gender             smoker              Salary           age          FGender         FSmoker   
##  Length:354         Length:354         Min.   : 1137   Min.   :18.00   Male  :174   Smoker   :258  
##  Class :character   Class :character   1st Qu.: 3580   1st Qu.:23.00   Female:180   Nonsmoker: 96  
##  Mode  :character   Mode  :character   Median :10602   Median :34.00                               
##                                        Mean   :15390   Mean   :37.19                               
##                                        3rd Qu.:23568   3rd Qu.:55.00                               
##                                        Max.   :51195   Max.   :63.00
result <- describeBy(mydata)
## Warning in describeBy(mydata): no grouping variable requested
print(result)
##          vars   n     mean       sd   median  trimmed      mad     min      max    range  skew kurtosis     se
## Gender*     1 354     1.49     0.50     1.00     1.49     0.00    1.00     2.00     1.00  0.03    -2.00   0.03
## smoker*     2 354     1.27     0.45     1.00     1.21     0.00    1.00     2.00     1.00  1.03    -0.95   0.02
## Salary      3 354 15389.76 14498.79 10602.39 13510.62 11203.00 1137.01 51194.56 50057.55  0.97    -0.40 770.60
## age         4 354    37.19    15.28    34.00    36.53    17.79   18.00    63.00    45.00  0.39    -1.28   0.81
## FGender*    5 354     1.51     0.50     2.00     1.51     0.00    1.00     2.00     1.00 -0.03    -2.00   0.03
## FSmoker*    6 354     1.27     0.45     1.00     1.21     0.00    1.00     2.00     1.00  1.03    -0.95   0.02

Explanation of few parameters:

Correlation Analysis

We can use Pearson’s correlation test but we need to first meet all of the following assumptions:

#Checking the normality with histogram

library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
## 
##     %+%, alpha
ggplot(mydata, aes(x = Salary, y = age)) + 
  geom_col(colour = "grey71", fill = "hotpink") + 
  scale_x_continuous(breaks = seq(0, 100000, 10000)) +
  ylab("Age")

#Checking the linearity with scatterplot

library(ggplot2)
ggplot(mydata, aes(x = age, y = Salary)) +
  geom_point()

We can see that normality is violated, therefore we should use a non-parametric test - Spearman correlation test.

#Performing Spearman correlation test

cor(mydata$Salary, mydata$age,
    method = "spearman")
## [1] 0.5280009

Based on the sample data we can conclude there is positive and moderate relationship between age and average monthly salary of employees.

#Checking the significance of correlation coefficient

cor.test(mydata$Salary, mydata$age,
         method = "spearman",
         exact = FALSE)
## 
##  Spearman's rank correlation rho
## 
## data:  mydata$Salary and mydata$age
## S = 3489766, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.5280009

H0: ro = 0

H1: ro =/ 0

Conclusion

Based on the sample data we can reject null hypothesis (p<0.001) and conclude there is a linear relationship between age of employees and their average monthly salary.

Chi squared test

#Creating the contingency table

contingency_table <- addmargins(table(mydata$FGender, mydata$FSmoker))

print(contingency_table)
##         
##          Smoker Nonsmoker Sum
##   Male      120        54 174
##   Female    138        42 180
##   Sum       258        96 354

Checking the assumptions for Pearson’s Chi-Squared test:

++ Expected frequency for Male Smokers: (174 * 258) / 354 = 127.31

++ Expected frequency for Male Non-smokers: (174 * 96) / 354 = 46.69

++ Expected frequency for Female Smokers: (180 * 258) / 354 = 130.69

++ Expected frequency for Female Non-smokers: (180 * 96) / 354 = 49.31

All assumptions are met, which means that we can perform Pearson’s Chi Squared test.

#Performing the Pearson Chi squared test

results <- chisq.test(mydata$FGender, mydata$FSmoker,
                      correct = TRUE)

print(results)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  mydata$FGender and mydata$FSmoker
## X-squared = 2.2795, df = 1, p-value = 0.1311

H0: There is no association between gender of an employee and his/her smoking status.

H1: There is association between gender of an employee and his/her smoking status.

Based on the sample data, we can’t rejct the null hypothesis (p = 0.1311) and conclude that there is no association between gender of an employee and his/her smoking status.

#Calculating standardized residuals

std_residuals <- results$stdres

print(std_residuals)
##               mydata$FSmoker
## mydata$FGender    Smoker Nonsmoker
##         Male   -1.629381  1.629381
##         Female  1.629381 -1.629381

Based on calculated standardized residuals we can’t say that gender had an effect on smoking status of an employee, because all four standardized residuals are (in absolute) smaller than 1.96.

#Proportion table 1

addmargins(round(prop.table(results$observed), 2))
##               mydata$FSmoker
## mydata$FGender Smoker Nonsmoker  Sum
##         Male     0.34      0.15 0.49
##         Female   0.39      0.12 0.51
##         Sum      0.73      0.27 1.00

Interpretation:

#Proportion table 2

addmargins(round(prop.table(results$observed, 1), 2), 2)
##               mydata$FSmoker
## mydata$FGender Smoker Nonsmoker  Sum
##         Male     0.69      0.31 1.00
##         Female   0.77      0.23 1.00

Interpretation:

#Proportion table 3

addmargins(round(prop.table(results$observed, 2), 2), 1)
##               mydata$FSmoker
## mydata$FGender Smoker Nonsmoker
##         Male     0.47      0.56
##         Female   0.53      0.44
##         Sum      1.00      1.00

Interpretation:

#Calculating effect size

library(effectsize)
## 
## Attaching package: 'effectsize'
## The following object is masked from 'package:psych':
## 
##     phi
effectsize::cramers_v(mydata$FGender, mydata$FSmoker)
## Cramer's V (adj.) |       95% CI
## --------------------------------
## 0.07              | [0.00, 1.00]
## 
## - One-sided CIs: upper bound fixed at [1.00].
interpret_cramers_v(0.07)
## [1] "very small"
## (Rules: funder2019)

The size of of association between gender and smoking status of employees is very small (V=0.07).