If the person’s education level has a significant impact on their salary in India
library(ggplot2)
mydataS <- read.csv("C:/Users/ACER/Desktop/R data/Salary_Data.csv")
head(mydataS)
## Age Gender Education.Level Job.Title Years.of.Experience Salary
## 1 32 Male Bachelor's Software Engineer 5 90000
## 2 28 Female Master's Data Analyst 3 65000
## 3 45 Male PhD Senior Manager 15 150000
## 4 36 Female Bachelor's Sales Associate 7 60000
## 5 52 Male Master's Director 20 200000
## 6 29 Male Bachelor's Marketing Analyst 2 55000
Unit of observation: A person
Sample Size: 6702
Variables:
Age: Age in years
Gender: Male or Female
Education.Level: bachelors, Masters, Phd, or High School
Job Title: Position Held
Years of Experience: Number of years of work experience
Salary: Salary in INR
Source: https://www.kaggle.com/datasets/mohithsairamreddy/salary-data
Since the sample size is quite large, we take 550 random people for our analysis
set.seed(1)
mydataNew <- mydataS[sample(nrow(mydataS), 550), ]
mydataNew$EduLevel <- factor(mydataNew$Education.Level,
levels = c("Bachelor's", "Master's", "PhD", "High School"),
labels = c("Bachelor's", "Master's", "PhD", "High School"))
mydataNew$GenderF <- factor(mydataNew$Gender,
levels = c("Male", "Female"),
labels = c("Male", "Female"))
summary(mydataNew[,-c(2,3,4)])
## Age Years.of.Experience Salary EduLevel
## Min. :21.00 Min. : 0.000 Min. : 25000 Bachelor's :254
## 1st Qu.:28.00 1st Qu.: 3.000 1st Qu.: 65000 Master's :143
## Median :31.00 Median : 6.000 Median :115000 PhD :117
## Mean :33.16 Mean : 7.655 Mean :113561 High School: 36
## 3rd Qu.:37.00 3rd Qu.:11.000 3rd Qu.:160000
## Max. :58.00 Max. :32.000 Max. :250000
## GenderF
## Male :292
## Female:258
##
##
##
##
Years of experience mean: Based on this sample, the average work experience of a person in India is 7.655 years
Salary Min: The minimum salary of a person in India observed in this sample is Rs 25000
Since this data is an independent sample with three or more arithmetic means, we do the one way analysis of variance- ANOVA as a parametric test, and if the assumption of normality is not met, we do the Kruskal-Wallis Rank Sum Test as a non parametric test
Analysed variable is numeric: Satisfied, because price is a numeric variable
Variable in the population is normally distributed within each group. We do the Shapiro WIlk Test for checking this
H0: The data within the group is normally distributed
H1: The data within the group is not normally distributed
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(rstatix)
##
## Attaching package: 'rstatix'
## The following object is masked from 'package:stats':
##
## filter
mydataNew %>%
group_by(EduLevel) %>%
shapiro_test(Salary)
## # A tibble: 4 × 4
## EduLevel variable statistic p
## <fct> <chr> <dbl> <dbl>
## 1 Bachelor's Salary 0.913 5.72e-11
## 2 Master's Salary 0.985 1.33e- 1
## 3 PhD Salary 0.878 2.33e- 8
## 4 High School Salary 0.506 7.39e-10
Based on the above values, we reject H0 for all the groups except Masters, which means
neither of the groups other than Masters have a normal distribution of the variable salary
c) Homoscedasticity: the variance of analyzed variable is the same within all groups:
We do the Levene test for checking this
H0: The variance of salary within the groups Bachelors, Masters, PhD, High School
is the same
H1: At least one group has a different variance
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
leveneTest(mydataNew$Salary, group = mydataNew$EduLevel)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 3 8.8463 9.829e-06 ***
## 546
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Based on the above value, we reject H0 at p<0.001. We perform Welch’s𝐹 test, which is less impacted by the conditions of heteroskedasticity
library(onewaytests)
welch.test(Salary ~ EduLevel,
data = mydataNew)
##
## Welch's Heteroscedastic F Test (alpha = 0.05)
## -------------------------------------------------------------
## data : Salary and EduLevel
##
## statistic : 172.449
## num df : 3
## denom df : 160.7844
## p.value : 4.993296e-50
##
## Result : Difference is statistically significant.
## -------------------------------------------------------------
library(ggplot2)
ggplot(mydataNew, aes(x = Salary)) +
geom_histogram(binwidth = 2500, colour="gray") +
facet_wrap(~EduLevel, ncol = 20) +
ylab("Frequency")
There are few outliers in the High School category due to the sample size being small
Conclusion: Since the assumption of normality is not met, a non parametric test is suitable for this sample
As the assumption of normality is not met, we use Kruskal-Wallis test
H0: The location distribution of salary is the same for all the groups
H1: At least one is different
kruskal.test(Salary ~ EduLevel,
data = mydataNew)
##
## Kruskal-Wallis rank sum test
##
## data: Salary by EduLevel
## Kruskal-Wallis chi-squared = 208.08, df = 3, p-value < 2.2e-16
As p<0.01, we reject H0, which means we reject the statement that the location distribution of salary is the same for all the groups
library(rstatix)
kruskal_effsize(Salary ~ EduLevel,
data = mydataNew)
## # A tibble: 1 × 5
## .y. n effsize method magnitude
## * <chr> <int> <dbl> <chr> <ord>
## 1 Salary 550 0.376 eta2[H] large
pwc <- mydataNew %>%
wilcox_test(Salary ~ EduLevel,
paired = FALSE,
p.adjust.method = "bonferroni")
Kruskal_results <- kruskal_test(Salary ~ EduLevel,
data = mydataNew)
library(rstatix)
pwc <- pwc %>%
add_y_position(fun = "median", step.increase = 0.35)
library(ggpubr)
ggboxplot(mydataNew, x = "EduLevel", y = "Salary", add = "point", ylim=c(0, 300000)) +
stat_pvalue_manual(pwc, hide.ns = FALSE) +
stat_summary(fun = median, geom = "point", shape = 16, size = 4,
aes(group = EduLevel), color = "darkred",
position = position_dodge(width = 0.8)) +
stat_summary(fun = median, colour = "darkred",
position = position_dodge(width = 0.8),
geom = "text", vjust = -0.5, hjust = -1,
aes(label = round(after_stat(y), digits = 2), group = EduLevel)) +
labs(subtitle = get_test_label(Kruskal_results, detailed = TRUE),
caption = get_pwc_label(pwc))
Along with the non parametric test we perform the parametric test as well
H0: Mean(Phd) = Mean(Masters) = Mean(Bachelors) = Mean(High School)
H1: At least one mean is different
ANOVA_results <- aov(Salary ~ EduLevel,
data = mydataNew)
summary(ANOVA_results)
## Df Sum Sq Mean Sq F value Pr(>F)
## EduLevel 3 5.675e+11 1.892e+11 111.2 <2e-16 ***
## Residuals 546 9.288e+11 1.701e+09
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Based on the above values, we reject H0, which implies we reject Mean(Phd) = Mean(Masters) = Mean(Bachelors) = Mean(High School)
library(effectsize)
##
## Attaching package: 'effectsize'
## The following objects are masked from 'package:rstatix':
##
## cohens_d, eta_squared
eta_squared(ANOVA_results)
## For one-way between subjects designs, partial eta squared is equivalent
## to eta squared. Returning eta squared.
## # Effect Size for ANOVA
##
## Parameter | Eta2 | 95% CI
## -------------------------------
## EduLevel | 0.38 | [0.33, 1.00]
##
## - One-sided CIs: upper bound fixed at [1.00].
interpret_eta_squared(0.379, rules = "cohen1992")
## [1] "large"
## (Rules: cohen1992)
pwc <- mydataNew %>%
pairwise_t_test(Salary ~ EduLevel,
paired = FALSE,
p.adjust.method = "bonferroni")
ANOVA_results <- anova_test(Salary ~ EduLevel,
data = mydataNew)
library(rstatix)
pwc <- pwc %>%
add_y_position(fun = "median", step.increase = 0.35)
library(ggpubr)
ggboxplot(mydataNew, x = "EduLevel", y = "Salary", add = "point", ylim=c(0, 300000)) +
stat_pvalue_manual(pwc, hide.ns = FALSE) +
stat_summary(fun = mean, geom = "point", shape = 16, size = 4,
aes(group = EduLevel), color = "darkred",
position = position_dodge(width = 0.5)) +
stat_summary(fun = mean, colour = "darkred",
position = position_dodge(width = 0.5),
geom = "text", vjust = -0.2, hjust = -0.2,
aes(label = round(after_stat(y), digits = 2), group = EduLevel)) +
labs(subtitle = get_test_label(ANOVA_results, detailed = TRUE),
caption = get_pwc_label(pwc))
Based on both the parametric test and the non parametric test we can conclude there is a statistically significant impact of Education Level on salary in India
If there is a correlation between the age of a person and salary in India
library(ggplot2)
ggplot(mydataNew, aes(x = Age, y = Salary)) +
geom_point()
library(ggplot2)
ggplot(mydataNew, aes(x = Salary)) +
geom_histogram(binwidth = 2500, colour="gray") +
ylab("Frequency")
The data does not seem to be normally distributed, accordingly a Spearman’s coeffiecient is more appropriate but we use both Pearson and Spearman
cor(mydataNew$Age, mydataNew$Salary,
method = "pearson")
## [1] 0.7157744
cor(mydataNew$Salary, mydataNew$Age,
method = "spearman",
use = "complete.obs")
## [1] 0.733125
cor.test(mydataNew$Age, mydataNew$Salary,
method = "pearson",
use = "complete.obs")
##
## Pearson's product-moment correlation
##
## data: mydataNew$Age and mydataNew$Salary
## t = 23.994, df = 548, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6724071 0.7542443
## sample estimates:
## cor
## 0.7157744
cor.test(mydataNew$Age, mydataNew$Salary,
method = "spearman",
use = "complete.obs")
## Warning in cor.test.default(mydataNew$Age, mydataNew$Salary, method =
## "spearman", : Cannot compute exact p-value with ties
##
## Spearman's rank correlation rho
##
## data: mydataNew$Age and mydataNew$Salary
## S = 7400198, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.733125
Based on the values from Pearson and Spearman correlation coefficients we can say there is a strong linear correlation between the age and salary of a person
If there is a relationship between gender and the education level of a person
results <- chisq.test(mydataNew$GenderF, mydataNew$EduLevel,
correct = FALSE)
results
##
## Pearson's Chi-squared test
##
## data: mydataNew$GenderF and mydataNew$EduLevel
## X-squared = 12.059, df = 3, p-value = 0.007185
H0: There is no association between the categorical variables
H1: There is an association between the categorical variables
As p<0.05, we reject H0 and as a result we reject that there is no association between gender and the education level of a person
addmargins(results$observed)
## mydataNew$EduLevel
## mydataNew$GenderF Bachelor's Master's PhD High School Sum
## Male 148 61 68 15 292
## Female 106 82 49 21 258
## Sum 254 143 117 36 550
round(results$expected)
## mydataNew$EduLevel
## mydataNew$GenderF Bachelor's Master's PhD High School
## Male 135 76 62 19
## Female 119 67 55 17
Assumptions:
round(results$res, 2)
## mydataNew$EduLevel
## mydataNew$GenderF Bachelor's Master's PhD High School
## Male 1.13 -1.71 0.75 -0.94
## Female -1.20 1.82 -0.79 1.00
Since all residuals<1.96, the difference between the observed and the expected frequencies are not statistically significant. There are more than expected males who have bachelors degree, Phd but less than expected with masters, high school. There are less than expected females with a bachelors degree, Phd, but more than expected with Masters, High School
addmargins(round(prop.table(results$observed), 3))
## mydataNew$EduLevel
## mydataNew$GenderF Bachelor's Master's PhD High School Sum
## Male 0.269 0.111 0.124 0.027 0.531
## Female 0.193 0.149 0.089 0.038 0.469
## Sum 0.462 0.260 0.213 0.065 1.000
26.9% of the population is males who have finished bachelors
addmargins(round(prop.table(results$observed, 1), 3), 2)
## mydataNew$EduLevel
## mydataNew$GenderF Bachelor's Master's PhD High School Sum
## Male 0.507 0.209 0.233 0.051 1.000
## Female 0.411 0.318 0.190 0.081 1.000
41.1% of females have done bachelor’s, 31.8% have done Masters, 19% have done Phd, 8.1% have finished high school
addmargins(round(prop.table(results$observed, 2), 3), 1)
## mydataNew$EduLevel
## mydataNew$GenderF Bachelor's Master's PhD High School
## Male 0.583 0.427 0.581 0.417
## Female 0.417 0.573 0.419 0.583
## Sum 1.000 1.000 1.000 1.000
within people who have done bachelors, 58.3% are male, 41.7% are female
library(effectsize)
effectsize::cramers_v(mydataNew$EduLevel, mydataNew$GenderF)
## Cramer's V (adj.) | 95% CI
## --------------------------------
## 0.13 | [0.00, 1.00]
##
## - One-sided CIs: upper bound fixed at [1.00].
interpret_cramers_v(0.13)
## [1] "small"
## (Rules: funder2019)
fisher.test(mydataNew$EduLevel, mydataNew$GenderF)
##
## Fisher's Exact Test for Count Data
##
## data: mydataNew$EduLevel and mydataNew$GenderF
## p-value = 0.007238
## alternative hypothesis: two.sided
H0: Odds ratio is equals 1
H1: Odds ratio does not equal 1
As p<0.05, we fail reject H0, as a result we reject the odds ratio is equal to 1
We reject there is no association between the gender and the education level