Minka Borec

Research Question 1:

If the person’s education level has a significant impact on their salary in India

library(ggplot2)
mydataS <- read.csv("C:/Users/ACER/Desktop/R data/Salary_Data.csv")
head(mydataS)
##   Age Gender Education.Level         Job.Title Years.of.Experience Salary
## 1  32   Male      Bachelor's Software Engineer                   5  90000
## 2  28 Female        Master's      Data Analyst                   3  65000
## 3  45   Male             PhD    Senior Manager                  15 150000
## 4  36 Female      Bachelor's   Sales Associate                   7  60000
## 5  52   Male        Master's          Director                  20 200000
## 6  29   Male      Bachelor's Marketing Analyst                   2  55000

Description:

Unit of observation: A person

Sample Size: 6702

Variables:

  1. Age: Age in years

  2. Gender: Male or Female

  3. Education.Level: bachelors, Masters, Phd, or High School

  4. Job Title: Position Held

  5. Years of Experience: Number of years of work experience

  6. Salary: Salary in INR

Source: https://www.kaggle.com/datasets/mohithsairamreddy/salary-data

Since the sample size is quite large, we take 550 random people for our analysis

set.seed(1)
mydataNew <- mydataS[sample(nrow(mydataS), 550), ]

Converting categorical variables into factors

mydataNew$EduLevel <- factor(mydataNew$Education.Level, 
                            levels = c("Bachelor's", "Master's", "PhD", "High School"), 
                            labels = c("Bachelor's", "Master's", "PhD", "High School"))
mydataNew$GenderF <- factor(mydataNew$Gender, 
                            levels = c("Male", "Female"), 
                            labels = c("Male", "Female"))

Showing descriptive statistics:

summary(mydataNew[,-c(2,3,4)])
##       Age        Years.of.Experience     Salary              EduLevel  
##  Min.   :21.00   Min.   : 0.000      Min.   : 25000   Bachelor's :254  
##  1st Qu.:28.00   1st Qu.: 3.000      1st Qu.: 65000   Master's   :143  
##  Median :31.00   Median : 6.000      Median :115000   PhD        :117  
##  Mean   :33.16   Mean   : 7.655      Mean   :113561   High School: 36  
##  3rd Qu.:37.00   3rd Qu.:11.000      3rd Qu.:160000                    
##  Max.   :58.00   Max.   :32.000      Max.   :250000                    
##    GenderF   
##  Male  :292  
##  Female:258  
##              
##              
##              
## 

Explaining a few parameters:

  1. Years of experience mean: Based on this sample, the average work experience of a person in India is 7.655 years

  2. Salary Min: The minimum salary of a person in India observed in this sample is Rs 25000

Hypothesis Testing

  1. Test details:

Since this data is an independent sample with three or more arithmetic means, we do the one way analysis of variance- ANOVA as a parametric test, and if the assumption of normality is not met, we do the Kruskal-Wallis Rank Sum Test as a non parametric test

  1. Checking the assumptions:
    1. Analysed variable is numeric: Satisfied, because price is a numeric variable

    2. Variable in the population is normally distributed within each group. We do the Shapiro WIlk Test for checking this

      H0: The data within the group is normally distributed

      H1: The data within the group is not normally distributed

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(rstatix)
## 
## Attaching package: 'rstatix'
## The following object is masked from 'package:stats':
## 
##     filter
  mydataNew %>%
  group_by(EduLevel) %>%
  shapiro_test(Salary)
## # A tibble: 4 × 4
##   EduLevel    variable statistic        p
##   <fct>       <chr>        <dbl>    <dbl>
## 1 Bachelor's  Salary       0.913 5.72e-11
## 2 Master's    Salary       0.985 1.33e- 1
## 3 PhD         Salary       0.878 2.33e- 8
## 4 High School Salary       0.506 7.39e-10
  Based on the above values, we reject H0 for all the groups except Masters, which means
  neither of the groups other than Masters have a normal distribution of the variable salary
  
  c) Homoscedasticity: the variance of analyzed variable is the same within all groups:
  We do the Levene test for checking this
  
  H0: The variance of salary within the groups Bachelors, Masters, PhD, High School
  is the same
  
  H1: At least one group has a different variance 
  
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
leveneTest(mydataNew$Salary, group = mydataNew$EduLevel)
## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value    Pr(>F)    
## group   3  8.8463 9.829e-06 ***
##       546                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Based on the above value, we reject H0 at p<0.001. We perform Welch’s𝐹 test, which is less impacted by the conditions of heteroskedasticity

library(onewaytests)
welch.test(Salary ~ EduLevel, 
           data = mydataNew)
## 
##   Welch's Heteroscedastic F Test (alpha = 0.05) 
## ------------------------------------------------------------- 
##   data : Salary and EduLevel 
## 
##   statistic  : 172.449 
##   num df     : 3 
##   denom df   : 160.7844 
##   p.value    : 4.993296e-50 
## 
##   Result     : Difference is statistically significant. 
## -------------------------------------------------------------
  1. No significant outliers:
library(ggplot2)
ggplot(mydataNew, aes(x = Salary)) +
  geom_histogram(binwidth = 2500, colour="gray") +
  facet_wrap(~EduLevel, ncol = 20) + 
  ylab("Frequency")

There are few outliers in the High School category due to the sample size being small

Conclusion: Since the assumption of normality is not met, a non parametric test is suitable for this sample

Non Parametric Test

As the assumption of normality is not met, we use Kruskal-Wallis test

H0: The location distribution of salary is the same for all the groups

H1: At least one is different

kruskal.test(Salary ~ EduLevel, 
             data = mydataNew)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  Salary by EduLevel
## Kruskal-Wallis chi-squared = 208.08, df = 3, p-value < 2.2e-16

As p<0.01, we reject H0, which means we reject the statement that the location distribution of salary is the same for all the groups

library(rstatix)
kruskal_effsize(Salary ~ EduLevel, 
                data = mydataNew)
## # A tibble: 1 × 5
##   .y.        n effsize method  magnitude
## * <chr>  <int>   <dbl> <chr>   <ord>    
## 1 Salary   550   0.376 eta2[H] large
pwc <- mydataNew %>%
  wilcox_test(Salary ~ EduLevel, 
              paired = FALSE, 
              p.adjust.method = "bonferroni")
 
Kruskal_results <- kruskal_test(Salary ~ EduLevel, 
                                data = mydataNew)

library(rstatix)
pwc <- pwc %>% 
       add_y_position(fun = "median", step.increase = 0.35)

library(ggpubr)
ggboxplot(mydataNew, x = "EduLevel", y = "Salary", add = "point", ylim=c(0, 300000)) +
  stat_pvalue_manual(pwc, hide.ns = FALSE) +
  stat_summary(fun = median, geom = "point", shape = 16, size = 4,
               aes(group = EduLevel), color = "darkred",
               position = position_dodge(width = 0.8)) +
  stat_summary(fun = median, colour = "darkred", 
               position = position_dodge(width = 0.8),
               geom = "text", vjust = -0.5, hjust = -1,
               aes(label = round(after_stat(y), digits = 2), group = EduLevel)) +
  labs(subtitle = get_test_label(Kruskal_results,  detailed = TRUE),
       caption = get_pwc_label(pwc))

Parametric Test

Along with the non parametric test we perform the parametric test as well

H0: Mean(Phd) = Mean(Masters) = Mean(Bachelors) = Mean(High School)

H1: At least one mean is different

ANOVA_results <- aov(Salary ~ EduLevel, 
                     data = mydataNew)

summary(ANOVA_results)
##              Df    Sum Sq   Mean Sq F value Pr(>F)    
## EduLevel      3 5.675e+11 1.892e+11   111.2 <2e-16 ***
## Residuals   546 9.288e+11 1.701e+09                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Based on the above values, we reject H0, which implies we reject Mean(Phd) = Mean(Masters) = Mean(Bachelors) = Mean(High School)

library(effectsize)
## 
## Attaching package: 'effectsize'
## The following objects are masked from 'package:rstatix':
## 
##     cohens_d, eta_squared
eta_squared(ANOVA_results)
## For one-way between subjects designs, partial eta squared is equivalent
##   to eta squared. Returning eta squared.
## # Effect Size for ANOVA
## 
## Parameter | Eta2 |       95% CI
## -------------------------------
## EduLevel  | 0.38 | [0.33, 1.00]
## 
## - One-sided CIs: upper bound fixed at [1.00].
interpret_eta_squared(0.379, rules = "cohen1992")
## [1] "large"
## (Rules: cohen1992)
pwc <- mydataNew %>%
  pairwise_t_test(Salary ~ EduLevel, 
                  paired = FALSE,
                  p.adjust.method = "bonferroni")

ANOVA_results <- anova_test(Salary ~ EduLevel, 
                            data = mydataNew)

library(rstatix)
pwc <- pwc %>% 
       add_y_position(fun = "median", step.increase = 0.35)

library(ggpubr)
ggboxplot(mydataNew, x = "EduLevel", y = "Salary", add = "point", ylim=c(0, 300000)) +
  stat_pvalue_manual(pwc, hide.ns = FALSE) +
  stat_summary(fun = mean, geom = "point", shape = 16, size = 4,
               aes(group = EduLevel), color = "darkred",
               position = position_dodge(width = 0.5)) +
  stat_summary(fun = mean, colour = "darkred", 
               position = position_dodge(width = 0.5),
               geom = "text", vjust = -0.2, hjust = -0.2,
               aes(label = round(after_stat(y), digits = 2), group = EduLevel)) +
  labs(subtitle = get_test_label(ANOVA_results,  detailed = TRUE),
       caption = get_pwc_label(pwc))

Conclusion:

Based on both the parametric test and the non parametric test we can conclude there is a statistically significant impact of Education Level on salary in India

Research Question 2

If there is a correlation between the age of a person and salary in India

library(ggplot2)
ggplot(mydataNew, aes(x = Age, y = Salary)) +
  geom_point()

library(ggplot2)
ggplot(mydataNew, aes(x = Salary)) +
  geom_histogram(binwidth = 2500, colour="gray") + 
  ylab("Frequency")

The data does not seem to be normally distributed, accordingly a Spearman’s coeffiecient is more appropriate but we use both Pearson and Spearman

cor(mydataNew$Age, mydataNew$Salary, 
    method = "pearson")
## [1] 0.7157744
cor(mydataNew$Salary, mydataNew$Age, 
    method = "spearman",
    use = "complete.obs")
## [1] 0.733125
cor.test(mydataNew$Age, mydataNew$Salary, 
         method = "pearson",
         use = "complete.obs")
## 
##  Pearson's product-moment correlation
## 
## data:  mydataNew$Age and mydataNew$Salary
## t = 23.994, df = 548, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6724071 0.7542443
## sample estimates:
##       cor 
## 0.7157744
cor.test(mydataNew$Age, mydataNew$Salary, 
         method = "spearman",
         use = "complete.obs")
## Warning in cor.test.default(mydataNew$Age, mydataNew$Salary, method =
## "spearman", : Cannot compute exact p-value with ties
## 
##  Spearman's rank correlation rho
## 
## data:  mydataNew$Age and mydataNew$Salary
## S = 7400198, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##      rho 
## 0.733125

Conclusion:

Based on the values from Pearson and Spearman correlation coefficients we can say there is a strong linear correlation between the age and salary of a person

Research Question 3

If there is a relationship between gender and the education level of a person

Pearson’s chi squared test

results <- chisq.test(mydataNew$GenderF, mydataNew$EduLevel,
           correct = FALSE)

results
## 
##  Pearson's Chi-squared test
## 
## data:  mydataNew$GenderF and mydataNew$EduLevel
## X-squared = 12.059, df = 3, p-value = 0.007185

H0: There is no association between the categorical variables

H1: There is an association between the categorical variables

As p<0.05, we reject H0 and as a result we reject that there is no association between gender and the education level of a person

Empirical Frequencies

addmargins(results$observed)
##                  mydataNew$EduLevel
## mydataNew$GenderF Bachelor's Master's PhD High School Sum
##            Male          148       61  68          15 292
##            Female        106       82  49          21 258
##            Sum           254      143 117          36 550
round(results$expected)
##                  mydataNew$EduLevel
## mydataNew$GenderF Bachelor's Master's PhD High School
##            Male          135       76  62          19
##            Female        119       67  55          17

Assumptions:

  1. Independent observations -> satisfied
  2. All expected frequencies>1 -> satisfied
  3. Max 20% frequencies between 1-5 -> satisfied
round(results$res, 2)
##                  mydataNew$EduLevel
## mydataNew$GenderF Bachelor's Master's   PhD High School
##            Male         1.13    -1.71  0.75       -0.94
##            Female      -1.20     1.82 -0.79        1.00

Since all residuals<1.96, the difference between the observed and the expected frequencies are not statistically significant. There are more than expected males who have bachelors degree, Phd but less than expected with masters, high school. There are less than expected females with a bachelors degree, Phd, but more than expected with Masters, High School

addmargins(round(prop.table(results$observed), 3))
##                  mydataNew$EduLevel
## mydataNew$GenderF Bachelor's Master's   PhD High School   Sum
##            Male        0.269    0.111 0.124       0.027 0.531
##            Female      0.193    0.149 0.089       0.038 0.469
##            Sum         0.462    0.260 0.213       0.065 1.000

26.9% of the population is males who have finished bachelors

addmargins(round(prop.table(results$observed, 1), 3), 2)
##                  mydataNew$EduLevel
## mydataNew$GenderF Bachelor's Master's   PhD High School   Sum
##            Male        0.507    0.209 0.233       0.051 1.000
##            Female      0.411    0.318 0.190       0.081 1.000

41.1% of females have done bachelor’s, 31.8% have done Masters, 19% have done Phd, 8.1% have finished high school

addmargins(round(prop.table(results$observed, 2), 3), 1)
##                  mydataNew$EduLevel
## mydataNew$GenderF Bachelor's Master's   PhD High School
##            Male        0.583    0.427 0.581       0.417
##            Female      0.417    0.573 0.419       0.583
##            Sum         1.000    1.000 1.000       1.000

within people who have done bachelors, 58.3% are male, 41.7% are female

library(effectsize)
effectsize::cramers_v(mydataNew$EduLevel, mydataNew$GenderF)
## Cramer's V (adj.) |       95% CI
## --------------------------------
## 0.13              | [0.00, 1.00]
## 
## - One-sided CIs: upper bound fixed at [1.00].
interpret_cramers_v(0.13)
## [1] "small"
## (Rules: funder2019)
fisher.test(mydataNew$EduLevel, mydataNew$GenderF)
## 
##  Fisher's Exact Test for Count Data
## 
## data:  mydataNew$EduLevel and mydataNew$GenderF
## p-value = 0.007238
## alternative hypothesis: two.sided

H0: Odds ratio is equals 1

H1: Odds ratio does not equal 1

As p<0.05, we fail reject H0, as a result we reject the odds ratio is equal to 1

Conclusion

We reject there is no association between the gender and the education level