Brief introduction đź“–

Motivation

The goal of this analysis is to identify how education level, professions and the experience within it influence the salary.

Data

  • Data source

  • Data origin: India

  • Currency: Indian Rupee(INR)

    Pleas note that the data is outdated and does not represent the current state of the market in India.

Goals

  • Identify the most paid occupation

  • Identify the career that is the most rewarding in the long term

  • Answer whether degree is required for a good career

Loading and processing datađź§ą

Cleaning checklist

âś…Duplicates
âś…Missing Values
âś…Data types
âś…Data Consistency

# Loading Libraries and data
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## âś” dplyr     1.1.4     âś” readr     2.1.5
## âś” forcats   1.0.0     âś” stringr   1.5.1
## âś” ggplot2   3.5.1     âś” tibble    3.2.1
## âś” lubridate 1.9.4     âś” tidyr     1.3.1
## âś” purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## âś– dplyr::filter() masks stats::filter()
## âś– dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
data <- read.csv("C:/Users/marci/Desktop/Google Capstone Project/Salaries/Salary_Data.csv")
# Checking data types
str(data)
## 'data.frame':    6704 obs. of  6 variables:
##  $ Age                : int  32 28 45 36 52 29 42 31 26 38 ...
##  $ Gender             : chr  "Male" "Female" "Male" "Female" ...
##  $ Education.Level    : chr  "Bachelor's" "Master's" "PhD" "Bachelor's" ...
##  $ Job.Title          : chr  "Software Engineer" "Data Analyst" "Senior Manager" "Sales Associate" ...
##  $ Years.of.Experience: num  5 3 15 7 20 2 12 4 1 10 ...
##  $ Salary             : int  90000 65000 150000 60000 200000 55000 120000 80000 45000 110000 ...
paste0("There are ", duplicated(data) %>% sum(), " duplicates in the dataset.")
## [1] "There are 4912 duplicates in the dataset."
paste0("There are ", is.na(data) %>% sum(), " missing value in the dataset.")
## [1] "There are 10 missing value in the dataset."

Duplicates

10 missing values is not a big deal considering the size of the dataset. However, the 4912 duplicate rows is something that must be investigated.

data <- drop_na(data)
paste0((duplicated(data) %>% sum() / nrow(data) * 100) %>% round(2), "%") 
## [1] "73.31%"

73% - That’s how much data is to be lost if duplicates are dropped.


Is it bad that there are duplicates?

How likely is it for 2 or 3 people, arbitrarily picked, to be of the same age, occupation, salary, gender, education level, and years of experience? It’s very unlikely.

Hence the decision to drop duplicates.

data <- data %>% filter(!duplicated(data))

Data consistency

unique(data$Gender)
## [1] "Male"   "Female" "Other"
print("--------------------------------------------------------------------------")
## [1] "--------------------------------------------------------------------------"
print(unique(data$Education.Level))
## [1] "Bachelor's"        "Master's"          "PhD"              
## [4] "Bachelor's Degree" "Master's Degree"   ""                 
## [7] "High School"       "phD"

As can be seen, there is an issue with Bachelor’s and PhD title.
Let’s modify those titles so they have a common format.

# convert everything to lower case
data$Education.Level <- data$Education.Level %>% tolower() 
# Replace Bachelor's Degree and Master's Degree with Master's and Bachelor's
data$Education.Level[data$Education.Level == "bachelor's degree"] <- "bachelor's"
data$Education.Level[data$Education.Level == "master's degree"] <- "master's"

unique(data$Education.Level)
## [1] "bachelor's"  "master's"    "phd"         ""            "high school"

Almost there! Let’s look at empty strings present in the column.

data %>% filter(data$Education.Level == "")
# Only one entry contains empty string. Dropping it won't affect the results of the analysis. 

data <- data %>% filter(!data$Education.Level == "")

unique(data$Education.Level)
## [1] "bachelor's"  "master's"    "phd"         "high school"

Great! Now the data is consistent.

Exploring the data

Data distribution

options(scipen=100000)

ggplot(data, aes(Age)) + geom_histogram(color="black", fill="#3b83f9") + geom_vline(xintercept = median(data$Age), linetype="dashed", color="orange", size = 1) + labs(title = "Age distribution") + geom_text(x = 40, y = 150, label="Median Age", color="orange")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(data, aes(Salary)) + geom_histogram(color="black", fill="#3b83f9") + geom_vline(xintercept = mean(data$Salary), linetype="dashed", color="orange", size = 1) + labs(title = "Salary distribution") + geom_text(x = mean(data$Salary) + 30000, y = 120, label="Average Salary", color="orange")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(data, aes(Education.Level, fill=Education.Level)) + geom_bar() + labs(title= "Education Level distribution")

bar <- ggplot(data, aes(0, fill=Gender)) + geom_bar(width=1)
bar + coord_polar(theta = "y") + labs(title="Distribution of Genders") + theme_void()

summary(data)
##       Age           Gender          Education.Level     Job.Title        
##  Min.   :21.00   Length:1787        Length:1787        Length:1787       
##  1st Qu.:29.00   Class :character   Class :character   Class :character  
##  Median :33.00   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :35.14                                                           
##  3rd Qu.:41.00                                                           
##  Max.   :62.00                                                           
##  Years.of.Experience     Salary      
##  Min.   : 0.000      Min.   :   350  
##  1st Qu.: 3.000      1st Qu.: 70000  
##  Median : 8.000      Median :110000  
##  Mean   : 9.156      Mean   :113185  
##  3rd Qu.:13.000      3rd Qu.:160000  
##  Max.   :34.000      Max.   :250000

Well paid occupations

top_jobs <- data %>% group_by(Job.Title) %>% summarize("Median salary" = median(Salary)) %>% arrange(desc(`Median salary`)) 
#Jobs sorted by Median Salary
top_jobs
#Top 10 jobs salary wise
top_10_jobs <- top_jobs %>% head(12)
top_10_jobs
# Fill argument wouldn't work for me here. I would appreciate if somebody could explain why =)
ggplot(top_10_jobs, aes(Job.Title, `Median salary`,fill=`Median salary`)) + geom_bar(stat="identity") + theme(axis.text.x = element_text(angle = 45, hjust=1),axis.title = element_blank()) + labs(title="Best paid occupations")

How experience level affects the salary

popular_jobs_general <- data %>% filter(Job.Title %in% popular_jobs$Job.Title) %>% group_by(Job.Title)
ggplot(popular_jobs_general,aes(Years.of.Experience,Salary)) + geom_point() + facet_wrap(~Job.Title) + labs(title="Effect of experience on Salary") + geom_smooth()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Conclusions and observations

  • Marketing manager is the occupation where the correlation between the salary and the years of experience is strong and the data isn’t dispersed. Based on that, we can theoretize that it’s very likely that the more experience as a marketing manager you have, the more you earn.

  • To put the above into perspective let’s look at the data analyst plot. It’s true that the salary tend to go up with the experience gained. However, the data is dispersed. Hence it’s harder to guarantee that you’ll be rewarded for your experience.

  • A noticeable difference in pay given the same amount experience can be also observed on the Software Engineer plot. Is it because of the education level? Let’s check that!

ggplot(filter(popular_jobs_general, Job.Title=="Software Engineer"),aes(Years.of.Experience,Salary, color=Education.Level)) + geom_point() + labs(title="Effect of Education Level on Salary")

As can be observed, there is disparity in pay within bachelor’s of the same experience. Hence, the education level isn’t the cause of the disparity.

Degree and Self-taught

For the last part of the analysis we are going to take a closer look at those without a degree within the most popular professions. We are going to answer the questions such as:

ggplot(filter(popular_jobs_general,Education.Level=="high school"),aes(Years.of.Experience,Salary)) + geom_point() + facet_wrap(~Job.Title) + labs(title="Salary without Degree") 

# Percentage of people without a degree within the most popular professions.
(filter(popular_jobs_general,Education.Level=="high school") %>% nrow()) / (popular_jobs_general %>% nrow())
## [1] 0.02120891
ggplot(filter(popular_jobs_general,Job.Title %in% c("Back end Developer","Front end Developer","Full Stack Engineer","Senior Project Engineer","Senior Software Engineer","Web Developer")),aes(Years.of.Experience,Salary,color=Education.Level)) + geom_point() + facet_wrap(~Job.Title) + labs(title="Salary without Degree") 

Conclusions and observations

  • Only 2% of employees within the most popular professions have no degree.

  • Back end developers is the profession with the highest number of autodidacts.

  • Senior project managers and Senior software engineer without a degree tend to be paid less compared to those with a degree. However, given the sample size, we can’t be certain.