Exploring the job market in India

Brief introduction 📖

Motivation

The goal of this analysis is to identify how education level, professions and the experience within it influence the salary.

Data

Data source
Data origin: India
Currency: Indian Rupee(INR)

Pleas note that the data is outdated and does not represent the current state of the market in India.

Goals

Identify the most paid occupation
Identify the career that is the most rewarding in the long term
Answer whether degree is required for a good career

Loading and processing data🧹

Cleaning checklist

✅Duplicates
✅Missing Values
✅Data types
✅Data Consistency

# Loading Libraries and data
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

data <- read.csv("C:/Users/marci/Desktop/Google Capstone Project/Salaries/Salary_Data.csv")

# Checking data types
str(data)

## 'data.frame':    6704 obs. of  6 variables:
##  $ Age                : int  32 28 45 36 52 29 42 31 26 38 ...
##  $ Gender             : chr  "Male" "Female" "Male" "Female" ...
##  $ Education.Level    : chr  "Bachelor's" "Master's" "PhD" "Bachelor's" ...
##  $ Job.Title          : chr  "Software Engineer" "Data Analyst" "Senior Manager" "Sales Associate" ...
##  $ Years.of.Experience: num  5 3 15 7 20 2 12 4 1 10 ...
##  $ Salary             : int  90000 65000 150000 60000 200000 55000 120000 80000 45000 110000 ...

paste0("There are ", duplicated(data) %>% sum(), " duplicates in the dataset.")

## [1] "There are 4912 duplicates in the dataset."

paste0("There are ", is.na(data) %>% sum(), " missing value in the dataset.")

## [1] "There are 10 missing value in the dataset."

Duplicates

10 missing values is not a big deal considering the size of the dataset. However, the 4912 duplicate rows is something that must be investigated.

data <- drop_na(data)
paste0((duplicated(data) %>% sum() / nrow(data) * 100) %>% round(2), "%")

## [1] "73.31%"

73% - That’s how much data is to be lost if duplicates are dropped.

Is it bad that there are duplicates?

How likely is it for 2 or 3 people, arbitrarily picked, to be of the same age, occupation, salary, gender, education level, and years of experience? It’s very unlikely.

Hence the decision to drop duplicates.

data <- data %>% filter(!duplicated(data))

Data consistency

unique(data$Gender)

## [1] "Male"   "Female" "Other"

print("--------------------------------------------------------------------------")

## [1] "--------------------------------------------------------------------------"

print(unique(data$Education.Level))

## [1] "Bachelor's"        "Master's"          "PhD"              
## [4] "Bachelor's Degree" "Master's Degree"   ""                 
## [7] "High School"       "phD"

As can be seen, there is an issue with Bachelor’s and PhD title.
Let’s modify those titles so they have a common format.

# convert everything to lower case
data$Education.Level <- data$Education.Level %>% tolower() 
# Replace Bachelor's Degree and Master's Degree with Master's and Bachelor's
data$Education.Level[data$Education.Level == "bachelor's degree"] <- "bachelor's"
data$Education.Level[data$Education.Level == "master's degree"] <- "master's"

unique(data$Education.Level)

## [1] "bachelor's"  "master's"    "phd"         ""            "high school"

Almost there! Let’s look at empty strings present in the column.

data %>% filter(data$Education.Level == "")

# Only one entry contains empty string. Dropping it won't affect the results of the analysis. 

data <- data %>% filter(!data$Education.Level == "")

unique(data$Education.Level)

## [1] "bachelor's"  "master's"    "phd"         "high school"

Great! Now the data is consistent.

Exploring the data

Data distribution

options(scipen=100000)

ggplot(data, aes(Age)) + geom_histogram(color="black", fill="#3b83f9") + geom_vline(xintercept = median(data$Age), linetype="dashed", color="orange", size = 1) + labs(title = "Age distribution") + geom_text(x = 40, y = 150, label="Median Age", color="orange")

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(data, aes(Salary)) + geom_histogram(color="black", fill="#3b83f9") + geom_vline(xintercept = mean(data$Salary), linetype="dashed", color="orange", size = 1) + labs(title = "Salary distribution") + geom_text(x = mean(data$Salary) + 30000, y = 120, label="Average Salary", color="orange")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(data, aes(Education.Level, fill=Education.Level)) + geom_bar() + labs(title= "Education Level distribution")

bar <- ggplot(data, aes(0, fill=Gender)) + geom_bar(width=1)
bar + coord_polar(theta = "y") + labs(title="Distribution of Genders") + theme_void()

summary(data)

##       Age           Gender          Education.Level     Job.Title        
##  Min.   :21.00   Length:1787        Length:1787        Length:1787       
##  1st Qu.:29.00   Class :character   Class :character   Class :character  
##  Median :33.00   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :35.14                                                           
##  3rd Qu.:41.00                                                           
##  Max.   :62.00                                                           
##  Years.of.Experience     Salary      
##  Min.   : 0.000      Min.   :   350  
##  1st Qu.: 3.000      1st Qu.: 70000  
##  Median : 8.000      Median :110000  
##  Mean   : 9.156      Mean   :113185  
##  3rd Qu.:13.000      3rd Qu.:160000  
##  Max.   :34.000      Max.   :250000

Well paid occupations

top_jobs <- data %>% group_by(Job.Title) %>% summarize("Median salary" = median(Salary)) %>% arrange(desc(`Median salary`)) 
#Jobs sorted by Median Salary
top_jobs

#Top 10 jobs salary wise
top_10_jobs <- top_jobs %>% head(12)
top_10_jobs

# Fill argument wouldn't work for me here. I would appreciate if somebody could explain why =)
ggplot(top_10_jobs, aes(Job.Title, `Median salary`,fill=`Median salary`)) + geom_bar(stat="identity") + theme(axis.text.x = element_text(angle = 45, hjust=1),axis.title = element_blank()) + labs(title="Best paid occupations")

The most popular Occupations

# Top 10 jobs popularity wise
popular_jobs <- data %>% group_by(Job.Title) %>% count() %>% arrange(desc(n)) %>% head(12) 
# Salary amongst the most polular jobs
popular_jobs_salary <- data %>% filter(Job.Title %in% popular_jobs$Job.Title) %>% group_by(Job.Title) %>% summarise("median salary" = median(Salary)) %>% arrange(desc(`median salary`))

ggplot(popular_jobs, aes(Job.Title,n,fill=n)) + geom_bar(stat="identity") + theme(axis.text.x = element_text(angle = 45, hjust=1)) + labs(title="Most Popular Occupations") + guides(fill="none") + theme(axis.title.x=element_blank(),axis.title.y = element_blank())

ggplot(popular_jobs_salary, aes(Job.Title,`median salary`,fill=`median salary`)) + geom_bar(stat="identity") + theme(axis.text.x = element_text(angle=45,hjust=1), axis.title = element_blank()) + guides(fill='none') + labs(title="Salary amongst the most popular Occupations")

How experience level affects the salary

popular_jobs_general <- data %>% filter(Job.Title %in% popular_jobs$Job.Title) %>% group_by(Job.Title)
ggplot(popular_jobs_general,aes(Years.of.Experience,Salary)) + geom_point() + facet_wrap(~Job.Title) + labs(title="Effect of experience on Salary") + geom_smooth()

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Conclusions and observations

Marketing manager is the occupation where the correlation between the salary and the years of experience is strong and the data isn’t dispersed. Based on that, we can theoretize that it’s very likely that the more experience as a marketing manager you have, the more you earn.
To put the above into perspective let’s look at the data analyst plot. It’s true that the salary tend to go up with the experience gained. However, the data is dispersed. Hence it’s harder to guarantee that you’ll be rewarded for your experience.
A noticeable difference in pay given the same amount experience can be also observed on the Software Engineer plot. Is it because of the education level? Let’s check that!

ggplot(filter(popular_jobs_general, Job.Title=="Software Engineer"),aes(Years.of.Experience,Salary, color=Education.Level)) + geom_point() + labs(title="Effect of Education Level on Salary")

As can be observed, there is disparity in pay within bachelor’s of the same experience. Hence, the education level isn’t the cause of the disparity.

Degree and Self-taught

For the last part of the analysis we are going to take a closer look at those without a degree within the most popular professions. We are going to answer the questions such as:

How well are they paid compared to those with a degree?
Which job title has the most autodidacts?
How common is it for an autodidact to get into the profession?

ggplot(filter(popular_jobs_general,Education.Level=="high school"),aes(Years.of.Experience,Salary)) + geom_point() + facet_wrap(~Job.Title) + labs(title="Salary without Degree")

# Percentage of people without a degree within the most popular professions.
(filter(popular_jobs_general,Education.Level=="high school") %>% nrow()) / (popular_jobs_general %>% nrow())

## [1] 0.02120891

ggplot(filter(popular_jobs_general,Job.Title %in% c("Back end Developer","Front end Developer","Full Stack Engineer","Senior Project Engineer","Senior Software Engineer","Web Developer")),aes(Years.of.Experience,Salary,color=Education.Level)) + geom_point() + facet_wrap(~Job.Title) + labs(title="Salary without Degree")

Conclusions and observations

Only 2% of employees within the most popular professions have no degree.
Back end developers is the profession with the highest number of autodidacts.
Senior project managers and Senior software engineer without a degree tend to be paid less compared to those with a degree. However, given the sample size, we can’t be certain.

Exploring the job market in India

Marcin Kubowicz

2025-01-07

Brief introduction 📖

Motivation

Data

Goals

Loading and processing data🧹

Cleaning checklist

Duplicates

Data consistency

Exploring the data

Data distribution

Well paid occupations

The most popular Occupations

How experience level affects the salary

Conclusions and observations

Degree and Self-taught

Conclusions and observations