The goal of this analysis is to identify how education level, professions and the experience within it influence the salary.
Data origin: India
Currency: Indian Rupee(INR)
Pleas note that the data is outdated and does not represent the current state of the market in India.
Identify the most paid occupation
Identify the career that is the most rewarding in the long term
Answer whether degree is required for a good career
âś…Duplicates
âś…Missing Values
âś…Data types
âś…Data Consistency
# Loading Libraries and data
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## âś” dplyr 1.1.4 âś” readr 2.1.5
## âś” forcats 1.0.0 âś” stringr 1.5.1
## âś” ggplot2 3.5.1 âś” tibble 3.2.1
## âś” lubridate 1.9.4 âś” tidyr 1.3.1
## âś” purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## âś– dplyr::filter() masks stats::filter()
## âś– dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
data <- read.csv("C:/Users/marci/Desktop/Google Capstone Project/Salaries/Salary_Data.csv")
# Checking data types
str(data)
## 'data.frame': 6704 obs. of 6 variables:
## $ Age : int 32 28 45 36 52 29 42 31 26 38 ...
## $ Gender : chr "Male" "Female" "Male" "Female" ...
## $ Education.Level : chr "Bachelor's" "Master's" "PhD" "Bachelor's" ...
## $ Job.Title : chr "Software Engineer" "Data Analyst" "Senior Manager" "Sales Associate" ...
## $ Years.of.Experience: num 5 3 15 7 20 2 12 4 1 10 ...
## $ Salary : int 90000 65000 150000 60000 200000 55000 120000 80000 45000 110000 ...
paste0("There are ", duplicated(data) %>% sum(), " duplicates in the dataset.")
## [1] "There are 4912 duplicates in the dataset."
paste0("There are ", is.na(data) %>% sum(), " missing value in the dataset.")
## [1] "There are 10 missing value in the dataset."
10 missing values is not a big deal considering the size of the dataset. However, the 4912 duplicate rows is something that must be investigated.
data <- drop_na(data)
paste0((duplicated(data) %>% sum() / nrow(data) * 100) %>% round(2), "%")
## [1] "73.31%"
73% - That’s how much data is to be lost if duplicates are dropped.
Is it bad that there are duplicates?
How likely is it for 2 or 3 people, arbitrarily picked, to be of the same age, occupation, salary, gender, education level, and years of experience? It’s very unlikely.
Hence the decision to drop duplicates.
data <- data %>% filter(!duplicated(data))
unique(data$Gender)
## [1] "Male" "Female" "Other"
print("--------------------------------------------------------------------------")
## [1] "--------------------------------------------------------------------------"
print(unique(data$Education.Level))
## [1] "Bachelor's" "Master's" "PhD"
## [4] "Bachelor's Degree" "Master's Degree" ""
## [7] "High School" "phD"
As can be seen, there is an issue with Bachelor’s and PhD
title.
Let’s modify those titles so they have a common format.
# convert everything to lower case
data$Education.Level <- data$Education.Level %>% tolower()
# Replace Bachelor's Degree and Master's Degree with Master's and Bachelor's
data$Education.Level[data$Education.Level == "bachelor's degree"] <- "bachelor's"
data$Education.Level[data$Education.Level == "master's degree"] <- "master's"
unique(data$Education.Level)
## [1] "bachelor's" "master's" "phd" "" "high school"
Almost there! Let’s look at empty strings present in the column.
data %>% filter(data$Education.Level == "")
# Only one entry contains empty string. Dropping it won't affect the results of the analysis.
data <- data %>% filter(!data$Education.Level == "")
unique(data$Education.Level)
## [1] "bachelor's" "master's" "phd" "high school"
Great! Now the data is consistent.
options(scipen=100000)
ggplot(data, aes(Age)) + geom_histogram(color="black", fill="#3b83f9") + geom_vline(xintercept = median(data$Age), linetype="dashed", color="orange", size = 1) + labs(title = "Age distribution") + geom_text(x = 40, y = 150, label="Median Age", color="orange")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data, aes(Salary)) + geom_histogram(color="black", fill="#3b83f9") + geom_vline(xintercept = mean(data$Salary), linetype="dashed", color="orange", size = 1) + labs(title = "Salary distribution") + geom_text(x = mean(data$Salary) + 30000, y = 120, label="Average Salary", color="orange")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data, aes(Education.Level, fill=Education.Level)) + geom_bar() + labs(title= "Education Level distribution")
bar <- ggplot(data, aes(0, fill=Gender)) + geom_bar(width=1)
bar + coord_polar(theta = "y") + labs(title="Distribution of Genders") + theme_void()
summary(data)
## Age Gender Education.Level Job.Title
## Min. :21.00 Length:1787 Length:1787 Length:1787
## 1st Qu.:29.00 Class :character Class :character Class :character
## Median :33.00 Mode :character Mode :character Mode :character
## Mean :35.14
## 3rd Qu.:41.00
## Max. :62.00
## Years.of.Experience Salary
## Min. : 0.000 Min. : 350
## 1st Qu.: 3.000 1st Qu.: 70000
## Median : 8.000 Median :110000
## Mean : 9.156 Mean :113185
## 3rd Qu.:13.000 3rd Qu.:160000
## Max. :34.000 Max. :250000
top_jobs <- data %>% group_by(Job.Title) %>% summarize("Median salary" = median(Salary)) %>% arrange(desc(`Median salary`))
#Jobs sorted by Median Salary
top_jobs
#Top 10 jobs salary wise
top_10_jobs <- top_jobs %>% head(12)
top_10_jobs
# Fill argument wouldn't work for me here. I would appreciate if somebody could explain why =)
ggplot(top_10_jobs, aes(Job.Title, `Median salary`,fill=`Median salary`)) + geom_bar(stat="identity") + theme(axis.text.x = element_text(angle = 45, hjust=1),axis.title = element_blank()) + labs(title="Best paid occupations")
# Top 10 jobs popularity wise
popular_jobs <- data %>% group_by(Job.Title) %>% count() %>% arrange(desc(n)) %>% head(12)
# Salary amongst the most polular jobs
popular_jobs_salary <- data %>% filter(Job.Title %in% popular_jobs$Job.Title) %>% group_by(Job.Title) %>% summarise("median salary" = median(Salary)) %>% arrange(desc(`median salary`))
ggplot(popular_jobs, aes(Job.Title,n,fill=n)) + geom_bar(stat="identity") + theme(axis.text.x = element_text(angle = 45, hjust=1)) + labs(title="Most Popular Occupations") + guides(fill="none") + theme(axis.title.x=element_blank(),axis.title.y = element_blank())
ggplot(popular_jobs_salary, aes(Job.Title,`median salary`,fill=`median salary`)) + geom_bar(stat="identity") + theme(axis.text.x = element_text(angle=45,hjust=1), axis.title = element_blank()) + guides(fill='none') + labs(title="Salary amongst the most popular Occupations")
popular_jobs_general <- data %>% filter(Job.Title %in% popular_jobs$Job.Title) %>% group_by(Job.Title)
ggplot(popular_jobs_general,aes(Years.of.Experience,Salary)) + geom_point() + facet_wrap(~Job.Title) + labs(title="Effect of experience on Salary") + geom_smooth()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Marketing manager is the occupation where the correlation between the salary and the years of experience is strong and the data isn’t dispersed. Based on that, we can theoretize that it’s very likely that the more experience as a marketing manager you have, the more you earn.
To put the above into perspective let’s look at the data analyst plot. It’s true that the salary tend to go up with the experience gained. However, the data is dispersed. Hence it’s harder to guarantee that you’ll be rewarded for your experience.
A noticeable difference in pay given the same amount experience can be also observed on the Software Engineer plot. Is it because of the education level? Let’s check that!
ggplot(filter(popular_jobs_general, Job.Title=="Software Engineer"),aes(Years.of.Experience,Salary, color=Education.Level)) + geom_point() + labs(title="Effect of Education Level on Salary")
As can be observed, there is disparity in pay within bachelor’s of the same experience. Hence, the education level isn’t the cause of the disparity.
For the last part of the analysis we are going to take a closer look at those without a degree within the most popular professions. We are going to answer the questions such as:
How well are they paid compared to those with a degree?
Which job title has the most autodidacts?
How common is it for an autodidact to get into the profession?
ggplot(filter(popular_jobs_general,Education.Level=="high school"),aes(Years.of.Experience,Salary)) + geom_point() + facet_wrap(~Job.Title) + labs(title="Salary without Degree")
# Percentage of people without a degree within the most popular professions.
(filter(popular_jobs_general,Education.Level=="high school") %>% nrow()) / (popular_jobs_general %>% nrow())
## [1] 0.02120891
ggplot(filter(popular_jobs_general,Job.Title %in% c("Back end Developer","Front end Developer","Full Stack Engineer","Senior Project Engineer","Senior Software Engineer","Web Developer")),aes(Years.of.Experience,Salary,color=Education.Level)) + geom_point() + facet_wrap(~Job.Title) + labs(title="Salary without Degree")
Only 2% of employees within the most popular professions have no degree.
Back end developers is the profession with the highest number of autodidacts.
Senior project managers and Senior software engineer without a degree tend to be paid less compared to those with a degree. However, given the sample size, we can’t be certain.