DS_job_analyze

Harvard Business Review calls the role of a data scientist as “the sexiest job of the 21st century”. Data Scientist becomes more and more popular in the job market in this decade, and this role is also essential across various field such as finance, banking, industry, or sports. Data scientist plays the important role to enhance their company.

In this project, I will use dataset downloaded on Kaggle to analyze to have a better look insight the salary and current employment status of data science job across many countries.

library(readxl)
df<-read.csv("ds_salaries.csv")
print(head(df))

##   X work_year experience_level employment_type                  job_title
## 1 0      2020               MI              FT             Data Scientist
## 2 1      2020               SE              FT Machine Learning Scientist
## 3 2      2020               SE              FT          Big Data Engineer
## 4 3      2020               MI              FT       Product Data Analyst
## 5 4      2020               SE              FT  Machine Learning Engineer
## 6 5      2020               EN              FT               Data Analyst
##   salary salary_currency salary_in_usd employee_residence remote_ratio
## 1  70000             EUR         79833                 DE            0
## 2 260000             USD        260000                 JP            0
## 3  85000             GBP        109024                 GB           50
## 4  20000             USD         20000                 HN            0
## 5 150000             USD        150000                 US           50
## 6  72000             USD         72000                 US          100
##   company_location company_size
## 1               DE            L
## 2               JP            S
## 3               GB            M
## 4               HN            S
## 5               US            L
## 6               US            L

Have a look at our dataset.

str(df)

## 'data.frame':    607 obs. of  12 variables:
##  $ X                 : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ work_year         : int  2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 ...
##  $ experience_level  : chr  "MI" "SE" "SE" "MI" ...
##  $ employment_type   : chr  "FT" "FT" "FT" "FT" ...
##  $ job_title         : chr  "Data Scientist" "Machine Learning Scientist" "Big Data Engineer" "Product Data Analyst" ...
##  $ salary            : int  70000 260000 85000 20000 150000 72000 190000 11000000 135000 125000 ...
##  $ salary_currency   : chr  "EUR" "USD" "GBP" "USD" ...
##  $ salary_in_usd     : int  79833 260000 109024 20000 150000 72000 190000 35735 135000 125000 ...
##  $ employee_residence: chr  "DE" "JP" "GB" "HN" ...
##  $ remote_ratio      : int  0 0 50 0 50 100 100 50 100 50 ...
##  $ company_location  : chr  "DE" "JP" "GB" "HN" ...
##  $ company_size      : chr  "L" "S" "M" "S" ...

Dataset contains the information of 607 data scientist with 12 different attributes. Moreover, in the attribute employment type, we denote categorical value as follows:

unique(df$employment_type)

## [1] "FT" "CT" "PT" "FL"

FT: Full-time

PT: Part-time

CT: Contract Basis

FL: Freelancer

And dataset is divided in different categories in experience level:

unique(df$experience_level)

## [1] "MI" "SE" "EN" "EX"

EN: Entry Level

MI: Mid Level

SE: Senior Level

EX: Executive Level

Company Size also has three values:

unique(df$company_size)

## [1] "L" "S" "M"

S: Small

M: Medium

L: Large

Before analyzing data, we need to check if there is any none value in the dataset.

print("Count of missing values by column ")

## [1] "Count of missing values by column "

sapply(df, function(x) sum(is.na(x)))

##                  X          work_year   experience_level    employment_type 
##                  0                  0                  0                  0 
##          job_title             salary    salary_currency      salary_in_usd 
##                  0                  0                  0                  0 
## employee_residence       remote_ratio   company_location       company_size 
##                  0                  0                  0                  0

Our dataset contains none NA values. We can begin to analyze dataset

First, we will find out the distribution of experience level in the dataset.

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.2.2

prop<-prop.table(table(df$experience_level))
print(prop)

## 
##         EN         EX         MI         SE 
## 0.14497529 0.04283361 0.35090610 0.46128501

prop<-data.frame(prop)
##Pie chart plot
ggplot(prop, aes(x = "", y = Freq, fill = Var1)) + geom_col(color = "black") +geom_text(aes(label = round(Freq,2)), position = position_stack(vjust = 0.5)) +
    coord_polar(theta = "y") +
  theme(axis.text = element_blank(),
        axis.ticks = element_blank(),
        axis.title = element_blank(),
        panel.grid = element_blank())+
  guides(fill = guide_legend(title = "Level"))

Most of data scientists here are in senior level, while just a small number are executive level. It is true that company just need a minority executive level data scientist to lead a whole data science team. Moreover, the plot shows that fresh graduate students could be struggle to find the data science job since most company usually needs employee who has more experience.

Next, we analyze which employment type is the most popular in data science job market. We can expect that full time position will account the most proportion since this job needs the security in company’s dataset.

prop<-prop.table(table(df$employment_type))
print(prop)

## 
##          CT          FL          FT          PT 
## 0.008237232 0.006589786 0.968698517 0.016474465

prop<-data.frame(prop)
##Pie chart plot
ggplot(prop, aes(x = "", y = Freq, fill = Var1)) + geom_col(color = "black")+
    coord_polar(theta = "y") +
  theme(axis.text = element_blank(),
        axis.ticks = element_blank(),
        axis.title = element_blank(),
        panel.grid = element_blank())+
  guides(fill = guide_legend(title = "Level"))

As we expected, full-time employee makes up for more than 96%. On the second place is the contract basis, because some company will have the temporary project and they only need to hire contract employee for that specified project.

Now, we will want to know the salary range between these status. First, we analyze dataset to have the salary range based on their level.

library("scales")
ggplot(df,aes(x=experience_level,y=salary_in_usd,fill=experience_level))+geom_boxplot() + 
  scale_y_continuous(labels = comma) +labs(title="Salary range based on employment level",x="Employment level",y="Salary",fill="Employment level")

#Median salary of the experience level
tapply(df$salary_in_usd,df$experience_level,median,na.rm=TRUE)

##       EN       EX       MI       SE 
##  56500.0 171437.5  76940.0 135500.0

The boxplot shows that the higher level of employment, the higher salary they receive. Furthermore, only executive level has the wider spread compared to other levels.

Next, we will see the salary based on the employment status.

library("scales")
ggplot(df,aes(x=employment_type,y=salary_in_usd,fill=employment_type))+geom_boxplot() + 
  scale_y_continuous(labels = comma)+ labs(title="Salary range based on employment type",x="Employment type",y="Salary",fill = "Employment type")

#Median salary of the employment type
tapply(df$salary_in_usd,df$employment_type,median,na.rm=TRUE)

##       CT       FL       FT       PT 
## 105000.0  40000.0 104196.5  18817.5

Contract employee earns the most in all of the type since the company only hires them for a short period to work on their specified project. Surprisingly, freelancer seems to have higher salary than the part-time employee in data science field.

Usually, the bigger company tends to pay more than the small ones. We will analyze to see if this is true or not.

ggplot(df,aes(x=company_size,y=salary_in_usd,fill=company_size))+geom_boxplot() + labs(title="Salary range in three types of company",x="Company size",y="Salary",fill = "Company size") + scale_y_continuous(labels = comma)

#Median salary of the company size
tapply(df$salary_in_usd,df$company_size,median,na.rm=TRUE)

##      L      M      S 
## 100000 113188  65000

This could be not as expected since the medium company pays more than the large-sized company. But there are some outliers in the salary of the large company could lead to the higher average salary compared to the medium ones. While employees in the small companies earn much lower.

And let’s look how each level is paid across three types of company.

ggplot(df,aes(x=experience_level,y=salary_in_usd,fill=company_size))+geom_boxplot()+scale_y_continuous(labels=comma) + labs(x="Experience level",y="Salary",fill="Company size")

#Median value
tapply(df$salary_in_usd,list(df$company_size,df$experience_level),median,na.rm=TRUE)

##      EN       EX      MI       SE
## L 63831 196979.0 86000.0 147000.0
## M 49823 171437.5 78658.5 135500.0
## S 60000 118187.0 56738.0 108603.5

The boxplot surprises us since small company willing to pay more for entry level than the medium ones. Normally, the fresh graduate students want to work in the big company; hence to find the talent for their team, the small companies tend to pay more for entry level in order to lure talents into their company.

Moreover, the spread of salary range of executive level in small company is also large. It can be the case where this company only has one or two data scientist and these data scientists in small companies are extremely important; hence the company needs to spend for them more to keep them work in the company.

By sum all of the salary based on company location, we plot the map to represent the average salary of data scientist across countries in the world.

#Calling the library
library(sf)

## Warning: package 'sf' was built under R version 4.2.2

## Linking to GEOS 3.9.3, GDAL 3.5.2, PROJ 8.2.1; sf_use_s2() is TRUE

library(terra)

## Warning: package 'terra' was built under R version 4.2.2

## terra 1.6.47

## 
## Attaching package: 'terra'

## The following object is masked from 'package:scales':
## 
##     rescale

library(spData)

## Warning: package 'spData' was built under R version 4.2.2

library(spDataLarge)

## Warning: package 'spDataLarge' was built under R version 4.2.2

#Sum all DS salary group by country
sum<-aggregate(df$salary_in_usd, list(df$company_location), sum)
colnames(sum)<-c("iso_a2","sum")
#left join
library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.2.2

## ── Attaching packages
## ───────────────────────────────────────
## tidyverse 1.3.2 ──

## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ✔ purrr   0.3.4

## Warning: package 'readr' was built under R version 4.2.2

## Warning: package 'forcats' was built under R version 4.2.2

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ readr::col_factor() masks scales::col_factor()
## ✖ purrr::discard()    masks scales::discard()
## ✖ tidyr::extract()    masks terra::extract()
## ✖ dplyr::filter()     masks stats::filter()
## ✖ dplyr::lag()        masks stats::lag()

world<-left_join(world,sum,by="iso_a2")
#map plot
ggplot(data = world) +
  geom_sf(aes(fill=log10(sum))) +
  theme_void() +scale_fill_viridis_c(label=comma) + labs(title="Salaries around the world",fill="Scale of salary")

#Sum all DS salary group by country
avg<-aggregate(df$salary_in_usd, list(df$company_location), mean)
colnames(avg)<-c("iso_a2","avg")
#left join
library(tidyverse)
world<-left_join(world,avg,by="iso_a2")
#map plot
ggplot(data = world) +
  geom_sf(aes(fill=log10(avg))) +
  theme_void() +scale_fill_viridis_c(label=comma) + labs(title="The average salaries around the world",fill="Scale of average salary")

#Average salary in usd group by currency
group<-aggregate(df$salary_in_usd,list(df$company_location),mean)
group<-group[order(-group$x),]
#ggplot
ggplot(data=group,aes(x=reorder(Group.1,-x),y=x,fill=Group.1))+geom_bar(stat = "identity") + theme(legend.position = 'none') +scale_y_continuous(labels=comma)+labs(x="Country",y="Average salary",title="Average salary in USD based on the location ")+ theme(axis.text.x = element_text(angle = 60, hjust = 1))

Surprisingly, company in Russia pays the most for data science position, followed by USA.

The plot above shows the sum of salary and the average salaries across countries in the world. However, this plot could be incorrect about the salary since the shortage of dataset in other countries. We will check which country has the majority data scientist job in this survey.

ggplot(df,aes(x=salary_in_usd,y=company_location,color=company_location)) + geom_jitter(position=position_dodge(0.8))+
  theme(legend.position = "none") + labs(title="Salary of data science job across countries",x="Salary",y="Country") + scale_x_continuous(labels = comma)

It is clearly that USA has the most data science job and they pay the most compared to other countries. Canada and United Kingdom follows in terms of the highest job salaries and number of jobs.

#Average salary in usd group by currency
group<-aggregate(df$salary_in_usd,list(df$salary_currency),mean)
group<-group[order(-group$x),]
#ggplot
ggplot(data=group,aes(x=reorder(Group.1,-x),y=x,fill=Group.1))+geom_bar(stat = "identity") + theme(legend.position = 'none') +scale_y_continuous(labels=comma)+labs(x="Currency",y="Average salary",title="Average salary in USD based on the location currency")

We see that employee earns the most in USD currency, followed by Swiss franc and Singapore dollar. This graph is heavily influenced by the value of a particular currency as most currencies on the left hand side of graph have relatively high value against USD.

In this dataset, we have the remote_ratio attribute denotes

0: Remote

50: Hybrid

100: On-site

#Transform remote ratio into working type
working_type<-function(x){
if(x==0){
  return ("Remote")
}
  if(x==50){
    return("Hybrid")
  }
  if(x==100){
    return("On-site")
  }
}
df$working_type<-sapply(df$remote_ratio,working_type)
#
a<-data.frame(prop.table(table(df$work_year,df$working_type),margin = 1))
ggplot(a,aes(x=Var1,y=Freq,fill=Var2))+geom_bar(stat="identity",position="dodge") + labs(title="The proportion of working type each year",x="Year",y="Proportion",fill="Working type")

Since this dataset only contains the information of data scientist who joined company between 2020 and 2022 so that we can not have the better insight. However, the plot shows that during the covid-19 outbreak, hybrid and remote type became more popular. And since 2022 when everything became normally, employee needs to work on-site instead of working from home. Nevertheless, some company still allows there employee to work from home due to its convenience.

DS_job_analyze

quoc_nguyen

2023-01-14