The data used in this Exploratory Data Analysis (EDA) is the salary data of a data science worker from companies located in various parts of the world.
salary <- read.csv("ds_salaries.csv")head(salary)Using glimpse we can quick inspect the data :
library(dplyr)##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
glimpse(salary)## Rows: 3,755
## Columns: 11
## $ work_year <int> 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023, 202…
## $ experience_level <chr> "SE", "MI", "MI", "SE", "SE", "SE", "SE", "SE", "SE…
## $ employment_type <chr> "FT", "CT", "CT", "FT", "FT", "FT", "FT", "FT", "FT…
## $ job_title <chr> "Principal Data Scientist", "ML Engineer", "ML Engi…
## $ salary <int> 80000, 30000, 25500, 175000, 120000, 222200, 136000…
## $ salary_currency <chr> "EUR", "USD", "USD", "USD", "USD", "USD", "USD", "U…
## $ salary_in_usd <int> 85847, 30000, 25500, 175000, 120000, 222200, 136000…
## $ employee_residence <chr> "ES", "US", "US", "CA", "CA", "US", "US", "CA", "CA…
## $ remote_ratio <int> 100, 100, 100, 100, 100, 0, 0, 0, 0, 0, 0, 100, 100…
## $ company_location <chr> "ES", "US", "US", "CA", "CA", "US", "US", "CA", "CA…
## $ company_size <chr> "L", "S", "S", "M", "M", "L", "L", "M", "M", "M", "…
This summary tell us that :
Data Science Job Salaries Dataset contains 11 columns with 3,755 rows, each are: work_year: The year the salary was paid. experience_level: The experience level in the job during the year employment_type: The type of employment for the role job_title: The role worked in during the year. salary: The total gross salary amount paid. salary_currency: The currency of the salary paid as an ISO 4217 currency code. salaryinusd: The salary in USD employee_residence: Employee’s primary country of residence in during the work year as an ISO 3166 country code. remote_ratio: The overall amount of work done remotely company_location: The country of the employer’s main office or contracting branch company_size: The median number of people that worked for the company during the year
The glimpse above tell us that the some data in the dataset are not in the correct data type, therefore we need to change the incorrect data type to the correct one.
salary$experience_level <- as.factor(salary$experience_level)
salary$employment_type <- as.factor(salary$employment_type)
salary$company_size <- as.factor(salary$company_size)
salary$remote_ratio <- as.factor(salary$remote_ratio)
salary$work_year <- as.factor(salary$work_year)
salary$company_location <- as.factor(salary$company_location)As you can see, we change those data to factor data type because the data have few unique value :
unique(salary$experience_level)## [1] SE MI EN EX
## Levels: EN EX MI SE
unique(salary$employment_type)## [1] FT CT FL PT
## Levels: CT FL FT PT
unique(salary$company_size)## [1] L S M
## Levels: L M S
unique(salary$remote_ratio)## [1] 100 0 50
## Levels: 0 50 100
unique(salary$work_year)## [1] 2023 2022 2020 2021
## Levels: 2020 2021 2022 2023
unique(salary$company_location)## [1] ES US CA DE GB NG IN HK NL CH CF FR FI UA IE IL GH CO SG AU SE SI MX BR PT
## [26] RU TH HR VN EE AM BA KE GR MK LV RO PK IT MA PL AL AR LT AS CR IR BS HU AT
## [51] SK CZ TR PR DK BO PH BE ID EG AE LU MY HN JP DZ IQ CN NZ CL MD MT
## 72 Levels: AE AL AM AR AS AT AU BA BE BO BR BS CA CF CH CL CN CO CR CZ ... VN
Using glimpse again to check if the each column in the dataset have desired data type :
glimpse(salary)## Rows: 3,755
## Columns: 11
## $ work_year <fct> 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023, 202…
## $ experience_level <fct> SE, MI, MI, SE, SE, SE, SE, SE, SE, SE, SE, SE, SE,…
## $ employment_type <fct> FT, CT, CT, FT, FT, FT, FT, FT, FT, FT, FT, FT, FT,…
## $ job_title <chr> "Principal Data Scientist", "ML Engineer", "ML Engi…
## $ salary <int> 80000, 30000, 25500, 175000, 120000, 222200, 136000…
## $ salary_currency <chr> "EUR", "USD", "USD", "USD", "USD", "USD", "USD", "U…
## $ salary_in_usd <int> 85847, 30000, 25500, 175000, 120000, 222200, 136000…
## $ employee_residence <chr> "ES", "US", "US", "CA", "CA", "US", "US", "CA", "CA…
## $ remote_ratio <fct> 100, 100, 100, 100, 100, 0, 0, 0, 0, 0, 0, 100, 100…
## $ company_location <fct> ES, US, US, CA, CA, US, US, CA, CA, US, US, US, US,…
## $ company_size <fct> L, S, S, M, M, L, L, M, M, M, M, M, M, L, L, M, M, …
We can see that each of column already changed into desired data type
Check for missing value :
colSums(is.na(salary))## work_year experience_level employment_type job_title
## 0 0 0 0
## salary salary_currency salary_in_usd employee_residence
## 0 0 0 0
## remote_ratio company_location company_size
## 0 0 0
There are no missing value in this dataset, therefore we can procced to analyze the data.
Brief summary of the data :
summary(salary)## work_year experience_level employment_type job_title
## 2020: 76 EN: 320 CT: 10 Length:3755
## 2021: 230 EX: 114 FL: 10 Class :character
## 2022:1664 MI: 805 FT:3718 Mode :character
## 2023:1785 SE:2516 PT: 17
##
##
##
## salary salary_currency salary_in_usd employee_residence
## Min. : 6000 Length:3755 Min. : 5132 Length:3755
## 1st Qu.: 100000 Class :character 1st Qu.: 95000 Class :character
## Median : 138000 Mode :character Median :135000 Mode :character
## Mean : 190696 Mean :137570
## 3rd Qu.: 180000 3rd Qu.:175000
## Max. :30400000 Max. :450000
##
## remote_ratio company_location company_size
## 0 :1923 US :3040 L: 454
## 50 : 189 GB : 172 M:3153
## 100:1643 CA : 87 S: 148
## ES : 77
## IN : 58
## DE : 56
## (Other): 265
From this summary, we can see that : - The highest salary as a data scientist field is $450000 USD - The lowest salary as a data scientist field is $5132 USD - Average salary as a data scientist is $ 137570 USD worldwide - most data scientist work full-time(FT) - Large number of medium size company use data scientis service(3153), followed by large company(454), then small company(148) - Work from home and work on-sit have workers have almost the same ratio (1643:1923), while the minority of workers are hybrid (189 workers) - majority company from this dataset is in the US
next, we will check the outlier of the salary data. We will use the salary in USD in salary_in_usd column to analyze the data, as the salary in the salary column have value of many different currencies and the data will become irrelevant.
boxplot(salary$salary_in_usd)boxplot.stats(salary$salary_in_usd)## $stats
## [1] 5132 95000 135000 175000 293000
##
## $n
## [1] 3755
##
## $conf
## [1] 132937.3 137062.7
##
## $out
## [1] 342810 309400 300000 342300 318300 309400 300000 329500 304000 353200
## [11] 297300 317070 423834 376080 299500 297300 299500 340000 310000 310000
## [21] 300240 300240 370000 323300 299500 310000 375000 318300 385000 370000
## [31] 314100 350000 310000 300000 299500 300000 300000 297300 297300 310000
## [41] 310000 430967 300000 310000 299500 300000 375000 350000 315000 300000
## [51] 345600 300000 297500 300000 300000 324000 405000 380000 450000 416000
## [61] 325000 423000 412000
sd(salary$salary_in_usd) # calculate the Standar deviation## [1] 63055.63
We have outlier in the salary data, but the standar deviation is 63055.63 it’s still within the range of Q1 and Q3. So in my opinion, it’s tolerable.
aggregate(salary_in_usd~company_location,salary,mean) %>%
arrange(desc(salary_in_usd))we can see Israel(IL) is a country with the highest averagy salary for data scientis followed by Poerto Rico(PR) and United States (US). How about the lowest?
aggregate(salary_in_usd~company_location,salary,mean) %>%
arrange(salary_in_usd)The North (MK) is the country with lowest average salary for data scientist followed by Bolivia(BO), and Albania(AL)
Averagy salary according to working style :
aggregate(salary_in_usd~remote_ratio,salary,mean)We can see that WFO (remote_ratio = 0) have the highest salary among the other working style. How about the highest and and lowest salary according to working style?
aggregate(salary_in_usd~remote_ratio,salary,max)aggregate(salary_in_usd~remote_ratio,salary,min)Highest salary is WFO with $ 450,000 USD, and lowest salary is WFH with $ 5,132 USD
aggregate(salary_in_usd~work_year,salary,mean) %>%
arrange(salary_in_usd)As you can see that every year, data scientist salary have a significant increase.
aggregate(salary_in_usd~experience_level,salary,mean) %>%
arrange(salary_in_usd)This data show us that more expreience equals more salary.
xtabs(~experience_level+remote_ratio,salary)## remote_ratio
## experience_level 0 50 100
## EN 111 65 144
## EX 56 6 52
## MI 396 74 335
## SE 1360 44 1112
plot(xtabs(~experience_level+remote_ratio,salary))
with those data and plot we can see that the worker is more distributed
in semi-expert level and WFO working style. addtional information,
hybrid sorking style is the least used working style no matter the
experience level.
According to data above, we can conclude that :
every year, the data scientist salary have significant increase from 2020 to 2023. With the avarage salary of $137570 USD
The country with most worker according to this data is United Stated
World highest salary as a data scientist is $450,000 USD
Country with higest average data scientist salary is Israel, Puorto Rico, and United States
Country with lowest average data scientist salary is North , Bolivia, and Albania.