1 Data Explanation

The data used in this Exploratory Data Analysis (EDA) is the salary data of a data science worker from companies located in various parts of the world.

2 Data Inspection, Coertion, and Cleansing

salary <- read.csv("ds_salaries.csv")

2.1 Data Inspection

head(salary)

Using glimpse we can quick inspect the data :

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

glimpse(salary)

## Rows: 3,755
## Columns: 11
## $ work_year          <int> 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023, 202…
## $ experience_level   <chr> "SE", "MI", "MI", "SE", "SE", "SE", "SE", "SE", "SE…
## $ employment_type    <chr> "FT", "CT", "CT", "FT", "FT", "FT", "FT", "FT", "FT…
## $ job_title          <chr> "Principal Data Scientist", "ML Engineer", "ML Engi…
## $ salary             <int> 80000, 30000, 25500, 175000, 120000, 222200, 136000…
## $ salary_currency    <chr> "EUR", "USD", "USD", "USD", "USD", "USD", "USD", "U…
## $ salary_in_usd      <int> 85847, 30000, 25500, 175000, 120000, 222200, 136000…
## $ employee_residence <chr> "ES", "US", "US", "CA", "CA", "US", "US", "CA", "CA…
## $ remote_ratio       <int> 100, 100, 100, 100, 100, 0, 0, 0, 0, 0, 0, 100, 100…
## $ company_location   <chr> "ES", "US", "US", "CA", "CA", "US", "US", "CA", "CA…
## $ company_size       <chr> "L", "S", "S", "M", "M", "L", "L", "M", "M", "M", "…

This summary tell us that :

Data Science Job Salaries Dataset contains 11 columns with 3,755 rows, each are: work_year: The year the salary was paid. experience_level: The experience level in the job during the year employment_type: The type of employment for the role job_title: The role worked in during the year. salary: The total gross salary amount paid. salary_currency: The currency of the salary paid as an ISO 4217 currency code. salaryinusd: The salary in USD employee_residence: Employee’s primary country of residence in during the work year as an ISO 3166 country code. remote_ratio: The overall amount of work done remotely company_location: The country of the employer’s main office or contracting branch company_size: The median number of people that worked for the company during the year

2.2 Data Coertion

The glimpse above tell us that the some data in the dataset are not in the correct data type, therefore we need to change the incorrect data type to the correct one.

salary$experience_level <- as.factor(salary$experience_level)
salary$employment_type <- as.factor(salary$employment_type)
salary$company_size <- as.factor(salary$company_size)
salary$remote_ratio <- as.factor(salary$remote_ratio)
salary$work_year <- as.factor(salary$work_year)
salary$company_location <- as.factor(salary$company_location)

As you can see, we change those data to factor data type because the data have few unique value :

unique(salary$experience_level)

## [1] SE MI EN EX
## Levels: EN EX MI SE

unique(salary$employment_type)

## [1] FT CT FL PT
## Levels: CT FL FT PT

unique(salary$company_size)

## [1] L S M
## Levels: L M S

unique(salary$remote_ratio)

## [1] 100 0   50 
## Levels: 0 50 100

unique(salary$work_year)

## [1] 2023 2022 2020 2021
## Levels: 2020 2021 2022 2023

unique(salary$company_location)

##  [1] ES US CA DE GB NG IN HK NL CH CF FR FI UA IE IL GH CO SG AU SE SI MX BR PT
## [26] RU TH HR VN EE AM BA KE GR MK LV RO PK IT MA PL AL AR LT AS CR IR BS HU AT
## [51] SK CZ TR PR DK BO PH BE ID EG AE LU MY HN JP DZ IQ CN NZ CL MD MT
## 72 Levels: AE AL AM AR AS AT AU BA BE BO BR BS CA CF CH CL CN CO CR CZ ... VN

Using glimpse again to check if the each column in the dataset have desired data type :

glimpse(salary)

## Rows: 3,755
## Columns: 11
## $ work_year          <fct> 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023, 202…
## $ experience_level   <fct> SE, MI, MI, SE, SE, SE, SE, SE, SE, SE, SE, SE, SE,…
## $ employment_type    <fct> FT, CT, CT, FT, FT, FT, FT, FT, FT, FT, FT, FT, FT,…
## $ job_title          <chr> "Principal Data Scientist", "ML Engineer", "ML Engi…
## $ salary             <int> 80000, 30000, 25500, 175000, 120000, 222200, 136000…
## $ salary_currency    <chr> "EUR", "USD", "USD", "USD", "USD", "USD", "USD", "U…
## $ salary_in_usd      <int> 85847, 30000, 25500, 175000, 120000, 222200, 136000…
## $ employee_residence <chr> "ES", "US", "US", "CA", "CA", "US", "US", "CA", "CA…
## $ remote_ratio       <fct> 100, 100, 100, 100, 100, 0, 0, 0, 0, 0, 0, 100, 100…
## $ company_location   <fct> ES, US, US, CA, CA, US, US, CA, CA, US, US, US, US,…
## $ company_size       <fct> L, S, S, M, M, L, L, M, M, M, M, M, M, L, L, M, M, …

We can see that each of column already changed into desired data type

2.3 Data Cleansing

Check for missing value :

colSums(is.na(salary))

##          work_year   experience_level    employment_type          job_title 
##                  0                  0                  0                  0 
##             salary    salary_currency      salary_in_usd employee_residence 
##                  0                  0                  0                  0 
##       remote_ratio   company_location       company_size 
##                  0                  0                  0

There are no missing value in this dataset, therefore we can procced to analyze the data.

3 Data Explanation

Brief summary of the data :

summary(salary)

##  work_year   experience_level employment_type  job_title        
##  2020:  76   EN: 320          CT:  10         Length:3755       
##  2021: 230   EX: 114          FL:  10         Class :character  
##  2022:1664   MI: 805          FT:3718         Mode  :character  
##  2023:1785   SE:2516          PT:  17                           
##                                                                 
##                                                                 
##                                                                 
##      salary         salary_currency    salary_in_usd    employee_residence
##  Min.   :    6000   Length:3755        Min.   :  5132   Length:3755       
##  1st Qu.:  100000   Class :character   1st Qu.: 95000   Class :character  
##  Median :  138000   Mode  :character   Median :135000   Mode  :character  
##  Mean   :  190696                      Mean   :137570                     
##  3rd Qu.:  180000                      3rd Qu.:175000                     
##  Max.   :30400000                      Max.   :450000                     
##                                                                           
##  remote_ratio company_location company_size
##  0  :1923     US     :3040     L: 454      
##  50 : 189     GB     : 172     M:3153      
##  100:1643     CA     :  87     S: 148      
##               ES     :  77                 
##               IN     :  58                 
##               DE     :  56                 
##               (Other): 265

From this summary, we can see that : - The highest salary as a data scientist field is $450000 USD - The lowest salary as a data scientist field is $5132 USD - Average salary as a data scientist is $ 137570 USD worldwide - most data scientist work full-time(FT) - Large number of medium size company use data scientis service(3153), followed by large company(454), then small company(148) - Work from home and work on-sit have workers have almost the same ratio (1643:1923), while the minority of workers are hybrid (189 workers) - majority company from this dataset is in the US

next, we will check the outlier of the salary data. We will use the salary in USD in salary_in_usd column to analyze the data, as the salary in the salary column have value of many different currencies and the data will become irrelevant.

boxplot(salary$salary_in_usd)

boxplot.stats(salary$salary_in_usd)

## $stats
## [1]   5132  95000 135000 175000 293000
## 
## $n
## [1] 3755
## 
## $conf
## [1] 132937.3 137062.7
## 
## $out
##  [1] 342810 309400 300000 342300 318300 309400 300000 329500 304000 353200
## [11] 297300 317070 423834 376080 299500 297300 299500 340000 310000 310000
## [21] 300240 300240 370000 323300 299500 310000 375000 318300 385000 370000
## [31] 314100 350000 310000 300000 299500 300000 300000 297300 297300 310000
## [41] 310000 430967 300000 310000 299500 300000 375000 350000 315000 300000
## [51] 345600 300000 297500 300000 300000 324000 405000 380000 450000 416000
## [61] 325000 423000 412000

sd(salary$salary_in_usd) # calculate the Standar deviation

## [1] 63055.63

We have outlier in the salary data, but the standar deviation is 63055.63 it’s still within the range of Q1 and Q3. So in my opinion, it’s tolerable.

4 Data Exploration

4.1 Highest avarge salary of data scientist worldwide

aggregate(salary_in_usd~company_location,salary,mean) %>%
  arrange(desc(salary_in_usd))

we can see Israel(IL) is a country with the highest averagy salary for data scientis followed by Poerto Rico(PR) and United States (US). How about the lowest?

aggregate(salary_in_usd~company_location,salary,mean) %>%
  arrange(salary_in_usd)

The North (MK) is the country with lowest average salary for data scientist followed by Bolivia(BO), and Albania(AL)

4.2 Salary difference between WFH, WFO, and Hybrid

Averagy salary according to working style :

aggregate(salary_in_usd~remote_ratio,salary,mean)

We can see that WFO (remote_ratio = 0) have the highest salary among the other working style. How about the highest and and lowest salary according to working style?

aggregate(salary_in_usd~remote_ratio,salary,max)

aggregate(salary_in_usd~remote_ratio,salary,min)

Highest salary is WFO with $ 450,000 USD, and lowest salary is WFH with $ 5,132 USD

4.3 Average Data Scientist Salary Year by Year

aggregate(salary_in_usd~work_year,salary,mean) %>%
  arrange(salary_in_usd)

As you can see that every year, data scientist salary have a significant increase.

4.4 Average Salary by Exprerince Level

aggregate(salary_in_usd~experience_level,salary,mean) %>%
  arrange(salary_in_usd)

This data show us that more expreience equals more salary.

4.5 Experience level distributed by working style

xtabs(~experience_level+remote_ratio,salary)

##                 remote_ratio
## experience_level    0   50  100
##               EN  111   65  144
##               EX   56    6   52
##               MI  396   74  335
##               SE 1360   44 1112

plot(xtabs(~experience_level+remote_ratio,salary))

with those data and plot we can see that the worker is more distributed in semi-expert level and WFO working style. addtional information, hybrid sorking style is the least used working style no matter the experience level.

5 Conclusion and Recomendation

5.1 Conclusion

According to data above, we can conclude that :

every year, the data scientist salary have significant increase from 2020 to 2023. With the avarage salary of $137570 USD
The country with most worker according to this data is United Stated
World highest salary as a data scientist is $450,000 USD
Country with higest average data scientist salary is Israel, Puorto Rico, and United States
Country with lowest average data scientist salary is North , Bolivia, and Albania.

5.2 Recomendation

For jobseeker, the most average salary as data scientist is in Israel, Puorto Rico, and United States. You might want to seek job in those countries for better income.
For employee seeker or company, this data help you to see suitable consideration to found employee and decide his/her salary. e.g you can seek employe in the country with low average income for lower salary that can limit budget expense.

Programming for Data Science: LBB 1

Rizky Fadilah

12 June 2023