Hello everyone
In this RMD, I use the dataset from kaggle
datasets. The dataset contains information about salaries of jobs in
the data science domain (Detail information about this data will be
explained in the next section).
Previously, I have downloaded the data and then I saved it in one
folder named data_input. So, let’s call the data and assign to one
object, I was named it as ds.
ds <- read.csv("LBB 1/data_input/ds_salaries.csv")Let’s see the top six data from ds
head(ds)#checking dimention of data `ds`
dim(ds)#> [1] 607 12
The result shows that the data has 607 rows and 12 columns.
Explanation from each columns
work_year : The year the salary was paid
experience_level : The experience level in the job during the year
employment_type : The type of employment
job_title : The role worked in during the year
salary : The total gross salary amount paid
salary_currency : The currency of the salary paid as an ISO 4217 currency code
salary_in_usd : The salary in USD
employee_residence : Employee’s primary country of residence in during the work year as an ISO 3166 country code
remote_ratio : The overall amount of work done remotely, possible values are as follows: 0 No remote work (less than 20%), 50 Partially remote, and 100 Fully remote (more than 80%)
company_location : The country of the employer’s main office or contracting branch as an ISO 3166 country code
company_size : The average number of people that worked for the company during the year: S less than 50 employees (small), M 50 to 250 employees (medium) , and L more than 250 employees (large)
str(ds)#> 'data.frame': 607 obs. of 12 variables:
#> $ X : int 0 1 2 3 4 5 6 7 8 9 ...
#> $ work_year : int 2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 ...
#> $ experience_level : chr "MI" "SE" "SE" "MI" ...
#> $ employment_type : chr "FT" "FT" "FT" "FT" ...
#> $ job_title : chr "Data Scientist" "Machine Learning Scientist" "Big Data Engineer" "Product Data Analyst" ...
#> $ salary : int 70000 260000 85000 20000 150000 72000 190000 11000000 135000 125000 ...
#> $ salary_currency : chr "EUR" "USD" "GBP" "USD" ...
#> $ salary_in_usd : int 79833 260000 109024 20000 150000 72000 190000 35735 135000 125000 ...
#> $ employee_residence: chr "DE" "JP" "GB" "HN" ...
#> $ remote_ratio : int 0 0 50 0 50 100 100 50 100 50 ...
#> $ company_location : chr "DE" "JP" "GB" "HN" ...
#> $ company_size : chr "L" "S" "M" "S" ...
From the data, I found some columns that does not have datatype exactly.
The datatype of these columns have to change to
factor
ds$experience_level <- as.factor(ds$experience_level)
ds$employment_type <- as.factor(ds$employment_type)
ds$job_title <- as.factor(ds$job_title)
ds$salary_currency <- as.factor(ds$salary_currency)
ds$employee_residence <- as.factor(ds$employee_residence)
ds$company_location <- as.factor(ds$company_location)
ds$company_size <- as.factor(ds$company_size)The datatype of these columns have to change to
numeric
ds$work_year <- as.numeric(ds$work_year)
ds$salary <- as.numeric(ds$salary)
ds$salary_in_usd <- as.numeric(ds$salary_in_usd)
ds$remote_ratio <- as.numeric(ds$remote_ratio)Let’s check back the structure of data
str(ds)#> 'data.frame': 607 obs. of 12 variables:
#> $ X : int 0 1 2 3 4 5 6 7 8 9 ...
#> $ work_year : num 2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 ...
#> $ experience_level : Factor w/ 4 levels "EN","EX","MI",..: 3 4 4 3 4 1 4 3 3 4 ...
#> $ employment_type : Factor w/ 4 levels "CT","FL","FT",..: 3 3 3 3 3 3 3 3 3 3 ...
#> $ job_title : Factor w/ 50 levels "3D Computer Vision Researcher",..: 23 41 8 48 38 13 35 23 9 34 ...
#> $ salary : num 70000 260000 85000 20000 150000 72000 190000 11000000 135000 125000 ...
#> $ salary_currency : Factor w/ 17 levels "AUD","BRL","CAD",..: 8 17 9 17 17 17 17 10 17 17 ...
#> $ salary_in_usd : num 79833 260000 109024 20000 150000 ...
#> $ employee_residence: Factor w/ 57 levels "AE","AR","AT",..: 15 33 21 24 56 56 56 26 56 42 ...
#> $ remote_ratio : num 0 0 50 0 50 100 100 50 100 50 ...
#> $ company_location : Factor w/ 50 levels "AE","AS","AT",..: 13 30 19 21 49 49 49 23 49 39 ...
#> $ company_size : Factor w/ 3 levels "L","M","S": 1 3 2 3 1 1 3 1 1 3 ...
anyNA(ds)#> [1] FALSE
The result shows that the data does not have missing value
If I pay attention the data ds, column X is
unnecessary, so it is better if I remove it from the data.
ds <- ds[ , -c(1)]
dsAfter did 3 steps data cleaning, the data is already clean. Let’s go to the next section!
Let’s do data analytics to answer the business questions through the
informations that available in data ds.
#Check summary from data ds
summary(ds)#> work_year experience_level employment_type
#> Min. :2020 EN: 88 CT: 5
#> 1st Qu.:2021 EX: 26 FL: 4
#> Median :2022 MI:213 FT:588
#> Mean :2021 SE:280 PT: 10
#> 3rd Qu.:2022
#> Max. :2022
#>
#> job_title salary salary_currency
#> Data Scientist :143 Min. : 4000 USD :398
#> Data Engineer :132 1st Qu.: 70000 EUR : 95
#> Data Analyst : 97 Median : 115000 GBP : 44
#> Machine Learning Engineer: 41 Mean : 324000 INR : 27
#> Research Scientist : 16 3rd Qu.: 165000 CAD : 18
#> Data Science Manager : 12 Max. :30400000 JPY : 3
#> (Other) :166 (Other): 22
#> salary_in_usd employee_residence remote_ratio company_location
#> Min. : 2859 US :332 Min. : 0.00 US :355
#> 1st Qu.: 62726 GB : 44 1st Qu.: 50.00 GB : 47
#> Median :101570 IN : 30 Median :100.00 CA : 30
#> Mean :112298 CA : 29 Mean : 70.92 DE : 28
#> 3rd Qu.:150000 DE : 25 3rd Qu.:100.00 IN : 24
#> Max. :600000 FR : 18 Max. :100.00 FR : 15
#> (Other):129 (Other):108
#> company_size
#> L:198
#> M:326
#> S: 83
#>
#>
#>
#>
From this summary, I can know that this data start from 2020 until
2022 (work year). There are 4 experience levels, EN for Entry Level, MI
for Mid-level, SE for Senior-level, and EX for Executive-level. There
are also 4 employment types, CT for Contract, FL for Freelance, FT for
Full Time, and PT for Part Time. This summary also shows the lowest and
the highest salary that has been paid and other information from each
column in data ds.
As I know, there are some people who more interest to read information from the graphics than the table or text. So, below I will make simple plot to show information from some columns such as experience_level, employment-type, and company_size.
# experience_level
plot(ds$experience_level)plot(ds$employment_type)# job_title
plot(ds$company_size)
- From the data
ds, check the average salary per job title in each work year start from 2020, 2021, and 2022, then get to know what is job title in the data science domain with highest average salary in every year?
# 2020
ds2020 <- ds[(ds$work_year == 2020) , ]
hasil1 <- aggregate(x = salary_in_usd ~ job_title, data = ds2020, FUN = mean)
head(hasil1[order(hasil1$salary_in_usd, decreasing = T), ])# 2021
ds2021 <- ds[(ds$work_year == 2021) , ]
hasil2 <- aggregate(x = salary_in_usd ~ job_title, data = ds2021, FUN = mean)
head(hasil2[order(hasil2$salary_in_usd, decreasing = T), ])# 2022
ds2022 <- ds[(ds$work_year == 2022) , ]
hasil3 <- aggregate(x = salary_in_usd ~ job_title, data = ds2022, FUN = mean)
head(hasil3[order(hasil3$salary_in_usd, decreasing = T), ])Answer:
There are so many jobs in the data science domain, the results shown above indicate average salary (in USD) in every job in data science domain globally not differentiated by experience levels, employment type, company size and other. Based on the results, the highest average salary in 2020 is Director of Data Science, in 2021 is Financial Data Analyst, and in 2022 is Data Analytics Lead.
- What country does company from that give salary more than average salary for job as a data scientist in 2021?
salary_ds <- ds[(ds$work_year == 2021) & (ds$job_title == 'Data Scientist') , ]
mean_salary <- mean(salary_ds$salary_in_usd)
mean_salary#> [1] 70671.73
salary_ds[salary_ds$salary_in_usd > 70671.73 , "company_location"]#> [1] US US US US US CA IL DE US US DE GB CA DE US CA US US US US FR US US
#> 50 Levels: AE AS AT AU BE BR CA CH CL CN CO CZ DE DK DZ EE ES FR GB GR ... VN
Answer :
The result shows 6 countries that give more than average salary for job as data scientist in 2021
- There is a student, she wants to take a part time job in the data science domain because she has understood about data science and her experience is Middle Level. But, she feels confuse which company’s country she has to choose (Part time job online). She wants to get high salary for her job. Let’s help her to choose the right choice based on average salary in data
ds.
PT2022 <- ds[(ds$employment_type == 'PT') & (ds$experience_level == "MI") , ]
result <- aggregate(x = salary_in_usd ~ job_title + company_location, data = PT2022, FUN = mean)
result[order(result$salary_in_usd, decreasing = T), ]Answer :
Based on data, if she wants to get high salary when she takes part time job, I suggest her to take part time job at company in Netherlands as a Data Engineer (based on data in 2020-2022).
- Let’s see the correlation between our experience level and salary in one company (size of company is L) that will we get
ds$ex_level <- sapply(ds$experience_level,
FUN = switch,
"EN" = 1,
"MI" = 2,
"SE" = 3,
"EX" = 4)company_l <- ds[(ds$company_size == "L"), c('salary_in_usd', 'ex_level')]
head(company_l)cor(company_l$ex_level, company_l$salary_in_usd)#> [1] 0.2113103
Answer :
Based on the result, I can know the correlation between experience level and salary (in USD) is weak and positive. It is mean that if your experience level is high, your salary will be high too, and if your experience level is low, your salary will be low too. But, this correlation is weak.
From all results in data analytics, I can make a conclusion that working in the data science domain is still recommended, because so many jobs in the data science domain with different employment types and the average salary is still high in the last 3 years. Of course, United states can be the first choice if you want to choose your company based on country with high salary. The tips for you if you want to get higher salary, you can upgrade your skill in data science and add more your experience.