The data that we’re going to analyze is from Kaggle.com. This data shows salaries of different data science fields in the data science domain.
Data Science Job Salaries Dataset contains 11 columns, which are:
work_year: The year the salary was paid.experience_level: The experience level in the job
during the year
employment_type: The type of employment for the role
job_title: The role worked in during the yearsalary: The total gross salary amount paid.salary_currency: The currency of the salary paid as an
ISO 4217 currency code.salaryinusd: The salary in USDemployee_residence: Employee’s primary country of
residence in during the work year as an ISO 3166 country code.remote_ratio: The overall amount of work done
remotelycompany_location: The country of the employer’s main
office or contracting branchcompany_size: The median number of people that worked
for the company during the yearMake sure that the data is placed in the same folder as our R
project. We are going to use the dataset ds_salaries.csv.
Use the function read.csv() to read the CSV file to R.
Then, save it under the salary object.
salary <- read.csv("ds_salaries.csv")Instead of looking at the whole data, it’s better for us to “peek” at some rows that can represent the overall shape of the data.
To see the first few rows of the data, we use the head()
function.
head(salary)To see the last few rows of the data, we use the tail()
function.
tail(salary)dim(salary)## [1] 3755 11
names(salary)## [1] "work_year" "experience_level" "employment_type"
## [4] "job_title" "salary" "salary_currency"
## [7] "salary_in_usd" "employee_residence" "remote_ratio"
## [10] "company_location" "company_size"
From the inspection above, we could conclude that:
Our data has 3755 rows and 11 columns
The name of the columns are: work_year,
experience_level, employment_type,
job_title, salary,
salary_currency, salary_in_usd,
employee_residence, remote_ratio,
company_location, company_size.
First, we want to check the data type for each columns using the
str() function.
str(salary)## 'data.frame': 3755 obs. of 11 variables:
## $ work_year : int 2023 2023 2023 2023 2023 2023 2023 2023 2023 2023 ...
## $ experience_level : chr "SE" "MI" "MI" "SE" ...
## $ employment_type : chr "FT" "CT" "CT" "FT" ...
## $ job_title : chr "Principal Data Scientist" "ML Engineer" "ML Engineer" "Data Scientist" ...
## $ salary : int 80000 30000 25500 175000 120000 222200 136000 219000 141000 147100 ...
## $ salary_currency : chr "EUR" "USD" "USD" "USD" ...
## $ salary_in_usd : int 85847 30000 25500 175000 120000 222200 136000 219000 141000 147100 ...
## $ employee_residence: chr "ES" "US" "US" "CA" ...
## $ remote_ratio : int 100 100 100 100 100 0 0 0 0 0 ...
## $ company_location : chr "ES" "US" "US" "CA" ...
## $ company_size : chr "L" "S" "S" "M" ...
Some of the columns does not have the correct data type. We need to
modify the data type of experience_level,
employment_type, company_size into factor
because they are categorical variables.
salary$experience_level <- as.factor(salary$experience_level)
salary$employment_type <- as.factor(salary$employment_type)
salary$company_size <- as.factor(salary$company_size)str(salary)## 'data.frame': 3755 obs. of 11 variables:
## $ work_year : int 2023 2023 2023 2023 2023 2023 2023 2023 2023 2023 ...
## $ experience_level : Factor w/ 4 levels "EN","EX","MI",..: 4 3 3 4 4 4 4 4 4 4 ...
## $ employment_type : Factor w/ 4 levels "CT","FL","FT",..: 3 1 1 3 3 3 3 3 3 3 ...
## $ job_title : chr "Principal Data Scientist" "ML Engineer" "ML Engineer" "Data Scientist" ...
## $ salary : int 80000 30000 25500 175000 120000 222200 136000 219000 141000 147100 ...
## $ salary_currency : chr "EUR" "USD" "USD" "USD" ...
## $ salary_in_usd : int 85847 30000 25500 175000 120000 222200 136000 219000 141000 147100 ...
## $ employee_residence: chr "ES" "US" "US" "CA" ...
## $ remote_ratio : int 100 100 100 100 100 0 0 0 0 0 ...
## $ company_location : chr "ES" "US" "US" "CA" ...
## $ company_size : Factor w/ 3 levels "L","M","S": 1 3 3 2 2 1 1 2 2 2 ...
Now that we have change the columns into our desired data type, we
could check the categories/levels of the factor type column using the
unique() function.
unique(salary$experience_level)## [1] SE MI EN EX
## Levels: EN EX MI SE
unique(salary$employment_type)## [1] FT CT FL PT
## Levels: CT FL FT PT
unique(salary$company_size)## [1] L S M
## Levels: L M S
Then, we are going to check whether there are missing value in our data.
anyNA(salary)## [1] FALSE
colSums(is.na(salary))## work_year experience_level employment_type job_title
## 0 0 0 0
## salary salary_currency salary_in_usd employee_residence
## 0 0 0 0
## remote_ratio company_location company_size
## 0 0 0
We do not have any missing value in our data.
To get the summary of our data, we could use the
summary() function.
summary(salary)## work_year experience_level employment_type job_title
## Min. :2020 EN: 320 CT: 10 Length:3755
## 1st Qu.:2022 EX: 114 FL: 10 Class :character
## Median :2022 MI: 805 FT:3718 Mode :character
## Mean :2022 SE:2516 PT: 17
## 3rd Qu.:2023
## Max. :2023
## salary salary_currency salary_in_usd employee_residence
## Min. : 6000 Length:3755 Min. : 5132 Length:3755
## 1st Qu.: 100000 Class :character 1st Qu.: 95000 Class :character
## Median : 138000 Mode :character Median :135000 Mode :character
## Mean : 190696 Mean :137570
## 3rd Qu.: 180000 3rd Qu.:175000
## Max. :30400000 Max. :450000
## remote_ratio company_location company_size
## Min. : 0.00 Length:3755 L: 454
## 1st Qu.: 0.00 Class :character M:3153
## Median : 0.00 Mode :character S: 148
## Mean : 46.27
## 3rd Qu.:100.00
## Max. :100.00
From the summary above, we could conclude that: - The earliest year the salary was paid was in 2020 while the most recent is in 2023. - Experience level with the highest quantity is SE with 2516 people. Followed by MI, EX and EN with 805, 320, and 114 people. - FT has the highest quantity of employment type. - Most company sizes of data scientists are M size
We want to know what job title could get the highest salary.
library(dplyr)salary %>%
group_by(job_title, salary_currency) %>%
summarise(max(salary_in_usd)) %>%
ungroup() %>%
top_n(1)From the data above, Research Scientist has the highest salary which is 450.000 USD.
We want to know how much the maximum salary a data scientist could get.
salary %>%
filter(job_title %in% "Data Scientist") %>%
select(c("job_title", "salary_in_usd")) %>%
top_n(1)From the data above, the highest salary a data scientist could get is 412.000 USD.
We want to know how much the maximum salary a data scientist could get.
salary %>%
filter(job_title %in% "Data Analyst") %>%
select(c("job_title", "salary_in_usd")) %>%
top_n(1)From the data above, the highest salary a data analyst could get is 430.967 USD.
We want to know how much the maximum salary a data scientist could get.
salary %>%
filter(job_title %in% "Data Engineer") %>%
select(c("job_title", "salary_in_usd")) %>%
top_n(1)From the data above, the highest salary a data engineer could get is 324.000 USD.
We want to know the highest salary of each experience level.
salary %>%
group_by(experience_level) %>%
summarise(max_salary = max(salary_in_usd)) %>%
arrange(desc(max_salary))From the data above, we could see that as an entry level, we could get the maximum of 300.000 USD, 416.000 USD as an executive, 450.000 USD as a mid-level and 423.834 USD as a senior level.
We want to know what job title has the most entry level (EN).
salary %>%
filter(experience_level %in% "EN") %>%
group_by(job_title) %>%
summarize(Freq = n()) %>%
arrange(desc(Freq)) %>%
top_n(3)Data Engineer, Data Analyst, and Data Scientist has the most entry level experience compare to other job title.
We want to know what is the most popular job.
salary %>%
group_by(job_title) %>%
summarize(Freq = n()) %>%
arrange(desc(Freq)) %>%
top_n(3)From the table above, Data Engineer is the most popular job with 1040 people who is a Data Engineer.
We want to know what job title that commonly found in M sized company.
salary %>%
filter(company_size %in% "M") %>%
group_by(job_title) %>%
summarise(Freq = n()) %>%
arrange(desc(Freq)) %>%
top_n(3)Data Engineer is the most common job title in M sized company.
Most people who are in the data science field are seniors. However, Data Engineers, Data Analysts and Data Scientists have the highest entry level experience of any other job title. This means that many newcomers want to become Data Engineers, Data Analysts and Data Scientists.
The data above also shows that Data Engineers, Data Analysts and Data Scientists are the top 3 most popular job in this past 3 years. These 3 job titles are also commonly found in M sized company.
The highest paid job is the Research Scientist, which has a salary of 450.000 USD. While the highest salary of a Data Engineer, Data Analyst, and Data Scientist could get is 324.000 USD, 430.967 USD, and 412.000 USD.