1 Explanation

1.1 Brief Explanation about the Data

The data that we’re going to analyze is from Kaggle.com. This data shows salaries of different data science fields in the data science domain.

1.2 Column Description

Data Science Job Salaries Dataset contains 11 columns, which are:

work_year: The year the salary was paid.
experience_level: The experience level in the job during the year
- SE: Senior Experience
- MI: Mid-Level Experience
- EN: Entry-Level Experience
- EX: Executive Experience
employment_type: The type of employment for the role
- FT: Full-Time
- CT: Contract or Casual
- FL: Freelance
- PT: Part-Time
job_title: The role worked in during the year
salary: The total gross salary amount paid.
salary_currency: The currency of the salary paid as an ISO 4217 currency code.
salaryinusd: The salary in USD
employee_residence: Employee’s primary country of residence in during the work year as an ISO 3166 country code.
remote_ratio: The overall amount of work done remotely
company_location: The country of the employer’s main office or contracting branch
company_size: The median number of people that worked for the company during the year

2 Input Data

Make sure that the data is placed in the same folder as our R project. We are going to use the dataset ds_salaries.csv. Use the function read.csv() to read the CSV file to R. Then, save it under the salary object.

salary <- read.csv("ds_salaries.csv")

2.1 Data Inspection

Instead of looking at the whole data, it’s better for us to “peek” at some rows that can represent the overall shape of the data.

To see the first few rows of the data, we use the head() function.

head(salary)

To see the last few rows of the data, we use the tail() function.

tail(salary)

dim(salary)

## [1] 3755   11

names(salary)

##  [1] "work_year"          "experience_level"   "employment_type"   
##  [4] "job_title"          "salary"             "salary_currency"   
##  [7] "salary_in_usd"      "employee_residence" "remote_ratio"      
## [10] "company_location"   "company_size"

From the inspection above, we could conclude that:

Our data has 3755 rows and 11 columns
The name of the columns are: work_year, experience_level, employment_type, job_title, salary, salary_currency, salary_in_usd, employee_residence, remote_ratio, company_location, company_size.

2.2 Data Cleansing & Coertions

First, we want to check the data type for each columns using the str() function.

str(salary)

## 'data.frame':    3755 obs. of  11 variables:
##  $ work_year         : int  2023 2023 2023 2023 2023 2023 2023 2023 2023 2023 ...
##  $ experience_level  : chr  "SE" "MI" "MI" "SE" ...
##  $ employment_type   : chr  "FT" "CT" "CT" "FT" ...
##  $ job_title         : chr  "Principal Data Scientist" "ML Engineer" "ML Engineer" "Data Scientist" ...
##  $ salary            : int  80000 30000 25500 175000 120000 222200 136000 219000 141000 147100 ...
##  $ salary_currency   : chr  "EUR" "USD" "USD" "USD" ...
##  $ salary_in_usd     : int  85847 30000 25500 175000 120000 222200 136000 219000 141000 147100 ...
##  $ employee_residence: chr  "ES" "US" "US" "CA" ...
##  $ remote_ratio      : int  100 100 100 100 100 0 0 0 0 0 ...
##  $ company_location  : chr  "ES" "US" "US" "CA" ...
##  $ company_size      : chr  "L" "S" "S" "M" ...

Some of the columns does not have the correct data type. We need to modify the data type of experience_level, employment_type, company_size into factor because they are categorical variables.

salary$experience_level <- as.factor(salary$experience_level)
salary$employment_type <- as.factor(salary$employment_type)
salary$company_size <- as.factor(salary$company_size)

str(salary)

## 'data.frame':    3755 obs. of  11 variables:
##  $ work_year         : int  2023 2023 2023 2023 2023 2023 2023 2023 2023 2023 ...
##  $ experience_level  : Factor w/ 4 levels "EN","EX","MI",..: 4 3 3 4 4 4 4 4 4 4 ...
##  $ employment_type   : Factor w/ 4 levels "CT","FL","FT",..: 3 1 1 3 3 3 3 3 3 3 ...
##  $ job_title         : chr  "Principal Data Scientist" "ML Engineer" "ML Engineer" "Data Scientist" ...
##  $ salary            : int  80000 30000 25500 175000 120000 222200 136000 219000 141000 147100 ...
##  $ salary_currency   : chr  "EUR" "USD" "USD" "USD" ...
##  $ salary_in_usd     : int  85847 30000 25500 175000 120000 222200 136000 219000 141000 147100 ...
##  $ employee_residence: chr  "ES" "US" "US" "CA" ...
##  $ remote_ratio      : int  100 100 100 100 100 0 0 0 0 0 ...
##  $ company_location  : chr  "ES" "US" "US" "CA" ...
##  $ company_size      : Factor w/ 3 levels "L","M","S": 1 3 3 2 2 1 1 2 2 2 ...

Now that we have change the columns into our desired data type, we could check the categories/levels of the factor type column using the unique() function.

unique(salary$experience_level)

## [1] SE MI EN EX
## Levels: EN EX MI SE

unique(salary$employment_type)

## [1] FT CT FL PT
## Levels: CT FL FT PT

unique(salary$company_size)

## [1] L S M
## Levels: L M S

Then, we are going to check whether there are missing value in our data.

anyNA(salary)

## [1] FALSE

colSums(is.na(salary))

##          work_year   experience_level    employment_type          job_title 
##                  0                  0                  0                  0 
##             salary    salary_currency      salary_in_usd employee_residence 
##                  0                  0                  0                  0 
##       remote_ratio   company_location       company_size 
##                  0                  0                  0

We do not have any missing value in our data.

2.3 Data Explanation

To get the summary of our data, we could use the summary() function.

summary(salary)

##    work_year    experience_level employment_type  job_title        
##  Min.   :2020   EN: 320          CT:  10         Length:3755       
##  1st Qu.:2022   EX: 114          FL:  10         Class :character  
##  Median :2022   MI: 805          FT:3718         Mode  :character  
##  Mean   :2022   SE:2516          PT:  17                           
##  3rd Qu.:2023                                                      
##  Max.   :2023                                                      
##      salary         salary_currency    salary_in_usd    employee_residence
##  Min.   :    6000   Length:3755        Min.   :  5132   Length:3755       
##  1st Qu.:  100000   Class :character   1st Qu.: 95000   Class :character  
##  Median :  138000   Mode  :character   Median :135000   Mode  :character  
##  Mean   :  190696                      Mean   :137570                     
##  3rd Qu.:  180000                      3rd Qu.:175000                     
##  Max.   :30400000                      Max.   :450000                     
##   remote_ratio    company_location   company_size
##  Min.   :  0.00   Length:3755        L: 454      
##  1st Qu.:  0.00   Class :character   M:3153      
##  Median :  0.00   Mode  :character   S: 148      
##  Mean   : 46.27                                  
##  3rd Qu.:100.00                                  
##  Max.   :100.00

From the summary above, we could conclude that: - The earliest year the salary was paid was in 2020 while the most recent is in 2023. - Experience level with the highest quantity is SE with 2516 people. Followed by MI, EX and EN with 805, 320, and 114 people. - FT has the highest quantity of employment type. - Most company sizes of data scientists are M size

3 Data Exploratory

3.1 Highest Salary

We want to know what job title could get the highest salary.

library(dplyr)

salary %>% 
  group_by(job_title, salary_currency) %>% 
  summarise(max(salary_in_usd)) %>% 
  ungroup() %>% 
  top_n(1)

From the data above, Research Scientist has the highest salary which is 450.000 USD.

3.2 Highest Salary of Data Scientist

We want to know how much the maximum salary a data scientist could get.

salary %>% 
  filter(job_title %in% "Data Scientist") %>% 
  select(c("job_title", "salary_in_usd")) %>% 
  top_n(1)

From the data above, the highest salary a data scientist could get is 412.000 USD.

3.3 Highest Salary of Data Analyst

We want to know how much the maximum salary a data scientist could get.

salary %>% 
  filter(job_title %in% "Data Analyst") %>% 
  select(c("job_title", "salary_in_usd")) %>% 
  top_n(1)

From the data above, the highest salary a data analyst could get is 430.967 USD.

3.4 Highest Salary of Data Engineer

We want to know how much the maximum salary a data scientist could get.

salary %>% 
  filter(job_title %in% "Data Engineer") %>% 
  select(c("job_title", "salary_in_usd")) %>% 
  top_n(1)

From the data above, the highest salary a data engineer could get is 324.000 USD.

3.5 Highest Salary of each Experience Level

We want to know the highest salary of each experience level.

salary %>% 
  group_by(experience_level) %>% 
  summarise(max_salary = max(salary_in_usd)) %>% 
  arrange(desc(max_salary))

From the data above, we could see that as an entry level, we could get the maximum of 300.000 USD, 416.000 USD as an executive, 450.000 USD as a mid-level and 423.834 USD as a senior level.

3.6 Top 3 job title that has the most EN

We want to know what job title has the most entry level (EN).

salary %>% 
  filter(experience_level %in% "EN") %>% 
  group_by(job_title) %>% 
  summarize(Freq = n()) %>%
  arrange(desc(Freq)) %>% 
  top_n(3)

Data Engineer, Data Analyst, and Data Scientist has the most entry level experience compare to other job title.

3.7 Most popular job

We want to know what is the most popular job.

salary %>% 
  group_by(job_title) %>% 
  summarize(Freq = n()) %>%
  arrange(desc(Freq)) %>% 
  top_n(3)

From the table above, Data Engineer is the most popular job with 1040 people who is a Data Engineer.

3.8 Most frequent job title in M sized company

We want to know what job title that commonly found in M sized company.

salary %>% 
  filter(company_size %in% "M") %>% 
  group_by(job_title) %>% 
  summarise(Freq = n()) %>% 
  arrange(desc(Freq)) %>% 
  top_n(3)

Data Engineer is the most common job title in M sized company.

4 Conclusion

Most people who are in the data science field are seniors. However, Data Engineers, Data Analysts and Data Scientists have the highest entry level experience of any other job title. This means that many newcomers want to become Data Engineers, Data Analysts and Data Scientists.

The data above also shows that Data Engineers, Data Analysts and Data Scientists are the top 3 most popular job in this past 3 years. These 3 job titles are also commonly found in M sized company.

The highest paid job is the Research Scientist, which has a salary of 450.000 USD. While the highest salary of a Data Engineer, Data Analyst, and Data Scientist could get is 324.000 USD, 430.967 USD, and 412.000 USD.

Exploratory Data Analysis - Data Science Salaries 2023 Dataset

Michelle Intan Handa

2023-06-11

1 Explanation

1.1 Brief Explanation about the Data

1.2 Column Description

2 Input Data

2.1 Data Inspection

2.2 Data Cleansing & Coertions

2.3 Data Explanation

3 Data Exploratory

3.1 Highest Salary

3.2 Highest Salary of Data Scientist

3.3 Highest Salary of Data Analyst

3.4 Highest Salary of Data Engineer

3.5 Highest Salary of each Experience Level

3.6 Top 3 job title that has the most EN

3.7 Most popular job

3.8 Most frequent job title in M sized company

4 Conclusion