## Warning: package 'lubridate' was built under R version 4.3.2
## Warning: package 'RMySQL' was built under R version 4.3.2
For this project, the aim is to obtain data to answer the question, “Which are the most valued data science skills?”
We obtained the Data Science Job Salaries dataset from Kaggle: https://www.kaggle.com/datasets/ruchi798/data-science-job-salaries?resource=download which contains information about salaries of jobs in the Data Science domain. The dataset includes work year, company size, job title, salary in USD, employee residence and company location.
We decided to focus on job titles that included salary in the USD and worked with variables including the work year, company size and company location.
We stored the data in MySQL workbench and azure database, joined the tables that contained data of company and job and then queried it using R.
## [1] "chess_tournament" "clean_company_ids" "clean_job_dets"
## [4] "company" "job" "movie_ratings"
job <- dbGetQuery(mydb,'select * from job')
company <- dbGetQuery(mydb,'select * from company')
total_df <- left_join(company, job, by='cid')
head(total_df)## company_id employee_residence remote_ratio company_location company_size
## 1 1 DE 0 DE L
## 2 1 DE 0 DE L
## 3 1 DE 0 DE L
## 4 2 JP 0 JP S
## 5 2 JP 0 JP S
## 6 3 GB 50 GB M
## cid job_title_id work_year experience_level employment_type
## 1 DE_0_DE_L 1 2020 MI FT
## 2 DE_0_DE_L 258 2021 EX FT
## 3 DE_0_DE_L 271 2021 EN FT
## 4 JP_0_JP_S 2 2020 SE FT
## 5 JP_0_JP_S 151 2021 SE FT
## 6 GB_50_GB_M 3 2020 SE FT
## job_title salary salary_currency salary_in_usd
## 1 Data Scientist 70000 EUR 79833
## 2 Director of Data Science 120000 EUR 141846
## 3 Data Science Consultant 65000 EUR 76833
## 4 Machine Learning Scientist 260000 USD 260000
## 5 Director of Data Science 168000 USD 168000
## 6 Big Data Engineer 85000 GBP 109024
We checked for missing and duplicate values. As shown below in the results, there were no missing or duplicate values.
## 'data.frame': 565 obs. of 14 variables:
## $ company_id : int 1 1 1 2 2 3 4 5 5 5 ...
## $ employee_residence: chr "DE" "DE" "DE" "JP" ...
## $ remote_ratio : int 0 0 0 0 0 50 0 50 50 50 ...
## $ company_location : chr "DE" "DE" "DE" "JP" ...
## $ company_size : chr "L" "L" "L" "S" ...
## $ cid : chr "DE_0_DE_L" "DE_0_DE_L" "DE_0_DE_L" "JP_0_JP_S" ...
## $ job_title_id : int 1 258 271 2 151 3 4 5 38 59 ...
## $ work_year : int 2020 2021 2021 2020 2021 2020 2020 2020 2020 2020 ...
## $ experience_level : chr "MI" "EX" "EN" "SE" ...
## $ employment_type : chr "FT" "FT" "FT" "FT" ...
## $ job_title : chr "Data Scientist" "Director of Data Science" "Data Science Consultant" "Machine Learning Scientist" ...
## $ salary : int 70000 120000 65000 260000 168000 85000 20000 150000 250000 120000 ...
## $ salary_currency : chr "EUR" "EUR" "EUR" "USD" ...
## $ salary_in_usd : int 79833 141846 76833 260000 168000 109024 20000 150000 250000 120000 ...
## company_id employee_residence remote_ratio company_location
## Min. : 1.00 Length:565 Min. : 0.00 Length:565
## 1st Qu.: 21.00 Class :character 1st Qu.: 50.00 Class :character
## Median : 42.00 Mode :character Median :100.00 Mode :character
## Mean : 52.98 Mean : 69.91
## 3rd Qu.: 71.00 3rd Qu.:100.00
## Max. :161.00 Max. :100.00
## company_size cid job_title_id work_year
## Length:565 Length:565 Min. : 1 Min. :2020
## Class :character Class :character 1st Qu.:142 1st Qu.:2021
## Mode :character Mode :character Median :283 Median :2021
## Mean :283 Mean :2021
## 3rd Qu.:424 3rd Qu.:2022
## Max. :565 Max. :2022
## experience_level employment_type job_title salary
## Length:565 Length:565 Length:565 Min. : 4000
## Class :character Class :character Class :character 1st Qu.: 67000
## Mode :character Mode :character Mode :character Median : 110925
## Mean : 338116
## 3rd Qu.: 165000
## Max. :30400000
## salary_currency salary_in_usd
## Length:565 Min. : 2859
## Class :character 1st Qu.: 60757
## Mode :character Median :100000
## Mean :110610
## 3rd Qu.:150000
## Max. :600000
## [1] 0
## [1] "company_id" "employee_residence" "remote_ratio"
## [4] "company_location" "company_size" "cid"
## [7] "job_title_id" "work_year" "experience_level"
## [10] "employment_type" "job_title" "salary"
## [13] "salary_currency" "salary_in_usd"
To verify duplicate values in the dataset, I used the duplicated() function. This creates a new dataframe displaying any duplication values. I also used the sum function.
num_duplicates <- sum(duplicated(total_df))
# Check for duplicates
duplicates <- total_df[duplicated(total_df), ]
print(duplicates)## [1] company_id employee_residence remote_ratio company_location
## [5] company_size cid job_title_id work_year
## [9] experience_level employment_type job_title salary
## [13] salary_currency salary_in_usd
## <0 rows> (or 0-length row.names)
There are no duplicate values in the dataset.
I extracted certain skills based on job titles including Data Scientist, Data Analyst and Machine Learning Engineer which would imply skills relevant to data science.
# Analysis
# Extract relevant skills (based on job titles)
data_science_roles <- c("Data Scientist", "Data Analyst", "Machine Learning Engineer")
data_science_data <- total_df[total_df$job_title %in% data_science_roles, ]
head(data_science_data)## company_id employee_residence remote_ratio company_location company_size
## 1 1 DE 0 DE L
## 8 5 US 50 US L
## 9 5 US 50 US L
## 10 5 US 50 US L
## 12 5 US 50 US L
## 13 5 US 50 US L
## cid job_title_id work_year experience_level employment_type
## 1 DE_0_DE_L 1 2020 MI FT
## 8 US_50_US_L 5 2020 SE FT
## 9 US_50_US_L 38 2020 EN FT
## 10 US_50_US_L 59 2020 SE FT
## 12 US_50_US_L 196 2021 MI FT
## 13 US_50_US_L 250 2021 MI FT
## job_title salary salary_currency salary_in_usd
## 1 Data Scientist 70000 EUR 79833
## 8 Machine Learning Engineer 150000 USD 150000
## 9 Machine Learning Engineer 250000 USD 250000
## 10 Data Scientist 120000 USD 120000
## 12 Data Scientist 147000 USD 147000
## 13 Data Scientist 115000 USD 115000
I used the dplyr library to filter the data and compute the summary statistics including the median, mean, min and max salary by job title. The Data Scientist job title had the highest mean salary among the other job titles based on data science roles.
# Filter data for relevant job titles
data_science_roles <- c("Data Scientist", "Data Analyst", "Machine Learning Engineer")
data_science_data <- total_df[total_df$job_title %in% data_science_roles, ]
# Summary statistics of salary by job title
summary_stats <- data_science_data %>%
group_by(job_title) %>%
summarise(
median_salary = median(salary_in_usd),
mean_salary = mean(salary_in_usd),
min_salary = min(salary_in_usd),
max_salary = max(salary_in_usd)
)
print(summary_stats)## # A tibble: 3 × 5
## job_title median_salary mean_salary min_salary max_salary
## <chr> <dbl> <dbl> <int> <int>
## 1 Data Analyst 90000 90090. 6072 200000
## 2 Data Scientist 100000 103336. 2859 412000
## 3 Machine Learning Engineer 87425 101165. 20000 250000
We created visualizations to display the summary statistics of salary by job title using ggplot2. Below is a boxplot where each box represents the distribution of salaries for each job title. It provides a visual comparison of the median, quartiles, and potential outliers for each job title’s salary distribution.
# Boxplot visualization with color and removed scientific notation
boxplot <- ggplot(data_science_data, aes(x = job_title, y = salary_in_usd, fill = job_title)) +
geom_boxplot() +
scale_y_continuous(labels = scales::comma) + # Remove scientific notation
labs(title = "Salary Distribution by Job Title",
x = "Job Title",
y = "Salary (USD)") +
theme_minimal()
print(boxplot)We directly searched for the highest salary across all job titles in the data_science_data data frame by using the which.max() function. The job title “Data Scientist” had the highest salary: $412,000 for the work year 2020 and experience level SE. The job title “Data Scientist” had the lowest salary: $2859 for the work year 2021 and experience level MI.
# Find the row index of the highest salary
highest_salary_index <- which.max(data_science_data$salary_in_usd)
# Get the corresponding row with the highest salary
highest_salary_row <- data_science_data[highest_salary_index, ]
print(highest_salary_row)## company_id employee_residence remote_ratio company_location company_size
## 34 6 US 100 US L
## cid job_title_id work_year experience_level employment_type
## 34 US_100_US_L 64 2020 SE FT
## job_title salary salary_currency salary_in_usd
## 34 Data Scientist 412000 USD 412000
# Find the row index of the lowest salary
lowest_salary_index <- which.min(data_science_data$salary_in_usd)
# Get the corresponding row with the lowest salary
lowest_salary_row <- data_science_data[lowest_salary_index, ]
print(lowest_salary_row)## company_id employee_residence remote_ratio company_location company_size
## 166 24 MX 0 MX S
## cid job_title_id work_year experience_level employment_type
## 166 MX_0_MX_S 177 2021 MI FT
## job_title salary salary_currency salary_in_usd
## 166 Data Scientist 58000 MXN 2859
The bar plot below displays the salary distribution by work experience and job title based on years 2020-2022. During the years 2020 and 2022, the job title Data Scientist received higher salaries compared to Data Analyst and Machine Learning Engineer roles. In 2021, Data Analyst and Machine Learning Engineer roles received the same salary distribution.
# Create a bar plot for each job title
ggplot(data_science_data, aes(x = factor(work_year), y = salary_in_usd, fill = job_title)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Salary Distribution by Work Experience and Job Title",
x = "Work Experience (Years)",
y = "Salary (USD)",
fill = "Job Title") +
scale_y_continuous(labels = scales::comma_format()) +
theme_minimal()The bar plots below displays that there were a higher frequency of Data Scientist roles.
# Create a bar plot of job titles with count values
ggplot(data_science_data, aes(x = job_title)) +
geom_bar() +
geom_text(stat = 'count', aes(label=..count..), vjust = -0.1) + # Add count values on top of bars
labs(title = "Distribution of Data Science Job Titles",
x = "Job Title",
y = "Frequency") +
theme(axis.text.x = element_text())## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
More scatter plots in the US and fewer in other locations suggest differences in the distribution and representation of salary data across different geographic regions, potentially reflecting underlying socioeconomic and industrial factors.
# Filter data for USD locations
df_usd <- total_df[total_df$salary_currency == "USD", ]
# Scatter plot: Salary (USD) vs. Location (with reversed axes)
ggplot(df_usd, aes(x = salary_in_usd, y = company_location)) +
geom_point(alpha = 0.5) +
labs(title = "Salary Distribution by Location (USD)",
x = "Salary (USD)",
y = "Location") +
scale_x_continuous(labels = scales::comma) + # Remove scientific notation
theme(axis.text.y = element_text(hjust = 1)) # Adjust text alignmentBased on this scatter plot displaying the salary distribution by Job title, there are a crowd of scattered points for the data science roles: Data Analyst, Data Scientist and Data Engineer. This means that there is a higher representation of these job titles being prevalent in the dataset compared to other data science job titles. It also seems that these 3 roles are popular in the data science industry.
The spread of these points for each job title category reflects significant differences in compensation depending on factors such as experience, education, location, and specific industry.
The clustering of points around these job titles could also indicate higher demand or competition in the job market for roles like Data Analyst, Data Scientist, and Data Engineer. Companies may be offering a wide range of salaries to attract talent in these fields.
To add on, this might also signify ongoing trends or changes in the industry where roles related to data analysis and data science are in high demand. This could be due to advancements in technology, the increasing importance of data-driven decision-making, or emerging sectors such as artificial intelligence and machine learning.
# Filter data for USD locations
df_usd <- total_df[total_df$salary_currency == "USD", ]
# Scatter plot: Salary (USD) vs. Job Title (with reversed axes)
ggplot(df_usd, aes(x = salary_in_usd, y = reorder(job_title, desc(salary_in_usd)))) +
geom_point(alpha = 0.5) +
labs(title = "Salary Distribution by Job Title",
x = "Salary (USD)",
y = "Job Title") +
scale_x_continuous(labels = scales::comma_format()) # Remove scientific notationWe also calculated the average salary by the work year, experience level, job title and company size and visualized the results in a heat map to observe the color intensity between each variable.
Each cell in the heatmap represents the average salary for a combination of the work year, job title, experience level, and company size.
The color intensity of each cell represents the average salary, with higher intensities indicating higher average salaries.
# Calculate average salary by work year and job title
average_salary <- aggregate(salary_in_usd ~ work_year + job_title, data = df_usd, FUN = mean)
print(average_salary)## work_year job_title salary_in_usd
## 1 2021 AI Scientist 26333.33
## 2 2022 AI Scientist 160000.00
## 3 2022 Analytics Engineer 175000.00
## 4 2022 Applied Data Scientist 238000.00
## 5 2021 Applied Machine Learning Scientist 230700.00
## 6 2022 Applied Machine Learning Scientist 75000.00
## 7 2020 BI Data Analyst 98000.00
## 8 2021 BI Data Analyst 78568.00
## 9 2020 Big Data Engineer 70000.00
## 10 2021 Big Data Engineer 39000.00
## 11 2020 Business Data Analyst 117500.00
## 12 2021 Cloud Data Engineer 160000.00
## 13 2020 Computer Vision Engineer 60000.00
## 14 2021 Computer Vision Engineer 24000.00
## 15 2022 Computer Vision Engineer 67500.00
## 16 2021 Computer Vision Software Engineer 70000.00
## 17 2022 Computer Vision Software Engineer 150000.00
## 18 2020 Data Analyst 53200.00
## 19 2021 Data Analyst 91250.00
## 20 2022 Data Analyst 107203.70
## 21 2021 Data Analytics Engineer 80000.00
## 22 2022 Data Analytics Engineer 20000.00
## 23 2022 Data Analytics Lead 405000.00
## 24 2021 Data Analytics Manager 126666.67
## 25 2022 Data Analytics Manager 127485.00
## 26 2021 Data Architect 166666.67
## 27 2022 Data Architect 182076.62
## 28 2020 Data Engineer 133700.00
## 29 2021 Data Engineer 107089.06
## 30 2022 Data Engineer 146999.05
## 31 2021 Data Engineering Manager 159000.00
## 32 2020 Data Science Consultant 103000.00
## 33 2021 Data Science Consultant 90000.00
## 34 2022 Data Science Engineer 60000.00
## 35 2020 Data Science Manager 190200.00
## 36 2021 Data Science Manager 177500.00
## 37 2022 Data Science Manager 170196.60
## 38 2020 Data Scientist 149158.57
## 39 2021 Data Scientist 97883.33
## 40 2022 Data Scientist 148052.00
## 41 2021 Data Specialist 165000.00
## 42 2021 Director of Data Engineering 200000.00
## 43 2020 Director of Data Science 325000.00
## 44 2021 Director of Data Science 209000.00
## 45 2021 Financial Data Analyst 450000.00
## 46 2022 Financial Data Analyst 100000.00
## 47 2021 Head of Data 232500.00
## 48 2022 Head of Data 200000.00
## 49 2021 Head of Data Science 97500.00
## 50 2022 Head of Data Science 195937.50
## 51 2020 Lead Data Analyst 87000.00
## 52 2021 Lead Data Analyst 170000.00
## 53 2020 Lead Data Engineer 90500.00
## 54 2021 Lead Data Engineer 218000.00
## 55 2020 Lead Data Scientist 152500.00
## 56 2021 Machine Learning Developer 100000.00
## 57 2020 Machine Learning Engineer 179333.33
## 58 2021 Machine Learning Engineer 98980.50
## 59 2022 Machine Learning Engineer 156249.56
## 60 2021 Machine Learning Infrastructure Engineer 195000.00
## 61 2020 Machine Learning Scientist 260000.00
## 62 2021 Machine Learning Scientist 145500.00
## 63 2022 Machine Learning Scientist 141766.67
## 64 2021 ML Engineer 263000.00
## 65 2021 Principal Data Analyst 170000.00
## 66 2022 Principal Data Analyst 75000.00
## 67 2021 Principal Data Engineer 328333.33
## 68 2021 Principal Data Scientist 255500.00
## 69 2020 Product Data Analyst 20000.00
## 70 2020 Research Scientist 246000.00
## 71 2021 Research Scientist 73333.00
## 72 2022 Research Scientist 132000.00
## 73 2021 Staff Data Scientist 105000.00
# Create a heat map
ggplot(average_salary, aes(x = work_year, y = job_title, fill = salary_in_usd)) +
geom_tile() +
scale_fill_gradient(low = "lightblue", high = "darkblue", labels = scales::comma) + # Remove scientific notation
labs(title = "Average Salary by Work Year and Job Title",
x = "Work Year",
y = "Job Title",
fill = "Average Salary (USD)") +
theme(legend.position = "right") # Adjust legend positionThe abbreviations “EN”, “EX”, “MI”, and “SE” likely represent different experience levels. Below is a typical interpretation of these abbreviations:
These abbreviations are commonly used in job postings or HR contexts to describe the level of experience required or preferred for a particular role.
# Calculate average salary by experience level, job title, and company size
average_salary <- aggregate(salary_in_usd ~ experience_level + job_title + company_size, data = df_usd, FUN = mean)
print(average_salary)## experience_level job_title company_size
## 1 MI AI Scientist L
## 2 SE AI Scientist L
## 3 MI Applied Data Scientist L
## 4 SE Applied Data Scientist L
## 5 MI Applied Machine Learning Scientist L
## 6 EX BI Data Analyst L
## 7 EN Big Data Engineer L
## 8 EN Business Data Analyst L
## 9 MI Business Data Analyst L
## 10 EN Data Analyst L
## 11 MI Data Analyst L
## 12 SE Data Analyst L
## 13 MI Data Analytics Engineer L
## 14 SE Data Analytics Lead L
## 15 SE Data Analytics Manager L
## 16 MI Data Architect L
## 17 EN Data Engineer L
## 18 MI Data Engineer L
## 19 SE Data Engineer L
## 20 SE Data Engineering Manager L
## 21 MI Data Science Consultant L
## 22 SE Data Science Engineer L
## 23 SE Data Science Manager L
## 24 EN Data Scientist L
## 25 MI Data Scientist L
## 26 SE Data Scientist L
## 27 SE Data Specialist L
## 28 SE Director of Data Engineering L
## 29 EX Director of Data Science L
## 30 EN Financial Data Analyst L
## 31 MI Financial Data Analyst L
## 32 EX Head of Data L
## 33 MI Lead Data Analyst L
## 34 SE Lead Data Analyst L
## 35 SE Lead Data Engineer L
## 36 MI Lead Data Scientist L
## 37 EN Machine Learning Engineer L
## 38 SE Machine Learning Engineer L
## 39 EN Machine Learning Scientist L
## 40 MI Machine Learning Scientist L
## 41 SE Machine Learning Scientist L
## 42 MI ML Engineer L
## 43 EX Principal Data Engineer L
## 44 SE Principal Data Engineer L
## 45 MI Principal Data Scientist L
## 46 SE Principal Data Scientist L
## 47 EN Research Scientist L
## 48 MI Research Scientist L
## 49 SE Research Scientist L
## 50 EN AI Scientist M
## 51 MI AI Scientist M
## 52 EX Analytics Engineer M
## 53 SE Analytics Engineer M
## 54 MI Applied Machine Learning Scientist M
## 55 MI BI Data Analyst M
## 56 MI Big Data Engineer M
## 57 EN Computer Vision Engineer M
## 58 SE Computer Vision Engineer M
## 59 EN Computer Vision Software Engineer M
## 60 EN Data Analyst M
## 61 EX Data Analyst M
## 62 MI Data Analyst M
## 63 SE Data Analyst M
## 64 EN Data Analytics Engineer M
## 65 SE Data Analytics Engineer M
## 66 SE Data Analytics Manager M
## 67 SE Data Architect M
## 68 EN Data Engineer M
## 69 EX Data Engineer M
## 70 MI Data Engineer M
## 71 SE Data Engineer M
## 72 MI Data Science Manager M
## 73 SE Data Science Manager M
## 74 EN Data Scientist M
## 75 MI Data Scientist M
## 76 SE Data Scientist M
## 77 SE Head of Data M
## 78 EX Head of Data Science M
## 79 MI Lead Data Engineer M
## 80 EN Machine Learning Engineer M
## 81 SE Machine Learning Engineer M
## 82 SE Machine Learning Infrastructure Engineer M
## 83 MI Machine Learning Scientist M
## 84 SE Principal Data Analyst M
## 85 SE Principal Data Engineer M
## 86 MI Research Scientist M
## 87 SE Staff Data Scientist M
## 88 EN AI Scientist S
## 89 EN BI Data Analyst S
## 90 MI Big Data Engineer S
## 91 SE Cloud Data Engineer S
## 92 SE Computer Vision Engineer S
## 93 EN Computer Vision Software Engineer S
## 94 EN Data Analyst S
## 95 MI Data Analyst S
## 96 SE Data Analyst S
## 97 EN Data Engineer S
## 98 SE Data Engineer S
## 99 EN Data Science Consultant S
## 100 EN Data Scientist S
## 101 MI Data Scientist S
## 102 SE Director of Data Science S
## 103 MI Head of Data Science S
## 104 SE Lead Data Engineer S
## 105 SE Lead Data Scientist S
## 106 EN Machine Learning Developer S
## 107 EN Machine Learning Engineer S
## 108 MI Machine Learning Engineer S
## 109 SE Machine Learning Engineer S
## 110 SE Machine Learning Scientist S
## 111 SE ML Engineer S
## 112 MI Principal Data Analyst S
## 113 EX Principal Data Scientist S
## 114 MI Product Data Analyst S
## 115 SE Research Scientist S
## salary_in_usd
## 1 200000.00
## 2 55000.00
## 3 157000.00
## 4 278500.00
## 5 249000.00
## 6 150000.00
## 7 70000.00
## 8 100000.00
## 9 135000.00
## 10 81500.00
## 11 76857.14
## 12 200000.00
## 13 110000.00
## 14 405000.00
## 15 130000.00
## 16 166666.67
## 17 76250.00
## 18 109777.78
## 19 157387.50
## 20 159000.00
## 21 103000.00
## 22 60000.00
## 23 177500.00
## 24 37133.33
## 25 113777.78
## 26 187863.64
## 27 165000.00
## 28 200000.00
## 29 287500.00
## 30 100000.00
## 31 450000.00
## 32 232500.00
## 33 87000.00
## 34 170000.00
## 35 276000.00
## 36 115000.00
## 37 250000.00
## 38 178333.33
## 39 225000.00
## 40 136150.00
## 41 225000.00
## 42 270000.00
## 43 600000.00
## 44 185000.00
## 45 151000.00
## 46 227500.00
## 47 87333.33
## 48 69999.00
## 49 144000.00
## 50 12000.00
## 51 120000.00
## 52 155000.00
## 53 195000.00
## 54 38400.00
## 55 99000.00
## 56 60000.00
## 57 67500.00
## 58 24000.00
## 59 70000.00
## 60 62250.00
## 61 120000.00
## 62 105584.44
## 63 112859.03
## 64 20000.00
## 65 50000.00
## 66 125988.00
## 67 182076.62
## 68 120000.00
## 69 245500.00
## 70 111232.40
## 71 142032.03
## 72 200000.00
## 73 160295.75
## 74 71000.00
## 75 127519.23
## 76 158403.45
## 77 200000.00
## 78 158958.33
## 79 56000.00
## 80 21844.00
## 81 183541.00
## 82 195000.00
## 83 82500.00
## 84 170000.00
## 85 200000.00
## 86 450000.00
## 87 105000.00
## 88 12000.00
## 89 32136.00
## 90 18000.00
## 91 160000.00
## 92 60000.00
## 93 150000.00
## 94 53333.33
## 95 39000.00
## 96 80000.00
## 97 65000.00
## 98 115000.00
## 99 90000.00
## 100 98333.33
## 101 58753.33
## 102 168000.00
## 103 110000.00
## 104 142500.00
## 105 190000.00
## 106 100000.00
## 107 89800.00
## 108 97000.00
## 109 92500.00
## 110 190000.00
## 111 256000.00
## 112 75000.00
## 113 416000.00
## 114 20000.00
## 115 50000.00
# Create a heatmap
ggplot(average_salary, aes(x = experience_level, y = job_title, fill = salary_in_usd)) +
geom_tile() +
scale_fill_gradient(low = "lightblue", high = "darkblue", labels = scales::comma) + # Remove scientific notation
labs(title = "Average Salary by Job Title and Experience Level",
x = "Experience Level",
y = "Job Title",
fill = "Average Salary (USD)") +
theme(legend.position = "right") # Adjust legend positionThe abbreviations “L”, “M”, “S” represent different company sizes. Below is a typical interpretation of these abbreviations:
# Calculate average salary by company size, job title, and experience level
average_salary <- aggregate(salary_in_usd ~ company_size + job_title + experience_level, data = df_usd, FUN = mean)
head(average_salary)## company_size job_title experience_level salary_in_usd
## 1 M AI Scientist EN 12000
## 2 S AI Scientist EN 12000
## 3 S BI Data Analyst EN 32136
## 4 L Big Data Engineer EN 70000
## 5 L Business Data Analyst EN 100000
## 6 M Computer Vision Engineer EN 67500
# Create a heatmap
ggplot(average_salary, aes(x = company_size, y = job_title, fill = salary_in_usd)) +
geom_tile() +
scale_fill_gradient(low = "lightblue", high = "darkblue", labels = scales::comma) + # Remove scientific notation
labs(title = "Average Salary by Job Title and Company Size",
x = "Company Size",
y = "Job Title",
fill = "Average Salary (USD)") +
theme(legend.position = "right") # Adjust legend positionThe abbreviations “CT”, “FL”, “FT”, and “PT” likely represent different types of employment. Here’s a typical interpretation of these abbreviations:
These abbreviations are commonly used in employment contexts to describe the nature of the work arrangement or employment status. Each abbreviation corresponds to a different type of employment arrangement, indicating whether the position is full-time, part-time, contract-based, or freelance.
# Calculate average salary by employment type, job title, and experience level
average_salary <- aggregate(salary_in_usd ~ employment_type + job_title + experience_level, data = df_usd, FUN = mean)
head(average_salary)## employment_type job_title experience_level
## 1 PT AI Scientist EN
## 2 FT BI Data Analyst EN
## 3 FT Big Data Engineer EN
## 4 CT Business Data Analyst EN
## 5 FT Computer Vision Engineer EN
## 6 FT Computer Vision Software Engineer EN
## salary_in_usd
## 1 12000
## 2 32136
## 3 70000
## 4 100000
## 5 67500
## 6 110000
# Create a heatmap
ggplot(average_salary, aes(x = employment_type, y = job_title, fill = salary_in_usd)) +
geom_tile() +
scale_fill_gradient(low = "lightblue", high = "darkblue", labels = scales::comma) + # Remove scientific notation
labs(title = "Average Salary by Job Title and Employment Type",
x = "Employment Type",
y = "Job Title",
fill = "Average Salary (USD)") +
theme(legend.position = "right") # Adjust legend position# Calculate average salary by job title
average_salary <- df_usd %>%
group_by(job_title) %>%
summarise(average_salary = mean(salary_in_usd)) %>%
arrange(desc(average_salary)) %>%
slice(1:10) # Select top 10 job titles
# Create a bar plot
ggplot(average_salary, aes(x = reorder(job_title, -average_salary), y = average_salary)) +
geom_bar(stat = "identity", fill = "darkblue") +
geom_text(aes(label = sprintf("$%.2f", average_salary)), vjust = -0.1, size = 3) + # Add salary labels
labs(title = "Top 10 Data Science Job Titles by Average Salary",
x = "Job Title",
y = "Average Salary (USD)") +
scale_y_continuous(labels = scales::comma_format()) + # Remove scientific notation
theme(axis.text.x = element_text(angle = 45, hjust = 1),
axis.title.x = element_blank()) # Remove x-axis label for better readabilityBased on the data analysis and visualization conducted from this dataset we obtained from Kaggle, it is evident that data science roles involving Data Analyst, Data Scientist and Machine Learning Engineer are the most valued data science skills due to high representation, salary distribution, job market demand and industry trends. Based on the data visualizations above, in 2021 and 2022, the average salary for the data science job titles: Financial Data Analyst and Data Analytics Lead were the highest. The average salary by Experience Levels (Experienced (EX), Mid-level (MI) and Senior-level (SE) ) for the data science job titles: Principal Data Engineer, Research Scientist, Financial Data Analyst and Data Analytics Manager were the highest. To add on, the average salary for data science roles were the highest for mostly large sized companies. Based on the data visualizations, the average salary for the job titles: Financial Data Analyst and Data Analytics Lead were paid the highest in large size companies. On the other hand, Research Scientists were paid the highest in mid-sized companies and Principal Data Scientists were paid the highest in small sized companies. There is a higher color intensity for data science job titles that are full time which proves that the job titles: Financial Data Analyst and Data Analytics Lead receive the highest salary. In contrast, Principal Data Scientists receive a high salary as Contractors. Therefore, data science skills involved with analytics, science, financial, machine learning and research are valued the most.