Final project data 110 : salaries

Author

Aminata Diatta

Introduction:

The dataset I am working with offers a detailed exploration of salary data specifically within the realm of data science careers, spanning the years from 2020 to 2024. It encompasses various job titles including data scientist, data analyst, data engineer, cloud data specialist, and big data architect, among others. Each entry in the dataset provides insight into the experience level of the employee, ranging from entry-level to executive-level positions, as well as the type of employment (part-time, full-time, contract, freelance). The dataset also includes information on the gross salary amount paid, the currency in which it was paid, and the equivalent salary in USD. Additionally, it provides data on the employee’s primary country of residence, the extent of remote work, the country of the employer’s main office, and the size of the company. This rich and comprehensive dataset serves as a valuable resource for understanding the nuances of salary structures and employment dynamics within the data science field, offering insights that can inform strategic decision-making and career planning for professionals in this domain.I selected this dataset due to my keen interest in the field of data science and my aspiration to delve deeper into its various career paths. As I pursue my associate degree, I recognize that my future endeavors might lead me towards roles such as a data engineer, data analyst, data scientist, or perhaps another related position. Therefore, it seems opportune to explore the potential salary prospects in these fields. Through this dataset, I aim to gain insights into how salary rates have evolved over the years and how they vary depending on factors such as job title and company size. By analyzing this data, I hope to better understand the financial landscape of the data science industry and make informed decisions regarding my career path moving forward.

Load the libraries:

library(tidyverse)

Warning: package 'tidyverse' was built under R version 4.3.3

Warning: package 'ggplot2' was built under R version 4.3.3

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.0     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggplot2)
setwd("C:/Users/satad/Downloads")
data_salary <- read_csv("salaries.csv")

Rows: 16534 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (7): experience_level, employment_type, job_title, salary_currency, empl...
dbl (4): work_year, salary, salary_in_usd, remote_ratio

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

view(data_salary)

Clean the dataset :

I will be focus on the work_year, experience_level, the job_title, salary,salary_in_usd , employee_residency, remote_ratio, company_location, company_size, the experience level will be meduim also

data_salary2 <- data_salary %>%
   select(work_year, experience_level, job_title, salary_in_usd, employee_residence, remote_ratio, company_location, company_size)
 view(data_salary2)

The only location that I will focus will be the United States, the employee has to live in the United States also, the job title that I need more information is DATA analyst, scientist, and engineer, the company size will be large,the experience_level will be senior level and mid level, the job will remote and in person, the employe will get at least more than 100000 usd

data_salary3 <- data_salary2 |>
  filter(job_title %in% c("Data Analyst","Data Engineer", "Data Scientist"),employee_residence== "US", company_location == "US", company_size =="L",experience_level %in% c("MI","SE"), salary_in_usd >100000)
View(data_salary3)

I want to see my dataset arrange from 2020 to 2024

data_salary4 <- data_salary3 |> 
  arrange(work_year)
view(data_salary4)

First visualisation :

My first visualisation will be a graph representing salaries in the data area based on the year :

library(ggplot2)
first_visualisation <- data_salary4 |>
  ggplot(aes(x = work_year, y = salary_in_usd)) +
  geom_bin2d() +
  labs(title = "Salaries of data career based on the company",
       x = "year",
       y = "Salary(USD)") + 
  theme_update() 
options(scipen = 999)
first_visualisation

Comments:

Examining the visualization, it’s evident that there’s been a notable increase in the number of jobs in 2024 compared to previous years. Interestingly, 2020 stands out as the only year where individuals were receiving a salary as high as $400,000. Surprisingly, despite the consistent average payments in 2023 and 2024, none of the senior or mid-level positions have seen salaries surpassing $400,000 across the four-year span. This raises questions about the demographics of data science professionals, prompting further inquiry into the factors influencing salary trends and the distribution of job levels within the field over time.

Second Visualisation:

My second visualisation will be focus on the job title and the salaries, I want to discover which job is paying more based in the domain of data

Second graph

library(plotly)


Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout

second_visualisation <- ggplotly(ggplot(data_salary4, aes(x = job_title, y = salary_in_usd, fill = job_title)) +
  geom_violin() + scale_fill_manual(values = c("Data Scientist" = "blue", "Data Analyst" = "brown", "Data Engineer"= "pink")) +
  labs(title = "Salary Distribution by Job Title",caption = "Source :AI JOBS",
       x = "Job Title",
       y = "Salary (USD)") +
  theme_dark())
second_visualisation

Comments:

Based on the second visualization, it’s clear that data scientists tend to earn more than both data engineers and data analysts. While the salary difference between data analysts and data engineers is relatively small, data scientists can earn well over $250,000. Interestingly, despite the higher potential earnings for data scientists, the majority of individuals earning over $200,000 fall under the category of data engineers. This highlights the significant earning potential within the data engineering domain, even compared to roles traditionally associated with higher salaries like data science.

I want to do a linear regression that will help to determine a dependence between my variables : which are the work_year and the salary.

ggplot(data_salary4, aes(x = salary_in_usd, y = work_year, color = job_title)) + 
  labs(title = "Salary between 2020 and 2024 based on the job title", 
       caption = "Source: AI JOBS",
       x = "Salary (USD)",
       y = "Work year") + 
  theme_minimal(base_size = 12) + 
  geom_point() + 
  geom_smooth(method = 'lm', formula = y ~ x, color = "black") +
  scale_color_manual(values = c("Data Scientist" = "blue", 
                                 "Data Engineer" = "green",
                                 "Data Analyst" = "orange"))

Comments :

The outcome is quite as expected. The visualization clearly indicates that the salary earned by a data engineer is contingent upon their career level and specialization. This observation strongly supports our hypothesis that the salary within the data domain varies significantly based on one’s career trajectory and expertise.

correlation between salary and the remote

Calculate correlation coefficient

correlation <- cor(data_salary4$remote_ratio, data_salary4$salary_in_usd)
print(correlation)

[1] -0.04557423

Comments :

A correlation coefficient of -0.04557423 suggests a very weak negative relationship between the remote ratio and salary. In other words, there is almost no linear association between these two variables. This indicates that changes in the remote ratio do not significantly predict changes in salary, and vice versa. Therefore, it seems that the remote work arrangement does not have a substantial impact on salary in this dataset.

make a linear regression equation :

lm_model <- lm(remote_ratio~ salary_in_usd, data = data_salary4)
summary(lm_model)


Call:
lm(formula = remote_ratio ~ salary_in_usd, data = data_salary4)

Residuals:
   Min     1Q Median     3Q    Max 
-33.99 -33.04 -29.49  66.29  78.79 

Coefficients:
                 Estimate  Std. Error t value Pr(>|t|)   
(Intercept)   38.32868939 12.39613245   3.092  0.00235 **
salary_in_usd -0.00004155  0.00007200  -0.577  0.56470   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 45.67 on 160 degrees of freedom
Multiple R-squared:  0.002077,  Adjusted R-squared:  -0.00416 
F-statistic: 0.333 on 1 and 160 DF,  p-value: 0.5647

The equation:

Remote_ratio = 38.32868939-0.00004155(Salary)

Analyze the model based on p-values,adjusted r^2 values, and diagnostic plots

The linear regression analysis conducted on the relationship between salary and remote ratio yielded insightful results. The model, which aimed to predict remote ratio based on salary, showed non-significant findings with a p-value of 0.5647 for the coefficient of salary_in_usd. This suggests a lack of statistical significance in the effect of salary on remote ratio. Additionally, the adjusted R-squared value of -0.00416 indicates that the model explains very little of the variance in remote ratio, further highlighting the limited predictive power of salary in determining remote work ratios. While linear regression models can provide valuable insights, the results underscore the complexity of factors influencing remote work arrangements beyond salary alone.

Third Visualisation :

library(plotly)
Third_visualisation <- data_salary4 %>%
  group_by(work_year) %>%
  ggplot(aes(job_title, fill = job_title)) +
  geom_bar() +
  theme_classic() +
  scale_fill_manual(values = c("Data Scientist" = "pink", 
                                "Data Engineer" = "red",
                                "Data Analyst" = "purple")) + 
  facet_wrap(~work_year) +
  labs(title = "Salary based on job titles for each year",
       x = "Job Title",
       y = "salary_in_usd",
       fill = "Job Title",
       caption = "Source: AI JOBS") + scale_y_continuous(labels = scales::comma_format(scale = 10000))
Third_visualisation

Comments :

The analysis of each data area by year reveals intriguing insights. In the initial year, 2020, there were no data analysts earning beyond 100,000K annually. However, a noticeable trend emerged between 2020 and 2021, with both data engineers and data scientists experiencing a surge in their earnings. By 2022, data scientists were leading the pack in terms of earnings, surpassing their counterparts in both data analysis and engineering. The following year, 2023, witnessed a significant milestone as data engineers began earning upwards of 300K, while data analysts also saw a notable increase in their salaries. However, between 2023 and 2024, there was a downturn in data engineer salaries, and no data analysts experienced a pay rise during this period. As for the ongoing year, 2024, while it is yet to conclude, early indications suggest that data scientists are on track to surpass their previous year’s earnings. This nuanced exploration highlights the dynamic nature of salary trends within the data science domain over the years.

Fourth Visualisation :

for my lastvisualisation

installed.packages("gganimate")

     Package LibPath Version Priority Depends Imports LinkingTo Suggests
     Enhances License License_is_FOSS License_restricts_use OS_type Archs
     MD5sum NeedsCompilation Built

library(dplyr)

# Convert year to factor for animation

library(ggthemes)
options(repos = "https://cloud.r-project.org")
install.packages("gganimate")

Installing package into 'C:/Users/satad/AppData/Local/R/win-library/4.3'
(as 'lib' is unspecified)

package 'gganimate' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
    C:\Users\satad\AppData\Local\Temp\RtmpqE26uS\downloaded_packages

library(gganimate)

Warning: package 'gganimate' was built under R version 4.3.3

library(gapminder)

Warning: package 'gapminder' was built under R version 4.3.3

fourth_vis <- ggplot(data_salary4, aes(x = work_year, y = salary_in_usd, color = job_title, size = salary_in_usd)) +
  geom_point(alpha = 0.7) +
  scale_color_manual(values = c("Data Scientist" = "pink", 
                                 "Data Engineer" = "red",
                                 "Data Analyst" = "purple")) +
  scale_size(range = c(1,15)) +
  labs(title = "Salary of Data Careers Over Time",
       x = "Work Year",
       y = "Salary (USD)",caption = "Source : AI JOBS",
       fill = "Job Title") +
  theme_dark()+ transition_time(work_year) +
  ease_aes('linear')
fourth_vis

Comments :

In my latest visualization, I depict the salaries of Data Scientists, Data Analysts, and Data Engineers in the United States from 2020 to 2024. Upon observing the animation, it’s noticeable that the salaries of Data Scientists experience a significant increase from 2020 to 2021. Interestingly, none of the bubbles representing salaries reach the $400,000 mark. This visualization effectively illustrates the evolving salary trends in these data-related roles, providing valuable insights into the potential earning trajectory for a Data Scientist working today.

Try an interactivity:

library(plotly)
fifth_vis <- ggplotly(
  ggplot(data_salary4, aes(x = work_year, y = salary_in_usd, color = job_title, size = salary_in_usd)) +
    geom_point(alpha = 0.7) +
    scale_color_manual(values = c("Data Scientist" = "pink", 
                                   "Data Engineer" = "red",
                                   "Data Analyst" = "purple")) +
    scale_size(range = c(1,15)) +  scale_x_continuous(breaks = seq(2020, 2024, by = 1))+ labs(title = "Salary of Data Careers Over Time",
         x = "Work Year",
         y = "Salary (USD)",
         caption = "Source: AI JOBS",
         fill = "Job Title") +
    theme_dark() +
    transition_time(work_year) +
    ease_aes('linear'))
fifth_vis

Essay :

Working with the salaries dataset for my project has been an enlightening experience. Focusing on roles like data scientist, data engineer, and data analyst has deepened my understanding of these fields and provided insights into potential earning potentials. I have found the information gleaned from this dataset invaluable for my project.

The article for the Montgomery library was highlighting the rise of the digital era has led to a surge in available data, creating a demand for skilled professionals in data science. Glassdoor has recognized Data Scientist as the top job in America, considering factors like job openings, salary, and satisfaction. To capitalize on this trend and enhance skills, the Ultimate Data and Analytics Bundle offers over 130 courses covering various data science techniques. These courses provide training in essential software like SAS, R, Oracle, and databases, along with expert support. Originally priced at $1699, the bundle is currently available for $39, offering a significant discount. Do not miss out on this opportunity to stay ahead in the booming field of data science.

During the course of my project, I was surprised to discover that there isn’t a significant correlation between an employee’s salary and whether they work remotely or in an office setting. Initially, I assumed that those working in-office would receive higher pay than remote workers. Additionally, creating my final visualization posed some challenges. After investing several hours in the process, I managed to produce an animated visualization. However, I encountered difficulty in making it interactive. Consequently, I resolved to revise it by incorporating interactive features.

In conclusion, working with the salaries dataset has provided valuable insights into the fields of data science, data engineering, and data analysis. The rise of the digital era has created a demand for skilled professionals in these areas, with roles like Data Scientist being recognized as top jobs in America , so it is time for you to find some information about dataalso!!

Sources :

“Data Science is Booming, Don’t Get Left Behind.” Daily Beast, 3 Mar. 2018. Gale In Context: Opposing Viewpoints, link.gale.com/apps/doc/A529680861/OVIC?u=rock77357&sid=bookmark-OVIC&xid=96e01489. Accessed 6 May 2024.