Data Science is one of the most popular fields presently. For those interested in exercising their technical prowess, there is plenty of opportunity. The data investigated in this report investigates trends in Data Science from 2020 through 2022. This report focuses on the various Data Science Positions, average salary, company trends in spending, and remote work opportunities. This report will investigate how advantageous it is to work in Data Science and if it leads to promising career growth.
The dataset used provides information on data science salaries throughout the world. The salary data is adjusted to US dollars. The fields collected are ‘work_year’, ‘experience_level’ expressed as entry level through executive level, ‘employment_type’ indicating part-time or full-time status, ‘job_title’ for the data science position, unadjusted ‘salary’, ‘salary_currency’ for country, ‘salary_in_usd’ to adjust and create a common baseline for comparison, ‘employee_residence’ for country of residence, ‘remote_ratio’ indicating whether the individual teleworks or works in the office, ‘company_location’ for country which company operates in, and ‘company_size’ representing whether the company is small, mid-size, or large. The data is composed of numeric data, and string characters. The Average Salary ranges from about $50K to $215K and there are 606 records in this dataset.
The data collected can illuminate which data science positions are the fastest growing positions. It can also provide an understanding of reasonable salary expectations given the level of experience. Utilizing the data provided, one can target skillsets for Data Science positions of interest with the greatest opportunity to grow professionally through experience and income.
setwd("C:\\Users\\santi\\OneDrive\\Desktop\\MBA Program\\GB736 Data Visualization\\R Assignment")
library(data.table)
library(dplyr)
library(ggplot2)
library(scales)
library(ggthemes)
library(RColorBrewer)
library(plotly)
DSsalaries1 <- read.csv("ds_salaries.csv")
The Findings expressed in this report are based on a limited set of data. This data spans Data Science information across many different companies. This report will show that there are plenty of opportunities for flexible work and growth in the field of Data Science when positioned in the right company. While the positions of Data Scientist, Data Engineer, and Data Analyst are popular there are emerging specialty fields that are more lucrative like Data Architect. The more technical the position the greater the opportunity for higher pay.
There are four experience levels for a data scientist as defined in the dataset. These categories are Entry Level, Mid-Level, Senior Level, and Executive Level/Director, where Entry Level is associated with a novice data scientist and Executive/Director is considered the foremost expert in data science at a given company and is responsible for setting the strategic direction. According to the “Average Data Scientist Salary by Level of Experience” chart below, the average salary ranges from $61K to $199K with considerable jumps in salary ($50K+) from Mid-level to Senior Level and Senior Level to Director. Overall, there is a significant opportunity for an up-and-coming Data Scientist to earn a 6-figure salary mid-career.
#create data frame that aggregates and averages salary by experience level
df_exp_avg_salary <-DSsalaries1 %>%
select(experience_level,salary_in_usd) %>%
group_by(experience_level) %>%
summarise(n=length(experience_level), totsalary = sum(salary_in_usd),
.groups = 'keep')%>%
group_by(experience_level) %>%
mutate(AvgSalary = totsalary/n) %>%
data.frame()
#format the y-axis
ylab <- seq(0, max(df_exp_avg_salary$AvgSalary)+50000, 25000)
ylab2 <- format(ylab, big.mark = ",", scientific = FALSE)
my_labels <- paste0("$", ylab2)
#plot bar chart with salary labels at the top of each bar associated with experience level
chart1 <- ggplot(df_exp_avg_salary,
aes(x = reorder(experience_level, -AvgSalary),
y = AvgSalary)) +
geom_bar(stat ="identity",colour = "darkblue",
fill="darkblue", width = 0.5) +
theme_tufte()+
labs(title = "Average Data Scientist Salary by Experience",
x = "Experience", y = "Average Salary")+
theme(plot.title = element_text(hjust = 0.5))+
scale_y_continuous(labels = my_labels,
breaks = ylab,
limits = c(0, max(df_exp_avg_salary$AvgSalary)+25000)) +
geom_text(data = df_exp_avg_salary,
aes(x = experience_level,
y = AvgSalary,
label= paste0("$",scales::comma(AvgSalary)),
fill = NULL), vjust = -0.5, size = 3 ) +
scale_x_discrete(
labels=c('Director', 'Senior-Level', 'Mid-Level', 'Entry-Level'))
chart1
The “Data Science Growth by Position Chart” below illustrates the growth in the top ten data science positions over the last 3 years (2020-2020). These positions represent the highest demand positions within industry based on total money allocated. The positions of Data Scientist (~$11.2M), Data Engineer (~$10.5M), Data Analyst (~$7.3M), and Machine Learning Engineer(~$2.3M) have experienced a steep incline in overall demand from 2021 to 2022 based on the chart. Additionally, the position of Data Architect (~$1.5M), which did not have any data in 2020, is also experiencing a surge in demand, whereas more senior positions like Principal Data Scientist, Director of Data Science, Research Scientist, and Machine Learning Scientist have experienced a decline in demand.
#data frame sums salaries by position and work year to get expenditures
df_DS_expenditures_by_JobTitle <- DSsalaries1%>%
select(job_title, work_year,
salary_in_usd) %>%
group_by(job_title, work_year) %>%
summarise(DSexpenditures = sum(salary_in_usd),
.groups = 'keep') %>%
data.frame()
#aggregate salaries by position to get total expenditures
top_DS_Jobs <- df_DS_expenditures_by_JobTitle %>%
select(job_title, work_year, DSexpenditures) %>%
group_by(job_title) %>%
summarise(JobDSexpenditures = sum(DSexpenditures),
.groups = 'keep') %>%
data.frame()
#sort expenditures in descending order
top_DS_Jobs <- top_DS_Jobs[order(top_DS_Jobs$JobDSexpenditures,
decreasing = TRUE),]
#commit top 10 positions to data frame
top10_DS_jobs <-top_DS_Jobs$job_title[1:10]
#filter Data Science Expenditures Data Frame to the Top 10 positions (by expenditures)
Newtop10_jobs <- df_DS_expenditures_by_JobTitle %>%
filter(job_title %in% top10_DS_jobs) %>%
select(job_title, work_year,DSexpenditures) %>%
data.frame()
#plot multiple line plot showing growth/decline trends from 2020-2022
chart5 <- ggplot(Newtop10_jobs, aes(x = work_year,
y = DSexpenditures,
group = job_title)) +
geom_line(aes(color = job_title), size = 1) +
labs(title = "Data Science Expenditure Growth by Position and Year",
x = "Year", y = "Total Expenditure")+
theme_light() +
theme(plot.title = element_text(hjust = 0.5)) +
geom_point(shape = 21, size = 2,
color = "black", fill = "white") +
scale_x_continuous(breaks = seq(min(Newtop10_jobs$work_year),
max(Newtop10_jobs$work_year),
by = 1)) +
scale_y_continuous(labels = dollar_format(suffix = "",
prefix = "$"),
breaks = seq(0,
max(Newtop10_jobs$DSexpenditures),
by = 2e6)) +
scale_color_brewer(palette = "Set3",
name = "Job Title",
guide = guide_legend(reverse = TRUE))
chart5
Remote working opportunities have become increasingly popular, especially after the pandemic. Data Science has long been considered a field which lends itself to remote work, but that is not the same at all companies. The “Percentage of Data Science Positions that offer Remote Telework” chart below shows the evolution of remote work in the field of Data Science from 2020 through 2022. In 2020, only 50% of positions were fully remote, 29% were partially remote, and about 21% were onsite. Over time, the percentage of fully remote positions increased to approximately 72%, while the percentage of onsite positions also increased to about 25%. Both are interesting trends, as they are indicative of the struggle between telework and in person work. There are other factors that could play a role such as new industries entering into data science. Personally, in my role in the DoD, data science is becoming more important in combatting adversary threats but these positions require onsite support due to the nature of the information.
#create a data frame that assembles information remote ratio by work year. Express the Remote Ratio as a percent of the Total.
remote_df <- DSsalaries1 %>%
select(work_year, remote_ratio) %>%
group_by(work_year, remote_ratio) %>%
summarise(remote_count=length(remote_ratio),
.groups = 'keep')%>%
group_by(work_year) %>%
mutate(percent_of_tot =
round(100*remote_count/sum(remote_count), 1),
remotelabs = ifelse(remote_ratio == 100, "Fully Remote", ifelse(remote_ratio == 50, "Partially Remote",
"No Remote work")))%>%
data.frame()
#create an interactive three layered nested pie chart that shows percent telework for each 2020, 2021, and 2022
RemoteFig <- plot_ly(hole =0.7)%>%
layout(title = "Percentage of Telework Positions in Data Science (2020-2022)") %>%
add_trace(data = remote_df[remote_df$work_year == 2022, ],
labels = ~remotelabs,
values = ~remote_df[remote_df$work_year == 2022, "remote_count"],
type = "pie",
textposition = "inside",
hovertemplate = "Year:2022<br>Telework Status:%{label}<br>Percent:%{percent}<extra></extra>") %>%
add_trace(data = remote_df[remote_df$work_year == 2021, ],
labels = ~remotelabs,
values = ~remote_df[remote_df$work_year == 2021,
"remote_count"],
type = "pie",
textposition = "inside",
hovertemplate = "Year:2021<br>Telework Status:%{label}<br>Percent:%{percent}<extra></extra>",
domain = list(
x = c(0.16, 0.84),
y = c(0.16, 0.84))) %>%
add_trace(data = remote_df[remote_df$work_year == 2020, ],
labels = ~remotelabs,
values = ~remote_df[remote_df$work_year == 2020,
"remote_count"],
type = "pie",
textposition = "inside",
hovertemplate = "Year:2020<br>Telework Status:%{label}<br>Percent:%{percent}<extra></extra>",
domain = list(
x = c(0.27, 0.73),
y = c(0.27, 0.73)))
RemoteFig
Data Science has been one of the most in demand fields over the last five years. Companies are allocating more of their budgets for Data Science personnel. One would assume that the allocation of funding would be higher at larger companies, but the “Total Spending on DS Personnel by Company Size and Experience in the US” chart below shows otherwise. As expected, small companies are spending less on data science positions comparatively, but are actually spending about the same amount, about $1.8M, as their medium and large-sized counterparts on entry level personnel. Additionally, medium sized companies are outspending large companies on data science positions by nearly $15M. This difference is very noticeable in their spending on Senior level data scientists. Mid-size companies have spent $25.6M versus $10.6M by large size companies.
#create data frame that aggregates salary based on company size and experience level (limit to United States)
df_company_totalspending <- DSsalaries1%>%
filter(employee_residence == "US") %>%
select(company_size,experience_level, salary_in_usd) %>%
group_by(company_size,experience_level) %>%
summarise(CntExp = length(experience_level),
SumSalary = sum(salary_in_usd), .groups = 'keep') %>%
group_by(company_size) %>%
mutate(avg_salary = SumSalary/CntExp,
percentTotalSalary = round(100*SumSalary/sum(SumSalary), 1 ),
numexplevel = ifelse(experience_level == "EN", 1,
ifelse(experience_level == "MI", 2,
ifelse(experience_level == "SE",3,4))))%>%
data.frame()
#reorganize x-axis labels
explevels <- c('EN', 'MI', 'SE', 'EX')
df_company_totalspending$experience_level <-factor(df_company_totalspending$experience_level,
levels = explevels)
#establish breaks in heatmap color step shading
breaks <- c(seq(0, max(df_company_totalspending$SumSalary),
by = 2e6))
#adjust legend number format
ylab <- seq(0, max(df_company_totalspending$SumSalary)/1e6,2)
my_labels <-paste0("$", ylab, "M")
#construct heatmap depicting total personnel spending by company size and experience level
chart4 <- ggplot(df_company_totalspending,
aes(x = numexplevel,
y = company_size,
fill = SumSalary)) +
geom_tile(color = "black")+
coord_equal(ratio = 1)+
labs(title = "Data Science Spending by Company Size in the United States",
x= "Experience Level
(1 = Entry Level, 2 = Mid Level, 3 = Senior Level, 4 = Executive Level)",
y = "Company Size",
fill = "Total Spending on Personnel") +
theme_minimal() +
theme(axis.title.x = element_text(size = 12),
plot.title = element_text(hjust = 0.5))+
scale_y_discrete(labels = c("Large", "Medium", "Small"))+
scale_fill_continuous(labels = my_labels, low = "white",
high = "darkblue", breaks = breaks)+
geom_text(data = df_company_totalspending,
aes(x = numexplevel,
y = company_size,
label= paste0("$", format((SumSalary/1e6), digits=1), "M"),
fill = NULL), vjust = -0.5, size = 3 ) +
guides(fill = guide_legend(reverse = TRUE,
override.aes = list(colour = "black")))
chart4
The “Top 25 Data Science Positions by Count” chart below, confirms our previous findings. Data Scientist, Data Engineer, Data Analyst, Machine Learning, and Research Scientist are the leading positions held by those in the Data Science field. It can also be noted that Mid size companies possess the majority of data scientists in those fields, then followed by Large companies. The average salary for these positions is between $93K and $113K. Positions like Data Architect, Principal Data Scientists, Director of Data Science, and Applied Data Science lead the way in terms of average salary, making between $175K - $215K a year. These appear to be the most vaulted and scarce positions in Data Science requiring ultimate expertise.
df_jobtitlecount <- DSsalaries1 %>%
select(company_size, job_title,salary_in_usd) %>%
group_by(company_size, job_title) %>%
summarise(n=length(job_title), AvgSal = sum(salary_in_usd)/n, .groups = 'keep')%>%
data.frame()
df_companysalary <- DSsalaries1 %>%
select(job_title, salary_in_usd ) %>%
group_by(job_title) %>%
summarise(Totaljobs = length(job_title), Totalsalary = sum(salary_in_usd), AvgSalbyjob = Totalsalary/Totaljobs, .groups = 'keep') %>%
data.frame()
df_companysalary2 <- df_companysalary %>%
select(job_title,AvgSalbyjob) %>%
data.frame()
top_jobs <- df_jobtitlecount %>%
select(job_title,n) %>%
group_by(job_title) %>%
summarise(m=sum(n), .groups = 'keep') %>%
data.frame()
top_jobs <- top_jobs[order(top_jobs$m, decreasing = TRUE),]
top_25jobs <-top_jobs$job_title[1:25]
Newtop_25jobs <- df_jobtitlecount %>%
filter(job_title %in% top_25jobs) %>%
select(job_title, company_size, n, AvgSal) %>%
data.frame()
Newtop_25salaries <- df_companysalary2 %>%
filter(job_title %in% top_25jobs) %>%
select(job_title,AvgSalbyjob) %>%
data.frame()
chart2 <- ggplot(Newtop_25jobs, aes(x=job_title, y = n, fill = company_size)) +
geom_bar(stat = "identity", position = position_stack(reverse = TRUE)) +
coord_flip() +
theme_light() +
labs(title= "Top 25 Data Science Positions by Count", x = "Data Scientist Positions", y = "Number of Job Positions", fill = "Company Size") +
theme(plot.title = element_text(hjust = 0.5))+
scale_fill_brewer(palette = "Paired", guide = guide_legend (reverse = TRUE))+
geom_line(inherit.aes = FALSE, data = Newtop_25salaries,
aes(x= job_title, y = AvgSalbyjob/1000, colour = "Average Salary", group = 1), size = 1)+
scale_color_manual(NULL, values = "black")+
scale_y_continuous(labels = comma,
sec.axis = sec_axis(~. *1000, name = "Average Salary", labels = dollar_format(suffix = "", prefix = "$")))+
geom_point(inherit.aes = FALSE, data=Newtop_25salaries, aes(x=job_title, y = AvgSalbyjob/1000, group = 1), size = 2, shape = 21, fill = "white", color = "black") +
theme(legend.background = element_rect(fill = "transparent"),
legend.box.background = element_rect(fill = "transparent", colour = NA),
legend.spacing = unit(-1, "lines"))
chart2
The data presented conveys that Data Science is certainly a lucrative field to enter. Those who are interested in pursuing a career in Data Science would be better served by applying to positions in Mid-size companies which appear to have greater opportunities for growth. The most lucrative of positions requires a holistic understanding of all aspects of data science such as Machine Learning, data architecture and data analysis. This field offers a great deal of flexibility when it comes to remote work and there is high monetary payoff if you are successful in Data Science.