The skills were counted for each position, data scientist, data analyst and data engineer. Vauge skills were removed that we felt didn’t provide much incite to this research.
dsskills <- df %>%
filter(df$query == "Data Science") %>%
count(skills) %>%
filter(n > 5
& skills != "Data Science"
& skills != "Data Analysis"
& skills != "Data Engineering" )
daskills <- df %>%
filter(df$query == "Data Analyst") %>%
count(skills) %>%
filter(n > 2
& skills != "Data Science"
& skills != "Data Analysis"
& skills != "Data Engineering" )
deskills <- df %>%
filter(df$query == "Data Engineer") %>%
count(skills) %>%
filter(n > 8
& skills != "Data Science"
& skills != "Data Analysis"
& skills != "Data Engineering" )
ggplot(dsskills, aes(x=reorder(skills, n),
y=n, label="count")) +
geom_bar(stat='identity', width=.5, fill = "lightblue", color = "darkblue") +
scale_fill_manual(name="Data Scientist Skills") +
labs(title= "Desired Skills for a Data Scientist") +
coord_flip()
For each job position the estimated salary posted on website was recorded and counted. Some minor cleaning was required in order to have accurate and consistant counts.
df$wage <- str_replace_all(df$wage, ".*\\$", "")
df$wage <- str_replace_all(df$wage, "[:alpha:]", "")
df$wage <- str_replace_all(df$wage, "\\.[:digit:]{1,3}", "")
dsskills <- df %>%
filter(query == "Data Science") %>%
count(wage) %>%
filter(wage != "null" & wage != " " & wage != "")
daskills <- df %>%
filter(query == "Data Analyst") %>%
count(wage) %>%
filter(wage != "null" & wage != " " & wage != "")
deskills <- df %>%
filter(query == "Data Engineer") %>%
count(wage) %>%
filter(wage != "null" & wage != " " & wage != "")
dsskills$salary <- as.numeric(dsskills$wage)
daskills$salary <- as.numeric(daskills$wage)
deskills$salary <- as.numeric(deskills$wage)
ggplot(dsskills, aes(x=salary))+
geom_density(color="darkblue", fill="lightblue")+
xlim(0, 350) +
geom_vline(aes(xintercept=mean(salary)),
color="blue", linetype="dashed", size=1)+
labs(title= "Data Scientist Salaries")
The location of each post was another topic of interest. We counted the total amount of job postings from each city to better understand which cities are hiring for data scientists, engineers and analysts.
ggplot(df, aes(x=reorder(location, n),
y=n, label="count")) +
geom_bar(stat='identity', width=.5, fill = "lightblue", color = "darkblue") +
scale_fill_manual(name="Cities Hiring Data Scientists") +
labs(title= "Listed Job Openings") +
coord_flip()
Looking at Python, R and SQL we see Python and SQL are neck and neck. For data scientists and data engineers Python experience was listed deveral more times than SQL. However for data analyst positions we see SQL listed the most. Python is the most popular programming language for data scientists, for data analysts SQL is most popular and for data engineers it is Java.
Data Scientist: Top 5 skills
employers desire from their data scientist is proficency in Machine Learning
, Python
, Data Mining
, Deep Learning
, and SQL
.
Data Analyst: Top 5 skills
employers desire from their data analysts is proficency in SQL
, Tableau
, Python
, Dashboard
, and Data Visualization
.
Data Engineer: Top 5 skills
employers desire from their data scientist is proficency in Java
, Python
, Big Data
, SPARK
, and Hadoop
.
The salaries range from 50k - 200k
for any given job position. Some exceed 200k and some are below 50k however for a student’s sake this is the range we will be focusing on. We see the mean salaries increase as we go from data analyst < data scientist < data engineer
.
The top 5 cities with the most job postings were San Fancisco, Boston, New York City, Los Angeles, and Austin. Each one of these cities included at least 45 job postings and are great places to start a career in the tech industry.