The aim of our project is to investigate Data Science Jobs in the US. We are interested in the topic since we are studying to become Data Scientists, and therefore, would like to know more about it. We are going to use the data about Data Science Job Posting on Glassdoor. It was collected by web-scraping job posts from Glassdoor for data science jobs.
We have decided to start our analysis with a graph that could give a sort of general overview of Data Science Jobs (and related) and salaries. From the graph above, it can be seen that being a Manager is definitely the position that, on average, pays the most. Data Architect immediately follows with an average yearly salary of $200,000. Contrary to our expectation, Data Modeler does not earn that much and can be found at almost the end of the ranking.
In this part of the project, we are going to analyze which sectors have the highest salaries. We think this is a useful because it highlights the trend that, even in the field of data science, there can be significant differences in salary depending on the field that an individual may choose to go into.
The above graph shows the average salary for data-related job postings across different sectors. Highest paying sectors are Media and Retail. However, the total number of openings from those sectors in the data set are only 5 and 7 respectively. The sector with the highest opportunity for data related jobs seem to be Business Services and Information Technology, because, despite being on the average salary range across sectors, they have the highest number of openings and the biggest job opportunity.
After exploring the relationship between sectors and salaries, we next try to assess the popularity of data science roles, and how it differentiates by state. We are assuming that Google search trends for data science related search terms (such as data modeler, data architect and data engineer) are an effective proxy for interest. We then compare how interest in data science roles, as measured by Google trends, is related to the actual number of job openings. For this purpose, we pulled Google Search trends for the following job titles across US states. Google trends provide a relative number of search popularity, not the absolute number of total searches.
We mostly see a strong correlation between the interest in job titles (as per Google search numbers) and the available jobs per state. California, without any surprise, is where the highest number of job openings and biggest interest is at. Interestingly, Virginia comes second in terms of openings and interest, rather than NY.
Another important insight is that there are many cases where the interest is not met by the available jobs. Texas, Illinois and Florida have very high Google search numbers for Data Science jobs but in terms of the number of job postings, they are in the smallest range.
The last bit of insight this graph shows is the interest for different job titles. We see that Data Analyst and Data Scientist searches on Google are quite high, on a scale of 0-50, while for Machine Learning Engineer and Data Engineer, this relative spectrum is quite smaller.
After analyzing the statewide breakdown of interest and availability of data science jobs, we thought it would be interesting to visualize on a map which cities and states have the most job openings and highest salaries.
## OGR data source with driver: ESRI Shapefile
## Source: "/Users/dylanrosenthal/Desktop/Columbia Courses/Data Visualization/Final Project/Final Version/cb_2018_us_state_500k", layer: "cb_2018_us_state_500k"
## with 56 features
## It has 9 fields
## Integer64 fields read as strings: ALAND AWATER
Note: grey states have no openings
From the first map above we can see that the States with the more openings are California, Virginia, and Massachusetts. City-wide, the jobs are well distributed over the Country. However, the city with more openings are San Francisco with 69 and New York with 50. In the third place comes Washington D.C. with just 26 openings. Nonetheless, it has to be taken into consideration that the openings in the dataset are community specific. This means that there are a lot of openings in the surrounding areas of the big cities that are not, however, counted as if in their metro areas. Zooming in to San Francisco gives a clearer idea. About thirty-plus areas with openings surround the San Francisco’s metropolitan area. Santa Clara has 9 openings, Redwood City has 7, San Jose 4, Cupertino 3, and many more.
## [1] "The median salary in the dataset is:"
## [1] 115.865
Note: grey states have no openings
Here, we wanted to focus more on the cities. The cities that have an average salary for their openings higher than the national median are colored in green. On the other hand, cities with average salaries below the national median are colored in red. Furthermore, the size of the circles is proportional to the city’s average salary. Among the most known cities, and with a discrete number of openings, Dallas (TX) and Sacramento (CA) are the two with the highest average salary: $183,000. There are three cities with a salary of $271,000 and only one job opening. One in TX, one in CA and one in DE. Colorado Springs (CO) and Tulsa (OK) close the list with salaries of just 43k and 67k.
On the other hand, the states map shows the average salary per state. The state with the higher average is Delaware (with $271,000), but this is because it has just one opening. In the second place comes North Carolina with just slightly under $150,000. South Carolina and Montana close the list with, respectively, 95.5k and 93.75k.
Following our analysis of job openings and salaries, we wanted to analyze the descriptions of the job openings to better understand what skills and abilities are desired by potential employers. We achieve this goal by breaking down specific words that commonly show up in data science job descriptions, with a special focus on words that indicate a specific skill set of the applicants.
The word cloud above shows the most common skills mentioned in data science job descriptions. Some of the most common are obvious, like machine learning, modeling, knowing data languages and how to use certain applications. Some of the most common are a little more surprising, like having strong writing and communication skills, or being a good collaborator.
The comparison cloud above shows the most common skills and tools mentioned in data scientist versus data analyst job descriptions. The words in the data scientist job descriptions appear to be much more technical, such as machine learning, engineering, and modeling, while words in data analyst job descriptions are much more focused around the use of specific tools used for data analysis (such as Microsoft and Tableau) and communicating insights through dashboards and presentations.
The comparison cloud above shows the most common tools mentioned in data scientist versus data analyst job descriptions. Once again, we can see that data scientist roles appear to be much more technical and oriented around coding for data analysis, with tools such as Python, SAS and Java topping the list, while data analysis tools (with the exception of SQL) are more focused around general business intelligence tools used for less advanced analysis, such as Excel, Tableau, and Microsoft.
In this part of the project, we would like to focus on the gender pay gap among Data Scientist in the US. Unfortunately, it has been studied that male data scientists in the US earn more and have a higher earnings range than their female counterparts. Furthermore, it has been demonstrated that more than 40% of women with full-time jobs in science leave the sector or go part time after having their first child in the United States. By contrast, only 23% of new fathers leave or cut their working hours. We would like to see if these findings are also supported by the data that we have. We are going to use the 2018 Kaggle Machine Learning & Data Science Survey, the most comprehensive dataset available on the state of ML and data science.
The above graph shows the average Data Scientists’ salary by educational level and gender. First of all, it can be seen that, having a Doctoral Degree instead of a Master’s Degree, allows people to earn more, independently of their sex. However, it can be seen that women with both a Doctoral and a Master’s Degree earn less than men. Also, the earnings for men are more spread out and can reach $500,000 per year, while this does not happen for women. From this graph, it can be concluded that a gender pay gap among Data Scientist in the US actually exists.
The above graph shows the average Data Scientists’ salary by age and gender. It can be seen that, at the beginning of their working career, men and women earn almost the same. However, when the age increases, at about 30 years old, the difference between the salary for men vs the one for women is evident. Thirty years is the age when a woman usually becomes a mother and, precisely in conjunction with this age group, the differences between the male and female wages are clear. Men earn almost twice as much as women. As the article by Nature suggests, “parenthood is an important driver of gender imbalance in STEM employment.” In conclusion, we hope that the gap will close in the future.
In the same repository, you can find a process book with a description of the steps taken to clean the data and create our graphs.