1 Data driven analysis on how to find a job.

In this post, I hope to get some insight on how to find a job as a data scientist using the kaggle 2017 survey data.

Load the data and some libraries.

library(data.table)
library(magrittr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:data.table':
## 
##     between, first, last
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
kaggle_data_science=fread("multipleChoiceResponses.csv",fill=TRUE,sep=",",na.strings = " ")
kaggle_data_science%>%class()
## [1] "data.table" "data.frame"

1.1 Gender disparity

Let us first look at the distribution of gender.

In order to simplify the analysis, I assign the value “non-binary” to people who are not male or female.

library(ggplot2)
library(magrittr)
kaggle_data_science[kaggle_data_science$GenderSelect=="Non-binary, genderqueer, or gender non-conforming","GenderSelect"]="Non_binary" 
kaggle_data_science%>%filter(!GenderSelect=="")%>%ggplot()+geom_bar(aes(x=GenderSelect),stat = "count",fill="orange")

There is definitely a disparity of gender in the data science field.

table(kaggle_data_science$EmploymentStatus)
## 
##                                   Employed full-time 
##                                                10897 
##                                   Employed part-time 
##                                                  917 
## Independent contractor, freelancer, or self-employed 
##                                                 1330 
##                                  I prefer not to say 
##                                                  420 
##               Not employed, and not looking for work 
##                                                  924 
##                   Not employed, but looking for work 
##                                                 2110 
##                                              Retired 
##                                                  118

Most of the people are employed full time.

1.2 Compensation in US and China

Since it does not make too much sense to make over a million dollars or RMB oding data analysis. I will consider them to be some human error here. I will examine those outliers later.

Next, what is the compensation difference between U.S and China?

kaggle_data_science[,CompensationAmount:=as.numeric(CompensationAmount)]
## Warning in eval(jsub, SDenv, parent.frame()): NAs introduced by coercion
kaggle_data_science%>%filter(CompensationAmount<1e6)%>%filter(Country %in% c("People 's Republic of China","United States"))%>%ggplot()+geom_histogram(aes(x=CompensationAmount,fill=Country),position = "dodge",stat = "density")
## Warning: Ignoring unknown parameters: binwidth, bins, pad

Most of the jobs in U.S is around 100K dollars, which is the expected compensation. While the compensation distribution in China is quite different from the distribution in US. There are more proportions of higher pay in China than U.S.

There are a lot of jobs that gets paid around 50K RMB, and there are also a lot of jobs that gets paid around 200K to 400K RMB. Let us look deeper into those people who gets paid around 50K and 200K RMB in China.

1.2.1 China compensation.

kaggle_data_science%>%filter(CompensationAmount<1e6)%>%filter(Country=="People 's Republic of China")%>%ggplot()+geom_histogram(aes(x=CompensationAmount,fill=EmploymentStatus),position = "dodge",binwidth=2e4)

A lot of people enters the compensation to be 0 while employed full-time. This is definitely not true. There are also some people whose compensation is 20K, which is also not a reasonable amount, since you could not really doing data analysis while earning less than 36K as washing dishses.

kaggle_data_science%>%filter(CompensationAmount<1e6)%>%filter(Country=="People 's Republic of China")%>%filter( CompensationAmount >3e4)%>%ggplot()+geom_histogram(aes(x=CompensationAmount,fill=EmploymentStatus),position = "dodge",binwidth=2e4)

This would be a more reasonable graph for the compensation in China.

Do all those people write code to analyze the data?

kaggle_data_science %>%filter(!CompensationCurrency == "")%>%filter(CompensationAmount<1e6)%>%filter(Country=="People 's Republic of China")%>%ggplot()+geom_histogram(aes(x=CompensationAmount,fill=CodeWriter),position = "dodge",binwidth=2e4)

DataScienceIdentitySelect

Yes. They all write code to analyze data. It is kind of hard to understand why some of them gets paid much less while some gets paid over 600K. Is it because of the CurrentJobTitleSelect?

kaggle_data_science%>%filter(CompensationAmount<1e6)%>%filter(Country=="People 's Republic of China")%>%filter( CompensationAmount >3e4)%>%ggplot()+geom_histogram(aes(x=CompensationAmount,fill=CurrentJobTitleSelect),position = "dodge",binwidth=5e4)

Yes. It has to do with the title of the job.

What are the most common job titles?

kaggle_data_science%>%filter(Country=="People 's Republic of China")%>%group_by(CurrentJobTitleSelect)%>%summarise(n=n())%>%arrange(desc(n))

Let us look at the compensation for Machine Learning Engineer, Data Analyst, Data Scientist.

kaggle_data_science%>%filter(Country=="People 's Republic of China")%>%filter(CurrentJobTitleSelect %in% c("Machine Learning Engineer", "Data Analyst","Data Scientist"))%>%filter(CompensationAmount>3e4)%>%ggplot()+geom_histogram(aes(x=CompensationAmount,fill=CurrentJobTitleSelect),position = "dodge",binwidth = 1e5)

TL;DR

The take home message from analysis of job compensation in China is that data analyst’s compensation is around 100K with some reaching 200K, while machine learning engineer’s compensation is around 300K-400K with some reaching over 650K. The data scientist’s compensation does not have enough data to support any trend.

1.2.2 US

Let us do the similar analysis on the compensation of US jobs.

First select the top 10 most common job titles in US.

job_titles_us=kaggle_data_science%>%filter(Country=="United States")%>%group_by(CurrentJobTitleSelect)%>%summarise(n=n())%>%arrange(desc(n))%>%top_n(10)%>%select(CurrentJobTitleSelect)%>%as.vector()
## Selecting by n
job_titles_us[,1]
kaggle_data_science%>%filter(Country=="United States")%>%filter(CurrentJobTitleSelect %in% c("Machine Learning Engineer", "Data Analyst","Data Scientist"))%>%filter(CompensationAmount>3e4& CompensationAmount<5e5)%>%ggplot()+geom_histogram(aes(x=CompensationAmount,fill=CurrentJobTitleSelect),position = "dodge",binwidth = 1e4)

TL;DR

The data analyst’s compensation is around 50k, while data scientist’s compensation is around 100k. There is much less machine learning engineers but they could reach a compensation as high as 400K.

How about other job titles?

kaggle_data_science%>%filter(Country=="United States")%>%filter(CurrentJobTitleSelect %in% c("Scientist/Researcher", "Researcher","Business Analyst"))%>%filter(CompensationAmount>3e4& CompensationAmount<5e5)%>%ggplot()+geom_histogram(aes(x=CompensationAmount,fill=CurrentJobTitleSelect),position = "dodge",binwidth = 1e4)

Most of Researcher has a compensation of around 50k while business analyst has a compensation of around 60-70k. But surprisingly, there are some researchers who have a compensation of 400K.

Let us look at where those high earning people are working. (CurrentEmployerType)

kaggle_data_science%>%filter(Country=="United States")%>%filter(CompensationAmount>3e5)%>%group_by(CurrentEmployerType,CurrentJobTitleSelect)%>%summarise(n=n())%>%arrange(desc(n))

Most of them work at a company that performs advanced analytics or a professional services/consulting firm. Only one of them work within the Univeristy, but that person also serves at a consulting firm and a company that performs advanced analytics.

Search for jobs like Machine Learning Engineer, Data Analyst, Data Scientist, Business analyst to get jobs doing data analysis in industry. But also pay attention to Software Developer/Software Engineer, Researcher, and Engineer titles to see if they are actually looking for data related jobs.

After such discussion, have you got enough motivation to learn about machine learning/data science? If you do decide to learn, what kind of programming language should you start with?

1.3 Which language to learn.

Let us look at what those data scientists, data analysts and machine learning engineers say about the tools they recommend to learn first.(“LanguageRecommendationSelect”)

kaggle_data_science%>%filter(!LanguageRecommendationSelect=="")%>%filter(CurrentJobTitleSelect %in% c("Machine Learning Engineer", "Data Analyst","Data Scientist"))%>%ggplot()+geom_bar(aes(x=LanguageRecommendationSelect,fill=CurrentJobTitleSelect))

Let us look at what those Scientist/Researcher, Researcher,Business Analyst say about the tools they recommend to learn first.(“LanguageRecommendationSelect”)

kaggle_data_science%>%filter(!LanguageRecommendationSelect=="")%>%filter(CurrentJobTitleSelect %in% c("Scientist/Researcher", "Researcher","Business Analyst") )%>%ggplot()+geom_bar(aes(x=LanguageRecommendationSelect,fill=CurrentJobTitleSelect))

kaggle_data_science[ grepl(pattern="Employed by a company that doesn't perform advanced analytics",x=CurrentEmployerType),CurrentEmployerType2:= "non_adv_ana" ]

kaggle_data_science[ grepl(pattern="Employed by a company that performs advanced analytics",x=CurrentEmployerType),CurrentEmployerType2:= "adv_ana"]

kaggle_data_science[ grepl(pattern="Employed by college or university",x=CurrentEmployerType),CurrentEmployerType2:= "school" ]
kaggle_data_science[ grepl(pattern="Employed by government",x=CurrentEmployerType),CurrentEmployerType2:= "gov" ]
kaggle_data_science[ grepl(pattern="Employed by company that makes advanced analytic software",x=CurrentEmployerType),CurrentEmployerType2:= "software" ]

kaggle_data_science[grepl(pattern="Employed by non-profit or NGO",x=CurrentEmployerType) ,CurrentEmployerType2:= "NGO" ]
kaggle_data_science[grepl(pattern="Employed by professional services/consulting firm",x=CurrentEmployerType) ,CurrentEmployerType2:= "consulting" ]

kaggle_data_science[ grepl(pattern="Self-employed",x=CurrentEmployerType),CurrentEmployerType2:="Self-employed " ]

1.3.1 Do Python/R preference change by the company type?

For those titled “Machine Learning Engineer”, “Data Analyst”,“Data Scientist”

kaggle_data_science%>%filter(!LanguageRecommendationSelect=="")%>%filter(!CurrentEmployerType=="")%>%filter(CurrentJobTitleSelect %in% c("Machine Learning Engineer", "Data Analyst","Data Scientist"))%>%ggplot()+geom_bar(aes(x=LanguageRecommendationSelect,fill=CurrentEmployerType2))

For those titled “Scientist/Researcher”, “Researcher”,“Business Analyst”

kaggle_data_science%>%filter(!LanguageRecommendationSelect=="")%>%filter(!CurrentEmployerType=="")%>%filter(CurrentJobTitleSelect %in% c("Scientist/Researcher", "Researcher","Business Analyst"))%>%ggplot()+geom_bar(aes(x=LanguageRecommendationSelect,fill=CurrentEmployerType2))

It is clear that python is definite winner to start, while R is the second and SQL being the third.

1.3.2 Does different job titles require different skillset?

Let us look at what those data scientists, data analysts and machine learning engineers say about the tools they use. (WorkToolsSelect)

kaggle_data_science%>%filter(CurrentJobTitleSelect %in% c("Machine Learning Engineer", "Data Analyst","Data Scientist"))%>%filter(!WorkToolsSelect=="")%>%group_by(WorkToolsSelect)%>%summarise(n=n())%>%arrange(desc(n))%>%head(10)

The python, R, SQL and tensorflow dominate the work tools for data scientists, data analysts and machine learning engineers

How about those titled “Scientist/Researcher”, “Researcher”,“Business Analyst”?

kaggle_data_science%>%filter(CurrentJobTitleSelect %in% c("Scientist/Researcher", "Researcher","Business Analyst"))%>%filter(!WorkToolsSelect=="")%>%group_by(WorkToolsSelect)%>%summarise(n=n())%>%arrange(desc(n))%>%head(10)

Python and R still dominate the battlefield but with more variety like Matlab, C/C++.

kaggle_data_science%>%filter(CurrentJobTitleSelect %in% c("Scientist/Researcher", "Researcher","Business Analyst"))%>%filter(grepl("MATLAB/Octave",WorkToolsSelect))%>%group_by(CurrentEmployerType2)%>%summarise(n=n())%>%arrange(desc(n))

This verifies that most of the Matlab users comes from people employed by school.

kaggle_data_science%>%filter(CurrentJobTitleSelect %in% c("Scientist/Researcher", "Researcher","Business Analyst"))%>%filter(grepl("C/+",WorkToolsSelect))%>%group_by(CurrentEmployerType2)%>%summarise(n=n())%>%arrange(desc(n))

C/C++ are also used most people employed by school.

Consider the percentage of people titled “Scientist/Researcher”, “Researcher”,“Business Analyst” are employed by school, the above conclusion is still valid.

kaggle_data_science%>%filter(CurrentJobTitleSelect %in% c("Scientist/Researcher", "Researcher","Business Analyst"))%>%group_by(CurrentEmployerType2)%>%summarise(n=n())%>%arrange(desc(n))

Where do those titled “Machine Learning Engineer”, “Data Analyst”,“Data Scientist” work?

kaggle_data_science%>%filter(CurrentJobTitleSelect %in% c("Machine Learning Engineer", "Data Analyst","Data Scientist"))%>%group_by(CurrentEmployerType2)%>%summarise(n=n())%>%arrange(desc(n))

Most of them are employed by companies applying advanced analytics to consulting forms.

To understand why SQL is important, let us look at the data type in the daily work of kagglers.

kaggle_data_science%>%filter(!WorkDataTypeSelect=="")%>%group_by(WorkDataTypeSelect)%>%summarise(n=n())%>%arrange(desc(n))%>%head(10)

Relational data dominate the battlefield with text data being the close second. There are also Image and video data, which are less common.

That is why SQL is very important since it is used to handle relational data. That is also another reason why tensorflow is very important since you could use it to handle image data, sometimes text data.

TL;DR

The Python+R is the most common type, but it is best for beginners to start with python. The SQL and tensorflow are also very important kills to master. A lot of people employed by school uses matlab and C/C++, but that is not true for industry.

1.3.3 What do kagglers want to learn next year?

kaggle_data_science%>%filter(!MLToolNextYearSelect=="")%>%group_by(MLToolNextYearSelect)%>%summarise(n=n())%>%arrange(desc(n))%>%head(10)

It is clear that tensorflow is the clear winner for the technology to learn next year.

For those titled “Machine Learning Engineer”, “Data Analyst”,“Data Scientist”, what do they what do learn next year?

kaggle_data_science%>%filter(CurrentJobTitleSelect %in% c("Machine Learning Engineer", "Data Analyst","Data Scientist"))%>%filter(!MLToolNextYearSelect=="")%>%group_by(MLToolNextYearSelect)%>%summarise(n=n())%>%arrange(desc(n))%>%head(10)

For those titled “Scientist/Researcher”, “Researcher”,“Business Analyst”, what do they what do learn next year?

kaggle_data_science%>%filter(CurrentJobTitleSelect %in% c("Scientist/Researcher", "Researcher","Business Analyst"))%>%filter(!MLToolNextYearSelect=="")%>%group_by(MLToolNextYearSelect)%>%summarise(n=n())%>%arrange(desc(n))%>%head(10)

It seems that those titled “Machine Learning Engineer”, “Data Analyst”,“Data Scientist” are more interested in Spark/Mlib over R while those titled “Scientist/Researcher”, “Researcher”,“Business Analyst” are more interested in R or nothing for the next year.

1.4 How about machine learning tools?

Which data science methods is used most?

kaggle_data_science%>%filter(!WorkMethodsSelect=="")%>%group_by(WorkMethodsSelect)%>%summarise(n=n())%>%arrange(desc(n))%>%head(10)

The skill on data visualization is of utmost importance. Logistic regression and Time series analysis are also important.

What do those titled “Machine Learning Engineer”, “Data Analyst”,“Data Scientist” use?

kaggle_data_science%>%filter(CurrentJobTitleSelect %in% c("Machine Learning Engineer", "Data Analyst","Data Scientist"))%>%filter(!WorkMethodsSelect=="")%>%group_by(WorkMethodsSelect)%>%summarise(n=n())%>%arrange(desc(n))%>%head(10)

The results are about the same.

What do those titled “Machine Learning Engineer”, “Data Analyst”,“Data Scientist” feels competent in?

kaggle_data_science%>%filter(CurrentJobTitleSelect %in% c("Machine Learning Engineer", "Data Analyst","Data Scientist"))%>%filter(!MLSkillsSelect=="")%>%group_by(MLSkillsSelect)%>%summarise(n=n())%>%arrange(desc(n))%>%head(10)

Apart from the basic supervised learning, unsupervised learning, and Time series, Natural Language Processing, Outlier detection (e.g. Fraud detection) and Computer Vision are the important skills to master.

What do those titled “Scientist/Researcher”, “Researcher”,“Business Analyst” feels competent in?

kaggle_data_science%>%filter(CurrentJobTitleSelect %in% c("Scientist/Researcher", "Researcher","Business Analyst"))%>%filter(!MLSkillsSelect=="")%>%group_by(MLSkillsSelect)%>%summarise(n=n())%>%arrange(desc(n))%>%head(10)

They are competent in about the same things with more emphasis on computer vision and time series.

1.5 Where do you learn those skills?

Right now, you have a better understanding about the language and machine learning skills to master.

kaggle_data_science%>%filter(!LearningPlatformSelect=="")%>%group_by(LearningPlatformSelect)%>%summarise(n=n())%>%arrange(desc(n))%>%head(10)

The result on this survey is definitely biased towards kaggle, since it is a kaggle survey. But you could see that online learnining is definitely the way to go with some personal projects.

From a personal perspective, kaggle is definitely a playground to hone and test the skills you learned and is more beginner-friendly than StackOverFlow. Plus, you could find a lot of datasets here for practice, competition and personal projects.

How do those titled “Machine Learning Engineer”, “Data Analyst”,“Data Scientist” learn about data science?

kaggle_data_science%>%filter(CurrentJobTitleSelect %in% c("Machine Learning Engineer", "Data Analyst","Data Scientist"))%>%filter(!LearningDataScienceTime=="")%>%group_by(LearningDataScienceTime)%>%summarise(n=n())%>%arrange(desc(n))%>%head(10)

Most people learn it for less than two years. This is a rather new field to explore.

There are more question to explore in this topic. Where should you learn those skills? Do you need a degree in computer science? Or is PhD really necessary? Or do you need a master degree in data science? Or do your undergraduate degree matter?

1.6 Where do you find those jobs?

kaggle_data_science%>%filter(!EmployerSearchMethod=="")%>%group_by(EmployerSearchMethod)%>%summarise(n=n())%>%arrange(desc(n))%>%head(10)

Most jobs are recommended by family, friends or former colleague. This is surprising. After learning for a while, it would be a great time to get out and get social.