As part of our project, we are tasked to answer the question “What are the most valued data science skills?” by working as a team, deciding what data to collect and how to collect it, use relational database and set of normalized tables and data exploration and analysis. Our team members are as follows;
Anil Akyildirim
Nicholas Chung
Jai Jeffryes
Tamiko Jenkins
Joe Rovalino
Sie Siong Wong
As part of project management tools, we have used Slack Private channel and Skype for Project Communication, Github for Project tracking, documentation and code collaboration, and Amazon Relational Database Service for data integration. All of our supporting code and data are on the GitHub repo, which documents branches and commits from our team.
We have reviewed and discussed different data types such as current job requirements around data scientists from job postings such as indeed.com or monster.com and articles around top data scientists skills in websites such as towardsdatascience and knuggets. Our approach built on the assumption that data scientists with jobs have the skills most valued by employers. We collected skills from employed data scientists.
we were inspired by the research of Jeff Hale whose article on data science skills appeared on the website, Medium.
We discussed different methods of collecting the data and further how we can store it. As a result, we decided to work with usefull data within linkedin.com. We compared our findings from LinkedIn data to Mr. Hale’s 2018 findings.
# load all JSON
filenames <- list.files("data/profiles", pattern="*.json", full.names=TRUE) # this should give you a character vector, with each file name represented by an entry
example_file <- lapply(filenames[1], function(x) jsonlite::fromJSON(txt = x)) # a list in which each element is one of your original JSON files
example_file## [[1]]
## [[1]]$profileAlternative
## [[1]]$profileAlternative$name
## [1] "Aditi Sharma"
##
## [[1]]$profileAlternative$headline
## [1] "Data Scientist at Square"
##
## [[1]]$profileAlternative$location
## [1] "San Francisco Bay Area"
##
## [[1]]$profileAlternative$connections
## [1] "500+"
##
## [[1]]$profileAlternative$summary
## [1] "Data Scientist with 3+ years of experience leveraging data to solve business problems in a results-driven environment. I thrive in a challenging environment and desire to continuously improve.\n\nCurrently working on recommender systems and pricing prediction using NLP/deep learning/RNN-LSTM."
##
##
## [[1]]$aboutAlternative
## [[1]]$aboutAlternative$text
## [1] "Data Scientist with 3+ years of experience leveraging data to solve business problems in a results-driven environment. I thrive in a challenging environment and desire to continuously improve.\n\nCurrently working on recommender systems and pricing prediction using NLP/deep learning/RNN-LSTM."
##
##
## [[1]]$positions
## list()
##
## [[1]]$educations
## title degree
## 1 University of San Francisco Master of Science - MS
## 2 National Institute of Technology Warangal Bachelor of Technology (BTech)
## date1 date2
## 1 2018 2019
## 2 2011 2015
##
## [[1]]$skills
## title count
## 1 Python 11
## 2 R 9
## 3 C++ 7
## 4 Data Structures 5
## 5 Statistics 2
## 6 Machine Learning 2
## 7 Research 1
## 8 Optimization 1
## 9 Data Analysis 1
## 10 Distributed Systems 1
## 11 Algorithms <NA>
## 12 Artificial Intelligence (AI) <NA>
## 13 Computer Vision <NA>
## 14 Financial Analysis <NA>
## 15 Regression Testing <NA>
## 16 Neuro-Linguistic Programming (NLP) <NA>
## 17 SQL 2
## 18 python 1
## 19 Amazon Web Services (AWS) 1
## 20 MongoDB 1
## 21 selenium <NA>
## 22 Matlab <NA>
## 23 PyTorch <NA>
## 24 pytorch <NA>
## 25 AWS <NA>
## 26 PySpark <NA>
## 27 NoSQL <NA>
## 28 MySQL <NA>
## 29 Deep Learning 2
## 30 Recommender Systems 1
## 31 Predictive Modeling 1
## 32 Linear Regression 1
## 33 Data Acquisition <NA>
## 34 Exploratory Data Analysis <NA>
## 35 Scikit-Learn <NA>
## 36 Data Analytics <NA>
## 37 Chatbot Development <NA>
## 38 Random Forest <NA>
## 39 CPLEX <NA>
## 40 Vehicle Routing <NA>
## 41 EDA <NA>
## 42 A/B Testing <NA>
## 43 NLP <NA>
## 44 Natural Language Processing (NLP) <NA>
##
## [[1]]$recommendations
## [[1]]$recommendations$givenCount
## [1] "0"
##
## [[1]]$recommendations$receivedCount
## [1] "0"
##
## [[1]]$recommendations$given
## list()
##
## [[1]]$recommendations$received
## list()
##
##
## [[1]]$accomplishments
## count
## 1 1
## items
## 1 Scalable Motor Movement Recognition from Electroencephalography using Machine Learning
##
## [[1]]$peopleAlsoViewed
## list()
##
## [[1]]$volunteerExperience
## list()
##
## [[1]]$profile
## [[1]]$profile$name
## [1] "Aditi Sharma"
##
## [[1]]$profile$headline
## [1] "Data Scientist at Square"
##
## [[1]]$profile$location
## [1] "San Francisco Bay Area"
##
## [[1]]$profile$connections
## [1] "500+"
##
## [[1]]$profile$summary
## [1] "Data Scientist with 3+ years of experience leveraging data to solve business problems in a results-driven environment. I thrive in a challenging environment and desire to continuously improve.\n\nCurrently working on recommender systems and pricing prediction using NLP/deep learning/RNN-LSTM."
# apply fromJSON to read in all of the json files
# create the column (variable) title, headline, which will be populated with json file identifying information
# extract the skills data which contains the variables title and counts
# bind the results together as a data frame named r_df
r_df <- dplyr::bind_rows(sapply(filenames, function(x) fromJSON(x, flatten=TRUE)$skills), .id="headline")df <- r_df
# TODO: conform names to db
# TODO: Fix naming call
# TODO: find out which naming convention we're using
# display data frame r_df
head(df)## headline title count
## 1 Data Scientist at Square Python 11
## 2 Data Scientist at Square R 9
## 3 Data Scientist at Square C++ 7
## 4 Data Scientist at Square Data Structures 5
## 5 Data Scientist at Square Statistics 2
## 6 Data Scientist at Square Machine Learning 2
# name r_df
names(df) <- c("title", "skills", "count")
#names(v) <- c("headline", "skills", "linkedin")
class(df)## [1] "data.frame"
## title skills count
## "character" "character" "character"
## title skills count
## "character" "character" "numeric"
## [1] 4822
## title skills count
## 11 Data Scientist at Square Algorithms NA
## 12 Data Scientist at Square Artificial Intelligence (AI) NA
## 13 Data Scientist at Square Computer Vision NA
## 14 Data Scientist at Square Financial Analysis NA
## 15 Data Scientist at Square Regression Testing NA
## 16 Data Scientist at Square Neuro-Linguistic Programming (NLP) NA
## 17 Data Scientist at Square SQL 2
## 18 Data Scientist at Square python 1
## 19 Data Scientist at Square Amazon Web Services (AWS) 1
## 20 Data Scientist at Square MongoDB 1
# filter for any rows with na
# count all rows with na's
# 942
# view a subset of rows with only na's
df_na <- df %>% filter_all(any_vars(is.na(.)))
nrow(df_na)## [1] 942
## title skills count
## 1 Data Scientist at Square Python 11
## 2 Data Scientist at Square R 9
## 3 Data Scientist at Square C++ 7
## 4 Data Scientist at Square Data Structures 5
## 5 Data Scientist at Square Statistics 2
## 6 Data Scientist at Square Machine Learning 2
## 7 Data Scientist at Square Research 1
## 8 Data Scientist at Square Optimization 1
## 9 Data Scientist at Square Data Analysis 1
## 10 Data Scientist at Square Distributed Systems 1
## 11 Data Scientist at Square Algorithms NA
## 12 Data Scientist at Square Artificial Intelligence (AI) NA
## 13 Data Scientist at Square Computer Vision NA
## 14 Data Scientist at Square Financial Analysis NA
## 15 Data Scientist at Square Regression Testing NA
## 16 Data Scientist at Square Neuro-Linguistic Programming (NLP) NA
## 17 Data Scientist at Square SQL 2
## 18 Data Scientist at Square python 1
## 19 Data Scientist at Square Amazon Web Services (AWS) 1
## 20 Data Scientist at Square MongoDB 1
# omit any rows with na's
# save rows without na's as a data frame names df
# count the data frame
# 4822-942 = 3880
df <- na.omit(df)
head(df)## title skills count
## 1 Data Scientist at Square Python 11
## 2 Data Scientist at Square R 9
## 3 Data Scientist at Square C++ 7
## 4 Data Scientist at Square Data Structures 5
## 5 Data Scientist at Square Statistics 2
## 6 Data Scientist at Square Machine Learning 2
## [1] 3880
# TODO: 1) I think we just need a names () on the column missing the column name
# Done
# TODO: 2) I have to research how to remove unicode
# suggest
# rs <- dbSendQuery(con, 'SET NAMES utf8')
# ALTER TABLE mytable CONVERT TO CHARACTER SET utf8mb4;
# TODO: 3)remove termination of comma
# TODO: can we try this: texts(df) <- iconv(texts(df, from = "UTF-8", to = "ASCII", sub = "")
# https://cran.r-project.org/web/packages/quanteda/quanteda.pdf
# stripped all the Unicode characters
# stripped all the commas from the fields for skill and title
# Example of dirty data obtained with \d,"[^"]+,
# line 256: "396","Python",31,"Data Scientist at Conde Nast • MS in Data Science, Columbia University • IIIT-H Alumnus • Marathoner"
#x <- c("396","Python",31,"Data Scientist at Conde Nast • MS in Data Science, Columbia University • IIIT-H Alumnus • Marathoner")[4]
#validUTF8(x)
#validUTF8(xx)
#xi <- stringi::stri_enc_toascii(x)
#xi <- iconv(xi, "latin1", "ASCII", sub='')
# Joe
#library("quanteda")
#dfs <-df
#texts(dfs) <- iconv(texts(dfs, from = "UTF-8", to = "ASCII", sub = ""))
#Convert Latin characters to UTF-8
#convert_for_sql <- function(x) {
# Encoding(x) <- "latin1"
# x <- iconv(x, "latin1", "UTF-8", sub='')
# x <- stringr::str_replace(x,",","")
# Encoding(x) <- "UTF-8"
# return(x)
#}
df_f <- df
# Remove non-ASCII character codes
test <- df_f[256,]
df_f$skills <- sapply(df_f$skills, function(x) gsub('[^\x20-\x7E]', '', x))
df_f$title <- sapply(df_f$title, function(x) gsub('[^\x20-\x7E]', '', x))
df_f$skills <- sapply(df_f$skills, function(x) gsub('[@]', 'at', x))
df_f$title <- sapply(df_f$title, function(x) gsub('[@]', 'at', x))
df_f$skills <- sapply(df_f$skills, function(x) gsub('[\\|\\(\\),]', '', x))
df_f$title <- sapply(df_f$title, function(x) gsub('[\\|\\(\\),]', '', x))
Encoding(df_f$skills) <- "UTF-8"
Encoding(df_f$title) <- "UTF-8"
head(df_f)## title skills count
## 1 Data Scientist at Square Python 11
## 2 Data Scientist at Square R 9
## 3 Data Scientist at Square C++ 7
## 4 Data Scientist at Square Data Structures 5
## 5 Data Scientist at Square Statistics 2
## 6 Data Scientist at Square Machine Learning 2
## title
## 397 Data Scientist at Conde Nast • MS in Data Science, Columbia University • IIIT-H Alumnus • Marathoner
## skills count
## 397 MySQL 20
## title
## 397 Data Scientist at Conde Nast MS in Data Science Columbia University IIIT-H Alumnus Marathoner
## skills count
## 397 MySQL 20
# TODO: follow df.csv convention
# Add rownames (indices) as a skill id
# to final dataframe to prepare for
# SQL-based storage and to provide option to
# remove automatic row names from write csv
# Remove depr function
#df_csv <- add_rownames(df, var = "skill_id")
# NB: these are the original row id's based on R records
# to generate skill_ids without skips for na's removed
# use a seq
df_csv <- tibble::rownames_to_column(df_f, var = "skill_id")
df_csv$skill_id <- as.numeric(df_csv$skill_id)
head(df_csv)## skill_id title skills count
## 1 1 Data Scientist at Square Python 11
## 2 2 Data Scientist at Square R 9
## 3 3 Data Scientist at Square C++ 7
## 4 4 Data Scientist at Square Data Structures 5
## 5 5 Data Scientist at Square Statistics 2
## 6 6 Data Scientist at Square Machine Learning 2
# TODO: follow df.csv convention
# Rearrange column order with dplyr select
df_csv <- dplyr::select(df_csv, skill_id, skills, count, title)
head(df_csv)## skill_id skills count title
## 1 1 Python 11 Data Scientist at Square
## 2 2 R 9 Data Scientist at Square
## 3 3 C++ 7 Data Scientist at Square
## 4 4 Data Structures 5 Data Scientist at Square
## 5 5 Statistics 2 Data Scientist at Square
## 6 6 Machine Learning 2 Data Scientist at Square
# load the data in the database and look at 2018 Linkedin Data
user_name <- 'anil'
user_password <- "redy2rok"
database <- 'prj3'
host_name <- 'msds607.ckxhi71v1dqf.us-east-1.rds.amazonaws.com'
#connecting to the MySQL database
myDb <- dbConnect(RMariaDB::MariaDB(), user=user_name, password=user_password, dbname=database, host=host_name)
myDb## <MariaDBConnection>
## Host: msds607.ckxhi71v1dqf.us-east-1.rds.amazonaws.com
## Server: 5.7.22-log
## Client: 5.5.1
## skill_id skills count title
## 1 1 Python 11 Data Scientist at Square
## 2 2 R 9 Data Scientist at Square
## 3 3 C++ 7 Data Scientist at Square
## 4 4 Data Structures 5 Data Scientist at Square
## 5 5 Statistics 2 Data Scientist at Square
## 6 6 Machine Learning 2 Data Scientist at Square
# There are more than 145 skills, clean to data similar to 2018 data
df <- subset(df, select = c(skills, count))
colnames(df) <- c("Skills", "Linkedin")
head(df)## Skills Linkedin
## 1 Python 11
## 2 R 9
## 3 C++ 7
## 4 Data Structures 5
## 5 Statistics 2
## 6 Machine Learning 2
# there are skills that is listed more than once. finding those
n_occur <- data.frame(table(df$Skills))
head(n_occur[n_occur$Freq > 1,])## Var1 Freq
## 4 A/B Testing 3
## 8 Access 4
## 9 Accounting 2
## 11 Actuarial Science 5
## 14 Adobe Photoshop 4
## 15 Advertising 3
# we need to add the count of the duplicate skills rows
df <- aggregate(Linkedin ~ Skills, dat=df, FUN=sum)
head(df)## Skills Linkedin
## 1 .NET 9
## 2 3D Modeling 1
## 3 8051 Assembly 3
## 4 A/B Testing 4
## 5 Abaqus 11
## 6 Ableton Live 1
## Skills Linkedin
## Length:929 Min. : 1
## Class :character 1st Qu.: 2
## Mode :character Median : 5
## Mean : 33
## 3rd Qu.: 17
## Max. :2196
## 'data.frame': 929 obs. of 2 variables:
## $ Skills : chr ".NET" "3D Modeling" "8051 Assembly" "A/B Testing" ...
## $ Linkedin:integer64 9 1 3 4 11 1 1 23 ...
The first exploratory pass is crowded. We’ll filter the data in the next pass.
#We have 1157 observations (skills) that data science roles use in linkedin
#let's see the distribution
theme_set(theme_classic())
ggplot(df, aes(x=Skills, y=Linkedin))+
geom_bar(stat="identity", width = 0.5, fill=("tomato2"))+
theme(axis.text.x = element_text(angle = 65, vjust=0.6))# we have way too many skills so let's only focus on the ones that has significant count.
df <- filter(df, Linkedin >100)
head(df)## Skills Linkedin
## 1 Algorithms 317
## 2 Analysis 148
## 3 Analytics 560
## 4 Big Data 214
## 5 Bioinformatics 106
## 6 Bloomberg 107
# we narrowed it down to 57 skills. Let's see how distribution looks like.
theme_set(theme_classic())
ggplot(df, aes(x=Skills, y=Linkedin))+
geom_bar(stat="identity", width = 0.5, fill=("tomato2"))+
theme(axis.text.x = element_text(angle = 65, vjust=0.6))## Skills Linkedin
## 1 Algorithms 317
## 2 Analytics 560
## 3 Big Data 214
## 4 Business Analysis 207
## 5 C 456
## 6 C++ 498
theme_set(theme_classic())
ggplot(df, aes(x=reorder(Skills, Linkedin, fun=max), y=Linkedin))+
geom_bar(stat="identity", width = 0.5, fill=("tomato2"))+
labs(title="2019 Data Science Skills Distribution",
x="Data Science Skills",
y="Count in Linkedin")+
theme(axis.text.x = element_text(angle = 65, vjust=0.6))The Data Science skills Distribution chart for 2019 shows us the most frequent data science skills that people use for their Linkedin Profiles. The results show us that, Data Analysis; as part of General Data Skills, is the most commonly used skill within Data Scientists in Linkedin. The top three programming languages used within the profiles are R, Python and SQL. Statistics and Machine Learning are in 5th and 6th place in that order. If we consider Machine Learning and Statistics, as part of General Data Science Skills and Programming Languages as part of Technical Data Science Skills, we can conclude that
** Top three General Data Science Skills are Data Analysis, Statistics and Machine Learning.**
** Top three Technical Data Science Skills are R, Python and SQL.
# count for top three General and Technical Data Science Skills
data_analysis <- filter(df, df$Skills=="Data Analysis")
machine_learning <- filter(df, df$Skills=="Machine Learning")
statistics <- filter(df, df$Skills=="Statistics")
python <- filter(df, df$Skills=="Python")
r <- filter(df, df$Skills=="R")
sql <- filter(df, df$Skills=="SQL")
data_analysis## Skills Linkedin
## 1 Data Analysis 2196
## Skills Linkedin
## 1 Machine Learning 979
## Skills Linkedin
## 1 Statistics 1036
## Skills Linkedin
## 1 Python 1539
## Skills Linkedin
## 1 R 1864
## Skills Linkedin
## 1 SQL 1237
We can also look at the Linkedin Data Set from Jeff Hale and see if they follow the same pattern.
# lets load 2018 Linkedin Data from Jeff Hale.
skills_2018 <- dbGetQuery(myDb, "select * from ds_general_skills_clean")
skills_2018$LinkedIn <- as.numeric(skills_2018$LinkedIn) # little cleanup
head(skills_2018)## Keyword LinkedIn Indeed SimplyHired Monster
## 1 machine learning 5701 3439 2561 2340
## 2 analysis 5168 3500 2668 3306
## 3 statistics 4893 2992 2308 2399
## 4 computer science 4517 2739 2093 1900
## 5 communication 3404 2344 1791 2053
## 6 mathematics 2605 1961 1497 1815
# analyze briefly to see if there are differences
theme_set(theme_classic())
ggplot(skills_2018, aes(x=reorder(Keyword, LinkedIn, fun=max),y=LinkedIn))+
geom_bar(stat="identity", width = 0.5, fill=("tomato2"))+
labs(title="2018 Data Science Skills Distribution",
x="Data Science Skills",
y="Count in Linkedin",
caption = "Source: Jeff Hale 2018 Data Skills Analysis")+
theme(axis.text.x = element_text(angle = 65, vjust=0.6))With the assumption of computer science covering the Programming Lanaguages, we can see that the data science skills distribution for 2018 is similar to our Data Science Skills Distribution for 2018. Machine Learning , Data Analysis and Statistics leading the top General Data Science Skills. The only difference we see is that Machine Learning is slightly more used Data Science Skill than Data Analysis.
The six data science skills most valued by employers in 2019 appear to be the following.
General Data Science Skills:
1- Data Analysis => 2196
2- Machine Learning => 979
3- Statistics => 1036
Technical Data Science Skills
1- R => 1864
2- Python => 1539
3- SQL => 1237
Our approach differed from Mr. Hale’s. He investigated programming languages as a separate research question. Our approach commingles them. Therefore, though our high-ranking skills list includes the languages R, Python, and SQL, nothing is to be concluded from their absence from Hale’s list. What we see in common are the skills of analysis, statistics, and machine learning. We believe the data tell a compelling story about investment in these disciplines.