Shutterstock.com
Now that there are correlation scores for the data jobs and their skillsets (both hard and soft), let’s pull the data from MySQL, tidy up the data and perform some data analysis.
Load libraries.
library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(reshape2)
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
library(RMySQL)
## Loading required package: DBI
library(psych)
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
library(stringr)
Let’s create a connection with the MySQL online database, and download the files into R.
# Creating a connection between R and MySQL
con <- dbConnect(RMySQL::MySQL(), dbname = "DataSkills", user = "admin", password = "CUNYsps2017", host = "cuny-607-project-db.ce9yg7qxcc7l.us-east-2.rds.amazonaws.com")
dbListTables(con)
## [1] "Correlation Scores" "Correlation Scores 2" "DataSkillset"
## [4] "Scores" "indeedjob"
There are three tables in the database “DataSkills”. The tables, “DataSkillset” and “indeedjob” were both scraped data from Monster, Kaggle, Kdnuggets (DataSkills) and Indeed (indeedjob). This information was then taken and used in Microsoft Cognitive Services. The job titles/descriptions from indeedjob
were then correlated with the data skills sets from DataSkillset
to see if there was any negative or positive correlation. A (-1) score means a complete negative correlation where as a (+1) score means a complete positive correlation. The score can range from -1 to +1.
Let’s pull up the DataSkillset
and Correlation Scores
table.
# Using queries to download the data sets
correlate.query <- "SELECT * FROM `Correlation Scores`;"
correlate.table <- dbGetQuery(con, correlate.query)
dataskill.query <- "SELECT * FROM DataSkillset;"
dataskill <- dbGetQuery(con, dataskill.query)
# Show the first 10 listings in the data table.
head(correlate.table, 10)
## ID job skill score
## 1 1 data scientist, business strategy - seattle, wa python 0.14926
## 2 2 data integration and reporting analyst agile 0.34817
## 3 3 data scientist, business strategy - seattle, wa leadership 0.40843
## 4 4 data integration and reporting analyst ruby -0.09936
## 5 5 data integration and reporting analyst statistics 0.43853
## 6 6 data scientist, business strategy - seattle, wa ruby 0.04905
## 7 7 data integration and reporting analyst performance 0.34586
## 8 8 data analyst/reporting writer group 0.08052
## 9 9 data integration and reporting analyst modeling 0.31776
## 10 10 data integration and reporting analyst matlab 0.15621
paste("There are", length(correlate.table$ID), "rows in this table.")
## [1] "There are 4372 rows in this table."
In this data, this data was scraped to see which soft and/or hard skills came up most frequently during my search for job listings in Monster.com, Kaggle.com, and Kdnuggets.com. What it does is tally up how many times each skill set was mentioned during the scrape.
The data is nearly tidy. However, what I like to do is tidy the data up even more, so that way there are no duplicates in the data set.
dataskill1 <- dataskill %>% group_by(d) %>% summarise(Count = sum(Sum)) %>% arrange(desc(Count))
colnames(dataskill1) <- c("Skill", "Count")
head(dataskill1, 10)
## # A tibble: 10 × 2
## Skill Count
## <chr> <dbl>
## 1 experience 632
## 2 business 292
## 3 team 232
## 4 analytical 124
## 5 skills 89
## 6 group 60
## 7 python 56
## 8 r 56
## 9 understanding 53
## 10 people 49
Both soft skills and hard skills are very important to the data scientist. However, we are going to take a closer look at both soft and hard skills. We will subset the data and then analyze the data.
soft <- c("experience", "business", "team", "analytical", "skills", "group", "understanding", "people", "communication skills", "teams", "solving", "leadership", "interpersonal", "problem solving", "ideas")
hard <- c("python", "r", "sql", "algorithms", "hadoop", "computer science", "bigdata", "statistics", "datamining", "java", "spark", "agile", "mathematics", "apache", "hive", "matlab", "oracle", "ruby", "sas", "scala", "nosql")
# Initially, I had tried to subset the data with the soft vector in the dplyr package, but was returning no data.
# However, I had realized that, for some reason, there was a space in front of every word in dataskill1$Skill.
# This next step, by using regular expressions, will remove the space.
skills <- dataskill1$Skill
skills <- gsub("\\s", "", skills)
dataskill1$Skill <- skills
# Now, we can subset the soft and hard skill sets
soft.skills <- dataskill1 %>% filter(Skill %in% soft)
hard.skills <- dataskill1 %>% filter(Skill %in% hard)
soft.skills
## # A tibble: 14 × 2
## Skill Count
## <chr> <dbl>
## 1 experience 632
## 2 business 292
## 3 team 232
## 4 analytical 124
## 5 skills 89
## 6 group 60
## 7 understanding 53
## 8 people 49
## 9 teams 28
## 10 solving 20
## 11 leadership 16
## 12 experience 4
## 13 interpersonal 4
## 14 ideas 2
hard.skills
## # A tibble: 20 × 2
## Skill Count
## <chr> <dbl>
## 1 python 56
## 2 r 56
## 3 sql 48
## 4 algorithms 36
## 5 hadoop 36
## 6 bigdata 20
## 7 statistics 17
## 8 datamining 16
## 9 java 16
## 10 spark 13
## 11 agile 12
## 12 mathematics 12
## 13 apache 4
## 14 hive 4
## 15 matlab 4
## 16 oracle 4
## 17 ruby 4
## 18 sas 4
## 19 scala 4
## 20 nosql 1
ss <- ggplot(soft.skills, aes(x=Skill, y=Count)) + geom_point(aes(color=Skill, size=Count)) + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + labs(main = "Soft Skills")
ss
hs <- ggplot(hard.skills, aes(x=Skill, y=Count)) + geom_point(aes(color=Skill, size=Count)) + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + labs(main = "Hard Skills")
hs
Looking at the data, it appears that experience is by far one of the most important qualifications that an employer is looking for. Regarding hard skill sets, R and Python remain very popular.
At first glance, the data looks like there is a lot of interesting information. (For example, there’s a very slight negative correlation with Ruby and “Data integration and reporting analyst”.) However, to make more sense of this data, let’s start by tidying up the data.
# Because there is so much variation in the 'jobs' column, we will group by 'skill'.
correlate.skills <- correlate.table %>% group_by(skill) %>% arrange(skill)
head(correlate.skills, 10)
## Source: local data frame [10 x 4]
## Groups: skill [1]
##
## ID job skill score
## <int> <chr> <chr> <dbl>
## 1 2 data integration and reporting analyst agile 0.34817
## 2 18 data scientist, business strategy - seattle, wa agile 0.27510
## 3 70 data reporting analyst agile 0.24886
## 4 93 analytics & insights coordinator agile 0.37262
## 5 119 statistical analyst - data analytics and reporting agile 0.33180
## 6 132 data analyst/reporting writer agile 0.23017
## 7 160 data analyst, performance, analysis & reporting agile 0.27345
## 8 256 data and reporting analyst-magellan mrx agile 0.07033
## 9 282 aco reporting & data analyst agile 0.31328
## 10 311 data scientist - staff agile 0.24106
Now as you see, all the jobs are now listed together, grouped by skill. We notice for agile, that there tends to be an overall positive correlation, meaning that the computer software agile was likely to be used for data integration, reporting analyst, statistical analyst, etc.
Let’s take this data frame and isolate just the agile group.
correlate.agile <- correlate.table %>% group_by(skill) %>% arrange(skill) %>% filter(skill == "agile")
correlate.agile
## Source: local data frame [111 x 4]
## Groups: skill [1]
##
## ID job skill score
## <int> <chr> <chr> <dbl>
## 1 2 data integration and reporting analyst agile 0.34817
## 2 18 data scientist, business strategy - seattle, wa agile 0.27510
## 3 70 data reporting analyst agile 0.24886
## 4 93 analytics & insights coordinator agile 0.37262
## 5 119 statistical analyst - data analytics and reporting agile 0.33180
## 6 132 data analyst/reporting writer agile 0.23017
## 7 160 data analyst, performance, analysis & reporting agile 0.27345
## 8 256 data and reporting analyst-magellan mrx agile 0.07033
## 9 282 aco reporting & data analyst agile 0.31328
## 10 311 data scientist - staff agile 0.24106
## # ... with 101 more rows
There are 13 data points after tidying and filtering the data. With the psych package, we can take this opportunity to find basic statistics about this subset of data i.e. mean, median, standard deviation, etc. We’ll also take this opportunity to graphically demonstrate
describe(correlate.agile$score)
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 111 0.31 0.11 0.32 0.32 0.12 0.05 0.5 0.45 -0.39 -0.77
## se
## X1 0.01
boxplot(correlate.agile$score, xlab = "Agile", ylab = "Correlation Score", main = "Correlation Score for Agile Skillset", col = "lightgreen")
So the mean and median of this subset is 0.25 with data points ranging from 0.07 to 0.37. And according to the boxplot, there is one outlier in this dataset. While not the strongest positive correlation, there is some positive correlation noted here.
Now, let’s compare multiple skills against each other and see how well they correlate. What I like to do is take all of the different skill sets and compare them to each other graphically to get a better sense how important these skill sets are.
mult.skills <- ggplot(data = correlate.skills, aes(x=skill, y=score)) + geom_boxplot() + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + labs(title = "Correlation Score for Skill Sets for Data Scientists/Analysts", x = "Skill Sets", y = "Correlation Score")
mult.skills
# Now with some color!
mult.skills1 <- ggplot(data = correlate.skills, aes(x=skill, y=score)) + geom_boxplot(aes(fill = skill)) + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + labs(title = "Correlation Score for Skill Sets for Data Scientists/Analysts", x = "Skill Sets", y = "Correlation Score")
mult.skills1
# This time, we will use a different visualization to look at the data.
mult.skills2 <- ggplot(data = correlate.skills, aes(x=skill, y=score)) + geom_bin2d() + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + labs(title = "Correlation Score for Skill Sets for Data Scientists/Analysts", x = "Skill Sets", y = "Correlation Score")
mult.skills2
Interestingly, R and Ruby had negative correlation. I am not sure how to make sense of this time, but it may be perhaps the way we had collected the information and performed the correlation. In the first data set analysis, R and Python were the two most popular tools for data scientists.
Most of the other data skills, such as understanding, team, teams, people were either moderately positive or highly positive on the correlation score.