Shutterstock.com

Shutterstock.com

Now that there are correlation scores for the data jobs and their skillsets (both hard and soft), let’s pull the data from MySQL, tidy up the data and perform some data analysis.

Load libraries.

library(tidyr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(reshape2)
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
library(RMySQL)
## Loading required package: DBI
library(psych)
## 
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
library(stringr)

Let’s create a connection with the MySQL online database, and download the files into R.

# Creating a connection between R and MySQL
con <- dbConnect(RMySQL::MySQL(), dbname = "DataSkills", user = "admin", password = "CUNYsps2017", host = "cuny-607-project-db.ce9yg7qxcc7l.us-east-2.rds.amazonaws.com")
dbListTables(con)
## [1] "Correlation Scores"   "Correlation Scores 2" "DataSkillset"        
## [4] "Scores"               "indeedjob"

There are three tables in the database “DataSkills”. The tables, “DataSkillset” and “indeedjob” were both scraped data from Monster, Kaggle, Kdnuggets (DataSkills) and Indeed (indeedjob). This information was then taken and used in Microsoft Cognitive Services. The job titles/descriptions from indeedjob were then correlated with the data skills sets from DataSkillset to see if there was any negative or positive correlation. A (-1) score means a complete negative correlation where as a (+1) score means a complete positive correlation. The score can range from -1 to +1.

Let’s pull up the DataSkillset and Correlation Scores table.

# Using queries to download the data sets
correlate.query <- "SELECT * FROM `Correlation Scores`;"
correlate.table <- dbGetQuery(con, correlate.query)

dataskill.query <- "SELECT * FROM DataSkillset;"
dataskill <- dbGetQuery(con, dataskill.query)

# Show the first 10 listings in the data table.
head(correlate.table, 10)
##    ID                                             job       skill    score
## 1   1 data scientist, business strategy - seattle, wa      python  0.14926
## 2   2          data integration and reporting analyst       agile  0.34817
## 3   3 data scientist, business strategy - seattle, wa  leadership  0.40843
## 4   4          data integration and reporting analyst        ruby -0.09936
## 5   5          data integration and reporting analyst  statistics  0.43853
## 6   6 data scientist, business strategy - seattle, wa        ruby  0.04905
## 7   7          data integration and reporting analyst performance  0.34586
## 8   8                   data analyst/reporting writer       group  0.08052
## 9   9          data integration and reporting analyst    modeling  0.31776
## 10 10          data integration and reporting analyst      matlab  0.15621
paste("There are", length(correlate.table$ID), "rows in this table.")
## [1] "There are 4372 rows in this table."

Data Skill Sets

In this data, this data was scraped to see which soft and/or hard skills came up most frequently during my search for job listings in Monster.com, Kaggle.com, and Kdnuggets.com. What it does is tally up how many times each skill set was mentioned during the scrape.

The data is nearly tidy. However, what I like to do is tidy the data up even more, so that way there are no duplicates in the data set.

dataskill1 <- dataskill %>% group_by(d) %>% summarise(Count = sum(Sum)) %>% arrange(desc(Count))
colnames(dataskill1) <- c("Skill", "Count")
head(dataskill1, 10)
## # A tibble: 10 × 2
##              Skill Count
##              <chr> <dbl>
## 1      experience    632
## 2        business    292
## 3            team    232
## 4      analytical    124
## 5          skills     89
## 6           group     60
## 7          python     56
## 8               r     56
## 9   understanding     53
## 10         people     49

Both soft skills and hard skills are very important to the data scientist. However, we are going to take a closer look at both soft and hard skills. We will subset the data and then analyze the data.

soft <- c("experience", "business", "team", "analytical", "skills", "group", "understanding", "people", "communication skills", "teams", "solving", "leadership", "interpersonal", "problem solving", "ideas")
hard <- c("python", "r", "sql", "algorithms", "hadoop", "computer science", "bigdata", "statistics", "datamining", "java", "spark", "agile", "mathematics", "apache", "hive", "matlab", "oracle", "ruby", "sas", "scala", "nosql")

# Initially, I had tried to subset the data with the soft vector in the dplyr package, but was returning no data.
# However, I had realized that, for some reason, there was a space in front of every word in dataskill1$Skill.
# This next step, by using regular expressions, will remove the space.
skills <- dataskill1$Skill
skills <- gsub("\\s", "", skills)
dataskill1$Skill <- skills

# Now, we can subset the soft and hard skill sets
soft.skills <- dataskill1 %>% filter(Skill %in% soft)
hard.skills <- dataskill1 %>% filter(Skill %in% hard)

soft.skills
## # A tibble: 14 × 2
##            Skill Count
##            <chr> <dbl>
## 1     experience   632
## 2       business   292
## 3           team   232
## 4     analytical   124
## 5         skills    89
## 6          group    60
## 7  understanding    53
## 8         people    49
## 9          teams    28
## 10       solving    20
## 11    leadership    16
## 12    experience     4
## 13 interpersonal     4
## 14         ideas     2
hard.skills
## # A tibble: 20 × 2
##          Skill Count
##          <chr> <dbl>
## 1       python    56
## 2            r    56
## 3          sql    48
## 4   algorithms    36
## 5       hadoop    36
## 6      bigdata    20
## 7   statistics    17
## 8   datamining    16
## 9         java    16
## 10       spark    13
## 11       agile    12
## 12 mathematics    12
## 13      apache     4
## 14        hive     4
## 15      matlab     4
## 16      oracle     4
## 17        ruby     4
## 18         sas     4
## 19       scala     4
## 20       nosql     1
ss <- ggplot(soft.skills, aes(x=Skill, y=Count)) + geom_point(aes(color=Skill, size=Count)) + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + labs(main = "Soft Skills")
ss

hs <- ggplot(hard.skills, aes(x=Skill, y=Count)) + geom_point(aes(color=Skill, size=Count)) + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + labs(main = "Hard Skills")
hs

Looking at the data, it appears that experience is by far one of the most important qualifications that an employer is looking for. Regarding hard skill sets, R and Python remain very popular.

Correlation Score

At first glance, the data looks like there is a lot of interesting information. (For example, there’s a very slight negative correlation with Ruby and “Data integration and reporting analyst”.) However, to make more sense of this data, let’s start by tidying up the data.

# Because there is so much variation in the 'jobs' column, we will group by 'skill'.
correlate.skills <- correlate.table %>% group_by(skill) %>% arrange(skill)
head(correlate.skills, 10)
## Source: local data frame [10 x 4]
## Groups: skill [1]
## 
##       ID                                                job skill   score
##    <int>                                              <chr> <chr>   <dbl>
## 1      2             data integration and reporting analyst agile 0.34817
## 2     18    data scientist, business strategy - seattle, wa agile 0.27510
## 3     70                             data reporting analyst agile 0.24886
## 4     93                   analytics & insights coordinator agile 0.37262
## 5    119 statistical analyst - data analytics and reporting agile 0.33180
## 6    132                      data analyst/reporting writer agile 0.23017
## 7    160    data analyst, performance, analysis & reporting agile 0.27345
## 8    256            data and reporting analyst-magellan mrx agile 0.07033
## 9    282                       aco reporting & data analyst agile 0.31328
## 10   311                             data scientist - staff agile 0.24106

Now as you see, all the jobs are now listed together, grouped by skill. We notice for agile, that there tends to be an overall positive correlation, meaning that the computer software agile was likely to be used for data integration, reporting analyst, statistical analyst, etc.

Let’s take this data frame and isolate just the agile group.

correlate.agile <- correlate.table %>% group_by(skill) %>% arrange(skill) %>% filter(skill == "agile")
correlate.agile
## Source: local data frame [111 x 4]
## Groups: skill [1]
## 
##       ID                                                job skill   score
##    <int>                                              <chr> <chr>   <dbl>
## 1      2             data integration and reporting analyst agile 0.34817
## 2     18    data scientist, business strategy - seattle, wa agile 0.27510
## 3     70                             data reporting analyst agile 0.24886
## 4     93                   analytics & insights coordinator agile 0.37262
## 5    119 statistical analyst - data analytics and reporting agile 0.33180
## 6    132                      data analyst/reporting writer agile 0.23017
## 7    160    data analyst, performance, analysis & reporting agile 0.27345
## 8    256            data and reporting analyst-magellan mrx agile 0.07033
## 9    282                       aco reporting & data analyst agile 0.31328
## 10   311                             data scientist - staff agile 0.24106
## # ... with 101 more rows

There are 13 data points after tidying and filtering the data. With the psych package, we can take this opportunity to find basic statistics about this subset of data i.e. mean, median, standard deviation, etc. We’ll also take this opportunity to graphically demonstrate

describe(correlate.agile$score)
##    vars   n mean   sd median trimmed  mad  min max range  skew kurtosis
## X1    1 111 0.31 0.11   0.32    0.32 0.12 0.05 0.5  0.45 -0.39    -0.77
##      se
## X1 0.01
boxplot(correlate.agile$score, xlab = "Agile", ylab = "Correlation Score", main = "Correlation Score for Agile Skillset", col = "lightgreen")

So the mean and median of this subset is 0.25 with data points ranging from 0.07 to 0.37. And according to the boxplot, there is one outlier in this dataset. While not the strongest positive correlation, there is some positive correlation noted here.

Now, let’s compare multiple skills against each other and see how well they correlate. What I like to do is take all of the different skill sets and compare them to each other graphically to get a better sense how important these skill sets are.

mult.skills <- ggplot(data = correlate.skills, aes(x=skill, y=score)) + geom_boxplot() + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + labs(title = "Correlation Score for Skill Sets for Data Scientists/Analysts", x = "Skill Sets", y = "Correlation Score")
mult.skills

# Now with some color!
mult.skills1 <- ggplot(data = correlate.skills, aes(x=skill, y=score)) + geom_boxplot(aes(fill = skill)) + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + labs(title = "Correlation Score for Skill Sets for Data Scientists/Analysts", x = "Skill Sets", y = "Correlation Score")
mult.skills1

# This time, we will use a different visualization to look at the data.
mult.skills2 <- ggplot(data = correlate.skills, aes(x=skill, y=score)) + geom_bin2d() + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + labs(title = "Correlation Score for Skill Sets for Data Scientists/Analysts", x = "Skill Sets", y = "Correlation Score")
mult.skills2

Interestingly, R and Ruby had negative correlation. I am not sure how to make sense of this time, but it may be perhaps the way we had collected the information and performed the correlation. In the first data set analysis, R and Python were the two most popular tools for data scientists.

Most of the other data skills, such as understanding, team, teams, people were either moderately positive or highly positive on the correlation score.