Overview:

Soft Skills

We create a dataframe containing only the total amount of softskills per job and the job type and the salary band

Looking at the extreme cases of high salary data engineer and low salary data analyst together

## 
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
## 
##     group_rows
Top Soft skills (engineers high salary)
skill occurance
Team 0.7885
Management 0.3846
Problem 0.3654
communication 0.3077
Result 0.2596
Hands-on 0.2404
Insight 0.2212
detail 0.2019
Organization 0.1923
Collaborate 0.1827
Competitive 0.1731
Driven 0.1731
Responsible 0.1635
Collaboration 0.1442
Passionate 0.1346
Problem solving 0.1250
Innovation 0.1154
Influence 0.1058
Leadership 0.1058
Managing 0.0962
Team player 0.0962
Analyzing 0.0865
Top Soft skills (analyst low salary)
skill occurance
Team 0.6227
Management 0.5386
communication 0.4656
Problem 0.4109
Organization 0.3829
detail 0.3240
Result 0.3128
Research 0.2917
Train 0.2693
Presentation 0.2567
Training 0.2216
Responsible 0.2188
Independent 0.2146
Insight 0.2020
Analyzing 0.1893
Attention to detail 0.1795
Collaborate 0.1697
Writing 0.1697
Leadership 0.1683
Interpersonal 0.1529
Strategic 0.1515
Integrity 0.1487
##              skill occurance
## 1             Team    0.7885
## 2       Management    0.3846
## 3          Problem    0.3654
## 4    communication    0.3077
## 5           Result    0.2596
## 6         Hands-on    0.2404
## 7          Insight    0.2212
## 8           detail    0.2019
## 9     Organization    0.1923
## 10     Collaborate    0.1827
## 11     Competitive    0.1731
## 12          Driven    0.1731
## 13     Responsible    0.1635
## 14   Collaboration    0.1442
## 15      Passionate    0.1346
## 16 Problem solving    0.1250
## 17      Innovation    0.1154
## 18       Influence    0.1058
## 19      Leadership    0.1058
## 20        Managing    0.0962
## 21     Team player    0.0962
## 22       Analyzing    0.0865

Job Title

Here we look at the different types of jobs and industries and compare the skills mentioned

df <- df_raw
job.indeed.df  <- df %>%
  select(c(Id=X, (Job_Title)))

# Pattern Building:
pattern.analyst <- c('analyst','statistician','analysis','analytics')
pattern.engineer <- c('engineer', 'engg', 'technician','technologist','designer','architect')
pattern.scientist <- c('scientist','doctor','dr.')
pattern.junior <- c('junior','jr', 'entry','internship','jr.')
pattern.senior <- c('senior', 'sr','experienced','sr.')

# Intermedaite Data Frame for Titlles:
final.data.df <- data.frame(Id=integer(nrow(job.indeed.df)), Job_Title=character(nrow(job.indeed.df))
                            , analyst=integer(nrow(job.indeed.df)) ,engineer=integer(nrow(job.indeed.df)),scientist=integer(nrow(job.indeed.df))
                            , junior=integer(nrow(job.indeed.df)),senior=integer(nrow(job.indeed.df)))
final.data.df$Id <-   job.indeed.df$Id
final.data.df$Job_Title <- as.character( as.character( job.indeed.df$Job_Title) )

# Working on the counts:
for (i in 1: nrow(job.indeed.df)) {
  final.data.df$analyst[i] <- if(grepl(paste(pattern.analyst,collapse="|"), job.indeed.df$Job_Title[i], ignore.case = TRUE) )  1 else 0
  final.data.df$engineer[i] <- if(grepl(paste(pattern.engineer,collapse="|"), job.indeed.df$Job_Title[i], ignore.case = TRUE) )  1 else 0
  final.data.df$scientist[i] <- if(grepl(paste(pattern.scientist,collapse="|"), job.indeed.df$Job_Title[i], ignore.case = TRUE) )  1 else 0
  final.data.df$junior[i] <- if(grepl(paste(pattern.junior,collapse="|"), job.indeed.df$Job_Title[i], ignore.case = TRUE) )  1 else 0
  final.data.df$senior[i] <- if(grepl(paste(pattern.senior,collapse="|"), job.indeed.df$Job_Title[i], ignore.case = TRUE) )  1 else 0
}

Summary Visualizations

Word Cloud

## Warning in wordcloud(words = df.stars.5.engg$skill, freq = df.stars.
## 5.engg$total_count, : Shell Scripting could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(words = df.stars.5.engg$skill, freq = df.stars.
## 5.engg$total_count, : Business Intelligence could not be fit on page. It will
## not be plotted.
## Warning in wordcloud(words = df.stars.5.engg$skill, freq = df.stars.
## 5.engg$total_count, : Google Cloud Platform could not be fit on page. It will
## not be plotted.
## Warning in wordcloud(words = df.stars.5.engg$skill, freq = df.stars.
## 5.engg$total_count, : Natural Language Processing could not be fit on page. It
## will not be plotted.
## Warning in wordcloud(words = df.stars.5.engg$skill, freq = df.stars.
## 5.engg$total_count, : Oral communication skills could not be fit on page. It
## will not be plotted.
## Warning in wordcloud(words = df.stars.5.engg$skill, freq = df.stars.
## 5.engg$total_count, : Responsibility could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(words = df.stars.5.engg$skill, freq = df.stars.
## 5.engg$total_count, : Work-Life Balance could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(words = df.stars.5.ana.sci$skill, freq = df.stars.
## 5.ana.sci$total_count, : Data-driven decision-making could not be fit on page.
## It will not be plotted.
## Warning in wordcloud(words = df.stars.5.ana.sci$skill, freq = df.stars.
## 5.ana.sci$total_count, : Survey Design could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(words = df.stars.5.ana.sci$skill, freq = df.stars.
## 5.ana.sci$total_count, : Written communication skills could not be fit on page.
## It will not be plotted.

Queried_Salary total_count ana_total eng_totals sci_total jr_total sr_total
<80000 37 37 0 0 0 0
>160000 47 17 0 30 0 0
100000-119999 111 33 54 24 0 0
120000-139999 136 55 34 77 0 9
140000-159999 59 17 28 14 0 8
80000-99999 61 25 36 0 0 13

The Question on Everybody’s Mind: What Technical Skills will Get Me a High Paying Job?

First, we need to select and arrange the data we want to use

##                              
##                               <80000 80000-99999 100000-119999 120000-139999
##   AI                               5          50           105           131
##   AWS                              1          37            80            96
##   Big Data                         2          17            97           161
##   C/C++                            6          36            98           141
##   Data Mining                     13          55           238           225
##   Data Science                     6          17            65            75
##   Hadoop                           3          27           163           297
##   Hive                             1          11            91           158
##   Java                             2          40           149           224
##   Linux                            2          44            55            66
##   Machine Learning                30         145           467           577
##   MATLAB                           3          30            68           131
##   Natural Language Processing      5          44           114           153
##   Python                          23         157           506           598
##   R                               19         127           460           502
##   SAS                              7          59           192           162
##   Scala                            1           7            67           119
##   Spark                            3          20           168           269
##   SQL                             13         134           354           371
##   Tableau                          5          38           157           164
##   TensorFlow                       6          46            86           128
##                              
##                               140000-159999 >160000
##   AI                                    114      46
##   AWS                                    65      37
##   Big Data                              111      29
##   C/C++                                  99      51
##   Data Mining                           162      40
##   Data Science                           41      17
##   Hadoop                                244      93
##   Hive                                  132      49
##   Java                                  190      70
##   Linux                                  48      10
##   Machine Learning                      440     183
##   MATLAB                                 98      34
##   Natural Language Processing           157      55
##   Python                                456     172
##   R                                     328     114
##   SAS                                   113      31
##   Scala                                 100      44
##   Spark                                 201      84
##   SQL                                   232      89
##   Tableau                                94      26
##   TensorFlow                            100      54

Proportion plots for skills vs salary

The plots below show the proportions of posts that mention a skill for each salary range. The horizontal bar would be the expected proportion if there was no difference in the proportion of mentions according to the salary range. The plots shown are only those for which there were over 200 skill mentions p > 0.05 (as determined by chi squared tests). There are several interesting findings below, and if one is interested in knowing which skills seem to correlate with the highest paying jobs: C/C++, Hadoop, Hive, Java, NLP, Scala, Spark, and TensorFlow.