Part 1 - Introduction:

Nowadays with the increase in data collection and processing; companies, governments and agencies have a need to extract and produce educated decisions based on factual data. Extracting that kind of information and knowledge from large, heterogeneous, and noisy data sets requires not only powerful computing resources, but the programming abstractions to use them effectively.

In that context, data scientist need to have the skills in order to overcome the challenges that implies to work diverse structures in a given data science project.

By looking at this relationship (data processing / data scientist), we know that it creates a third component with a numerical variable named salary.

With that in mind, I will explore and try to answer a very important research question:

Are data science skills predictive of salary?

Hypothesis

From our exploration question, we can define our hypothesis as follows:

\(H_0:\) Data Science skills are not predictive of salary; that is, the mean for all Skill Values are the same.

\(H_1:\) Data Science skills are predictive of salary; that is, at least one mean for all Skill Values is different.

Part 2 - Data:

Data Source

The data that I will be working with, is collected by Paysa and is available online here: http://paysa.com

For this project, the data was extracted by copying and pasting a job search of “Data Science” on March 16, 2017 into a text file, then cleaned and uploaded into a table in a local MySQL server.

This data is collected by Paysa as part of the integrated job posting website and this data is submitted by employers daily.

Raw Data

The below table display all job listings compiled from Paysa.

Cases

Each case represents a job posting in the United States. There are 390 observations in the given data set.

Explanatory variable

The explanatory variable is Data Science skills and is categorical.

Response variable

The response variable is Base Salary and is numerical.

Curated Data

From the above table I will focus on the Base Salary and combination of Skills as follows:

Skill value per job listing

Since each case list multiple skills combined for a single base salary. For this study purposes, I will assign a “Skill Value” salary per skill listed on each listing; that is, by taking the base salary and dividing it by the number of skills listed for that study case.

For example: In the first case, there is a base salary of $253000 with 6 skills listed (Distributed Systems, Big Data, Algorithms, Data Science, Strategy, Databases). By taking $253000 and dividing it by 6, we obtain an average of $42167. That is, each skill value will be taken as $42167 in the first case study. Similar process will be applied for the rest of the cases.

The below table shows the number of skills per job listing and also shows the “average” base salary for each skill in that listing.

Part 3 - Exploratory data analysis:

From the above table, we have defined a series of Skill Value for each skill listed on each job posting.

From the raw data, we have a total of 2220 skills listed in the 390 job postings.

Summary

Below is a summary of the individual skills data.

##     Skills           Skill Value    
##  Length:2220        Min.   : 15833  
##  Class :character   1st Qu.: 20500  
##  Mode  :character   Median : 22500  
##                     Mean   : 24255  
##                     3rd Qu.: 26617  
##                     Max.   :161000

From the above summary table, we can quickly identify that the minimum skill value is set at $15833 and the maximum is at $161000 with a median skill value of $22500 per skill.

Count, Mean and Standard deviation

From the above plot, as an initial inspection of the data, it suggests that there are differences in between the medians but is not clear at this point.

Outliers

From the above set of box plots we can quickly identify that there are some outliers in the remaining data set, while the different medians seems to vary depending on the skill; this could be taking as an indication that the skills could be predictive of salary as established in our introduction.

From the above histogram we can visualize some sort of normality and skewness to the right, also we can confirm the outliers as well.

For visualization purposes, I will include a new density histogram with a limited domain as follows:

Medians

From the calculated medians, we can have the following histograms:

From the above histogram we can still visualize some sort of normality and skewness to the right, also we can confirm the presence of outliers, performing leverage.

For visualization purposes, I will include a new density histogram with a limited domain as follows:

Based on out Q-Q Plot, we can visualize how our medians data follow the qqline most of the trajectory then due to leverage a couple of points fall away from it.

Part 4 - Inference:

Satisfying conditions for inference:

Conditions:

  • The sample size is greater than 30.

  • The data sets follow a uni modal normal distribution.

  • The samples are random.

Hence, the conditions for inference seems to be satisfied.

ANOVA

Summary

## 
## Call:
## lm(formula = `Skill Value` ~ Skills, data = my.skills.data_long)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -17991  -3594  -1297   1854 134444 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                          18333       7298   2.512  0.01208 *  
## SkillsAlgorithms                      5128       7318   0.701  0.48357    
## SkillsAnalytics                       7499       7341   1.021  0.30715    
## SkillsAndroid                         3800       7995   0.475  0.63463    
## SkillsApache Spark                    4750       7741   0.614  0.53952    
## SkillsArchitecture                    5695       7883   0.722  0.47013    
## SkillsArchitectures                   5387       7462   0.722  0.47048    
## SkillsAutomation                      5439       7883   0.690  0.49028    
## SkillsAWS                             3873       7510   0.516  0.60613    
## SkillsBig Data                        4928       7319   0.673  0.50085    
## SkillsBusiness Intelligence          10056       8428   1.193  0.23291    
## SkillsC                               3977       7802   0.510  0.61034    
## SkillsC++                             4721       7350   0.642  0.52072    
## SkillsCassandra                       9900       8428   1.175  0.24022    
## SkillsComputer Vision                 2334       7655   0.305  0.76051    
## SkillsData Mining                     4729       7329   0.645  0.51881    
## SkillsData Science                    8223       7316   1.124  0.26113    
## SkillsDatabases                       7521       7741   0.972  0.33136    
## SkillsDeep Learning                   4380       7498   0.584  0.55920    
## SkillsDistributed Systems             6723       7498   0.897  0.37007    
## SkillsEMPTY                         113667      10322  11.013  < 2e-16 ***
## SkillsEngineering Management         10167      10322   0.985  0.32473    
## SkillsEnterprise Software             3250       8939   0.364  0.71620    
## SkillsETL                             7509       7523   0.998  0.31836    
## SkillsFirewalls                      12167      10322   1.179  0.23861    
## SkillsFunctional Programming          7178       8428   0.852  0.39446    
## SkillsGame Development                1467       7995   0.183  0.85445    
## SkillsGo                             13028       8428   1.546  0.12228    
## SkillsHadoop                          4650       7318   0.635  0.52522    
## SkillsHTTP                           14334      10322   1.389  0.16506    
## SkillsImage Processing                3334       8939   0.373  0.70920    
## SkillsInformation Retrieval           5858       7498   0.781  0.43478    
## SkillsJava                            6439       7574   0.850  0.39534    
## SkillsLAMP                            1834       8939   0.205  0.83746    
## SkillsLeadership                      8739       7655   1.142  0.25374    
## SkillsMachine Learning                5584       7312   0.764  0.44514    
## SkillsManagement                      5165       7358   0.702  0.48278    
## SkillsMapReduce                       4027       7415   0.543  0.58712    
## SkillsMathematical Modeling           3667      10322   0.355  0.72242    
## SkillsMathematics                    11950       8160   1.464  0.14320    
## SkillsMatlab                          4940       7389   0.669  0.50383    
## SkillsMySQL                           4367       8428   0.518  0.60439    
## SkillsNatural Language Processing     6378       7883   0.809  0.41855    
## SkillsNetwork Architecture            4334      10322   0.420  0.67460    
## SkillsOptimization                    5755       7336   0.785  0.43278    
## SkillsOS X                            4600       7995   0.575  0.56510    
## SkillsPHP                             5473       7883   0.694  0.48761    
## SkillsProduct Design                 48167      10322   4.667 3.25e-06 ***
## SkillsProduct Management              5758       7408   0.777  0.43712    
## SkillsPython                          4771       7328   0.651  0.51510    
## SkillsRecommender Systems             1834       8939   0.205  0.83746    
## SkillsRelational Databases            4223       7538   0.560  0.57541    
## SkillsREST                            2122       7623   0.278  0.78079    
## SkillsRuby                            5462       7574   0.721  0.47091    
## SkillsScala                           5207       7394   0.704  0.48140    
## SkillsScalability                     2205       7574   0.291  0.77094    
## SkillsScripting                      17700       7995   2.214  0.02694 *  
## SkillsSearch                         23158       7802   2.968  0.00303 ** 
## SkillsSignal Processing               5667       8939   0.634  0.52616    
## SkillsSoftware Design                 4400       7655   0.575  0.56545    
## SkillsSQL                             6801       7363   0.924  0.35581    
## SkillsStatistics                      8266       7367   1.122  0.26195    
## SkillsStrategy                       16047       7432   2.159  0.03096 *  
## SkillsTechnical Leadership            6123       7488   0.818  0.41362    
## SkillsTest Driven Development         9667      10322   0.937  0.34908    
## SkillsTime Series Analysis            4667       7802   0.598  0.54979    
## SkillsTomcat                          2334       8939   0.261  0.79403    
## SkillsUser Experience                 7855       7555   1.040  0.29856    
## SkillsWeb Services                    1500      10322   0.145  0.88447    
## SkillsWindows                         3941       7555   0.522  0.60199    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7298 on 2150 degrees of freedom
## Multiple R-squared:  0.1714, Adjusted R-squared:  0.1448 
## F-statistic: 6.444 on 69 and 2150 DF,  p-value: < 2.2e-16

From the above results, the model output indicates some evidence of a difference in the average value for the skills.

Results

## Analysis of Variance Table
## 
## Response: Skill Value
##             Df     Sum Sq   Mean Sq F value    Pr(>F)    
## Skills      69 2.3685e+10 343255703   6.444 < 2.2e-16 ***
## Residuals 2150 1.1453e+11  53267887                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Based on the above table confirms that there are differences between the skills which were highlighted in the model summary.

Part 5 - Conclusion:

From our initial question: Are data science skills predictive of salary? we can conclude as follows:

By observing the above plots, linear modeling, and statistical analysis; we can observe how data science skills and income did appear to be correlated.

The validity of the data was indicated by summary statistics in which our hypothesis \(H_0\) gets discarded and our alternative hypothesis \(H_1\) is accepted. The above conclusion is statistically accepted since our analysis of variance returned an extremely low p-value (2.2e-16) which is less than 0.05. This can be enforced by comparing our results with the normality and qqplots for the medians as well.

References:

  • OpenIntro Statistics, Third Edition. Diez, D. et all. 2015