Introduction


IT specialists are employed throughout a variety of sectors, such as finance, healthcare, education, and government. They frequently demand strong analytical and problem-solving abilities, as well as familiarity with programming languages, operating systems, and other technological tools. In order to stay current with emerging technology and industry trends, many IT professions also call for continual training and certification.

Depending on the individual function, degree of expertise, location, and other considerations, IT jobs can offer a range of pay. Yet generally speaking, compared to many other professions, IT occupations tend to be well-paid, which is why this is a popular discussion topic nowadays when it comes down to salaries.

It’s important to remember that wages might vary widely depending on region, sector, amount of education, and experience. Also, just as in any other profession, the remuneration for individual job categories might vary significantly depending on the company and duties. In this study, the most important factors that influence the IT professionals income will be uncovered in order to accurately predict it basing on one’s individual characteristics such as years of experience, education, etc.

Data

Around 70,000 engineers shared their learning and development strategies, tool preferences, and goals with StackOverflow in May 2022. Topics covered in the survey included developer profile, technology, work, community and professional Developers.

The dataset on which the model was built was made available on the Kaggle portal and can be accessed at the web link. Initially, the data set included following columns:

  1. Employment - full-time/part-time/etc…
  2. RemoteWork - remote/on-site/hybrid
  3. CodingActivities - Hobby/Other projects/Don’t code outside of work
  4. EdLevel - Bachelor’s/Master’s/High school/etc…
  5. YearsCode - years of coding experience
  6. YearsCodePro - years of coding experience (professionally)
  7. DevType - job position
  8. OrgSize - size of the organization
  9. Country
  10. CompTotal - salary
  11. CompFreq - frequency of salary
  12. LanguageHaveWorkedWith - tech stack (programming languages)
  13. DatabaseHaveWorkedWith - tech stack (Databases)
  14. PlatformsHaveWorkedWith - tech stack (Platforms)
  15. OpSysProfessionaluse - tech stack (OS)
  16. Age - age group
  17. Gender
  18. Ethnicity
  19. WorkExp - work experience in years
  20. ConvertedCompYearly - total salary yearly
Employment RemoteWork CodingActivities EdLevel YearsCode YearsCodePro DevType OrgSize Country CompTotal CompFreq LanguageHaveWorkedWith DatabaseHaveWorkedWith PlatformHaveWorkedWith OpSysProfessional use Age Gender Ethnicity WorkExp ConvertedCompYearly
Employed, full-time;Independent contractor, freelancer, or self-employed Fully remote Hobby;Contribute to open-source projects;Freelance/contract work Bachelor’s degree (B.A., B.S., B.Eng., etc.) 12 10 Engineering manager 20 to 99 employees United States of America 194400 Yearly C#;HTML/CSS;JavaScript;PowerShell;Python;Rust;SQL Couchbase;CouchDB;Microsoft SQL Server;MongoDB;MySQL;PostgreSQL;Redis AWS;Microsoft Azure Linux-based;macOS;Windows 35-44 years old Man White 14 194400
Employed, full-time Hybrid (some remote, some in-person) Hobby Bachelor’s degree (B.A., B.S., B.Eng., etc.) 12 5 Developer, full-stack 2 to 9 employees United States of America 65000 Yearly C;HTML/CSS;Rust;SQL;Swift;TypeScript PostgreSQL AWS macOS 25-34 years old Man White 5 65000
Employed, full-time;Independent contractor, freelancer, or self-employed Fully remote Hobby;Freelance/contract work Master’s degree (M.A., M.S., M.Eng., MBA, etc.) 11 5 Developer, full-stack;Academic researcher;DevOps specialist 5,000 to 9,999 employees United States of America 110000 Yearly HTML/CSS;JavaScript;PHP;Python;R;Ruby;Scala Elasticsearch;MongoDB;Neo4j;PostgreSQL AWS;DigitalOcean;Heroku macOS 25-34 years old Man White 5 110000
Employed, full-time Hybrid (some remote, some in-person) I don’t code outside of work Master’s degree (M.A., M.S., M.Eng., MBA, etc.) 5 4 Developer, full-stack 20 to 99 employees Italy 32000 Yearly Python;SQL;TypeScript MySQL;Redis;SQLite Google Cloud;OVH;VMware Windows Subsystem for Linux (WSL) 25-34 years old Man European 4 34126
Employed, full-time Fully remote Hobby;Contribute to open-source projects Something else 25 20 Developer, back-end 100 to 499 employees Canada 125000 Yearly C#;SQL;TypeScript Microsoft SQL Server;PostgreSQL;Redis Google Cloud;Microsoft Azure Windows 35-44 years old Man White;North American 23 97605

Since the vast majority of data is qualitative, many transformations were processed during the study. For starters, variables CodingActivities, DevType, CompTotal and CompFreq were removed from this dataset. First two were very difficult to structure in a way that could be beneficial for the study. For instance, DevType variable would often have many different job position for one person and similarly Coding Activities variable. Two other were very similar (just in a slightly different) to ConvertedCompYearly, which is our dependent variable. Then, qualitative variables were transformed into dummy variables (Employment, RemoteWork, Gender, Ethnicity etc) or categorical variables (EdLevel, Age,OrgSize - the higher the level, the higher the education/person older). Country variable was grouped into regions (Western Europe, North America, etc.). Finally, the data was converted to numerical type in order to use regression algortihms.

The final dataset used for analysis:

Employment RemoteWork EdLevel YearsCode YearsCodePro OrgSize LanguageHaveWorkedWith DatabaseHaveWorkedWith PlatformHaveWorkedWith Age Gender Ethnicity WorkExp ConvertedCompYearly RegionAustralia.and.New.Zealand RegionCentral.and.Eastern.Europe RegionEastern.Asia RegionLatin.America.and.Caribbean RegionMiddle.East.and.Northern.Africa RegionNorth.America RegionRussia RegionSoutheastern.Asia RegionSouthern.Asia RegionSub.Saharan.Africa RegionUSA RegionWestern.Europe X.OpSysProfessional.use.Linux.based X.OpSysProfessional.use.macOS X.OpSysProfessional.use.Other X.OpSysProfessional.use.Windows
0 0 3 16 12 2 6 4 3 3 1 0 15 8400 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0
0 1 2 1 0 1 10 4 2 2 1 1 1 9084 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0
0 0 4 22 18 1 8 8 6 3 1 1 18 60588 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0
0 0 3 38 30 1 4 2 3 4 1 0 32 20196 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0
0 0 4 17 10 1 7 2 4 3 1 0 17 20196 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0

Exploratory Data Analysis

Since the data was ready for modelling, it was plotted and summarized in order to see which variables should be taken into consideration and if any other transformations are needed here. First, let’s take a look at the salary distribution:

On the left side, we can see that the distribution is not appropriate for modelling. A model build on this data would most likely be worse than a bad guess. Biggest salary included is 4 M USD which is a huge outlier and this would disrupt the whole modelling process. The data was limited to rows for which salary < 300 000. As we can see on the right side, the distribution looks much better. While it is not a normally distributed variable, its shape shouldn’t be a roadblock here either. Let’s now see if there are any clear relationships between this variable and the remaining ones.

Unsurprisingly, there is a positive correlation between salary and both - years of coding and working experience. This indicates that both variables should be included in the model which is in line with reality.

Both, age and education seem to be significant here as, in general, the higher the category, the bigger earnings. This means that older and/or educated IT professionals tend to make more money.

While organization size seems to have an influence on salary, the gender seems to have no statistical significance here.

Unfortunately, Ethnicity (1 for white people and 0 for other races) box plot indicates that white people tend to make much more money than others. Apart from racial profiling, it can be caused by a fact that regions with the biggest amount of white people tend to be better economy wise and therefore - salary wise.

As we can see, region is a feature that makes a lot of difference when it comes down to salaries. USA IT professionals tend to make much more money in comparison to other regions while SE Asia ans S. Asia much less.

Let’s now look at the correlation plot to see if the variables influence each other:

As we can see, there are many variables that are moderately correlated with the salary in the data set. Before modelling, outliers were removed basing on their z scores so that they don’t affect predictions.

Modelling

In order to accurately predict the salary, first a classic linear regression was generated in order to see which variables are statistically significant and how they affect the salary:

## 
## Call:
## lm(formula = formula, data = dfm)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -177388  -22289   -6027   15059  283453 
## 
## Coefficients:
##                                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                            67315.15    4334.36  15.531  < 2e-16 ***
## Employment                              7350.49     933.37   7.875 3.62e-15 ***
## RemoteWork                              7319.17     648.76  11.282  < 2e-16 ***
## EdLevel                                 4350.64     349.39  12.452  < 2e-16 ***
## YearsCode                                284.26      84.39   3.368 0.000758 ***
## YearsCodePro                            1706.47     123.51  13.817  < 2e-16 ***
## OrgSize                                 6716.18     340.46  19.727  < 2e-16 ***
## LanguageHaveWorkedWith                  -295.58     127.08  -2.326 0.020036 *  
## DatabaseHaveWorkedWith                   429.29     199.87   2.148 0.031735 *  
## PlatformHaveWorkedWith                   229.95     293.17   0.784 0.432850    
## Age                                    -1796.57     628.55  -2.858 0.004265 ** 
## Gender                                  5540.01    1210.37   4.577 4.75e-06 ***
## Ethnicity                               5371.57     706.11   7.607 2.95e-14 ***
## WorkExp                                   14.68      98.07   0.150 0.881035    
## RegionAustralia.and.New.Zealand       -40848.91    1905.27 -21.440  < 2e-16 ***
## RegionCentral.and.Eastern.Europe      -80959.32    1303.30 -62.119  < 2e-16 ***
## RegionEastern.Asia                    -71930.72    2930.49 -24.546  < 2e-16 ***
## RegionLatin.America.and.Caribbean     -94381.42    1333.95 -70.753  < 2e-16 ***
## RegionMiddle.East.and.Northern.Africa -67470.57    1806.18 -37.355  < 2e-16 ***
## RegionNorth.America                   -36679.61    1636.02 -22.420  < 2e-16 ***
## RegionRussia                          -85236.29    2886.81 -29.526  < 2e-16 ***
## RegionSoutheastern.Asia               -94058.84    2346.37 -40.087  < 2e-16 ***
## RegionSouthern.Asia                   -98101.98    1345.49 -72.912  < 2e-16 ***
## RegionSub.Saharan.Africa              -86856.06    2398.21 -36.217  < 2e-16 ***
## RegionWestern.Europe                  -61746.04     900.91 -68.537  < 2e-16 ***
## X.OpSysProfessional.use.Linux.based     9883.40    3588.57   2.754 0.005891 ** 
## X.OpSysProfessional.use.macOS          17348.75    3611.11   4.804 1.57e-06 ***
## X.OpSysProfessional.use.Windows         -228.95    3617.09  -0.063 0.949532    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 38730 on 16157 degrees of freedom
## Multiple R-squared:  0.5566, Adjusted R-squared:  0.5559 
## F-statistic: 751.3 on 27 and 16157 DF,  p-value: < 2.2e-16
##                            Employment                            RemoteWork 
##                              1.110820                              1.135345 
##                               EdLevel                             YearsCode 
##                              1.074805                              6.087117 
##                          YearsCodePro                               OrgSize 
##                              9.538270                              1.050031 
##                LanguageHaveWorkedWith                DatabaseHaveWorkedWith 
##                              1.355443                              1.487622 
##                PlatformHaveWorkedWith                                   Age 
##                              1.341481                              3.360013 
##                                Gender                             Ethnicity 
##                              1.014814                              1.341081 
##                               WorkExp       RegionAustralia.and.New.Zealand 
##                              7.090206                              1.120345 
##      RegionCentral.and.Eastern.Europe                    RegionEastern.Asia 
##                              1.385183                              1.075025 
##     RegionLatin.America.and.Caribbean RegionMiddle.East.and.Northern.Africa 
##                              1.402368                              1.200069 
##                   RegionNorth.America                          RegionRussia 
##                              1.147695                              1.059490 
##               RegionSoutheastern.Asia                   RegionSouthern.Asia 
##                              1.158338                              1.773945 
##              RegionSub.Saharan.Africa                  RegionWestern.Europe 
##                              1.103116                              1.949226 
##   X.OpSysProfessional.use.Linux.based         X.OpSysProfessional.use.macOS 
##                             33.545523                             30.252799 
##       X.OpSysProfessional.use.Windows 
##                             27.972604

As we can see, PlatformHaveWorkedWith, X.OpSysProfessional.use.Windows and WorkExp variables turned out to be insignificant - probably because of the existance of other, similar variables in the data set and not because of its insignificance in general. As a next step, data was split into training and testing sets and the final formula that included only significant variables was used:

## 
## Call:
## lm(formula = formula, data = dfm)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -177344  -22214   -6012   14913  284655 
## 
## Coefficients:
##                                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                            67439.03    2430.25  27.750  < 2e-16 ***
## Employment                              7368.77     928.49   7.936 2.22e-15 ***
## RemoteWork                              7317.75     648.74  11.280  < 2e-16 ***
## EdLevel                                 4457.78     346.48  12.866  < 2e-16 ***
## YearsCodePro                            1990.14      66.58  29.890  < 2e-16 ***
## OrgSize                                 6706.94     340.12  19.719  < 2e-16 ***
## LanguageHaveWorkedWith                  -236.21     124.50  -1.897  0.05781 .  
## DatabaseHaveWorkedWith                   458.98     187.11   2.453  0.01418 *  
## Age                                    -1520.34     574.81  -2.645  0.00818 ** 
## Gender                                  5595.21    1210.28   4.623 3.81e-06 ***
## Ethnicity                               5468.52     705.44   7.752 9.59e-15 ***
## RegionAustralia.and.New.Zealand       -40716.98    1905.09 -21.373  < 2e-16 ***
## RegionCentral.and.Eastern.Europe      -81000.92    1301.16 -62.253  < 2e-16 ***
## RegionEastern.Asia                    -72007.17    2929.61 -24.579  < 2e-16 ***
## RegionLatin.America.and.Caribbean     -94582.13    1332.30 -70.992  < 2e-16 ***
## RegionMiddle.East.and.Northern.Africa -67673.27    1804.61 -37.500  < 2e-16 ***
## RegionNorth.America                   -36523.07    1633.97 -22.352  < 2e-16 ***
## RegionRussia                          -85350.88    2886.22 -29.572  < 2e-16 ***
## RegionSoutheastern.Asia               -94108.59    2344.18 -40.146  < 2e-16 ***
## RegionSouthern.Asia                   -98305.81    1341.48 -73.282  < 2e-16 ***
## RegionSub.Saharan.Africa              -86964.35    2397.64 -36.271  < 2e-16 ***
## RegionWestern.Europe                  -61573.21     896.47 -68.684  < 2e-16 ***
## X.OpSysProfessional.use.Linux.based    10154.57     762.07  13.325  < 2e-16 ***
## X.OpSysProfessional.use.macOS          17532.33     803.41  21.822  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 38740 on 16161 degrees of freedom
## Multiple R-squared:  0.5563, Adjusted R-squared:  0.5557 
## F-statistic:   881 on 23 and 16161 DF,  p-value: < 2.2e-16
##                            Employment                            RemoteWork 
##                              1.098690                              1.134711 
##                               EdLevel                          YearsCodePro 
##                              1.056460                              2.770749 
##                               OrgSize                LanguageHaveWorkedWith 
##                              1.047422                              1.300286 
##                DatabaseHaveWorkedWith                                   Age 
##                              1.303091                              2.808662 
##                                Gender                             Ethnicity 
##                              1.014153                              1.337880 
##       RegionAustralia.and.New.Zealand      RegionCentral.and.Eastern.Europe 
##                              1.119577                              1.379942 
##                    RegionEastern.Asia     RegionLatin.America.and.Caribbean 
##                              1.073843                              1.398196 
## RegionMiddle.East.and.Northern.Africa                   RegionNorth.America 
##                              1.197382                              1.144233 
##                          RegionRussia               RegionSoutheastern.Asia 
##                              1.058527                              1.155596 
##                   RegionSouthern.Asia              RegionSub.Saharan.Africa 
##                              1.762513                              1.102040 
##                  RegionWestern.Europe   X.OpSysProfessional.use.Linux.based 
##                              1.929084                              1.512057 
##         X.OpSysProfessional.use.macOS 
##                              1.496729
##          MSE    RMSE      MAE    MedAE        R2
## 1 1477287889 38435.5 27316.81 19994.25 0.5516044

The basic linear regression model reveals that the variables that the study is based on are a good predictors of salary and we can move forward with the analysis as the variables are significant and there is no coolinearity issue here.

Regression tree

First, regression tree algorithm with default hyperparameters was used to predict the y variable:

It is noticable that the algorithm takes into consideration Ethnicity, Years of coding and region only. This suggests that these variables are the most important ones when it comes down to salary formation. It appears that there is a breaking point around 6 years of coding when it comes down to salary as this number serves as a split that creates homogenous groups here for both ethnicities. In general, the overall look of the tree and the information that it gives us seems to be in line with logic and reality. Let’s now take a look at variable’s importance plot.

Apart from the variables mentioned above, it seems that Age is also significant when salaries are concerned, which is also a reasonable conclusion.

##          MSE     RMSE      MAE   MedAE        R2
## 1 2099034616 45815.22 33554.61 25169.2 0.3849208
##          MSE     RMSE      MAE    MedAE        R2
## 1 2108073393 45913.76 33661.84 25431.98 0.3601444

The predictions fit to the actual data is very poor with 36% R^2 on testing set. Let’s try improving it by generating a very deep tree and pruning it later.

This tree is huge and difficult to read or interpret. It’ll be pruned now in order to keep it simple.

## [1] 1.00074e-08

The optimal complexity parameter here is close to 0 which in general is not safe. It allows our tree to grow almost as much as it wants, which can be detrimental for obtaining accurate predictions. In our case, the tree is the same as it was in the beginning. It is very likely that the predictions generated by using it will not be well fitted to the data.

##          MSE     RMSE      MAE    MedAE        R2
## 1 4778810788 69128.94 54119.81 43923.64 -2.385981
##          MSE     RMSE      MAE    MedAE        R2
## 1 4778810788 69128.94 54119.81 43923.64 -2.385981

As expected, the model generates huge errors. It is not good at all. Let’s now see how cross-validation will handle this data.

## CART 
## 
## 11331 samples
##    23 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 10197, 10199, 10197, 10198, 10199, 10198, ... 
## Resampling results across tuning parameters:
## 
##   cp     RMSE      Rsquared   MAE     
##   0.000  42550.81  0.4856900  30195.67
##   0.001  41006.80  0.5078980  29378.03
##   0.002  42355.76  0.4747963  30584.09
##   0.003  43011.86  0.4582194  31050.44
##   0.004  43715.29  0.4402383  31528.23
##   0.005  44196.55  0.4280036  31959.52
##   0.006  44367.95  0.4235386  32250.43
##   0.007  44972.68  0.4078068  32732.25
##   0.008  45238.41  0.4004826  32997.52
##   0.009  45497.98  0.3938246  33226.52
##   0.010  45809.66  0.3854272  33502.50
##   0.011  46285.01  0.3727902  34117.36
##   0.012  46541.76  0.3658099  34631.49
##   0.013  46848.16  0.3574761  34934.35
##   0.014  47623.84  0.3359541  35799.19
##   0.015  47703.24  0.3336658  35924.23
##   0.016  48082.34  0.3230754  36352.44
##   0.017  48272.26  0.3177439  36554.14
##   0.018  48272.26  0.3177439  36554.14
##   0.019  48272.26  0.3177439  36554.14
##   0.020  48272.26  0.3177439  36554.14
##   0.021  48272.26  0.3177439  36554.14
##   0.022  48272.26  0.3177439  36554.14
##   0.023  48272.26  0.3177439  36554.14
##   0.024  48272.26  0.3177439  36554.14
##   0.025  48542.44  0.3101445  36786.55
##   0.026  49108.68  0.2938699  37264.86
##   0.027  49530.62  0.2818053  37682.12
##   0.028  49606.87  0.2789161  37736.67
##   0.029  49901.40  0.2704573  37989.99
##   0.030  50857.29  0.2425214  39119.95
## 
## MAE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.001.
##          MSE     RMSE      MAE    MedAE        R2
## 1 1593213395 39915.08 28581.81 20310.16 0.5331414
##          MSE     RMSE      MAE    MedAE        R2
## 1 1650434650 40625.54 29199.44 20684.91 0.4990498

The optimal value of complexity parameter is now 0.001. Even though the fit to the underlying data is much better now, it is still far from perfect and worse than a simple linear regression.

Random Forest

The regression tree did not do well enough on this data set. Now, random forest algorithm will be applied here and its fit to the data will be assessed. Several models with different numbers of trees and numbers of features selected were created and the results are available below:

## 
## Call:
##  randomForest(formula = formula, data = data.train, ntree = 400) 
##                Type of random forest: regression
##                      Number of trees: 400
## No. of variables tried at each split: 7
## 
##           Mean of squared residuals: 1537513219
##                     % Var explained: 54.95
## 
## Call:
##  randomForest(formula = formula, data = data.train, ntree = 800) 
##                Type of random forest: regression
##                      Number of trees: 800
## No. of variables tried at each split: 7
## 
##           Mean of squared residuals: 1538517871
##                     % Var explained: 54.92
## 
## Call:
##  randomForest(formula = formula, data = data.train, ntree = 800,      mtry = 3) 
##                Type of random forest: regression
##                      Number of trees: 800
## No. of variables tried at each split: 3
## 
##           Mean of squared residuals: 1612280129
##                     % Var explained: 52.76
## 
## Call:
##  randomForest(formula = formula, data = data.train, ntree = 800,      mtry = 6) 
##                Type of random forest: regression
##                      Number of trees: 800
## No. of variables tried at each split: 6
## 
##           Mean of squared residuals: 1534371203
##                     % Var explained: 55.04

The best model turned out to be based on 800 trees and 6 features selected. It generates following predictions:

##         MSE     RMSE      MAE    MedAE        R2
## 1 624606435 24992.13 17815.71 13060.63 0.8169718
##          MSE     RMSE      MAE    MedAE        R2
## 1 1507392686 38825.16 27827.64 20624.59 0.5424668

There is a clear overfitting problem here. The predictions are really good on the training set and then much worse on testing. Even if we would like to use it for predictions, which is not advised, it is only similar to what simple linear regression was able to achieve. As a next step, the cross-validation will be used in order to obtain optimal parameters.

## Random Forest 
## 
## 11331 samples
##    23 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 10197, 10199, 10196, 10199, 10198, 10198, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared   MAE     
##    2    42574.47  0.5207408  31718.05
##   12    39428.86  0.5455047  28229.64
##   23    39772.74  0.5396773  28478.89
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 12.
##         MSE     RMSE      MAE    MedAE       R2
## 1 421871975 20539.52 14422.14 10349.91 0.876379
##          MSE     RMSE      MAE    MedAE       R2
## 1 1524982018 39051.02 27916.81 20601.51 0.537128

The optimal value of number of features here is 12. The model is able to generate predictions that are slightly worse than the tree with 800 trees and 6 features. Unfortunately, there is a problem with overfitting here as well. All in all, the random forest algorithm does really well on training set but it learns patterns specific to this dataset and not the salary formation in general.

Extreme Gradient Boosting

The last algorithm that was used here is XGboost, which is quite popular lately. It combines multiple weak models to create a strong predictive model. In order to build an optimal one, first parameters are set to default values and then, step by step, consecutive parameters are being optimized, which results in this logic:

parameters_xgb <- expand.grid(nrounds = seq(20, 300, 10),
                              max_depth = c(6),
                              eta = c(0.25), 
                              gamma = 1,
                              colsample_bytree = c(0.2),
                              min_child_weight = c(85),
                              subsample = 0.8)

set.seed(1234)
ctrl_cv3 <- trainControl(method = "cv")

#data.xgb <- train(formula,
#                     data = data.train,
#                     method = "xgbTree",
#                     trControl = ctrl_cv3,
#                     tuneGrid  = parameters_xgb)
#
#saveRDS(data.xgb, file = "data.xgb.rds")
data.xgb <- readRDS("data.xgb.rds")
data.xgb
## eXtreme Gradient Boosting 
## 
## 11331 samples
##    23 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 10197, 10199, 10197, 10198, 10199, 10198, ... 
## Resampling results across tuning parameters:
## 
##   nrounds  RMSE      Rsquared   MAE     
##    20      44670.21  0.4384267  33258.89
##    30      42995.25  0.4701945  31641.21
##    40      41879.79  0.4935689  30528.49
##    50      41131.25  0.5094104  29807.23
##    60      40537.61  0.5224691  29255.32
##    70      39989.56  0.5343628  28758.74
##    80      39650.95  0.5415643  28413.22
##    90      39383.64  0.5474202  28183.25
##   100      39165.86  0.5523496  27958.57
##   110      38990.48  0.5561393  27814.42
##   120      38806.35  0.5599565  27624.87
##   130      38704.80  0.5620959  27526.79
##   140      38605.63  0.5639837  27438.09
##   150      38524.89  0.5656758  27350.91
##   160      38457.62  0.5670419  27280.39
##   170      38406.40  0.5680670  27228.96
##   180      38364.92  0.5689036  27178.55
##   190      38333.20  0.5696576  27158.55
##   200      38324.66  0.5697760  27136.49
##   210      38287.86  0.5705445  27113.48
##   220      38279.26  0.5706567  27089.13
##   230      38257.77  0.5711807  27060.03
##   240      38242.68  0.5714808  27055.56
##   250      38234.74  0.5716138  27049.56
##   260      38214.89  0.5720852  27030.72
##   270      38225.47  0.5718665  27032.03
##   280      38205.66  0.5722876  26995.87
##   290      38206.41  0.5722318  27012.95
##   300      38202.48  0.5723114  27021.04
## 
## Tuning parameter 'max_depth' was held constant at a value of 6
## Tuning
##  parameter 'min_child_weight' was held constant at a value of 85
## 
## Tuning parameter 'subsample' was held constant at a value of 0.8
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nrounds = 300, max_depth = 6, eta
##  = 0.25, gamma = 1, colsample_bytree = 0.2, min_child_weight = 85 and
##  subsample = 0.8.
parameters_xgb2 <- expand.grid(nrounds = 300,
                               max_depth = seq(2, 6, 2),
                               eta = c(0.25), 
                               gamma = 1,
                               colsample_bytree = c(0.2),
                               min_child_weight = seq(20, 220, 100),
                               subsample = 0.8)

set.seed(123456789)
#datausa.xgb2 <- train(formula,
#                      data = data.train,
#                      method = "xgbTree",
#                      trControl = ctrl_cv3,
#                      tuneGrid  = parameters_xgb2)
#saveRDS(datausa.xgb2, file = "datausa.xgb2.rds")
datausa.xgb2 <- readRDS("datausa.xgb2.rds")
datausa.xgb2
## eXtreme Gradient Boosting 
## 
## 11331 samples
##    23 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 10198, 10198, 10197, 10198, 10198, 10197, ... 
## Resampling results across tuning parameters:
## 
##   max_depth  min_child_weight  RMSE      Rsquared   MAE     
##   2           20               37968.12  0.5778751  26819.42
##   2          120               39476.06  0.5436061  27852.80
##   2          220               41447.24  0.4968708  29664.20
##   4           20               38092.10  0.5749326  26902.38
##   4          120               39439.74  0.5443360  27763.27
##   4          220               41337.35  0.4995594  29519.98
##   6           20               38534.52  0.5653002  27256.59
##   6          120               39694.50  0.5384698  28030.83
##   6          220               41453.17  0.4967633  29673.96
## 
## Tuning parameter 'nrounds' was held constant at a value of 300
## Tuning
##  'colsample_bytree' was held constant at a value of 0.2
## Tuning
##  parameter 'subsample' was held constant at a value of 0.8
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nrounds = 300, max_depth = 2, eta
##  = 0.25, gamma = 1, colsample_bytree = 0.2, min_child_weight = 20 and
##  subsample = 0.8.
parameters_xgb3 <- expand.grid(nrounds = 300,
                               max_depth = 2,
                               eta = c(0.25), 
                               gamma = 1,
                               colsample_bytree = seq(0.6, 0.9, 0.1),
                               min_child_weight = 20,
                               subsample = 0.8)
set.seed(123456789)
#datausa.xgb3 <- train(formula,
#                      data = data.train,
#                      method = "xgbTree",
#                      trControl = ctrl_cv3,
#                      tuneGrid  = parameters_xgb3)
#saveRDS(datausa.xgb3, file = "datausa.xgb3.rds")
datausa.xgb3 <- readRDS("datausa.xgb3.rds")
datausa.xgb3
## eXtreme Gradient Boosting 
## 
## 11331 samples
##    23 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 10198, 10197, 10198, 10199, 10199, 10197, ... 
## Resampling results across tuning parameters:
## 
##   colsample_bytree  RMSE      Rsquared   MAE     
##   0.6               38000.32  0.5770720  26755.70
##   0.7               37927.65  0.5786895  26712.55
##   0.8               37988.11  0.5773240  26750.53
##   0.9               37980.82  0.5775382  26740.47
## 
## Tuning parameter 'nrounds' was held constant at a value of 300
## Tuning
##  held constant at a value of 20
## Tuning parameter 'subsample' was held
##  constant at a value of 0.8
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nrounds = 300, max_depth = 2, eta
##  = 0.25, gamma = 1, colsample_bytree = 0.7, min_child_weight = 20 and
##  subsample = 0.8.
parameters_xgb4 <- expand.grid(nrounds = 300,
                               max_depth = 2,
                               eta = c(0.25), 
                               gamma = 1,
                               colsample_bytree = 0.7,
                               min_child_weight = 20,
                               subsample = c(0.6, 0.7, 0.75, 0.8, 0.85, 0.9))

set.seed(123456789)
#datausa.xgb4 <- train(formula,
#                      data = data.train,
#                      method = "xgbTree",
#                      trControl = ctrl_cv3,
#                      tuneGrid  = parameters_xgb4)
#saveRDS(datausa.xgb4, file = "datausa.xgb4.rds")
datausa.xgb4 <- readRDS("datausa.xgb4.rds")
datausa.xgb4
## eXtreme Gradient Boosting 
## 
## 11331 samples
##    23 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 10198, 10197, 10198, 10199, 10199, 10197, ... 
## Resampling results across tuning parameters:
## 
##   subsample  RMSE      Rsquared   MAE     
##   0.60       38044.07  0.5760558  26790.66
##   0.70       37981.35  0.5774379  26725.84
##   0.75       37980.29  0.5775076  26702.90
##   0.80       37994.11  0.5771910  26745.48
##   0.85       37960.53  0.5778642  26750.90
##   0.90       37979.37  0.5774959  26728.88
## 
## Tuning parameter 'nrounds' was held constant at a value of 300
## Tuning
##  held constant at a value of 0.7
## Tuning parameter 'min_child_weight' was
##  held constant at a value of 20
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nrounds = 300, max_depth = 2, eta
##  = 0.25, gamma = 1, colsample_bytree = 0.7, min_child_weight = 20 and
##  subsample = 0.85.
parameters_xgb5 <- expand.grid(nrounds = 600,
                               max_depth = 2,
                               eta = 0.12, 
                               gamma = 1,
                               colsample_bytree = 0.7,
                               min_child_weight = 20,
                               subsample = 0.75)

set.seed(123456789)
#datausa.xgb5 <- train(formula,
#                      data = data.train,
#                      method = "xgbTree",
#                      trControl = ctrl_cv3,
#                      tuneGrid  = parameters_xgb5)
#saveRDS(datausa.xgb5, file = "datausa.xgb5.rds")
datausa.xgb5 <- readRDS("datausa.xgb5.rds")
datausa.xgb5
## eXtreme Gradient Boosting 
## 
## 11331 samples
##    23 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 10198, 10197, 10198, 10199, 10199, 10197, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   37906.45  0.5790887  26656.33
## 
## Tuning parameter 'nrounds' was held constant at a value of 600
## Tuning
##  held constant at a value of 20
## Tuning parameter 'subsample' was held
##  constant at a value of 0.75
parameters_xgb6 <- expand.grid(nrounds = 1200,
                               max_depth = 2,
                               eta = 0.06, 
                               gamma = 1,
                               colsample_bytree = 0.7,
                               min_child_weight = 20,
                               subsample = 0.75)

set.seed(123456789)
#datausa.xgb6 <- train(formula,
#                      data = data.train,
#                      method = "xgbTree",
#                      trControl = ctrl_cv3,
#                      tuneGrid  = parameters_xgb6)
#saveRDS(datausa.xgb6, file = "datausa.xgb6.rds")
datausa.xgb6 <- readRDS("datausa.xgb6.rds")
datausa.xgb6
## eXtreme Gradient Boosting 
## 
## 11331 samples
##    23 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 10198, 10197, 10198, 10199, 10199, 10197, ... 
## Resampling results:
## 
##   RMSE      Rsquared  MAE     
##   37850.91  0.58035   26610.59
## 
## Tuning parameter 'nrounds' was held constant at a value of 1200
## 
## Tuning parameter 'min_child_weight' was held constant at a value of 20
## 
## Tuning parameter 'subsample' was held constant at a value of 0.75
datausa.xgb6_pred <- predict(datausa.xgb6,data.test)
datausa.xgb6_predtrain <- predict(datausa.xgb6,data.train)
##          MSE     RMSE      MAE   MedAE        R2
## 1 1358057734 36851.83 25939.84 18424.8 0.6020489
##          MSE     RMSE      MAE    MedAE        R2
## 1 1391691006 37305.38 26315.03 18905.65 0.5775853

Eventually, the best model created over the course of this analysis was able to predict salaries in a more efficient way than others with ~ 58% R Squared and 26315 USD mean absolute error on testing data set. While these results are the best out of all of the obtained so far, this is far from perfect and only slightly better than linear regression. On the more positive note, there is no problem with overfitting here as the fit to the data is similar on both - training and testing sets. Let’s see how real and predicted values behave on a plot:

The real and predicted values are mainly focused on the diagonal which is good. However, there are clearly many points that are very distant from what should be predicted. Let’s now see final predictions and its fit to the data. In order to see if we can further improve predictions, ensemble technique will be used and two of the best models will be combined to predict the data:

##         final.pred     rf4_pred datausa.xgb6_pred
## MSE   1.410216e+09 1.507393e+09      1.391691e+09
## RMSE  3.755284e+04 3.882516e+04      3.730538e+04
## MAE   2.661963e+04 2.782764e+04      2.631503e+04
## MedAE 1.925054e+04 2.062459e+04      1.890565e+04
## R2    5.719624e-01 5.424668e-01      5.775853e-01

It is clear that the ensemble technique won’t improve anything here. With that said, we can conclude that the best predictions are generated by XGboost algorithms alone. ## Evaluation


By using three different algorithms on the StackOverflow survey that examined IT professionals salaries, . Out all of them, XGboost turned out to understand data to the biggest extent and was able to generate the most accurate predictions. The overall fit is not too satisfying. The model’s MAE is 26315, which means that an average prediction is more than 25k away from its true value. The may be many reasons causing this state of matters. One are independent of the modelling since there are many under and over paid employees and the salary does not always reflect someone skills. On the top of it, it also depends on one’s character traits that are not taken into consideration here. There are also many external factors like economics related variables. Additionaly, there is a group of developers that live in one region but work remotely in other, which also affects the study. On the other side, there are more sophisticated techniques that could be used here and could better approximate salary formation.


Conclusions

Basing on the evidence presented above, the analysis can be concluded by following findings:

  1. Ethnicity, years of coding experience, age and region are significant factors that influence IT professional salary in the most significant way.
  2. There are also other variables that are important when salary formation is considered: remote/hybrid/on-site work, education level, size of a company, tech stack, system used.
  3. The models generated over the course of this analysis can’t predict the salary in a very efficient and accurate way.
  4. Modelling this phenomenon is quite a challenge since there are many factors that are difficult to include in a data set and are relevant to the study.

Possible improvements.

  1. Taking into consideration countries instead of regions.
  2. Further transformation of salary variable.
  3. Using more advanced techniques.
  4. Adding more features to the dataset.
  5. Including variables related to soft skills in the data set.
  6. Including information on job position.