IT specialists are employed throughout a variety of sectors, such as finance, healthcare, education, and government. They frequently demand strong analytical and problem-solving abilities, as well as familiarity with programming languages, operating systems, and other technological tools. In order to stay current with emerging technology and industry trends, many IT professions also call for continual training and certification.
Depending on the individual function, degree of expertise, location, and other considerations, IT jobs can offer a range of pay. Yet generally speaking, compared to many other professions, IT occupations tend to be well-paid, which is why this is a popular discussion topic nowadays when it comes down to salaries.
It’s important to remember that wages might vary widely depending on region, sector, amount of education, and experience. Also, just as in any other profession, the remuneration for individual job categories might vary significantly depending on the company and duties. In this study, the most important factors that influence the IT professionals income will be uncovered in order to accurately predict it basing on one’s individual characteristics such as years of experience, education, etc.
Around 70,000 engineers shared their learning and development strategies, tool preferences, and goals with StackOverflow in May 2022. Topics covered in the survey included developer profile, technology, work, community and professional Developers.
The dataset on which the model was built was made available on the Kaggle portal and can be accessed at the web link. Initially, the data set included following columns:
| Employment | RemoteWork | CodingActivities | EdLevel | YearsCode | YearsCodePro | DevType | OrgSize | Country | CompTotal | CompFreq | LanguageHaveWorkedWith | DatabaseHaveWorkedWith | PlatformHaveWorkedWith | OpSysProfessional use | Age | Gender | Ethnicity | WorkExp | ConvertedCompYearly |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Employed, full-time;Independent contractor, freelancer, or self-employed | Fully remote | Hobby;Contribute to open-source projects;Freelance/contract work | Bachelor’s degree (B.A., B.S., B.Eng., etc.) | 12 | 10 | Engineering manager | 20 to 99 employees | United States of America | 194400 | Yearly | C#;HTML/CSS;JavaScript;PowerShell;Python;Rust;SQL | Couchbase;CouchDB;Microsoft SQL Server;MongoDB;MySQL;PostgreSQL;Redis | AWS;Microsoft Azure | Linux-based;macOS;Windows | 35-44 years old | Man | White | 14 | 194400 |
| Employed, full-time | Hybrid (some remote, some in-person) | Hobby | Bachelor’s degree (B.A., B.S., B.Eng., etc.) | 12 | 5 | Developer, full-stack | 2 to 9 employees | United States of America | 65000 | Yearly | C;HTML/CSS;Rust;SQL;Swift;TypeScript | PostgreSQL | AWS | macOS | 25-34 years old | Man | White | 5 | 65000 |
| Employed, full-time;Independent contractor, freelancer, or self-employed | Fully remote | Hobby;Freelance/contract work | Master’s degree (M.A., M.S., M.Eng., MBA, etc.) | 11 | 5 | Developer, full-stack;Academic researcher;DevOps specialist | 5,000 to 9,999 employees | United States of America | 110000 | Yearly | HTML/CSS;JavaScript;PHP;Python;R;Ruby;Scala | Elasticsearch;MongoDB;Neo4j;PostgreSQL | AWS;DigitalOcean;Heroku | macOS | 25-34 years old | Man | White | 5 | 110000 |
| Employed, full-time | Hybrid (some remote, some in-person) | I don’t code outside of work | Master’s degree (M.A., M.S., M.Eng., MBA, etc.) | 5 | 4 | Developer, full-stack | 20 to 99 employees | Italy | 32000 | Yearly | Python;SQL;TypeScript | MySQL;Redis;SQLite | Google Cloud;OVH;VMware | Windows Subsystem for Linux (WSL) | 25-34 years old | Man | European | 4 | 34126 |
| Employed, full-time | Fully remote | Hobby;Contribute to open-source projects | Something else | 25 | 20 | Developer, back-end | 100 to 499 employees | Canada | 125000 | Yearly | C#;SQL;TypeScript | Microsoft SQL Server;PostgreSQL;Redis | Google Cloud;Microsoft Azure | Windows | 35-44 years old | Man | White;North American | 23 | 97605 |
Since the vast majority of data is qualitative, many transformations were processed during the study. For starters, variables CodingActivities, DevType, CompTotal and CompFreq were removed from this dataset. First two were very difficult to structure in a way that could be beneficial for the study. For instance, DevType variable would often have many different job position for one person and similarly Coding Activities variable. Two other were very similar (just in a slightly different) to ConvertedCompYearly, which is our dependent variable. Then, qualitative variables were transformed into dummy variables (Employment, RemoteWork, Gender, Ethnicity etc) or categorical variables (EdLevel, Age,OrgSize - the higher the level, the higher the education/person older). Country variable was grouped into regions (Western Europe, North America, etc.). Finally, the data was converted to numerical type in order to use regression algortihms.
The final dataset used for analysis:
| Employment | RemoteWork | EdLevel | YearsCode | YearsCodePro | OrgSize | LanguageHaveWorkedWith | DatabaseHaveWorkedWith | PlatformHaveWorkedWith | Age | Gender | Ethnicity | WorkExp | ConvertedCompYearly | RegionAustralia.and.New.Zealand | RegionCentral.and.Eastern.Europe | RegionEastern.Asia | RegionLatin.America.and.Caribbean | RegionMiddle.East.and.Northern.Africa | RegionNorth.America | RegionRussia | RegionSoutheastern.Asia | RegionSouthern.Asia | RegionSub.Saharan.Africa | RegionUSA | RegionWestern.Europe | X.OpSysProfessional.use.Linux.based | X.OpSysProfessional.use.macOS | X.OpSysProfessional.use.Other | X.OpSysProfessional.use.Windows |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 3 | 16 | 12 | 2 | 6 | 4 | 3 | 3 | 1 | 0 | 15 | 8400 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 0 | 1 | 2 | 1 | 0 | 1 | 10 | 4 | 2 | 2 | 1 | 1 | 1 | 9084 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 0 | 0 | 4 | 22 | 18 | 1 | 8 | 8 | 6 | 3 | 1 | 1 | 18 | 60588 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 0 | 0 | 3 | 38 | 30 | 1 | 4 | 2 | 3 | 4 | 1 | 0 | 32 | 20196 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 0 | 0 | 4 | 17 | 10 | 1 | 7 | 2 | 4 | 3 | 1 | 0 | 17 | 20196 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
Since the data was ready for modelling, it was plotted and summarized in order to see which variables should be taken into consideration and if any other transformations are needed here. First, let’s take a look at the salary distribution:
On the left side, we can see that the distribution is not appropriate for modelling. A model build on this data would most likely be worse than a bad guess. Biggest salary included is 4 M USD which is a huge outlier and this would disrupt the whole modelling process. The data was limited to rows for which salary < 300 000. As we can see on the right side, the distribution looks much better. While it is not a normally distributed variable, its shape shouldn’t be a roadblock here either. Let’s now see if there are any clear relationships between this variable and the remaining ones.
Unsurprisingly, there is a positive correlation between salary and both - years of coding and working experience. This indicates that both variables should be included in the model which is in line with reality.
Both, age and education seem to be significant here as, in general, the higher the category, the bigger earnings. This means that older and/or educated IT professionals tend to make more money.
While organization size seems to have an influence on salary, the gender seems to have no statistical significance here.
Unfortunately, Ethnicity (1 for white people and 0 for other races) box plot indicates that white people tend to make much more money than others. Apart from racial profiling, it can be caused by a fact that regions with the biggest amount of white people tend to be better economy wise and therefore - salary wise.
As we can see, region is a feature that makes a lot of difference when it comes down to salaries. USA IT professionals tend to make much more money in comparison to other regions while SE Asia ans S. Asia much less.
Let’s now look at the correlation plot to see if the variables influence each other:
As we can see, there are many variables that are moderately correlated with the salary in the data set. Before modelling, outliers were removed basing on their z scores so that they don’t affect predictions.
In order to accurately predict the salary, first a classic linear regression was generated in order to see which variables are statistically significant and how they affect the salary:
##
## Call:
## lm(formula = formula, data = dfm)
##
## Residuals:
## Min 1Q Median 3Q Max
## -177388 -22289 -6027 15059 283453
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 67315.15 4334.36 15.531 < 2e-16 ***
## Employment 7350.49 933.37 7.875 3.62e-15 ***
## RemoteWork 7319.17 648.76 11.282 < 2e-16 ***
## EdLevel 4350.64 349.39 12.452 < 2e-16 ***
## YearsCode 284.26 84.39 3.368 0.000758 ***
## YearsCodePro 1706.47 123.51 13.817 < 2e-16 ***
## OrgSize 6716.18 340.46 19.727 < 2e-16 ***
## LanguageHaveWorkedWith -295.58 127.08 -2.326 0.020036 *
## DatabaseHaveWorkedWith 429.29 199.87 2.148 0.031735 *
## PlatformHaveWorkedWith 229.95 293.17 0.784 0.432850
## Age -1796.57 628.55 -2.858 0.004265 **
## Gender 5540.01 1210.37 4.577 4.75e-06 ***
## Ethnicity 5371.57 706.11 7.607 2.95e-14 ***
## WorkExp 14.68 98.07 0.150 0.881035
## RegionAustralia.and.New.Zealand -40848.91 1905.27 -21.440 < 2e-16 ***
## RegionCentral.and.Eastern.Europe -80959.32 1303.30 -62.119 < 2e-16 ***
## RegionEastern.Asia -71930.72 2930.49 -24.546 < 2e-16 ***
## RegionLatin.America.and.Caribbean -94381.42 1333.95 -70.753 < 2e-16 ***
## RegionMiddle.East.and.Northern.Africa -67470.57 1806.18 -37.355 < 2e-16 ***
## RegionNorth.America -36679.61 1636.02 -22.420 < 2e-16 ***
## RegionRussia -85236.29 2886.81 -29.526 < 2e-16 ***
## RegionSoutheastern.Asia -94058.84 2346.37 -40.087 < 2e-16 ***
## RegionSouthern.Asia -98101.98 1345.49 -72.912 < 2e-16 ***
## RegionSub.Saharan.Africa -86856.06 2398.21 -36.217 < 2e-16 ***
## RegionWestern.Europe -61746.04 900.91 -68.537 < 2e-16 ***
## X.OpSysProfessional.use.Linux.based 9883.40 3588.57 2.754 0.005891 **
## X.OpSysProfessional.use.macOS 17348.75 3611.11 4.804 1.57e-06 ***
## X.OpSysProfessional.use.Windows -228.95 3617.09 -0.063 0.949532
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 38730 on 16157 degrees of freedom
## Multiple R-squared: 0.5566, Adjusted R-squared: 0.5559
## F-statistic: 751.3 on 27 and 16157 DF, p-value: < 2.2e-16
## Employment RemoteWork
## 1.110820 1.135345
## EdLevel YearsCode
## 1.074805 6.087117
## YearsCodePro OrgSize
## 9.538270 1.050031
## LanguageHaveWorkedWith DatabaseHaveWorkedWith
## 1.355443 1.487622
## PlatformHaveWorkedWith Age
## 1.341481 3.360013
## Gender Ethnicity
## 1.014814 1.341081
## WorkExp RegionAustralia.and.New.Zealand
## 7.090206 1.120345
## RegionCentral.and.Eastern.Europe RegionEastern.Asia
## 1.385183 1.075025
## RegionLatin.America.and.Caribbean RegionMiddle.East.and.Northern.Africa
## 1.402368 1.200069
## RegionNorth.America RegionRussia
## 1.147695 1.059490
## RegionSoutheastern.Asia RegionSouthern.Asia
## 1.158338 1.773945
## RegionSub.Saharan.Africa RegionWestern.Europe
## 1.103116 1.949226
## X.OpSysProfessional.use.Linux.based X.OpSysProfessional.use.macOS
## 33.545523 30.252799
## X.OpSysProfessional.use.Windows
## 27.972604
As we can see, PlatformHaveWorkedWith, X.OpSysProfessional.use.Windows and WorkExp variables turned out to be insignificant - probably because of the existance of other, similar variables in the data set and not because of its insignificance in general. As a next step, data was split into training and testing sets and the final formula that included only significant variables was used:
##
## Call:
## lm(formula = formula, data = dfm)
##
## Residuals:
## Min 1Q Median 3Q Max
## -177344 -22214 -6012 14913 284655
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 67439.03 2430.25 27.750 < 2e-16 ***
## Employment 7368.77 928.49 7.936 2.22e-15 ***
## RemoteWork 7317.75 648.74 11.280 < 2e-16 ***
## EdLevel 4457.78 346.48 12.866 < 2e-16 ***
## YearsCodePro 1990.14 66.58 29.890 < 2e-16 ***
## OrgSize 6706.94 340.12 19.719 < 2e-16 ***
## LanguageHaveWorkedWith -236.21 124.50 -1.897 0.05781 .
## DatabaseHaveWorkedWith 458.98 187.11 2.453 0.01418 *
## Age -1520.34 574.81 -2.645 0.00818 **
## Gender 5595.21 1210.28 4.623 3.81e-06 ***
## Ethnicity 5468.52 705.44 7.752 9.59e-15 ***
## RegionAustralia.and.New.Zealand -40716.98 1905.09 -21.373 < 2e-16 ***
## RegionCentral.and.Eastern.Europe -81000.92 1301.16 -62.253 < 2e-16 ***
## RegionEastern.Asia -72007.17 2929.61 -24.579 < 2e-16 ***
## RegionLatin.America.and.Caribbean -94582.13 1332.30 -70.992 < 2e-16 ***
## RegionMiddle.East.and.Northern.Africa -67673.27 1804.61 -37.500 < 2e-16 ***
## RegionNorth.America -36523.07 1633.97 -22.352 < 2e-16 ***
## RegionRussia -85350.88 2886.22 -29.572 < 2e-16 ***
## RegionSoutheastern.Asia -94108.59 2344.18 -40.146 < 2e-16 ***
## RegionSouthern.Asia -98305.81 1341.48 -73.282 < 2e-16 ***
## RegionSub.Saharan.Africa -86964.35 2397.64 -36.271 < 2e-16 ***
## RegionWestern.Europe -61573.21 896.47 -68.684 < 2e-16 ***
## X.OpSysProfessional.use.Linux.based 10154.57 762.07 13.325 < 2e-16 ***
## X.OpSysProfessional.use.macOS 17532.33 803.41 21.822 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 38740 on 16161 degrees of freedom
## Multiple R-squared: 0.5563, Adjusted R-squared: 0.5557
## F-statistic: 881 on 23 and 16161 DF, p-value: < 2.2e-16
## Employment RemoteWork
## 1.098690 1.134711
## EdLevel YearsCodePro
## 1.056460 2.770749
## OrgSize LanguageHaveWorkedWith
## 1.047422 1.300286
## DatabaseHaveWorkedWith Age
## 1.303091 2.808662
## Gender Ethnicity
## 1.014153 1.337880
## RegionAustralia.and.New.Zealand RegionCentral.and.Eastern.Europe
## 1.119577 1.379942
## RegionEastern.Asia RegionLatin.America.and.Caribbean
## 1.073843 1.398196
## RegionMiddle.East.and.Northern.Africa RegionNorth.America
## 1.197382 1.144233
## RegionRussia RegionSoutheastern.Asia
## 1.058527 1.155596
## RegionSouthern.Asia RegionSub.Saharan.Africa
## 1.762513 1.102040
## RegionWestern.Europe X.OpSysProfessional.use.Linux.based
## 1.929084 1.512057
## X.OpSysProfessional.use.macOS
## 1.496729
## MSE RMSE MAE MedAE R2
## 1 1477287889 38435.5 27316.81 19994.25 0.5516044
The basic linear regression model reveals that the variables that the study is based on are a good predictors of salary and we can move forward with the analysis as the variables are significant and there is no coolinearity issue here.
First, regression tree algorithm with default hyperparameters was used to predict the y variable:
It is noticable that the algorithm takes into consideration Ethnicity, Years of coding and region only. This suggests that these variables are the most important ones when it comes down to salary formation. It appears that there is a breaking point around 6 years of coding when it comes down to salary as this number serves as a split that creates homogenous groups here for both ethnicities. In general, the overall look of the tree and the information that it gives us seems to be in line with logic and reality. Let’s now take a look at variable’s importance plot.
Apart from the variables mentioned above, it seems that Age is also significant when salaries are concerned, which is also a reasonable conclusion.
## MSE RMSE MAE MedAE R2
## 1 2099034616 45815.22 33554.61 25169.2 0.3849208
## MSE RMSE MAE MedAE R2
## 1 2108073393 45913.76 33661.84 25431.98 0.3601444
The predictions fit to the actual data is very poor with 36% R^2 on testing set. Let’s try improving it by generating a very deep tree and pruning it later.
This tree is huge and difficult to read or interpret. It’ll be pruned now in order to keep it simple.
## [1] 1.00074e-08
The optimal complexity parameter here is close to 0 which in general is not safe. It allows our tree to grow almost as much as it wants, which can be detrimental for obtaining accurate predictions. In our case, the tree is the same as it was in the beginning. It is very likely that the predictions generated by using it will not be well fitted to the data.
## MSE RMSE MAE MedAE R2
## 1 4778810788 69128.94 54119.81 43923.64 -2.385981
## MSE RMSE MAE MedAE R2
## 1 4778810788 69128.94 54119.81 43923.64 -2.385981
As expected, the model generates huge errors. It is not good at all. Let’s now see how cross-validation will handle this data.
## CART
##
## 11331 samples
## 23 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 10197, 10199, 10197, 10198, 10199, 10198, ...
## Resampling results across tuning parameters:
##
## cp RMSE Rsquared MAE
## 0.000 42550.81 0.4856900 30195.67
## 0.001 41006.80 0.5078980 29378.03
## 0.002 42355.76 0.4747963 30584.09
## 0.003 43011.86 0.4582194 31050.44
## 0.004 43715.29 0.4402383 31528.23
## 0.005 44196.55 0.4280036 31959.52
## 0.006 44367.95 0.4235386 32250.43
## 0.007 44972.68 0.4078068 32732.25
## 0.008 45238.41 0.4004826 32997.52
## 0.009 45497.98 0.3938246 33226.52
## 0.010 45809.66 0.3854272 33502.50
## 0.011 46285.01 0.3727902 34117.36
## 0.012 46541.76 0.3658099 34631.49
## 0.013 46848.16 0.3574761 34934.35
## 0.014 47623.84 0.3359541 35799.19
## 0.015 47703.24 0.3336658 35924.23
## 0.016 48082.34 0.3230754 36352.44
## 0.017 48272.26 0.3177439 36554.14
## 0.018 48272.26 0.3177439 36554.14
## 0.019 48272.26 0.3177439 36554.14
## 0.020 48272.26 0.3177439 36554.14
## 0.021 48272.26 0.3177439 36554.14
## 0.022 48272.26 0.3177439 36554.14
## 0.023 48272.26 0.3177439 36554.14
## 0.024 48272.26 0.3177439 36554.14
## 0.025 48542.44 0.3101445 36786.55
## 0.026 49108.68 0.2938699 37264.86
## 0.027 49530.62 0.2818053 37682.12
## 0.028 49606.87 0.2789161 37736.67
## 0.029 49901.40 0.2704573 37989.99
## 0.030 50857.29 0.2425214 39119.95
##
## MAE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.001.
## MSE RMSE MAE MedAE R2
## 1 1593213395 39915.08 28581.81 20310.16 0.5331414
## MSE RMSE MAE MedAE R2
## 1 1650434650 40625.54 29199.44 20684.91 0.4990498
The optimal value of complexity parameter is now 0.001. Even though the fit to the underlying data is much better now, it is still far from perfect and worse than a simple linear regression.
The regression tree did not do well enough on this data set. Now, random forest algorithm will be applied here and its fit to the data will be assessed. Several models with different numbers of trees and numbers of features selected were created and the results are available below:
##
## Call:
## randomForest(formula = formula, data = data.train, ntree = 400)
## Type of random forest: regression
## Number of trees: 400
## No. of variables tried at each split: 7
##
## Mean of squared residuals: 1537513219
## % Var explained: 54.95
##
## Call:
## randomForest(formula = formula, data = data.train, ntree = 800)
## Type of random forest: regression
## Number of trees: 800
## No. of variables tried at each split: 7
##
## Mean of squared residuals: 1538517871
## % Var explained: 54.92
##
## Call:
## randomForest(formula = formula, data = data.train, ntree = 800, mtry = 3)
## Type of random forest: regression
## Number of trees: 800
## No. of variables tried at each split: 3
##
## Mean of squared residuals: 1612280129
## % Var explained: 52.76
##
## Call:
## randomForest(formula = formula, data = data.train, ntree = 800, mtry = 6)
## Type of random forest: regression
## Number of trees: 800
## No. of variables tried at each split: 6
##
## Mean of squared residuals: 1534371203
## % Var explained: 55.04
The best model turned out to be based on 800 trees and 6 features selected. It generates following predictions:
## MSE RMSE MAE MedAE R2
## 1 624606435 24992.13 17815.71 13060.63 0.8169718
## MSE RMSE MAE MedAE R2
## 1 1507392686 38825.16 27827.64 20624.59 0.5424668
There is a clear overfitting problem here. The predictions are really good on the training set and then much worse on testing. Even if we would like to use it for predictions, which is not advised, it is only similar to what simple linear regression was able to achieve. As a next step, the cross-validation will be used in order to obtain optimal parameters.
## Random Forest
##
## 11331 samples
## 23 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 10197, 10199, 10196, 10199, 10198, 10198, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 2 42574.47 0.5207408 31718.05
## 12 39428.86 0.5455047 28229.64
## 23 39772.74 0.5396773 28478.89
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 12.
## MSE RMSE MAE MedAE R2
## 1 421871975 20539.52 14422.14 10349.91 0.876379
## MSE RMSE MAE MedAE R2
## 1 1524982018 39051.02 27916.81 20601.51 0.537128
The optimal value of number of features here is 12. The model is able to generate predictions that are slightly worse than the tree with 800 trees and 6 features. Unfortunately, there is a problem with overfitting here as well. All in all, the random forest algorithm does really well on training set but it learns patterns specific to this dataset and not the salary formation in general.
The last algorithm that was used here is XGboost, which is quite popular lately. It combines multiple weak models to create a strong predictive model. In order to build an optimal one, first parameters are set to default values and then, step by step, consecutive parameters are being optimized, which results in this logic:
parameters_xgb <- expand.grid(nrounds = seq(20, 300, 10),
max_depth = c(6),
eta = c(0.25),
gamma = 1,
colsample_bytree = c(0.2),
min_child_weight = c(85),
subsample = 0.8)
set.seed(1234)
ctrl_cv3 <- trainControl(method = "cv")
#data.xgb <- train(formula,
# data = data.train,
# method = "xgbTree",
# trControl = ctrl_cv3,
# tuneGrid = parameters_xgb)
#
#saveRDS(data.xgb, file = "data.xgb.rds")
data.xgb <- readRDS("data.xgb.rds")
data.xgb
## eXtreme Gradient Boosting
##
## 11331 samples
## 23 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 10197, 10199, 10197, 10198, 10199, 10198, ...
## Resampling results across tuning parameters:
##
## nrounds RMSE Rsquared MAE
## 20 44670.21 0.4384267 33258.89
## 30 42995.25 0.4701945 31641.21
## 40 41879.79 0.4935689 30528.49
## 50 41131.25 0.5094104 29807.23
## 60 40537.61 0.5224691 29255.32
## 70 39989.56 0.5343628 28758.74
## 80 39650.95 0.5415643 28413.22
## 90 39383.64 0.5474202 28183.25
## 100 39165.86 0.5523496 27958.57
## 110 38990.48 0.5561393 27814.42
## 120 38806.35 0.5599565 27624.87
## 130 38704.80 0.5620959 27526.79
## 140 38605.63 0.5639837 27438.09
## 150 38524.89 0.5656758 27350.91
## 160 38457.62 0.5670419 27280.39
## 170 38406.40 0.5680670 27228.96
## 180 38364.92 0.5689036 27178.55
## 190 38333.20 0.5696576 27158.55
## 200 38324.66 0.5697760 27136.49
## 210 38287.86 0.5705445 27113.48
## 220 38279.26 0.5706567 27089.13
## 230 38257.77 0.5711807 27060.03
## 240 38242.68 0.5714808 27055.56
## 250 38234.74 0.5716138 27049.56
## 260 38214.89 0.5720852 27030.72
## 270 38225.47 0.5718665 27032.03
## 280 38205.66 0.5722876 26995.87
## 290 38206.41 0.5722318 27012.95
## 300 38202.48 0.5723114 27021.04
##
## Tuning parameter 'max_depth' was held constant at a value of 6
## Tuning
## parameter 'min_child_weight' was held constant at a value of 85
##
## Tuning parameter 'subsample' was held constant at a value of 0.8
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nrounds = 300, max_depth = 6, eta
## = 0.25, gamma = 1, colsample_bytree = 0.2, min_child_weight = 85 and
## subsample = 0.8.
parameters_xgb2 <- expand.grid(nrounds = 300,
max_depth = seq(2, 6, 2),
eta = c(0.25),
gamma = 1,
colsample_bytree = c(0.2),
min_child_weight = seq(20, 220, 100),
subsample = 0.8)
set.seed(123456789)
#datausa.xgb2 <- train(formula,
# data = data.train,
# method = "xgbTree",
# trControl = ctrl_cv3,
# tuneGrid = parameters_xgb2)
#saveRDS(datausa.xgb2, file = "datausa.xgb2.rds")
datausa.xgb2 <- readRDS("datausa.xgb2.rds")
datausa.xgb2
## eXtreme Gradient Boosting
##
## 11331 samples
## 23 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 10198, 10198, 10197, 10198, 10198, 10197, ...
## Resampling results across tuning parameters:
##
## max_depth min_child_weight RMSE Rsquared MAE
## 2 20 37968.12 0.5778751 26819.42
## 2 120 39476.06 0.5436061 27852.80
## 2 220 41447.24 0.4968708 29664.20
## 4 20 38092.10 0.5749326 26902.38
## 4 120 39439.74 0.5443360 27763.27
## 4 220 41337.35 0.4995594 29519.98
## 6 20 38534.52 0.5653002 27256.59
## 6 120 39694.50 0.5384698 28030.83
## 6 220 41453.17 0.4967633 29673.96
##
## Tuning parameter 'nrounds' was held constant at a value of 300
## Tuning
## 'colsample_bytree' was held constant at a value of 0.2
## Tuning
## parameter 'subsample' was held constant at a value of 0.8
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nrounds = 300, max_depth = 2, eta
## = 0.25, gamma = 1, colsample_bytree = 0.2, min_child_weight = 20 and
## subsample = 0.8.
parameters_xgb3 <- expand.grid(nrounds = 300,
max_depth = 2,
eta = c(0.25),
gamma = 1,
colsample_bytree = seq(0.6, 0.9, 0.1),
min_child_weight = 20,
subsample = 0.8)
set.seed(123456789)
#datausa.xgb3 <- train(formula,
# data = data.train,
# method = "xgbTree",
# trControl = ctrl_cv3,
# tuneGrid = parameters_xgb3)
#saveRDS(datausa.xgb3, file = "datausa.xgb3.rds")
datausa.xgb3 <- readRDS("datausa.xgb3.rds")
datausa.xgb3
## eXtreme Gradient Boosting
##
## 11331 samples
## 23 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 10198, 10197, 10198, 10199, 10199, 10197, ...
## Resampling results across tuning parameters:
##
## colsample_bytree RMSE Rsquared MAE
## 0.6 38000.32 0.5770720 26755.70
## 0.7 37927.65 0.5786895 26712.55
## 0.8 37988.11 0.5773240 26750.53
## 0.9 37980.82 0.5775382 26740.47
##
## Tuning parameter 'nrounds' was held constant at a value of 300
## Tuning
## held constant at a value of 20
## Tuning parameter 'subsample' was held
## constant at a value of 0.8
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nrounds = 300, max_depth = 2, eta
## = 0.25, gamma = 1, colsample_bytree = 0.7, min_child_weight = 20 and
## subsample = 0.8.
parameters_xgb4 <- expand.grid(nrounds = 300,
max_depth = 2,
eta = c(0.25),
gamma = 1,
colsample_bytree = 0.7,
min_child_weight = 20,
subsample = c(0.6, 0.7, 0.75, 0.8, 0.85, 0.9))
set.seed(123456789)
#datausa.xgb4 <- train(formula,
# data = data.train,
# method = "xgbTree",
# trControl = ctrl_cv3,
# tuneGrid = parameters_xgb4)
#saveRDS(datausa.xgb4, file = "datausa.xgb4.rds")
datausa.xgb4 <- readRDS("datausa.xgb4.rds")
datausa.xgb4
## eXtreme Gradient Boosting
##
## 11331 samples
## 23 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 10198, 10197, 10198, 10199, 10199, 10197, ...
## Resampling results across tuning parameters:
##
## subsample RMSE Rsquared MAE
## 0.60 38044.07 0.5760558 26790.66
## 0.70 37981.35 0.5774379 26725.84
## 0.75 37980.29 0.5775076 26702.90
## 0.80 37994.11 0.5771910 26745.48
## 0.85 37960.53 0.5778642 26750.90
## 0.90 37979.37 0.5774959 26728.88
##
## Tuning parameter 'nrounds' was held constant at a value of 300
## Tuning
## held constant at a value of 0.7
## Tuning parameter 'min_child_weight' was
## held constant at a value of 20
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nrounds = 300, max_depth = 2, eta
## = 0.25, gamma = 1, colsample_bytree = 0.7, min_child_weight = 20 and
## subsample = 0.85.
parameters_xgb5 <- expand.grid(nrounds = 600,
max_depth = 2,
eta = 0.12,
gamma = 1,
colsample_bytree = 0.7,
min_child_weight = 20,
subsample = 0.75)
set.seed(123456789)
#datausa.xgb5 <- train(formula,
# data = data.train,
# method = "xgbTree",
# trControl = ctrl_cv3,
# tuneGrid = parameters_xgb5)
#saveRDS(datausa.xgb5, file = "datausa.xgb5.rds")
datausa.xgb5 <- readRDS("datausa.xgb5.rds")
datausa.xgb5
## eXtreme Gradient Boosting
##
## 11331 samples
## 23 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 10198, 10197, 10198, 10199, 10199, 10197, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 37906.45 0.5790887 26656.33
##
## Tuning parameter 'nrounds' was held constant at a value of 600
## Tuning
## held constant at a value of 20
## Tuning parameter 'subsample' was held
## constant at a value of 0.75
parameters_xgb6 <- expand.grid(nrounds = 1200,
max_depth = 2,
eta = 0.06,
gamma = 1,
colsample_bytree = 0.7,
min_child_weight = 20,
subsample = 0.75)
set.seed(123456789)
#datausa.xgb6 <- train(formula,
# data = data.train,
# method = "xgbTree",
# trControl = ctrl_cv3,
# tuneGrid = parameters_xgb6)
#saveRDS(datausa.xgb6, file = "datausa.xgb6.rds")
datausa.xgb6 <- readRDS("datausa.xgb6.rds")
datausa.xgb6
## eXtreme Gradient Boosting
##
## 11331 samples
## 23 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 10198, 10197, 10198, 10199, 10199, 10197, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 37850.91 0.58035 26610.59
##
## Tuning parameter 'nrounds' was held constant at a value of 1200
##
## Tuning parameter 'min_child_weight' was held constant at a value of 20
##
## Tuning parameter 'subsample' was held constant at a value of 0.75
datausa.xgb6_pred <- predict(datausa.xgb6,data.test)
datausa.xgb6_predtrain <- predict(datausa.xgb6,data.train)
## MSE RMSE MAE MedAE R2
## 1 1358057734 36851.83 25939.84 18424.8 0.6020489
## MSE RMSE MAE MedAE R2
## 1 1391691006 37305.38 26315.03 18905.65 0.5775853
Eventually, the best model created over the course of this analysis was able to predict salaries in a more efficient way than others with ~ 58% R Squared and 26315 USD mean absolute error on testing data set. While these results are the best out of all of the obtained so far, this is far from perfect and only slightly better than linear regression. On the more positive note, there is no problem with overfitting here as the fit to the data is similar on both - training and testing sets. Let’s see how real and predicted values behave on a plot:
The real and predicted values are mainly focused on the diagonal which is good. However, there are clearly many points that are very distant from what should be predicted. Let’s now see final predictions and its fit to the data. In order to see if we can further improve predictions, ensemble technique will be used and two of the best models will be combined to predict the data:
## final.pred rf4_pred datausa.xgb6_pred
## MSE 1.410216e+09 1.507393e+09 1.391691e+09
## RMSE 3.755284e+04 3.882516e+04 3.730538e+04
## MAE 2.661963e+04 2.782764e+04 2.631503e+04
## MedAE 1.925054e+04 2.062459e+04 1.890565e+04
## R2 5.719624e-01 5.424668e-01 5.775853e-01
It is clear that the ensemble technique won’t improve anything here. With that said, we can conclude that the best predictions are generated by XGboost algorithms alone. ## Evaluation
By using three different algorithms on the StackOverflow survey that examined IT professionals salaries, . Out all of them, XGboost turned out to understand data to the biggest extent and was able to generate the most accurate predictions. The overall fit is not too satisfying. The model’s MAE is 26315, which means that an average prediction is more than 25k away from its true value. The may be many reasons causing this state of matters. One are independent of the modelling since there are many under and over paid employees and the salary does not always reflect someone skills. On the top of it, it also depends on one’s character traits that are not taken into consideration here. There are also many external factors like economics related variables. Additionaly, there is a group of developers that live in one region but work remotely in other, which also affects the study. On the other side, there are more sophisticated techniques that could be used here and could better approximate salary formation.
Basing on the evidence presented above, the analysis can be concluded by following findings: