Stack Overflow is a Q&A website for programmers. Every year for the last decade Stack Overflow organizes a survey to provide a gauge of the tech industry and technology trends. The survey collects data such as Basic Information; Education, Work, and Career; Technology and Tech Culture; Stack Overflow Usage + Community; Demographics Information and some additional information. Most questions in the survey are options, and the anonymized results are released publicly. The insights from such surveys could be used for employer recruitment plans and study plans. It provides a clear picture of technology, age and work experience, pay, and gender bias.

1. Introduction

The project aims to predict the salary of the developer from the possible other 60 factors. The dimension of the dataset is 64461*61. Factors to be considered such as Gender, Organization Size of the Work, Education Level, Years of coding, level of satisfaction with work, if coding is a hobby, Age, how often respondents work overtime, when respondents started coding, and how frequently respondents are compensated. The model uses the regression technique from Machine Learning algorithms. The steps to achieve desired results include exploratory data analysis, data cleaning, and feature selection.

2. Exploratory data analysis

2.1. Dataset

The dataset used in this analysis contains Stack Overflow Annual Developer Survey 2020 results with nearly 65,000 responses fielded from over 180 countries and dependent territories. The survey examines all aspects of the developer experience from career satisfaction and job search to education and opinions on open-source software.Dataset can be found on Stack Overflow webpage.

Loading all the necessary packages and dataset.

library(dplyr)
library(DataExplorer)
library(sns)
library(tidyverse)
library(hrbrthemes)
library(viridis)
library(forcats)
library(reshape)
library(e1071)
library(caret)
library(corrplot)
library(olsrr)
library(glmnet)
library(coefplot)
library(skimr)

df<-read.csv("survey_results_public.csv")

skim(df)

Data summary
Name	df
Number of rows	64461
Number of columns	61
_______________________
Column type frequency:
character	56
numeric	5
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
MainBranch	299	1.00	27	77	5
Hobbyist	45	1.00	2	3	2
Age1stCode	6561	0.90	1	20	63
CompFreq	24392	0.62	6	7	3
Country	389	0.99	4	41	183
CurrencyDesc	18989	0.71	7	39	142
CurrencySymbol	18989	0.71	3	3	141
DatabaseDesireNextYear	20391	0.68	5	133	3193
DatabaseWorkedWith	14924	0.77	5	133	2808
DevType	15091	0.77	8	531	8269
EdLevel	7030	0.89	25	82	9
Employment	607	0.99	7	52	7
Ethnicity	18513	0.71	8	235	208
Gender	13904	0.78	3	59	7
JobFactors	15112	0.77	19	189	230
JobSat	19267	0.70	14	34	5
JobSeek	12734	0.80	31	60	3
LanguageDesireNextYear	10348	0.84	1	164	16243
LanguageWorkedWith	7083	0.89	1	164	14256
MiscTechDesireNextYear	22082	0.66	4	169	5216
MiscTechWorkedWith	24147	0.63	4	169	2730
NEWCollabToolsDesireNextYear	17174	0.73	4	149	1277
NEWCollabToolsWorkedWith	11578	0.82	4	149	1153
NEWDevOps	21775	0.66	2	8	3
NEWDevOpsImpt	22729	0.65	7	20	5
NEWEdImpt	15996	0.75	14	34	5
NEWJobHunt	22175	0.66	12	377	2172
NEWJobHuntResearch	23439	0.64	36	362	63
NEWLearn	8305	0.87	11	20	4
NEWOffTopic	13657	0.79	2	8	3
NEWOnboardGood	21838	0.66	2	28	3
NEWOtherComms	7256	0.89	2	3	2
NEWOvertime	21230	0.67	5	56	5
NEWPurchaseResearch	27140	0.58	18	253	63
NEWPurpleLink	9658	0.85	6	17	4
NEWSOSites	6186	0.90	37	305	61
NEWStuck	9478	0.85	5	225	444
OpSys	8233	0.87	3	11	4
OrgSize	20127	0.69	16	50	9
PlatformDesireNextYear	13856	0.79	3	177	7471
PlatformWorkedWith	10618	0.84	3	177	6287
PurchaseWhat	25097	0.61	21	32	3
Sexuality	20469	0.68	5	53	14
SOAccount	7656	0.88	2	23	3
SOComm	7985	0.88	7	15	6
SOPartFreq	17669	0.73	20	50	6
SOVisitFreq	7491	0.88	20	50	6
SurveyEase	12659	0.80	4	26	3
SurveyLength	12760	0.80	8	21	3
Trans	15116	0.77	2	3	2
UndergradMajor	13466	0.79	24	78	12
WebframeDesireNextYear	24437	0.62	5	134	3986
WebframeWorkedWith	22182	0.66	5	134	3789
WelcomeChange	11778	0.82	37	55	6
YearsCode	6777	0.89	1	18	52
YearsCodePro	18112	0.72	1	18	52

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Respondent	0	1.00	3.255408e+04	18967.44	1	16116	32231	49142	6.563900e+04	▇▇▇▇▇
Age	19015	0.71	3.083000e+01	9.59	1	24	29	35	2.790000e+02	▇▁▁▁▁
CompTotal	29635	0.54	3.190464e+242	Inf	0	20000	63000	125000	1.111111e+247	▇▁▁▁▁
ConvertedComp	29705	0.54	1.037561e+05	226885.30	0	24648	54049	95000	2.000000e+06	▇▁▁▁▁
WorkWeekHrs	23310	0.64	4.078000e+01	17.82	1	40	40	44	4.750000e+02	▇▁▁▁▁

2.2. Initial selection

As the dataset selected by us contains many features that will not necessarily be used for further analysis, we decided to clean the data a bit at the beginning. We started this process by asking ourselves a basic question:

Which variables affect the amount of annual salary (USD) of a given respondent?

Thus, we chose the following variables:

df <- df %>%
  dplyr::select(Gender, EdLevel, 
                OrgSize, YearsCode, JobSat, 
                ConvertedComp, Hobbyist, Age, NEWOvertime, Age1stCode, CompFreq)

Gender - respondent’s gender
EdLevel - highest level of formal education that respondent completed
OrgSize - number of people that are employed by the company or organization where respondent currently work for
YearsCode - how many years respondent has been coding in total?
JobSat - how satisfied the respondent is with his/her current job
ConvertedComp - salary converted to annual USD salaries using the exchange rate on 2020-02-19, assuming 12 working months and 50 working weeks.
Hobbyist - if respondent code as a hobby
Age - respondent’s age in years
NEWOvertime - how often respondent works overtime or beyond the formal time expectation of his/her job
Age1stCode - Respondent’s age when he/she wrote his/her first line of code or program? (e.g., webpage, Hello World, Scratch project)
CompFreq - ferquency of compensation payment: weekly, monthly, or yearly

2.3. Missing values

We will start our analysis by checking whether our dataset contains the missing values and, as you can see below, there is quite a lot of them, therefore the process of handling them will be considered in terms of individual variable, both numerical and categorical.

plot_missing(df,)

2.3.1. Numeric variables

Initially we only have two numeric variables: ConvertedComp and Age.

sapply(df, is.numeric) %>% 
  which() %>% 
  names()

## [1] "ConvertedComp" "Age"

df %>%
  keep(is.numeric) %>% 
  gather() %>% 
  ggplot(aes(value)) +
    facet_wrap(~ key, scales = "free") +
    geom_histogram()

As we can see above variable Age looks to be right-skewed (long tail in the right) hence we will use median to replace missing values. In the case of the variable ConvertedComp, we will remove all missing values as we want to include in our analysis only employed programmers who receive salary.

ConvertedComp

df<- df %>% 
  filter(ConvertedComp!=0) %>% 
  na.omit(df)

any(df$ConvertedComp <= 0) %>% 
  knitr::knit_print()

## [1] FALSE

Age

df$Age[is.na(df$Age)] <- median(df$Age, na.rm=TRUE)

any(df$Age <= 0) %>% 
  knitr::knit_print()

## [1] FALSE

2.3.2. Categorical variables

The vast majority of our variables are categorical:

categorical_vars <- 
  sapply(df, is.character) %>% 
  which() %>% 
  names()

categorical_vars

## [1] "Gender"      "EdLevel"     "OrgSize"     "YearsCode"   "JobSat"     
## [6] "Hobbyist"    "NEWOvertime" "Age1stCode"  "CompFreq"

Let’s explore initially our variables to see what levels they have:

df[categorical_vars] <- lapply(df[categorical_vars],factor)
lapply(df[categorical_vars],levels)

## $Gender
## [1] "Man"                                                        
## [2] "Man;Non-binary, genderqueer, or gender non-conforming"      
## [3] "Non-binary, genderqueer, or gender non-conforming"          
## [4] "Woman"                                                      
## [5] "Woman;Man"                                                  
## [6] "Woman;Man;Non-binary, genderqueer, or gender non-conforming"
## [7] "Woman;Non-binary, genderqueer, or gender non-conforming"    
## 
## $EdLevel
## [1] "Associate degree (A.A., A.S., etc.)"                                               
## [2] "Bachelor’s degree (B.A., B.S., B.Eng., etc.)"                                      
## [3] "I never completed any formal education"                                            
## [4] "Master’s degree (M.A., M.S., M.Eng., MBA, etc.)"                                   
## [5] "Other doctoral degree (Ph.D., Ed.D., etc.)"                                        
## [6] "Primary/elementary school"                                                         
## [7] "Professional degree (JD, MD, etc.)"                                                
## [8] "Secondary school (e.g. American high school, German Realschule or Gymnasium, etc.)"
## [9] "Some college/university study without earning a degree"                            
## 
## $OrgSize
## [1] "1,000 to 4,999 employees"                          
## [2] "10 to 19 employees"                                
## [3] "10,000 or more employees"                          
## [4] "100 to 499 employees"                              
## [5] "2 to 9 employees"                                  
## [6] "20 to 99 employees"                                
## [7] "5,000 to 9,999 employees"                          
## [8] "500 to 999 employees"                              
## [9] "Just me - I am a freelancer, sole proprietor, etc."
## 
## $YearsCode
##  [1] "1"                  "10"                 "11"                
##  [4] "12"                 "13"                 "14"                
##  [7] "15"                 "16"                 "17"                
## [10] "18"                 "19"                 "2"                 
## [13] "20"                 "21"                 "22"                
## [16] "23"                 "24"                 "25"                
## [19] "26"                 "27"                 "28"                
## [22] "29"                 "3"                  "30"                
## [25] "31"                 "32"                 "33"                
## [28] "34"                 "35"                 "36"                
## [31] "37"                 "38"                 "39"                
## [34] "4"                  "40"                 "41"                
## [37] "42"                 "43"                 "44"                
## [40] "45"                 "46"                 "47"                
## [43] "48"                 "49"                 "5"                 
## [46] "50"                 "6"                  "7"                 
## [49] "8"                  "9"                  "Less than 1 year"  
## [52] "More than 50 years"
## 
## $JobSat
## [1] "Neither satisfied nor dissatisfied" "Slightly dissatisfied"             
## [3] "Slightly satisfied"                 "Very dissatisfied"                 
## [5] "Very satisfied"                    
## 
## $Hobbyist
## [1] "No"  "Yes"
## 
## $NEWOvertime
## [1] "Never"                                                   
## [2] "Occasionally: 1-2 days per quarter but less than monthly"
## [3] "Often: 1-2 days per week or more"                        
## [4] "Rarely: 1-2 days per year or less"                       
## [5] "Sometimes: 1-2 days per month but less than weekly"      
## 
## $Age1stCode
##  [1] "10"                   "11"                   "12"                  
##  [4] "13"                   "14"                   "15"                  
##  [7] "16"                   "17"                   "18"                  
## [10] "19"                   "20"                   "21"                  
## [13] "22"                   "23"                   "24"                  
## [16] "25"                   "26"                   "27"                  
## [19] "28"                   "29"                   "30"                  
## [22] "31"                   "32"                   "33"                  
## [25] "34"                   "35"                   "36"                  
## [28] "37"                   "38"                   "39"                  
## [31] "40"                   "41"                   "42"                  
## [34] "43"                   "44"                   "45"                  
## [37] "46"                   "47"                   "48"                  
## [40] "5"                    "50"                   "6"                   
## [43] "7"                    "8"                    "9"                   
## [46] "Older than 85"        "Younger than 5 years"
## 
## $CompFreq
## [1] "Monthly" "Weekly"  "Yearly"

options(contrasts = c("contr.treatment",  # for non-ordinal factors
                      "contr.treatment")) # for ordinal factors

Below we have performed detailed encoded ordinal features and handled missing values. As in the case of numeric variables, also here we are dealing with skewed data, hence the missing values have been replaced with the medians.

Gender

As the questionnaire contained a lot of options to choose gender by the respondent, we distinguished two genders for the purposes of our analysis. Therefore we aggregated two categories: "Man" and "Man;Non-binary, genderqueer, or gender non-conforming" to create a new one: Man (1), the other options were merged to: Woman (0).

df$Gender <- ifelse(df$Gender == "Man"| df$Gender =="Man;Non-binary, genderqueer, or gender non-conforming", 1, 0)

df$Gender[is.na(df$Gender)] <- median(df$Gender, na.rm=TRUE)

EdLevel

df$EdLevel <-
  plyr::revalue(as.character(df$EdLevel),
                c("I never completed any formal education" = "1",
                  "Primary/elementary school"  = "2",
                  "Secondary school (e.g. American high school, German Realschule or Gymnasium, etc.)" = "3",
                  "Some college/university study without earning a degree"   = "4",
                  "Bachelor’s degree (B.A., B.S., B.Eng., etc.)" = "6",
                  "Master’s degree (M.A., M.S., M.Eng., MBA, etc.)"  = "7",
                  "Professional degree (JD, MD, etc.)" = "8", 
                  "Other doctoral degree (Ph.D., Ed.D., etc.)" = "8",
                  "Associate degree (A.A., A.S., etc.)" = "5")) %>%
  as.numeric()

df$EdLevel[is.na(df$EdLevel)] <- median(df$EdLevel, na.rm=TRUE)

OrgSize

df$OrgSize <-
  plyr::revalue(as.character(df$OrgSize),
                c("Just me - I am a freelancer, sole proprietor, etc." = "1",
                  "2 to 9 employees"  = "2",
                  "10 to 19 employees" = "3",
                  "20 to 99 employees"   = "4",
                  "100 to 499 employees" = "5",
                  "500 to 999 employees"  = "6",
                  "1,000 to 4,999 employees"  = "7", 
                  "5,000 to 9,999 employees"  = "8",
                  "10,000 or more employees"  = "9")) %>%
  as.numeric()

df$OrgSize[is.na(df$OrgSize)] <- median(df$OrgSize, na.rm=TRUE)

YearsCode

df$YearsCode <-
  plyr::revalue(as.character(df$YearsCode),
                c("Less than 1 year" = "1",
                  "More than 50 years"  = "50")) %>%
  as.numeric()

df$YearsCode[is.na(df$YearsCode)] <- median(df$YearsCode, na.rm=TRUE)

JobSat

df$JobSat <-
  plyr::revalue(as.character(df$JobSat),
                c("Very dissatisfied" = "1",
                  "Slightly dissatisfied"   = "2",
                  "Neither satisfied nor dissatisfied" = "3",
                  "Slightly satisfied" = "4", 
                  "Very satisfied" ="5")) %>%
  as.numeric()

df$JobSat[is.na(df$JobSat)] <- median(df$JobSat, na.rm=TRUE)

Hobbyist

df$Hobbyist <- ifelse(df$Hobbyist == "Yes", 1, 0)

df$Hobbyist[is.na(df$Hobbyist)] <- median(df$Hobbyist, na.rm=TRUE)

NEWOvertime

df$NEWOvertime <-
  plyr::revalue(as.character(df$NEWOvertime),
                c("Never"  = "1",
                  "Rarely: 1-2 days per year or less"   = "2",
                  "Occasionally: 1-2 days per quarter but less than monthly"   = "3",
                  "Sometimes: 1-2 days per month but less than weekly"  = "4", 
                  "Often: 1-2 days per week or more"  ="5")) %>%
  as.numeric()

df$NEWOvertime [is.na(df$NEWOvertime )] <- median(df$NEWOvertime , na.rm=TRUE)

Age1stCode

df$Age1stCode <-
  plyr::revalue(as.character(df$Age1stCode),
                c("Younger than 5 years"  = "4",
                  "Older than 85"  = "86")) %>%
  as.numeric()

df$Age1stCode[is.na(df$Age1stCode)] <- median(df$Age1stCode, na.rm=TRUE)

CompFreq

df$CompFreq <-
  plyr::revalue(as.character(df$CompFreq),
                c("Weekly"  = "1",
                  "Monthly"   = "2",
                  "Yearly" = "3")) %>%
  as.numeric()

df$CompFreq[is.na(df$CompFreq)] <- median(df$CompFreq, na.rm=TRUE)

Let’s see if we still have any missing variables.

sum(is.na(df))

## [1] 0

2.4. Outliers

meltData <- melt(df)

## Using  as id variables

p <- ggplot(meltData, aes(factor(variable), value)) 
p + geom_boxplot() + facet_wrap(~variable, scale="free")

As we can see variables Age, Age1stCode, YearsCode and ConvertedComp have outliers. We will remove the instances with those values in the next stage.

Age

outlier_Age<- boxplot(df$Age,data = df)$out

Age1stCode

outlier_Age1stCode<- boxplot(df$Age1stCode,data = df)$out

YearsCode

outlier_YearsCode<- boxplot(df$YearsCode,data = df)$out

ConvertedComp

outlier_ConvertedComp<- boxplot(df$ConvertedComp,data = df)$out

Removing outliers:

df <- df[!((df$Age %in% outlier_Age) |(df$Age1stCode %in% outlier_Age1stCode) |(df$ConvertedComp %in% outlier_ConvertedComp))|(df$YearsCode %in% outlier_YearsCode),]

2.5.Dependent variable

After cleaning the data we have to take a closer look at our dependent variable ConvertedComp. As we can see there are still some outliers. All values greater than the cut-off point of 250.000 USD can be eliminated.

df %>% 
  filter(df$ConvertedComp > 250000) %>% 
  nrow() %>% 
  knitr::knit_print()

## [1] 61

df <- df %>% 
  filter(df$ConvertedComp < 250000)

Now let see distribution of our dependent variable. It is clearly visible that it is not symmetrical, but right-skewed. Therefore we should apply some transformations on our variable.

skewness(df$ConvertedComp)

## [1] 0.9421419

par(mfrow=c(1,3))
hist(df$ConvertedComp, freq=FALSE, col="chocolate2", xlab="ConvertedComp regular")
hist(log(df$ConvertedComp), freq=FALSE, col="chocolate2", xlab="ConvertedComp log")
hist(sqrt(df$ConvertedComp), freq=FALSE, col="chocolate2", xlab="ConvertedComp square root")

As we can see square root transformation shows more similarity to normal distribution.

df$ConvertedComp<-sqrt(df$ConvertedComp)

3. Data partioning

set.seed(987654321)

df_training <- createDataPartition(df$ConvertedComp, p = 0.7, list = FALSE) 

df_train <- df[df_training,]
df_test <- df[-df_training,]

summary(df_train$ConvertedComp)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0   155.7   224.6   225.4   289.9   498.6

summary(df_test$ConvertedComp)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0   155.8   224.6   226.0   289.9   495.9

Our test and training sets have very similar distribution and key characteristics of the dependent variable, so we’ll stick to this datasets.

4. Features selection

4.1. Numeric variables

4.1.1. Variables with near zero variance

numeric_vars<-sapply(df_train, is.numeric) %>% 
  which() %>% 
  names()
numeric_vars

##  [1] "Gender"        "EdLevel"       "OrgSize"       "YearsCode"    
##  [5] "JobSat"        "ConvertedComp" "Hobbyist"      "Age"          
##  [9] "NEWOvertime"   "Age1stCode"    "CompFreq"

It is advisable to remove variables with a single level (zero variance). As there are no variables with 1 level, we don’t need to eliminate any variable.

sapply(df_train[, numeric_vars], 
        function(x) 
          unique(x) %>% 
          length()) %>% 
  sort()

##        Gender      Hobbyist      CompFreq        JobSat   NEWOvertime 
##             2             2             3             5             5 
##       EdLevel       OrgSize    Age1stCode     YearsCode           Age 
##             8             9            28            50            64 
## ConvertedComp 
##          4649

There are also no variables with near-zero variance.

df_nzv <- nearZeroVar(df_train[, numeric_vars],
                              saveMetrics = TRUE)
df_nzv

##               freqRatio percentUnique zeroVar   nzv
## Gender        11.639918    0.01085246   FALSE FALSE
## EdLevel        1.956885    0.04340984   FALSE FALSE
## OrgSize        1.121892    0.04883607   FALSE FALSE
## YearsCode      1.487522    0.27131152   FALSE FALSE
## JobSat         1.020840    0.02713115   FALSE FALSE
## ConvertedComp  1.011834   25.22654512   FALSE FALSE
## Hobbyist       3.436447    0.01085246   FALSE FALSE
## Age            1.016708    0.34727875   FALSE FALSE
## NEWOvertime    1.118553    0.02713115   FALSE FALSE
## Age1stCode     1.011446    0.15193445   FALSE FALSE
## CompFreq       1.384332    0.01627869   FALSE FALSE

4.1.2. Colinearity

data_linear_combinations <- findLinearCombos(df_train[, numeric_vars])
data_linear_combinations

## $linearCombos
## list()
## 
## $remove
## NULL

No linear regression in our data (no sets of variables correlated significantly).

4.1.3. Correlations

correlations <- cor(df_train[,numeric_vars],
    use = "pairwise.complete.obs")

corrplot.mixed(correlations,
               upper = "number",
               lower = "circle",
               tl.col = "black",
               tl.pos = "lt")

As we can see no variable is highly correlated with the dependent variable ConvertedComp. However strong correlation between Age and YearsCode can be seen, which is understandable - the older the programmer, the more years he spent on programming. From the perspective of our analysis and study of the impact on annual salary, the variable indicating programming experience expressed in years seems to be much more significant than the programmer’s age itself. Therefore, we will get rid of the variable Age.

df_train$Age<-NULL
df_test$Age<-NULL

4.2. Categorical variables

Let’s transform all categorical variables into factors for a qualitative approach.

df$Gender<-as.factor(df$Gender)
df$EdLevel<-as.factor(df$EdLevel)
df$OrgSize<-as.factor(df$OrgSize)
df$JobSat<-as.factor(df$JobSat)
df$Hobbyist<-as.factor(df$Hobbyist)
df$NEWOvertime<-as.factor(df$NEWOvertime)
df$CompFreq<-as.factor(df$CompFreq)

categorical_vars<-c("Gender","EdLevel","OrgSize","JobSat","Hobbyist","NEWOvertime","CompFreq")

4.2.1 Backward elimination

The selection of categorical variables can be made by using backward elimination which is general to a specific approach, in which the model starts with all variables and then runs the model again and again, removing one least significant variable at each step. The process stops when the model with all relevant variables is reached.

# Model with all variables.
model <- lm(df_train$ConvertedComp ~ ., 
                 data = df_train[categorical_vars])

ols_step_backward_p(model,
                    prem = 0.05,
                    # show progress
                    progress = FALSE) -> backward_selector

backward_selector$removed

## [1] "Gender"

As we can see variable Gender was selected to be removed.

4.2.2 ANOVA

We can check the relationship between our categorical variables with the target variable using ANOVA.

Gender

anova_Gender <- aov(df_train$ConvertedComp ~ df_train$Gender)
summary(anova_Gender)

##                    Df    Sum Sq Mean Sq F value Pr(>F)  
## df_train$Gender     1     25691   25691   2.935 0.0867 .
## Residuals       18427 161272234    8752                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

EdLevel

anova_EdLevel <- aov(df_train$ConvertedComp ~ df_train$EdLevel)
summary(anova_EdLevel)

##                     Df    Sum Sq Mean Sq F value Pr(>F)    
## df_train$EdLevel     1    917557  917557   105.4 <2e-16 ***
## Residuals        18427 160380368    8704                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

OrgSize

anova_EdLevel <- aov(df_train$ConvertedComp ~ df_train$EdLevel)
summary(anova_EdLevel)

##                     Df    Sum Sq Mean Sq F value Pr(>F)    
## df_train$EdLevel     1    917557  917557   105.4 <2e-16 ***
## Residuals        18427 160380368    8704                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

JobSat

anova_JobSat <- aov(df_train$ConvertedComp ~ df_train$JobSat)
summary(anova_JobSat)

##                    Df    Sum Sq Mean Sq F value Pr(>F)    
## df_train$JobSat     1   3768249 3768249   440.8 <2e-16 ***
## Residuals       18427 157529676    8549                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Hobbyist

anova_Hobbyist <- aov(df_train$ConvertedComp ~ df_train$Hobbyist)
summary(anova_Hobbyist)

##                      Df    Sum Sq Mean Sq F value Pr(>F)  
## df_train$Hobbyist     1     48391   48391    5.53 0.0187 *
## Residuals         18427 161249534    8751                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

NEWOvertime

anova_NEWOvertime <- aov(df_train$ConvertedComp ~ df_train$NEWOvertime)
summary(anova_NEWOvertime)

##                         Df    Sum Sq Mean Sq F value   Pr(>F)    
## df_train$NEWOvertime     1    205176  205176   23.47 1.28e-06 ***
## Residuals            18427 161092749    8742                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

CompFreq

anova_CompFreq <- aov(df_train$ConvertedComp ~ df_train$CompFreq)
summary(anova_CompFreq)

##                      Df   Sum Sq  Mean Sq F value Pr(>F)    
## df_train$CompFreq     1 3.53e+07 35302506    5163 <2e-16 ***
## Residuals         18427 1.26e+08     6838                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Let’s sort the F-statistics decreasingly to explore the most significant relationships.

pval_anova <- function(categorical_vars) {
  anova_ <- aov(df_train$ConvertedComp ~ 
                  df_train[[categorical_vars]]) 
  
  return(summary(anova_)[[1]][1, 4])
}


anova_categorical <- sapply(categorical_vars,
       pval_anova) %>% 
 sort(decreasing = TRUE)

anova_categorical

##    CompFreq     OrgSize      JobSat     EdLevel NEWOvertime    Hobbyist 
## 5163.039231  879.755349  440.790071  105.423259   23.469545    5.529898 
##      Gender 
##    2.935405

Both methods Backward elimination and ANOVA indicate that variable Gender is the least significant one and we should get rid of it.

df_train$Gender<-NULL
df_test$Gender<-NULL

vars_selected<- c("Age1stCode","YearsCode","ConvertedComp","EdLevel","OrgSize","JobSat","Hobbyist","NEWOvertime","CompFreq")

5. Modeling

5.1. Estimating all models

5.1.1. Multiple Linear Regression

Multiple Linear Regression is a linear model that models the relationship between the dependent variable, in our case ConvertedComp and the independent, explanatory variables.

Model_MultipleRegr <- lm(ConvertedComp ~ ., 
                  data = df_train %>% 
                   dplyr::select(all_of(vars_selected))) 

summary(Model_MultipleRegr)

## 
## Call:
## lm(formula = ConvertedComp ~ ., data = df_train %>% dplyr::select(all_of(vars_selected)))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -349.72  -49.66   -4.78   45.60  350.02 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -25.05143    5.02095  -4.989 6.11e-07 ***
## Age1stCode   -1.26570    0.14666  -8.630  < 2e-16 ***
## YearsCode     3.37534    0.06721  50.223  < 2e-16 ***
## EdLevel       0.81323    0.45621   1.783   0.0747 .  
## OrgSize       4.97484    0.25254  19.699  < 2e-16 ***
## JobSat        7.72234    0.42266  18.271  < 2e-16 ***
## Hobbyist      1.26986    1.32196   0.961   0.3368    
## NEWOvertime  -2.57258    0.42509  -6.052 1.46e-09 ***
## CompFreq     66.94391    1.07023  62.551  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 74.28 on 18420 degrees of freedom
## Multiple R-squared:   0.37,  Adjusted R-squared:  0.3697 
## F-statistic:  1352 on 8 and 18420 DF,  p-value: < 2.2e-16

Quite surprising in our summary results for the p-value for variables Hobbyist andEdLevel are higher than 0.05, indicating that the these features are not significant - therefore it can be concluded the coefficients of hobby and education level do not affect the annual salary.

For the next two models Ridge Regression and the LASSO we’ll use glmnet package. Both of them are simple techniques to reduce model complexity and prevent over-fitting which may result from Multiple Linear Regression.

5.1.2. Ridge Regression

Ridge Regression puts constraint on the coefficients. The penalty term lambda regularizes the coefficients such that if the coefficients take large values the optimization function is penalized. Thus,Ridge Regressionreduces the model complexity and multi-collinearity.

As the glmnet() function does not use the same formula structure as modeling function for Multiple Linear Regression we have to start with creating matrix of predictors x_train as well as y_train vector with target variable.

#Matrix of predictors
x_train = model.matrix(ConvertedComp~., df_train)[,-1] 
                                        
#Creating vector with only target variable.
y_train = df_train$ConvertedComp%>%
  unlist() %>%
  as.numeric()

#Create grid of lambda values (ranging from  λ=10^10  to  λ=10^(−2) - covers the full range of scenarios from the null model containing only the intercept, to the least squares fit)
grid = 10^seq(10, -2, length = 100)

#Training the Ridge Regression model
Model_Ridge <- glmnet(x_train, y_train, alpha = 0, lambda = grid) 

#Alpha argument determines what type of model is fit: 0 - Ridge Regression, 1 - LASSO Regression

summary(Model_Ridge)

##           Length Class     Mode   
## a0        100    -none-    numeric
## beta      800    dgCMatrix S4     
## df        100    -none-    numeric
## dim         2    -none-    numeric
## lambda    100    -none-    numeric
## dev.ratio 100    -none-    numeric
## nulldev     1    -none-    numeric
## npasses     1    -none-    numeric
## jerr        1    -none-    numeric
## offset      1    -none-    logical
## call        5    -none-    call   
## nobs        1    -none-    numeric

#Minimum lambda value with cross-validation which is the best fit model
CV_Ridge=cv.glmnet(x_train, y_train, alpha=0, type="deviance", family="gaussian",  nfolds=10)
CV_Ridge$lambda.min

## [1] 4.376751

coef(CV_Ridge, s="lambda.min")

## 9 x 1 sparse Matrix of class "dgCMatrix"
##                      s1
## (Intercept) -14.7000591
## EdLevel       0.9711423
## OrgSize       4.8843414
## YearsCode     3.2271837
## JobSat        7.4755985
## Hobbyist      1.1724929
## NEWOvertime  -2.4432230
## Age1stCode   -1.3714879
## CompFreq     64.3447077

Below is a visualization of the Ridge coefficients versus λ values. Optimal value for lambda is marked by blue dashed line.

plot(Model_Ridge, xvar="lambda", lwd=2, label=T)
abline(v=log(CV_Ridge$lambda.min), col="blue", lty=2)

Let’s see how the coefficients change as the log of lambda change.

coefpath(Model_Ridge)

Plot of the coefficients sorted by magnitude at the optimal lambda value.

coefplot(Model_Ridge, lambda=CV_Ridge$lambda.min, sort="magnitude",color = "chocolate2")

5.1.3. LASSO Regression

LASSO Regression is similar to Ridge Regression, but the penalty term is slightly different - instead of penalizing the sum of squared coefficients, LASSO penalizes the sum of absolute values.

Model_LASSO <- glmnet(x_train, 
                   y_train, 
                   alpha = 1, 
                   lambda = grid) 
summary(Model_LASSO)

##           Length Class     Mode   
## a0        100    -none-    numeric
## beta      800    dgCMatrix S4     
## df        100    -none-    numeric
## dim         2    -none-    numeric
## lambda    100    -none-    numeric
## dev.ratio 100    -none-    numeric
## nulldev     1    -none-    numeric
## npasses     1    -none-    numeric
## jerr        1    -none-    numeric
## offset      1    -none-    logical
## call        5    -none-    call   
## nobs        1    -none-    numeric

# The optimal lambda 
CV_LASSO=cv.glmnet(x_train, y_train, alpha=1, type="deviance", family="gaussian",  nfolds=10)
CV_LASSO$lambda.min

## [1] 0.1501431

# Extract coefficients at optimal lambda
coef(Model_LASSO,s=CV_LASSO$lambda.min)

## 9 x 1 sparse Matrix of class "dgCMatrix"
##                      s1
## (Intercept) -23.6679780
## EdLevel       0.7007534
## OrgSize       4.9286993
## YearsCode     3.3659317
## JobSat        7.6195264
## Hobbyist      0.9067515
## NEWOvertime  -2.4529689
## Age1stCode   -1.2448488
## CompFreq     66.7836619

plot(Model_LASSO, xvar="lambda", lwd=2, label=T)
abline(v=log(CV_LASSO$lambda.min), col="blue", lty=2)

coefpath(Model_LASSO)

coefplot(Model_LASSO, lambda=CV_LASSO$lambda.min, sort="magnitude", color = "chocolate2")

5.1.4. Elastic Net Regression

Elastic Net Regression combines the penalties of Ridge and LASSO.

Model_ENR = glmnet(x_train,y_train,alpha=0.5,nlambda=100)
summary(Model_ENR)

##           Length Class     Mode   
## a0         63    -none-    numeric
## beta      504    dgCMatrix S4     
## df         63    -none-    numeric
## dim         2    -none-    numeric
## lambda     63    -none-    numeric
## dev.ratio  63    -none-    numeric
## nulldev     1    -none-    numeric
## npasses     1    -none-    numeric
## jerr        1    -none-    numeric
## offset      1    -none-    logical
## call        5    -none-    call   
## nobs        1    -none-    numeric

# 10-fold CV to find the optimal lambda 
CV_ENR=cv.glmnet(x_train,y_train,alpha=0.5, type="deviance", family="gaussian", nfolds=10)
CV_ENR$lambda.min

## [1] 0.2736096

plot(Model_ENR, xvar="lambda", lwd=2, label=T)
abline(v=log(CV_ENR$lambda.min), col="blue", lty=2)

coefplot(Model_ENR, lambda=CV_ENR$lambda.min, sort="magnitude",color = "chocolate2")

coefpath(Model_ENR)

5.2. Testing the models

Preparing the test data.

x_test = model.matrix(ConvertedComp~., df_test)[,-1] 
y_test = df_test$ConvertedComp%>%
  unlist() %>%
  as.numeric()

Predicting the models on test data.

Model_MultipleRegr_fit <- predict(Model_MultipleRegr,df_test)
Model_Ridge_fit<- predict(Model_Ridge, s = CV_Ridge$lambda.min, newx = x_test)
Model_LASSO_fit <- predict(Model_LASSO, s = CV_LASSO$lambda.min, newx = x_test)
Model_ENR_fit <- predict(Model_ENR,s = CV_ENR$lambda.min, x_test)

5.3. Evaluation

Let’s apply some statistical metrics to evaluate performance of our models.

5.3.1. Correlation

Unfortunately we can not see strong correlations between predictions and real values.

DF <- data.frame(real = y_test,predicted_1 = Model_MultipleRegr_fit,predicted_2 = Model_Ridge_fit, predicted_3 = Model_LASSO_fit,predicted_4 = Model_ENR_fit,time = 1:length(y_test))
colnames(DF) <- c("real","predicted_1","predicted_2","predicted_3","predicted_4","time")

cor(DF$real,DF[,-c(1,6)])

##      predicted_1 predicted_2 predicted_3 predicted_4
## [1,]   0.6102647   0.6100434   0.6103097   0.6103005

5.3.2. Regression metrics

The regression Metrics function is defined below, which will allow us to compare the statistics of our models:

MSE - Mean Square Error - mean or average of the squared differences between predicted and expected target values in a dataset.
RMSE - Root Mean Square Error - the square root of the second sample moment of the differences between predicted values and observed values or the quadratic mean of these differences.
MAE - Mean Absolute Error - the average of the absolute error values.
MAPE - Mean Absolute Percentage Error - expresses the accuracy as a ratio.
MedAE - Median Absolute Error - the median of all absolute differences between the target and the prediction.
MSLE - Mean Squared Logarithmic Error - squared difference between the log of the predictions and log of actual values.
R2 - R Squared - represents the proportion of the variance for a dependent variable that’s explained by an independent variable.

regressionMetrics <- function(real, predicted) {
  MSE <- mean((real - predicted)^2)
  RMSE <- sqrt(MSE)
  MAE <- mean(abs(real - predicted))
  MAPE <- mean(abs(real - predicted)/real)
  MedAE <- median(abs(real - predicted))
  MSLE <- mean((log(1 + real) - log(1 + predicted))^2)
  TSS <- sum((real - mean(real))^2)
  RSS <- sum((predicted - real)^2)
  R2 <- 1 - RSS/TSS
  
  result <- data.frame(MSE, RMSE, MAE, MAPE, MedAE, MSLE, R2)
  return(result)
}

##                         MSE     RMSE      MAE      MAPE    MedAE      MSLE
## Model_MultipleRegr 5563.126 74.58637 58.54117 0.4955225 47.93930 0.1962692
## Model_Ridge        5573.147 74.65352 58.68107 0.4982790 48.23868 0.1972209
## Model_LASSO        5563.250 74.58720 58.55713 0.4960227 48.02357 0.1964102
## Model_ENR          5563.456 74.58858 58.55987 0.4960679 48.02811 0.1964263
##                           R2
## Model_MultipleRegr 0.3722421
## Model_Ridge        0.3711113
## Model_LASSO        0.3722282
## Model_ENR          0.3722049

Unfortunately,none of the models is fitting well.Statistics for all models are very similar, but the best results were obtained from the Multiple Regression Model.

5.3.3. Error distribution

Let’s check if our errors (or residuals) are normally distributed - as we can see below, this assumption has been met. If we notice that the patterns are present in our residuals or our errors are not normally distributed, it would mean that we are dealing with a polynomial rather than linear relationship, and maybe there are outliers that affect our model more than others.

The histograms of residuals for all models are almost identically alike.

6. Conclusions

Not every dataset fits each type of the model. Final models for StackOverflow dataset are not the highest quality.
During the data processing, based on ANOVA and Backward elimination we concluded that Gender does not impact the amount of the salary.
The size of the organization in which the programmer works has a significant impact on his salary.
It seems that ConvertedComp is much more complex variable to be predict and R^2 at the level of 37% may be sufficient.

7. References

Class materials prepared and provided by Piotr Wójcik PhD at the course “Machine Learning 1”, University of Warsaw, 2021
Stack Overflow Annual Developer Survey: https://insights.stackoverflow.com/survey
Photo source: https://commons.wikimedia.org/wiki/File:Stack_Overflow_logo.svg

Programmers Salary Prediction with Regression

Andrea Furmanek, Divij Pherwani

1. Introduction

2. Exploratory data analysis

2.1. Dataset

2.2. Initial selection

2.3. Missing values

2.3.1. Numeric variables

ConvertedComp

Age

2.3.2. Categorical variables

Gender

EdLevel

OrgSize

YearsCode

JobSat

Hobbyist

NEWOvertime

Age1stCode

CompFreq

2.4. Outliers

Age

Age1stCode

YearsCode

ConvertedComp

2.5.Dependent variable

3. Data partioning

4. Features selection

4.1. Numeric variables

4.1.1. Variables with near zero variance

4.1.2. Colinearity

4.1.3. Correlations

4.2. Categorical variables

4.2.1 Backward elimination

4.2.2 ANOVA

Gender

EdLevel

OrgSize

JobSat

Hobbyist

NEWOvertime

CompFreq

5. Modeling

5.1. Estimating all models

5.1.1. Multiple Linear Regression

5.1.2. Ridge Regression

5.1.3. LASSO Regression

5.1.4. Elastic Net Regression

5.2. Testing the models

5.3. Evaluation

5.3.1. Correlation

5.3.2. Regression metrics

5.3.3. Error distribution

6. Conclusions

7. References