Stack Overflow is a Q&A website for programmers. Every year for the last decade Stack Overflow organizes a survey to provide a gauge of the tech industry and technology trends. The survey collects data such as Basic Information; Education, Work, and Career; Technology and Tech Culture; Stack Overflow Usage + Community; Demographics Information and some additional information. Most questions in the survey are options, and the anonymized results are released publicly. The insights from such surveys could be used for employer recruitment plans and study plans. It provides a clear picture of technology, age and work experience, pay, and gender bias.
The project aims to predict the salary of the developer from the possible other 60 factors. The dimension of the dataset is 64461*61. Factors to be considered such as Gender, Organization Size of the Work, Education Level, Years of coding, level of satisfaction with work, if coding is a hobby, Age, how often respondents work overtime, when respondents started coding, and how frequently respondents are compensated. The model uses the regression technique from Machine Learning algorithms. The steps to achieve desired results include exploratory data analysis, data cleaning, and feature selection.
The dataset used in this analysis contains Stack Overflow Annual Developer Survey 2020 results with nearly 65,000 responses fielded from over 180 countries and dependent territories. The survey examines all aspects of the developer experience from career satisfaction and job search to education and opinions on open-source software.Dataset can be found on Stack Overflow webpage.
Loading all the necessary packages and dataset.
library(dplyr)
library(DataExplorer)
library(sns)
library(tidyverse)
library(hrbrthemes)
library(viridis)
library(forcats)
library(reshape)
library(e1071)
library(caret)
library(corrplot)
library(olsrr)
library(glmnet)
library(coefplot)
library(skimr)
df<-read.csv("survey_results_public.csv")
skim(df)
| Name | df |
| Number of rows | 64461 |
| Number of columns | 61 |
| _______________________ | |
| Column type frequency: | |
| character | 56 |
| numeric | 5 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| MainBranch | 299 | 1.00 | 27 | 77 | 0 | 5 | 0 |
| Hobbyist | 45 | 1.00 | 2 | 3 | 0 | 2 | 0 |
| Age1stCode | 6561 | 0.90 | 1 | 20 | 0 | 63 | 0 |
| CompFreq | 24392 | 0.62 | 6 | 7 | 0 | 3 | 0 |
| Country | 389 | 0.99 | 4 | 41 | 0 | 183 | 0 |
| CurrencyDesc | 18989 | 0.71 | 7 | 39 | 0 | 142 | 0 |
| CurrencySymbol | 18989 | 0.71 | 3 | 3 | 0 | 141 | 0 |
| DatabaseDesireNextYear | 20391 | 0.68 | 5 | 133 | 0 | 3193 | 0 |
| DatabaseWorkedWith | 14924 | 0.77 | 5 | 133 | 0 | 2808 | 0 |
| DevType | 15091 | 0.77 | 8 | 531 | 0 | 8269 | 0 |
| EdLevel | 7030 | 0.89 | 25 | 82 | 0 | 9 | 0 |
| Employment | 607 | 0.99 | 7 | 52 | 0 | 7 | 0 |
| Ethnicity | 18513 | 0.71 | 8 | 235 | 0 | 208 | 0 |
| Gender | 13904 | 0.78 | 3 | 59 | 0 | 7 | 0 |
| JobFactors | 15112 | 0.77 | 19 | 189 | 0 | 230 | 0 |
| JobSat | 19267 | 0.70 | 14 | 34 | 0 | 5 | 0 |
| JobSeek | 12734 | 0.80 | 31 | 60 | 0 | 3 | 0 |
| LanguageDesireNextYear | 10348 | 0.84 | 1 | 164 | 0 | 16243 | 0 |
| LanguageWorkedWith | 7083 | 0.89 | 1 | 164 | 0 | 14256 | 0 |
| MiscTechDesireNextYear | 22082 | 0.66 | 4 | 169 | 0 | 5216 | 0 |
| MiscTechWorkedWith | 24147 | 0.63 | 4 | 169 | 0 | 2730 | 0 |
| NEWCollabToolsDesireNextYear | 17174 | 0.73 | 4 | 149 | 0 | 1277 | 0 |
| NEWCollabToolsWorkedWith | 11578 | 0.82 | 4 | 149 | 0 | 1153 | 0 |
| NEWDevOps | 21775 | 0.66 | 2 | 8 | 0 | 3 | 0 |
| NEWDevOpsImpt | 22729 | 0.65 | 7 | 20 | 0 | 5 | 0 |
| NEWEdImpt | 15996 | 0.75 | 14 | 34 | 0 | 5 | 0 |
| NEWJobHunt | 22175 | 0.66 | 12 | 377 | 0 | 2172 | 0 |
| NEWJobHuntResearch | 23439 | 0.64 | 36 | 362 | 0 | 63 | 0 |
| NEWLearn | 8305 | 0.87 | 11 | 20 | 0 | 4 | 0 |
| NEWOffTopic | 13657 | 0.79 | 2 | 8 | 0 | 3 | 0 |
| NEWOnboardGood | 21838 | 0.66 | 2 | 28 | 0 | 3 | 0 |
| NEWOtherComms | 7256 | 0.89 | 2 | 3 | 0 | 2 | 0 |
| NEWOvertime | 21230 | 0.67 | 5 | 56 | 0 | 5 | 0 |
| NEWPurchaseResearch | 27140 | 0.58 | 18 | 253 | 0 | 63 | 0 |
| NEWPurpleLink | 9658 | 0.85 | 6 | 17 | 0 | 4 | 0 |
| NEWSOSites | 6186 | 0.90 | 37 | 305 | 0 | 61 | 0 |
| NEWStuck | 9478 | 0.85 | 5 | 225 | 0 | 444 | 0 |
| OpSys | 8233 | 0.87 | 3 | 11 | 0 | 4 | 0 |
| OrgSize | 20127 | 0.69 | 16 | 50 | 0 | 9 | 0 |
| PlatformDesireNextYear | 13856 | 0.79 | 3 | 177 | 0 | 7471 | 0 |
| PlatformWorkedWith | 10618 | 0.84 | 3 | 177 | 0 | 6287 | 0 |
| PurchaseWhat | 25097 | 0.61 | 21 | 32 | 0 | 3 | 0 |
| Sexuality | 20469 | 0.68 | 5 | 53 | 0 | 14 | 0 |
| SOAccount | 7656 | 0.88 | 2 | 23 | 0 | 3 | 0 |
| SOComm | 7985 | 0.88 | 7 | 15 | 0 | 6 | 0 |
| SOPartFreq | 17669 | 0.73 | 20 | 50 | 0 | 6 | 0 |
| SOVisitFreq | 7491 | 0.88 | 20 | 50 | 0 | 6 | 0 |
| SurveyEase | 12659 | 0.80 | 4 | 26 | 0 | 3 | 0 |
| SurveyLength | 12760 | 0.80 | 8 | 21 | 0 | 3 | 0 |
| Trans | 15116 | 0.77 | 2 | 3 | 0 | 2 | 0 |
| UndergradMajor | 13466 | 0.79 | 24 | 78 | 0 | 12 | 0 |
| WebframeDesireNextYear | 24437 | 0.62 | 5 | 134 | 0 | 3986 | 0 |
| WebframeWorkedWith | 22182 | 0.66 | 5 | 134 | 0 | 3789 | 0 |
| WelcomeChange | 11778 | 0.82 | 37 | 55 | 0 | 6 | 0 |
| YearsCode | 6777 | 0.89 | 1 | 18 | 0 | 52 | 0 |
| YearsCodePro | 18112 | 0.72 | 1 | 18 | 0 | 52 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Respondent | 0 | 1.00 | 3.255408e+04 | 18967.44 | 1 | 16116 | 32231 | 49142 | 6.563900e+04 | ▇▇▇▇▇ |
| Age | 19015 | 0.71 | 3.083000e+01 | 9.59 | 1 | 24 | 29 | 35 | 2.790000e+02 | ▇▁▁▁▁ |
| CompTotal | 29635 | 0.54 | 3.190464e+242 | Inf | 0 | 20000 | 63000 | 125000 | 1.111111e+247 | ▇▁▁▁▁ |
| ConvertedComp | 29705 | 0.54 | 1.037561e+05 | 226885.30 | 0 | 24648 | 54049 | 95000 | 2.000000e+06 | ▇▁▁▁▁ |
| WorkWeekHrs | 23310 | 0.64 | 4.078000e+01 | 17.82 | 1 | 40 | 40 | 44 | 4.750000e+02 | ▇▁▁▁▁ |
As the dataset selected by us contains many features that will not necessarily be used for further analysis, we decided to clean the data a bit at the beginning. We started this process by asking ourselves a basic question:
Which variables affect the amount of annual salary (USD) of a given respondent?
Thus, we chose the following variables:
df <- df %>%
dplyr::select(Gender, EdLevel,
OrgSize, YearsCode, JobSat,
ConvertedComp, Hobbyist, Age, NEWOvertime, Age1stCode, CompFreq)
Gender - respondent’s gender
EdLevel - highest level of formal education that respondent completed
OrgSize - number of people that are employed by the company or organization where respondent currently work for
YearsCode - how many years respondent has been coding in total?
JobSat - how satisfied the respondent is with his/her current job
ConvertedComp - salary converted to annual USD salaries using the exchange rate on 2020-02-19, assuming 12 working months and 50 working weeks.
Hobbyist - if respondent code as a hobby
Age - respondent’s age in years
NEWOvertime - how often respondent works overtime or beyond the formal time expectation of his/her job
Age1stCode - Respondent’s age when he/she wrote his/her first line of code or program? (e.g., webpage, Hello World, Scratch project)
CompFreq - ferquency of compensation payment: weekly, monthly, or yearly
We will start our analysis by checking whether our dataset contains the missing values and, as you can see below, there is quite a lot of them, therefore the process of handling them will be considered in terms of individual variable, both numerical and categorical.
plot_missing(df,)
Initially we only have two numeric variables: ConvertedComp and Age.
sapply(df, is.numeric) %>%
which() %>%
names()
## [1] "ConvertedComp" "Age"
df %>%
keep(is.numeric) %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_histogram()
As we can see above variable Age looks to be right-skewed (long tail in the right) hence we will use median to replace missing values. In the case of the variable ConvertedComp, we will remove all missing values as we want to include in our analysis only employed programmers who receive salary.
df<- df %>%
filter(ConvertedComp!=0) %>%
na.omit(df)
any(df$ConvertedComp <= 0) %>%
knitr::knit_print()
## [1] FALSE
df$Age[is.na(df$Age)] <- median(df$Age, na.rm=TRUE)
any(df$Age <= 0) %>%
knitr::knit_print()
## [1] FALSE
The vast majority of our variables are categorical:
categorical_vars <-
sapply(df, is.character) %>%
which() %>%
names()
categorical_vars
## [1] "Gender" "EdLevel" "OrgSize" "YearsCode" "JobSat"
## [6] "Hobbyist" "NEWOvertime" "Age1stCode" "CompFreq"
Let’s explore initially our variables to see what levels they have:
df[categorical_vars] <- lapply(df[categorical_vars],factor)
lapply(df[categorical_vars],levels)
## $Gender
## [1] "Man"
## [2] "Man;Non-binary, genderqueer, or gender non-conforming"
## [3] "Non-binary, genderqueer, or gender non-conforming"
## [4] "Woman"
## [5] "Woman;Man"
## [6] "Woman;Man;Non-binary, genderqueer, or gender non-conforming"
## [7] "Woman;Non-binary, genderqueer, or gender non-conforming"
##
## $EdLevel
## [1] "Associate degree (A.A., A.S., etc.)"
## [2] "Bachelor’s degree (B.A., B.S., B.Eng., etc.)"
## [3] "I never completed any formal education"
## [4] "Master’s degree (M.A., M.S., M.Eng., MBA, etc.)"
## [5] "Other doctoral degree (Ph.D., Ed.D., etc.)"
## [6] "Primary/elementary school"
## [7] "Professional degree (JD, MD, etc.)"
## [8] "Secondary school (e.g. American high school, German Realschule or Gymnasium, etc.)"
## [9] "Some college/university study without earning a degree"
##
## $OrgSize
## [1] "1,000 to 4,999 employees"
## [2] "10 to 19 employees"
## [3] "10,000 or more employees"
## [4] "100 to 499 employees"
## [5] "2 to 9 employees"
## [6] "20 to 99 employees"
## [7] "5,000 to 9,999 employees"
## [8] "500 to 999 employees"
## [9] "Just me - I am a freelancer, sole proprietor, etc."
##
## $YearsCode
## [1] "1" "10" "11"
## [4] "12" "13" "14"
## [7] "15" "16" "17"
## [10] "18" "19" "2"
## [13] "20" "21" "22"
## [16] "23" "24" "25"
## [19] "26" "27" "28"
## [22] "29" "3" "30"
## [25] "31" "32" "33"
## [28] "34" "35" "36"
## [31] "37" "38" "39"
## [34] "4" "40" "41"
## [37] "42" "43" "44"
## [40] "45" "46" "47"
## [43] "48" "49" "5"
## [46] "50" "6" "7"
## [49] "8" "9" "Less than 1 year"
## [52] "More than 50 years"
##
## $JobSat
## [1] "Neither satisfied nor dissatisfied" "Slightly dissatisfied"
## [3] "Slightly satisfied" "Very dissatisfied"
## [5] "Very satisfied"
##
## $Hobbyist
## [1] "No" "Yes"
##
## $NEWOvertime
## [1] "Never"
## [2] "Occasionally: 1-2 days per quarter but less than monthly"
## [3] "Often: 1-2 days per week or more"
## [4] "Rarely: 1-2 days per year or less"
## [5] "Sometimes: 1-2 days per month but less than weekly"
##
## $Age1stCode
## [1] "10" "11" "12"
## [4] "13" "14" "15"
## [7] "16" "17" "18"
## [10] "19" "20" "21"
## [13] "22" "23" "24"
## [16] "25" "26" "27"
## [19] "28" "29" "30"
## [22] "31" "32" "33"
## [25] "34" "35" "36"
## [28] "37" "38" "39"
## [31] "40" "41" "42"
## [34] "43" "44" "45"
## [37] "46" "47" "48"
## [40] "5" "50" "6"
## [43] "7" "8" "9"
## [46] "Older than 85" "Younger than 5 years"
##
## $CompFreq
## [1] "Monthly" "Weekly" "Yearly"
options(contrasts = c("contr.treatment", # for non-ordinal factors
"contr.treatment")) # for ordinal factors
Below we have performed detailed encoded ordinal features and handled missing values. As in the case of numeric variables, also here we are dealing with skewed data, hence the missing values have been replaced with the medians.
As the questionnaire contained a lot of options to choose gender by the respondent, we distinguished two genders for the purposes of our analysis. Therefore we aggregated two categories: "Man" and "Man;Non-binary, genderqueer, or gender non-conforming" to create a new one: Man (1), the other options were merged to: Woman (0).
df$Gender <- ifelse(df$Gender == "Man"| df$Gender =="Man;Non-binary, genderqueer, or gender non-conforming", 1, 0)
df$Gender[is.na(df$Gender)] <- median(df$Gender, na.rm=TRUE)
df$EdLevel <-
plyr::revalue(as.character(df$EdLevel),
c("I never completed any formal education" = "1",
"Primary/elementary school" = "2",
"Secondary school (e.g. American high school, German Realschule or Gymnasium, etc.)" = "3",
"Some college/university study without earning a degree" = "4",
"Bachelor’s degree (B.A., B.S., B.Eng., etc.)" = "6",
"Master’s degree (M.A., M.S., M.Eng., MBA, etc.)" = "7",
"Professional degree (JD, MD, etc.)" = "8",
"Other doctoral degree (Ph.D., Ed.D., etc.)" = "8",
"Associate degree (A.A., A.S., etc.)" = "5")) %>%
as.numeric()
df$EdLevel[is.na(df$EdLevel)] <- median(df$EdLevel, na.rm=TRUE)
df$OrgSize <-
plyr::revalue(as.character(df$OrgSize),
c("Just me - I am a freelancer, sole proprietor, etc." = "1",
"2 to 9 employees" = "2",
"10 to 19 employees" = "3",
"20 to 99 employees" = "4",
"100 to 499 employees" = "5",
"500 to 999 employees" = "6",
"1,000 to 4,999 employees" = "7",
"5,000 to 9,999 employees" = "8",
"10,000 or more employees" = "9")) %>%
as.numeric()
df$OrgSize[is.na(df$OrgSize)] <- median(df$OrgSize, na.rm=TRUE)
df$YearsCode <-
plyr::revalue(as.character(df$YearsCode),
c("Less than 1 year" = "1",
"More than 50 years" = "50")) %>%
as.numeric()
df$YearsCode[is.na(df$YearsCode)] <- median(df$YearsCode, na.rm=TRUE)
df$JobSat <-
plyr::revalue(as.character(df$JobSat),
c("Very dissatisfied" = "1",
"Slightly dissatisfied" = "2",
"Neither satisfied nor dissatisfied" = "3",
"Slightly satisfied" = "4",
"Very satisfied" ="5")) %>%
as.numeric()
df$JobSat[is.na(df$JobSat)] <- median(df$JobSat, na.rm=TRUE)
df$Hobbyist <- ifelse(df$Hobbyist == "Yes", 1, 0)
df$Hobbyist[is.na(df$Hobbyist)] <- median(df$Hobbyist, na.rm=TRUE)
df$NEWOvertime <-
plyr::revalue(as.character(df$NEWOvertime),
c("Never" = "1",
"Rarely: 1-2 days per year or less" = "2",
"Occasionally: 1-2 days per quarter but less than monthly" = "3",
"Sometimes: 1-2 days per month but less than weekly" = "4",
"Often: 1-2 days per week or more" ="5")) %>%
as.numeric()
df$NEWOvertime [is.na(df$NEWOvertime )] <- median(df$NEWOvertime , na.rm=TRUE)
df$Age1stCode <-
plyr::revalue(as.character(df$Age1stCode),
c("Younger than 5 years" = "4",
"Older than 85" = "86")) %>%
as.numeric()
df$Age1stCode[is.na(df$Age1stCode)] <- median(df$Age1stCode, na.rm=TRUE)
df$CompFreq <-
plyr::revalue(as.character(df$CompFreq),
c("Weekly" = "1",
"Monthly" = "2",
"Yearly" = "3")) %>%
as.numeric()
df$CompFreq[is.na(df$CompFreq)] <- median(df$CompFreq, na.rm=TRUE)
Let’s see if we still have any missing variables.
sum(is.na(df))
## [1] 0
meltData <- melt(df)
## Using as id variables
p <- ggplot(meltData, aes(factor(variable), value))
p + geom_boxplot() + facet_wrap(~variable, scale="free")
As we can see variables Age, Age1stCode, YearsCode and ConvertedComp have outliers. We will remove the instances with those values in the next stage.
outlier_Age<- boxplot(df$Age,data = df)$out
outlier_Age1stCode<- boxplot(df$Age1stCode,data = df)$out
outlier_YearsCode<- boxplot(df$YearsCode,data = df)$out
outlier_ConvertedComp<- boxplot(df$ConvertedComp,data = df)$out
Removing outliers:
df <- df[!((df$Age %in% outlier_Age) |(df$Age1stCode %in% outlier_Age1stCode) |(df$ConvertedComp %in% outlier_ConvertedComp))|(df$YearsCode %in% outlier_YearsCode),]
After cleaning the data we have to take a closer look at our dependent variable ConvertedComp. As we can see there are still some outliers. All values greater than the cut-off point of 250.000 USD can be eliminated.
df %>%
filter(df$ConvertedComp > 250000) %>%
nrow() %>%
knitr::knit_print()
## [1] 61
df <- df %>%
filter(df$ConvertedComp < 250000)
Now let see distribution of our dependent variable. It is clearly visible that it is not symmetrical, but right-skewed. Therefore we should apply some transformations on our variable.
skewness(df$ConvertedComp)
## [1] 0.9421419
par(mfrow=c(1,3))
hist(df$ConvertedComp, freq=FALSE, col="chocolate2", xlab="ConvertedComp regular")
hist(log(df$ConvertedComp), freq=FALSE, col="chocolate2", xlab="ConvertedComp log")
hist(sqrt(df$ConvertedComp), freq=FALSE, col="chocolate2", xlab="ConvertedComp square root")
As we can see square root transformation shows more similarity to normal distribution.
df$ConvertedComp<-sqrt(df$ConvertedComp)
set.seed(987654321)
df_training <- createDataPartition(df$ConvertedComp, p = 0.7, list = FALSE)
df_train <- df[df_training,]
df_test <- df[-df_training,]
summary(df_train$ConvertedComp)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 155.7 224.6 225.4 289.9 498.6
summary(df_test$ConvertedComp)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 155.8 224.6 226.0 289.9 495.9
Our test and training sets have very similar distribution and key characteristics of the dependent variable, so we’ll stick to this datasets.
numeric_vars<-sapply(df_train, is.numeric) %>%
which() %>%
names()
numeric_vars
## [1] "Gender" "EdLevel" "OrgSize" "YearsCode"
## [5] "JobSat" "ConvertedComp" "Hobbyist" "Age"
## [9] "NEWOvertime" "Age1stCode" "CompFreq"
It is advisable to remove variables with a single level (zero variance). As there are no variables with 1 level, we don’t need to eliminate any variable.
sapply(df_train[, numeric_vars],
function(x)
unique(x) %>%
length()) %>%
sort()
## Gender Hobbyist CompFreq JobSat NEWOvertime
## 2 2 3 5 5
## EdLevel OrgSize Age1stCode YearsCode Age
## 8 9 28 50 64
## ConvertedComp
## 4649
There are also no variables with near-zero variance.
df_nzv <- nearZeroVar(df_train[, numeric_vars],
saveMetrics = TRUE)
df_nzv
## freqRatio percentUnique zeroVar nzv
## Gender 11.639918 0.01085246 FALSE FALSE
## EdLevel 1.956885 0.04340984 FALSE FALSE
## OrgSize 1.121892 0.04883607 FALSE FALSE
## YearsCode 1.487522 0.27131152 FALSE FALSE
## JobSat 1.020840 0.02713115 FALSE FALSE
## ConvertedComp 1.011834 25.22654512 FALSE FALSE
## Hobbyist 3.436447 0.01085246 FALSE FALSE
## Age 1.016708 0.34727875 FALSE FALSE
## NEWOvertime 1.118553 0.02713115 FALSE FALSE
## Age1stCode 1.011446 0.15193445 FALSE FALSE
## CompFreq 1.384332 0.01627869 FALSE FALSE
data_linear_combinations <- findLinearCombos(df_train[, numeric_vars])
data_linear_combinations
## $linearCombos
## list()
##
## $remove
## NULL
No linear regression in our data (no sets of variables correlated significantly).
correlations <- cor(df_train[,numeric_vars],
use = "pairwise.complete.obs")
corrplot.mixed(correlations,
upper = "number",
lower = "circle",
tl.col = "black",
tl.pos = "lt")
As we can see no variable is highly correlated with the dependent variable ConvertedComp. However strong correlation between Age and YearsCode can be seen, which is understandable - the older the programmer, the more years he spent on programming. From the perspective of our analysis and study of the impact on annual salary, the variable indicating programming experience expressed in years seems to be much more significant than the programmer’s age itself. Therefore, we will get rid of the variable Age.
df_train$Age<-NULL
df_test$Age<-NULL
Let’s transform all categorical variables into factors for a qualitative approach.
df$Gender<-as.factor(df$Gender)
df$EdLevel<-as.factor(df$EdLevel)
df$OrgSize<-as.factor(df$OrgSize)
df$JobSat<-as.factor(df$JobSat)
df$Hobbyist<-as.factor(df$Hobbyist)
df$NEWOvertime<-as.factor(df$NEWOvertime)
df$CompFreq<-as.factor(df$CompFreq)
categorical_vars<-c("Gender","EdLevel","OrgSize","JobSat","Hobbyist","NEWOvertime","CompFreq")
The selection of categorical variables can be made by using backward elimination which is general to a specific approach, in which the model starts with all variables and then runs the model again and again, removing one least significant variable at each step. The process stops when the model with all relevant variables is reached.
# Model with all variables.
model <- lm(df_train$ConvertedComp ~ .,
data = df_train[categorical_vars])
ols_step_backward_p(model,
prem = 0.05,
# show progress
progress = FALSE) -> backward_selector
backward_selector$removed
## [1] "Gender"
As we can see variable Gender was selected to be removed.
We can check the relationship between our categorical variables with the target variable using ANOVA.
anova_Gender <- aov(df_train$ConvertedComp ~ df_train$Gender)
summary(anova_Gender)
## Df Sum Sq Mean Sq F value Pr(>F)
## df_train$Gender 1 25691 25691 2.935 0.0867 .
## Residuals 18427 161272234 8752
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova_EdLevel <- aov(df_train$ConvertedComp ~ df_train$EdLevel)
summary(anova_EdLevel)
## Df Sum Sq Mean Sq F value Pr(>F)
## df_train$EdLevel 1 917557 917557 105.4 <2e-16 ***
## Residuals 18427 160380368 8704
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova_EdLevel <- aov(df_train$ConvertedComp ~ df_train$EdLevel)
summary(anova_EdLevel)
## Df Sum Sq Mean Sq F value Pr(>F)
## df_train$EdLevel 1 917557 917557 105.4 <2e-16 ***
## Residuals 18427 160380368 8704
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova_JobSat <- aov(df_train$ConvertedComp ~ df_train$JobSat)
summary(anova_JobSat)
## Df Sum Sq Mean Sq F value Pr(>F)
## df_train$JobSat 1 3768249 3768249 440.8 <2e-16 ***
## Residuals 18427 157529676 8549
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova_Hobbyist <- aov(df_train$ConvertedComp ~ df_train$Hobbyist)
summary(anova_Hobbyist)
## Df Sum Sq Mean Sq F value Pr(>F)
## df_train$Hobbyist 1 48391 48391 5.53 0.0187 *
## Residuals 18427 161249534 8751
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova_NEWOvertime <- aov(df_train$ConvertedComp ~ df_train$NEWOvertime)
summary(anova_NEWOvertime)
## Df Sum Sq Mean Sq F value Pr(>F)
## df_train$NEWOvertime 1 205176 205176 23.47 1.28e-06 ***
## Residuals 18427 161092749 8742
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova_CompFreq <- aov(df_train$ConvertedComp ~ df_train$CompFreq)
summary(anova_CompFreq)
## Df Sum Sq Mean Sq F value Pr(>F)
## df_train$CompFreq 1 3.53e+07 35302506 5163 <2e-16 ***
## Residuals 18427 1.26e+08 6838
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Let’s sort the F-statistics decreasingly to explore the most significant relationships.
pval_anova <- function(categorical_vars) {
anova_ <- aov(df_train$ConvertedComp ~
df_train[[categorical_vars]])
return(summary(anova_)[[1]][1, 4])
}
anova_categorical <- sapply(categorical_vars,
pval_anova) %>%
sort(decreasing = TRUE)
anova_categorical
## CompFreq OrgSize JobSat EdLevel NEWOvertime Hobbyist
## 5163.039231 879.755349 440.790071 105.423259 23.469545 5.529898
## Gender
## 2.935405
Both methods Backward elimination and ANOVA indicate that variable Gender is the least significant one and we should get rid of it.
df_train$Gender<-NULL
df_test$Gender<-NULL
vars_selected<- c("Age1stCode","YearsCode","ConvertedComp","EdLevel","OrgSize","JobSat","Hobbyist","NEWOvertime","CompFreq")
Multiple Linear Regression is a linear model that models the relationship between the dependent variable, in our case ConvertedComp and the independent, explanatory variables.
Model_MultipleRegr <- lm(ConvertedComp ~ .,
data = df_train %>%
dplyr::select(all_of(vars_selected)))
summary(Model_MultipleRegr)
##
## Call:
## lm(formula = ConvertedComp ~ ., data = df_train %>% dplyr::select(all_of(vars_selected)))
##
## Residuals:
## Min 1Q Median 3Q Max
## -349.72 -49.66 -4.78 45.60 350.02
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -25.05143 5.02095 -4.989 6.11e-07 ***
## Age1stCode -1.26570 0.14666 -8.630 < 2e-16 ***
## YearsCode 3.37534 0.06721 50.223 < 2e-16 ***
## EdLevel 0.81323 0.45621 1.783 0.0747 .
## OrgSize 4.97484 0.25254 19.699 < 2e-16 ***
## JobSat 7.72234 0.42266 18.271 < 2e-16 ***
## Hobbyist 1.26986 1.32196 0.961 0.3368
## NEWOvertime -2.57258 0.42509 -6.052 1.46e-09 ***
## CompFreq 66.94391 1.07023 62.551 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 74.28 on 18420 degrees of freedom
## Multiple R-squared: 0.37, Adjusted R-squared: 0.3697
## F-statistic: 1352 on 8 and 18420 DF, p-value: < 2.2e-16
Quite surprising in our summary results for the p-value for variables Hobbyist andEdLevel are higher than 0.05, indicating that the these features are not significant - therefore it can be concluded the coefficients of hobby and education level do not affect the annual salary.
For the next two models Ridge Regression and the LASSO we’ll use glmnet package. Both of them are simple techniques to reduce model complexity and prevent over-fitting which may result from Multiple Linear Regression.
Ridge Regression puts constraint on the coefficients. The penalty term lambda regularizes the coefficients such that if the coefficients take large values the optimization function is penalized. Thus,Ridge Regressionreduces the model complexity and multi-collinearity.
As the glmnet() function does not use the same formula structure as modeling function for Multiple Linear Regression we have to start with creating matrix of predictors x_train as well as y_train vector with target variable.
#Matrix of predictors
x_train = model.matrix(ConvertedComp~., df_train)[,-1]
#Creating vector with only target variable.
y_train = df_train$ConvertedComp%>%
unlist() %>%
as.numeric()
#Create grid of lambda values (ranging from λ=10^10 to λ=10^(−2) - covers the full range of scenarios from the null model containing only the intercept, to the least squares fit)
grid = 10^seq(10, -2, length = 100)
#Training the Ridge Regression model
Model_Ridge <- glmnet(x_train, y_train, alpha = 0, lambda = grid)
#Alpha argument determines what type of model is fit: 0 - Ridge Regression, 1 - LASSO Regression
summary(Model_Ridge)
## Length Class Mode
## a0 100 -none- numeric
## beta 800 dgCMatrix S4
## df 100 -none- numeric
## dim 2 -none- numeric
## lambda 100 -none- numeric
## dev.ratio 100 -none- numeric
## nulldev 1 -none- numeric
## npasses 1 -none- numeric
## jerr 1 -none- numeric
## offset 1 -none- logical
## call 5 -none- call
## nobs 1 -none- numeric
#Minimum lambda value with cross-validation which is the best fit model
CV_Ridge=cv.glmnet(x_train, y_train, alpha=0, type="deviance", family="gaussian", nfolds=10)
CV_Ridge$lambda.min
## [1] 4.376751
coef(CV_Ridge, s="lambda.min")
## 9 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) -14.7000591
## EdLevel 0.9711423
## OrgSize 4.8843414
## YearsCode 3.2271837
## JobSat 7.4755985
## Hobbyist 1.1724929
## NEWOvertime -2.4432230
## Age1stCode -1.3714879
## CompFreq 64.3447077
Below is a visualization of the Ridge coefficients versus λ values. Optimal value for lambda is marked by blue dashed line.
plot(Model_Ridge, xvar="lambda", lwd=2, label=T)
abline(v=log(CV_Ridge$lambda.min), col="blue", lty=2)
Let’s see how the coefficients change as the log of lambda change.
coefpath(Model_Ridge)
Plot of the coefficients sorted by magnitude at the optimal lambda value.
coefplot(Model_Ridge, lambda=CV_Ridge$lambda.min, sort="magnitude",color = "chocolate2")
LASSO Regression is similar to Ridge Regression, but the penalty term is slightly different - instead of penalizing the sum of squared coefficients, LASSO penalizes the sum of absolute values.
Model_LASSO <- glmnet(x_train,
y_train,
alpha = 1,
lambda = grid)
summary(Model_LASSO)
## Length Class Mode
## a0 100 -none- numeric
## beta 800 dgCMatrix S4
## df 100 -none- numeric
## dim 2 -none- numeric
## lambda 100 -none- numeric
## dev.ratio 100 -none- numeric
## nulldev 1 -none- numeric
## npasses 1 -none- numeric
## jerr 1 -none- numeric
## offset 1 -none- logical
## call 5 -none- call
## nobs 1 -none- numeric
# The optimal lambda
CV_LASSO=cv.glmnet(x_train, y_train, alpha=1, type="deviance", family="gaussian", nfolds=10)
CV_LASSO$lambda.min
## [1] 0.1501431
# Extract coefficients at optimal lambda
coef(Model_LASSO,s=CV_LASSO$lambda.min)
## 9 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) -23.6679780
## EdLevel 0.7007534
## OrgSize 4.9286993
## YearsCode 3.3659317
## JobSat 7.6195264
## Hobbyist 0.9067515
## NEWOvertime -2.4529689
## Age1stCode -1.2448488
## CompFreq 66.7836619
plot(Model_LASSO, xvar="lambda", lwd=2, label=T)
abline(v=log(CV_LASSO$lambda.min), col="blue", lty=2)
coefpath(Model_LASSO)
coefplot(Model_LASSO, lambda=CV_LASSO$lambda.min, sort="magnitude", color = "chocolate2")
Elastic Net Regression combines the penalties of Ridge and LASSO.
Model_ENR = glmnet(x_train,y_train,alpha=0.5,nlambda=100)
summary(Model_ENR)
## Length Class Mode
## a0 63 -none- numeric
## beta 504 dgCMatrix S4
## df 63 -none- numeric
## dim 2 -none- numeric
## lambda 63 -none- numeric
## dev.ratio 63 -none- numeric
## nulldev 1 -none- numeric
## npasses 1 -none- numeric
## jerr 1 -none- numeric
## offset 1 -none- logical
## call 5 -none- call
## nobs 1 -none- numeric
# 10-fold CV to find the optimal lambda
CV_ENR=cv.glmnet(x_train,y_train,alpha=0.5, type="deviance", family="gaussian", nfolds=10)
CV_ENR$lambda.min
## [1] 0.2736096
plot(Model_ENR, xvar="lambda", lwd=2, label=T)
abline(v=log(CV_ENR$lambda.min), col="blue", lty=2)
coefplot(Model_ENR, lambda=CV_ENR$lambda.min, sort="magnitude",color = "chocolate2")
coefpath(Model_ENR)
Preparing the test data.
x_test = model.matrix(ConvertedComp~., df_test)[,-1]
y_test = df_test$ConvertedComp%>%
unlist() %>%
as.numeric()
Predicting the models on test data.
Model_MultipleRegr_fit <- predict(Model_MultipleRegr,df_test)
Model_Ridge_fit<- predict(Model_Ridge, s = CV_Ridge$lambda.min, newx = x_test)
Model_LASSO_fit <- predict(Model_LASSO, s = CV_LASSO$lambda.min, newx = x_test)
Model_ENR_fit <- predict(Model_ENR,s = CV_ENR$lambda.min, x_test)
Let’s apply some statistical metrics to evaluate performance of our models.
Unfortunately we can not see strong correlations between predictions and real values.
DF <- data.frame(real = y_test,predicted_1 = Model_MultipleRegr_fit,predicted_2 = Model_Ridge_fit, predicted_3 = Model_LASSO_fit,predicted_4 = Model_ENR_fit,time = 1:length(y_test))
colnames(DF) <- c("real","predicted_1","predicted_2","predicted_3","predicted_4","time")
cor(DF$real,DF[,-c(1,6)])
## predicted_1 predicted_2 predicted_3 predicted_4
## [1,] 0.6102647 0.6100434 0.6103097 0.6103005
The regression Metrics function is defined below, which will allow us to compare the statistics of our models:
MSE - Mean Square Error - mean or average of the squared differences between predicted and expected target values in a dataset.
RMSE - Root Mean Square Error - the square root of the second sample moment of the differences between predicted values and observed values or the quadratic mean of these differences.
MAE - Mean Absolute Error - the average of the absolute error values.
MAPE - Mean Absolute Percentage Error - expresses the accuracy as a ratio.
MedAE - Median Absolute Error - the median of all absolute differences between the target and the prediction.
MSLE - Mean Squared Logarithmic Error - squared difference between the log of the predictions and log of actual values.
R2 - R Squared - represents the proportion of the variance for a dependent variable that’s explained by an independent variable.
regressionMetrics <- function(real, predicted) {
MSE <- mean((real - predicted)^2)
RMSE <- sqrt(MSE)
MAE <- mean(abs(real - predicted))
MAPE <- mean(abs(real - predicted)/real)
MedAE <- median(abs(real - predicted))
MSLE <- mean((log(1 + real) - log(1 + predicted))^2)
TSS <- sum((real - mean(real))^2)
RSS <- sum((predicted - real)^2)
R2 <- 1 - RSS/TSS
result <- data.frame(MSE, RMSE, MAE, MAPE, MedAE, MSLE, R2)
return(result)
}
## MSE RMSE MAE MAPE MedAE MSLE
## Model_MultipleRegr 5563.126 74.58637 58.54117 0.4955225 47.93930 0.1962692
## Model_Ridge 5573.147 74.65352 58.68107 0.4982790 48.23868 0.1972209
## Model_LASSO 5563.250 74.58720 58.55713 0.4960227 48.02357 0.1964102
## Model_ENR 5563.456 74.58858 58.55987 0.4960679 48.02811 0.1964263
## R2
## Model_MultipleRegr 0.3722421
## Model_Ridge 0.3711113
## Model_LASSO 0.3722282
## Model_ENR 0.3722049
Unfortunately,none of the models is fitting well.Statistics for all models are very similar, but the best results were obtained from the Multiple Regression Model.
Let’s check if our errors (or residuals) are normally distributed - as we can see below, this assumption has been met. If we notice that the patterns are present in our residuals or our errors are not normally distributed, it would mean that we are dealing with a polynomial rather than linear relationship, and maybe there are outliers that affect our model more than others.
The histograms of residuals for all models are almost identically alike.
StackOverflow dataset are not the highest quality.ANOVA and Backward elimination we concluded that Gender does not impact the amount of the salary.ConvertedComp is much more complex variable to be predict and R^2 at the level of 37% may be sufficient.