Exploring the Stack Overflow survey

Stack Overflow is the world’s largest online community for developers, and you have probably used it to find an answer to a programming question. The second chapter uses data from the annual Stack Overflow Developer Survey to practice predictive modeling and find which developers are more likely to work remotely.

Anytime you are planning to implement modeling, it is always a good idea to explore your dataset. Start off this modeling analysis by checking out how many remote and non-remote developers you have to work with, where they live, and how much experience they have.

library(tidyverse)
## -- Attaching packages -------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 2.2.1     v purrr   0.2.4
## v tibble  1.4.2     v dplyr   0.7.4
## v tidyr   0.8.0     v stringr 1.3.0
## v readr   1.1.1     v forcats 0.3.0
## -- Conflicts ----------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
 stackoverflow <- read.csv("D:/VIT/DATA SCIENCE/stackoverflow.csv")

# Print stackoverflow

# First count for Remote
stackoverflow %>% 
    count(Remote, sort = TRUE)
## # A tibble: 2 x 2
##   Remote         n
##   <fct>      <int>
## 1 Not remote  6273
## 2 Remote       718
# then count for Country
stackoverflow %>% 
    count(Country, sort = TRUE)
## # A tibble: 5 x 2
##   Country            n
##   <fct>          <int>
## 1 United States   3486
## 2 United Kingdom  1270
## 3 Germany          950
## 4 India            666
## 5 Canada           619
ggplot(stackoverflow, aes(Remote, YearsCodedJob)) +
    geom_boxplot() +
    labs(x = NULL,
         y = "Years of professional coding experience") 

Notice the distribution of countries, the effect of experience, and most importantly, the imbalance between remote and non-remote workers in this dataset.

Start with a simple model

Before starting the process of building machine learning models, let’s start by building an extremely simple model to get our bearings. This is not a model you would want to use to make predictions on new data, but it can give you an idea about how successful you may eventually be and which predictors are most important.

Recall that when you use the pipe operator %>% with a function like glm() (whose first argument is not data), you must specify data = . to indicate that you are piping in the modeling data set.

# Build a simple logistic regression model
simple_glm <- stackoverflow %>%
        select(-Respondent) %>%
        glm(Remote ~ .,
            family = "binomial",
            data = .)

# Print the summary of the model
summary(simple_glm)
## 
## Call:
## glm(formula = Remote ~ ., family = "binomial", data = .)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.1942  -0.4971  -0.3824  -0.2867   2.9118  
## 
## Coefficients:
##                                            Estimate Std. Error z value
## (Intercept)                              -4.156e+00  2.929e-01 -14.187
## CountryGermany                           -2.034e-01  2.196e-01  -0.927
## CountryIndia                              9.574e-01  2.220e-01   4.312
## CountryUnited Kingdom                     5.599e-02  1.974e-01   0.284
## CountryUnited States                      5.990e-01  1.799e-01   3.330
## Salary                                    4.076e-06  1.589e-06   2.565
## YearsCodedJob                             7.133e-02  7.556e-03   9.440
## OpenSourceTRUE                            4.207e-01  8.555e-02   4.917
## HobbyTRUE                                 8.330e-03  9.827e-02   0.085
## CompanySizeNumber                        -6.104e-05  1.223e-05  -4.990
## CareerSatisfaction                        6.748e-02  2.664e-02   2.533
## Data.scientistTRUE                       -1.186e-01  1.838e-01  -0.645
## Database.administratorTRUE                2.763e-01  1.267e-01   2.181
## Desktop.applications.developerTRUE       -2.903e-01  9.842e-02  -2.950
## Developer.with.stats.math.backgroundTRUE  2.840e-02  1.359e-01   0.209
## DevOpsTRUE                               -1.532e-01  1.292e-01  -1.185
## Embedded.developerTRUE                   -2.777e-01  1.653e-01  -1.680
## Graphic.designerTRUE                     -1.904e-01  2.725e-01  -0.699
## Graphics.programmingTRUE                  1.078e-01  2.312e-01   0.466
## Machine.learning.specialistTRUE          -2.289e-01  2.769e-01  -0.827
## Mobile.developerTRUE                      2.170e-01  1.019e-01   2.130
## Quality.assurance.engineerTRUE           -2.826e-01  2.437e-01  -1.160
## Systems.administratorTRUE                 1.462e-01  1.421e-01   1.029
## Web.developerTRUE                         1.158e-01  9.993e-02   1.159
##                                          Pr(>|z|)    
## (Intercept)                               < 2e-16 ***
## CountryGermany                           0.354161    
## CountryIndia                             1.62e-05 ***
## CountryUnited Kingdom                    0.776710    
## CountryUnited States                     0.000868 ***
## Salary                                   0.010314 *  
## YearsCodedJob                             < 2e-16 ***
## OpenSourceTRUE                           8.78e-07 ***
## HobbyTRUE                                0.932444    
## CompanySizeNumber                        6.04e-07 ***
## CareerSatisfaction                       0.011323 *  
## Data.scientistTRUE                       0.518709    
## Database.administratorTRUE               0.029184 *  
## Desktop.applications.developerTRUE       0.003178 ** 
## Developer.with.stats.math.backgroundTRUE 0.834400    
## DevOpsTRUE                               0.235833    
## Embedded.developerTRUE                   0.093039 .  
## Graphic.designerTRUE                     0.484596    
## Graphics.programmingTRUE                 0.641060    
## Machine.learning.specialistTRUE          0.408484    
## Mobile.developerTRUE                     0.033194 *  
## Quality.assurance.engineerTRUE           0.246098    
## Systems.administratorTRUE                0.303507    
## Web.developerTRUE                        0.246655    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4627.8  on 6990  degrees of freedom
## Residual deviance: 4268.8  on 6967  degrees of freedom
## AIC: 4316.8
## 
## Number of Fisher Scoring iterations: 5

While this model isn’t the best for predicting on new data, you can get an idea of which predictors have larger effect sizes and which are significant or not significant.

Training and testing data

Before you deal with the imbalance in the remote/not remote classes, first split your data into training and testing sets. You create subsets of your data for training and testing your model for the same reasons you did before, to reduce overfitting and obtain a more accurate estimate for how your model will perform on new data.

# Load caret
library(caret)
## Warning: package 'caret' was built under R version 3.4.4
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
stack_select <- stackoverflow %>%
    select(-Respondent)

# Split the data into training and testing sets
set.seed(1234)
in_train <- createDataPartition(stack_select$Remote, p = 0.8, list = FALSE)
training <- stack_select[in_train,]
testing <- stack_select[-in_train,]

Upsampling

There are multiple possible approaches to dealing with class imbalance. Here, you will implement upsampling using caret’s upSample() function.

up_train <- upSample(x = select(training, -Remote),
                     y = training$Remote,
                     yname = "Remote") %>%
    as_tibble()

up_train %>%
    count(Remote)
## # A tibble: 2 x 2
##   Remote         n
##   <fct>      <int>
## 1 Not remote  5019
## 2 Remote      5019

Upsampling samples with replacement until the class distributions are equal, so there are the same number of remote and non-remote developers after upsampling. djusting class imbalance helps you train a model that performs better.

Training models

Finally! It’s time to train predictive models for this data set of Stack Overflow Developer Survey responses. We will continue to use the powerful, flexible train() function from caret to specify our machine learning models.

To keep the code in this exercise evaluating quickly, the data sets in your environment are 1% of their original size. (This means you may see some warnings due to the subsampling.)

# Build a logistic regression model
stack_glm <- train(Remote ~ ., method='glm', family='binomial',
                   data = training,
                   trControl = trainControl(method = "boot",
                                            sampling='up'))

# Print the model object 
stack_glm
## Generalized Linear Model 
## 
## 5594 samples
##   20 predictor
##    2 classes: 'Not remote', 'Remote' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 5594, 5594, 5594, 5594, 5594, 5594, ... 
## Addtional sampling using up-sampling
## 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.6537111  0.1282623
# Build a random forest model
stack_rf <- train(Remote ~ ., method='rf', 
                  data = training,
                  trControl = trainControl(method = "boot",
                                           sampling='up'))

# Print the model object
stack_rf
## Random Forest 
## 
## 5594 samples
##   20 predictor
##    2 classes: 'Not remote', 'Remote' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 5594, 5594, 5594, 5594, 5594, 5594, ... 
## Addtional sampling using up-sampling
## 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa     
##    2    0.7558523  0.16301035
##   12    0.8771749  0.06805856
##   23    0.8659782  0.07804283
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 12.

Confusion matrix

A confusion matrix describes how well a classification model (like the ones you just trained!) performs. A confusion matrix tabulates how many examples in each class were correctly classified by a model. In your case, it will show you how many remote developers were classified as remote and how many non-remote developers were classified as non-remote; the confusion matrix also shows you how many were classified into the wrong categories.

Here you will use the confusionMatrix() function from caret to evaluate the performance of the two models you trained, stack_glm and stack_rf. The models available in your environment were trained on all the training data, not only 1%.

# Confusion matrix for logistic regression model
confusionMatrix(predict(stack_glm, testing),
                testing$Remote)
## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   Not remote Remote
##   Not remote        837     56
##   Remote            417     87
##                                           
##                Accuracy : 0.6614          
##                  95% CI : (0.6359, 0.6862)
##     No Information Rate : 0.8976          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1302          
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.6675          
##             Specificity : 0.6084          
##          Pos Pred Value : 0.9373          
##          Neg Pred Value : 0.1726          
##              Prevalence : 0.8976          
##          Detection Rate : 0.5991          
##    Detection Prevalence : 0.6392          
##       Balanced Accuracy : 0.6379          
##                                           
##        'Positive' Class : Not remote      
## 
# Confusion matrix for random forest model
confusionMatrix(predict(stack_rf, testing),
                testing$Remote)
## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   Not remote Remote
##   Not remote       1211    122
##   Remote             43     21
##                                           
##                Accuracy : 0.8819          
##                  95% CI : (0.8638, 0.8984)
##     No Information Rate : 0.8976          
##     P-Value [Acc > NIR] : 0.9747          
##                                           
##                   Kappa : 0.149           
##  Mcnemar's Test P-Value : 1.261e-09       
##                                           
##             Sensitivity : 0.9657          
##             Specificity : 0.1469          
##          Pos Pred Value : 0.9085          
##          Neg Pred Value : 0.3281          
##              Prevalence : 0.8976          
##          Detection Rate : 0.8669          
##    Detection Prevalence : 0.9542          
##       Balanced Accuracy : 0.5563          
##                                           
##        'Positive' Class : Not remote      
## 

A confusion matrix is used to evaluate the performance of models, so you should use the testing set.

Classification model metrics

The confusionMatrix() function is helpful but often you want to store specific performance estimates for later, perhaps in a dataframe-friendly form. The yardstick package is built to handle such needs. For this kind of classifier model, you might look at the positive or negative predictive value or perhaps overall accuracy.

The models available in your environment, stack_glm and stack_rf were trained on all the training data, not only 1%.

# Load yardstick
library(yardstick)
## Loading required package: broom
## 
## Attaching package: 'yardstick'
## The following objects are masked from 'package:caret':
## 
##     mnLogLoss, precision, recall
## The following object is masked from 'package:readr':
## 
##     spec
# Predict values
testing_results <- testing %>%
    mutate(`Logistic regression` = predict(stack_glm, testing),
           `Random forest` = predict(stack_rf, testing))

## Calculate accuracy
accuracy(testing_results, truth = Remote, estimate = `Logistic regression`)
## [1] 0.6614173
accuracy(testing_results, truth = Remote, estimate = `Random forest`)
## [1] 0.8818898
## Calculate positive predict value
ppv(testing_results, truth = Remote, estimate = `Logistic regression`)
## [1] 0.93729
ppv(testing_results, truth = Remote, estimate = `Random forest`)
## [1] 0.9084771

In terms of overall accuracy and positive predictive value, the random forest model outperforms the logistic regression model. We can predict the remote status of a developer more accurately with a random forest model.