Stack Overflow is the world’s largest online community for developers, and you have probably used it to find an answer to a programming question. The second chapter uses data from the annual Stack Overflow Developer Survey to practice predictive modeling and find which developers are more likely to work remotely.
Anytime you are planning to implement modeling, it is always a good idea to explore your dataset. Start off this modeling analysis by checking out how many remote and non-remote developers you have to work with, where they live, and how much experience they have.
library(tidyverse)
## -- Attaching packages -------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 2.2.1 v purrr 0.2.4
## v tibble 1.4.2 v dplyr 0.7.4
## v tidyr 0.8.0 v stringr 1.3.0
## v readr 1.1.1 v forcats 0.3.0
## -- Conflicts ----------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
stackoverflow <- read.csv("D:/VIT/DATA SCIENCE/stackoverflow.csv")
# Print stackoverflow
# First count for Remote
stackoverflow %>%
count(Remote, sort = TRUE)
## # A tibble: 2 x 2
## Remote n
## <fct> <int>
## 1 Not remote 6273
## 2 Remote 718
# then count for Country
stackoverflow %>%
count(Country, sort = TRUE)
## # A tibble: 5 x 2
## Country n
## <fct> <int>
## 1 United States 3486
## 2 United Kingdom 1270
## 3 Germany 950
## 4 India 666
## 5 Canada 619
ggplot(stackoverflow, aes(Remote, YearsCodedJob)) +
geom_boxplot() +
labs(x = NULL,
y = "Years of professional coding experience")
Notice the distribution of countries, the effect of experience, and most importantly, the imbalance between remote and non-remote workers in this dataset.
Before starting the process of building machine learning models, let’s start by building an extremely simple model to get our bearings. This is not a model you would want to use to make predictions on new data, but it can give you an idea about how successful you may eventually be and which predictors are most important.
Recall that when you use the pipe operator %>% with a function like glm() (whose first argument is not data), you must specify data = . to indicate that you are piping in the modeling data set.
# Build a simple logistic regression model
simple_glm <- stackoverflow %>%
select(-Respondent) %>%
glm(Remote ~ .,
family = "binomial",
data = .)
# Print the summary of the model
summary(simple_glm)
##
## Call:
## glm(formula = Remote ~ ., family = "binomial", data = .)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.1942 -0.4971 -0.3824 -0.2867 2.9118
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) -4.156e+00 2.929e-01 -14.187
## CountryGermany -2.034e-01 2.196e-01 -0.927
## CountryIndia 9.574e-01 2.220e-01 4.312
## CountryUnited Kingdom 5.599e-02 1.974e-01 0.284
## CountryUnited States 5.990e-01 1.799e-01 3.330
## Salary 4.076e-06 1.589e-06 2.565
## YearsCodedJob 7.133e-02 7.556e-03 9.440
## OpenSourceTRUE 4.207e-01 8.555e-02 4.917
## HobbyTRUE 8.330e-03 9.827e-02 0.085
## CompanySizeNumber -6.104e-05 1.223e-05 -4.990
## CareerSatisfaction 6.748e-02 2.664e-02 2.533
## Data.scientistTRUE -1.186e-01 1.838e-01 -0.645
## Database.administratorTRUE 2.763e-01 1.267e-01 2.181
## Desktop.applications.developerTRUE -2.903e-01 9.842e-02 -2.950
## Developer.with.stats.math.backgroundTRUE 2.840e-02 1.359e-01 0.209
## DevOpsTRUE -1.532e-01 1.292e-01 -1.185
## Embedded.developerTRUE -2.777e-01 1.653e-01 -1.680
## Graphic.designerTRUE -1.904e-01 2.725e-01 -0.699
## Graphics.programmingTRUE 1.078e-01 2.312e-01 0.466
## Machine.learning.specialistTRUE -2.289e-01 2.769e-01 -0.827
## Mobile.developerTRUE 2.170e-01 1.019e-01 2.130
## Quality.assurance.engineerTRUE -2.826e-01 2.437e-01 -1.160
## Systems.administratorTRUE 1.462e-01 1.421e-01 1.029
## Web.developerTRUE 1.158e-01 9.993e-02 1.159
## Pr(>|z|)
## (Intercept) < 2e-16 ***
## CountryGermany 0.354161
## CountryIndia 1.62e-05 ***
## CountryUnited Kingdom 0.776710
## CountryUnited States 0.000868 ***
## Salary 0.010314 *
## YearsCodedJob < 2e-16 ***
## OpenSourceTRUE 8.78e-07 ***
## HobbyTRUE 0.932444
## CompanySizeNumber 6.04e-07 ***
## CareerSatisfaction 0.011323 *
## Data.scientistTRUE 0.518709
## Database.administratorTRUE 0.029184 *
## Desktop.applications.developerTRUE 0.003178 **
## Developer.with.stats.math.backgroundTRUE 0.834400
## DevOpsTRUE 0.235833
## Embedded.developerTRUE 0.093039 .
## Graphic.designerTRUE 0.484596
## Graphics.programmingTRUE 0.641060
## Machine.learning.specialistTRUE 0.408484
## Mobile.developerTRUE 0.033194 *
## Quality.assurance.engineerTRUE 0.246098
## Systems.administratorTRUE 0.303507
## Web.developerTRUE 0.246655
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4627.8 on 6990 degrees of freedom
## Residual deviance: 4268.8 on 6967 degrees of freedom
## AIC: 4316.8
##
## Number of Fisher Scoring iterations: 5
While this model isn’t the best for predicting on new data, you can get an idea of which predictors have larger effect sizes and which are significant or not significant.
Before you deal with the imbalance in the remote/not remote classes, first split your data into training and testing sets. You create subsets of your data for training and testing your model for the same reasons you did before, to reduce overfitting and obtain a more accurate estimate for how your model will perform on new data.
# Load caret
library(caret)
## Warning: package 'caret' was built under R version 3.4.4
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
stack_select <- stackoverflow %>%
select(-Respondent)
# Split the data into training and testing sets
set.seed(1234)
in_train <- createDataPartition(stack_select$Remote, p = 0.8, list = FALSE)
training <- stack_select[in_train,]
testing <- stack_select[-in_train,]
There are multiple possible approaches to dealing with class imbalance. Here, you will implement upsampling using caret’s upSample() function.
up_train <- upSample(x = select(training, -Remote),
y = training$Remote,
yname = "Remote") %>%
as_tibble()
up_train %>%
count(Remote)
## # A tibble: 2 x 2
## Remote n
## <fct> <int>
## 1 Not remote 5019
## 2 Remote 5019
Upsampling samples with replacement until the class distributions are equal, so there are the same number of remote and non-remote developers after upsampling. djusting class imbalance helps you train a model that performs better.
Finally! It’s time to train predictive models for this data set of Stack Overflow Developer Survey responses. We will continue to use the powerful, flexible train() function from caret to specify our machine learning models.
To keep the code in this exercise evaluating quickly, the data sets in your environment are 1% of their original size. (This means you may see some warnings due to the subsampling.)
# Build a logistic regression model
stack_glm <- train(Remote ~ ., method='glm', family='binomial',
data = training,
trControl = trainControl(method = "boot",
sampling='up'))
# Print the model object
stack_glm
## Generalized Linear Model
##
## 5594 samples
## 20 predictor
## 2 classes: 'Not remote', 'Remote'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 5594, 5594, 5594, 5594, 5594, 5594, ...
## Addtional sampling using up-sampling
##
## Resampling results:
##
## Accuracy Kappa
## 0.6537111 0.1282623
# Build a random forest model
stack_rf <- train(Remote ~ ., method='rf',
data = training,
trControl = trainControl(method = "boot",
sampling='up'))
# Print the model object
stack_rf
## Random Forest
##
## 5594 samples
## 20 predictor
## 2 classes: 'Not remote', 'Remote'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 5594, 5594, 5594, 5594, 5594, 5594, ...
## Addtional sampling using up-sampling
##
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.7558523 0.16301035
## 12 0.8771749 0.06805856
## 23 0.8659782 0.07804283
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 12.
A confusion matrix describes how well a classification model (like the ones you just trained!) performs. A confusion matrix tabulates how many examples in each class were correctly classified by a model. In your case, it will show you how many remote developers were classified as remote and how many non-remote developers were classified as non-remote; the confusion matrix also shows you how many were classified into the wrong categories.
Here you will use the confusionMatrix() function from caret to evaluate the performance of the two models you trained, stack_glm and stack_rf. The models available in your environment were trained on all the training data, not only 1%.
# Confusion matrix for logistic regression model
confusionMatrix(predict(stack_glm, testing),
testing$Remote)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Not remote Remote
## Not remote 837 56
## Remote 417 87
##
## Accuracy : 0.6614
## 95% CI : (0.6359, 0.6862)
## No Information Rate : 0.8976
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1302
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.6675
## Specificity : 0.6084
## Pos Pred Value : 0.9373
## Neg Pred Value : 0.1726
## Prevalence : 0.8976
## Detection Rate : 0.5991
## Detection Prevalence : 0.6392
## Balanced Accuracy : 0.6379
##
## 'Positive' Class : Not remote
##
# Confusion matrix for random forest model
confusionMatrix(predict(stack_rf, testing),
testing$Remote)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Not remote Remote
## Not remote 1211 122
## Remote 43 21
##
## Accuracy : 0.8819
## 95% CI : (0.8638, 0.8984)
## No Information Rate : 0.8976
## P-Value [Acc > NIR] : 0.9747
##
## Kappa : 0.149
## Mcnemar's Test P-Value : 1.261e-09
##
## Sensitivity : 0.9657
## Specificity : 0.1469
## Pos Pred Value : 0.9085
## Neg Pred Value : 0.3281
## Prevalence : 0.8976
## Detection Rate : 0.8669
## Detection Prevalence : 0.9542
## Balanced Accuracy : 0.5563
##
## 'Positive' Class : Not remote
##
A confusion matrix is used to evaluate the performance of models, so you should use the testing set.
The confusionMatrix() function is helpful but often you want to store specific performance estimates for later, perhaps in a dataframe-friendly form. The yardstick package is built to handle such needs. For this kind of classifier model, you might look at the positive or negative predictive value or perhaps overall accuracy.
The models available in your environment, stack_glm and stack_rf were trained on all the training data, not only 1%.
# Load yardstick
library(yardstick)
## Loading required package: broom
##
## Attaching package: 'yardstick'
## The following objects are masked from 'package:caret':
##
## mnLogLoss, precision, recall
## The following object is masked from 'package:readr':
##
## spec
# Predict values
testing_results <- testing %>%
mutate(`Logistic regression` = predict(stack_glm, testing),
`Random forest` = predict(stack_rf, testing))
## Calculate accuracy
accuracy(testing_results, truth = Remote, estimate = `Logistic regression`)
## [1] 0.6614173
accuracy(testing_results, truth = Remote, estimate = `Random forest`)
## [1] 0.8818898
## Calculate positive predict value
ppv(testing_results, truth = Remote, estimate = `Logistic regression`)
## [1] 0.93729
ppv(testing_results, truth = Remote, estimate = `Random forest`)
## [1] 0.9084771
In terms of overall accuracy and positive predictive value, the random forest model outperforms the logistic regression model. We can predict the remote status of a developer more accurately with a random forest model.