We’ll start out our Machine Learning tutorial using a familiar
dataset: the Opportunity Insights data on county-level inequality. We’ll
predict inequality levels using the data. We’ll begin our journey into
machine learning using a familiar method: linear regression. The ideas
are largely the same as the previous tutorial, but this time we’ll start
to think about regression from a machine learning perspective.
data <- read.csv("county_data.csv")
We’ll build a model predicting inequality at the county level, using
several co-variates to make our prediction (in ML, covariates are often
referred to as “features” in the model). This is considered “supervised
machine learning” because we already know the actual inequality level in
each county.
First, we’ll select our variables of interest. Then, we need to split
our data into training and testing sets. We’ll use our training set to
train the model, while our testing set will be used to validate our
model.
#select the variables we'll use
library(dplyr)
model_data <- data %>%
select(c("inequality", "racial_segregation", "crime_rate", "unemp_rate", "percent_college_degree")) %>%
mutate(college = percent_college_degree/100) %>%
select(-"percent_college_degree")
#Check the mean of our dependent variable for reference
mean(model_data$inequality, na.rm=T)
sd(model_data$inequality, na.rm=T)
#split into training and testing data
#setting the seed allows us to recreate this analysis later
set.seed(8675309)
#we need to draw random samples for our training and testing data - this code does this for us
#The training data will be 75% of the data, and the remaining 25% will be the testing/validation data
split <- 0.75
rows <- nrow(model_data)
train.entries <- sample(rows, rows*split)
model.train <- model_data[train.entries, ]
model.valid <- model_data[-train.entries, ]
model <- lm(inequality ~ racial_segregation, data=model.train)
summary(model)
Ok, now we have the model based on the test data. Let’s use this
model on the validation data, and assess its fit.
model.valid <- model.valid %>%
#use the model to predict inequality in the validation data
#save the predicted coefficients from this model
mutate(yhat = predict(model, newdata=model.valid)) %>%
#calculate the residual so we can plot it
mutate(residual = inequality - yhat)
#Save this for later, when we compare the two models
model.train <- model.train %>%
mutate(yhat = predict(model, newdata=model.train)) %>%
mutate(residual = inequality - yhat)
head(model.valid)
mean(model.valid$residual, na.rm=T)
Now, we’ll plot the residuals to take a look at the fit of the
model:
library(ggplot2)
ggplot(model.valid) +
geom_point(aes(x=yhat, y=residual)) +
geom_hline(aes(yintercept=0), linetype="dashed", color="red") +
xlab("Predicted Inequality") +
ylab("Residual Inequality")+
theme_minimal()+
labs(title = "Residual Plot for Validation Model")
This residual is not perfect, but the spread around zero is
relatively consistent, which is a good sign for us. The data does seem
to be clustered around lower predicted levels of inequality though,
rather than spread out across the distribution.
Is this an issue? Maybe not. This may indicate that one of our
variables is skewed, and we can try transforming (e.g. taking the
logarithm) the variable and see if that improves the model fit. This may
also indicate that we’re missing a variable.
Let’s take a closer look at the model error - what kind of error are
we looking at here?
ggplot(model.valid) +
geom_histogram(aes(x=residual), bins=50, color="black", fill = "lightblue") +
xlab("Residual (inequality)")+
theme_minimal()+
labs(title = "Distribution of Model Residuals")
There are some outliers, but most of the residual is relatively
normally distributed with a mean of 0. We generally like to see
residuals that are normally distributed. On the other hand, some of the
variables have an error as high as 0.25! For a variable with a mean of
0.38, that is some substantial error.
Finally, we’ll check for overfitting today. To do this, we’ll compare
the standard deviations of the training and test set residuals. If the
test residuals sd is much larger than the sd for the training data, then
we know that our model has been overfit.
#Check the training model
sd(model.train$residual, na.rm=T)
#now compare with the validation model
sd(model.valid$residual, na.rm=T)
These are very close! This tells me that my model is likely not
overfitted. If my SD were, for example, 2x the size of the training
model residuals, I’d be concerned.
Finally, the mean squared error (mse) and root mean squared error
(rmse) are additional measures of model accuracy that are common in
machine learning:
model.valid <- model.valid %>%
mutate(square_error = residual^2)
#mse
mean(model.valid$square_error, na.rm=T)
#calculate rmse - this is very similar to an SD for residuals. You'll often see this terminology used in ML instead of SD to assess model fit
sqrt(mean(model.valid$square_error, na.rm=T))
We’ll come back to this measure later!
LS0tCnRpdGxlOiAiVG9waWMgOC4gTWFjaGluZSBMZWFybmluZyBJLiBSZWdyZXNzaW9uIgpvdXRwdXQ6IGh0bWxfbm90ZWJvb2sKLS0tCgpXZSdsbCBzdGFydCBvdXQgb3VyIE1hY2hpbmUgTGVhcm5pbmcgdHV0b3JpYWwgdXNpbmcgYSBmYW1pbGlhciBkYXRhc2V0OiB0aGUgT3Bwb3J0dW5pdHkgSW5zaWdodHMgZGF0YSBvbiBjb3VudHktbGV2ZWwgaW5lcXVhbGl0eS4gV2UnbGwgcHJlZGljdCBpbmVxdWFsaXR5IGxldmVscyB1c2luZyB0aGUgZGF0YS4gV2UnbGwgYmVnaW4gb3VyIGpvdXJuZXkgaW50byBtYWNoaW5lIGxlYXJuaW5nIHVzaW5nIGEgZmFtaWxpYXIgbWV0aG9kOiBsaW5lYXIgcmVncmVzc2lvbi4gVGhlIGlkZWFzIGFyZSBsYXJnZWx5IHRoZSBzYW1lIGFzIHRoZSBwcmV2aW91cyB0dXRvcmlhbCwgYnV0IHRoaXMgdGltZSB3ZSdsbCBzdGFydCB0byB0aGluayBhYm91dCByZWdyZXNzaW9uIGZyb20gYSBtYWNoaW5lIGxlYXJuaW5nIHBlcnNwZWN0aXZlLiAKCmBgYHtyfQpkYXRhIDwtIHJlYWQuY3N2KCJjb3VudHlfZGF0YS5jc3YiKQpgYGAKCldlJ2xsIGJ1aWxkIGEgbW9kZWwgcHJlZGljdGluZyBpbmVxdWFsaXR5IGF0IHRoZSBjb3VudHkgbGV2ZWwsIHVzaW5nIHNldmVyYWwgY28tdmFyaWF0ZXMgdG8gbWFrZSBvdXIgcHJlZGljdGlvbiAoaW4gTUwsIGNvdmFyaWF0ZXMgYXJlIG9mdGVuIHJlZmVycmVkIHRvIGFzICJmZWF0dXJlcyIgaW4gdGhlIG1vZGVsKS4gVGhpcyBpcyBjb25zaWRlcmVkICJzdXBlcnZpc2VkIG1hY2hpbmUgbGVhcm5pbmciIGJlY2F1c2Ugd2UgYWxyZWFkeSBrbm93IHRoZSBhY3R1YWwgaW5lcXVhbGl0eSBsZXZlbCBpbiBlYWNoIGNvdW50eS4KCkZpcnN0LCB3ZSdsbCBzZWxlY3Qgb3VyIHZhcmlhYmxlcyBvZiBpbnRlcmVzdC4gVGhlbiwgd2UgbmVlZCB0byBzcGxpdCBvdXIgZGF0YSBpbnRvIHRyYWluaW5nIGFuZCB0ZXN0aW5nIHNldHMuIFdlJ2xsIHVzZSBvdXIgdHJhaW5pbmcgc2V0IHRvIHRyYWluIHRoZSBtb2RlbCwgd2hpbGUgb3VyIHRlc3Rpbmcgc2V0IHdpbGwgYmUgdXNlZCB0byB2YWxpZGF0ZSBvdXIgbW9kZWwuIAoKYGBge3J9CiNzZWxlY3QgdGhlIHZhcmlhYmxlcyB3ZSdsbCB1c2UKbGlicmFyeShkcGx5cikKbW9kZWxfZGF0YSA8LSBkYXRhICU+JSAKICBzZWxlY3QoYygiaW5lcXVhbGl0eSIsICJyYWNpYWxfc2VncmVnYXRpb24iLCAiY3JpbWVfcmF0ZSIsICJ1bmVtcF9yYXRlIiwgInBlcmNlbnRfY29sbGVnZV9kZWdyZWUiKSkgJT4lIAogIG11dGF0ZShjb2xsZWdlID0gcGVyY2VudF9jb2xsZWdlX2RlZ3JlZS8xMDApICU+JSAKICBzZWxlY3QoLSJwZXJjZW50X2NvbGxlZ2VfZGVncmVlIikKCiNDaGVjayB0aGUgbWVhbiBvZiBvdXIgZGVwZW5kZW50IHZhcmlhYmxlIGZvciByZWZlcmVuY2UKbWVhbihtb2RlbF9kYXRhJGluZXF1YWxpdHksIG5hLnJtPVQpCnNkKG1vZGVsX2RhdGEkaW5lcXVhbGl0eSwgbmEucm09VCkKYGBgCgpgYGB7cn0KI3NwbGl0IGludG8gdHJhaW5pbmcgYW5kIHRlc3RpbmcgZGF0YQojc2V0dGluZyB0aGUgc2VlZCBhbGxvd3MgdXMgdG8gcmVjcmVhdGUgdGhpcyBhbmFseXNpcyBsYXRlcgpzZXQuc2VlZCg4Njc1MzA5KQoKI3dlIG5lZWQgdG8gZHJhdyByYW5kb20gc2FtcGxlcyBmb3Igb3VyIHRyYWluaW5nIGFuZCB0ZXN0aW5nIGRhdGEgLSB0aGlzIGNvZGUgZG9lcyB0aGlzIGZvciB1cwojVGhlIHRyYWluaW5nIGRhdGEgd2lsbCBiZSA3NSUgb2YgdGhlIGRhdGEsIGFuZCB0aGUgcmVtYWluaW5nIDI1JSB3aWxsIGJlIHRoZSB0ZXN0aW5nL3ZhbGlkYXRpb24gZGF0YQpzcGxpdCA8LSAwLjc1CnJvd3MgIDwtIG5yb3cobW9kZWxfZGF0YSkKCnRyYWluLmVudHJpZXMgPC0gc2FtcGxlKHJvd3MsIHJvd3Mqc3BsaXQpCgptb2RlbC50cmFpbiA8LSBtb2RlbF9kYXRhW3RyYWluLmVudHJpZXMsIF0KbW9kZWwudmFsaWQgIDwtIG1vZGVsX2RhdGFbLXRyYWluLmVudHJpZXMsICBdCgpgYGAKCmBgYHtyfQptb2RlbCA8LSBsbShpbmVxdWFsaXR5IH4gcmFjaWFsX3NlZ3JlZ2F0aW9uLCBkYXRhPW1vZGVsLnRyYWluKQoKc3VtbWFyeShtb2RlbCkKYGBgCk9rLCBub3cgd2UgaGF2ZSB0aGUgbW9kZWwgYmFzZWQgb24gdGhlIHRlc3QgZGF0YS4gTGV0J3MgdXNlIHRoaXMgbW9kZWwgb24gdGhlIHZhbGlkYXRpb24gZGF0YSwgYW5kIGFzc2VzcyBpdHMgZml0LiAKCmBgYHtyfQptb2RlbC52YWxpZCA8LSBtb2RlbC52YWxpZCAlPiUKICAjdXNlIHRoZSBtb2RlbCB0byBwcmVkaWN0IGluZXF1YWxpdHkgaW4gdGhlIHZhbGlkYXRpb24gZGF0YQogICNzYXZlIHRoZSBwcmVkaWN0ZWQgY29lZmZpY2llbnRzIGZyb20gdGhpcyBtb2RlbAogICAgbXV0YXRlKHloYXQgPSBwcmVkaWN0KG1vZGVsLCBuZXdkYXRhPW1vZGVsLnZhbGlkKSkgJT4lCiAgI2NhbGN1bGF0ZSB0aGUgcmVzaWR1YWwgc28gd2UgY2FuIHBsb3QgaXQKICAgIG11dGF0ZShyZXNpZHVhbCA9IGluZXF1YWxpdHkgLSB5aGF0KQoKI1NhdmUgdGhpcyBmb3IgbGF0ZXIsIHdoZW4gd2UgY29tcGFyZSB0aGUgdHdvIG1vZGVscwptb2RlbC50cmFpbiA8LSBtb2RlbC50cmFpbiAlPiUKICAgIG11dGF0ZSh5aGF0ID0gcHJlZGljdChtb2RlbCwgbmV3ZGF0YT1tb2RlbC50cmFpbikpICU+JQogICAgbXV0YXRlKHJlc2lkdWFsID0gaW5lcXVhbGl0eSAtIHloYXQpCgpoZWFkKG1vZGVsLnZhbGlkKQoKbWVhbihtb2RlbC52YWxpZCRyZXNpZHVhbCwgbmEucm09VCkKYGBgCk5vdywgd2UnbGwgcGxvdCB0aGUgcmVzaWR1YWxzIHRvIHRha2UgYSBsb29rIGF0IHRoZSBmaXQgb2YgdGhlIG1vZGVsOgpgYGB7cn0KbGlicmFyeShnZ3Bsb3QyKQpnZ3Bsb3QobW9kZWwudmFsaWQpICsKICBnZW9tX3BvaW50KGFlcyh4PXloYXQsIHk9cmVzaWR1YWwpKSArCiAgZ2VvbV9obGluZShhZXMoeWludGVyY2VwdD0wKSwgbGluZXR5cGU9ImRhc2hlZCIsIGNvbG9yPSJyZWQiKSArCiAgeGxhYigiUHJlZGljdGVkIEluZXF1YWxpdHkiKSArCiAgeWxhYigiUmVzaWR1YWwgSW5lcXVhbGl0eSIpKwogIHRoZW1lX21pbmltYWwoKSsKICBsYWJzKHRpdGxlID0gIlJlc2lkdWFsIFBsb3QgZm9yIFZhbGlkYXRpb24gTW9kZWwiKQpgYGAKVGhpcyByZXNpZHVhbCBpcyBub3QgcGVyZmVjdCwgYnV0IHRoZSBzcHJlYWQgYXJvdW5kIHplcm8gaXMgcmVsYXRpdmVseSBjb25zaXN0ZW50LCB3aGljaCBpcyBhIGdvb2Qgc2lnbiBmb3IgdXMuIFRoZSBkYXRhIGRvZXMgc2VlbSB0byBiZSBjbHVzdGVyZWQgYXJvdW5kIGxvd2VyIHByZWRpY3RlZCBsZXZlbHMgb2YgaW5lcXVhbGl0eSB0aG91Z2gsIHJhdGhlciB0aGFuIHNwcmVhZCBvdXQgYWNyb3NzIHRoZSBkaXN0cmlidXRpb24uIAoKSXMgdGhpcyBhbiBpc3N1ZT8gTWF5YmUgbm90LiBUaGlzIG1heSBpbmRpY2F0ZSB0aGF0IG9uZSBvZiBvdXIgdmFyaWFibGVzIGlzIHNrZXdlZCwgYW5kIHdlIGNhbiB0cnkgdHJhbnNmb3JtaW5nIChlLmcuIHRha2luZyB0aGUgbG9nYXJpdGhtKSB0aGUgdmFyaWFibGUgYW5kIHNlZSBpZiB0aGF0IGltcHJvdmVzIHRoZSBtb2RlbCBmaXQuIFRoaXMgbWF5IGFsc28gaW5kaWNhdGUgdGhhdCB3ZSdyZSBtaXNzaW5nIGEgdmFyaWFibGUuIAoKTGV0J3MgdGFrZSBhIGNsb3NlciBsb29rIGF0IHRoZSBtb2RlbCBlcnJvciAtIHdoYXQga2luZCBvZiBlcnJvciBhcmUgd2UgbG9va2luZyBhdCBoZXJlPwoKYGBge3J9CmdncGxvdChtb2RlbC52YWxpZCkgKwogIGdlb21faGlzdG9ncmFtKGFlcyh4PXJlc2lkdWFsKSwgYmlucz01MCwgY29sb3I9ImJsYWNrIiwgZmlsbCA9ICJsaWdodGJsdWUiKSArCiAgeGxhYigiUmVzaWR1YWwgKGluZXF1YWxpdHkpIikrCiAgdGhlbWVfbWluaW1hbCgpKwogIGxhYnModGl0bGUgPSAiRGlzdHJpYnV0aW9uIG9mIE1vZGVsIFJlc2lkdWFscyIpCmBgYApUaGVyZSBhcmUgc29tZSBvdXRsaWVycywgYnV0IG1vc3Qgb2YgdGhlIHJlc2lkdWFsIGlzIHJlbGF0aXZlbHkgbm9ybWFsbHkgZGlzdHJpYnV0ZWQgd2l0aCBhIG1lYW4gb2YgMC4gV2UgZ2VuZXJhbGx5IGxpa2UgdG8gc2VlIHJlc2lkdWFscyB0aGF0IGFyZSBub3JtYWxseSBkaXN0cmlidXRlZC4gT24gdGhlIG90aGVyIGhhbmQsIHNvbWUgb2YgdGhlIHZhcmlhYmxlcyBoYXZlIGFuIGVycm9yIGFzIGhpZ2ggYXMgMC4yNSEgRm9yIGEgdmFyaWFibGUgd2l0aCBhIG1lYW4gb2YgMC4zOCwgdGhhdCBpcyBzb21lIHN1YnN0YW50aWFsIGVycm9yLiAKCkZpbmFsbHksIHdlJ2xsIGNoZWNrIGZvciBvdmVyZml0dGluZyB0b2RheS4gVG8gZG8gdGhpcywgd2UnbGwgY29tcGFyZSB0aGUgc3RhbmRhcmQgZGV2aWF0aW9ucyBvZiB0aGUgdHJhaW5pbmcgYW5kIHRlc3Qgc2V0IHJlc2lkdWFscy4gSWYgdGhlIHRlc3QgcmVzaWR1YWxzIHNkIGlzIG11Y2ggbGFyZ2VyIHRoYW4gdGhlIHNkIGZvciB0aGUgdHJhaW5pbmcgZGF0YSwgdGhlbiB3ZSBrbm93IHRoYXQgb3VyIG1vZGVsIGhhcyBiZWVuIG92ZXJmaXQuIAoKYGBge3J9CiNDaGVjayB0aGUgdHJhaW5pbmcgbW9kZWwKc2QobW9kZWwudHJhaW4kcmVzaWR1YWwsIG5hLnJtPVQpCmBgYApgYGB7cn0KI25vdyBjb21wYXJlIHdpdGggdGhlIHZhbGlkYXRpb24gbW9kZWwKc2QobW9kZWwudmFsaWQkcmVzaWR1YWwsIG5hLnJtPVQpCmBgYApUaGVzZSBhcmUgdmVyeSBjbG9zZSEgVGhpcyB0ZWxscyBtZSB0aGF0IG15IG1vZGVsIGlzIGxpa2VseSBub3Qgb3ZlcmZpdHRlZC4gSWYgbXkgU0Qgd2VyZSwgZm9yIGV4YW1wbGUsIDJ4IHRoZSBzaXplIG9mIHRoZSB0cmFpbmluZyBtb2RlbCByZXNpZHVhbHMsIEknZCBiZSBjb25jZXJuZWQuIAoKRmluYWxseSwgdGhlIG1lYW4gc3F1YXJlZCBlcnJvciAobXNlKSBhbmQgcm9vdCBtZWFuIHNxdWFyZWQgZXJyb3IgKHJtc2UpIGFyZSBhZGRpdGlvbmFsIG1lYXN1cmVzIG9mIG1vZGVsIGFjY3VyYWN5IHRoYXQgYXJlIGNvbW1vbiBpbiBtYWNoaW5lIGxlYXJuaW5nOgpgYGB7cn0KbW9kZWwudmFsaWQgPC0gbW9kZWwudmFsaWQgJT4lIAogIG11dGF0ZShzcXVhcmVfZXJyb3IgPSByZXNpZHVhbF4yKQoKI21zZQptZWFuKG1vZGVsLnZhbGlkJHNxdWFyZV9lcnJvciwgbmEucm09VCkKCiNjYWxjdWxhdGUgcm1zZSAtIHRoaXMgaXMgdmVyeSBzaW1pbGFyIHRvIGFuIFNEIGZvciByZXNpZHVhbHMuIFlvdSdsbCBvZnRlbiBzZWUgdGhpcyB0ZXJtaW5vbG9neSB1c2VkIGluIE1MIGluc3RlYWQgb2YgU0QgdG8gYXNzZXNzIG1vZGVsIGZpdApzcXJ0KG1lYW4obW9kZWwudmFsaWQkc3F1YXJlX2Vycm9yLCBuYS5ybT1UKSkKYGBgCldlJ2xsIGNvbWUgYmFjayB0byB0aGlzIG1lYXN1cmUgbGF0ZXIhCg==