knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(eval = TRUE)
This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
In this chapter we introduce the concept of regression from a machine learning point of view. We will present the fundamental regression method: linear regression. We will show how to fit a linear regression model and to make predictions from the model.
From a machine learning perspective, the term regression generally encompasses the prediction of continuous values. Statistically, these predictions are the expected value, or the average value one would observe for the given input values.
Which of the following tasks are regression problems?
Answer the question
50 XP
Possible Answers
Predict (yes/no) whether a student will pass the final exam, given scores on midterms and homework.
Predict the student’s score (0 - 100) on the final exam, given scores on midterms and homework. [ans]
Predict the student’s final grade (A, B, C, D, F) in the class given scores on midterms and homework (before they’ve taken the final exam).
For the first coding exercise, you’ll create a formula to define a one-variable modeling task, and then fit a linear model to the data. You are given the rates of male and female unemployment in the United States over several years (Source).
The task is to predict the rate of female unemployment from the observed rate of male unemployment. The outcome is female_unemployment, and the input is male_unemployment.
The sign of the variable coefficient tells you whether the outcome increases (+) or decreases (-) as the variable increases.
Recall the calling interface for lm() is:
lm(formula, data = ___)
Instructions
100 XP
The data frame unemployment is in your workspace.
Define a formula that expresses female_unemployment as a function of male_unemployment. Assign the formula to the variable fmla and print it.
Then use lm() and fmla to fit a linear model to predict female unemployment from male unemployment using the data set unemployment.
Print the model. Is the coefficent for male unemployment consistent with what you would expect? Does female unemployment increase as male unemployment does?
# unemployment is loaded in the workspace
unemployment<-readRDS("unemployment.rds")
summary(unemployment)
## male_unemployment female_unemployment
## Min. :2.900 Min. :4.000
## 1st Qu.:4.900 1st Qu.:4.400
## Median :6.000 Median :5.200
## Mean :5.954 Mean :5.569
## 3rd Qu.:6.700 3rd Qu.:6.100
## Max. :9.800 Max. :7.900
# Define a formula to express female_unemployment as a function of male_unemployment
fmla <- female_unemployment ~ male_unemployment
# Print it
fmla
## female_unemployment ~ male_unemployment
# Use the formula to fit a model: unemployment_model
unemployment_model <- lm(fmla, data = unemployment)
# Print it
unemployment_model
##
## Call:
## lm(formula = fmla, data = unemployment)
##
## Coefficients:
## (Intercept) male_unemployment
## 1.4341 0.6945
Let’s look at the model unemployment_model that you have just created. There are a variety of different ways to examine a model; each way provides different information. We will use summary(), broom::glance(), and sigr::wrapFTest().
Instructions
100 XP
The object unemployment_model is in your workspace.
Print unemployment_model again. What information does it report?
Call summary() on unemployment_model. In addition to the coefficient values, you get standard errors on the coefficient estimates, and some goodness-of-fit metrics like R-squared.
Call glance() on the model to see the performance metrics in an orderly data frame. Can you match the information from summary() to the columns of glance()?
Now call wrapFTest() on the model to see the R-squared again.
# broom and sigr are already loaded in your workspace
# Print unemployment_model
install.packages("broom")
## Warning: unable to access index for repository http://www.stats.ox.ac.uk/pub/RWin/bin/windows/contrib/3.5:
## cannot open URL 'http://www.stats.ox.ac.uk/pub/RWin/bin/windows/contrib/3.5/PACKAGES'
## package 'broom' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\cweiqiang\AppData\Local\Temp\RtmpAVjfoT\downloaded_packages
install.packages("sigr")
## Warning: unable to access index for repository http://www.stats.ox.ac.uk/pub/RWin/bin/windows/contrib/3.5:
## cannot open URL 'http://www.stats.ox.ac.uk/pub/RWin/bin/windows/contrib/3.5/PACKAGES'
## package 'sigr' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\cweiqiang\AppData\Local\Temp\RtmpAVjfoT\downloaded_packages
library(broom)
library(sigr)
unemployment_model
##
## Call:
## lm(formula = fmla, data = unemployment)
##
## Coefficients:
## (Intercept) male_unemployment
## 1.4341 0.6945
# Call summary() on unemployment_model to get more details
summary(unemployment_model)
##
## Call:
## lm(formula = fmla, data = unemployment)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.77621 -0.34050 -0.09004 0.27911 1.31254
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.43411 0.60340 2.377 0.0367 *
## male_unemployment 0.69453 0.09767 7.111 1.97e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5803 on 11 degrees of freedom
## Multiple R-squared: 0.8213, Adjusted R-squared: 0.8051
## F-statistic: 50.56 on 1 and 11 DF, p-value: 1.966e-05
# Call glance() on unemployment_model to see the details in a tidier form
glance(unemployment_model)
## # A tibble: 1 x 11
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## * <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.821 0.805 0.580 50.6 1.97e-5 2 -10.3 26.6 28.3
## # ... with 2 more variables: deviance <dbl>, df.residual <int>
# Call wrapFTest() on unemployment_model to see the most relevant details
wrapFTest(unemployment_model)
## [1] "F Test summary: (R2=0.82, F(1,11)=51, p=2e-05)."
Remark: here are several different ways to get diagnostics for your model. Use the one that suits your needs or preferences the best.
In this exercise, you will use your unemployment model unemployment_model to make predictions from the unemployment data, and compare predicted female unemployment rates to the actual observed female unemployment rates on the training data, unemployment. You will also use your model to predict on the new data in newrates, which consists of only one observation, where male unemployment is 5%.
The predict() interface for lm models takes the form
predict(model, newdata)
You will use the ggplot2 package to make the plots, so you will add the prediction column to the unemployment data frame. You will plot outcome versus prediction, and compare them to the line that represents perfect predictions (that is when the outcome is equal to the predicted value).
The ggplot2 command to plot a scatterplot of dframe\(outcome versus dframe\)pred (pred on the x axis, outcome on the y axis), along with a blue line where outcome == pred is as follows:
ggplot(dframe, aes(x = pred, y = outcome)) + geom_point() +
geom_abline(color = “blue”)
Instructions
100 XP
The objects unemployment, unemployment_model and newrates are in your workspace.
Use predict() to predict female unemployment rates from the unemployment data. Assign it to a new column: prediction.
Use the library() command to load the ggplot2 package.
Use ggplot() to compare the predictions to actual unemployment rates. Put the predictions on the x axis. How close are the results to the line of perfect prediction?
Use the data frame newrates to predict expected female unemployment rate when male unemployment is 5%. Assign the answer to the variable pred and print it.
newrates<-read.csv("new_rates.csv")
# unemployment is in your workspace
summary(unemployment)
## male_unemployment female_unemployment
## Min. :2.900 Min. :4.000
## 1st Qu.:4.900 1st Qu.:4.400
## Median :6.000 Median :5.200
## Mean :5.954 Mean :5.569
## 3rd Qu.:6.700 3rd Qu.:6.100
## Max. :9.800 Max. :7.900
# newrates is in your workspace
newrates
## X male_unemployment
## 1 1 5
# Predict female unemployment in the unemployment data set
unemployment$prediction <- predict(unemployment_model)
# load the ggplot2 package
library(ggplot2)
# Make a plot to compare predictions to actual (prediction on x axis)
ggplot(unemployment, aes(x = prediction, y = female_unemployment)) +
geom_point() +
geom_abline(color = "blue")
# Predict female unemployment rate when male unemployment is 5%
pred <- predict(unemployment_model, newdata = newrates)
# Print it
pred
## 1
## 4.906757
In this exercise, you will work with the blood pressure dataset (Source), and model blood_pressure as a function of weight and age.
Instructions
100 XP
The data frame bloodpressure is in the workspace.
Define a formula that expresses blood_pressure explicitly as a function of age and weight. Assign the formula to the variable fmla and print it.
Use fmla to fit a linear model to predict blood_pressure from age and weight in the data set bloodpressure. Call the model bloodpressure_model.
Print the model and call summary() on it. Does blood pressure increase or decrease with age? With weight?
# bloodpressure is in the workspace
bloodpressure<-readRDS("bloodpressure.rds")
summary(bloodpressure)
## blood_pressure age weight
## Min. :128.0 Min. :46.00 Min. :167
## 1st Qu.:140.0 1st Qu.:56.50 1st Qu.:186
## Median :153.0 Median :64.00 Median :194
## Mean :150.1 Mean :62.45 Mean :195
## 3rd Qu.:160.5 3rd Qu.:69.50 3rd Qu.:209
## Max. :168.0 Max. :74.00 Max. :220
# Create the formula and print it
fmla <- blood_pressure~age+weight
fmla
## blood_pressure ~ age + weight
# Fit the model: bloodpressure_model
bloodpressure_model <- lm(fmla,data=bloodpressure)
# Print bloodpressure_model and call summary()
bloodpressure_model
##
## Call:
## lm(formula = fmla, data = bloodpressure)
##
## Coefficients:
## (Intercept) age weight
## 30.9941 0.8614 0.3349
summary(bloodpressure_model)
##
## Call:
## lm(formula = fmla, data = bloodpressure)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4640 -1.1949 -0.4078 1.8511 2.6981
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30.9941 11.9438 2.595 0.03186 *
## age 0.8614 0.2482 3.470 0.00844 **
## weight 0.3349 0.1307 2.563 0.03351 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.318 on 8 degrees of freedom
## Multiple R-squared: 0.9768, Adjusted R-squared: 0.9711
## F-statistic: 168.8 on 2 and 8 DF, p-value: 2.874e-07
Now you will make predictions using the blood pressure model bloodpressure_model that you fit in the previous exercise.
You will also compare the predictions to outcomes graphically. ggplot2 is already loaded in your workspace. Recall the plot command takes the form:
ggplot(dframe, aes(x = pred, y = outcome)) + geom_point() + geom_abline(color = “blue”) `
Instructions
100 XP
The objects bloodpressure and bloodpressure_model are in the workspace.
Use predict() to predict blood pressure in the bloodpressure dataset. Assign the predictions to the column prediction.
Graphically compare the predictions to actual blood pressures. Put predictions on the x axis. How close are the results to the line of perfect prediction?
# bloodpressure is in your workspace
summary(bloodpressure)
## blood_pressure age weight
## Min. :128.0 Min. :46.00 Min. :167
## 1st Qu.:140.0 1st Qu.:56.50 1st Qu.:186
## Median :153.0 Median :64.00 Median :194
## Mean :150.1 Mean :62.45 Mean :195
## 3rd Qu.:160.5 3rd Qu.:69.50 3rd Qu.:209
## Max. :168.0 Max. :74.00 Max. :220
# bloodpressure_model is in your workspace
bloodpressure_model
##
## Call:
## lm(formula = fmla, data = bloodpressure)
##
## Coefficients:
## (Intercept) age weight
## 30.9941 0.8614 0.3349
# predict blood pressure using bloodpressure_model :prediction
bloodpressure$prediction <- predict(bloodpressure_model)
# plot the results
ggplot(bloodpressure, aes(x = prediction, y = blood_pressure))+
geom_point() +
geom_abline(color = "blue")
Now that we have learned how to fit basic linear regression models, we will learn how to evaluate how well our models perform. We will review evaluating a model graphically, and look at two basic metrics for regression models. We will also learn how to train a model that will perform well in the wild, not just on training data. Although we will demonstrate these techniques using linear regression, all these concepts apply to models fit with any regression algorithm.
In this exercise you will graphically evaluate the unemployment model, unemployment_model, that you fit to the unemployment data in the previous chapter. Recall that the model predicts female_unemployment from male_unemployment.
You will plot the model’s predictions against the actual female_unemployment; recall the command is of the form
ggplot(dframe, aes(x = pred, y = outcome)) + geom_point() +
geom_abline() Then you will calculate the residuals:
residuals <- actual outcome - predicted outcome and plot predictions against residuals. The residual graph will take a slightly different form: you compare the residuals to the horizonal line x=0 (using geom_hline()) rather than to the line x=y. The command will be provided.
Instructions
100 XP
The data frame unemployment and model unemployment_model are available in the workspace.
Use predict() to get the model predictions and add them to unemployment as the column predictions.
Plot predictions (on the x-axis) versus actual female unemployment rates. Are the predictions near the x=y line?
Calculate the residuals between the predictions and actual unemployment rates. Add these residuals to unemployment as the column residuals.
Fill in the blanks to plot predictions (on the x-axis) versus residuals (on the y-axis). This gives you a different view of the model’s predictions as compared to ground truth.
# unemployment is in the workspace
summary(unemployment)
## male_unemployment female_unemployment prediction
## Min. :2.900 Min. :4.000 Min. :3.448
## 1st Qu.:4.900 1st Qu.:4.400 1st Qu.:4.837
## Median :6.000 Median :5.200 Median :5.601
## Mean :5.954 Mean :5.569 Mean :5.569
## 3rd Qu.:6.700 3rd Qu.:6.100 3rd Qu.:6.087
## Max. :9.800 Max. :7.900 Max. :8.240
# unemployment_model is in the workspace
summary(unemployment_model)
##
## Call:
## lm(formula = fmla, data = unemployment)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.77621 -0.34050 -0.09004 0.27911 1.31254
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.43411 0.60340 2.377 0.0367 *
## male_unemployment 0.69453 0.09767 7.111 1.97e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5803 on 11 degrees of freedom
## Multiple R-squared: 0.8213, Adjusted R-squared: 0.8051
## F-statistic: 50.56 on 1 and 11 DF, p-value: 1.966e-05
# Make predictions from the model
unemployment$predictions <- predict(unemployment_model)
# Fill in the blanks to plot predictions (on x-axis) versus the female_unemployment rates
ggplot(unemployment, aes(x = predictions, y = female_unemployment)) +
geom_point() +
geom_abline()
# Calculate residuals
unemployment$residuals <- unemployment$female_unemployment - unemployment$predictions
# Fill in the blanks to plot predictions (on x-axis) versus the residuals
ggplot(unemployment, aes(x = predictions, y = residuals)) +
geom_pointrange(aes(ymin = 0, ymax = residuals)) +
geom_hline(yintercept = 0, linetype = 3) +
ggtitle("residuals vs. linear model prediction")
In the previous exercise you made predictions about female_unemployment and visualized the predictions and the residuals. Now, you will also plot the gain curve of the unemployment_model’s predictions against actual female_unemployment using the WVPlots::GainCurvePlot() function.
For situations where order is more important than exact values, the gain curve helps you check if the model’s predictions sort in the same order as the true outcome.
Calls to the function GainCurvePlot() look like:
GainCurvePlot(frame, xvar, truthvar, title)
where
When the predictions sort in exactly the same order, the relative Gini coefficient is 1. When the model sorts poorly, the relative Gini coefficient is close to zero, or even negative.
Instructions
100 XP
The data frame unemployment and the model unemployment_model are in the workspace.
Load the package WVPlots.
Plot the gain curve. Give the plot the title “Unemployment model”. Do the model’s predictions sort correctly?
# unemployment is in the workspace (with predictions)
summary(unemployment)
## male_unemployment female_unemployment prediction predictions
## Min. :2.900 Min. :4.000 Min. :3.448 Min. :3.448
## 1st Qu.:4.900 1st Qu.:4.400 1st Qu.:4.837 1st Qu.:4.837
## Median :6.000 Median :5.200 Median :5.601 Median :5.601
## Mean :5.954 Mean :5.569 Mean :5.569 Mean :5.569
## 3rd Qu.:6.700 3rd Qu.:6.100 3rd Qu.:6.087 3rd Qu.:6.087
## Max. :9.800 Max. :7.900 Max. :8.240 Max. :8.240
## residuals
## Min. :-0.77621
## 1st Qu.:-0.34050
## Median :-0.09004
## Mean : 0.00000
## 3rd Qu.: 0.27911
## Max. : 1.31254
# unemployment_model is in the workspace
summary(unemployment_model)
##
## Call:
## lm(formula = fmla, data = unemployment)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.77621 -0.34050 -0.09004 0.27911 1.31254
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.43411 0.60340 2.377 0.0367 *
## male_unemployment 0.69453 0.09767 7.111 1.97e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5803 on 11 degrees of freedom
## Multiple R-squared: 0.8213, Adjusted R-squared: 0.8051
## F-statistic: 50.56 on 1 and 11 DF, p-value: 1.966e-05
# Load the package WVPlots
install.packages("WVPlots")
## Warning: unable to access index for repository http://www.stats.ox.ac.uk/pub/RWin/bin/windows/contrib/3.5:
## cannot open URL 'http://www.stats.ox.ac.uk/pub/RWin/bin/windows/contrib/3.5/PACKAGES'
## package 'WVPlots' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\cweiqiang\AppData\Local\Temp\RtmpAVjfoT\downloaded_packages
library(WVPlots)
# Plot the Gain Curve
GainCurvePlot(unemployment, "predictions", "female_unemployment", "Unemployment model")