Course overview
introduction to modeling: theory and terminology
regression:
General modeling framework formula
\(y = f(\vec{x}) + \epsilon\)
Where:
Two modeling scenarios
Modeling for either:
Exploratory data analysis
Three basic steps to exploratory data analysis (EDA)
Exploratory visualization of age
Let’s perform an exploratory data analysis (EDA) of the numerical explanatory variable age. You should always perform an exploratory analysis of your variables before any formal modeling. This will give you a sense of your variable’s distributions, any outliers, and any patterns that might be useful when constructing your eventual model.
# Load packages
library(moderndive)
library(ggplot2)
# Plot the histogram
ggplot(evals, aes(x = age)) +
geom_histogram(binwidth = 5) +
labs(x = "age", y = "count")
Nice! The 463 instructors’ ages are centered roughly at age 50. You’ll now compute notions of center numerically!
Numerical summaries of age
Let’s continue our exploratory data analysis of the numerical explanatory variable age by computing summary statistics. Summary statistics take many values and summarize them with a single value. Let’s compute three such values using dplyr data wrangling: mean (AKA the average), the median (the middle value), and the standard deviation (a measure of spread/variation).
# Load packages
library(moderndive)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Compute summary stats
evals %>%
summarize(mean_age = mean(age),
median_age = median(age),
sd_age = sd(age))
## # A tibble: 1 x 3
## mean_age median_age sd_age
## <dbl> <int> <dbl>
## 1 48.4 48 9.80
Great! As suggested in the previous histogram for age, the center of the distribution as quantified by the mean and median is around 48 years old!
Comparing before and after log10-transformation
Exploratory visualization of house size
Let’s create an exploratory visualization of the predictor variable reflecting the size of houses: sqft_living the square footage of living space where 1 sq.foot ≈ 0.1 sq.meter.
After plotting the histogram, what can you say about the distribution of the variable sqft_living?
# Load packages
library(moderndive)
library(ggplot2)
# Plot the histogram
ggplot(house_prices, aes(x = sqft_living)) +
geom_histogram() +
labs(x = "Size (sq.feet)", y = "count")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
A variable is right-skewed if it has a long tail to the right.
Log10 transformation of house size
You just saw that the predictor variable sqft_living is right-skewed and hence a log base 10 transformation is warranted to unskew it. Just as we transformed the outcome variable price to create log10_price in the video, let’s do the same for sqft_living.
# Load packages
library(moderndive)
library(dplyr)
library(ggplot2)
# Add log10_size
house_prices_2 <- house_prices %>%
mutate(log10_size = log10(sqft_living))
# Plot the histogram
ggplot(house_prices_2, aes(x = log10_size)) +
geom_histogram() +
labs(x = "log10 size", y = "count")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Huzzah! Notice how the distribution is much less skewed. Going forward, you’ll use this new transformed variable to represent the size of houses.
EDA of relationship of teaching & “beauty” scores
The researchers in the UT Austin created a “beauty score” by asking a panel of 6 students to rate the “beauty” of all 463 instructors. They were interested in studying any possible impact of “beauty” of teaching evaluation scores. Let’s do an EDA of this variable and its relationship with teaching score.
From now on, assume that ggplot2, dplyr, and moderndive are all available in your workspace unless you’re told otherwise.
# Plot the histogram
ggplot(evals, aes(x = bty_avg)) +
geom_histogram(binwidth = 0.5) +
labs(x = "Beauty score", y = "count")
# Scatterplot
ggplot(evals, aes(x = bty_avg, y = score)) +
geom_point() +
labs(x = "beauty score", y = "teaching score")
# Jitter plot
ggplot(evals, aes(x = bty_avg, y = score)) +
geom_jitter() +
labs(x = "beauty score", y = "teaching score")
“Beauty”-ful! It seems the original scatterplot did suffer from overplotting since the jittered scatterplot reveals many originally hidden points. Most bty_avg scores range from 2-8, with 5 being about the center.
Correlation between teaching and “beauty” scores
Let’s numerically summarize the relationship between teaching score and beauty score bty_avg using the correlation coefficient. Based on this, what can you say about the relationship between these two variables?
# Compute correlation
evals %>%
summarize(correlation = cor(score, bty_avg))
## # A tibble: 1 x 1
## correlation
## <dbl>
## 1 0.187
While there seems to be a positive relationship, +0.187 is still a long ways from +1, so the correlation is only weakly positive.
Difference between explanation and prediction
EDA of relationship of house price and waterfront
Let’s now perform an exploratory data analysis of the relationship between log10_price, the log base 10 house price, and the binary variable waterfront. Let’s look at the raw values of waterfront and then visualize their relationship.
The column log10_price has been added for you in the house_prices dataset.
house_prices_2 <- house_prices %>%
mutate(log10_price = log10(price), log10_size = log10(sqft_living))
house_prices <- house_prices_2
# View the structure of log10_price and waterfront
house_prices %>%
select(log10_price, waterfront) %>%
glimpse()
## Rows: 21,613
## Columns: 2
## $ log10_price <dbl> 5.346157, 5.730782, 5.255273, 5.781037, 5.707570, 6.088...
## $ waterfront <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,...
# Plot
ggplot(house_prices, aes(x = waterfront, y = log10_price)) +
geom_boxplot() +
labs(x = "waterfront", y = "log10 price")
A+! Look at that boxplot! Houses that have a view of the waterfront tend to be MUCH more expensive as evidenced by the much higher log10 prices!
Predicting house price with waterfront
You just saw that houses with a view of the waterfront tend to be much more expensive. But by how much? Let’s compute group means of log10_price, convert them back to dollar units, and compare!
The variable log10_price has already been added to house_prices for you.
# Calculate stats
house_prices %>%
group_by(waterfront) %>%
summarize(mean_log10_price = mean(log10_price), n = n())
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 3
## waterfront mean_log10_price n
## <lgl> <dbl> <int>
## 1 FALSE 5.66 21450
## 2 TRUE 6.12 163
# Prediction of price for houses with view
10^(6.12)
## [1] 1318257
# Prediction of price for houses without view
10^(5.66)
## [1] 457088.2
100%! Most houses don’t have a view of the waterfront (n = 21,450), but those that do (n = 163) have a MUCH higher predicted price. Look at that difference! $457,088 versus $1,318,257! In the upcoming Chapter 2 on basic regression, we’ll build on such intuition and construct our first formal explanatory and predictive models using basic regression!
Plotting a “best-fitting” regression line
Previously you visualized the relationship of teaching score and “beauty score” via a scatterplot. Now let’s add the “best-fitting” regression line to provide a sense of any overall trends. Even though you know this plot suffers from overplotting, you’ll stick to the non-jitter version.
# Load packages
library(ggplot2)
library(dplyr)
library(moderndive)
# Plot
ggplot(evals, aes(x = bty_avg, y = score)) +
geom_point() +
labs(x = "beauty score", y = "score") +
geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula 'y ~ x'
Fantastic! The overall trend seems to be positive! As instructors have higher “beauty” scores, so also do they tend to have higher teaching scores.
Fitting a regression with a numerical x
Let’s now explicity quantify the linear relationship between score and bty_avg using linear regression. You will do this by first “fitting” the model. Then you will get the regression table, a standard output in many statistical software packages. Finally, based on the output of get_regression_table(), which interpretation of the slope coefficient is correct?
# Load package
library(moderndive)
# Fit model
model_score_2 <- lm(score ~ bty_avg, data = evals)
# Output content
model_score_2
##
## Call:
## lm(formula = score ~ bty_avg, data = evals)
##
## Coefficients:
## (Intercept) bty_avg
## 3.88034 0.06664
# Output regression table
get_regression_table(model_score_2, print = TRUE)
| term | estimate | std_error | statistic | p_value | lower_ci | upper_ci |
|---|---|---|---|---|---|---|
| intercept | 3.880 | 0.076 | 50.961 | 0 | 3.731 | 4.030 |
| bty_avg | 0.067 | 0.016 | 4.090 | 0 | 0.035 | 0.099 |
As suggested by your exploratory visualization, there is a positive relationship between “beauty” and teaching score.
Making predictions using “beauty score”
Say there is an instructor at UT Austin and you know nothing about them except that their beauty score is 5. What is your prediction \(\hat{y}\) of their teaching score \(y\)?
get_regression_table(model_score_2)
term estimate std_error statistic p_value lower_ci upper_ci
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 intercept 3.88 0.076 51.0 0 3.73 4.03
2 bty_avg 0.067 0.016 4.09 0 0.035 0.099
# Use fitted intercept and slope to get a prediction
y_hat <- 3.88 + 5 * 0.0670
y_hat
## [1] 4.215
# Compute residual y - y_hat
4.7 - 4.215
## [1] 0.485
Awesome! Was your visual guess close to the predicted teaching score of 4.215? Also, note that this prediction is off by about 0.485 units in teaching score.
Computing fitted/predicted values & residuals
Now say you want to repeat this for all 463 instructors in evals. Doing this manually as you just did would be long and tedious, so as seen in the video, let’s automate this using the get_regression_points() function.
Furthemore, let’s unpack its output.
# Fit regression model
model_score_2 <- lm(score ~ bty_avg, data = evals)
# Get regression table
get_regression_table(model_score_2)
## # A tibble: 2 x 7
## term estimate std_error statistic p_value lower_ci upper_ci
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 intercept 3.88 0.076 51.0 0 3.73 4.03
## 2 bty_avg 0.067 0.016 4.09 0 0.035 0.099
# Get all fitted/predicted values and residuals
get_regression_points(model_score_2)
## # A tibble: 463 x 5
## ID score bty_avg score_hat residual
## <int> <dbl> <dbl> <dbl> <dbl>
## 1 1 4.7 5 4.21 0.486
## 2 2 4.1 5 4.21 -0.114
## 3 3 3.9 5 4.21 -0.314
## 4 4 4.8 5 4.21 0.586
## 5 5 4.6 3 4.08 0.52
## 6 6 4.3 3 4.08 0.22
## 7 7 2.8 3 4.08 -1.28
## 8 8 4.1 3.33 4.10 -0.002
## 9 9 3.4 3.33 4.10 -0.702
## 10 10 4.5 3.17 4.09 0.409
## # ... with 453 more rows
# Get all fitted/predicted values and residuals
get_regression_points(model_score_2) %>%
mutate(score_hat_2 = 3.88 + 0.067 * bty_avg)
## # A tibble: 463 x 6
## ID score bty_avg score_hat residual score_hat_2
## <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 4.7 5 4.21 0.486 4.22
## 2 2 4.1 5 4.21 -0.114 4.22
## 3 3 3.9 5 4.21 -0.314 4.22
## 4 4 4.8 5 4.21 0.586 4.22
## 5 5 4.6 3 4.08 0.52 4.08
## 6 6 4.3 3 4.08 0.22 4.08
## 7 7 2.8 3 4.08 -1.28 4.08
## 8 8 4.1 3.33 4.10 -0.002 4.10
## 9 9 3.4 3.33 4.10 -0.702 4.10
## 10 10 4.5 3.17 4.09 0.409 4.09
## # ... with 453 more rows
# Get all fitted/predicted values and residuals
get_regression_points(model_score_2) %>%
mutate(residual_2 = score - score_hat)
## # A tibble: 463 x 6
## ID score bty_avg score_hat residual residual_2
## <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 4.7 5 4.21 0.486 0.486
## 2 2 4.1 5 4.21 -0.114 -0.114
## 3 3 3.9 5 4.21 -0.314 -0.314
## 4 4 4.8 5 4.21 0.586 0.586
## 5 5 4.6 3 4.08 0.52 0.520
## 6 6 4.3 3 4.08 0.22 0.220
## 7 7 2.8 3 4.08 -1.28 -1.28
## 8 8 4.1 3.33 4.10 -0.002 -0.002
## 9 9 3.4 3.33 4.10 -0.702 -0.702
## 10 10 4.5 3.17 4.09 0.409 0.409
## # ... with 453 more rows
Bingo! You’ll see later that the residuals can provide useful information about the quality of your regression models. Stay tuned!
EDA of relationship of score and rank
Let’s perform an EDA of the relationship between an instructor’s score and their rank in the evals dataset. You’ll both visualize this relationship and compute summary statistics for each level of rank: teaching, tenure track, and tenured.
ggplot(evals, aes(x = rank, y = score)) +
geom_boxplot() +
labs(x = "rank", y = "score")
evals %>%
group_by(rank) %>%
summarize(n = n(), mean_score = mean(score), sd_score = sd(score))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 3 x 4
## rank n mean_score sd_score
## <fct> <int> <dbl> <dbl>
## 1 teaching 102 4.28 0.498
## 2 tenure track 108 4.15 0.561
## 3 tenured 253 4.14 0.550
Cool! The boxplot and summary statistics suggest that teaching get the highest scores while tenured professors get the lowest. However, there is clearly variation around the respective means.
Fitting a regression with a categorical x
You’ll now fit a regression model with the categorical variable rank as the explanatory variable and interpret the values in the resulting regression table. Note here the rank “teaching” is treated as the baseline for comparison group for the “tenure track” and “tenured” groups.
# Fit regression model
model_score_4 <- lm(score ~ rank, data = evals)
# Get regression table
get_regression_table(model_score_4)
## # A tibble: 3 x 7
## term estimate std_error statistic p_value lower_ci upper_ci
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 intercept 4.28 0.054 79.9 0 4.18 4.39
## 2 ranktenure track -0.13 0.075 -1.73 0.084 -0.277 0.017
## 3 ranktenured -0.145 0.064 -2.28 0.023 -0.27 -0.02
# teaching mean
teaching_mean <- 4.28
# tenure track mean
tenure_track_mean <- 4.28 - 0.13
# tenured mean
tenured_mean <- 4.28 - 0.145
Kudos! Remember that regressions with a categorical variable return group means expressed relative to a baseline for comparison!
Regression tables for categorical explanatory variables show differences in means relative to a baseline.
Visualizing the distribution of residuals
Let’s now compute both the predicted score \(\hat{y}\) and the residual \(y-\hat{y}\) for all \(n = 463\) instructors in the evals dataset. Furthermore, you’ll plot a histogram of the residuals and see if there are any patterns to the residuals, i.e. your predictive errors.
model_score_4 from the previous exercise is available in your workspace.
# Calculate predictions and residuals
model_score_4_points <- get_regression_points(model_score_4)
model_score_4_points
## # A tibble: 463 x 5
## ID score rank score_hat residual
## <int> <dbl> <fct> <dbl> <dbl>
## 1 1 4.7 tenure track 4.16 0.545
## 2 2 4.1 tenure track 4.16 -0.055
## 3 3 3.9 tenure track 4.16 -0.255
## 4 4 4.8 tenure track 4.16 0.645
## 5 5 4.6 tenured 4.14 0.461
## 6 6 4.3 tenured 4.14 0.161
## 7 7 2.8 tenured 4.14 -1.34
## 8 8 4.1 tenured 4.14 -0.039
## 9 9 3.4 tenured 4.14 -0.739
## 10 10 4.5 tenured 4.14 0.361
## # ... with 453 more rows
# Plot residuals
ggplot(model_score_4_points, aes(x = residual)) +
geom_histogram() +
labs(x = "residuals", title = "Residuals from score ~ rank model")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Congrats! Look at the distribution of the residuals. While it seems there are fewer negative residuals corresponding to overpredictions of score, the magnitude of the error seems to be larger (ranging all the way to -2).
Plotly was used to create this visualization
EDA of relationship
Unfortunately, making 3D scatterplots to perform an EDA is beyond the scope of this course. So instead let’s focus on making standard 2D scatterplots of the relationship between price and the number of bedrooms, keeping an eye out for outliers.
The log10 transformations have been made for you and are saved in house_prices.
# Create scatterplot with regression line
ggplot(house_prices, aes(x = bedrooms, y = log10_price)) +
geom_point() +
labs(x = "Number of bedrooms", y = "log10 price") +
geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula 'y ~ x'
# Remove outlier
house_prices_transform <- house_prices %>%
filter(bedrooms < 20)
# Create scatterplot with regression line
ggplot(house_prices_transform, aes(x = bedrooms, y = log10_price)) +
geom_point() +
labs(x = "Number of bedrooms", y = "log10 price") +
geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula 'y ~ x'
Excellent! Another important reason to perform EDA is to discard any potential outliers that are likely data entry errors. In our case, after removing an outlier, you can see a clear positive relationship between the number of bedrooms and price, as one would expect.
Fitting a regression
house_prices, which is available in your environment, has the log base 10 transformed variables included and the outlier house with 33 bedrooms removed. Let’s fit a multiple regression model of price as a function of size and the number of bedrooms and generate the regression table. In this exercise, you will first fit the model, and based on the regression table, in the second part, you will answer the following question:
Which of these interpretations of the slope coefficent for bedrooms is correct?
# Fit model
model_price_2 <- lm(log10_price ~ log10_size + bedrooms,
data = house_prices)
# Get regression table
get_regression_table(model_price_2)
## # A tibble: 3 x 7
## term estimate std_error statistic p_value lower_ci upper_ci
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 intercept 2.72 0.023 118. 0 2.67 2.76
## 2 log10_size 0.931 0.008 118. 0 0.916 0.947
## 3 bedrooms -0.03 0.002 -19.3 0 -0.033 -0.027
Accounting for log10_size, every extra bedroom is associated with a decrease of on average 0.033 in log10_price.
Splendid! In this multiple regression setting, the associated effect of any variable must be viewed in light of the other variables in the model. In our case, accounting for the size of the house reverses the relationship of the number of bedrooms and price from positive to negative!
Making predictions using size and bedrooms
Say you want to predict the price of a house using this model and you know it has:
What is your prediction both in log10 dollars and then dollars?
The regression model from the previous exercise is available in your workspace as model_price_2.
get_regression_table(model_price_2)
# A tibble: 3 x 7
term estimate std_error statistic p_value lower_ci upper_ci
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 intercept 2.69 0.023 116. 0 2.65 2.74
2 log10_size 0.941 0.008 118. 0 0.925 0.957
3 bedrooms -0.033 0.002 -20.5 0 -0.036 -0.03
# Make prediction in log10 dollars
2.69 + 0.941 * log10(1000) - .033 * 3
## [1] 5.414
# Make prediction dollars
10^5.414
## [1] 259417.9
Spot on! Using the values in the regression table you can make predictions of house prices! In this case, your prediction is about $260,000. Let’s now apply this procedure to all 21k houses!
Interpreting residuals
Let’s automate this process for all 21K rows in house_prices to obtain residuals, which you’ll use to compute the sum of squared residuals: a measure of the lack of fit of a model. After computing the sum of squared residuals, you will answer the following question:
Which of these statements about residuals is correct?
# Automate prediction and residual computation
get_regression_points(model_price_2)
## # A tibble: 21,613 x 6
## ID log10_price log10_size bedrooms log10_price_hat residual
## <int> <dbl> <dbl> <int> <dbl> <dbl>
## 1 1 5.35 3.07 3 5.48 -0.139
## 2 2 5.73 3.41 3 5.8 -0.069
## 3 3 5.26 2.89 2 5.34 -0.087
## 4 4 5.78 3.29 4 5.66 0.121
## 5 5 5.71 3.22 3 5.63 0.08
## 6 6 6.09 3.73 4 6.07 0.017
## 7 7 5.41 3.23 3 5.64 -0.225
## 8 8 5.46 3.02 3 5.44 0.024
## 9 9 5.36 3.25 3 5.65 -0.290
## 10 10 5.51 3.28 3 5.68 -0.166
## # ... with 21,603 more rows
# Automate prediction and residual computation
get_regression_points(model_price_2) %>%
mutate(sq_residuals = residual^2) %>%
summarize(sum_sq_residuals = sum(sq_residuals))
## # A tibble: 1 x 1
## sum_sq_residuals
## <dbl>
## 1 605.
Residual does suggest ‘leftovers’, but not in the sense that they are leftover points.
Parallel slopes model
Let’s now fit a “parallel slopes” model with the numerical explanatory/predictor variable log10_size and the categorical, in this case binary, variable waterfront. The visualization corresponding to this model is below:
# Fit model
model_price_4 <- lm(log10_price ~ log10_size + waterfront,
data = house_prices)
# Get regression table
get_regression_table(model_price_4)
## # A tibble: 3 x 7
## term estimate std_error statistic p_value lower_ci upper_ci
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 intercept 2.96 0.02 146. 0 2.92 3.00
## 2 log10_size 0.825 0.006 134. 0 0.813 0.837
## 3 waterfrontTRUE 0.322 0.013 24.5 0 0.296 0.348
Success! Notice how the regression table has three rows: intercept, the slope for log10_size, and an offset for houses that do have a waterfront.
log10_price & log10_size.0.322 is the offset in the intercept for houses with a view of the waterfront relative to those which don’t.
Making predictions using size and waterfront
Using your model for log10_price as a function of log10_size and the binary variable waterfront, let’s make some predictions! Say you have the two following “new” houses, what would you predict their prices to be in dollars?
log10_size = 2.9 that has a view of the waterfrontlog10_size = 3.1 that does not have a view of the waterfrontWe make the corresponding visual predictions below:
# Get regression table
get_regression_table(model_price_4)
## # A tibble: 3 x 7
## term estimate std_error statistic p_value lower_ci upper_ci
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 intercept 2.96 0.02 146. 0 2.92 3.00
## 2 log10_size 0.825 0.006 134. 0 0.813 0.837
## 3 waterfrontTRUE 0.322 0.013 24.5 0 0.296 0.348
# Prediction for House A
10^(2.96 + 0.825 * 2.9 + 0.322)
## [1] 472606.8
# Prediction for House B
10^(2.96 + 0.825 * 3.1)
## [1] 329230.5
Yay! Your modeling toolbox is getting quite extensive! Let’s now automate this!
Automating predictions on “new” houses
Let’s now repeat what you did in the last exercise, but in an automated fashion assuming the information on these “new” houses is saved in a dataframe.
Your model for log10_price as a function of log10_size and the binary variable waterfront (model_price_4) is available in your workspace, and so is new_houses_2, a dataframe with data on 2 new houses. While not so beneficial with only 2 “new” houses, this will save a lot of work if you had 2000 “new” houses.
new_houses_2 <- tibble(log10_size = c(2.9, 3.1), waterfront = c(TRUE, FALSE))
# View the "new" houses
new_houses_2
## # A tibble: 2 x 2
## log10_size waterfront
## <dbl> <lgl>
## 1 2.9 TRUE
## 2 3.1 FALSE
# Get predictions on "new" houses
get_regression_points(model_price_4, newdata = new_houses_2)
## # A tibble: 2 x 4
## ID log10_size waterfront log10_price_hat
## <int> <dbl> <lgl> <dbl>
## 1 1 2.9 TRUE 5.67
## 2 2 3.1 FALSE 5.52
# Get predictions price_hat in dollars on "new" houses
get_regression_points(model_price_4, newdata = new_houses_2) %>%
mutate(price_hat = 10^log10_price_hat)
## # A tibble: 2 x 5
## ID log10_size waterfront log10_price_hat price_hat
## <int> <dbl> <lgl> <dbl> <dbl>
## 1 1 2.9 TRUE 5.67 472063.
## 2 2 3.1 FALSE 5.52 328095.
Predictions of $472,000 and $328,000! Exceptional! You’re done with the multiple regression chapter, and now you’re onto model assessment and selection!
Refresher: sum of squared residuals Let’s remind you how to compute the sum of squared residuals. You’ll do this for two models.
# Model 2
model_price_2 <- lm(log10_price ~ log10_size + bedrooms,
data = house_prices)
# Calculate squared residuals
get_regression_points(model_price_2) %>%
mutate(sq_residuals = residual^2) %>%
summarize(sum_sq_residuals = sum(sq_residuals))
## # A tibble: 1 x 1
## sum_sq_residuals
## <dbl>
## 1 605.
# Model 4
model_price_4 <- lm(log10_price ~ log10_size + waterfront,
data = house_prices)
# Calculate squared residuals
get_regression_points(model_price_4) %>%
mutate(sq_residuals = residual^2) %>%
summarize(sum_sq_residuals = sum(sq_residuals))
## # A tibble: 1 x 1
## sum_sq_residuals
## <dbl>
## 1 599.
Wonderful! Let’s use these two measures of model assessment to choose between these two models, or in other words, perform model selection!
Which model to select?
Based on these two values of the sum of squared residuals, which of these two models do you think is “better”, and hence which would you select?
model_price_2 that uses log10_size and bedrooms?model_price_4 that uses log10_size and waterfront?Since model_price_4’s value was lower at 599, select this one.
Given the choice of just these two models, the evidence suggests using size and waterfront yield a better fit, so you should choose this one!
Assessing model fit with R-squared
Computing the R-squared of a model
Let’s compute the \(R^2\) summary value for the two numerical explanatory/predictor variable model you fit in the Chapter 3, price as a function of size and the number of bedrooms.
Recall that \(R^2\) can be calculated as:
\(1 - \frac{Var_{(residuals)}}{Var_{(y)}}\)
# Fit model
model_price_2 <- lm(log10_price ~ log10_size + bedrooms,
data = house_prices)
# Get fitted/values & residuals, compute R^2 using residuals
get_regression_points(model_price_2) %>%
summarize(r_squared = 1 - var(residual) / var(log10_price))
## # A tibble: 1 x 1
## r_squared
## <dbl>
## 1 0.465
Nice job! You observed an R-squared value of 0.465, which means that 46.5% of the total variability of the outcome variable log base 10 price can be explained by this model.
Comparing the R-squared of two models
Let’s now compute \(R^2\) for the one numerical and one categorical explanatory/predictor variable model you fit in the Chapter 3, price as a function of size and whether the house had a view of the waterfront, and compare its \(R^2\) with the one you just computed.
# Fit model
model_price_4 <- lm(log10_price ~ log10_size + waterfront,
data = house_prices)
# Get fitted/values & residuals, compute R^2 using residuals
get_regression_points(model_price_4) %>%
summarize(r_squared = 1 - var(residual) / var(log10_price))
## # A tibble: 1 x 1
## r_squared
## <dbl>
## 1 0.470
Since model_price_4 had a higher \(R^2\) of 0.470, it “fit” the data better.
Since using waterfront explained a higher proportion of the total variance of the outcome variable than using the number of bedrooms, using waterfront in our model is preferred.
\(RMSE\) = typical prediction error the model will make that matches units of outcome variable \(y\)
Computing the MSE & RMSE of a model
Just as you did earlier with \(R^2\), which is a measure of model fit, let’s now compute the root mean square error (RMSE) of our models, which is a commonly used measure of preditive error. Let’s use the model of price as a function of size and number of bedrooms.
The model is available in your workspace as model_price_2.
# Get all residuals, square them, and take mean
get_regression_points(model_price_2) %>%
mutate(sq_residuals = residual^2) %>%
summarize(mse = mean(sq_residuals))
## # A tibble: 1 x 1
## mse
## <dbl>
## 1 0.0280
# Get all residuals, square them, take the mean and square root
get_regression_points(model_price_2) %>%
mutate(sq_residuals = residual^2) %>%
summarize(mse = mean(sq_residuals)) %>%
mutate(rmse = sqrt(mse))
## # A tibble: 1 x 2
## mse rmse
## <dbl> <dbl>
## 1 0.0280 0.167
Woo hoo! The RMSE is 0.167. You can think of this as the “typical” prediction error this model makes.
Comparing the RMSE of two models
As you did using the sum of squared residuals and \(R^2\), let’s once again assess and compare the quality of your two models using the root mean squared error (RMSE). Note that RMSE is more typically used in prediction settings than explanatory settings.
model_price_2 and model_price_4 are available in your workspace.
# MSE and RMSE for model_price_2
get_regression_points(model_price_2) %>%
mutate(sq_residuals = residual^2) %>%
summarize(mse = mean(sq_residuals), rmse = sqrt(mean(sq_residuals)))
## # A tibble: 1 x 2
## mse rmse
## <dbl> <dbl>
## 1 0.0280 0.167
# MSE and RMSE for model_price_4
get_regression_points(model_price_4) %>%
mutate(sq_residuals = residual^2) %>%
summarize(mse = mean(sq_residuals), rmse = sqrt(mean(sq_residuals)))
## # A tibble: 1 x 2
## mse rmse
## <dbl> <dbl>
## 1 0.0277 0.166
Since model_price_4 had a lower rmse of 0.166, this is suggestive that this model has better predictive power.
RMSE can be thought of as the ‘typical’ error a predicive model will make.
Validation set approach
Use two independent datasets to:
Fitting model to training data
It’s time to split your data into a training set to fit a model and a separate test set to evaluate the predictive power of the model. Before making this split however, we first sample 100% of the rows of house_prices without replacement and assign this to house_prices_shuffled. This has the effect of “shuffling” the rows, thereby ensuring that the training and test sets are randomly sampled.
# Set random number generator seed value for reproducibility
set.seed(76)
# Randomly reorder the rows
house_prices_shuffled <- house_prices %>%
sample_frac(size = 1, replace = FALSE)
# Train/test split
train <- house_prices_shuffled %>%
slice(1:10000)
test <- house_prices_shuffled %>%
slice(10001:21613)
# Fit model to training set
train_model_2 <- lm(log10_price ~ log10_size + bedrooms, data = train)
Fabulous! Since you’ve fit/trained the predictive model on the training set, let’s now apply it to the test set to make predictions!
Predicting on test data
Now that you’ve trained the model on the train set, let’s apply the model to the test data, make predictions, and evaluate the predictions. Recall that having a separate test set here simulates the gathering of a “new” independent dataset to test our model’s predictive performance on.
The datasets train and test, and the trained model, train_model_2 are available in your workspace.
# Make predictions on test set
get_regression_points(train_model_2, newdata = test)
## # A tibble: 11,613 x 6
## ID log10_price log10_size bedrooms log10_price_hat residual
## <int> <dbl> <dbl> <int> <dbl> <dbl>
## 1 1 5.81 3.24 3 5.64 0.171
## 2 2 5.45 3.25 3 5.65 -0.204
## 3 3 5.69 3.38 3 5.78 -0.089
## 4 4 5.50 2.98 3 5.40 0.106
## 5 5 5.93 3.55 4 5.90 0.031
## 6 6 5.39 3.25 3 5.65 -0.267
## 7 7 5.61 3.06 2 5.51 0.101
## 8 8 5.38 3.12 2 5.56 -0.182
## 9 9 6.28 3.49 4 5.84 0.44
## 10 10 5.85 3.53 3 5.91 -0.058
## # ... with 11,603 more rows
# Compute RMSE
get_regression_points(train_model_2, newdata = test) %>%
mutate(sq_residuals = residual^2) %>%
summarize(rmse = sqrt(mean(sq_residuals)))
## # A tibble: 1 x 1
## rmse
## <dbl>
## 1 0.167
Magnificent! Your RMSE using size and condition as predictor variables is 0.167, which is higher than 0.165 when you used size and year built! It seems the latter is marginally better!