Linear Regression for Machine Learning

Introduction

This is a brief tutorial on the use of linear regression for machine learning in R. What is linear regression? In essence, linear regression is a method for creating an equation that best predicts some continuous, numerical outcome variable, from some given input variables. If this doesn’t make sense right off the bat, don’t worry! It should make more sense as we move through the example.

This tutorial will use the programming language “R”. If you follow along in R (preferably run through the RStudio IDE), you should get the same results!

Preparing our analysis

Loading tidyverse

For these analyses, we will only be using one package (well… family of packages) that isn’t in base R: The tidyverse. The tidyverse is a collection of packages that are excellent at manipulating data, visualizing data, etc. I load the tidyverse into nearly every r project I write–It’s that useful.

If you already have the tidyverse installed, you can simply run the following code chunk to load it. If you do not already have it installed, you should first run this code: install.packages("tidyverse"). (Note: the quotation marks are necessary for the installation function, but not the loading function)

library(tidyverse)

Now to load in our data

In this tutorial we will be using a dataset of home sales in King County, WA, USA during 2014 and 2015. We will be looking to predict the sale price of a home. This dataset was originally posted to kaggle by user harlfoxem (https://www.kaggle.com/harlfoxem/housesalesprediction), but it was re-hosted by gitbhub user kevintheduu. We will be loading the data from this github repository, and saving it into an r object called “dat”.

dat <- read_csv("https://github.com/kevintheduu/predicting-king-county-housing-prices/raw/master/kc_house_data.csv")

The data should now be in our r session and ready to use!

We can take a quick glimpse at the data.

glimpse(dat)

## Rows: 21,597
## Columns: 21
## $ id            <dbl> 7129300520, 6414100192, 5631500400, 2487200875, 19544005~
## $ date          <chr> "10/13/2014", "12/9/2014", "2/25/2015", "12/9/2014", "2/~
## $ price         <dbl> 221900, 538000, 180000, 604000, 510000, 1230000, 257500,~
## $ bedrooms      <dbl> 3, 3, 2, 4, 3, 4, 3, 3, 3, 3, 3, 2, 3, 3, 5, 4, 3, 4, 2,~
## $ bathrooms     <dbl> 1.00, 2.25, 1.00, 3.00, 2.00, 4.50, 2.25, 1.50, 1.00, 2.~
## $ sqft_living   <dbl> 1180, 2570, 770, 1960, 1680, 5420, 1715, 1060, 1780, 189~
## $ sqft_lot      <dbl> 5650, 7242, 10000, 5000, 8080, 101930, 6819, 9711, 7470,~
## $ floors        <dbl> 1.0, 2.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 2.0, 1.0, 1~
## $ waterfront    <dbl> NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, NA, 0, 0, 0, 0, 0, 0, 0, ~
## $ view          <dbl> 0, 0, 0, 0, 0, 0, 0, NA, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0~
## $ condition     <dbl> 3, 3, 3, 5, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 4, 4,~
## $ grade         <dbl> 7, 7, 6, 7, 8, 11, 7, 7, 7, 7, 8, 7, 7, 7, 7, 9, 7, 7, 7~
## $ sqft_above    <dbl> 1180, 2170, 770, 1050, 1680, 3890, 1715, 1060, 1050, 189~
## $ sqft_basement <chr> "0.0", "400.0", "0.0", "910.0", "0.0", "1530.0", "?", "0~
## $ yr_built      <dbl> 1955, 1951, 1933, 1965, 1987, 2001, 1995, 1963, 1960, 20~
## $ yr_renovated  <dbl> 0, 1991, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, NA, 0, 0, 0, 0, ~
## $ zipcode       <dbl> 98178, 98125, 98028, 98136, 98074, 98053, 98003, 98198, ~
## $ lat           <dbl> 47.5112, 47.7210, 47.7379, 47.5208, 47.6168, 47.6561, 47~
## $ long          <dbl> -122.257, -122.319, -122.233, -122.393, -122.045, -122.0~
## $ sqft_living15 <dbl> 1340, 1690, 2720, 1360, 1800, 4760, 2238, 1650, 1780, 23~
## $ sqft_lot15    <dbl> 5650, 7639, 8062, 5000, 7503, 101930, 6819, 9711, 8113, ~

From the output of this code we can see the following information:

There are 21,597 home sales (the number of rows).
There are 21 variables included in the dataset (the number of columns)
The name of each variable
The type of each variable (e.g. <dbl>)
A small subset of the data in each variable

Our outcome variable

One variable of interest right now is the price variable. This is the value we want to predict. So let’s take a look at this variable in more detail.

First, let’s see if there are any missing variables:

filter(dat, is.na(price))

## # A tibble: 0 x 21
## # ... with 21 variables: id <dbl>, date <chr>, price <dbl>, bedrooms <dbl>,
## #   bathrooms <dbl>, sqft_living <dbl>, sqft_lot <dbl>, floors <dbl>,
## #   waterfront <dbl>, view <dbl>, condition <dbl>, grade <dbl>,
## #   sqft_above <dbl>, sqft_basement <chr>, yr_built <dbl>, yr_renovated <dbl>,
## #   zipcode <dbl>, lat <dbl>, long <dbl>, sqft_living15 <dbl>, sqft_lot15 <dbl>

What the code above does is filters our dataset “dat”, such that it only shows rows where the “price” variable is NA. Fortunately this returns no results, meaning there are no missing values of our price variable. Good news!

One potential issue with our price variable, however, is that if we were to look for predictors of this variable in its current, unadjusted form, we’d be investigating an additive, rather than a multiplicative, relationship. That is to say, any added value in price would have the same weight across the entire distribution of houses.

As a practical example of what this means, the difference between a $200,000 house and $220,000 would be the same as the difference between a $2,000,000 house and a $2,020,000 house because both have differences of $20,000. Instead, we may be more interested in the multiplicative relationship. That is, we may be more interested in a percentage increase in a home price, such that the difference between a $200,000 house and $220,000 is akin to the difference between a $2,000,000 house and a $2,200,000 house (the latter being 10% more expensive than the former in both cases).

We can correct for this issue by “log transforming” our varible. We can create a new variable which is the log10 correction of the price variable. We can name this new variable log10price and create it as follows.

dat <- mutate(dat, log10price = log10(price))

In the original price variable, a 1-unit increase represented a $1 increase in value. In this new variable, a 1-unit increase represents a 10-fold increase in value.

Let’s take a look at the distribution of this new variable.

ggplot(dat, aes(x = log10price)) +
  geom_histogram(binwidth = 0.05)

We can see that the distribution is an approximately normal distribution, centered around ~5.6 or so. This looks like a nice, clean variable to work with. Let’s get into trying to predict it.

Simple linear regression

Now that we have our outcome variable ready to go, we want to think through what in our dataset may predict this variable well. It is very reasonable to think that home prices will be correlated with the square footage of living space, so this is a great place to start. Let’s take a look at our sqft_living variable and see if there are any missing values.

filter(dat, is.na(sqft_living))

## # A tibble: 0 x 22
## # ... with 22 variables: id <dbl>, date <chr>, price <dbl>, bedrooms <dbl>,
## #   bathrooms <dbl>, sqft_living <dbl>, sqft_lot <dbl>, floors <dbl>,
## #   waterfront <dbl>, view <dbl>, condition <dbl>, grade <dbl>,
## #   sqft_above <dbl>, sqft_basement <chr>, yr_built <dbl>, yr_renovated <dbl>,
## #   zipcode <dbl>, lat <dbl>, long <dbl>, sqft_living15 <dbl>,
## #   sqft_lot15 <dbl>, log10price <dbl>

No missing values! Good news.

Just like with price, we may be more interested in a multiplicative relationship than an additive one, so let’s take the log10 of our living space variable and put it in a new variable.

dat <- mutate(dat, log10sqftliving = log10(sqft_living))

And let’s take a look at the distribution of our new varaible.

ggplot(dat, aes(x = log10sqftliving)) +
  geom_histogram(binwidth = 0.02)

Looks pretty good! Now we want to see how well this variable predicts log10price.

Train and test data

Let’s not get ahead of ourselves though. Before we get into running any statistical models, we want to split our data into “train” and “test” datasets. One potential concern with a statistical model is that it will “overfit” our data. That is, it might fit the data we train it on very well, but it might not perform very well on data it wasn’t trained on (and therefore, it won’t perform well in the real world).

To lessen this worry, we can try our model on “test” data it wasn’t trained on. If the model fits these test data well, it probably wasn’t overfit.

With that explanation out of the way, let’s make this split (See the comments marked by # in the code to understand what each step is doing.)

set.seed(94) #If you enter this seed, you should get the same results I get on the following random code
dat <- mutate(dat, pID = row_number()) #Gives each row in the original dataset a unique identifier, based on its row number

train <- slice_sample(dat, prop = .7, replace = FALSE) #Takes a randomly-selected 70% of the rows from the original dataset to serve as the training dataset. 
test <- filter(dat, !pID %in% train$pID) #Takes all rows from dat with a pID NOT in train to create the new test dataset. In other words, it takes the 30% of rows not included in the train dataset.

Now we have our training and test datasets. All models we train will only be run on the train dataset for now.

Back to simple linear regression

We now have training data, an outcome variable, and a potenital predictor variable. Let’s create our first regression model!

For now, we want to simply predict one outcome variable (log10price) from one input variable (log10sqftliving). This is called “simple linear regression.” We can model it and store the model results in a new object (model1) using the following syntax.

model1 <- lm(log10price ~ log10sqftliving, data = train)

In the above syntax, lm() indicates we want to run a linear model. log10price ~ log10sqftliving is a formula that can be translated as “log10 price predicted by log10sqftliving. And data = train simply says we’re using the training data for this model.

So now we’ve run our model! But we don’t yet have any usable output. To get a lot of useful information, we can simply run a summary() function on our model object. Like so:

summary(model1)

## 
## Call:
## lm(formula = log10price ~ log10sqftliving, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.48212 -0.12698  0.00645  0.11162  0.57809 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     2.933542   0.024328   120.6   <2e-16 ***
## log10sqftliving 0.833931   0.007408   112.6   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1683 on 15115 degrees of freedom
## Multiple R-squared:  0.4561, Adjusted R-squared:  0.456 
## F-statistic: 1.267e+04 on 1 and 15115 DF,  p-value: < 2.2e-16

We get a lot of information here! But what does it all mean? A great first place to look is at the bottom of this output. The ‘Multiple R-squared’ of 0.4561 is a measure of how well this model fits the data. Specifically, we can say that about 46% of the variance in the outcome can be explained by the predictors (in this case, that’s just log10sqftliving). That’s not a great model fit, but it’s not bad for a single predictor!

We can also look at the p-value for the R-squared value on the bottom right. This means that this is a much better model fit than a simple horizontal line at the sample average would be.

Let’s also look at the coefficients. A simple regression equation takes the following format: [predicted outcome] = intercept + coefficient*x

Specifically then, for this model, we can say that the predicted log10price is approximately equal to 2.93 + 0.83 * log10sqftliving.

If you remember your algebra, you may notice that the above equation is simply the equation of a line. So we can plot it as a line!

Plotting a simple regression

Luckily with r, it is very simple to plot a simple linear regression. We can do so as follows:

ggplot(train, aes(log10sqftliving, log10price)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)

## `geom_smooth()` using formula 'y ~ x'

In the above plot, the black dots represent the actual log10price value for each home. The blue line represents the predicted log10price, as predicted by log10sqftliving. As you can see, the line predicts the values fairly well, but many of the points deviate pretty widely from the line (i.e. they have large “errors” or “residuals”). So let’s see if we an predict them any more closely.

Multiple regression

Multiple regression is much like simple regression, except we include more than one predictor in the model. We can consider other variables to include alongside log10sqftliving to predict price. One reasonable predictor in the number of bedrooms. It seems completely reasonable that a house with more bedrooms would be worth more. We can include that as a predictor in the model like so (and we’ll go ahead and summarize it too):

model2 <- lm(log10price ~ log10sqftliving + bedrooms, data = train)
summary(model2)

## 
## Call:
## lm(formula = log10price ~ log10sqftliving + bedrooms, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.4990 -0.1249  0.0047  0.1108  1.0599 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      2.731740   0.027331   99.95   <2e-16 ***
## log10sqftliving  0.925211   0.009363   98.81   <2e-16 ***
## bedrooms        -0.028939   0.001840  -15.73   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.167 on 15114 degrees of freedom
## Multiple R-squared:  0.4648, Adjusted R-squared:  0.4648 
## F-statistic:  6564 on 2 and 15114 DF,  p-value: < 2.2e-16

Ok, so how do we interpret this? Well, let’s start by looking at the adjusted R-squared (the adjusted R-squared becomes more valuable vs. the regular R-squared as we add more predictors, because it partially adjusts for overfitting risk). This R-squared is slightly higher than the one from our simple regression. Good news! It means we’re fitting the data more closely and explaining more of the variance.

We can also look at our coefficients. Just like before, the log10sqftliving coefficient is significant and positive. So as the size of the house goes up, so does the value of the house–makes sense. We also see that the bedroom coeffient is signicant, suggesting that this variable helps predict home prices. But wait! This coefficient is negative! How does that make sense? Are houses with more bedrooms worth less? This needs to be considered in more detail.

Interpreting multiple regression

One assumption of multiple regression is a lack of “multicollinearity”. This means that we assume that the predictor variables are not correlated with each other. That assumption is very likely violated in this model, but fortunately we can easily test for it.

cor.test(train$log10sqftliving, train$bedrooms)

## 
##  Pearson's product-moment correlation
## 
## data:  train$log10sqftliving and train$bedrooms
## t = 97.099, df = 15115, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6098803 0.6295171
## sample estimates:
##       cor 
## 0.6197957

As expected, there’s a pretty high correlation between our two predictor variables, and the assumption of multicollinearity is violated. This simply means that we need to be careful about our interpretation of the model output. So looking back at our output, how should we interpret the negative coeffient? Well, rather than concluding that homes with more bedrooms are worth less, what we should conclude is that among houses of the same size, those with more bedrooms are worth less. In other words, when we hold predictor 1 (the size of the house) constant, the coefficient for predictor 2 (the number of bedrooms) is negative. In practical terms, this may demonstrate that if you have two houses of the same size, and an extra bedroom is squeezed into one of them, the latter house is less valuable. When considered in this light, the results make sense! So we just need to be careful about our interpretations.

With this in mind, we can interpret the predictions as an equation, just like with our simple regression. The only difference is that there is an additional predictor variable and coefficient.

So the predicted outcome is now approximately 2.73 + 0.93 * log10sqftliving - .03 * bedrooms

For those who recall algebra, you’ll now note the “line” of best fit is no longer a line at all, but rather a plane. This makes this equation harder to plot, so we won’t do so, but understand the principle is roughly the same as with our simple regression. We simply have extended into another dimension.

Adding another predictor

We can expand further on this multiple regression by considering other predictors. One candidate is ZIP code. It is reasonable to expect that some areas of the county will be more desirable to live in then others, so zipcode will partially predict home price. We therefore may want to include the “zipcode” variable in the model.

We need to be careful here, however. At the moment, r treats zipcode as a continuous, numerical variable. Instead, we want to treat it as a categorgical variable. Fortunately, r makes this easy with the factor() call.

model3 <- lm(log10price ~ log10sqftliving + bedrooms + factor(zipcode), data = train)

By wrapping zipcode in factor(), r implicitly does something called “one-hot encoding” or “dummy coding”. R essentially creates a separate variable for each possible zip code. If a house in a particular row belongs to a particular zipcode, a 1 is placed in that cell and a 0 is placed in the cell for every other zipcode.

We can see in more detail what r did by summarizing this model result.

summary(model3)

## 
## Call:
## lm(formula = log10price ~ log10sqftliving + bedrooms + factor(zipcode), 
##     data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.55724 -0.05556 -0.00168  0.05081  0.60834 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           2.897541   0.018461 156.951  < 2e-16 ***
## log10sqftliving       0.794184   0.005928 133.980  < 2e-16 ***
## bedrooms             -0.017521   0.001090 -16.071  < 2e-16 ***
## factor(zipcode)98002 -0.015111   0.010720  -1.410 0.158666    
## factor(zipcode)98003  0.015300   0.009451   1.619 0.105476    
## factor(zipcode)98004  0.531522   0.009095  58.443  < 2e-16 ***
## factor(zipcode)98005  0.364088   0.011026  33.022  < 2e-16 ***
## factor(zipcode)98006  0.339092   0.008320  40.755  < 2e-16 ***
## factor(zipcode)98007  0.313368   0.011620  26.967  < 2e-16 ***
## factor(zipcode)98008  0.322658   0.009340  34.547  < 2e-16 ***
## factor(zipcode)98010  0.148860   0.013554  10.983  < 2e-16 ***
## factor(zipcode)98011  0.200756   0.010510  19.102  < 2e-16 ***
## factor(zipcode)98014  0.150858   0.012487  12.081  < 2e-16 ***
## factor(zipcode)98019  0.141467   0.010369  13.644  < 2e-16 ***
## factor(zipcode)98022  0.059561   0.009901   6.016 1.83e-09 ***
## factor(zipcode)98023 -0.001096   0.008299  -0.132 0.894968    
## factor(zipcode)98024  0.231761   0.013894  16.680  < 2e-16 ***
## factor(zipcode)98027  0.249203   0.008519  29.254  < 2e-16 ***
## factor(zipcode)98028  0.187619   0.009639  19.465  < 2e-16 ***
## factor(zipcode)98029  0.280004   0.009205  30.419  < 2e-16 ***
## factor(zipcode)98030  0.022451   0.009864   2.276 0.022856 *  
## factor(zipcode)98031  0.036193   0.009334   3.877 0.000106 ***
## factor(zipcode)98032  0.000293   0.012014   0.024 0.980541    
## factor(zipcode)98033  0.370178   0.008547  43.313  < 2e-16 ***
## factor(zipcode)98034  0.251538   0.008151  30.861  < 2e-16 ***
## factor(zipcode)98038  0.071447   0.008027   8.901  < 2e-16 ***
## factor(zipcode)98039  0.645747   0.017671  36.543  < 2e-16 ***
## factor(zipcode)98040  0.466159   0.009438  49.389  < 2e-16 ***
## factor(zipcode)98042  0.040709   0.008137   5.003 5.71e-07 ***
## factor(zipcode)98045  0.159248   0.010070  15.814  < 2e-16 ***
## factor(zipcode)98052  0.297213   0.008006  37.124  < 2e-16 ***
## factor(zipcode)98053  0.270400   0.008666  31.203  < 2e-16 ***
## factor(zipcode)98055  0.059237   0.009520   6.222 5.03e-10 ***
## factor(zipcode)98056  0.150546   0.008526  17.658  < 2e-16 ***
## factor(zipcode)98058  0.081677   0.008364   9.766  < 2e-16 ***
## factor(zipcode)98059  0.166478   0.008236  20.214  < 2e-16 ***
## factor(zipcode)98065  0.177128   0.009164  19.329  < 2e-16 ***
## factor(zipcode)98070  0.235956   0.012027  19.618  < 2e-16 ***
## factor(zipcode)98072  0.235244   0.009521  24.709  < 2e-16 ***
## factor(zipcode)98074  0.278506   0.008370  33.273  < 2e-16 ***
## factor(zipcode)98075  0.303203   0.008797  34.465  < 2e-16 ***
## factor(zipcode)98077  0.257871   0.010313  25.005  < 2e-16 ***
## factor(zipcode)98092  0.035785   0.008854   4.042 5.33e-05 ***
## factor(zipcode)98102  0.446815   0.013036  34.275  < 2e-16 ***
## factor(zipcode)98103  0.358079   0.007913  45.252  < 2e-16 ***
## factor(zipcode)98105  0.442353   0.010030  44.104  < 2e-16 ***
## factor(zipcode)98106  0.151766   0.009121  16.640  < 2e-16 ***
## factor(zipcode)98107  0.375121   0.009570  39.196  < 2e-16 ***
## factor(zipcode)98108  0.145826   0.011074  13.168  < 2e-16 ***
## factor(zipcode)98109  0.453002   0.013250  34.190  < 2e-16 ***
## factor(zipcode)98112  0.480846   0.009441  50.933  < 2e-16 ***
## factor(zipcode)98115  0.358701   0.007957  45.078  < 2e-16 ***
## factor(zipcode)98116  0.351817   0.009100  38.661  < 2e-16 ***
## factor(zipcode)98117  0.355218   0.007981  44.508  < 2e-16 ***
## factor(zipcode)98118  0.204275   0.008180  24.973  < 2e-16 ***
## factor(zipcode)98119  0.452412   0.010343  43.742  < 2e-16 ***
## factor(zipcode)98122  0.364658   0.009097  40.086  < 2e-16 ***
## factor(zipcode)98125  0.256148   0.008582  29.849  < 2e-16 ***
## factor(zipcode)98126  0.248865   0.008877  28.035  < 2e-16 ***
## factor(zipcode)98133  0.202737   0.008170  24.815  < 2e-16 ***
## factor(zipcode)98136  0.325828   0.009571  34.045  < 2e-16 ***
## factor(zipcode)98144  0.302764   0.008911  33.976  < 2e-16 ***
## factor(zipcode)98146  0.136466   0.009328  14.630  < 2e-16 ***
## factor(zipcode)98148  0.070065   0.015688   4.466 8.02e-06 ***
## factor(zipcode)98155  0.199436   0.008382  23.794  < 2e-16 ***
## factor(zipcode)98166  0.168447   0.009668  17.423  < 2e-16 ***
## factor(zipcode)98168  0.021482   0.009493   2.263 0.023646 *  
## factor(zipcode)98177  0.288923   0.009599  30.100  < 2e-16 ***
## factor(zipcode)98178  0.066396   0.009466   7.014 2.42e-12 ***
## factor(zipcode)98188  0.031858   0.011786   2.703 0.006878 ** 
## factor(zipcode)98198  0.051714   0.009427   5.486 4.18e-08 ***
## factor(zipcode)98199  0.396659   0.009031  43.921  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09728 on 15045 degrees of freedom
## Multiple R-squared:  0.8192, Adjusted R-squared:  0.8183 
## F-statistic: 960.1 on 71 and 15045 DF,  p-value: < 2.2e-16

This output is unwieldy! But we can interpret it using the information about how r treats factors that I gave above. R implicitly takes one reference group (in this case zip code 98001), and the Intercept and other coefficients are considered in respect to that reference group. Thus, for a house in zip code 98001, the predicted log10price is approximately 2.90 + .79 * log10sqftliving - .02 * bedrooms. Then, for all other zipcodes, the coefficient represents an inflation or deflation of the intercept. For example, since zip code 98004 has a coefficient of about .53, the intercept for houses in zip code 98004 is about 2.90 + .53, or 3.43. The coefficients for log10sqftliving and bedrooms are the same as for the reference group. So the predicted outcome for 98004 is approximately 3.43 + .79 * log10sqftliving - .02 * bedrooms. Not too complicated!

That explains the coefficients, but what about the R-squared? Here, we have an adjusted R-squared of .82. That means, we can predict over 80% of the variance in home price by knowing nothing more than zipcode, home size, and number of bedrooms. That’s pretty good! We could almost certainly get an even better fit by considering even more predictors, but for the sake of keeping this tutorial simple, let’s go ahead and use this model for our predictions.

Model Fit Train vs. Test

So we have a model than seems to fit the data pretty well. But as I mentioned before, we always run the risk of overfitting the data. It could be the case that this model fits the data we gave it very well, but it won’t make good predictions about data that are “out of pocket.” Fortunately, we reserved a test dataset that wasn’t used to train the model, so we can test for overfit!

We want a metric that can be used to compare the fit on our train dataset vs. our test dataset. One good option is the “root mean squared error” or RMSE. The RMSE is essentially a measure of how far a typical real outcome is from the model-predicted predicted one. We simply calculate this by subtracting each predicted outcome from the real outcome to calculate the error. Then we square each of these errors and take the mean of all of them. Then we take the square root of this mean. Fortunately, this is quite easy in R. Let’s start by calculating this for our train dataset.

errorTrain <- train$log10price - predict(model3)
rmseTrain <- sqrt(mean(errorTrain^2))

Now let’s do the same for our test dataset.

errorTest <- test$log10price - predict(model3, newdata = test)
rmseTest <- sqrt(mean(errorTest^2))

Now let’s look at both of these RMSEs and see how similar they are.

rmseTrain

## [1] 0.09705103

rmseTest

## [1] 0.0973942

Both RMSEs are very close. The model fits the test dataset only very slightly worse than the train one, so it looks like we don’t have a big problem with overfitting. That’s good news! It means that we can be reasonably confident we can use this model for predictions.

However, it’s also very important that we understand what this does not tell us. It is important to realize that both our train and test datasets were drawn from the same population. So we cannot be as confident that we can make predictions outside of this population. For example, if a house is outside of the zipcodes represented in this dataset or went for sale outside of the 2014-2015 years represented here, we cannot be as confident about our predictions. That disclaimer out of the way, let’s try out a prediction that we can reasonably make.

Making predictions

Let’s say there was a house for sale in zip code 98116 that was 2,000 square feet with 4 bedrooms. We want to use our model to predict what it will sell for. R makes this easy!

First we want to create a new dataframe that includes this information. We want to make sure we name the variables in accordance with the conventions used in our test dataset. We can do this like so:

newhouse = data.frame(log10sqftliving = log10(2000), bedrooms = 4, zipcode = 98116)

Now we can simply run the predict function on our model, calling a newdata argument on our new newhouse dataframe.

prediction <- predict(model3, newdata = newhouse)

Now to see the value of our prediction, we can simply run the prediction object.

prediction

##        1 
## 5.800897

This outcome may not be terribly useful in itself. Recall that this is a log transformed variable, so we’ll likely want to transform it back into a raw number for clarity’s sake. We can do this like so:

10^prediction

##        1 
## 632262.3

And there we have it! Our best guess for the price of this house is about $630,000. We can use this predict function to estimate the 2014/2015 value of any house in King County as long as we know its zipcode, square footage, and number of bedrooms. Linear regression is pretty simple, but it can be a very powerful tool!