logologo

Trying it in R:

Now that we know some of the common problems that can be solved using linear regression tools, we can try implementing it! Nearly every statistical software package available can do linear regression. We will outline how it can be done using R. For directions on how to install and use R, see the “Using R” tutorial on the Wookie.

This document contains R-code and output associated with the modeling process. If you have questions about how the models are built or how the R-code works, email paul@atrium.ai.

Consider the following scenario. You are trying to determine whether or not having a garage impacts the sales price of a home. You collect data on home sales in your city (which, in this case, happens to be Grand Junction, Colorado). Then, you plan to analyze it in R via the following steps:

## [1] 145  39

Let’s take a look at the data:

Every dataset contains variables that are not necessarily good predictors. In this case, we filter out all the variables that are included in the dataset but are not necessary for analysis. Identifying these variables can sometimes be hard to figure out. Talk to your domain experts (i.e. customers) when determining which variables are unrelated to the process, such as comment-fields, fields that don’t make sense in a model, etc.

We can also treat sales date in a more intuitive way using a package called lubridate (Grolemund and Wickham, 2011). In our model, we’re just going to use Month as a predictor instead of the entire date for each house sale. R is full of packages that make doing regression easier.

##Simple Linear Regression We are now ready to try fitting a model. Let’s start with a simple linear regression that looks at whether or not the presence of a garage has an effect on house price. Here’s the model:

\[Price_i = \beta_0 + \beta_1*Garage_i + \epsilon_i \]

Here’s how we generate the code we need to run the model in R. Thankfully, linear regression is easy in R, so it requires only one little bit of code:

  Estimate Std. Error t value Pr(>|t|)
(Intercept) 92977 15359 6.054 1.184e-08
Garage 81793 16097 5.081 1.154e-06
Fitting linear model: Price ~ Garage
Observations Residual Std. Error \(R^2\) Adjusted \(R^2\)
145 55377 0.1529 0.147

Based on the results above, we can see that the a house with a garage is estimated by the model to sell for $83,498 more than a house that does not have a garage, on average. The p-value associated with the Garage variable is very small, indicating strong statistical evidence that there is indeed a true difference in sales price between houses that have a garage and houses that do not.

##Assessing Diagnostics

It’s always a good idea to make sure that the models we use satisfy the assumptions of a linear regression model. Those assumptions are:

Thankfully, we can assess these assumptions with a few lines of code and a keen eye. The four plots you can use are:

Here’s the diagnostic plots that go with our simple linear regression model:

Based on the plots above, there might be an issue with non-constant variance. The Residuals vs. Fitted Plot indicates wide variability for observations with ‘large’ fitted values and much narrower variability for observations with ‘small’ fitted values. The QQ plot is not quite linear either, so it might be time to start thinking about tweaking our model by either transforming it or considering more variables.

##Multiple Regression: Models with more than 1 variable

Maybe you decide that the presence/absence of a garage isn’t enough information to accurately predict a home’s sales price. You want to fit a model that contains more than just a single variable. That moves us into the world of multiple linear regression. Here, we can examine many predictor variables at a time. Here’s our new model with six predictor variables:

\[ Price_i = \beta_0 + \beta_1*Acres_i + \beta_2*Sqft_i + \beta_3*Bed_i + \beta_4*Bath_i + \] \[ \beta_5*Garage_i + \beta_6*Month_i + \epsilon_i \]

The way you implement MLR in R is the same as you would in a single variable regression, but you can use the + sign to add variables to the model formula.

Remember, it’s always a good idea to check diagnostics! These are the same as in the linear model case. Thankfully, these look pretty good! The QQ-plot is nearly perfectly linear and the Residuals vs. Fitted plot shows even spread for all fitted values. This model appears to be pretty reasonable.

##Model Interpretation: Now that you have a reasonably good model, it’s time to use it! Here’s the results from the model. Which effects are ‘statistically significant?’

  Estimate Std. Error t value Pr(>|t|)
(Intercept) -62376 14982 -4.163 5.516e-05
ACRES 97851 47412 2.064 0.04092
Total_HeatedSqFtV 159.9 8.583 18.63 2.195e-39
BEDROOM -14276 4557 -3.133 0.002117
Full.Bath 5637 4616 1.221 0.2241
Garage 22041 8658 2.546 0.01201
Month 1514 710.5 2.13 0.03492
Fitting linear model: Price ~ ACRES + Total_HeatedSqFtV + BEDROOM + Full.Bath + Garage + Month
Observations Residual Std. Error \(R^2\) Adjusted \(R^2\)
144 26602 0.8124 0.8042

If you said “darn-near everything”, you’re on the right track. To determine which features are ‘statistically significant’ (meaning that we have strong evidence that they play an important role in affecting sales price), look at the P-value (denoted Pr(>t)). Variables that have p-values less than 0.05 are usually considered ‘significant’.

Let’s interpret a few of these variables:

The metrics at the bottom of the summary table tell us our R-squared value, which can be thought of as the proportion of variation in sales price explained by the predictors in the model. Roughly, we are explaining 75% of the variation in house prices with our model.

##Visulizing Models

One nice tool for visualizing estimated coefficients can be found in the Effects package in R (Fox, 2009). These show the estimated effect of each predictor on the response variable, and they can be very handy for explaining a model’s results to a client/customer. They are also handy for us in interpreting which how each predictor affects the response.

Here’s an example:

## Loading required package: carData
## lattice theme set by effectsTheme()
## See ?effectsTheme for details.

##Making Predictions

Finally, we’ve fit a model, checked diagnostics, and interpreted some of the most salient results. Now we are ready to make some predictions!

We have a set of new homes for which we want to make predictions about house price. You can explore it here:

Let’s try using the predict function, which takes our regression model and a new dataset (‘newdata’) and easily generates predicted prices. We can then examine a plot of predicted sales over time and compare them to the data that we used to fit the model.

## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`