Linear regression

Linear regression is a fundamental statistical method used to predict a target or outcome variable based on one or more input features. It’s called “linear” regression because it models the relationship between the target and the inputs as a linear function.

In the simplest case, where there is just one input feature, linear regression can be expressed mathematically as: y = β0 + β1 * x + ε

Here:

y is the target variable that we want to predict.
x is the input feature.
β0 (beta zero) is the y-intercept of the regression line, representing the predicted value of y when x is zero.
β1 (beta one) is the slope of the regression line, representing the predicted increase in y resulting from a one unit increase in x.
ε (epsilon) is the residual error term that captures the difference between the actual and predicted values of y. It accounts for the randomness or unpredictability in the data that cannot be captured by the model.

In multiple linear regression, where there are more than one input features, the equation is extended to: y = β0 + β1 * x1 + β2 * x2 + ... + βn*xn + ε

Here, x1, x2, …, xn are the input features, and β1, β2, …, βn are the coefficients that the model learns during the training process.

The objective of linear regression is to find the values of the coefficients (β0, β1, …, βn) that minimize the sum of the squared residuals, a method known as least squares.

Linear regression makes several key assumptions:

Linearity: The relationship between the input and the output is linear.
Independence: The residuals are independent of each other.
Homoscedasticity: The variance of the residuals is constant across all levels of the input variables.
Normality: The residuals are normally distributed.

Violations of these assumptions can lead to problems in the estimation and interpretation of the model.

A linear model of home sales price in Charlotte

We will use data on home sales in Charlotte to build a model of home sales price. We have data on 500 single-family home sales in Mecklenburg County from 2010 to the present. This includes the sale price of the home as well as a number of attributes of the home collected by the tax assessor.

First, we need to load libraries. For this exercise, we only need tidyverse.

library(tidyverse)
options(scipen = 999)

Next, we’ll load our data.

data <- read_csv("charlotte_home_sales.csv")

Rows: 1506 Columns: 19
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (10): houseunit, stdir, stname, sttype, stsuffix, actype, vacantorim, s...
dbl   (8): amt_price, houseno, heatedarea, yearbuilt, fullbaths, halfbaths, ...
date  (1): dte_dateof

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

The columns of the data are:

dte_dateof - date of sale
houseno, houseunit, stdir, stname, sttype, stsuffix - components of street address (number, unit number, street direction, street name, street type, and street suffix)
amt_price - Sale price in dollars
heatedarea - area of the home that is heated (i.e. living area), square feet
yearbuilt - year the home was built
fullbaths - number of full baths
halfbaths - number of half baths
bedrooms - number of bedrooms
actype - type of air conditioning/cooling
vacantorim - VACant or IMProved (there is a building on the lot)
totalac - total acreage of the lot
siding - type of siding (wall coverage)
area - the part of the Charlotte metropolitan area where the home is located (Public Use Microdata Area code, one)

Before we start doing any regressions, it’s always a good idea to get to know our data a bit first. Make a histogram of sale price:

ggplot(data, aes(x=amt_price)) +
  geom_histogram(bins=50)

Make a histogram of heated area.

ggplot(data, aes(x=heatedarea)) +
  geom_histogram(bins=50)

In linear regression, the input variables don’t need to perfectly normal (following the Bell-Curve), but they shouldn’t be heavily skewed either. The above histograms look good enough.

Now, we can start thinking about building our model. One variable that is likely to be associated with home prices is the size of the home (heatedarea). We can make a scatterplot of these variables before we start thinking about a model.

ggplot(data, aes(x=heatedarea, y=amt_price)) +
  geom_point(size=0.1)

It’s hard to read because of the outliers (we’ll return to this issue shortly) For now, we can just set the limits to look at the bulk of the data.

ggplot(data, aes(x=heatedarea, y=amt_price)) +
  geom_point(size=0.1) +
  xlim(0, 5000) +
  ylim(0, 2000000)

Warning: Removed 32 rows containing missing values (`geom_point()`).

It does look like there’s a positive trend. We’re ready to build our first model! We’ll express price as a function of heated area In R, you use the lm() function to estimate a linear regression. The first argument is a formula, which expresses the mathematical equation you want to estimate. The dependent variable is on the left, followed by a ~ (tilde), followed by the independent variable(s). We also specify the dataset we are using for estimation. We do not need to specify a constant/intercept. R will include that automatically.

single_variable_model <- lm(amt_price~heatedarea, data)

To see the results of our model, we can run summary()

summary(single_variable_model)


Call:
lm(formula = amt_price ~ heatedarea, data = data)

Residuals:
     Min       1Q   Median       3Q      Max 
-2358329  -106534   -23728    62963  3890211 

Coefficients:
               Estimate  Std. Error t value            Pr(>|t|)    
(Intercept) -117500.988   12423.706  -9.458 <0.0000000000000002 ***
heatedarea      206.183       4.644  44.398 <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 261500 on 1504 degrees of freedom
Multiple R-squared:  0.5672,    Adjusted R-squared:  0.5669 
F-statistic:  1971 on 1 and 1504 DF,  p-value: < 0.00000000000000022

Interpretation of the summary

Intercept: The value of price when the heated area is zero
Slope: Increase in price for each unit increase in heated area
t-value: The t-value is a measure of how many standard deviations our coefficient estimate is far from zero. Large t-values (either positive or negative) indicate that the null hypothesis (that the coefficient is zero or has no effect) is unlikely.
Pr(>|t|): This is the p-value associated with the null hypothesis. A small p-value (typically ≤ 0.05) indicates strong evidence that the coefficient is different from zero. In this case, the p-values for both the intercept and heatedarea are very small, indicating strong evidence against the null hypothesis.
Residuals: The residuals are the differences between the observed and predicted values of the dependent variable. The summary provides the minimum, 1st quartile (Q1), median, 3rd quartile (Q3), and maximum residuals. These can be used to understand the distribution of the residuals.
Residual standard error: This is the standard deviation of the residuals. It tells us the typical difference between the observed and predicted values of the dependent variable.
Multiple R-squared and Adjusted R-squared: These are measures of how well the model fits the data. In this case, the model explains about 56.72% (or 56.69% when adjusted for the number of predictors) of the variance in amt_price.
F-statistic and its p-value: The F-statistic tests the overall significance of the model. The null hypothesis is that all of the regression coefficients are equal to zero. A very small p-value (like the one we have here) indicates that at least one of the predictors is significantly related to the dependent variable.

The first section includes some information about the residuals - the difference between the predictions and the actual values in the data. The next section details the coefficients - what the estimate was, and how much random variation there might be in that estimate due to sampling error (we’ll cover this in a moment). The last interesting statistic is the R-squared, which we’ll discuss below.

Seeing our model visually

Let’s plot our regression line on top of our scatterplot. We can use geom_abline for this, which accepts an intercept and slope. The intercept is the coefficient for the constant. In a regression with only one variable, the coefficient for that variable is the slope of the line that predicts the outcome. We will find the intercept and the slope in the regression output above, and enter them into the geom_abline below to plot the regression line over the scatterplot. Note: e+05 is scientific notation - 1.5e+05 would be 1.5 times 10 to the 5th, or 150,000.

ggplot(data, aes(x=heatedarea, y=amt_price)) +
  geom_point(size=0.1) +
  geom_abline(intercept=-1.175e5, slope=206.2, color="red") +
  xlim(0, 5000) +
  ylim(0, 2000000)

Warning: Removed 32 rows containing missing values (`geom_point()`).

Dealing with outliers That line doesn’t look like it goes quite through the middle of the data. If we remove the limits on the plot we can see why:

ggplot(data, aes(x=heatedarea, y=amt_price)) +
  geom_point(size=0.1) +
  geom_abline(intercept=-98341, slope=196.4, color="red")

There are outliers well off to the right and top. Because regression is minimizing the sum of squared residuals, outliers can have a very large influence. Even though there are not very many of them, they are far from the regression line. We can remove the outliers and run the model again. Whether or not to remove outliers is up to you and what makes sense for your analysis. However, you should avoid “cherry-picking” data - filtering the data until you get the results you want from your analysis.

Filter the data to only homes with heated area less than 5,000 square feet, and prices less than $2 million. Run a model on this filtered dataset, and make a similar plot.

no_outliers = filter(data, heatedarea < 5000 & amt_price < 2000000)
no_outlier_model = lm(amt_price~heatedarea, no_outliers)
summary(no_outlier_model)


Call:
lm(formula = amt_price ~ heatedarea, data = no_outliers)

Residuals:
    Min      1Q  Median      3Q     Max 
-374900  -95232  -32866   59515 1266036 

Coefficients:
             Estimate Std. Error t value            Pr(>|t|)    
(Intercept) -2254.106  11008.308  -0.205               0.838    
heatedarea    148.294      4.805  30.861 <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 157000 on 1472 degrees of freedom
Multiple R-squared:  0.3928,    Adjusted R-squared:  0.3924 
F-statistic: 952.4 on 1 and 1472 DF,  p-value: < 0.00000000000000022

ggplot(no_outliers, aes(x=heatedarea, y=amt_price)) +
  geom_point(size=0.1) +
  geom_abline(intercept=-2254.106, slope=148.3, color="red")

Exercise

Are the results different?

Multiple regression

So far we’ve only included a single variable in the model - heated area. But with regression, we can include many variables. Then the coefficient for each independent variable will be how much change in the dependent variable is associated with a 1-unit change in the independent variable, holding everything else in the model constant.

For example, these data were collected over the course of 10+ years. During that time, housing prices went up significantly. We’d expect homes that sold in 2010 to sell for less than homes that sold in 2020. We can add a variable for year of sale, but first we need to compute the year of sale.

no_outliers = mutate(no_outliers, year_of_sale=year(dte_dateof))

year_built_model = lm(amt_price~heatedarea+year_of_sale, no_outliers)
summary(year_built_model)


Call:
lm(formula = amt_price ~ heatedarea + year_of_sale, data = no_outliers)

Residuals:
    Min      1Q  Median      3Q     Max 
-439207  -77465  -25623   41662 1189805 

Coefficients:
                  Estimate    Std. Error t value            Pr(>|t|)    
(Intercept)  -41722596.414   2259076.545  -18.47 <0.0000000000000002 ***
heatedarea         154.810         4.345   35.63 <0.0000000000000002 ***
year_of_sale     20677.956      1119.661   18.47 <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 141500 on 1471 degrees of freedom
Multiple R-squared:  0.5071,    Adjusted R-squared:  0.5065 
F-statistic: 756.8 on 2 and 1471 DF,  p-value: < 0.00000000000000022

Exercise

What is the effect of year of sale?

When two variables are correlated, including them both in the model can do strange things. As an example, let’s try adding number of bedrooms to our model

bedrooms_model = lm(amt_price~heatedarea+year_of_sale+bedrooms, no_outliers)
summary(bedrooms_model)


Call:
lm(formula = amt_price ~ heatedarea + year_of_sale + bedrooms, 
    data = no_outliers)

Residuals:
    Min      1Q  Median      3Q     Max 
-442838  -76245  -26490   42332 1185432 

Coefficients:
                  Estimate    Std. Error t value            Pr(>|t|)    
(Intercept)  -41768083.847   2259360.441 -18.487 <0.0000000000000002 ***
heatedarea         158.507         5.547  28.575 <0.0000000000000002 ***
year_of_sale     20707.446      1119.942  18.490 <0.0000000000000002 ***
bedrooms         -6692.410      6242.805  -1.072               0.284    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 141500 on 1470 degrees of freedom
Multiple R-squared:  0.5075,    Adjusted R-squared:  0.5065 
F-statistic: 504.9 on 3 and 1470 DF,  p-value: < 0.00000000000000022

Exercise

What is the relationship between an additional bedroom and price?
Is it what you would expect?

Estimating the same model again, but without the heated area. Does the result change?

bedrooms_only_model = lm(amt_price~year_of_sale+bedrooms, no_outliers)
summary(bedrooms_only_model)


Call:
lm(formula = amt_price ~ year_of_sale + bedrooms, data = no_outliers)

Residuals:
    Min      1Q  Median      3Q     Max 
-295893 -110560  -42581   56848 1155161 

Coefficients:
              Estimate Std. Error t value            Pr(>|t|)    
(Intercept)  -36703493    2808184  -13.07 <0.0000000000000002 ***
year_of_sale     18184       1392   13.06 <0.0000000000000002 ***
bedrooms        104202       6096   17.09 <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 176400 on 1471 degrees of freedom
Multiple R-squared:  0.234, Adjusted R-squared:  0.2329 
F-statistic: 224.6 on 2 and 1471 DF,  p-value: < 0.00000000000000022

Exercise

Is the coefficient for bedrooms now what you would expect?

What’s going on? In the first model, we estimated the relationship between number of bedrooms and price holding square footage constant - so the value of an additional bedroom without increasing the total size of the home. The trend lately has been towards larger rooms, and adding a room without increasing the size of the home means smaller rooms, so it is reasonable to think that an additional bedroom in a home of the same size might not be particularly valuable. The model above shows a slightly negative coefficient, though it’s not statistically significant (we’ll discuss that in a bit), so we can’t be sure it wouldn’t be positive if we had a different sample of 1500 homes.

Here, we plot the relationship between heated area and price, with color indicating number of bedrooms. Note that the colors at the bottom and top of the point cloud are all mixed together, indicating that there isn’t really a trend in price for different numbers of bedrooms with the same square footage.

ggplot(no_outliers, aes(x=heatedarea, y=amt_price, color=as.factor(bedrooms))) +
  geom_point() +
  ylim(0, 1000000)

Warning: Removed 18 rows containing missing values (`geom_point()`).

This is known as Simpson’s Paradox - cases where adding a variable can change the sign of the coefficient of another variable. It is a specific case of the more general omitted variable bias and collinearity concepts.

Omitted variable bias is the notion that whenever you add a variable to a model, it may change the coefficients for any other related variable. In general, this means that you can only interpret the coefficients of a regression in relation to what else is included in the model. For instance, you would likely find that the number of TVs a household owns is highly predictive of the price of their home. But this is not because more TVs cause households to buy more expensive homes - rather, it is likely that higher incomes lead to buying more expensive homes and to buying more TVs. Adding income into the model would likely make the effect of TVs go away - because then the model would be asking the question of how an additional TV is related to home value, holding income constant - and it’s probably not very related.

In extreme cases, adding additional variables to the model that are correlated with each other may result in counterintuitive, statistically-insignificant results. This is know as a collinearity or multicollinearity problem, and is something you should look into if you are experiencing unexpected results in a regression.

Categorical variables

So far we’ve only included continuous, numerical variables in the model. But we know other things likely affect home prices as well, for instance what the home looks like from the street. The data contain a variable “siding” that says what the siding (wall covering) on the outside of the home is. People may pay more for certain types of siding if they are more durable, more attractive, etc. But it’s not a number, so we can’t just put it in the model with a coefficient.

The most common way to put a categorical variable into a regression model is to use “dummy variables” There will be one variable for each category, that is one if the home is in that category, and zero otherwise. This allows the model to estimate a coefficient for each of the categories—so for instance we will have the amount of money brick siding is associated with, or vinyl siding, etc. You have to leave one category out; that is the base category, and everything else is relative to that. (If we didn’t, there wouldn’t be a unique set of coefficients for the computer to estimate. You could get exactly the same prediction from many models, by increasing the coefficient for the constant and decreasing the coefficient for all of the dummy variables.)

We can create dummy variables in R by including non-numeric values in the regression. R will automatically create one variable for each category. If you wanted to create dummy variables for a numeric column (e.g., if siding types were coded as 1, 2, 3, etc.), you would convert the column to a factor using as.factor() and then include that converted column in your model.

# Change the display settings
siding_model = lm(amt_price~heatedarea+year_of_sale+siding, no_outliers)
summary(siding_model)


Call:
lm(formula = amt_price ~ heatedarea + year_of_sale + siding, 
    data = no_outliers)

Residuals:
    Min      1Q  Median      3Q     Max 
-359582  -73168  -14072   45243 1230773 

Coefficients:
                        Estimate    Std. Error t value             Pr(>|t|)    
(Intercept)        -43094934.375   2115711.407 -20.369 < 0.0000000000000002 ***
heatedarea               143.727         4.154  34.602 < 0.0000000000000002 ***
year_of_sale           21342.131      1048.490  20.355 < 0.0000000000000002 ***
sidingFACE BRICK      113144.035      8505.793  13.302 < 0.0000000000000002 ***
sidingHARDIPLANK       98806.378     13382.101   7.383 0.000000000000257504 ***
sidingMASONITE         49318.140     10870.466   4.537 0.000006174692803023 ***
sidingOther           146101.235     17823.518   8.197 0.000000000000000532 ***
sidingWOOD ON SHTG     95982.927     16522.051   5.809 0.000000007678134355 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 131800 on 1466 degrees of freedom
Multiple R-squared:  0.5736,    Adjusted R-squared:  0.5716 
F-statistic: 281.7 on 7 and 1466 DF,  p-value: < 0.00000000000000022

Exercise

How much more valuable is Hardiplank than Vinyl?

It is often said that the three most important aspects of a home are location, location, and location. Similar homes in different neighborhoods may sell for very different prices. There is an area variable in the file, which represents different parts of the Charlotte metro area. Adding this as a categorical variable.

area_model = lm(amt_price~heatedarea+year_of_sale+siding+area, no_outliers)
summary(area_model)


Call:
lm(formula = amt_price ~ heatedarea + year_of_sale + siding + 
    area, data = no_outliers)

Residuals:
    Min      1Q  Median      3Q     Max 
-367903  -66180  -11709   42231 1237776 

Coefficients:
                        Estimate    Std. Error t value             Pr(>|t|)    
(Intercept)        -43198383.741   1974671.081 -21.876 < 0.0000000000000002 ***
heatedarea               146.823         4.192  35.027 < 0.0000000000000002 ***
year_of_sale           21438.790       978.729  21.905 < 0.0000000000000002 ***
sidingFACE BRICK       77448.093      8465.564   9.149 < 0.0000000000000002 ***
sidingHARDIPLANK       72223.507     12663.184   5.703 0.000000014193167020 ***
sidingMASONITE         45508.625     10285.145   4.425 0.000010375155096964 ***
sidingOther           110696.430     16864.297   6.564 0.000000000072622804 ***
sidingWOOD ON SHTG     58559.028     15751.101   3.718             0.000209 ***
area03102            -132564.029     16208.826  -8.179 0.000000000000000618 ***
area03103            -158558.057     17022.265  -9.315 < 0.0000000000000002 ***
area03104             -36864.732     14097.470  -2.615             0.009015 ** 
area03105               2324.833     15406.446   0.151             0.880075    
area03106            -101954.418     13749.553  -7.415 0.000000000000205171 ***
area03107            -136177.004     14884.338  -9.149 < 0.0000000000000002 ***
area03108            -100300.375     14565.747  -6.886 0.000000000008497905 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 122500 on 1459 degrees of freedom
Multiple R-squared:  0.6336,    Adjusted R-squared:   0.63 
F-statistic: 180.2 on 14 and 1459 DF,  p-value: < 0.00000000000000022

Exercise

How much do the different areas differ in price?

You may also see coefficients for dummy variables for areas or groups referred to as “fixed effects”.

Predicting from a linear regression

A linear regression is just estimating an equation that predicts the dependent variable based on the independent variables. We can write out the equation. For instance, suppose we wanted to estimate the price of a 2000 square foot home sold in 2021, with Hardiplank siding.

-4.309e7 + 1.437e2 * 2000 + 2.134e4 * 2021 + 9.881e4

[1] 424350

Using our model that includes area to predict the price of a 2000 square foot home sold in 2021 with Masonite siding in area 03103.

-4.320e7 + 1.468e2 * 2000 + 2.144e4 * 2021 + 4.551e4 - 1.586e5

[1] 310750

Normally, you wouldn’t make predictions manually like this. Instead, you would create a new dataset that had all of your independent variables, and then apply the predict() function to it. For instance, here I’ll make a dataset of several different types of homes, and predict the sales prices of all of them. A developer might use a model like this to forecast profits from a new development—probably a bit more complex model with more home features. Here we create a table with the candidate homes. Each line is a column in the table. Lines where there is only one item will have that item repreated for every row of the table.

candidate_homes = tibble(
  heatedarea=c(1200, 1500, 2000, 1200, 1500, 2000),
  year_of_sale=2021,
  siding=c("HARDIPLANK", "HARDIPLANK", "HARDIPLANK", "FACE BRICK", "FACE BRICK", "FACE BRICK")
)

now, we can use predict() to create a new column in candidate_homes with the forecasted price It should be close to what we found above (a little different because we used rounded coefficients above but predict() uses the original, un-rounded coefficients)

candidate_homes$forecast_price = predict(siding_model, candidate_homes)
candidate_homes

# A tibble: 6 × 4
  heatedarea year_of_sale siding     forecast_price
       <dbl>        <dbl> <chr>               <dbl>
1       1200         2021 HARDIPLANK        308791.
2       1500         2021 HARDIPLANK        351909.
3       2000         2021 HARDIPLANK        423773.
4       1200         2021 FACE BRICK        323129.
5       1500         2021 FACE BRICK        366247.
6       2000         2021 FACE BRICK        438110.

We will stimate the price of a 2000 square foot home sold in 2021 with Hardiplank siding in areas 03103, 03105, or 03108. A developer might use a result like this to decide which of several sites would be most profitable to build on.

candidate_sites = tibble(
  heatedarea=2000,
  year_of_sale=2021,
  siding="HARDIPLANK",
  area=c("03103", "03105", "03108")
)

candidate_sites$forecast_price = predict(area_model, candidate_sites)
candidate_sites

# A tibble: 3 × 5
  heatedarea year_of_sale siding     area  forecast_price
       <dbl>        <dbl> <chr>      <chr>          <dbl>
1       2000         2021 HARDIPLANK 03103        336723.
2       2000         2021 HARDIPLANK 03105        497606.
3       2000         2021 HARDIPLANK 03108        394981.

While prediction is often seen as a primary function of linear regression, it is actually not that common in urban analytics. People use linear models for interpretation more often - to interpret the coefficients and see what the relationships between different variables are, holding other things constant.

These models had quite high predictive power (R²), which is common in models of home prices. In models of other phenomena (especially behavioral phenomena), much lower R² is common. This may represent an issue for prediction, but does not necessarily represent an issue for interpretation. It’s only an issue for interpretation if there are other variables left out that would affect the coefficients for the variables included.

Regression has a lot more pieces than we discussed here. I recommend the books An Introduction to Statistical Learning with Applications in R, and The Effect, if you want to learn more. Both are available online as free ebooks.

Notes on statistics and machine learning

Statistical modeling assumes a model of data generation, tries to fit data to parameters of the assumed model and uses that for mainly interpretation but for prediction as well.

Machine learning makes no assumption of standard data generation, tries to build a predictive model out of available data, and uses that mainly for prediction, and sometimes for interpretation.

References

This tutorial borrows from and expands on PLAN372 Spring 2023 tutorial on linear regression and machine learning.