library(tidyverse)
options(scipen = 999)
Linear regression
Linear regression is a fundamental statistical method used to predict a target or outcome variable based on one or more input features. It’s called “linear” regression because it models the relationship between the target and the inputs as a linear function.
In the simplest case, where there is just one input feature, linear regression can be expressed mathematically as: y = β0 + β1 * x + ε
Here:
- y is the target variable that we want to predict.
- x is the input feature.
- β0 (beta zero) is the y-intercept of the regression line, representing the predicted value of y when x is zero.
- β1 (beta one) is the slope of the regression line, representing the predicted increase in y resulting from a one unit increase in x.
- ε (epsilon) is the residual error term that captures the difference between the actual and predicted values of y. It accounts for the randomness or unpredictability in the data that cannot be captured by the model.
In multiple linear regression, where there are more than one input features, the equation is extended to: y = β0 + β1 * x1 + β2 * x2 + ... + βn*xn + ε
Here, x1, x2, …, xn are the input features, and β1, β2, …, βn are the coefficients that the model learns during the training process.
The objective of linear regression is to find the values of the coefficients (β0, β1, …, βn) that minimize the sum of the squared residuals, a method known as least squares.
Linear regression makes several key assumptions:
- Linearity: The relationship between the input and the output is linear.
- Independence: The residuals are independent of each other.
- Homoscedasticity: The variance of the residuals is constant across all levels of the input variables.
- Normality: The residuals are normally distributed.
Violations of these assumptions can lead to problems in the estimation and interpretation of the model.
A linear model of home sales price in Charlotte
We will use data on home sales in Charlotte to build a model of home sales price. We have data on 500 single-family home sales in Mecklenburg County from 2010 to the present. This includes the sale price of the home as well as a number of attributes of the home collected by the tax assessor.
First, we need to load libraries. For this exercise, we only need tidyverse.
Next, we’ll load our data.
<- read_csv("charlotte_home_sales.csv") data
Rows: 1506 Columns: 19
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (10): houseunit, stdir, stname, sttype, stsuffix, actype, vacantorim, s...
dbl (8): amt_price, houseno, heatedarea, yearbuilt, fullbaths, halfbaths, ...
date (1): dte_dateof
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
The columns of the data are:
dte_dateof
- date of salehouseno
,houseunit
,stdir
,stname
,sttype
,stsuffix
- components of street address (number, unit number, street direction, street name, street type, and street suffix)amt_price
- Sale price in dollarsheatedarea
- area of the home that is heated (i.e. living area), square feetyearbuilt
- year the home was builtfullbaths
- number of full bathshalfbaths
- number of half bathsbedrooms
- number of bedroomsactype
- type of air conditioning/coolingvacantorim
- VACant or IMProved (there is a building on the lot)totalac
- total acreage of the lotsiding
- type of siding (wall coverage)area
- the part of the Charlotte metropolitan area where the home is located (Public Use Microdata Area code, one)
Before we start doing any regressions, it’s always a good idea to get to know our data a bit first. Make a histogram of sale price:
ggplot(data, aes(x=amt_price)) +
geom_histogram(bins=50)
Make a histogram of heated area.
ggplot(data, aes(x=heatedarea)) +
geom_histogram(bins=50)
In linear regression, the input variables don’t need to perfectly normal (following the Bell-Curve), but they shouldn’t be heavily skewed either. The above histograms look good enough.
Now, we can start thinking about building our model. One variable that is likely to be associated with home prices is the size of the home (heatedarea
). We can make a scatterplot of these variables before we start thinking about a model.
ggplot(data, aes(x=heatedarea, y=amt_price)) +
geom_point(size=0.1)
It’s hard to read because of the outliers (we’ll return to this issue shortly) For now, we can just set the limits to look at the bulk of the data.
ggplot(data, aes(x=heatedarea, y=amt_price)) +
geom_point(size=0.1) +
xlim(0, 5000) +
ylim(0, 2000000)
Warning: Removed 32 rows containing missing values (`geom_point()`).
It does look like there’s a positive trend. We’re ready to build our first model! We’ll express price as a function of heated area In R, you use the lm()
function to estimate a linear regression. The first argument is a formula, which expresses the mathematical equation you want to estimate. The dependent variable is on the left, followed by a ~
(tilde), followed by the independent variable(s). We also specify the dataset we are using for estimation. We do not need to specify a constant/intercept. R will include that automatically.
<- lm(amt_price~heatedarea, data) single_variable_model
To see the results of our model, we can run summary()
summary(single_variable_model)
Call:
lm(formula = amt_price ~ heatedarea, data = data)
Residuals:
Min 1Q Median 3Q Max
-2358329 -106534 -23728 62963 3890211
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -117500.988 12423.706 -9.458 <0.0000000000000002 ***
heatedarea 206.183 4.644 44.398 <0.0000000000000002 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 261500 on 1504 degrees of freedom
Multiple R-squared: 0.5672, Adjusted R-squared: 0.5669
F-statistic: 1971 on 1 and 1504 DF, p-value: < 0.00000000000000022
Interpretation of the summary
- Intercept: The value of price when the heated area is zero
- Slope: Increase in price for each unit increase in heated area
- t-value: The t-value is a measure of how many standard deviations our coefficient estimate is far from zero. Large t-values (either positive or negative) indicate that the null hypothesis (that the coefficient is zero or has no effect) is unlikely.
- Pr(>|t|): This is the p-value associated with the null hypothesis. A small p-value (typically ≤ 0.05) indicates strong evidence that the coefficient is different from zero. In this case, the p-values for both the intercept and heatedarea are very small, indicating strong evidence against the null hypothesis.
- Residuals: The residuals are the differences between the observed and predicted values of the dependent variable. The summary provides the minimum, 1st quartile (Q1), median, 3rd quartile (Q3), and maximum residuals. These can be used to understand the distribution of the residuals.
- Residual standard error: This is the standard deviation of the residuals. It tells us the typical difference between the observed and predicted values of the dependent variable.
- Multiple R-squared and Adjusted R-squared: These are measures of how well the model fits the data. In this case, the model explains about 56.72% (or 56.69% when adjusted for the number of predictors) of the variance in amt_price.
- F-statistic and its p-value: The F-statistic tests the overall significance of the model. The null hypothesis is that all of the regression coefficients are equal to zero. A very small p-value (like the one we have here) indicates that at least one of the predictors is significantly related to the dependent variable.
The first section includes some information about the residuals - the difference between the predictions and the actual values in the data. The next section details the coefficients - what the estimate was, and how much random variation there might be in that estimate due to sampling error (we’ll cover this in a moment). The last interesting statistic is the R-squared, which we’ll discuss below.
Seeing our model visually
Let’s plot our regression line on top of our scatterplot. We can use geom_abline
for this, which accepts an intercept and slope. The intercept is the coefficient for the constant. In a regression with only one variable, the coefficient for that variable is the slope of the line that predicts the outcome. We will find the intercept and the slope in the regression output above, and enter them into the geom_abline below to plot the regression line over the scatterplot. Note: e+05 is scientific notation - 1.5e+05 would be 1.5 times 10 to the 5th, or 150,000.
ggplot(data, aes(x=heatedarea, y=amt_price)) +
geom_point(size=0.1) +
geom_abline(intercept=-1.175e5, slope=206.2, color="red") +
xlim(0, 5000) +
ylim(0, 2000000)
Warning: Removed 32 rows containing missing values (`geom_point()`).
Dealing with outliers That line doesn’t look like it goes quite through the middle of the data. If we remove the limits on the plot we can see why:
ggplot(data, aes(x=heatedarea, y=amt_price)) +
geom_point(size=0.1) +
geom_abline(intercept=-98341, slope=196.4, color="red")
There are outliers well off to the right and top. Because regression is minimizing the sum of squared residuals, outliers can have a very large influence. Even though there are not very many of them, they are far from the regression line. We can remove the outliers and run the model again. Whether or not to remove outliers is up to you and what makes sense for your analysis. However, you should avoid “cherry-picking” data - filtering the data until you get the results you want from your analysis.
Filter the data to only homes with heated area less than 5,000 square feet, and prices less than $2 million. Run a model on this filtered dataset, and make a similar plot.
= filter(data, heatedarea < 5000 & amt_price < 2000000)
no_outliers = lm(amt_price~heatedarea, no_outliers)
no_outlier_model summary(no_outlier_model)
Call:
lm(formula = amt_price ~ heatedarea, data = no_outliers)
Residuals:
Min 1Q Median 3Q Max
-374900 -95232 -32866 59515 1266036
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2254.106 11008.308 -0.205 0.838
heatedarea 148.294 4.805 30.861 <0.0000000000000002 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 157000 on 1472 degrees of freedom
Multiple R-squared: 0.3928, Adjusted R-squared: 0.3924
F-statistic: 952.4 on 1 and 1472 DF, p-value: < 0.00000000000000022
ggplot(no_outliers, aes(x=heatedarea, y=amt_price)) +
geom_point(size=0.1) +
geom_abline(intercept=-2254.106, slope=148.3, color="red")
Multiple regression
So far we’ve only included a single variable in the model - heated area. But with regression, we can include many variables. Then the coefficient for each independent variable will be how much change in the dependent variable is associated with a 1-unit change in the independent variable, holding everything else in the model constant.
For example, these data were collected over the course of 10+ years. During that time, housing prices went up significantly. We’d expect homes that sold in 2010 to sell for less than homes that sold in 2020. We can add a variable for year of sale, but first we need to compute the year of sale.
= mutate(no_outliers, year_of_sale=year(dte_dateof))
no_outliers
= lm(amt_price~heatedarea+year_of_sale, no_outliers)
year_built_model summary(year_built_model)
Call:
lm(formula = amt_price ~ heatedarea + year_of_sale, data = no_outliers)
Residuals:
Min 1Q Median 3Q Max
-439207 -77465 -25623 41662 1189805
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -41722596.414 2259076.545 -18.47 <0.0000000000000002 ***
heatedarea 154.810 4.345 35.63 <0.0000000000000002 ***
year_of_sale 20677.956 1119.661 18.47 <0.0000000000000002 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 141500 on 1471 degrees of freedom
Multiple R-squared: 0.5071, Adjusted R-squared: 0.5065
F-statistic: 756.8 on 2 and 1471 DF, p-value: < 0.00000000000000022
When two variables are correlated, including them both in the model can do strange things. As an example, let’s try adding number of bedrooms to our model
= lm(amt_price~heatedarea+year_of_sale+bedrooms, no_outliers)
bedrooms_model summary(bedrooms_model)
Call:
lm(formula = amt_price ~ heatedarea + year_of_sale + bedrooms,
data = no_outliers)
Residuals:
Min 1Q Median 3Q Max
-442838 -76245 -26490 42332 1185432
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -41768083.847 2259360.441 -18.487 <0.0000000000000002 ***
heatedarea 158.507 5.547 28.575 <0.0000000000000002 ***
year_of_sale 20707.446 1119.942 18.490 <0.0000000000000002 ***
bedrooms -6692.410 6242.805 -1.072 0.284
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 141500 on 1470 degrees of freedom
Multiple R-squared: 0.5075, Adjusted R-squared: 0.5065
F-statistic: 504.9 on 3 and 1470 DF, p-value: < 0.00000000000000022
Estimating the same model again, but without the heated area. Does the result change?
= lm(amt_price~year_of_sale+bedrooms, no_outliers)
bedrooms_only_model summary(bedrooms_only_model)
Call:
lm(formula = amt_price ~ year_of_sale + bedrooms, data = no_outliers)
Residuals:
Min 1Q Median 3Q Max
-295893 -110560 -42581 56848 1155161
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -36703493 2808184 -13.07 <0.0000000000000002 ***
year_of_sale 18184 1392 13.06 <0.0000000000000002 ***
bedrooms 104202 6096 17.09 <0.0000000000000002 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 176400 on 1471 degrees of freedom
Multiple R-squared: 0.234, Adjusted R-squared: 0.2329
F-statistic: 224.6 on 2 and 1471 DF, p-value: < 0.00000000000000022
What’s going on? In the first model, we estimated the relationship between number of bedrooms and price holding square footage constant - so the value of an additional bedroom without increasing the total size of the home. The trend lately has been towards larger rooms, and adding a room without increasing the size of the home means smaller rooms, so it is reasonable to think that an additional bedroom in a home of the same size might not be particularly valuable. The model above shows a slightly negative coefficient, though it’s not statistically significant (we’ll discuss that in a bit), so we can’t be sure it wouldn’t be positive if we had a different sample of 1500 homes.
Here, we plot the relationship between heated area and price, with color indicating number of bedrooms. Note that the colors at the bottom and top of the point cloud are all mixed together, indicating that there isn’t really a trend in price for different numbers of bedrooms with the same square footage.
ggplot(no_outliers, aes(x=heatedarea, y=amt_price, color=as.factor(bedrooms))) +
geom_point() +
ylim(0, 1000000)
Warning: Removed 18 rows containing missing values (`geom_point()`).
This is known as Simpson’s Paradox - cases where adding a variable can change the sign of the coefficient of another variable. It is a specific case of the more general omitted variable bias and collinearity concepts.
Omitted variable bias is the notion that whenever you add a variable to a model, it may change the coefficients for any other related variable. In general, this means that you can only interpret the coefficients of a regression in relation to what else is included in the model. For instance, you would likely find that the number of TVs a household owns is highly predictive of the price of their home. But this is not because more TVs cause households to buy more expensive homes - rather, it is likely that higher incomes lead to buying more expensive homes and to buying more TVs. Adding income into the model would likely make the effect of TVs go away - because then the model would be asking the question of how an additional TV is related to home value, holding income constant - and it’s probably not very related.
In extreme cases, adding additional variables to the model that are correlated with each other may result in counterintuitive, statistically-insignificant results. This is know as a collinearity or multicollinearity problem, and is something you should look into if you are experiencing unexpected results in a regression.
Categorical variables
So far we’ve only included continuous, numerical variables in the model. But we know other things likely affect home prices as well, for instance what the home looks like from the street. The data contain a variable “siding” that says what the siding (wall covering) on the outside of the home is. People may pay more for certain types of siding if they are more durable, more attractive, etc. But it’s not a number, so we can’t just put it in the model with a coefficient.
The most common way to put a categorical variable into a regression model is to use “dummy variables” There will be one variable for each category, that is one if the home is in that category, and zero otherwise. This allows the model to estimate a coefficient for each of the categories—so for instance we will have the amount of money brick siding is associated with, or vinyl siding, etc. You have to leave one category out; that is the base category, and everything else is relative to that. (If we didn’t, there wouldn’t be a unique set of coefficients for the computer to estimate. You could get exactly the same prediction from many models, by increasing the coefficient for the constant and decreasing the coefficient for all of the dummy variables.)
We can create dummy variables in R by including non-numeric values in the regression. R will automatically create one variable for each category. If you wanted to create dummy variables for a numeric column (e.g., if siding types were coded as 1, 2, 3, etc.), you would convert the column to a factor using as.factor() and then include that converted column in your model.
# Change the display settings
= lm(amt_price~heatedarea+year_of_sale+siding, no_outliers)
siding_model summary(siding_model)
Call:
lm(formula = amt_price ~ heatedarea + year_of_sale + siding,
data = no_outliers)
Residuals:
Min 1Q Median 3Q Max
-359582 -73168 -14072 45243 1230773
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -43094934.375 2115711.407 -20.369 < 0.0000000000000002 ***
heatedarea 143.727 4.154 34.602 < 0.0000000000000002 ***
year_of_sale 21342.131 1048.490 20.355 < 0.0000000000000002 ***
sidingFACE BRICK 113144.035 8505.793 13.302 < 0.0000000000000002 ***
sidingHARDIPLANK 98806.378 13382.101 7.383 0.000000000000257504 ***
sidingMASONITE 49318.140 10870.466 4.537 0.000006174692803023 ***
sidingOther 146101.235 17823.518 8.197 0.000000000000000532 ***
sidingWOOD ON SHTG 95982.927 16522.051 5.809 0.000000007678134355 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 131800 on 1466 degrees of freedom
Multiple R-squared: 0.5736, Adjusted R-squared: 0.5716
F-statistic: 281.7 on 7 and 1466 DF, p-value: < 0.00000000000000022
It is often said that the three most important aspects of a home are location, location, and location. Similar homes in different neighborhoods may sell for very different prices. There is an area variable in the file, which represents different parts of the Charlotte metro area. Adding this as a categorical variable.
= lm(amt_price~heatedarea+year_of_sale+siding+area, no_outliers)
area_model summary(area_model)
Call:
lm(formula = amt_price ~ heatedarea + year_of_sale + siding +
area, data = no_outliers)
Residuals:
Min 1Q Median 3Q Max
-367903 -66180 -11709 42231 1237776
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -43198383.741 1974671.081 -21.876 < 0.0000000000000002 ***
heatedarea 146.823 4.192 35.027 < 0.0000000000000002 ***
year_of_sale 21438.790 978.729 21.905 < 0.0000000000000002 ***
sidingFACE BRICK 77448.093 8465.564 9.149 < 0.0000000000000002 ***
sidingHARDIPLANK 72223.507 12663.184 5.703 0.000000014193167020 ***
sidingMASONITE 45508.625 10285.145 4.425 0.000010375155096964 ***
sidingOther 110696.430 16864.297 6.564 0.000000000072622804 ***
sidingWOOD ON SHTG 58559.028 15751.101 3.718 0.000209 ***
area03102 -132564.029 16208.826 -8.179 0.000000000000000618 ***
area03103 -158558.057 17022.265 -9.315 < 0.0000000000000002 ***
area03104 -36864.732 14097.470 -2.615 0.009015 **
area03105 2324.833 15406.446 0.151 0.880075
area03106 -101954.418 13749.553 -7.415 0.000000000000205171 ***
area03107 -136177.004 14884.338 -9.149 < 0.0000000000000002 ***
area03108 -100300.375 14565.747 -6.886 0.000000000008497905 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 122500 on 1459 degrees of freedom
Multiple R-squared: 0.6336, Adjusted R-squared: 0.63
F-statistic: 180.2 on 14 and 1459 DF, p-value: < 0.00000000000000022
You may also see coefficients for dummy variables for areas or groups referred to as “fixed effects”.
Predicting from a linear regression
A linear regression is just estimating an equation that predicts the dependent variable based on the independent variables. We can write out the equation. For instance, suppose we wanted to estimate the price of a 2000 square foot home sold in 2021, with Hardiplank siding.
-4.309e7 + 1.437e2 * 2000 + 2.134e4 * 2021 + 9.881e4
[1] 424350
Using our model that includes area to predict the price of a 2000 square foot home sold in 2021 with Masonite siding in area 03103.
-4.320e7 + 1.468e2 * 2000 + 2.144e4 * 2021 + 4.551e4 - 1.586e5
[1] 310750
Normally, you wouldn’t make predictions manually like this. Instead, you would create a new dataset that had all of your independent variables, and then apply the predict() function to it. For instance, here I’ll make a dataset of several different types of homes, and predict the sales prices of all of them. A developer might use a model like this to forecast profits from a new development—probably a bit more complex model with more home features. Here we create a table with the candidate homes. Each line is a column in the table. Lines where there is only one item will have that item repreated for every row of the table.
= tibble(
candidate_homes heatedarea=c(1200, 1500, 2000, 1200, 1500, 2000),
year_of_sale=2021,
siding=c("HARDIPLANK", "HARDIPLANK", "HARDIPLANK", "FACE BRICK", "FACE BRICK", "FACE BRICK")
)
now, we can use predict()
to create a new column in candidate_homes with the forecasted price It should be close to what we found above (a little different because we used rounded coefficients above but predict() uses the original, un-rounded coefficients)
$forecast_price = predict(siding_model, candidate_homes)
candidate_homes candidate_homes
# A tibble: 6 × 4
heatedarea year_of_sale siding forecast_price
<dbl> <dbl> <chr> <dbl>
1 1200 2021 HARDIPLANK 308791.
2 1500 2021 HARDIPLANK 351909.
3 2000 2021 HARDIPLANK 423773.
4 1200 2021 FACE BRICK 323129.
5 1500 2021 FACE BRICK 366247.
6 2000 2021 FACE BRICK 438110.
We will stimate the price of a 2000 square foot home sold in 2021 with Hardiplank siding in areas 03103, 03105, or 03108. A developer might use a result like this to decide which of several sites would be most profitable to build on.
= tibble(
candidate_sites heatedarea=2000,
year_of_sale=2021,
siding="HARDIPLANK",
area=c("03103", "03105", "03108")
)
$forecast_price = predict(area_model, candidate_sites)
candidate_sites candidate_sites
# A tibble: 3 × 5
heatedarea year_of_sale siding area forecast_price
<dbl> <dbl> <chr> <chr> <dbl>
1 2000 2021 HARDIPLANK 03103 336723.
2 2000 2021 HARDIPLANK 03105 497606.
3 2000 2021 HARDIPLANK 03108 394981.
While prediction is often seen as a primary function of linear regression, it is actually not that common in urban analytics. People use linear models for interpretation more often - to interpret the coefficients and see what the relationships between different variables are, holding other things constant.
These models had quite high predictive power (R2), which is common in models of home prices. In models of other phenomena (especially behavioral phenomena), much lower R2 is common. This may represent an issue for prediction, but does not necessarily represent an issue for interpretation. It’s only an issue for interpretation if there are other variables left out that would affect the coefficients for the variables included.
Regression has a lot more pieces than we discussed here. I recommend the books An Introduction to Statistical Learning with Applications in R, and The Effect, if you want to learn more. Both are available online as free ebooks.
Notes on statistics and machine learning
Statistical modeling assumes a model of data generation, tries to fit data to parameters of the assumed model and uses that for mainly interpretation but for prediction as well.
Machine learning makes no assumption of standard data generation, tries to build a predictive model out of available data, and uses that mainly for prediction, and sometimes for interpretation.
References
This tutorial borrows from and expands on PLAN372 Spring 2023 tutorial on linear regression and machine learning.