What factors most influence the selling price of a house?

To answer this question I will be using a multiple linear regression model.

Introduction

I will be using the Housing prices in Ames, Iowa dataset from the openintro.org website at https://www.openintro.org/data/index.php?data=ames

The dataset has information that comes from the Ames Assessor’s Office used in computing assessed values for individual residential properties sold in Ames, IA from 2006 to 2010.

The dataset has 2930 observations on 82 variables. It has so many characteristics for each house; a PID or parcel ID number, the price, the neighborhood of the house, the zoning of the house, the number of square feet of each floor, how many full and half bathrooms it has, if it has a fireplace, a garage, a porch, a pool, the size and shape of the lot, what type of utilities the house has, and so much more.

I will explain below in more detail the variables I will be using.

Data Analysis

I am going to look at the dimensions and head of the data as well as check for any missing information.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.2.0     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.2     ✔ tibble    3.3.1
## ✔ lubridate 1.9.5     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(corrplot)
## corrplot 0.95 loaded
library(lubridate)
library(dplyr)


setwd("~/Downloads/Data 101 Course materials/Data Sets")
homes <- read_csv("ames.csv")
## Rows: 2930 Columns: 82
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (43): MS.Zoning, Street, Alley, Lot.Shape, Land.Contour, Utilities, Lot....
## dbl (39): Order, PID, area, price, MS.SubClass, Lot.Frontage, Lot.Area, Over...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
homes
## # A tibble: 2,930 × 82
##    Order     PID  area  price MS.SubClass MS.Zoning Lot.Frontage Lot.Area Street
##    <dbl>   <dbl> <dbl>  <dbl>       <dbl> <chr>            <dbl>    <dbl> <chr> 
##  1     1  5.26e8  1656 215000          20 RL                 141    31770 Pave  
##  2     2  5.26e8   896 105000          20 RH                  80    11622 Pave  
##  3     3  5.26e8  1329 172000          20 RL                  81    14267 Pave  
##  4     4  5.26e8  2110 244000          20 RL                  93    11160 Pave  
##  5     5  5.27e8  1629 189900          60 RL                  74    13830 Pave  
##  6     6  5.27e8  1604 195500          60 RL                  78     9978 Pave  
##  7     7  5.27e8  1338 213500         120 RL                  41     4920 Pave  
##  8     8  5.27e8  1280 191500         120 RL                  43     5005 Pave  
##  9     9  5.27e8  1616 236500         120 RL                  39     5389 Pave  
## 10    10  5.27e8  1804 189000          60 RL                  60     7500 Pave  
## # ℹ 2,920 more rows
## # ℹ 73 more variables: Alley <chr>, Lot.Shape <chr>, Land.Contour <chr>,
## #   Utilities <chr>, Lot.Config <chr>, Land.Slope <chr>, Neighborhood <chr>,
## #   Condition.1 <chr>, Condition.2 <chr>, Bldg.Type <chr>, House.Style <chr>,
## #   Overall.Qual <dbl>, Overall.Cond <dbl>, Year.Built <dbl>,
## #   Year.Remod.Add <dbl>, Roof.Style <chr>, Roof.Matl <chr>,
## #   Exterior.1st <chr>, Exterior.2nd <chr>, Mas.Vnr.Type <chr>, …
str(homes)
## spc_tbl_ [2,930 × 82] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Order          : num [1:2930] 1 2 3 4 5 6 7 8 9 10 ...
##  $ PID            : num [1:2930] 5.26e+08 5.26e+08 5.26e+08 5.26e+08 5.27e+08 ...
##  $ area           : num [1:2930] 1656 896 1329 2110 1629 ...
##  $ price          : num [1:2930] 215000 105000 172000 244000 189900 ...
##  $ MS.SubClass    : num [1:2930] 20 20 20 20 60 60 120 120 120 60 ...
##  $ MS.Zoning      : chr [1:2930] "RL" "RH" "RL" "RL" ...
##  $ Lot.Frontage   : num [1:2930] 141 80 81 93 74 78 41 43 39 60 ...
##  $ Lot.Area       : num [1:2930] 31770 11622 14267 11160 13830 ...
##  $ Street         : chr [1:2930] "Pave" "Pave" "Pave" "Pave" ...
##  $ Alley          : chr [1:2930] NA NA NA NA ...
##  $ Lot.Shape      : chr [1:2930] "IR1" "Reg" "IR1" "Reg" ...
##  $ Land.Contour   : chr [1:2930] "Lvl" "Lvl" "Lvl" "Lvl" ...
##  $ Utilities      : chr [1:2930] "AllPub" "AllPub" "AllPub" "AllPub" ...
##  $ Lot.Config     : chr [1:2930] "Corner" "Inside" "Corner" "Corner" ...
##  $ Land.Slope     : chr [1:2930] "Gtl" "Gtl" "Gtl" "Gtl" ...
##  $ Neighborhood   : chr [1:2930] "NAmes" "NAmes" "NAmes" "NAmes" ...
##  $ Condition.1    : chr [1:2930] "Norm" "Feedr" "Norm" "Norm" ...
##  $ Condition.2    : chr [1:2930] "Norm" "Norm" "Norm" "Norm" ...
##  $ Bldg.Type      : chr [1:2930] "1Fam" "1Fam" "1Fam" "1Fam" ...
##  $ House.Style    : chr [1:2930] "1Story" "1Story" "1Story" "1Story" ...
##  $ Overall.Qual   : num [1:2930] 6 5 6 7 5 6 8 8 8 7 ...
##  $ Overall.Cond   : num [1:2930] 5 6 6 5 5 6 5 5 5 5 ...
##  $ Year.Built     : num [1:2930] 1960 1961 1958 1968 1997 ...
##  $ Year.Remod.Add : num [1:2930] 1960 1961 1958 1968 1998 ...
##  $ Roof.Style     : chr [1:2930] "Hip" "Gable" "Hip" "Hip" ...
##  $ Roof.Matl      : chr [1:2930] "CompShg" "CompShg" "CompShg" "CompShg" ...
##  $ Exterior.1st   : chr [1:2930] "BrkFace" "VinylSd" "Wd Sdng" "BrkFace" ...
##  $ Exterior.2nd   : chr [1:2930] "Plywood" "VinylSd" "Wd Sdng" "BrkFace" ...
##  $ Mas.Vnr.Type   : chr [1:2930] "Stone" "None" "BrkFace" "None" ...
##  $ Mas.Vnr.Area   : num [1:2930] 112 0 108 0 0 20 0 0 0 0 ...
##  $ Exter.Qual     : chr [1:2930] "TA" "TA" "TA" "Gd" ...
##  $ Exter.Cond     : chr [1:2930] "TA" "TA" "TA" "TA" ...
##  $ Foundation     : chr [1:2930] "CBlock" "CBlock" "CBlock" "CBlock" ...
##  $ Bsmt.Qual      : chr [1:2930] "TA" "TA" "TA" "TA" ...
##  $ Bsmt.Cond      : chr [1:2930] "Gd" "TA" "TA" "TA" ...
##  $ Bsmt.Exposure  : chr [1:2930] "Gd" "No" "No" "No" ...
##  $ BsmtFin.Type.1 : chr [1:2930] "BLQ" "Rec" "ALQ" "ALQ" ...
##  $ BsmtFin.SF.1   : num [1:2930] 639 468 923 1065 791 ...
##  $ BsmtFin.Type.2 : chr [1:2930] "Unf" "LwQ" "Unf" "Unf" ...
##  $ BsmtFin.SF.2   : num [1:2930] 0 144 0 0 0 0 0 0 0 0 ...
##  $ Bsmt.Unf.SF    : num [1:2930] 441 270 406 1045 137 ...
##  $ Total.Bsmt.SF  : num [1:2930] 1080 882 1329 2110 928 ...
##  $ Heating        : chr [1:2930] "GasA" "GasA" "GasA" "GasA" ...
##  $ Heating.QC     : chr [1:2930] "Fa" "TA" "TA" "Ex" ...
##  $ Central.Air    : chr [1:2930] "Y" "Y" "Y" "Y" ...
##  $ Electrical     : chr [1:2930] "SBrkr" "SBrkr" "SBrkr" "SBrkr" ...
##  $ X1st.Flr.SF    : num [1:2930] 1656 896 1329 2110 928 ...
##  $ X2nd.Flr.SF    : num [1:2930] 0 0 0 0 701 678 0 0 0 776 ...
##  $ Low.Qual.Fin.SF: num [1:2930] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Bsmt.Full.Bath : num [1:2930] 1 0 0 1 0 0 1 0 1 0 ...
##  $ Bsmt.Half.Bath : num [1:2930] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Full.Bath      : num [1:2930] 1 1 1 2 2 2 2 2 2 2 ...
##  $ Half.Bath      : num [1:2930] 0 0 1 1 1 1 0 0 0 1 ...
##  $ Bedroom.AbvGr  : num [1:2930] 3 2 3 3 3 3 2 2 2 3 ...
##  $ Kitchen.AbvGr  : num [1:2930] 1 1 1 1 1 1 1 1 1 1 ...
##  $ Kitchen.Qual   : chr [1:2930] "TA" "TA" "Gd" "Ex" ...
##  $ TotRms.AbvGrd  : num [1:2930] 7 5 6 8 6 7 6 5 5 7 ...
##  $ Functional     : chr [1:2930] "Typ" "Typ" "Typ" "Typ" ...
##  $ Fireplaces     : num [1:2930] 2 0 0 2 1 1 0 0 1 1 ...
##  $ Fireplace.Qu   : chr [1:2930] "Gd" NA NA "TA" ...
##  $ Garage.Type    : chr [1:2930] "Attchd" "Attchd" "Attchd" "Attchd" ...
##  $ Garage.Yr.Blt  : num [1:2930] 1960 1961 1958 1968 1997 ...
##  $ Garage.Finish  : chr [1:2930] "Fin" "Unf" "Unf" "Fin" ...
##  $ Garage.Cars    : num [1:2930] 2 1 1 2 2 2 2 2 2 2 ...
##  $ Garage.Area    : num [1:2930] 528 730 312 522 482 470 582 506 608 442 ...
##  $ Garage.Qual    : chr [1:2930] "TA" "TA" "TA" "TA" ...
##  $ Garage.Cond    : chr [1:2930] "TA" "TA" "TA" "TA" ...
##  $ Paved.Drive    : chr [1:2930] "P" "Y" "Y" "Y" ...
##  $ Wood.Deck.SF   : num [1:2930] 210 140 393 0 212 360 0 0 237 140 ...
##  $ Open.Porch.SF  : num [1:2930] 62 0 36 0 34 36 0 82 152 60 ...
##  $ Enclosed.Porch : num [1:2930] 0 0 0 0 0 0 170 0 0 0 ...
##  $ X3Ssn.Porch    : num [1:2930] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Screen.Porch   : num [1:2930] 0 120 0 0 0 0 0 144 0 0 ...
##  $ Pool.Area      : num [1:2930] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Pool.QC        : chr [1:2930] NA NA NA NA ...
##  $ Fence          : chr [1:2930] NA "MnPrv" NA NA ...
##  $ Misc.Feature   : chr [1:2930] NA NA "Gar2" NA ...
##  $ Misc.Val       : num [1:2930] 0 0 12500 0 0 0 0 0 0 0 ...
##  $ Mo.Sold        : num [1:2930] 5 6 6 4 3 6 4 1 3 6 ...
##  $ Yr.Sold        : num [1:2930] 2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...
##  $ Sale.Type      : chr [1:2930] "WD" "WD" "WD" "WD" ...
##  $ Sale.Condition : chr [1:2930] "Normal" "Normal" "Normal" "Normal" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Order = col_double(),
##   ..   PID = col_double(),
##   ..   area = col_double(),
##   ..   price = col_double(),
##   ..   MS.SubClass = col_double(),
##   ..   MS.Zoning = col_character(),
##   ..   Lot.Frontage = col_double(),
##   ..   Lot.Area = col_double(),
##   ..   Street = col_character(),
##   ..   Alley = col_character(),
##   ..   Lot.Shape = col_character(),
##   ..   Land.Contour = col_character(),
##   ..   Utilities = col_character(),
##   ..   Lot.Config = col_character(),
##   ..   Land.Slope = col_character(),
##   ..   Neighborhood = col_character(),
##   ..   Condition.1 = col_character(),
##   ..   Condition.2 = col_character(),
##   ..   Bldg.Type = col_character(),
##   ..   House.Style = col_character(),
##   ..   Overall.Qual = col_double(),
##   ..   Overall.Cond = col_double(),
##   ..   Year.Built = col_double(),
##   ..   Year.Remod.Add = col_double(),
##   ..   Roof.Style = col_character(),
##   ..   Roof.Matl = col_character(),
##   ..   Exterior.1st = col_character(),
##   ..   Exterior.2nd = col_character(),
##   ..   Mas.Vnr.Type = col_character(),
##   ..   Mas.Vnr.Area = col_double(),
##   ..   Exter.Qual = col_character(),
##   ..   Exter.Cond = col_character(),
##   ..   Foundation = col_character(),
##   ..   Bsmt.Qual = col_character(),
##   ..   Bsmt.Cond = col_character(),
##   ..   Bsmt.Exposure = col_character(),
##   ..   BsmtFin.Type.1 = col_character(),
##   ..   BsmtFin.SF.1 = col_double(),
##   ..   BsmtFin.Type.2 = col_character(),
##   ..   BsmtFin.SF.2 = col_double(),
##   ..   Bsmt.Unf.SF = col_double(),
##   ..   Total.Bsmt.SF = col_double(),
##   ..   Heating = col_character(),
##   ..   Heating.QC = col_character(),
##   ..   Central.Air = col_character(),
##   ..   Electrical = col_character(),
##   ..   X1st.Flr.SF = col_double(),
##   ..   X2nd.Flr.SF = col_double(),
##   ..   Low.Qual.Fin.SF = col_double(),
##   ..   Bsmt.Full.Bath = col_double(),
##   ..   Bsmt.Half.Bath = col_double(),
##   ..   Full.Bath = col_double(),
##   ..   Half.Bath = col_double(),
##   ..   Bedroom.AbvGr = col_double(),
##   ..   Kitchen.AbvGr = col_double(),
##   ..   Kitchen.Qual = col_character(),
##   ..   TotRms.AbvGrd = col_double(),
##   ..   Functional = col_character(),
##   ..   Fireplaces = col_double(),
##   ..   Fireplace.Qu = col_character(),
##   ..   Garage.Type = col_character(),
##   ..   Garage.Yr.Blt = col_double(),
##   ..   Garage.Finish = col_character(),
##   ..   Garage.Cars = col_double(),
##   ..   Garage.Area = col_double(),
##   ..   Garage.Qual = col_character(),
##   ..   Garage.Cond = col_character(),
##   ..   Paved.Drive = col_character(),
##   ..   Wood.Deck.SF = col_double(),
##   ..   Open.Porch.SF = col_double(),
##   ..   Enclosed.Porch = col_double(),
##   ..   X3Ssn.Porch = col_double(),
##   ..   Screen.Porch = col_double(),
##   ..   Pool.Area = col_double(),
##   ..   Pool.QC = col_character(),
##   ..   Fence = col_character(),
##   ..   Misc.Feature = col_character(),
##   ..   Misc.Val = col_double(),
##   ..   Mo.Sold = col_double(),
##   ..   Yr.Sold = col_double(),
##   ..   Sale.Type = col_character(),
##   ..   Sale.Condition = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>
head(homes)
## # A tibble: 6 × 82
##   Order      PID  area  price MS.SubClass MS.Zoning Lot.Frontage Lot.Area Street
##   <dbl>    <dbl> <dbl>  <dbl>       <dbl> <chr>            <dbl>    <dbl> <chr> 
## 1     1   5.26e8  1656 215000          20 RL                 141    31770 Pave  
## 2     2   5.26e8   896 105000          20 RH                  80    11622 Pave  
## 3     3   5.26e8  1329 172000          20 RL                  81    14267 Pave  
## 4     4   5.26e8  2110 244000          20 RL                  93    11160 Pave  
## 5     5   5.27e8  1629 189900          60 RL                  74    13830 Pave  
## 6     6   5.27e8  1604 195500          60 RL                  78     9978 Pave  
## # ℹ 73 more variables: Alley <chr>, Lot.Shape <chr>, Land.Contour <chr>,
## #   Utilities <chr>, Lot.Config <chr>, Land.Slope <chr>, Neighborhood <chr>,
## #   Condition.1 <chr>, Condition.2 <chr>, Bldg.Type <chr>, House.Style <chr>,
## #   Overall.Qual <dbl>, Overall.Cond <dbl>, Year.Built <dbl>,
## #   Year.Remod.Add <dbl>, Roof.Style <chr>, Roof.Matl <chr>,
## #   Exterior.1st <chr>, Exterior.2nd <chr>, Mas.Vnr.Type <chr>,
## #   Mas.Vnr.Area <dbl>, Exter.Qual <chr>, Exter.Cond <chr>, Foundation <chr>, …
colSums(is.na(homes))
##           Order             PID            area           price     MS.SubClass 
##               0               0               0               0               0 
##       MS.Zoning    Lot.Frontage        Lot.Area          Street           Alley 
##               0             490               0               0            2732 
##       Lot.Shape    Land.Contour       Utilities      Lot.Config      Land.Slope 
##               0               0               0               0               0 
##    Neighborhood     Condition.1     Condition.2       Bldg.Type     House.Style 
##               0               0               0               0               0 
##    Overall.Qual    Overall.Cond      Year.Built  Year.Remod.Add      Roof.Style 
##               0               0               0               0               0 
##       Roof.Matl    Exterior.1st    Exterior.2nd    Mas.Vnr.Type    Mas.Vnr.Area 
##               0               0               0              23              23 
##      Exter.Qual      Exter.Cond      Foundation       Bsmt.Qual       Bsmt.Cond 
##               0               0               0              80              80 
##   Bsmt.Exposure  BsmtFin.Type.1    BsmtFin.SF.1  BsmtFin.Type.2    BsmtFin.SF.2 
##              83              80               1              81               1 
##     Bsmt.Unf.SF   Total.Bsmt.SF         Heating      Heating.QC     Central.Air 
##               1               1               0               0               0 
##      Electrical     X1st.Flr.SF     X2nd.Flr.SF Low.Qual.Fin.SF  Bsmt.Full.Bath 
##               1               0               0               0               2 
##  Bsmt.Half.Bath       Full.Bath       Half.Bath   Bedroom.AbvGr   Kitchen.AbvGr 
##               2               0               0               0               0 
##    Kitchen.Qual   TotRms.AbvGrd      Functional      Fireplaces    Fireplace.Qu 
##               0               0               0               0            1422 
##     Garage.Type   Garage.Yr.Blt   Garage.Finish     Garage.Cars     Garage.Area 
##             157             159             159               1               1 
##     Garage.Qual     Garage.Cond     Paved.Drive    Wood.Deck.SF   Open.Porch.SF 
##             159             159               0               0               0 
##  Enclosed.Porch     X3Ssn.Porch    Screen.Porch       Pool.Area         Pool.QC 
##               0               0               0               0            2917 
##           Fence    Misc.Feature        Misc.Val         Mo.Sold         Yr.Sold 
##            2358            2824               0               0               0 
##       Sale.Type  Sale.Condition 
##               0               0

There are missing data but there is only 1 missing value in the columns I will be using. 1 missing in Total.Bsmt.SF so it should not be a major issue.

df_homes <- homes |>
select(price, MS.Zoning, Neighborhood, Bldg.Type, Lot.Area, Year.Remod.Add, Overall.Cond, X1st.Flr.SF,  X2nd.Flr.SF, Total.Bsmt.SF, Sale.Condition) |>
  filter(Sale.Condition == "Normal") |>
  filter(Bldg.Type == "1Fam") |>
    filter(MS.Zoning == "RL") |>
  mutate(Total.Sq = X1st.Flr.SF + X2nd.Flr.SF + Total.Bsmt.SF)
df_homes
## # A tibble: 1,679 × 12
##     price MS.Zoning Neighborhood Bldg.Type Lot.Area Year.Remod.Add Overall.Cond
##     <dbl> <chr>     <chr>        <chr>        <dbl>          <dbl>        <dbl>
##  1 215000 RL        NAmes        1Fam         31770           1960            5
##  2 172000 RL        NAmes        1Fam         14267           1958            6
##  3 244000 RL        NAmes        1Fam         11160           1968            5
##  4 189900 RL        Gilbert      1Fam         13830           1998            5
##  5 195500 RL        Gilbert      1Fam          9978           1998            6
##  6 189000 RL        Gilbert      1Fam          7500           1999            5
##  7 175900 RL        Gilbert      1Fam         10000           1994            5
##  8 185000 RL        Gilbert      1Fam          7980           2007            7
##  9 180400 RL        Gilbert      1Fam          8402           1998            5
## 10 171500 RL        Gilbert      1Fam         10176           1990            5
## # ℹ 1,669 more rows
## # ℹ 5 more variables: X1st.Flr.SF <dbl>, X2nd.Flr.SF <dbl>,
## #   Total.Bsmt.SF <dbl>, Sale.Condition <chr>, Total.Sq <dbl>

Here I am pulling out the data that I want to use from this dataset. I am using the following variables: price, MS.Zoning, Neighborhood, Bldg.Type, Lot.Area, Year.Remod.Add, Overall.Cond, X1st.Flr.SF, X2nd.Flr.SF, Total.Bsmt.SF, Sale.Condition. I added the square feet on the first floor, second floor and basement to create a new column called total square feet.

I tried to compare homes with the same categorical values, so I filtered the houses to be the same type; single family homes with the same type zoning code with normal sale conditions (no foreclosures or family sales). This is to enable me to compare like with like as much as feasible and reduce conditions that could skew the sale price.

Here is what each of the variables mean in a bit more detail:

price - (Continuous): Sale Price of the house

MS.Zoning - (Nominal): Identifies the general zoning classification of the sale. (Agriculture, Commercial, Floating Village Residential, Industrial, Residential High Density, Residential Low Density, Residential Low Density Park, Residential Medium Density)

Neighborhood - (Nominal): Physical locations within Ames city limits (map available)

Bldg.Type - (Nominal): Type of dwelling (Single-family Detached, Two-family Conversion; originally built as one-family dwelling, Duplex, Townhouse End Unit, Townhouse Inside Unit)

Lot.Area - (Continuous): Lot size in square feet

Year.Remod.Add - (Discrete): Remodel date (same as construction date if no remodeling or additions)

Overall.Cond - (Ordinal): Rates the overall condition of the house with 10 -Very Excellent and 1 - Very Poor.

X1st.Flr.SF - (Continuous) First Floor square feet X2nd.Flr.SF - (Continuous) Second floor square feet Total.Bsmt.SF - (Continuous): Total square feet of basement area

Sale.Condition - (Nominal): Condition of sale (Normal Sale, Abnormal Sale - trade, foreclosure, short sale, Adjoining Land Purchase, Allocation - two linked properties with separate deeds, typically condo with a garage unit, Family Sale between family members, Partial Home was not completed when last assessed (associated with New Homes))

names(df_homes) <- gsub("price", "Price", names(df_homes)) 
names(df_homes)
##  [1] "Price"          "MS.Zoning"      "Neighborhood"   "Bldg.Type"     
##  [5] "Lot.Area"       "Year.Remod.Add" "Overall.Cond"   "X1st.Flr.SF"   
##  [9] "X2nd.Flr.SF"    "Total.Bsmt.SF"  "Sale.Condition" "Total.Sq"

I uppercased the P in price to make it consistent with the other column names.

df_homes |>
  mutate(across(c(MS.Zoning, Bldg.Type, Neighborhood, Sale.Condition), as.factor))
## # A tibble: 1,679 × 12
##     Price MS.Zoning Neighborhood Bldg.Type Lot.Area Year.Remod.Add Overall.Cond
##     <dbl> <fct>     <fct>        <fct>        <dbl>          <dbl>        <dbl>
##  1 215000 RL        NAmes        1Fam         31770           1960            5
##  2 172000 RL        NAmes        1Fam         14267           1958            6
##  3 244000 RL        NAmes        1Fam         11160           1968            5
##  4 189900 RL        Gilbert      1Fam         13830           1998            5
##  5 195500 RL        Gilbert      1Fam          9978           1998            6
##  6 189000 RL        Gilbert      1Fam          7500           1999            5
##  7 175900 RL        Gilbert      1Fam         10000           1994            5
##  8 185000 RL        Gilbert      1Fam          7980           2007            7
##  9 180400 RL        Gilbert      1Fam          8402           1998            5
## 10 171500 RL        Gilbert      1Fam         10176           1990            5
## # ℹ 1,669 more rows
## # ℹ 5 more variables: X1st.Flr.SF <dbl>, X2nd.Flr.SF <dbl>,
## #   Total.Bsmt.SF <dbl>, Sale.Condition <fct>, Total.Sq <dbl>

To save memory, I made the categorical variables factors.

Regression Analysis

I will be using the lm function.

multiple_model <- lm(Price ~ Year.Remod.Add + Lot.Area + Total.Sq + Overall.Cond, data = df_homes)
multiple_model
## 
## Call:
## lm(formula = Price ~ Year.Remod.Add + Lot.Area + Total.Sq + Overall.Cond, 
##     data = df_homes)
## 
## Coefficients:
##    (Intercept)  Year.Remod.Add        Lot.Area        Total.Sq    Overall.Cond  
##     -1.558e+06       7.722e+02       5.528e-01       7.633e+01       1.741e+03
summary(multiple_model)
## 
## Call:
## lm(formula = Price ~ Year.Remod.Add + Lot.Area + Total.Sq + Overall.Cond, 
##     data = df_homes)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -147415  -18336   -1540   15413  234572 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -1.558e+06  8.714e+04 -17.884  < 2e-16 ***
## Year.Remod.Add  7.722e+02  4.470e+01  17.276  < 2e-16 ***
## Lot.Area        5.528e-01  1.021e-01   5.414 7.05e-08 ***
## Total.Sq        7.633e+01  1.226e+00  62.260  < 2e-16 ***
## Overall.Cond    1.741e+03  7.707e+02   2.259    0.024 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 33300 on 1674 degrees of freedom
## Multiple R-squared:  0.7957, Adjusted R-squared:  0.7952 
## F-statistic:  1630 on 4 and 1674 DF,  p-value: < 2.2e-16

Explaining the Coefficients

For every year the house is newer/has been renovated, the price increases by $772

For every additional square foot of the lot, the price increases by $0.55

For every extra square foot of house, the price increases by $76

For every 1 increase of the overall condition score, the price increases by $1,741

The intercept, what y is when x is 0, i.e. the price of house when everything is 0, is -1.558e+06 or -$1,558,000.

We can see that Year.Remod.Add, Lot.Area, Total.Sq are all very statistically significant. The Overall.Cond is also statistically significant. The Adjusted R-squared value is 0.7952. That means the variability can be explained by the elements selected around 80% of the time.

The p-value is very low at less than 2.2e-16.

Checking Assumptions:

1. Linearity:

library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following object is masked from 'package:purrr':
## 
##     some
crPlots(multiple_model)

Each plot shows the predictor on the x-axis and partial residuals on the y-axis.

The straight line = the fitted linear effect. The dashed loess line = the actual trend. If the loess line follows the straight line closely → linearity assumption looks good.

Year.Remod.Add: Slight but clear positive linear relationship with price. The dashed line follows the straight line incredibly closely with a small number of dots not following the line within the higher values, from 1990-2010. Linearity is generally very good.

Lot.Area: Cloud-like clustering around the 1000 with some values farther from the line, both above and below.

Total.Sq: There is a clear upward trend that is generally linear although there are some dots veer farther from the line at the higher values. Overall, the linearity looks good.

Overall.Cond: Overall it is linear, though there are more dots higher up and farther away from the line at 5.

Overall: Linearity is generally good, with only slight deviations in places for Lot.Area, Total.Sq and Overall.Cond

2. Independence

plot(resid(multiple_model), type="b",
     main="Residuals vs Order", ylab="Residuals"); abline(h=0, lty=2)

Residuals vs Order: Mostly centered around 0. A few bursts, most are positive, a few are negative, slight pattern, bursts appearing at around 400 apart.

3. Core diagnostics (linearity, homoscedasticity, normality, influence)

par(mfrow=c(2,2)); plot(multiple_model); par(mfrow=c(1,1))

Residuals vs Fitted: Mostly following the line, though a few are far above the line within the higher values. However, the residuals generally follow the red line which indicates the homoscedasticity assumption is generally met.

Scale–Location: Cloud like spread with more variation as the values increase. There is some heteroscedasticity (variance grows for sale price when the values get higher.)

Q–Q plot: Tails deviate but only very slightly at the bottom left and a bit more at the top right tail. However, the residuals are very close to normal as it generally does follow the line very clearly.

Residuals vs Leverage: Cloud like formation then the number of dots clearly decreases at around 0.01 through 0.3.

4. Multicollinearity

cor(df_homes[, c("Lot.Area", "Overall.Cond", "Total.Sq", "Year.Remod.Add")], use = "complete.obs")
##                   Lot.Area Overall.Cond   Total.Sq Year.Remod.Add
## Lot.Area        1.00000000  -0.02968746  0.2543952     0.03936975
## Overall.Cond   -0.02968746   1.00000000 -0.1876532     0.05627162
## Total.Sq        0.25439520  -0.18765317  1.0000000     0.36068033
## Year.Remod.Add  0.03936975   0.05627162  0.3606803     1.00000000

Generally there are low correlations among predictors, like Year.Remod.Add and Lot.Area = 0.04, Overall.Cond and Lot.Area = -0.03 and Year.Remod.Add and Overall.Cond. = 0.06

The highest correlations are: Total.Sq to Lot.Area = 0.25 Total.Sq to Year.Remod.Add = 0.36.

Total.Sq is moderately related to the Year.Remod.Add (|r| ≲ 0.36) is a slight concern.

RMSE

residuals_multiple <- resid(multiple_model)

rmse_multiple <- sqrt(mean(residuals_multiple^2))
rmse_multiple
## [1] 33251.63

Multiple model, RMSE = 33251.63 meaning the predictions miss by around $33,252 on average. However, sale prices are often hundreds of thousands of dollars.

Conclusion

Overall, this model is good at showing which factors most influence the selling price of a house which are the total square feet of the house, the lot area (lot size) and year built or renovated. The overall condition of the house also plays a role.

The Adjusted R-squared is 80%, which means the variation in price can be explained by these variables 80% of time.

This is very helpful information for home buyers and sellers to have as well as for real estate agents and investors when considering the purchase or sale of a house.

Future Steps

It may be interesting to look at homes with different zoning codes (Agriculture, Commercial, Residential High Density, Residential Medium Density), or different types of houses (Duplex, Townhouse End Unit, Townhouse Inside Unit) to see if there are different results or if the same elements have a strong effect on the sale price of the house.

Sources

https://www.openintro.org/data/index.php?data=ames

De Cock, Dean. “Ames, Iowa: Alternative to the Boston housing data as an end of semester regression project.” Journal of Statistics Education 19.3 (2011).

http://www.cityofames.org/assessor/ http://beacon.schneidercorp.com/Default.aspx

Pardoe , I. (2008), “Modeling home prices using realtor data”, Journal of Statistics Education Volume 16, Number 2 (2008).