I will be using the Housing prices in Ames, Iowa dataset from the openintro.org website at https://www.openintro.org/data/index.php?data=ames
The dataset has information that comes from the Ames Assessor’s Office used in computing assessed values for individual residential properties sold in Ames, IA from 2006 to 2010.
I am going to look at the dimensions and head of the data as well as check for any missing information.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.2.0 ✔ readr 2.1.6
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.2 ✔ tibble 3.3.1
## ✔ lubridate 1.9.5 ✔ tidyr 1.3.2
## ✔ purrr 1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(corrplot)
## corrplot 0.95 loaded
library(lubridate)
library(dplyr)
setwd("~/Downloads/Data 101 Course materials/Data Sets")
homes <- read_csv("ames.csv")
## Rows: 2930 Columns: 82
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (43): MS.Zoning, Street, Alley, Lot.Shape, Land.Contour, Utilities, Lot....
## dbl (39): Order, PID, area, price, MS.SubClass, Lot.Frontage, Lot.Area, Over...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
homes
## # A tibble: 2,930 × 82
## Order PID area price MS.SubClass MS.Zoning Lot.Frontage Lot.Area Street
## <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <chr>
## 1 1 5.26e8 1656 215000 20 RL 141 31770 Pave
## 2 2 5.26e8 896 105000 20 RH 80 11622 Pave
## 3 3 5.26e8 1329 172000 20 RL 81 14267 Pave
## 4 4 5.26e8 2110 244000 20 RL 93 11160 Pave
## 5 5 5.27e8 1629 189900 60 RL 74 13830 Pave
## 6 6 5.27e8 1604 195500 60 RL 78 9978 Pave
## 7 7 5.27e8 1338 213500 120 RL 41 4920 Pave
## 8 8 5.27e8 1280 191500 120 RL 43 5005 Pave
## 9 9 5.27e8 1616 236500 120 RL 39 5389 Pave
## 10 10 5.27e8 1804 189000 60 RL 60 7500 Pave
## # ℹ 2,920 more rows
## # ℹ 73 more variables: Alley <chr>, Lot.Shape <chr>, Land.Contour <chr>,
## # Utilities <chr>, Lot.Config <chr>, Land.Slope <chr>, Neighborhood <chr>,
## # Condition.1 <chr>, Condition.2 <chr>, Bldg.Type <chr>, House.Style <chr>,
## # Overall.Qual <dbl>, Overall.Cond <dbl>, Year.Built <dbl>,
## # Year.Remod.Add <dbl>, Roof.Style <chr>, Roof.Matl <chr>,
## # Exterior.1st <chr>, Exterior.2nd <chr>, Mas.Vnr.Type <chr>, …
str(homes)
## spc_tbl_ [2,930 × 82] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Order : num [1:2930] 1 2 3 4 5 6 7 8 9 10 ...
## $ PID : num [1:2930] 5.26e+08 5.26e+08 5.26e+08 5.26e+08 5.27e+08 ...
## $ area : num [1:2930] 1656 896 1329 2110 1629 ...
## $ price : num [1:2930] 215000 105000 172000 244000 189900 ...
## $ MS.SubClass : num [1:2930] 20 20 20 20 60 60 120 120 120 60 ...
## $ MS.Zoning : chr [1:2930] "RL" "RH" "RL" "RL" ...
## $ Lot.Frontage : num [1:2930] 141 80 81 93 74 78 41 43 39 60 ...
## $ Lot.Area : num [1:2930] 31770 11622 14267 11160 13830 ...
## $ Street : chr [1:2930] "Pave" "Pave" "Pave" "Pave" ...
## $ Alley : chr [1:2930] NA NA NA NA ...
## $ Lot.Shape : chr [1:2930] "IR1" "Reg" "IR1" "Reg" ...
## $ Land.Contour : chr [1:2930] "Lvl" "Lvl" "Lvl" "Lvl" ...
## $ Utilities : chr [1:2930] "AllPub" "AllPub" "AllPub" "AllPub" ...
## $ Lot.Config : chr [1:2930] "Corner" "Inside" "Corner" "Corner" ...
## $ Land.Slope : chr [1:2930] "Gtl" "Gtl" "Gtl" "Gtl" ...
## $ Neighborhood : chr [1:2930] "NAmes" "NAmes" "NAmes" "NAmes" ...
## $ Condition.1 : chr [1:2930] "Norm" "Feedr" "Norm" "Norm" ...
## $ Condition.2 : chr [1:2930] "Norm" "Norm" "Norm" "Norm" ...
## $ Bldg.Type : chr [1:2930] "1Fam" "1Fam" "1Fam" "1Fam" ...
## $ House.Style : chr [1:2930] "1Story" "1Story" "1Story" "1Story" ...
## $ Overall.Qual : num [1:2930] 6 5 6 7 5 6 8 8 8 7 ...
## $ Overall.Cond : num [1:2930] 5 6 6 5 5 6 5 5 5 5 ...
## $ Year.Built : num [1:2930] 1960 1961 1958 1968 1997 ...
## $ Year.Remod.Add : num [1:2930] 1960 1961 1958 1968 1998 ...
## $ Roof.Style : chr [1:2930] "Hip" "Gable" "Hip" "Hip" ...
## $ Roof.Matl : chr [1:2930] "CompShg" "CompShg" "CompShg" "CompShg" ...
## $ Exterior.1st : chr [1:2930] "BrkFace" "VinylSd" "Wd Sdng" "BrkFace" ...
## $ Exterior.2nd : chr [1:2930] "Plywood" "VinylSd" "Wd Sdng" "BrkFace" ...
## $ Mas.Vnr.Type : chr [1:2930] "Stone" "None" "BrkFace" "None" ...
## $ Mas.Vnr.Area : num [1:2930] 112 0 108 0 0 20 0 0 0 0 ...
## $ Exter.Qual : chr [1:2930] "TA" "TA" "TA" "Gd" ...
## $ Exter.Cond : chr [1:2930] "TA" "TA" "TA" "TA" ...
## $ Foundation : chr [1:2930] "CBlock" "CBlock" "CBlock" "CBlock" ...
## $ Bsmt.Qual : chr [1:2930] "TA" "TA" "TA" "TA" ...
## $ Bsmt.Cond : chr [1:2930] "Gd" "TA" "TA" "TA" ...
## $ Bsmt.Exposure : chr [1:2930] "Gd" "No" "No" "No" ...
## $ BsmtFin.Type.1 : chr [1:2930] "BLQ" "Rec" "ALQ" "ALQ" ...
## $ BsmtFin.SF.1 : num [1:2930] 639 468 923 1065 791 ...
## $ BsmtFin.Type.2 : chr [1:2930] "Unf" "LwQ" "Unf" "Unf" ...
## $ BsmtFin.SF.2 : num [1:2930] 0 144 0 0 0 0 0 0 0 0 ...
## $ Bsmt.Unf.SF : num [1:2930] 441 270 406 1045 137 ...
## $ Total.Bsmt.SF : num [1:2930] 1080 882 1329 2110 928 ...
## $ Heating : chr [1:2930] "GasA" "GasA" "GasA" "GasA" ...
## $ Heating.QC : chr [1:2930] "Fa" "TA" "TA" "Ex" ...
## $ Central.Air : chr [1:2930] "Y" "Y" "Y" "Y" ...
## $ Electrical : chr [1:2930] "SBrkr" "SBrkr" "SBrkr" "SBrkr" ...
## $ X1st.Flr.SF : num [1:2930] 1656 896 1329 2110 928 ...
## $ X2nd.Flr.SF : num [1:2930] 0 0 0 0 701 678 0 0 0 776 ...
## $ Low.Qual.Fin.SF: num [1:2930] 0 0 0 0 0 0 0 0 0 0 ...
## $ Bsmt.Full.Bath : num [1:2930] 1 0 0 1 0 0 1 0 1 0 ...
## $ Bsmt.Half.Bath : num [1:2930] 0 0 0 0 0 0 0 0 0 0 ...
## $ Full.Bath : num [1:2930] 1 1 1 2 2 2 2 2 2 2 ...
## $ Half.Bath : num [1:2930] 0 0 1 1 1 1 0 0 0 1 ...
## $ Bedroom.AbvGr : num [1:2930] 3 2 3 3 3 3 2 2 2 3 ...
## $ Kitchen.AbvGr : num [1:2930] 1 1 1 1 1 1 1 1 1 1 ...
## $ Kitchen.Qual : chr [1:2930] "TA" "TA" "Gd" "Ex" ...
## $ TotRms.AbvGrd : num [1:2930] 7 5 6 8 6 7 6 5 5 7 ...
## $ Functional : chr [1:2930] "Typ" "Typ" "Typ" "Typ" ...
## $ Fireplaces : num [1:2930] 2 0 0 2 1 1 0 0 1 1 ...
## $ Fireplace.Qu : chr [1:2930] "Gd" NA NA "TA" ...
## $ Garage.Type : chr [1:2930] "Attchd" "Attchd" "Attchd" "Attchd" ...
## $ Garage.Yr.Blt : num [1:2930] 1960 1961 1958 1968 1997 ...
## $ Garage.Finish : chr [1:2930] "Fin" "Unf" "Unf" "Fin" ...
## $ Garage.Cars : num [1:2930] 2 1 1 2 2 2 2 2 2 2 ...
## $ Garage.Area : num [1:2930] 528 730 312 522 482 470 582 506 608 442 ...
## $ Garage.Qual : chr [1:2930] "TA" "TA" "TA" "TA" ...
## $ Garage.Cond : chr [1:2930] "TA" "TA" "TA" "TA" ...
## $ Paved.Drive : chr [1:2930] "P" "Y" "Y" "Y" ...
## $ Wood.Deck.SF : num [1:2930] 210 140 393 0 212 360 0 0 237 140 ...
## $ Open.Porch.SF : num [1:2930] 62 0 36 0 34 36 0 82 152 60 ...
## $ Enclosed.Porch : num [1:2930] 0 0 0 0 0 0 170 0 0 0 ...
## $ X3Ssn.Porch : num [1:2930] 0 0 0 0 0 0 0 0 0 0 ...
## $ Screen.Porch : num [1:2930] 0 120 0 0 0 0 0 144 0 0 ...
## $ Pool.Area : num [1:2930] 0 0 0 0 0 0 0 0 0 0 ...
## $ Pool.QC : chr [1:2930] NA NA NA NA ...
## $ Fence : chr [1:2930] NA "MnPrv" NA NA ...
## $ Misc.Feature : chr [1:2930] NA NA "Gar2" NA ...
## $ Misc.Val : num [1:2930] 0 0 12500 0 0 0 0 0 0 0 ...
## $ Mo.Sold : num [1:2930] 5 6 6 4 3 6 4 1 3 6 ...
## $ Yr.Sold : num [1:2930] 2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...
## $ Sale.Type : chr [1:2930] "WD" "WD" "WD" "WD" ...
## $ Sale.Condition : chr [1:2930] "Normal" "Normal" "Normal" "Normal" ...
## - attr(*, "spec")=
## .. cols(
## .. Order = col_double(),
## .. PID = col_double(),
## .. area = col_double(),
## .. price = col_double(),
## .. MS.SubClass = col_double(),
## .. MS.Zoning = col_character(),
## .. Lot.Frontage = col_double(),
## .. Lot.Area = col_double(),
## .. Street = col_character(),
## .. Alley = col_character(),
## .. Lot.Shape = col_character(),
## .. Land.Contour = col_character(),
## .. Utilities = col_character(),
## .. Lot.Config = col_character(),
## .. Land.Slope = col_character(),
## .. Neighborhood = col_character(),
## .. Condition.1 = col_character(),
## .. Condition.2 = col_character(),
## .. Bldg.Type = col_character(),
## .. House.Style = col_character(),
## .. Overall.Qual = col_double(),
## .. Overall.Cond = col_double(),
## .. Year.Built = col_double(),
## .. Year.Remod.Add = col_double(),
## .. Roof.Style = col_character(),
## .. Roof.Matl = col_character(),
## .. Exterior.1st = col_character(),
## .. Exterior.2nd = col_character(),
## .. Mas.Vnr.Type = col_character(),
## .. Mas.Vnr.Area = col_double(),
## .. Exter.Qual = col_character(),
## .. Exter.Cond = col_character(),
## .. Foundation = col_character(),
## .. Bsmt.Qual = col_character(),
## .. Bsmt.Cond = col_character(),
## .. Bsmt.Exposure = col_character(),
## .. BsmtFin.Type.1 = col_character(),
## .. BsmtFin.SF.1 = col_double(),
## .. BsmtFin.Type.2 = col_character(),
## .. BsmtFin.SF.2 = col_double(),
## .. Bsmt.Unf.SF = col_double(),
## .. Total.Bsmt.SF = col_double(),
## .. Heating = col_character(),
## .. Heating.QC = col_character(),
## .. Central.Air = col_character(),
## .. Electrical = col_character(),
## .. X1st.Flr.SF = col_double(),
## .. X2nd.Flr.SF = col_double(),
## .. Low.Qual.Fin.SF = col_double(),
## .. Bsmt.Full.Bath = col_double(),
## .. Bsmt.Half.Bath = col_double(),
## .. Full.Bath = col_double(),
## .. Half.Bath = col_double(),
## .. Bedroom.AbvGr = col_double(),
## .. Kitchen.AbvGr = col_double(),
## .. Kitchen.Qual = col_character(),
## .. TotRms.AbvGrd = col_double(),
## .. Functional = col_character(),
## .. Fireplaces = col_double(),
## .. Fireplace.Qu = col_character(),
## .. Garage.Type = col_character(),
## .. Garage.Yr.Blt = col_double(),
## .. Garage.Finish = col_character(),
## .. Garage.Cars = col_double(),
## .. Garage.Area = col_double(),
## .. Garage.Qual = col_character(),
## .. Garage.Cond = col_character(),
## .. Paved.Drive = col_character(),
## .. Wood.Deck.SF = col_double(),
## .. Open.Porch.SF = col_double(),
## .. Enclosed.Porch = col_double(),
## .. X3Ssn.Porch = col_double(),
## .. Screen.Porch = col_double(),
## .. Pool.Area = col_double(),
## .. Pool.QC = col_character(),
## .. Fence = col_character(),
## .. Misc.Feature = col_character(),
## .. Misc.Val = col_double(),
## .. Mo.Sold = col_double(),
## .. Yr.Sold = col_double(),
## .. Sale.Type = col_character(),
## .. Sale.Condition = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
The dataset has with 2930 observations on 82 variables.
head(homes)
## # A tibble: 6 × 82
## Order PID area price MS.SubClass MS.Zoning Lot.Frontage Lot.Area Street
## <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <chr>
## 1 1 5.26e8 1656 215000 20 RL 141 31770 Pave
## 2 2 5.26e8 896 105000 20 RH 80 11622 Pave
## 3 3 5.26e8 1329 172000 20 RL 81 14267 Pave
## 4 4 5.26e8 2110 244000 20 RL 93 11160 Pave
## 5 5 5.27e8 1629 189900 60 RL 74 13830 Pave
## 6 6 5.27e8 1604 195500 60 RL 78 9978 Pave
## # ℹ 73 more variables: Alley <chr>, Lot.Shape <chr>, Land.Contour <chr>,
## # Utilities <chr>, Lot.Config <chr>, Land.Slope <chr>, Neighborhood <chr>,
## # Condition.1 <chr>, Condition.2 <chr>, Bldg.Type <chr>, House.Style <chr>,
## # Overall.Qual <dbl>, Overall.Cond <dbl>, Year.Built <dbl>,
## # Year.Remod.Add <dbl>, Roof.Style <chr>, Roof.Matl <chr>,
## # Exterior.1st <chr>, Exterior.2nd <chr>, Mas.Vnr.Type <chr>,
## # Mas.Vnr.Area <dbl>, Exter.Qual <chr>, Exter.Cond <chr>, Foundation <chr>, …
colSums(is.na(homes))
## Order PID area price MS.SubClass
## 0 0 0 0 0
## MS.Zoning Lot.Frontage Lot.Area Street Alley
## 0 490 0 0 2732
## Lot.Shape Land.Contour Utilities Lot.Config Land.Slope
## 0 0 0 0 0
## Neighborhood Condition.1 Condition.2 Bldg.Type House.Style
## 0 0 0 0 0
## Overall.Qual Overall.Cond Year.Built Year.Remod.Add Roof.Style
## 0 0 0 0 0
## Roof.Matl Exterior.1st Exterior.2nd Mas.Vnr.Type Mas.Vnr.Area
## 0 0 0 23 23
## Exter.Qual Exter.Cond Foundation Bsmt.Qual Bsmt.Cond
## 0 0 0 80 80
## Bsmt.Exposure BsmtFin.Type.1 BsmtFin.SF.1 BsmtFin.Type.2 BsmtFin.SF.2
## 83 80 1 81 1
## Bsmt.Unf.SF Total.Bsmt.SF Heating Heating.QC Central.Air
## 1 1 0 0 0
## Electrical X1st.Flr.SF X2nd.Flr.SF Low.Qual.Fin.SF Bsmt.Full.Bath
## 1 0 0 0 2
## Bsmt.Half.Bath Full.Bath Half.Bath Bedroom.AbvGr Kitchen.AbvGr
## 2 0 0 0 0
## Kitchen.Qual TotRms.AbvGrd Functional Fireplaces Fireplace.Qu
## 0 0 0 0 1422
## Garage.Type Garage.Yr.Blt Garage.Finish Garage.Cars Garage.Area
## 157 159 159 1 1
## Garage.Qual Garage.Cond Paved.Drive Wood.Deck.SF Open.Porch.SF
## 159 159 0 0 0
## Enclosed.Porch X3Ssn.Porch Screen.Porch Pool.Area Pool.QC
## 0 0 0 0 2917
## Fence Misc.Feature Misc.Val Mo.Sold Yr.Sold
## 2358 2824 0 0 0
## Sale.Type Sale.Condition
## 0 0
There are missing data but there is only 1 missing value in the columns I will be using. 1 missing in Total.Bsmt.SF so it should not be a major issue.
df_homes <- homes |>
select(price, MS.Zoning, Neighborhood, Bldg.Type, Lot.Area, Year.Remod.Add, Overall.Cond, X1st.Flr.SF, X2nd.Flr.SF, Total.Bsmt.SF, Sale.Condition) |>
filter(Sale.Condition == "Normal") |>
filter(Bldg.Type == "1Fam") |>
filter(MS.Zoning == "RL") |>
mutate(Total.Sq = X1st.Flr.SF + X2nd.Flr.SF + Total.Bsmt.SF)
df_homes
## # A tibble: 1,679 × 12
## price MS.Zoning Neighborhood Bldg.Type Lot.Area Year.Remod.Add Overall.Cond
## <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 215000 RL NAmes 1Fam 31770 1960 5
## 2 172000 RL NAmes 1Fam 14267 1958 6
## 3 244000 RL NAmes 1Fam 11160 1968 5
## 4 189900 RL Gilbert 1Fam 13830 1998 5
## 5 195500 RL Gilbert 1Fam 9978 1998 6
## 6 189000 RL Gilbert 1Fam 7500 1999 5
## 7 175900 RL Gilbert 1Fam 10000 1994 5
## 8 185000 RL Gilbert 1Fam 7980 2007 7
## 9 180400 RL Gilbert 1Fam 8402 1998 5
## 10 171500 RL Gilbert 1Fam 10176 1990 5
## # ℹ 1,669 more rows
## # ℹ 5 more variables: X1st.Flr.SF <dbl>, X2nd.Flr.SF <dbl>,
## # Total.Bsmt.SF <dbl>, Sale.Condition <chr>, Total.Sq <dbl>
Here I am pulling out the data that I want to use from this dataset. I am using the following variables: price, MS.Zoning, Neighborhood, Bldg.Type, Lot.Area, Year.Remod.Add, Overall.Cond, X1st.Flr.SF, X2nd.Flr.SF, Total.Bsmt.SF, Sale.Condition. I added the square feet on the first floor, second floor and basement to create a new column called total square feet.
I tried to compare homes with the same categorical values, so I filtered the houses to be the same type; single family homes with the same type zoning code with normal sale conditions (no foreclosures or family sales). This is to enable me to compare like with like as much as feasible and reduce conditions that could skew the sale price.
Here is what each of the variables mean in a bit more detail:
price - (Continuous): Sale Price of the house
MS.Zoning - (Nominal): Identifies the general zoning classification of the sale. (Agriculture, Commercial, Floating Village Residential, Industrial, Residential High Density, Residential Low Density, Residential Low Density Park, Residential Medium Density)
Neighborhood - (Nominal): Physical locations within Ames city limits (map available)
Bldg.Type - (Nominal): Type of dwelling (Single-family Detached, Two-family Conversion; originally built as one-family dwelling, Duplex, Townhouse End Unit, Townhouse Inside Unit)
Lot.Area - (Continuous): Lot size in square feet
Year.Remod.Add - (Discrete): Remodel date (same as construction date if no remodeling or additions)
Overall.Cond - (Ordinal): Rates the overall condition of the house with 10 -Very Excellent and 1 - Very Poor.
X1st.Flr.SF - (Continuous) First Floor square feet X2nd.Flr.SF - (Continuous) Second floor square feet Total.Bsmt.SF - (Continuous): Total square feet of basement area
Sale.Condition - (Nominal): Condition of sale (Normal Sale, Abnormal Sale - trade, foreclosure, short sale, Adjoining Land Purchase, Allocation - two linked properties with separate deeds, typically condo with a garage unit, Family Sale between family members, Partial Home was not completed when last assessed (associated with New Homes))
names(df_homes) <- gsub("price", "Price", names(df_homes))
names(df_homes)
## [1] "Price" "MS.Zoning" "Neighborhood" "Bldg.Type"
## [5] "Lot.Area" "Year.Remod.Add" "Overall.Cond" "X1st.Flr.SF"
## [9] "X2nd.Flr.SF" "Total.Bsmt.SF" "Sale.Condition" "Total.Sq"
I uppercased the P in price to make it consistent with the other column names.
df_homes |>
mutate(across(c(MS.Zoning, Bldg.Type, Neighborhood, Sale.Condition), as.factor))
## # A tibble: 1,679 × 12
## Price MS.Zoning Neighborhood Bldg.Type Lot.Area Year.Remod.Add Overall.Cond
## <dbl> <fct> <fct> <fct> <dbl> <dbl> <dbl>
## 1 215000 RL NAmes 1Fam 31770 1960 5
## 2 172000 RL NAmes 1Fam 14267 1958 6
## 3 244000 RL NAmes 1Fam 11160 1968 5
## 4 189900 RL Gilbert 1Fam 13830 1998 5
## 5 195500 RL Gilbert 1Fam 9978 1998 6
## 6 189000 RL Gilbert 1Fam 7500 1999 5
## 7 175900 RL Gilbert 1Fam 10000 1994 5
## 8 185000 RL Gilbert 1Fam 7980 2007 7
## 9 180400 RL Gilbert 1Fam 8402 1998 5
## 10 171500 RL Gilbert 1Fam 10176 1990 5
## # ℹ 1,669 more rows
## # ℹ 5 more variables: X1st.Flr.SF <dbl>, X2nd.Flr.SF <dbl>,
## # Total.Bsmt.SF <dbl>, Sale.Condition <fct>, Total.Sq <dbl>
To save memory, I made the categorical variables factors.
multiple_model <- lm(Price ~ Year.Remod.Add + Lot.Area + Total.Sq + Overall.Cond, data = df_homes)
multiple_model
##
## Call:
## lm(formula = Price ~ Year.Remod.Add + Lot.Area + Total.Sq + Overall.Cond,
## data = df_homes)
##
## Coefficients:
## (Intercept) Year.Remod.Add Lot.Area Total.Sq Overall.Cond
## -1.558e+06 7.722e+02 5.528e-01 7.633e+01 1.741e+03
summary(multiple_model)
##
## Call:
## lm(formula = Price ~ Year.Remod.Add + Lot.Area + Total.Sq + Overall.Cond,
## data = df_homes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -147415 -18336 -1540 15413 234572
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.558e+06 8.714e+04 -17.884 < 2e-16 ***
## Year.Remod.Add 7.722e+02 4.470e+01 17.276 < 2e-16 ***
## Lot.Area 5.528e-01 1.021e-01 5.414 7.05e-08 ***
## Total.Sq 7.633e+01 1.226e+00 62.260 < 2e-16 ***
## Overall.Cond 1.741e+03 7.707e+02 2.259 0.024 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 33300 on 1674 degrees of freedom
## Multiple R-squared: 0.7957, Adjusted R-squared: 0.7952
## F-statistic: 1630 on 4 and 1674 DF, p-value: < 2.2e-16
Explaining the Coefficients For every year the house is newer/has been renovated, the price increases by $772 For every additional square foot of the lot, the price increases by $0.55 For every extra square foot of house, the price increases by $76 For every 1 increase of the overall condition score, the price increases by $1,741
For every 1 increase in the overall condition score, the price increases by $1,479.
The intercept, what y is when x is 0, i.e. the price of house when everything is 0, is -1.558e+06 or -$1,558,000.
We can see that Year.Remod.Add, Lot.Area, Total.Sq are all very statistically significant. The Overall.Cond is also statistically significant. The Adjusted R-squared value, which is the variability in the values selected impacting the selling price of the house is 0.7952 or around 80% of the time.
The p-value is very low at less than 2.2e-16.
Linearity:
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:purrr':
##
## some
crPlots(multiple_model)
Each plot shows the predictor on the x-axis and partial residuals on the
y-axis.
The straight line = the fitted linear effect. The dashed loess line = the actual trend. If the loess line follows the straight line closely → linearity assumption looks good.
Year.Remod.Add: Slight but clear positive linear relationship with price. The loess line follows the straight line very closely with a small number of dots not following the line within the higher values, from 1990-2010. Linearity is generally good.
Lot.Area: Cloud-like clustering around the 1000 with some values farther from the line, both above and below. There is some non-linearity.
Total.Sq: There is a clear upward trend. Follows the loess line pretty closely with some dots farther from the line at the higher values. Overall, the linearity looks good.
Overall.Cond: It follows the loess line reasonably well, though there is a slight dip away from the line at the lower values (from 2 to 5).
Overall: Linearity is generally good, with only slight deviations for Lot.Area and in places for Overall.Cond
Independence
plot(resid(multiple_model), type="b",
main="Residuals vs Order", ylab="Residuals"); abline(h=0, lty=2)
Residuals vs Order: Mostly centered around 0. A few bursts, most are positive, a few are negative, but no clear pattern.
3. Core diagnostics (linearity, homoscedasticity, normality, influence)
par(mfrow=c(2,2)); plot(multiple_model); par(mfrow=c(1,1))
Residuals vs Fitted: Mostly following the red line, though a few are far above the line within the higher values. However, the residuals generally follow the red line which indicates the homoscedasticity assumption is generally met.
Scale–Location: Cloud like spread with more variation as the values increase. There is some heteroscedasticity (variance grows for sale price when the values get higher.)
Q–Q plot: Tails deviate but only very slightly at the bottom left and a bit more at the top right tail. However, the residuals are very close to normal as it generally does follow the line very clearly.
Residuals vs Leverage: Cloud like formation then the number of dots clearly decreases at around 0.01 through 0.3. It seems to cluster around the line initially and then follow it closely.
4. Multicollinearity
cor(df_homes[, c("Lot.Area", "Overall.Cond", "Total.Sq", "Year.Remod.Add")], use = "complete.obs")
## Lot.Area Overall.Cond Total.Sq Year.Remod.Add
## Lot.Area 1.00000000 -0.02968746 0.2543952 0.03936975
## Overall.Cond -0.02968746 1.00000000 -0.1876532 0.05627162
## Total.Sq 0.25439520 -0.18765317 1.0000000 0.36068033
## Year.Remod.Add 0.03936975 0.05627162 0.3606803 1.00000000
Generally there are low correlations among predictors, like Year.Remod.Add and Lot.Area = 0.04, Overall.Cond and Lot.Area = -0.03 and Year.Remod.Add and Overall.Cond. = 0.06
The highest correlations are: Total.Sq to Lot.Area = 0.25 Total.Sq to Year.Remod.Add = 0.36.
Total.Sq is moderately related to the Year.Remod.Add (|r| ≲ 0.36) is a slight concern.
The implications are the Coefficient SEs for Total_Sq/Year.Remod.Add may be inflated. The individual p-values are significant and the model seems to fit well.
In the powerpoint -
Add summary of the output of the model with r squared and p value the 4
plot and the linearity;
residuals_multiple <- resid(multiple_model)
rmse_multiple <- sqrt(mean(residuals_multiple^2))
rmse_multiple
## [1] 33251.63
Multiple model, RMSE = 33251.63 meaning the predictions miss by around $33,252 on average. However, sale prices are often hundreds of thousands of dollars.
Overall, this model is good at showing which factors most influence the selling price of a house which are the total square feet of the house, the lot area (lot size) and year built or renovated. The overall condition of the house also plays a role.
This is helpful information for home buyers and sellers to have as well as real estate agents and investors when considering the purchase or sale of a house.
It may be interesting to look at homes with different zoning codes (Agriculture, Commercial, Residential High Density, Residential Medium Density), or different types of houses (Duplex, Townhouse End Unit, Townhouse Inside Unit) to see if there are different results or if the same elements have a strong effect on the sale price of the house.
https://www.openintro.org/data/index.php?data=ames
De Cock, Dean. “Ames, Iowa: Alternative to the Boston housing data as an end of semester regression project.” Journal of Statistics Education 19.3 (2011).
http://www.cityofames.org/assessor/ http://beacon.schneidercorp.com/Default.aspx
Pardoe , I. (2008), “Modeling home prices using realtor data”, Journal of Statistics Education Volume 16, Number 2 (2008).