What factors most influence the selling price of a house?

Introduction

I will be using the Housing prices in Ames, Iowa data set from the openintro.org website at https://www.openintro.org/data/index.php?data=ames

The data set has information that comes from the Ames Assessor’s Office used in computing assessed values for individual residential properties sold in Ames, IA from 2006 to 2010.

Data Analysis

I am going to look at the dimensions and head of the data as well as check for any missing information.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.2.0     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.2     ✔ tibble    3.3.1
## ✔ lubridate 1.9.5     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(corrplot)
## corrplot 0.95 loaded
library(lubridate)
library(dplyr)


setwd("~/Downloads/Data 101 Course materials/Data Sets")
homes <- read_csv("ames.csv")
## Rows: 2930 Columns: 82
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (43): MS.Zoning, Street, Alley, Lot.Shape, Land.Contour, Utilities, Lot....
## dbl (39): Order, PID, area, price, MS.SubClass, Lot.Frontage, Lot.Area, Over...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
homes
## # A tibble: 2,930 × 82
##    Order     PID  area  price MS.SubClass MS.Zoning Lot.Frontage Lot.Area Street
##    <dbl>   <dbl> <dbl>  <dbl>       <dbl> <chr>            <dbl>    <dbl> <chr> 
##  1     1  5.26e8  1656 215000          20 RL                 141    31770 Pave  
##  2     2  5.26e8   896 105000          20 RH                  80    11622 Pave  
##  3     3  5.26e8  1329 172000          20 RL                  81    14267 Pave  
##  4     4  5.26e8  2110 244000          20 RL                  93    11160 Pave  
##  5     5  5.27e8  1629 189900          60 RL                  74    13830 Pave  
##  6     6  5.27e8  1604 195500          60 RL                  78     9978 Pave  
##  7     7  5.27e8  1338 213500         120 RL                  41     4920 Pave  
##  8     8  5.27e8  1280 191500         120 RL                  43     5005 Pave  
##  9     9  5.27e8  1616 236500         120 RL                  39     5389 Pave  
## 10    10  5.27e8  1804 189000          60 RL                  60     7500 Pave  
## # ℹ 2,920 more rows
## # ℹ 73 more variables: Alley <chr>, Lot.Shape <chr>, Land.Contour <chr>,
## #   Utilities <chr>, Lot.Config <chr>, Land.Slope <chr>, Neighborhood <chr>,
## #   Condition.1 <chr>, Condition.2 <chr>, Bldg.Type <chr>, House.Style <chr>,
## #   Overall.Qual <dbl>, Overall.Cond <dbl>, Year.Built <dbl>,
## #   Year.Remod.Add <dbl>, Roof.Style <chr>, Roof.Matl <chr>,
## #   Exterior.1st <chr>, Exterior.2nd <chr>, Mas.Vnr.Type <chr>, …
str(homes)
## spc_tbl_ [2,930 × 82] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Order          : num [1:2930] 1 2 3 4 5 6 7 8 9 10 ...
##  $ PID            : num [1:2930] 5.26e+08 5.26e+08 5.26e+08 5.26e+08 5.27e+08 ...
##  $ area           : num [1:2930] 1656 896 1329 2110 1629 ...
##  $ price          : num [1:2930] 215000 105000 172000 244000 189900 ...
##  $ MS.SubClass    : num [1:2930] 20 20 20 20 60 60 120 120 120 60 ...
##  $ MS.Zoning      : chr [1:2930] "RL" "RH" "RL" "RL" ...
##  $ Lot.Frontage   : num [1:2930] 141 80 81 93 74 78 41 43 39 60 ...
##  $ Lot.Area       : num [1:2930] 31770 11622 14267 11160 13830 ...
##  $ Street         : chr [1:2930] "Pave" "Pave" "Pave" "Pave" ...
##  $ Alley          : chr [1:2930] NA NA NA NA ...
##  $ Lot.Shape      : chr [1:2930] "IR1" "Reg" "IR1" "Reg" ...
##  $ Land.Contour   : chr [1:2930] "Lvl" "Lvl" "Lvl" "Lvl" ...
##  $ Utilities      : chr [1:2930] "AllPub" "AllPub" "AllPub" "AllPub" ...
##  $ Lot.Config     : chr [1:2930] "Corner" "Inside" "Corner" "Corner" ...
##  $ Land.Slope     : chr [1:2930] "Gtl" "Gtl" "Gtl" "Gtl" ...
##  $ Neighborhood   : chr [1:2930] "NAmes" "NAmes" "NAmes" "NAmes" ...
##  $ Condition.1    : chr [1:2930] "Norm" "Feedr" "Norm" "Norm" ...
##  $ Condition.2    : chr [1:2930] "Norm" "Norm" "Norm" "Norm" ...
##  $ Bldg.Type      : chr [1:2930] "1Fam" "1Fam" "1Fam" "1Fam" ...
##  $ House.Style    : chr [1:2930] "1Story" "1Story" "1Story" "1Story" ...
##  $ Overall.Qual   : num [1:2930] 6 5 6 7 5 6 8 8 8 7 ...
##  $ Overall.Cond   : num [1:2930] 5 6 6 5 5 6 5 5 5 5 ...
##  $ Year.Built     : num [1:2930] 1960 1961 1958 1968 1997 ...
##  $ Year.Remod.Add : num [1:2930] 1960 1961 1958 1968 1998 ...
##  $ Roof.Style     : chr [1:2930] "Hip" "Gable" "Hip" "Hip" ...
##  $ Roof.Matl      : chr [1:2930] "CompShg" "CompShg" "CompShg" "CompShg" ...
##  $ Exterior.1st   : chr [1:2930] "BrkFace" "VinylSd" "Wd Sdng" "BrkFace" ...
##  $ Exterior.2nd   : chr [1:2930] "Plywood" "VinylSd" "Wd Sdng" "BrkFace" ...
##  $ Mas.Vnr.Type   : chr [1:2930] "Stone" "None" "BrkFace" "None" ...
##  $ Mas.Vnr.Area   : num [1:2930] 112 0 108 0 0 20 0 0 0 0 ...
##  $ Exter.Qual     : chr [1:2930] "TA" "TA" "TA" "Gd" ...
##  $ Exter.Cond     : chr [1:2930] "TA" "TA" "TA" "TA" ...
##  $ Foundation     : chr [1:2930] "CBlock" "CBlock" "CBlock" "CBlock" ...
##  $ Bsmt.Qual      : chr [1:2930] "TA" "TA" "TA" "TA" ...
##  $ Bsmt.Cond      : chr [1:2930] "Gd" "TA" "TA" "TA" ...
##  $ Bsmt.Exposure  : chr [1:2930] "Gd" "No" "No" "No" ...
##  $ BsmtFin.Type.1 : chr [1:2930] "BLQ" "Rec" "ALQ" "ALQ" ...
##  $ BsmtFin.SF.1   : num [1:2930] 639 468 923 1065 791 ...
##  $ BsmtFin.Type.2 : chr [1:2930] "Unf" "LwQ" "Unf" "Unf" ...
##  $ BsmtFin.SF.2   : num [1:2930] 0 144 0 0 0 0 0 0 0 0 ...
##  $ Bsmt.Unf.SF    : num [1:2930] 441 270 406 1045 137 ...
##  $ Total.Bsmt.SF  : num [1:2930] 1080 882 1329 2110 928 ...
##  $ Heating        : chr [1:2930] "GasA" "GasA" "GasA" "GasA" ...
##  $ Heating.QC     : chr [1:2930] "Fa" "TA" "TA" "Ex" ...
##  $ Central.Air    : chr [1:2930] "Y" "Y" "Y" "Y" ...
##  $ Electrical     : chr [1:2930] "SBrkr" "SBrkr" "SBrkr" "SBrkr" ...
##  $ X1st.Flr.SF    : num [1:2930] 1656 896 1329 2110 928 ...
##  $ X2nd.Flr.SF    : num [1:2930] 0 0 0 0 701 678 0 0 0 776 ...
##  $ Low.Qual.Fin.SF: num [1:2930] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Bsmt.Full.Bath : num [1:2930] 1 0 0 1 0 0 1 0 1 0 ...
##  $ Bsmt.Half.Bath : num [1:2930] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Full.Bath      : num [1:2930] 1 1 1 2 2 2 2 2 2 2 ...
##  $ Half.Bath      : num [1:2930] 0 0 1 1 1 1 0 0 0 1 ...
##  $ Bedroom.AbvGr  : num [1:2930] 3 2 3 3 3 3 2 2 2 3 ...
##  $ Kitchen.AbvGr  : num [1:2930] 1 1 1 1 1 1 1 1 1 1 ...
##  $ Kitchen.Qual   : chr [1:2930] "TA" "TA" "Gd" "Ex" ...
##  $ TotRms.AbvGrd  : num [1:2930] 7 5 6 8 6 7 6 5 5 7 ...
##  $ Functional     : chr [1:2930] "Typ" "Typ" "Typ" "Typ" ...
##  $ Fireplaces     : num [1:2930] 2 0 0 2 1 1 0 0 1 1 ...
##  $ Fireplace.Qu   : chr [1:2930] "Gd" NA NA "TA" ...
##  $ Garage.Type    : chr [1:2930] "Attchd" "Attchd" "Attchd" "Attchd" ...
##  $ Garage.Yr.Blt  : num [1:2930] 1960 1961 1958 1968 1997 ...
##  $ Garage.Finish  : chr [1:2930] "Fin" "Unf" "Unf" "Fin" ...
##  $ Garage.Cars    : num [1:2930] 2 1 1 2 2 2 2 2 2 2 ...
##  $ Garage.Area    : num [1:2930] 528 730 312 522 482 470 582 506 608 442 ...
##  $ Garage.Qual    : chr [1:2930] "TA" "TA" "TA" "TA" ...
##  $ Garage.Cond    : chr [1:2930] "TA" "TA" "TA" "TA" ...
##  $ Paved.Drive    : chr [1:2930] "P" "Y" "Y" "Y" ...
##  $ Wood.Deck.SF   : num [1:2930] 210 140 393 0 212 360 0 0 237 140 ...
##  $ Open.Porch.SF  : num [1:2930] 62 0 36 0 34 36 0 82 152 60 ...
##  $ Enclosed.Porch : num [1:2930] 0 0 0 0 0 0 170 0 0 0 ...
##  $ X3Ssn.Porch    : num [1:2930] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Screen.Porch   : num [1:2930] 0 120 0 0 0 0 0 144 0 0 ...
##  $ Pool.Area      : num [1:2930] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Pool.QC        : chr [1:2930] NA NA NA NA ...
##  $ Fence          : chr [1:2930] NA "MnPrv" NA NA ...
##  $ Misc.Feature   : chr [1:2930] NA NA "Gar2" NA ...
##  $ Misc.Val       : num [1:2930] 0 0 12500 0 0 0 0 0 0 0 ...
##  $ Mo.Sold        : num [1:2930] 5 6 6 4 3 6 4 1 3 6 ...
##  $ Yr.Sold        : num [1:2930] 2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...
##  $ Sale.Type      : chr [1:2930] "WD" "WD" "WD" "WD" ...
##  $ Sale.Condition : chr [1:2930] "Normal" "Normal" "Normal" "Normal" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Order = col_double(),
##   ..   PID = col_double(),
##   ..   area = col_double(),
##   ..   price = col_double(),
##   ..   MS.SubClass = col_double(),
##   ..   MS.Zoning = col_character(),
##   ..   Lot.Frontage = col_double(),
##   ..   Lot.Area = col_double(),
##   ..   Street = col_character(),
##   ..   Alley = col_character(),
##   ..   Lot.Shape = col_character(),
##   ..   Land.Contour = col_character(),
##   ..   Utilities = col_character(),
##   ..   Lot.Config = col_character(),
##   ..   Land.Slope = col_character(),
##   ..   Neighborhood = col_character(),
##   ..   Condition.1 = col_character(),
##   ..   Condition.2 = col_character(),
##   ..   Bldg.Type = col_character(),
##   ..   House.Style = col_character(),
##   ..   Overall.Qual = col_double(),
##   ..   Overall.Cond = col_double(),
##   ..   Year.Built = col_double(),
##   ..   Year.Remod.Add = col_double(),
##   ..   Roof.Style = col_character(),
##   ..   Roof.Matl = col_character(),
##   ..   Exterior.1st = col_character(),
##   ..   Exterior.2nd = col_character(),
##   ..   Mas.Vnr.Type = col_character(),
##   ..   Mas.Vnr.Area = col_double(),
##   ..   Exter.Qual = col_character(),
##   ..   Exter.Cond = col_character(),
##   ..   Foundation = col_character(),
##   ..   Bsmt.Qual = col_character(),
##   ..   Bsmt.Cond = col_character(),
##   ..   Bsmt.Exposure = col_character(),
##   ..   BsmtFin.Type.1 = col_character(),
##   ..   BsmtFin.SF.1 = col_double(),
##   ..   BsmtFin.Type.2 = col_character(),
##   ..   BsmtFin.SF.2 = col_double(),
##   ..   Bsmt.Unf.SF = col_double(),
##   ..   Total.Bsmt.SF = col_double(),
##   ..   Heating = col_character(),
##   ..   Heating.QC = col_character(),
##   ..   Central.Air = col_character(),
##   ..   Electrical = col_character(),
##   ..   X1st.Flr.SF = col_double(),
##   ..   X2nd.Flr.SF = col_double(),
##   ..   Low.Qual.Fin.SF = col_double(),
##   ..   Bsmt.Full.Bath = col_double(),
##   ..   Bsmt.Half.Bath = col_double(),
##   ..   Full.Bath = col_double(),
##   ..   Half.Bath = col_double(),
##   ..   Bedroom.AbvGr = col_double(),
##   ..   Kitchen.AbvGr = col_double(),
##   ..   Kitchen.Qual = col_character(),
##   ..   TotRms.AbvGrd = col_double(),
##   ..   Functional = col_character(),
##   ..   Fireplaces = col_double(),
##   ..   Fireplace.Qu = col_character(),
##   ..   Garage.Type = col_character(),
##   ..   Garage.Yr.Blt = col_double(),
##   ..   Garage.Finish = col_character(),
##   ..   Garage.Cars = col_double(),
##   ..   Garage.Area = col_double(),
##   ..   Garage.Qual = col_character(),
##   ..   Garage.Cond = col_character(),
##   ..   Paved.Drive = col_character(),
##   ..   Wood.Deck.SF = col_double(),
##   ..   Open.Porch.SF = col_double(),
##   ..   Enclosed.Porch = col_double(),
##   ..   X3Ssn.Porch = col_double(),
##   ..   Screen.Porch = col_double(),
##   ..   Pool.Area = col_double(),
##   ..   Pool.QC = col_character(),
##   ..   Fence = col_character(),
##   ..   Misc.Feature = col_character(),
##   ..   Misc.Val = col_double(),
##   ..   Mo.Sold = col_double(),
##   ..   Yr.Sold = col_double(),
##   ..   Sale.Type = col_character(),
##   ..   Sale.Condition = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>
head(homes)
## # A tibble: 6 × 82
##   Order      PID  area  price MS.SubClass MS.Zoning Lot.Frontage Lot.Area Street
##   <dbl>    <dbl> <dbl>  <dbl>       <dbl> <chr>            <dbl>    <dbl> <chr> 
## 1     1   5.26e8  1656 215000          20 RL                 141    31770 Pave  
## 2     2   5.26e8   896 105000          20 RH                  80    11622 Pave  
## 3     3   5.26e8  1329 172000          20 RL                  81    14267 Pave  
## 4     4   5.26e8  2110 244000          20 RL                  93    11160 Pave  
## 5     5   5.27e8  1629 189900          60 RL                  74    13830 Pave  
## 6     6   5.27e8  1604 195500          60 RL                  78     9978 Pave  
## # ℹ 73 more variables: Alley <chr>, Lot.Shape <chr>, Land.Contour <chr>,
## #   Utilities <chr>, Lot.Config <chr>, Land.Slope <chr>, Neighborhood <chr>,
## #   Condition.1 <chr>, Condition.2 <chr>, Bldg.Type <chr>, House.Style <chr>,
## #   Overall.Qual <dbl>, Overall.Cond <dbl>, Year.Built <dbl>,
## #   Year.Remod.Add <dbl>, Roof.Style <chr>, Roof.Matl <chr>,
## #   Exterior.1st <chr>, Exterior.2nd <chr>, Mas.Vnr.Type <chr>,
## #   Mas.Vnr.Area <dbl>, Exter.Qual <chr>, Exter.Cond <chr>, Foundation <chr>, …
colSums(is.na(homes))
##           Order             PID            area           price     MS.SubClass 
##               0               0               0               0               0 
##       MS.Zoning    Lot.Frontage        Lot.Area          Street           Alley 
##               0             490               0               0            2732 
##       Lot.Shape    Land.Contour       Utilities      Lot.Config      Land.Slope 
##               0               0               0               0               0 
##    Neighborhood     Condition.1     Condition.2       Bldg.Type     House.Style 
##               0               0               0               0               0 
##    Overall.Qual    Overall.Cond      Year.Built  Year.Remod.Add      Roof.Style 
##               0               0               0               0               0 
##       Roof.Matl    Exterior.1st    Exterior.2nd    Mas.Vnr.Type    Mas.Vnr.Area 
##               0               0               0              23              23 
##      Exter.Qual      Exter.Cond      Foundation       Bsmt.Qual       Bsmt.Cond 
##               0               0               0              80              80 
##   Bsmt.Exposure  BsmtFin.Type.1    BsmtFin.SF.1  BsmtFin.Type.2    BsmtFin.SF.2 
##              83              80               1              81               1 
##     Bsmt.Unf.SF   Total.Bsmt.SF         Heating      Heating.QC     Central.Air 
##               1               1               0               0               0 
##      Electrical     X1st.Flr.SF     X2nd.Flr.SF Low.Qual.Fin.SF  Bsmt.Full.Bath 
##               1               0               0               0               2 
##  Bsmt.Half.Bath       Full.Bath       Half.Bath   Bedroom.AbvGr   Kitchen.AbvGr 
##               2               0               0               0               0 
##    Kitchen.Qual   TotRms.AbvGrd      Functional      Fireplaces    Fireplace.Qu 
##               0               0               0               0            1422 
##     Garage.Type   Garage.Yr.Blt   Garage.Finish     Garage.Cars     Garage.Area 
##             157             159             159               1               1 
##     Garage.Qual     Garage.Cond     Paved.Drive    Wood.Deck.SF   Open.Porch.SF 
##             159             159               0               0               0 
##  Enclosed.Porch     X3Ssn.Porch    Screen.Porch       Pool.Area         Pool.QC 
##               0               0               0               0            2917 
##           Fence    Misc.Feature        Misc.Val         Mo.Sold         Yr.Sold 
##            2358            2824               0               0               0 
##       Sale.Type  Sale.Condition 
##               0               0

There are missing data but only 1 missing value in the columns I will be using. 1 missing in Total.Bsmt.SF

df_homes <- homes |>
select(price, MS.Zoning, Neighborhood, Bldg.Type, Lot.Area, Year.Remod.Add, Overall.Cond, X1st.Flr.SF,  X2nd.Flr.SF, Total.Bsmt.SF, Full.Bath, Half.Bath, Land.Slope, Sale.Condition) |>
  filter(Sale.Condition == "Normal") |>
  filter(Neighborhood == "BrkSide") |>
  filter(Bldg.Type == "1Fam") |>
    filter(MS.Zoning == "RL") |>
  filter(Land.Slope =="Gtl") |>
  mutate(Total_Sq = X1st.Flr.SF + X2nd.Flr.SF + Total.Bsmt.SF)
df_homes
## # A tibble: 35 × 15
##     price MS.Zoning Neighborhood Bldg.Type Lot.Area Year.Remod.Add Overall.Cond
##     <dbl> <chr>     <chr>        <chr>        <dbl>          <dbl>        <dbl>
##  1 116500 RL        BrkSide      1Fam          7207           2008            7
##  2  76500 RL        BrkSide      1Fam          5350           1966            2
##  3 209500 RL        BrkSide      1Fam          7793           2005            7
##  4  82500 RL        BrkSide      1Fam          5330           1950            7
##  5 110000 RL        BrkSide      1Fam          7015           1950            4
##  6 223500 RL        BrkSide      1Fam         21384           2004            6
##  7 149000 RL        BrkSide      1Fam          6615           1950            6
##  8 205000 RL        BrkSide      1Fam          7264           2007            7
##  9 137000 RL        BrkSide      1Fam          4960           1982            7
## 10 121000 RL        BrkSide      1Fam          8854           1950            6
## # ℹ 25 more rows
## # ℹ 8 more variables: X1st.Flr.SF <dbl>, X2nd.Flr.SF <dbl>,
## #   Total.Bsmt.SF <dbl>, Full.Bath <dbl>, Half.Bath <dbl>, Land.Slope <chr>,
## #   Sale.Condition <chr>, Total_Sq <dbl>
names(df_homes) <- gsub("price", "Price", names(df_homes)) 
names(df_homes)
##  [1] "Price"          "MS.Zoning"      "Neighborhood"   "Bldg.Type"     
##  [5] "Lot.Area"       "Year.Remod.Add" "Overall.Cond"   "X1st.Flr.SF"   
##  [9] "X2nd.Flr.SF"    "Total.Bsmt.SF"  "Full.Bath"      "Half.Bath"     
## [13] "Land.Slope"     "Sale.Condition" "Total_Sq"
df_homes |>
  mutate(across(c(MS.Zoning, Bldg.Type, Neighborhood, Sale.Condition, Land.Slope), as.factor))
## # A tibble: 35 × 15
##     Price MS.Zoning Neighborhood Bldg.Type Lot.Area Year.Remod.Add Overall.Cond
##     <dbl> <fct>     <fct>        <fct>        <dbl>          <dbl>        <dbl>
##  1 116500 RL        BrkSide      1Fam          7207           2008            7
##  2  76500 RL        BrkSide      1Fam          5350           1966            2
##  3 209500 RL        BrkSide      1Fam          7793           2005            7
##  4  82500 RL        BrkSide      1Fam          5330           1950            7
##  5 110000 RL        BrkSide      1Fam          7015           1950            4
##  6 223500 RL        BrkSide      1Fam         21384           2004            6
##  7 149000 RL        BrkSide      1Fam          6615           1950            6
##  8 205000 RL        BrkSide      1Fam          7264           2007            7
##  9 137000 RL        BrkSide      1Fam          4960           1982            7
## 10 121000 RL        BrkSide      1Fam          8854           1950            6
## # ℹ 25 more rows
## # ℹ 8 more variables: X1st.Flr.SF <dbl>, X2nd.Flr.SF <dbl>,
## #   Total.Bsmt.SF <dbl>, Full.Bath <dbl>, Half.Bath <dbl>, Land.Slope <fct>,
## #   Sale.Condition <fct>, Total_Sq <dbl>
multiple_model <- lm(Price ~ Year.Remod.Add + Lot.Area + Total_Sq + Overall.Cond, data = df_homes)
multiple_model
## 
## Call:
## lm(formula = Price ~ Year.Remod.Add + Lot.Area + Total_Sq + Overall.Cond, 
##     data = df_homes)
## 
## Coefficients:
##    (Intercept)  Year.Remod.Add        Lot.Area        Total_Sq    Overall.Cond  
##     -75737.267           8.024           3.077          58.333        8695.172
summary(multiple_model)
## 
## Call:
## lm(formula = Price ~ Year.Remod.Add + Lot.Area + Total_Sq + Overall.Cond, 
##     data = df_homes)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -37676  -8089   -961   9191  35584 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -7.574e+04  2.177e+05  -0.348  0.73041    
## Year.Remod.Add  8.024e+00  1.136e+02   0.071  0.94414    
## Lot.Area        3.077e+00  8.786e-01   3.502  0.00147 ** 
## Total_Sq        5.833e+01  4.628e+00  12.603 1.62e-13 ***
## Overall.Cond    8.695e+03  1.918e+03   4.534 8.65e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13990 on 30 degrees of freedom
## Multiple R-squared:  0.9287, Adjusted R-squared:  0.9192 
## F-statistic: 97.73 on 4 and 30 DF,  p-value: < 2.2e-16
par(mfrow=c(2,2)); plot(multiple_model); par(mfrow=c(1,1))

Residuals vs Fitted: Mostly following the red line with a high concentration of dots around 70 with fewer dots mostly following the red line as the values get higher. Thus, linearity looks pretty good. The residuals follow the red line generally which indicates the homoscedasticity assumption is generally met. There is slight variation, suggesting mild heteroscedasticity, but not severe enough to invalidate the model.

Scale–Location: Cloud like spread with more variation as the values increase. There is some heteroscedasticity (variance grows for sale price when the values get higher, especially between 75 and 85.)

Q–Q plot: Tails deviate, the left tail and right tail are both a bit low. This means that residuals not are perfectly normal, but it generally does follow the line. So normality is checked.

Residuals vs Leverage: The dashed line is far from the red line in places especially towards the higher values, but the dots mostly follow the red line, except at the far left where the dots clump higher and the red line dips down.

plot(resid(multiple_model), type="b",
     main="Residuals vs Order", ylab="Residuals"); abline(h=0, lty=2)

cor(df_homes[, c("Lot.Area", "Overall.Cond", "Total_Sq")], use = "complete.obs")
##               Lot.Area Overall.Cond  Total_Sq
## Lot.Area     1.0000000    0.1604025 0.5788556
## Overall.Cond 0.1604025    1.0000000 0.1069479
## Total_Sq     0.5788556    0.1069479 1.0000000

Low correlations among predictors: Overall Condition- weight–horsepower = 0.86, horsepower–cylinders = 0.84. This is weak multicollinearity.

Year is only moderately related to the others (|r| ≲ 0.42) → less of a concern.

Implications: Coefficient SEs for weight/horsepower/cylinders may be inflated; individual p-values can look nonsignificant even though the model fits well.

Next step would be to keep weight and year, and possibly dropping one of {horsepower, cylinders} **** Check for Assumptions - 3 codes missing Linear, independence, multicoliniarity,


Interpret the coefficient for Health, explaining what it means in terms of life expectancy while controlling for GDP and Internet. How does the adjusted R2 compare to the simple regression model from Question 1, and what does this suggest about the additional predictors?

The coefficient for Health is 0.247, which means when the percentage of government expenditures on healthcare increases by. 1 the life expectancy increases by 0.247 years.

The Adjusted R-squared increased from almost 43% to almost 72%, which is a significant increase. This indicates that the variation of life expectancy is explained almost 30% of the time with the additional predictors.

Checking Assumptions (Homoscedasticity and Normality) For the simple linear regression model from Question 1 (LifeExpectancy ~ GDP), describe how you would check the assumptions of homoscedasticity and normality of residuals. For each assumption, explain what an ideal outcome would look like and what a violation might indicate about the model’s reliability for predicting life expectancy. Afterwords, code your answer and reflect if it matched the ideal outcome.

An ideal outcome would look like the dots clumping together all around the line. A violation with large residuals would indicate that the model is not as reliable for predicting the life expectancy accurately and the model would need to be adjusted.

residuals_multiple <- resid(multiple_model)

# Calculate RMSE for multiple model
rmse_multiple <- sqrt(mean(residuals_multiple^2))
rmse_multiple
## [1] 12949.28

Statistical Analysis

Conclusion

This is helpful information for home buyers and sellers as well as real estate agents and investors.

Future Steps

It may be interesting to look at homes in a different neighborhood, or with different zoning, or different type of house to see if there are different results or if the same elements have a strong effect on the sale price of the house. Or to recheck it based on when sales took place.

Sources

https://www.openintro.org/data/index.php?data=ames

De Cock, Dean. “Ames, Iowa: Alternative to the Boston housing data as an end of semester regression project.” Journal of Statistics Education 19.3 (2011).