I will be using the Housing prices in Ames, Iowa data set from the openintro.org website at https://www.openintro.org/data/index.php?data=ames
The data set has information that comes from the Ames Assessor’s Office used in computing assessed values for individual residential properties sold in Ames, IA from 2006 to 2010.
I am going to look at the dimensions and head of the data as well as check for any missing information.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.2.0 ✔ readr 2.1.6
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.2 ✔ tibble 3.3.1
## ✔ lubridate 1.9.5 ✔ tidyr 1.3.2
## ✔ purrr 1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(corrplot)
## corrplot 0.95 loaded
library(lubridate)
library(dplyr)
setwd("~/Downloads/Data 101 Course materials/Data Sets")
homes <- read_csv("ames.csv")
## Rows: 2930 Columns: 82
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (43): MS.Zoning, Street, Alley, Lot.Shape, Land.Contour, Utilities, Lot....
## dbl (39): Order, PID, area, price, MS.SubClass, Lot.Frontage, Lot.Area, Over...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
homes
## # A tibble: 2,930 × 82
## Order PID area price MS.SubClass MS.Zoning Lot.Frontage Lot.Area Street
## <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <chr>
## 1 1 5.26e8 1656 215000 20 RL 141 31770 Pave
## 2 2 5.26e8 896 105000 20 RH 80 11622 Pave
## 3 3 5.26e8 1329 172000 20 RL 81 14267 Pave
## 4 4 5.26e8 2110 244000 20 RL 93 11160 Pave
## 5 5 5.27e8 1629 189900 60 RL 74 13830 Pave
## 6 6 5.27e8 1604 195500 60 RL 78 9978 Pave
## 7 7 5.27e8 1338 213500 120 RL 41 4920 Pave
## 8 8 5.27e8 1280 191500 120 RL 43 5005 Pave
## 9 9 5.27e8 1616 236500 120 RL 39 5389 Pave
## 10 10 5.27e8 1804 189000 60 RL 60 7500 Pave
## # ℹ 2,920 more rows
## # ℹ 73 more variables: Alley <chr>, Lot.Shape <chr>, Land.Contour <chr>,
## # Utilities <chr>, Lot.Config <chr>, Land.Slope <chr>, Neighborhood <chr>,
## # Condition.1 <chr>, Condition.2 <chr>, Bldg.Type <chr>, House.Style <chr>,
## # Overall.Qual <dbl>, Overall.Cond <dbl>, Year.Built <dbl>,
## # Year.Remod.Add <dbl>, Roof.Style <chr>, Roof.Matl <chr>,
## # Exterior.1st <chr>, Exterior.2nd <chr>, Mas.Vnr.Type <chr>, …
str(homes)
## spc_tbl_ [2,930 × 82] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Order : num [1:2930] 1 2 3 4 5 6 7 8 9 10 ...
## $ PID : num [1:2930] 5.26e+08 5.26e+08 5.26e+08 5.26e+08 5.27e+08 ...
## $ area : num [1:2930] 1656 896 1329 2110 1629 ...
## $ price : num [1:2930] 215000 105000 172000 244000 189900 ...
## $ MS.SubClass : num [1:2930] 20 20 20 20 60 60 120 120 120 60 ...
## $ MS.Zoning : chr [1:2930] "RL" "RH" "RL" "RL" ...
## $ Lot.Frontage : num [1:2930] 141 80 81 93 74 78 41 43 39 60 ...
## $ Lot.Area : num [1:2930] 31770 11622 14267 11160 13830 ...
## $ Street : chr [1:2930] "Pave" "Pave" "Pave" "Pave" ...
## $ Alley : chr [1:2930] NA NA NA NA ...
## $ Lot.Shape : chr [1:2930] "IR1" "Reg" "IR1" "Reg" ...
## $ Land.Contour : chr [1:2930] "Lvl" "Lvl" "Lvl" "Lvl" ...
## $ Utilities : chr [1:2930] "AllPub" "AllPub" "AllPub" "AllPub" ...
## $ Lot.Config : chr [1:2930] "Corner" "Inside" "Corner" "Corner" ...
## $ Land.Slope : chr [1:2930] "Gtl" "Gtl" "Gtl" "Gtl" ...
## $ Neighborhood : chr [1:2930] "NAmes" "NAmes" "NAmes" "NAmes" ...
## $ Condition.1 : chr [1:2930] "Norm" "Feedr" "Norm" "Norm" ...
## $ Condition.2 : chr [1:2930] "Norm" "Norm" "Norm" "Norm" ...
## $ Bldg.Type : chr [1:2930] "1Fam" "1Fam" "1Fam" "1Fam" ...
## $ House.Style : chr [1:2930] "1Story" "1Story" "1Story" "1Story" ...
## $ Overall.Qual : num [1:2930] 6 5 6 7 5 6 8 8 8 7 ...
## $ Overall.Cond : num [1:2930] 5 6 6 5 5 6 5 5 5 5 ...
## $ Year.Built : num [1:2930] 1960 1961 1958 1968 1997 ...
## $ Year.Remod.Add : num [1:2930] 1960 1961 1958 1968 1998 ...
## $ Roof.Style : chr [1:2930] "Hip" "Gable" "Hip" "Hip" ...
## $ Roof.Matl : chr [1:2930] "CompShg" "CompShg" "CompShg" "CompShg" ...
## $ Exterior.1st : chr [1:2930] "BrkFace" "VinylSd" "Wd Sdng" "BrkFace" ...
## $ Exterior.2nd : chr [1:2930] "Plywood" "VinylSd" "Wd Sdng" "BrkFace" ...
## $ Mas.Vnr.Type : chr [1:2930] "Stone" "None" "BrkFace" "None" ...
## $ Mas.Vnr.Area : num [1:2930] 112 0 108 0 0 20 0 0 0 0 ...
## $ Exter.Qual : chr [1:2930] "TA" "TA" "TA" "Gd" ...
## $ Exter.Cond : chr [1:2930] "TA" "TA" "TA" "TA" ...
## $ Foundation : chr [1:2930] "CBlock" "CBlock" "CBlock" "CBlock" ...
## $ Bsmt.Qual : chr [1:2930] "TA" "TA" "TA" "TA" ...
## $ Bsmt.Cond : chr [1:2930] "Gd" "TA" "TA" "TA" ...
## $ Bsmt.Exposure : chr [1:2930] "Gd" "No" "No" "No" ...
## $ BsmtFin.Type.1 : chr [1:2930] "BLQ" "Rec" "ALQ" "ALQ" ...
## $ BsmtFin.SF.1 : num [1:2930] 639 468 923 1065 791 ...
## $ BsmtFin.Type.2 : chr [1:2930] "Unf" "LwQ" "Unf" "Unf" ...
## $ BsmtFin.SF.2 : num [1:2930] 0 144 0 0 0 0 0 0 0 0 ...
## $ Bsmt.Unf.SF : num [1:2930] 441 270 406 1045 137 ...
## $ Total.Bsmt.SF : num [1:2930] 1080 882 1329 2110 928 ...
## $ Heating : chr [1:2930] "GasA" "GasA" "GasA" "GasA" ...
## $ Heating.QC : chr [1:2930] "Fa" "TA" "TA" "Ex" ...
## $ Central.Air : chr [1:2930] "Y" "Y" "Y" "Y" ...
## $ Electrical : chr [1:2930] "SBrkr" "SBrkr" "SBrkr" "SBrkr" ...
## $ X1st.Flr.SF : num [1:2930] 1656 896 1329 2110 928 ...
## $ X2nd.Flr.SF : num [1:2930] 0 0 0 0 701 678 0 0 0 776 ...
## $ Low.Qual.Fin.SF: num [1:2930] 0 0 0 0 0 0 0 0 0 0 ...
## $ Bsmt.Full.Bath : num [1:2930] 1 0 0 1 0 0 1 0 1 0 ...
## $ Bsmt.Half.Bath : num [1:2930] 0 0 0 0 0 0 0 0 0 0 ...
## $ Full.Bath : num [1:2930] 1 1 1 2 2 2 2 2 2 2 ...
## $ Half.Bath : num [1:2930] 0 0 1 1 1 1 0 0 0 1 ...
## $ Bedroom.AbvGr : num [1:2930] 3 2 3 3 3 3 2 2 2 3 ...
## $ Kitchen.AbvGr : num [1:2930] 1 1 1 1 1 1 1 1 1 1 ...
## $ Kitchen.Qual : chr [1:2930] "TA" "TA" "Gd" "Ex" ...
## $ TotRms.AbvGrd : num [1:2930] 7 5 6 8 6 7 6 5 5 7 ...
## $ Functional : chr [1:2930] "Typ" "Typ" "Typ" "Typ" ...
## $ Fireplaces : num [1:2930] 2 0 0 2 1 1 0 0 1 1 ...
## $ Fireplace.Qu : chr [1:2930] "Gd" NA NA "TA" ...
## $ Garage.Type : chr [1:2930] "Attchd" "Attchd" "Attchd" "Attchd" ...
## $ Garage.Yr.Blt : num [1:2930] 1960 1961 1958 1968 1997 ...
## $ Garage.Finish : chr [1:2930] "Fin" "Unf" "Unf" "Fin" ...
## $ Garage.Cars : num [1:2930] 2 1 1 2 2 2 2 2 2 2 ...
## $ Garage.Area : num [1:2930] 528 730 312 522 482 470 582 506 608 442 ...
## $ Garage.Qual : chr [1:2930] "TA" "TA" "TA" "TA" ...
## $ Garage.Cond : chr [1:2930] "TA" "TA" "TA" "TA" ...
## $ Paved.Drive : chr [1:2930] "P" "Y" "Y" "Y" ...
## $ Wood.Deck.SF : num [1:2930] 210 140 393 0 212 360 0 0 237 140 ...
## $ Open.Porch.SF : num [1:2930] 62 0 36 0 34 36 0 82 152 60 ...
## $ Enclosed.Porch : num [1:2930] 0 0 0 0 0 0 170 0 0 0 ...
## $ X3Ssn.Porch : num [1:2930] 0 0 0 0 0 0 0 0 0 0 ...
## $ Screen.Porch : num [1:2930] 0 120 0 0 0 0 0 144 0 0 ...
## $ Pool.Area : num [1:2930] 0 0 0 0 0 0 0 0 0 0 ...
## $ Pool.QC : chr [1:2930] NA NA NA NA ...
## $ Fence : chr [1:2930] NA "MnPrv" NA NA ...
## $ Misc.Feature : chr [1:2930] NA NA "Gar2" NA ...
## $ Misc.Val : num [1:2930] 0 0 12500 0 0 0 0 0 0 0 ...
## $ Mo.Sold : num [1:2930] 5 6 6 4 3 6 4 1 3 6 ...
## $ Yr.Sold : num [1:2930] 2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...
## $ Sale.Type : chr [1:2930] "WD" "WD" "WD" "WD" ...
## $ Sale.Condition : chr [1:2930] "Normal" "Normal" "Normal" "Normal" ...
## - attr(*, "spec")=
## .. cols(
## .. Order = col_double(),
## .. PID = col_double(),
## .. area = col_double(),
## .. price = col_double(),
## .. MS.SubClass = col_double(),
## .. MS.Zoning = col_character(),
## .. Lot.Frontage = col_double(),
## .. Lot.Area = col_double(),
## .. Street = col_character(),
## .. Alley = col_character(),
## .. Lot.Shape = col_character(),
## .. Land.Contour = col_character(),
## .. Utilities = col_character(),
## .. Lot.Config = col_character(),
## .. Land.Slope = col_character(),
## .. Neighborhood = col_character(),
## .. Condition.1 = col_character(),
## .. Condition.2 = col_character(),
## .. Bldg.Type = col_character(),
## .. House.Style = col_character(),
## .. Overall.Qual = col_double(),
## .. Overall.Cond = col_double(),
## .. Year.Built = col_double(),
## .. Year.Remod.Add = col_double(),
## .. Roof.Style = col_character(),
## .. Roof.Matl = col_character(),
## .. Exterior.1st = col_character(),
## .. Exterior.2nd = col_character(),
## .. Mas.Vnr.Type = col_character(),
## .. Mas.Vnr.Area = col_double(),
## .. Exter.Qual = col_character(),
## .. Exter.Cond = col_character(),
## .. Foundation = col_character(),
## .. Bsmt.Qual = col_character(),
## .. Bsmt.Cond = col_character(),
## .. Bsmt.Exposure = col_character(),
## .. BsmtFin.Type.1 = col_character(),
## .. BsmtFin.SF.1 = col_double(),
## .. BsmtFin.Type.2 = col_character(),
## .. BsmtFin.SF.2 = col_double(),
## .. Bsmt.Unf.SF = col_double(),
## .. Total.Bsmt.SF = col_double(),
## .. Heating = col_character(),
## .. Heating.QC = col_character(),
## .. Central.Air = col_character(),
## .. Electrical = col_character(),
## .. X1st.Flr.SF = col_double(),
## .. X2nd.Flr.SF = col_double(),
## .. Low.Qual.Fin.SF = col_double(),
## .. Bsmt.Full.Bath = col_double(),
## .. Bsmt.Half.Bath = col_double(),
## .. Full.Bath = col_double(),
## .. Half.Bath = col_double(),
## .. Bedroom.AbvGr = col_double(),
## .. Kitchen.AbvGr = col_double(),
## .. Kitchen.Qual = col_character(),
## .. TotRms.AbvGrd = col_double(),
## .. Functional = col_character(),
## .. Fireplaces = col_double(),
## .. Fireplace.Qu = col_character(),
## .. Garage.Type = col_character(),
## .. Garage.Yr.Blt = col_double(),
## .. Garage.Finish = col_character(),
## .. Garage.Cars = col_double(),
## .. Garage.Area = col_double(),
## .. Garage.Qual = col_character(),
## .. Garage.Cond = col_character(),
## .. Paved.Drive = col_character(),
## .. Wood.Deck.SF = col_double(),
## .. Open.Porch.SF = col_double(),
## .. Enclosed.Porch = col_double(),
## .. X3Ssn.Porch = col_double(),
## .. Screen.Porch = col_double(),
## .. Pool.Area = col_double(),
## .. Pool.QC = col_character(),
## .. Fence = col_character(),
## .. Misc.Feature = col_character(),
## .. Misc.Val = col_double(),
## .. Mo.Sold = col_double(),
## .. Yr.Sold = col_double(),
## .. Sale.Type = col_character(),
## .. Sale.Condition = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
head(homes)
## # A tibble: 6 × 82
## Order PID area price MS.SubClass MS.Zoning Lot.Frontage Lot.Area Street
## <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <chr>
## 1 1 5.26e8 1656 215000 20 RL 141 31770 Pave
## 2 2 5.26e8 896 105000 20 RH 80 11622 Pave
## 3 3 5.26e8 1329 172000 20 RL 81 14267 Pave
## 4 4 5.26e8 2110 244000 20 RL 93 11160 Pave
## 5 5 5.27e8 1629 189900 60 RL 74 13830 Pave
## 6 6 5.27e8 1604 195500 60 RL 78 9978 Pave
## # ℹ 73 more variables: Alley <chr>, Lot.Shape <chr>, Land.Contour <chr>,
## # Utilities <chr>, Lot.Config <chr>, Land.Slope <chr>, Neighborhood <chr>,
## # Condition.1 <chr>, Condition.2 <chr>, Bldg.Type <chr>, House.Style <chr>,
## # Overall.Qual <dbl>, Overall.Cond <dbl>, Year.Built <dbl>,
## # Year.Remod.Add <dbl>, Roof.Style <chr>, Roof.Matl <chr>,
## # Exterior.1st <chr>, Exterior.2nd <chr>, Mas.Vnr.Type <chr>,
## # Mas.Vnr.Area <dbl>, Exter.Qual <chr>, Exter.Cond <chr>, Foundation <chr>, …
colSums(is.na(homes))
## Order PID area price MS.SubClass
## 0 0 0 0 0
## MS.Zoning Lot.Frontage Lot.Area Street Alley
## 0 490 0 0 2732
## Lot.Shape Land.Contour Utilities Lot.Config Land.Slope
## 0 0 0 0 0
## Neighborhood Condition.1 Condition.2 Bldg.Type House.Style
## 0 0 0 0 0
## Overall.Qual Overall.Cond Year.Built Year.Remod.Add Roof.Style
## 0 0 0 0 0
## Roof.Matl Exterior.1st Exterior.2nd Mas.Vnr.Type Mas.Vnr.Area
## 0 0 0 23 23
## Exter.Qual Exter.Cond Foundation Bsmt.Qual Bsmt.Cond
## 0 0 0 80 80
## Bsmt.Exposure BsmtFin.Type.1 BsmtFin.SF.1 BsmtFin.Type.2 BsmtFin.SF.2
## 83 80 1 81 1
## Bsmt.Unf.SF Total.Bsmt.SF Heating Heating.QC Central.Air
## 1 1 0 0 0
## Electrical X1st.Flr.SF X2nd.Flr.SF Low.Qual.Fin.SF Bsmt.Full.Bath
## 1 0 0 0 2
## Bsmt.Half.Bath Full.Bath Half.Bath Bedroom.AbvGr Kitchen.AbvGr
## 2 0 0 0 0
## Kitchen.Qual TotRms.AbvGrd Functional Fireplaces Fireplace.Qu
## 0 0 0 0 1422
## Garage.Type Garage.Yr.Blt Garage.Finish Garage.Cars Garage.Area
## 157 159 159 1 1
## Garage.Qual Garage.Cond Paved.Drive Wood.Deck.SF Open.Porch.SF
## 159 159 0 0 0
## Enclosed.Porch X3Ssn.Porch Screen.Porch Pool.Area Pool.QC
## 0 0 0 0 2917
## Fence Misc.Feature Misc.Val Mo.Sold Yr.Sold
## 2358 2824 0 0 0
## Sale.Type Sale.Condition
## 0 0
There are missing data but only 1 missing value in the columns I will be using. 1 missing in Total.Bsmt.SF
df_homes <- homes |>
select(price, MS.Zoning, Neighborhood, Bldg.Type, Lot.Area, Year.Remod.Add, Overall.Cond, X1st.Flr.SF, X2nd.Flr.SF, Total.Bsmt.SF, Full.Bath, Half.Bath, Land.Slope, Sale.Condition) |>
filter(Sale.Condition == "Normal") |>
filter(Neighborhood == "BrkSide") |>
filter(Bldg.Type == "1Fam") |>
filter(MS.Zoning == "RL") |>
filter(Land.Slope =="Gtl") |>
mutate(Total_Sq = X1st.Flr.SF + X2nd.Flr.SF + Total.Bsmt.SF)
df_homes
## # A tibble: 35 × 15
## price MS.Zoning Neighborhood Bldg.Type Lot.Area Year.Remod.Add Overall.Cond
## <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 116500 RL BrkSide 1Fam 7207 2008 7
## 2 76500 RL BrkSide 1Fam 5350 1966 2
## 3 209500 RL BrkSide 1Fam 7793 2005 7
## 4 82500 RL BrkSide 1Fam 5330 1950 7
## 5 110000 RL BrkSide 1Fam 7015 1950 4
## 6 223500 RL BrkSide 1Fam 21384 2004 6
## 7 149000 RL BrkSide 1Fam 6615 1950 6
## 8 205000 RL BrkSide 1Fam 7264 2007 7
## 9 137000 RL BrkSide 1Fam 4960 1982 7
## 10 121000 RL BrkSide 1Fam 8854 1950 6
## # ℹ 25 more rows
## # ℹ 8 more variables: X1st.Flr.SF <dbl>, X2nd.Flr.SF <dbl>,
## # Total.Bsmt.SF <dbl>, Full.Bath <dbl>, Half.Bath <dbl>, Land.Slope <chr>,
## # Sale.Condition <chr>, Total_Sq <dbl>
names(df_homes) <- gsub("price", "Price", names(df_homes))
names(df_homes)
## [1] "Price" "MS.Zoning" "Neighborhood" "Bldg.Type"
## [5] "Lot.Area" "Year.Remod.Add" "Overall.Cond" "X1st.Flr.SF"
## [9] "X2nd.Flr.SF" "Total.Bsmt.SF" "Full.Bath" "Half.Bath"
## [13] "Land.Slope" "Sale.Condition" "Total_Sq"
df_homes |>
mutate(across(c(MS.Zoning, Bldg.Type, Neighborhood, Sale.Condition, Land.Slope), as.factor))
## # A tibble: 35 × 15
## Price MS.Zoning Neighborhood Bldg.Type Lot.Area Year.Remod.Add Overall.Cond
## <dbl> <fct> <fct> <fct> <dbl> <dbl> <dbl>
## 1 116500 RL BrkSide 1Fam 7207 2008 7
## 2 76500 RL BrkSide 1Fam 5350 1966 2
## 3 209500 RL BrkSide 1Fam 7793 2005 7
## 4 82500 RL BrkSide 1Fam 5330 1950 7
## 5 110000 RL BrkSide 1Fam 7015 1950 4
## 6 223500 RL BrkSide 1Fam 21384 2004 6
## 7 149000 RL BrkSide 1Fam 6615 1950 6
## 8 205000 RL BrkSide 1Fam 7264 2007 7
## 9 137000 RL BrkSide 1Fam 4960 1982 7
## 10 121000 RL BrkSide 1Fam 8854 1950 6
## # ℹ 25 more rows
## # ℹ 8 more variables: X1st.Flr.SF <dbl>, X2nd.Flr.SF <dbl>,
## # Total.Bsmt.SF <dbl>, Full.Bath <dbl>, Half.Bath <dbl>, Land.Slope <fct>,
## # Sale.Condition <fct>, Total_Sq <dbl>
multiple_model <- lm(Price ~ Year.Remod.Add + Lot.Area + Total_Sq + Overall.Cond, data = df_homes)
multiple_model
##
## Call:
## lm(formula = Price ~ Year.Remod.Add + Lot.Area + Total_Sq + Overall.Cond,
## data = df_homes)
##
## Coefficients:
## (Intercept) Year.Remod.Add Lot.Area Total_Sq Overall.Cond
## -75737.267 8.024 3.077 58.333 8695.172
summary(multiple_model)
##
## Call:
## lm(formula = Price ~ Year.Remod.Add + Lot.Area + Total_Sq + Overall.Cond,
## data = df_homes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37676 -8089 -961 9191 35584
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.574e+04 2.177e+05 -0.348 0.73041
## Year.Remod.Add 8.024e+00 1.136e+02 0.071 0.94414
## Lot.Area 3.077e+00 8.786e-01 3.502 0.00147 **
## Total_Sq 5.833e+01 4.628e+00 12.603 1.62e-13 ***
## Overall.Cond 8.695e+03 1.918e+03 4.534 8.65e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13990 on 30 degrees of freedom
## Multiple R-squared: 0.9287, Adjusted R-squared: 0.9192
## F-statistic: 97.73 on 4 and 30 DF, p-value: < 2.2e-16
par(mfrow=c(2,2)); plot(multiple_model); par(mfrow=c(1,1))
Residuals vs Fitted: Mostly following the red line with a high concentration of dots around 70 with fewer dots mostly following the red line as the values get higher. Thus, linearity looks pretty good. The residuals follow the red line generally which indicates the homoscedasticity assumption is generally met. There is slight variation, suggesting mild heteroscedasticity, but not severe enough to invalidate the model.
Scale–Location: Cloud like spread with more variation as the values increase. There is some heteroscedasticity (variance grows for sale price when the values get higher, especially between 75 and 85.)
Q–Q plot: Tails deviate, the left tail and right tail are both a bit low. This means that residuals not are perfectly normal, but it generally does follow the line. So normality is checked.
Residuals vs Leverage: The dashed line is far from the red line in places especially towards the higher values, but the dots mostly follow the red line, except at the far left where the dots clump higher and the red line dips down.
plot(resid(multiple_model), type="b",
main="Residuals vs Order", ylab="Residuals"); abline(h=0, lty=2)
cor(df_homes[, c("Lot.Area", "Overall.Cond", "Total_Sq")], use = "complete.obs")
## Lot.Area Overall.Cond Total_Sq
## Lot.Area 1.0000000 0.1604025 0.5788556
## Overall.Cond 0.1604025 1.0000000 0.1069479
## Total_Sq 0.5788556 0.1069479 1.0000000
Low correlations among predictors: Overall Condition- weight–horsepower = 0.86, horsepower–cylinders = 0.84. This is weak multicollinearity.
Year is only moderately related to the others (|r| ≲ 0.42) → less of a concern.
Implications: Coefficient SEs for weight/horsepower/cylinders may be inflated; individual p-values can look nonsignificant even though the model fits well.
Next step would be to keep weight and year, and possibly dropping one of {horsepower, cylinders} **** Check for Assumptions - 3 codes missing Linear, independence, multicoliniarity,
Interpret the coefficient for Health, explaining what it means in terms of life expectancy while controlling for GDP and Internet. How does the adjusted R2 compare to the simple regression model from Question 1, and what does this suggest about the additional predictors?
The coefficient for Health is 0.247, which means when the percentage of government expenditures on healthcare increases by. 1 the life expectancy increases by 0.247 years.
The Adjusted R-squared increased from almost 43% to almost 72%, which is a significant increase. This indicates that the variation of life expectancy is explained almost 30% of the time with the additional predictors.
Checking Assumptions (Homoscedasticity and Normality) For the simple linear regression model from Question 1 (LifeExpectancy ~ GDP), describe how you would check the assumptions of homoscedasticity and normality of residuals. For each assumption, explain what an ideal outcome would look like and what a violation might indicate about the model’s reliability for predicting life expectancy. Afterwords, code your answer and reflect if it matched the ideal outcome.
An ideal outcome would look like the dots clumping together all around the line. A violation with large residuals would indicate that the model is not as reliable for predicting the life expectancy accurately and the model would need to be adjusted.
residuals_multiple <- resid(multiple_model)
# Calculate RMSE for multiple model
rmse_multiple <- sqrt(mean(residuals_multiple^2))
rmse_multiple
## [1] 12949.28
This is helpful information for home buyers and sellers as well as real estate agents and investors.
It may be interesting to look at homes in a different neighborhood, or with different zoning, or different type of house to see if there are different results or if the same elements have a strong effect on the sale price of the house. Or to recheck it based on when sales took place.
https://www.openintro.org/data/index.php?data=ames
De Cock, Dean. “Ames, Iowa: Alternative to the Boston housing data as an end of semester regression project.” Journal of Statistics Education 19.3 (2011).