This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
This exercise is under construction. Please report any errors at https://forms.gle/2W4tffs4YJA1jeBv9
Goal: Understand and experience outlier detection techniques Law in action.
Background: The data for this question has been adapted from https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data. Please review information at https://www.kaggle.com/c/house-prices-advanced-regression-techniques/overview before you get started.
Before starting: 1. You are not allowed to search for solutions to this assignment. 2. You are allowed to search information about packages and functions that can help you.
Individual assignment only: 70 total points (Rmd and html solution) Team assignment: 20 points (written analysis)
Start by entering your name and today’s date in Lines 3 and 4, respectively, to indicate your compliance with the Fuqua Honor Code. Then, run the chunk of code below by clicking on the green arrow (that points to the right) on the top right of the chunk. Tip: I numbered code chunks corresponding to their numbers. Chunk 1 specified the knitting parameters.
Read and store the data from the file PricesBefore2009.csv into a variable called before2009. Tip: Then, inspect the data. Rubric: 1 each point for reading and storing; 1 points each for using 2 R commands for inspecting. Tip: I recommend using the read_csv() function from the tidyverse package to do this for this and all subsequent assignments.
rm(list =ls())
ls()
## character(0)
before2009 <- read.csv("PricesBefore2009.csv")
head(before2009,20)
## X Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape
## 1 1 1 60 RL 65 8450 Pave <NA> Reg
## 2 2 2 20 RL 80 9600 Pave <NA> Reg
## 3 3 3 60 RL 68 11250 Pave <NA> IR1
## 4 4 4 70 RL 60 9550 Pave <NA> IR1
## 5 5 5 60 RL 84 14260 Pave <NA> IR1
## 6 6 7 20 RL 75 10084 Pave <NA> Reg
## 7 7 9 50 RM 51 6120 Pave <NA> Reg
## 8 8 10 190 RL 50 7420 Pave <NA> Reg
## 9 9 11 20 RL 70 11200 Pave <NA> Reg
## 10 10 12 60 RL 85 11924 Pave <NA> IR1
## 11 11 13 20 RL NA 12968 Pave <NA> IR2
## 12 12 14 20 RL 91 10652 Pave <NA> IR1
## 13 13 15 20 RL NA 10920 Pave <NA> IR1
## 14 14 16 45 RM 51 6120 Pave <NA> Reg
## 15 15 18 90 RL 72 10791 Pave <NA> Reg
## 16 16 19 20 RL 66 13695 Pave <NA> Reg
## 17 17 21 60 RL 101 14215 Pave <NA> IR1
## 18 18 22 45 RM 57 7449 Pave Grvl Reg
## 19 19 23 20 RL 75 9742 Pave <NA> Reg
## 20 20 24 120 RM 44 4224 Pave <NA> Reg
## LandContour Utilities LotConfig LandSlope Neighborhood Condition1 Condition2
## 1 Lvl AllPub Inside Gtl CollgCr Norm Norm
## 2 Lvl AllPub FR2 Gtl Veenker Feedr Norm
## 3 Lvl AllPub Inside Gtl CollgCr Norm Norm
## 4 Lvl AllPub Corner Gtl Crawfor Norm Norm
## 5 Lvl AllPub FR2 Gtl NoRidge Norm Norm
## 6 Lvl AllPub Inside Gtl Somerst Norm Norm
## 7 Lvl AllPub Inside Gtl OldTown Artery Norm
## 8 Lvl AllPub Corner Gtl BrkSide Artery Artery
## 9 Lvl AllPub Inside Gtl Sawyer Norm Norm
## 10 Lvl AllPub Inside Gtl NridgHt Norm Norm
## 11 Lvl AllPub Inside Gtl Sawyer Norm Norm
## 12 Lvl AllPub Inside Gtl CollgCr Norm Norm
## 13 Lvl AllPub Corner Gtl NAmes Norm Norm
## 14 Lvl AllPub Corner Gtl BrkSide Norm Norm
## 15 Lvl AllPub Inside Gtl Sawyer Norm Norm
## 16 Lvl AllPub Inside Gtl SawyerW RRAe Norm
## 17 Lvl AllPub Corner Gtl NridgHt Norm Norm
## 18 Bnk AllPub Inside Gtl IDOTRR Norm Norm
## 19 Lvl AllPub Inside Gtl CollgCr Norm Norm
## 20 Lvl AllPub Inside Gtl MeadowV Norm Norm
## BldgType HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle
## 1 1Fam 2Story 7 5 2003 2003 Gable
## 2 1Fam 1Story 6 8 1976 1976 Gable
## 3 1Fam 2Story 7 5 2001 2002 Gable
## 4 1Fam 2Story 7 5 1915 1970 Gable
## 5 1Fam 2Story 8 5 2000 2000 Gable
## 6 1Fam 1Story 8 5 2004 2005 Gable
## 7 1Fam 1.5Fin 7 5 1931 1950 Gable
## 8 2fmCon 1.5Unf 5 6 1939 1950 Gable
## 9 1Fam 1Story 5 5 1965 1965 Hip
## 10 1Fam 2Story 9 5 2005 2006 Hip
## 11 1Fam 1Story 5 6 1962 1962 Hip
## 12 1Fam 1Story 7 5 2006 2007 Gable
## 13 1Fam 1Story 6 5 1960 1960 Hip
## 14 1Fam 1.5Unf 7 8 1929 2001 Gable
## 15 Duplex 1Story 4 5 1967 1967 Gable
## 16 1Fam 1Story 5 5 2004 2004 Gable
## 17 1Fam 2Story 8 5 2005 2006 Gable
## 18 1Fam 1.5Unf 7 7 1930 1950 Gable
## 19 1Fam 1Story 8 5 2002 2002 Hip
## 20 TwnhsE 1Story 5 7 1976 1976 Gable
## RoofMatl Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond
## 1 CompShg VinylSd VinylSd BrkFace 196 Gd TA
## 2 CompShg MetalSd MetalSd None 0 TA TA
## 3 CompShg VinylSd VinylSd BrkFace 162 Gd TA
## 4 CompShg Wd Sdng Wd Shng None 0 TA TA
## 5 CompShg VinylSd VinylSd BrkFace 350 Gd TA
## 6 CompShg VinylSd VinylSd Stone 186 Gd TA
## 7 CompShg BrkFace Wd Shng None 0 TA TA
## 8 CompShg MetalSd MetalSd None 0 TA TA
## 9 CompShg HdBoard HdBoard None 0 TA TA
## 10 CompShg WdShing Wd Shng Stone 286 Ex TA
## 11 CompShg HdBoard Plywood None 0 TA TA
## 12 CompShg VinylSd VinylSd Stone 306 Gd TA
## 13 CompShg MetalSd MetalSd BrkFace 212 TA TA
## 14 CompShg Wd Sdng Wd Sdng None 0 TA TA
## 15 CompShg MetalSd MetalSd None 0 TA TA
## 16 CompShg VinylSd VinylSd None 0 TA TA
## 17 CompShg VinylSd VinylSd BrkFace 380 Gd TA
## 18 CompShg Wd Sdng Wd Sdng None 0 TA TA
## 19 CompShg VinylSd VinylSd BrkFace 281 Gd TA
## 20 CompShg CemntBd CmentBd None 0 TA TA
## Foundation BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1
## 1 PConc Gd TA No GLQ 706
## 2 CBlock Gd TA Gd ALQ 978
## 3 PConc Gd TA Mn GLQ 486
## 4 BrkTil TA Gd No ALQ 216
## 5 PConc Gd TA Av GLQ 655
## 6 PConc Ex TA Av GLQ 1369
## 7 BrkTil TA TA No Unf 0
## 8 BrkTil TA TA No GLQ 851
## 9 CBlock TA TA No Rec 906
## 10 PConc Ex TA No GLQ 998
## 11 CBlock TA TA No ALQ 737
## 12 PConc Gd TA Av Unf 0
## 13 CBlock TA TA No BLQ 733
## 14 BrkTil TA TA No Unf 0
## 15 Slab <NA> <NA> <NA> <NA> 0
## 16 PConc TA TA No GLQ 646
## 17 PConc Ex TA Av Unf 0
## 18 PConc TA TA No Unf 0
## 19 PConc Gd TA No Unf 0
## 20 PConc Gd TA No GLQ 840
## BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir
## 1 Unf 0 150 856 GasA Ex Y
## 2 Unf 0 284 1262 GasA Ex Y
## 3 Unf 0 434 920 GasA Ex Y
## 4 Unf 0 540 756 GasA Gd Y
## 5 Unf 0 490 1145 GasA Ex Y
## 6 Unf 0 317 1686 GasA Ex Y
## 7 Unf 0 952 952 GasA Gd Y
## 8 Unf 0 140 991 GasA Ex Y
## 9 Unf 0 134 1040 GasA Ex Y
## 10 Unf 0 177 1175 GasA Ex Y
## 11 Unf 0 175 912 GasA TA Y
## 12 Unf 0 1494 1494 GasA Ex Y
## 13 Unf 0 520 1253 GasA TA Y
## 14 Unf 0 832 832 GasA Ex Y
## 15 <NA> 0 0 0 GasA TA Y
## 16 Unf 0 468 1114 GasA Ex Y
## 17 Unf 0 1158 1158 GasA Ex Y
## 18 Unf 0 637 637 GasA Ex Y
## 19 Unf 0 1777 1777 GasA Ex Y
## 20 Unf 0 200 1040 GasA TA Y
## Electrical X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath
## 1 SBrkr 856 854 0 1710 1
## 2 SBrkr 1262 0 0 1262 0
## 3 SBrkr 920 866 0 1786 1
## 4 SBrkr 961 756 0 1717 1
## 5 SBrkr 1145 1053 0 2198 1
## 6 SBrkr 1694 0 0 1694 1
## 7 FuseF 1022 752 0 1774 0
## 8 SBrkr 1077 0 0 1077 1
## 9 SBrkr 1040 0 0 1040 1
## 10 SBrkr 1182 1142 0 2324 1
## 11 SBrkr 912 0 0 912 1
## 12 SBrkr 1494 0 0 1494 0
## 13 SBrkr 1253 0 0 1253 1
## 14 FuseA 854 0 0 854 0
## 15 SBrkr 1296 0 0 1296 0
## 16 SBrkr 1114 0 0 1114 1
## 17 SBrkr 1158 1218 0 2376 0
## 18 FuseF 1108 0 0 1108 0
## 19 SBrkr 1795 0 0 1795 0
## 20 SBrkr 1060 0 0 1060 1
## BsmtHalfBath FullBath HalfBath BedroomAbvGr KitchenAbvGr KitchenQual
## 1 0 2 1 3 1 Gd
## 2 1 2 0 3 1 TA
## 3 0 2 1 3 1 Gd
## 4 0 1 0 3 1 Gd
## 5 0 2 1 4 1 Gd
## 6 0 2 0 3 1 Gd
## 7 0 2 0 2 2 TA
## 8 0 1 0 2 2 TA
## 9 0 1 0 3 1 TA
## 10 0 3 0 4 1 Ex
## 11 0 1 0 2 1 TA
## 12 0 2 0 3 1 Gd
## 13 0 1 1 2 1 TA
## 14 0 1 0 2 1 TA
## 15 0 2 0 2 2 TA
## 16 0 1 1 3 1 Gd
## 17 0 3 1 4 1 Gd
## 18 0 1 0 3 1 Gd
## 19 0 2 0 3 1 Gd
## 20 0 1 0 3 1 TA
## TotRmsAbvGrd Functional Fireplaces FireplaceQu GarageType GarageYrBlt
## 1 8 Typ 0 <NA> Attchd 2003
## 2 6 Typ 1 TA Attchd 1976
## 3 6 Typ 1 TA Attchd 2001
## 4 7 Typ 1 Gd Detchd 1998
## 5 9 Typ 1 TA Attchd 2000
## 6 7 Typ 1 Gd Attchd 2004
## 7 8 Min1 2 TA Detchd 1931
## 8 5 Typ 2 TA Attchd 1939
## 9 5 Typ 0 <NA> Detchd 1965
## 10 11 Typ 2 Gd BuiltIn 2005
## 11 4 Typ 0 <NA> Detchd 1962
## 12 7 Typ 1 Gd Attchd 2006
## 13 5 Typ 1 Fa Attchd 1960
## 14 5 Typ 0 <NA> Detchd 1991
## 15 6 Typ 0 <NA> CarPort 1967
## 16 6 Typ 0 <NA> Detchd 2004
## 17 9 Typ 1 Gd BuiltIn 2005
## 18 6 Typ 1 Gd Attchd 1930
## 19 7 Typ 1 Gd Attchd 2002
## 20 6 Typ 1 TA Attchd 1976
## GarageFinish GarageCars GarageArea GarageQual GarageCond PavedDrive
## 1 RFn 2 548 TA TA Y
## 2 RFn 2 460 TA TA Y
## 3 RFn 2 608 TA TA Y
## 4 Unf 3 642 TA TA Y
## 5 RFn 3 836 TA TA Y
## 6 RFn 2 636 TA TA Y
## 7 Unf 2 468 Fa TA Y
## 8 RFn 1 205 Gd TA Y
## 9 Unf 1 384 TA TA Y
## 10 Fin 3 736 TA TA Y
## 11 Unf 1 352 TA TA Y
## 12 RFn 3 840 TA TA Y
## 13 RFn 1 352 TA TA Y
## 14 Unf 2 576 TA TA Y
## 15 Unf 2 516 TA TA Y
## 16 Unf 2 576 TA TA Y
## 17 RFn 3 853 TA TA Y
## 18 Unf 1 280 TA TA N
## 19 RFn 2 534 TA TA Y
## 20 Unf 2 572 TA TA Y
## WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch ScreenPorch PoolArea PoolQC
## 1 0 61 0 0 0 0 <NA>
## 2 298 0 0 0 0 0 <NA>
## 3 0 42 0 0 0 0 <NA>
## 4 0 35 272 0 0 0 <NA>
## 5 192 84 0 0 0 0 <NA>
## 6 255 57 0 0 0 0 <NA>
## 7 90 0 205 0 0 0 <NA>
## 8 0 4 0 0 0 0 <NA>
## 9 0 0 0 0 0 0 <NA>
## 10 147 21 0 0 0 0 <NA>
## 11 140 0 0 0 176 0 <NA>
## 12 160 33 0 0 0 0 <NA>
## 13 0 213 176 0 0 0 <NA>
## 14 48 112 0 0 0 0 <NA>
## 15 0 0 0 0 0 0 <NA>
## 16 0 102 0 0 0 0 <NA>
## 17 240 154 0 0 0 0 <NA>
## 18 0 0 205 0 0 0 <NA>
## 19 171 159 0 0 0 0 <NA>
## 20 100 110 0 0 0 0 <NA>
## Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
## 1 <NA> <NA> 0 2 2008 WD Normal 208500
## 2 <NA> <NA> 0 5 2007 WD Normal 181500
## 3 <NA> <NA> 0 9 2008 WD Normal 223500
## 4 <NA> <NA> 0 2 2006 WD Abnorml 140000
## 5 <NA> <NA> 0 12 2008 WD Normal 250000
## 6 <NA> <NA> 0 8 2007 WD Normal 307000
## 7 <NA> <NA> 0 4 2008 WD Abnorml 129900
## 8 <NA> <NA> 0 1 2008 WD Normal 118000
## 9 <NA> <NA> 0 2 2008 WD Normal 129500
## 10 <NA> <NA> 0 7 2006 New Partial 345000
## 11 <NA> <NA> 0 9 2008 WD Normal 144000
## 12 <NA> <NA> 0 8 2007 New Partial 279500
## 13 GdWo <NA> 0 5 2008 WD Normal 157000
## 14 GdPrv <NA> 0 7 2007 WD Normal 132000
## 15 <NA> Shed 500 10 2006 WD Normal 90000
## 16 <NA> <NA> 0 6 2008 WD Normal 159000
## 17 <NA> <NA> 0 11 2006 New Partial 325300
## 18 GdPrv <NA> 0 6 2007 WD Normal 139400
## 19 <NA> <NA> 0 9 2008 WD Normal 230000
## 20 <NA> <NA> 0 6 2007 WD Normal 129900
str(before2009)
## 'data.frame': 1933 obs. of 82 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Id : int 1 2 3 4 5 7 9 10 11 12 ...
## $ MSSubClass : int 60 20 60 70 60 20 50 190 20 60 ...
## $ MSZoning : chr "RL" "RL" "RL" "RL" ...
## $ LotFrontage : int 65 80 68 60 84 75 51 50 70 85 ...
## $ LotArea : int 8450 9600 11250 9550 14260 10084 6120 7420 11200 11924 ...
## $ Street : chr "Pave" "Pave" "Pave" "Pave" ...
## $ Alley : chr NA NA NA NA ...
## $ LotShape : chr "Reg" "Reg" "IR1" "IR1" ...
## $ LandContour : chr "Lvl" "Lvl" "Lvl" "Lvl" ...
## $ Utilities : chr "AllPub" "AllPub" "AllPub" "AllPub" ...
## $ LotConfig : chr "Inside" "FR2" "Inside" "Corner" ...
## $ LandSlope : chr "Gtl" "Gtl" "Gtl" "Gtl" ...
## $ Neighborhood : chr "CollgCr" "Veenker" "CollgCr" "Crawfor" ...
## $ Condition1 : chr "Norm" "Feedr" "Norm" "Norm" ...
## $ Condition2 : chr "Norm" "Norm" "Norm" "Norm" ...
## $ BldgType : chr "1Fam" "1Fam" "1Fam" "1Fam" ...
## $ HouseStyle : chr "2Story" "1Story" "2Story" "2Story" ...
## $ OverallQual : int 7 6 7 7 8 8 7 5 5 9 ...
## $ OverallCond : int 5 8 5 5 5 5 5 6 5 5 ...
## $ YearBuilt : int 2003 1976 2001 1915 2000 2004 1931 1939 1965 2005 ...
## $ YearRemodAdd : int 2003 1976 2002 1970 2000 2005 1950 1950 1965 2006 ...
## $ RoofStyle : chr "Gable" "Gable" "Gable" "Gable" ...
## $ RoofMatl : chr "CompShg" "CompShg" "CompShg" "CompShg" ...
## $ Exterior1st : chr "VinylSd" "MetalSd" "VinylSd" "Wd Sdng" ...
## $ Exterior2nd : chr "VinylSd" "MetalSd" "VinylSd" "Wd Shng" ...
## $ MasVnrType : chr "BrkFace" "None" "BrkFace" "None" ...
## $ MasVnrArea : int 196 0 162 0 350 186 0 0 0 286 ...
## $ ExterQual : chr "Gd" "TA" "Gd" "TA" ...
## $ ExterCond : chr "TA" "TA" "TA" "TA" ...
## $ Foundation : chr "PConc" "CBlock" "PConc" "BrkTil" ...
## $ BsmtQual : chr "Gd" "Gd" "Gd" "TA" ...
## $ BsmtCond : chr "TA" "TA" "TA" "Gd" ...
## $ BsmtExposure : chr "No" "Gd" "Mn" "No" ...
## $ BsmtFinType1 : chr "GLQ" "ALQ" "GLQ" "ALQ" ...
## $ BsmtFinSF1 : int 706 978 486 216 655 1369 0 851 906 998 ...
## $ BsmtFinType2 : chr "Unf" "Unf" "Unf" "Unf" ...
## $ BsmtFinSF2 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ BsmtUnfSF : int 150 284 434 540 490 317 952 140 134 177 ...
## $ TotalBsmtSF : int 856 1262 920 756 1145 1686 952 991 1040 1175 ...
## $ Heating : chr "GasA" "GasA" "GasA" "GasA" ...
## $ HeatingQC : chr "Ex" "Ex" "Ex" "Gd" ...
## $ CentralAir : chr "Y" "Y" "Y" "Y" ...
## $ Electrical : chr "SBrkr" "SBrkr" "SBrkr" "SBrkr" ...
## $ X1stFlrSF : int 856 1262 920 961 1145 1694 1022 1077 1040 1182 ...
## $ X2ndFlrSF : int 854 0 866 756 1053 0 752 0 0 1142 ...
## $ LowQualFinSF : int 0 0 0 0 0 0 0 0 0 0 ...
## $ GrLivArea : int 1710 1262 1786 1717 2198 1694 1774 1077 1040 2324 ...
## $ BsmtFullBath : int 1 0 1 1 1 1 0 1 1 1 ...
## $ BsmtHalfBath : int 0 1 0 0 0 0 0 0 0 0 ...
## $ FullBath : int 2 2 2 1 2 2 2 1 1 3 ...
## $ HalfBath : int 1 0 1 0 1 0 0 0 0 0 ...
## $ BedroomAbvGr : int 3 3 3 3 4 3 2 2 3 4 ...
## $ KitchenAbvGr : int 1 1 1 1 1 1 2 2 1 1 ...
## $ KitchenQual : chr "Gd" "TA" "Gd" "Gd" ...
## $ TotRmsAbvGrd : int 8 6 6 7 9 7 8 5 5 11 ...
## $ Functional : chr "Typ" "Typ" "Typ" "Typ" ...
## $ Fireplaces : int 0 1 1 1 1 1 2 2 0 2 ...
## $ FireplaceQu : chr NA "TA" "TA" "Gd" ...
## $ GarageType : chr "Attchd" "Attchd" "Attchd" "Detchd" ...
## $ GarageYrBlt : int 2003 1976 2001 1998 2000 2004 1931 1939 1965 2005 ...
## $ GarageFinish : chr "RFn" "RFn" "RFn" "Unf" ...
## $ GarageCars : int 2 2 2 3 3 2 2 1 1 3 ...
## $ GarageArea : int 548 460 608 642 836 636 468 205 384 736 ...
## $ GarageQual : chr "TA" "TA" "TA" "TA" ...
## $ GarageCond : chr "TA" "TA" "TA" "TA" ...
## $ PavedDrive : chr "Y" "Y" "Y" "Y" ...
## $ WoodDeckSF : int 0 298 0 0 192 255 90 0 0 147 ...
## $ OpenPorchSF : int 61 0 42 35 84 57 0 4 0 21 ...
## $ EnclosedPorch: int 0 0 0 272 0 0 205 0 0 0 ...
## $ X3SsnPorch : int 0 0 0 0 0 0 0 0 0 0 ...
## $ ScreenPorch : int 0 0 0 0 0 0 0 0 0 0 ...
## $ PoolArea : int 0 0 0 0 0 0 0 0 0 0 ...
## $ PoolQC : chr NA NA NA NA ...
## $ Fence : chr NA NA NA NA ...
## $ MiscFeature : chr NA NA NA NA ...
## $ MiscVal : int 0 0 0 0 0 0 0 0 0 0 ...
## $ MoSold : int 2 5 9 2 12 8 4 1 2 7 ...
## $ YrSold : int 2008 2007 2008 2006 2008 2007 2008 2008 2008 2006 ...
## $ SaleType : chr "WD" "WD" "WD" "WD" ...
## $ SaleCondition: chr "Normal" "Normal" "Normal" "Abnorml" ...
## $ SalePrice : num 208500 181500 223500 140000 250000 ...
Convert the following columns to character or factor type: MSSubClass, OverallQual, OverallCond. Then inspect the result to verify that your code works. Rubric: 3 points (1 point each) for conversion and 1 point for verification.
# Convert the columns to factor type
before2009$MSSubClass <- as.factor(before2009$MSSubClass)
before2009$OverallQual <- as.factor(before2009$OverallQual)
before2009$OverallCond <- as.factor(before2009$OverallCond)
# Inspect the result to verify the conversion
str(before2009[, c("MSSubClass", "OverallQual","OverallCond")])
## 'data.frame': 1933 obs. of 3 variables:
## $ MSSubClass : Factor w/ 16 levels "20","30","40",..: 6 1 6 7 6 1 5 16 1 6 ...
## $ OverallQual: Factor w/ 10 levels "1","2","3","4",..: 7 6 7 7 8 8 7 5 5 9 ...
## $ OverallCond: Factor w/ 9 levels "1","2","3","4",..: 5 8 5 5 5 5 5 6 5 5 ...
How many NAs does each column have? Display your answer as a dataframe (or tibble) called beforeNAs. The dataset beforeNAs should contain two columns, one containing the names of the columns of before2009, and the other containing the number of NAs in each column. Then, print only the first 10 (head) rows of this dataframe to verify that your code works. Rubric: 6 points for constructing beforeNAs and 1 point for verification.
#install.packages("dplyr")
library(dplyr)
##
## 载入程辑包:'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
#install.packages("tidyverse")
library(tidyverse)
## Warning: 程辑包'tidyverse'是用R版本4.3.2 来建造的
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ readr 2.1.4
## ✔ ggplot2 3.4.2 ✔ stringr 1.5.0
## ✔ lubridate 1.9.2 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
temp = map(before2009, ~sum(is.na(.))) %>% as_tibble() %>% t()
beforeNAs = tibble('Columns' = rownames(temp), "NAs" = temp[,1])
beforeNAs %>% head(10)
## # A tibble: 10 × 2
## Columns NAs
## <chr> <int>
## 1 X 0
## 2 Id 0
## 3 MSSubClass 0
## 4 MSZoning 3
## 5 LotFrontage 317
## 6 LotArea 0
## 7 Street 0
## 8 Alley 1797
## 9 LotShape 0
## 10 LandContour 0
Drop (remove) all the columns (except SalePrice) that have 20 or more many missing values. Also, drop (remove) the columns called X1, Id, and Utilities (all its values are the same). While some of the columns we drop here may contribute to the predictive accuracy of our model, the majority of the information will be contained in the remaining variables. Then, print only the first 10 (head) rows of this dataframe to verify that your code works. Rubric: 8 points for constructing beforeNAs and 1 point for verification.
count_NA <- sapply(before2009, function(x) sum(is.na(x)))
cols_NA <- names(count_NA[count_NA >= 20])
count_NA["SalePrice"] <- 0
cols_NA <- names(count_NA[count_NA >= 20])
cols_drop <- c("X","Id","Utilities")
dropCols <- union(cols_NA, cols_drop)
before2009 <- select(before2009, -dropCols)
## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
## # Was:
## data %>% select(dropCols)
##
## # Now:
## data %>% select(all_of(dropCols))
##
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
head(before2009, 20)
## MSSubClass MSZoning LotArea Street LotShape LandContour LotConfig LandSlope
## 1 60 RL 8450 Pave Reg Lvl Inside Gtl
## 2 20 RL 9600 Pave Reg Lvl FR2 Gtl
## 3 60 RL 11250 Pave IR1 Lvl Inside Gtl
## 4 70 RL 9550 Pave IR1 Lvl Corner Gtl
## 5 60 RL 14260 Pave IR1 Lvl FR2 Gtl
## 6 20 RL 10084 Pave Reg Lvl Inside Gtl
## 7 50 RM 6120 Pave Reg Lvl Inside Gtl
## 8 190 RL 7420 Pave Reg Lvl Corner Gtl
## 9 20 RL 11200 Pave Reg Lvl Inside Gtl
## 10 60 RL 11924 Pave IR1 Lvl Inside Gtl
## 11 20 RL 12968 Pave IR2 Lvl Inside Gtl
## 12 20 RL 10652 Pave IR1 Lvl Inside Gtl
## 13 20 RL 10920 Pave IR1 Lvl Corner Gtl
## 14 45 RM 6120 Pave Reg Lvl Corner Gtl
## 15 90 RL 10791 Pave Reg Lvl Inside Gtl
## 16 20 RL 13695 Pave Reg Lvl Inside Gtl
## 17 60 RL 14215 Pave IR1 Lvl Corner Gtl
## 18 45 RM 7449 Pave Reg Bnk Inside Gtl
## 19 20 RL 9742 Pave Reg Lvl Inside Gtl
## 20 120 RM 4224 Pave Reg Lvl Inside Gtl
## Neighborhood Condition1 Condition2 BldgType HouseStyle OverallQual
## 1 CollgCr Norm Norm 1Fam 2Story 7
## 2 Veenker Feedr Norm 1Fam 1Story 6
## 3 CollgCr Norm Norm 1Fam 2Story 7
## 4 Crawfor Norm Norm 1Fam 2Story 7
## 5 NoRidge Norm Norm 1Fam 2Story 8
## 6 Somerst Norm Norm 1Fam 1Story 8
## 7 OldTown Artery Norm 1Fam 1.5Fin 7
## 8 BrkSide Artery Artery 2fmCon 1.5Unf 5
## 9 Sawyer Norm Norm 1Fam 1Story 5
## 10 NridgHt Norm Norm 1Fam 2Story 9
## 11 Sawyer Norm Norm 1Fam 1Story 5
## 12 CollgCr Norm Norm 1Fam 1Story 7
## 13 NAmes Norm Norm 1Fam 1Story 6
## 14 BrkSide Norm Norm 1Fam 1.5Unf 7
## 15 Sawyer Norm Norm Duplex 1Story 4
## 16 SawyerW RRAe Norm 1Fam 1Story 5
## 17 NridgHt Norm Norm 1Fam 2Story 8
## 18 IDOTRR Norm Norm 1Fam 1.5Unf 7
## 19 CollgCr Norm Norm 1Fam 1Story 8
## 20 MeadowV Norm Norm TwnhsE 1Story 5
## OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl Exterior1st
## 1 5 2003 2003 Gable CompShg VinylSd
## 2 8 1976 1976 Gable CompShg MetalSd
## 3 5 2001 2002 Gable CompShg VinylSd
## 4 5 1915 1970 Gable CompShg Wd Sdng
## 5 5 2000 2000 Gable CompShg VinylSd
## 6 5 2004 2005 Gable CompShg VinylSd
## 7 5 1931 1950 Gable CompShg BrkFace
## 8 6 1939 1950 Gable CompShg MetalSd
## 9 5 1965 1965 Hip CompShg HdBoard
## 10 5 2005 2006 Hip CompShg WdShing
## 11 6 1962 1962 Hip CompShg HdBoard
## 12 5 2006 2007 Gable CompShg VinylSd
## 13 5 1960 1960 Hip CompShg MetalSd
## 14 8 1929 2001 Gable CompShg Wd Sdng
## 15 5 1967 1967 Gable CompShg MetalSd
## 16 5 2004 2004 Gable CompShg VinylSd
## 17 5 2005 2006 Gable CompShg VinylSd
## 18 7 1930 1950 Gable CompShg Wd Sdng
## 19 5 2002 2002 Hip CompShg VinylSd
## 20 7 1976 1976 Gable CompShg CemntBd
## Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation BsmtFinSF1
## 1 VinylSd BrkFace 196 Gd TA PConc 706
## 2 MetalSd None 0 TA TA CBlock 978
## 3 VinylSd BrkFace 162 Gd TA PConc 486
## 4 Wd Shng None 0 TA TA BrkTil 216
## 5 VinylSd BrkFace 350 Gd TA PConc 655
## 6 VinylSd Stone 186 Gd TA PConc 1369
## 7 Wd Shng None 0 TA TA BrkTil 0
## 8 MetalSd None 0 TA TA BrkTil 851
## 9 HdBoard None 0 TA TA CBlock 906
## 10 Wd Shng Stone 286 Ex TA PConc 998
## 11 Plywood None 0 TA TA CBlock 737
## 12 VinylSd Stone 306 Gd TA PConc 0
## 13 MetalSd BrkFace 212 TA TA CBlock 733
## 14 Wd Sdng None 0 TA TA BrkTil 0
## 15 MetalSd None 0 TA TA Slab 0
## 16 VinylSd None 0 TA TA PConc 646
## 17 VinylSd BrkFace 380 Gd TA PConc 0
## 18 Wd Sdng None 0 TA TA PConc 0
## 19 VinylSd BrkFace 281 Gd TA PConc 0
## 20 CmentBd None 0 TA TA PConc 840
## BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical
## 1 0 150 856 GasA Ex Y SBrkr
## 2 0 284 1262 GasA Ex Y SBrkr
## 3 0 434 920 GasA Ex Y SBrkr
## 4 0 540 756 GasA Gd Y SBrkr
## 5 0 490 1145 GasA Ex Y SBrkr
## 6 0 317 1686 GasA Ex Y SBrkr
## 7 0 952 952 GasA Gd Y FuseF
## 8 0 140 991 GasA Ex Y SBrkr
## 9 0 134 1040 GasA Ex Y SBrkr
## 10 0 177 1175 GasA Ex Y SBrkr
## 11 0 175 912 GasA TA Y SBrkr
## 12 0 1494 1494 GasA Ex Y SBrkr
## 13 0 520 1253 GasA TA Y SBrkr
## 14 0 832 832 GasA Ex Y FuseA
## 15 0 0 0 GasA TA Y SBrkr
## 16 0 468 1114 GasA Ex Y SBrkr
## 17 0 1158 1158 GasA Ex Y SBrkr
## 18 0 637 637 GasA Ex Y FuseF
## 19 0 1777 1777 GasA Ex Y SBrkr
## 20 0 200 1040 GasA TA Y SBrkr
## X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath
## 1 856 854 0 1710 1 0
## 2 1262 0 0 1262 0 1
## 3 920 866 0 1786 1 0
## 4 961 756 0 1717 1 0
## 5 1145 1053 0 2198 1 0
## 6 1694 0 0 1694 1 0
## 7 1022 752 0 1774 0 0
## 8 1077 0 0 1077 1 0
## 9 1040 0 0 1040 1 0
## 10 1182 1142 0 2324 1 0
## 11 912 0 0 912 1 0
## 12 1494 0 0 1494 0 0
## 13 1253 0 0 1253 1 0
## 14 854 0 0 854 0 0
## 15 1296 0 0 1296 0 0
## 16 1114 0 0 1114 1 0
## 17 1158 1218 0 2376 0 0
## 18 1108 0 0 1108 0 0
## 19 1795 0 0 1795 0 0
## 20 1060 0 0 1060 1 0
## FullBath HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd
## 1 2 1 3 1 Gd 8
## 2 2 0 3 1 TA 6
## 3 2 1 3 1 Gd 6
## 4 1 0 3 1 Gd 7
## 5 2 1 4 1 Gd 9
## 6 2 0 3 1 Gd 7
## 7 2 0 2 2 TA 8
## 8 1 0 2 2 TA 5
## 9 1 0 3 1 TA 5
## 10 3 0 4 1 Ex 11
## 11 1 0 2 1 TA 4
## 12 2 0 3 1 Gd 7
## 13 1 1 2 1 TA 5
## 14 1 0 2 1 TA 5
## 15 2 0 2 2 TA 6
## 16 1 1 3 1 Gd 6
## 17 3 1 4 1 Gd 9
## 18 1 0 3 1 Gd 6
## 19 2 0 3 1 Gd 7
## 20 1 0 3 1 TA 6
## Functional Fireplaces GarageCars GarageArea PavedDrive WoodDeckSF
## 1 Typ 0 2 548 Y 0
## 2 Typ 1 2 460 Y 298
## 3 Typ 1 2 608 Y 0
## 4 Typ 1 3 642 Y 0
## 5 Typ 1 3 836 Y 192
## 6 Typ 1 2 636 Y 255
## 7 Min1 2 2 468 Y 90
## 8 Typ 2 1 205 Y 0
## 9 Typ 0 1 384 Y 0
## 10 Typ 2 3 736 Y 147
## 11 Typ 0 1 352 Y 140
## 12 Typ 1 3 840 Y 160
## 13 Typ 1 1 352 Y 0
## 14 Typ 0 2 576 Y 48
## 15 Typ 0 2 516 Y 0
## 16 Typ 0 2 576 Y 0
## 17 Typ 1 3 853 Y 240
## 18 Typ 1 1 280 N 0
## 19 Typ 1 2 534 Y 171
## 20 Typ 1 2 572 Y 100
## OpenPorchSF EnclosedPorch X3SsnPorch ScreenPorch PoolArea MiscVal MoSold
## 1 61 0 0 0 0 0 2
## 2 0 0 0 0 0 0 5
## 3 42 0 0 0 0 0 9
## 4 35 272 0 0 0 0 2
## 5 84 0 0 0 0 0 12
## 6 57 0 0 0 0 0 8
## 7 0 205 0 0 0 0 4
## 8 4 0 0 0 0 0 1
## 9 0 0 0 0 0 0 2
## 10 21 0 0 0 0 0 7
## 11 0 0 0 176 0 0 9
## 12 33 0 0 0 0 0 8
## 13 213 176 0 0 0 0 5
## 14 112 0 0 0 0 0 7
## 15 0 0 0 0 0 500 10
## 16 102 0 0 0 0 0 6
## 17 154 0 0 0 0 0 11
## 18 0 205 0 0 0 0 6
## 19 159 0 0 0 0 0 9
## 20 110 0 0 0 0 0 6
## YrSold SaleType SaleCondition SalePrice
## 1 2008 WD Normal 208500
## 2 2007 WD Normal 181500
## 3 2008 WD Normal 223500
## 4 2006 WD Abnorml 140000
## 5 2008 WD Normal 250000
## 6 2007 WD Normal 307000
## 7 2008 WD Abnorml 129900
## 8 2008 WD Normal 118000
## 9 2008 WD Normal 129500
## 10 2006 New Partial 345000
## 11 2008 WD Normal 144000
## 12 2007 New Partial 279500
## 13 2008 WD Normal 157000
## 14 2007 WD Normal 132000
## 15 2006 WD Normal 90000
## 16 2008 WD Normal 159000
## 17 2006 New Partial 325300
## 18 2007 WD Normal 139400
## 19 2008 WD Normal 230000
## 20 2007 WD Normal 129900
Conduct a multiple linear regression on all variables. Set SalePrice as the response and store the results in regBefore2009. Then, print the summary of regBefore2009 to verify that your code works. Tip: The formula for regression is lm(SalePrice ~ ., data = before2009) Rubric: 4 points for setting regBefore2009 and 1 point for verification.
regBefore2009 <- lm(SalePrice ~., data = before2009)
summary(regBefore2009)
##
## Call:
## lm(formula = SalePrice ~ ., data = before2009)
##
## Residuals:
## Min 1Q Median 3Q Max
## -178924 -4505 -76 4002 157196
##
## Coefficients: (3 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.564e+06 9.825e+05 -2.610 0.009139 **
## MSSubClass30 -2.601e+02 2.741e+03 -0.095 0.924405
## MSSubClass40 -2.076e+02 8.694e+03 -0.024 0.980951
## MSSubClass45 6.521e+03 1.191e+04 0.547 0.584141
## MSSubClass50 3.890e+03 4.941e+03 0.787 0.431210
## MSSubClass60 1.242e+03 4.708e+03 0.264 0.791926
## MSSubClass70 3.569e+03 4.989e+03 0.715 0.474411
## MSSubClass75 6.586e+02 7.683e+03 0.086 0.931698
## MSSubClass80 -1.153e+04 7.335e+03 -1.572 0.116120
## MSSubClass85 -7.530e+03 5.808e+03 -1.296 0.195004
## MSSubClass90 -1.215e+04 4.669e+03 -2.602 0.009344 **
## MSSubClass120 -8.063e+03 7.107e+03 -1.135 0.256704
## MSSubClass150 2.546e+03 1.911e+04 0.133 0.894003
## MSSubClass160 -5.545e+03 8.866e+03 -0.625 0.531741
## MSSubClass180 -5.791e+03 9.898e+03 -0.585 0.558595
## MSSubClass190 -8.616e+03 1.274e+04 -0.676 0.499018
## MSZoningFV 4.095e+04 6.833e+03 5.993 2.51e-09 ***
## MSZoningRH 2.490e+04 6.983e+03 3.567 0.000372 ***
## MSZoningRL 3.009e+04 5.710e+03 5.269 1.55e-07 ***
## MSZoningRM 3.019e+04 5.331e+03 5.663 1.74e-08 ***
## LotArea 5.643e-01 7.679e-02 7.348 3.13e-13 ***
## StreetPave 2.965e+04 6.601e+03 4.492 7.54e-06 ***
## LotShapeIR2 4.777e+03 2.431e+03 1.965 0.049604 *
## LotShapeIR3 7.886e+03 5.038e+03 1.565 0.117678
## LotShapeReg 5.291e+02 9.504e+02 0.557 0.577754
## LandContourHLS 1.283e+04 2.912e+03 4.406 1.12e-05 ***
## LandContourLow 8.402e+02 4.041e+03 0.208 0.835341
## LandContourLvl 1.041e+04 2.160e+03 4.818 1.58e-06 ***
## LotConfigCulDSac 6.820e+03 1.939e+03 3.517 0.000449 ***
## LotConfigFR2 -7.573e+03 2.562e+03 -2.956 0.003161 **
## LotConfigFR3 -1.235e+04 5.077e+03 -2.432 0.015109 *
## LotConfigInside -3.084e+03 1.062e+03 -2.904 0.003728 **
## LandSlopeMod 1.175e+04 2.376e+03 4.943 8.46e-07 ***
## LandSlopeSev -2.130e+04 7.277e+03 -2.926 0.003476 **
## NeighborhoodBlueste -1.081e+04 9.673e+03 -1.118 0.263830
## NeighborhoodBrDale -2.852e+03 6.729e+03 -0.424 0.671753
## NeighborhoodBrkSide -1.128e+04 5.441e+03 -2.073 0.038294 *
## NeighborhoodClearCr -2.081e+04 5.730e+03 -3.631 0.000290 ***
## NeighborhoodCollgCr -1.806e+04 4.275e+03 -4.224 2.53e-05 ***
## NeighborhoodCrawfor 3.828e+03 4.936e+03 0.775 0.438221
## NeighborhoodEdwards -2.614e+04 4.686e+03 -5.578 2.83e-08 ***
## NeighborhoodGilbert -1.959e+04 4.545e+03 -4.312 1.71e-05 ***
## NeighborhoodIDOTRR -1.865e+04 5.904e+03 -3.159 0.001613 **
## NeighborhoodMeadowV -2.288e+04 6.805e+03 -3.362 0.000792 ***
## NeighborhoodMitchel -2.956e+04 4.753e+03 -6.220 6.26e-10 ***
## NeighborhoodNAmes -2.254e+04 4.574e+03 -4.929 9.09e-07 ***
## NeighborhoodNoRidge 1.408e+04 5.029e+03 2.799 0.005181 **
## NeighborhoodNPkVill 5.527e+03 1.022e+04 0.541 0.588839
## NeighborhoodNridgHt 1.329e+04 4.439e+03 2.994 0.002796 **
## NeighborhoodNWAmes -2.666e+04 4.712e+03 -5.659 1.79e-08 ***
## NeighborhoodOldTown -2.330e+04 5.415e+03 -4.303 1.78e-05 ***
## NeighborhoodSawyer -1.786e+04 4.745e+03 -3.763 0.000174 ***
## NeighborhoodSawyerW -1.319e+04 4.649e+03 -2.838 0.004596 **
## NeighborhoodSomerst -1.647e+04 5.199e+03 -3.169 0.001559 **
## NeighborhoodStoneBr 2.869e+04 5.073e+03 5.656 1.81e-08 ***
## NeighborhoodSWISU -1.355e+04 5.847e+03 -2.318 0.020557 *
## NeighborhoodTimber -1.265e+04 4.795e+03 -2.638 0.008427 **
## NeighborhoodVeenker -4.591e+03 5.740e+03 -0.800 0.423928
## Condition1Feedr 2.806e+03 2.954e+03 0.950 0.342241
## Condition1Norm 1.340e+04 2.470e+03 5.425 6.62e-08 ***
## Condition1PosA 1.274e+04 5.569e+03 2.288 0.022259 *
## Condition1PosN 7.585e+03 4.569e+03 1.660 0.097089 .
## Condition1RRAe -1.425e+04 4.652e+03 -3.063 0.002226 **
## Condition1RRAn 1.101e+04 3.895e+03 2.826 0.004765 **
## Condition1RRNe -1.495e+03 8.783e+03 -0.170 0.864886
## Condition1RRNn 6.815e+03 9.224e+03 0.739 0.460134
## Condition2Feedr -8.786e+03 1.040e+04 -0.845 0.398354
## Condition2Norm -3.235e+03 9.104e+03 -0.355 0.722366
## Condition2PosA -7.289e+03 1.437e+04 -0.507 0.612077
## Condition2PosN -2.439e+05 1.368e+04 -17.834 < 2e-16 ***
## Condition2RRAe -1.074e+05 2.368e+04 -4.537 6.11e-06 ***
## Condition2RRAn -7.330e+03 1.870e+04 -0.392 0.695047
## Condition2RRNn -3.959e+02 1.467e+04 -0.027 0.978476
## BldgType2fmCon -7.442e+02 1.277e+04 -0.058 0.953546
## BldgTypeDuplex NA NA NA NA
## BldgTypeTwnhs -1.811e+04 7.659e+03 -2.365 0.018163 *
## BldgTypeTwnhsE -1.752e+04 7.082e+03 -2.474 0.013460 *
## HouseStyle1.5Unf 7.953e+03 1.114e+04 0.714 0.475592
## HouseStyle1Story 1.086e+04 4.960e+03 2.190 0.028633 *
## HouseStyle2.5Fin -1.193e+04 1.025e+04 -1.164 0.244636
## HouseStyle2.5Unf -8.651e+03 7.330e+03 -1.180 0.238061
## HouseStyle2Story -6.055e+03 4.809e+03 -1.259 0.208150
## HouseStyleSFoyer 1.538e+04 6.497e+03 2.367 0.018057 *
## HouseStyleSLvl 1.827e+04 7.821e+03 2.336 0.019626 *
## OverallQual2 2.961e+04 1.988e+04 1.490 0.136484
## OverallQual3 3.496e+04 1.859e+04 1.880 0.060248 .
## OverallQual4 3.600e+04 1.847e+04 1.949 0.051465 .
## OverallQual5 4.016e+04 1.853e+04 2.167 0.030388 *
## OverallQual6 4.530e+04 1.858e+04 2.438 0.014872 *
## OverallQual7 5.296e+04 1.860e+04 2.847 0.004468 **
## OverallQual8 6.663e+04 1.867e+04 3.569 0.000368 ***
## OverallQual9 8.783e+04 1.890e+04 4.647 3.63e-06 ***
## OverallQual10 1.369e+05 1.948e+04 7.028 3.03e-12 ***
## OverallCond2 1.315e+04 2.332e+04 0.564 0.572872
## OverallCond3 2.004e+04 1.342e+04 1.494 0.135405
## OverallCond4 2.603e+04 1.336e+04 1.948 0.051607 .
## OverallCond5 3.390e+04 1.336e+04 2.536 0.011290 *
## OverallCond6 4.023e+04 1.342e+04 2.998 0.002759 **
## OverallCond7 4.572e+04 1.345e+04 3.400 0.000690 ***
## OverallCond8 5.177e+04 1.350e+04 3.836 0.000130 ***
## OverallCond9 6.002e+04 1.400e+04 4.288 1.90e-05 ***
## YearBuilt 3.595e+02 4.731e+01 7.599 4.92e-14 ***
## YearRemodAdd 1.055e+02 3.212e+01 3.286 0.001038 **
## RoofStyleGable -4.815e+03 8.656e+03 -0.556 0.578143
## RoofStyleGambrel -2.186e+03 9.745e+03 -0.224 0.822557
## RoofStyleHip -3.696e+03 8.704e+03 -0.425 0.671165
## RoofStyleMansard 5.986e+03 1.100e+04 0.544 0.586339
## RoofStyleShed 7.902e+04 1.515e+04 5.215 2.06e-07 ***
## RoofMatlCompShg 6.617e+05 2.022e+04 32.718 < 2e-16 ***
## RoofMatlMembran 7.396e+05 2.887e+04 25.620 < 2e-16 ***
## RoofMatlMetal 6.992e+05 2.872e+04 24.346 < 2e-16 ***
## RoofMatlRoll 6.538e+05 2.634e+04 24.818 < 2e-16 ***
## RoofMatlTar&Grv 6.672e+05 2.175e+04 30.676 < 2e-16 ***
## RoofMatlWdShake 6.437e+05 2.152e+04 29.911 < 2e-16 ***
## RoofMatlWdShngl 7.432e+05 2.132e+04 34.854 < 2e-16 ***
## Exterior1stAsphShn -1.850e+04 2.276e+04 -0.813 0.416323
## Exterior1stBrkComm -7.338e+03 1.387e+04 -0.529 0.596760
## Exterior1stBrkFace 7.740e+03 7.342e+03 1.054 0.291930
## Exterior1stCemntBd -1.151e+04 1.224e+04 -0.940 0.347353
## Exterior1stHdBoard -1.014e+04 7.086e+03 -1.431 0.152611
## Exterior1stImStucc -6.986e+04 1.799e+04 -3.884 0.000107 ***
## Exterior1stMetalSd 1.209e+03 7.963e+03 0.152 0.879352
## Exterior1stPlywood -1.547e+04 6.952e+03 -2.226 0.026170 *
## Exterior1stStone -2.641e+04 1.549e+04 -1.705 0.088463 .
## Exterior1stStucco -4.405e+03 8.089e+03 -0.544 0.586173
## Exterior1stVinylSd -1.611e+04 8.066e+03 -1.997 0.045966 *
## Exterior1stWd Sdng -8.627e+03 6.955e+03 -1.240 0.215021
## Exterior1stWdShing -3.086e+03 7.381e+03 -0.418 0.675971
## Exterior2ndAsphShn 2.238e+03 1.431e+04 0.156 0.875675
## Exterior2ndBrk Cmn 4.508e+03 1.355e+04 0.333 0.739463
## Exterior2ndBrkFace -3.269e+02 8.265e+03 -0.040 0.968454
## Exterior2ndCmentBd 1.024e+04 1.261e+04 0.812 0.417099
## Exterior2ndHdBoard 1.608e+03 7.571e+03 0.212 0.831880
## Exterior2ndImStucc 3.464e+04 8.906e+03 3.890 0.000104 ***
## Exterior2ndMetalSd -3.830e+03 8.357e+03 -0.458 0.646778
## Exterior2ndOther -1.006e+04 1.825e+04 -0.551 0.581676
## Exterior2ndPlywood 3.119e+03 7.283e+03 0.428 0.668547
## Exterior2ndStone 1.465e+04 1.425e+04 1.029 0.303846
## Exterior2ndStucco -2.973e+03 8.510e+03 -0.349 0.726834
## Exterior2ndVinylSd 1.104e+04 8.440e+03 1.308 0.191016
## Exterior2ndWd Sdng 4.350e+03 7.496e+03 0.580 0.561750
## Exterior2ndWd Shng -4.197e+03 7.836e+03 -0.536 0.592332
## MasVnrTypeBrkFace 8.602e+03 3.955e+03 2.175 0.029777 *
## MasVnrTypeNone 1.164e+04 3.963e+03 2.938 0.003349 **
## MasVnrTypeStone 1.311e+04 4.195e+03 3.125 0.001807 **
## MasVnrArea 1.884e+01 3.395e+00 5.550 3.32e-08 ***
## ExterQualFa 1.336e+04 6.288e+03 2.125 0.033726 *
## ExterQualGd -8.024e+03 3.085e+03 -2.601 0.009375 **
## ExterQualTA -1.003e+04 3.394e+03 -2.956 0.003160 **
## ExterCondFa -3.499e+03 7.066e+03 -0.495 0.620488
## ExterCondGd -1.037e+04 6.377e+03 -1.625 0.104247
## ExterCondTA -6.483e+03 6.373e+03 -1.017 0.309186
## FoundationCBlock 1.819e+03 1.801e+03 1.010 0.312470
## FoundationPConc 6.034e+03 1.964e+03 3.072 0.002159 **
## FoundationSlab 6.341e+03 4.468e+03 1.419 0.156047
## FoundationStone 4.014e+03 7.200e+03 0.557 0.577301
## FoundationWood -2.481e+04 1.172e+04 -2.116 0.034456 *
## BsmtFinSF1 3.360e+01 2.392e+00 14.048 < 2e-16 ***
## BsmtFinSF2 2.188e+01 3.201e+00 6.835 1.14e-11 ***
## BsmtUnfSF 1.270e+01 2.228e+00 5.698 1.43e-08 ***
## TotalBsmtSF NA NA NA NA
## HeatingGasA 2.343e+03 1.677e+04 0.140 0.888850
## HeatingGasW -6.304e+03 1.730e+04 -0.364 0.715552
## HeatingGrav -3.139e+03 1.894e+04 -0.166 0.868428
## HeatingOthW -2.839e+04 2.062e+04 -1.377 0.168675
## HeatingWall 4.965e+03 2.129e+04 0.233 0.815610
## HeatingQCFa -1.928e+03 2.706e+03 -0.712 0.476280
## HeatingQCGd -3.737e+03 1.183e+03 -3.160 0.001606 **
## HeatingQCPo 1.191e+04 1.245e+04 0.956 0.339135
## HeatingQCTA -3.674e+03 1.197e+03 -3.069 0.002180 **
## CentralAirY -3.873e+02 2.091e+03 -0.185 0.853039
## ElectricalFuseF -2.989e+03 3.552e+03 -0.842 0.400166
## ElectricalFuseP -9.314e+03 7.068e+03 -1.318 0.187772
## ElectricalMix 9.603e+03 2.643e+04 0.363 0.716405
## ElectricalSBrkr -1.970e+03 1.715e+03 -1.148 0.251012
## X1stFlrSF 5.350e+01 2.858e+00 18.718 < 2e-16 ***
## X2ndFlrSF 6.487e+01 3.035e+00 21.375 < 2e-16 ***
## LowQualFinSF 1.211e+01 1.099e+01 1.102 0.270735
## GrLivArea NA NA NA NA
## BsmtFullBath 2.144e+03 1.114e+03 1.925 0.054395 .
## BsmtHalfBath 8.477e+02 1.648e+03 0.514 0.607154
## FullBath 3.703e+03 1.295e+03 2.860 0.004292 **
## HalfBath -7.771e+01 1.241e+03 -0.063 0.950057
## BedroomAbvGr -3.233e+03 7.977e+02 -4.053 5.28e-05 ***
## KitchenAbvGr -6.874e+03 4.161e+03 -1.652 0.098739 .
## KitchenQualFa -1.566e+04 3.644e+03 -4.296 1.84e-05 ***
## KitchenQualGd -2.059e+04 2.131e+03 -9.663 < 2e-16 ***
## KitchenQualTA -1.784e+04 2.372e+03 -7.523 8.70e-14 ***
## TotRmsAbvGrd 7.895e+02 5.523e+02 1.430 0.153012
## FunctionalMaj2 -5.674e+03 9.442e+03 -0.601 0.547938
## FunctionalMin1 5.206e+03 5.666e+03 0.919 0.358376
## FunctionalMin2 6.318e+03 5.832e+03 1.083 0.278778
## FunctionalMod -5.581e+03 6.326e+03 -0.882 0.377781
## FunctionalSev -5.304e+04 1.823e+04 -2.910 0.003666 **
## FunctionalTyp 1.758e+04 5.075e+03 3.465 0.000544 ***
## Fireplaces 4.200e+03 7.970e+02 5.270 1.54e-07 ***
## GarageCars 2.574e+03 1.278e+03 2.015 0.044109 *
## GarageArea 1.710e+01 4.373e+00 3.910 9.61e-05 ***
## PavedDriveP -3.467e+03 2.994e+03 -1.158 0.247036
## PavedDriveY -2.394e+03 1.905e+03 -1.257 0.208887
## WoodDeckSF 1.471e+01 3.400e+00 4.325 1.61e-05 ***
## OpenPorchSF 1.655e+01 6.399e+00 2.586 0.009807 **
## EnclosedPorch 6.099e+00 6.780e+00 0.900 0.368493
## X3SsnPorch 5.566e+01 1.690e+01 3.293 0.001012 **
## ScreenPorch 2.423e+01 7.027e+00 3.449 0.000577 ***
## PoolArea 6.602e+01 9.534e+00 6.924 6.20e-12 ***
## MiscVal -5.596e-01 6.572e-01 -0.852 0.394563
## MoSold -5.360e+02 1.417e+02 -3.784 0.000160 ***
## YrSold 4.488e+02 4.868e+02 0.922 0.356638
## SaleTypeCon 3.738e+04 9.712e+03 3.849 0.000123 ***
## SaleTypeConLD 1.265e+04 5.268e+03 2.402 0.016432 *
## SaleTypeConLI -5.307e+03 9.946e+03 -0.534 0.593718
## SaleTypeConLw -5.056e+02 8.633e+03 -0.059 0.953308
## SaleTypeCWD 2.103e+04 5.336e+03 3.942 8.40e-05 ***
## SaleTypeNew 1.496e+04 8.815e+03 1.697 0.089953 .
## SaleTypeOth 1.122e+04 9.702e+03 1.156 0.247642
## SaleTypeWD -1.063e+03 2.538e+03 -0.419 0.675428
## SaleConditionAdjLand 9.288e+03 5.828e+03 1.594 0.111226
## SaleConditionAlloca 6.959e+03 6.002e+03 1.160 0.246397
## SaleConditionFamily -2.638e+02 3.233e+03 -0.082 0.934961
## SaleConditionNormal 4.383e+03 1.704e+03 2.573 0.010165 *
## SaleConditionPartial 5.334e+03 8.442e+03 0.632 0.527558
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15630 on 1684 degrees of freedom
## (因为不存在,30个观察量被删除了)
## Multiple R-squared: 0.966, Adjusted R-squared: 0.9616
## F-statistic: 219.2 on 218 and 1684 DF, p-value: < 2.2e-16
Using the result of this and your general understanding of what variables should be important in determining SalePrice, choose a maximum of 15 variables and create another, smaller regression, and call it regBefore2009optimal. Then, print the summary of regBefore2009optimal to verify that your code works. Tip: Normally you would do a more detailed variable selection using a backward or step-wise selection approach but this is NOT required for this question. Tip: This is the formula for regression: lm(SalePrice ~ var1 + var2 + … + varN, data = before2009), where var1, etc. are the variables of your choice. Tip: Pick the variables with the lowest Pr(>|t|) Rubric: 8 points for setting regBefore2009optimal and 1 point for verification.
summary_reg <- summary(regBefore2009)
coef_df <- as.data.frame(summary_reg$coefficients)
coef_df_pval <- coef_df[order(coef_df[,"Pr(>|t|)"]), ]
top_var <- rownames(coef_df_pval)[1:40]
top_var
## [1] "RoofMatlWdShngl" "RoofMatlCompShg" "RoofMatlTar&Grv"
## [4] "RoofMatlWdShake" "RoofMatlMembran" "RoofMatlRoll"
## [7] "RoofMatlMetal" "X2ndFlrSF" "X1stFlrSF"
## [10] "Condition2PosN" "BsmtFinSF1" "KitchenQualGd"
## [13] "YearBuilt" "KitchenQualTA" "LotArea"
## [16] "OverallQual10" "PoolArea" "BsmtFinSF2"
## [19] "NeighborhoodMitchel" "MSZoningFV" "BsmtUnfSF"
## [22] "MSZoningRM" "NeighborhoodNWAmes" "NeighborhoodStoneBr"
## [25] "NeighborhoodEdwards" "MasVnrArea" "Condition1Norm"
## [28] "Fireplaces" "MSZoningRL" "RoofStyleShed"
## [31] "LandSlopeMod" "NeighborhoodNAmes" "LandContourLvl"
## [34] "OverallQual9" "Condition2RRAe" "StreetPave"
## [37] "LandContourHLS" "WoodDeckSF" "NeighborhoodGilbert"
## [40] "NeighborhoodOldTown"
regBefore2009optimal <- lm(SalePrice ~ RoofMatl + LandSlope + BsmtUnfSF + OverallQual+
Condition2 + MSZoning + Neighborhood + LotArea + OverallCond +
Foundation + BedroomAbvGr + EnclosedPorch + BsmtFinSF1 +BsmtFinSF2 +
MasVnrType, data = before2009)
summary(regBefore2009optimal)
##
## Call:
## lm(formula = SalePrice ~ RoofMatl + LandSlope + BsmtUnfSF + OverallQual +
## Condition2 + MSZoning + Neighborhood + LotArea + OverallCond +
## Foundation + BedroomAbvGr + EnclosedPorch + BsmtFinSF1 +
## BsmtFinSF2 + MasVnrType, data = before2009)
##
## Residuals:
## Min 1Q Median 3Q Max
## -126988 -15520 -1473 13956 187597
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.723e+05 5.381e+04 -12.493 < 2e-16 ***
## RoofMatlCompShg 5.839e+05 3.168e+04 18.433 < 2e-16 ***
## RoofMatlMembran 6.160e+05 4.481e+04 13.747 < 2e-16 ***
## RoofMatlMetal 6.445e+05 4.510e+04 14.292 < 2e-16 ***
## RoofMatlRoll 5.786e+05 4.287e+04 13.495 < 2e-16 ***
## RoofMatlTar&Grv 5.862e+05 3.250e+04 18.040 < 2e-16 ***
## RoofMatlWdShake 5.870e+05 3.362e+04 17.460 < 2e-16 ***
## RoofMatlWdShngl 6.633e+05 3.346e+04 19.825 < 2e-16 ***
## LandSlopeMod 1.080e+04 3.589e+03 3.009 0.002654 **
## LandSlopeSev -4.146e+04 1.213e+04 -3.418 0.000645 ***
## BsmtUnfSF 2.807e+01 2.443e+00 11.488 < 2e-16 ***
## OverallQual2 4.370e+04 3.308e+04 1.321 0.186691
## OverallQual3 3.858e+04 3.060e+04 1.261 0.207594
## OverallQual4 4.067e+04 3.048e+04 1.334 0.182334
## OverallQual5 4.746e+04 3.054e+04 1.554 0.120390
## OverallQual6 6.640e+04 3.062e+04 2.169 0.030228 *
## OverallQual7 9.085e+04 3.065e+04 2.964 0.003074 **
## OverallQual8 1.245e+05 3.070e+04 4.054 5.24e-05 ***
## OverallQual9 1.729e+05 3.092e+04 5.594 2.56e-08 ***
## OverallQual10 2.870e+05 3.162e+04 9.079 < 2e-16 ***
## Condition2Feedr -7.713e+03 1.734e+04 -0.445 0.656452
## Condition2Norm 1.214e+03 1.464e+04 0.083 0.933906
## Condition2PosA -4.793e+04 2.348e+04 -2.041 0.041388 *
## Condition2PosN -2.343e+05 2.270e+04 -10.320 < 2e-16 ***
## Condition2RRAe 2.725e+04 3.238e+04 0.842 0.400075
## Condition2RRAn -2.672e+04 3.244e+04 -0.824 0.410288
## Condition2RRNn 1.542e+03 2.498e+04 0.062 0.950802
## MSZoningFV 4.900e+04 1.121e+04 4.373 1.30e-05 ***
## MSZoningRH 2.496e+04 1.181e+04 2.113 0.034729 *
## MSZoningRL 4.049e+04 9.214e+03 4.394 1.18e-05 ***
## MSZoningRM 3.501e+04 8.642e+03 4.052 5.30e-05 ***
## NeighborhoodBlueste -2.377e+04 1.643e+04 -1.447 0.148186
## NeighborhoodBrDale -3.148e+04 9.979e+03 -3.155 0.001634 **
## NeighborhoodBrkSide -1.670e+04 8.244e+03 -2.025 0.042981 *
## NeighborhoodClearCr -1.231e+04 9.202e+03 -1.338 0.181206
## NeighborhoodCollgCr -9.268e+03 6.955e+03 -1.333 0.182795
## NeighborhoodCrawfor 8.885e+03 7.769e+03 1.144 0.252898
## NeighborhoodEdwards -3.248e+04 7.502e+03 -4.329 1.58e-05 ***
## NeighborhoodGilbert -3.688e+03 7.360e+03 -0.501 0.616348
## NeighborhoodIDOTRR -2.445e+04 8.843e+03 -2.765 0.005757 **
## NeighborhoodMeadowV -3.192e+04 9.947e+03 -3.209 0.001355 **
## NeighborhoodMitchel -3.232e+04 7.699e+03 -4.197 2.83e-05 ***
## NeighborhoodNAmes -2.768e+04 7.232e+03 -3.827 0.000134 ***
## NeighborhoodNoRidge 4.475e+04 8.059e+03 5.553 3.22e-08 ***
## NeighborhoodNPkVill -2.614e+04 1.190e+04 -2.196 0.028248 *
## NeighborhoodNridgHt 2.934e+04 7.500e+03 3.912 9.49e-05 ***
## NeighborhoodNWAmes -2.052e+04 7.602e+03 -2.699 0.007009 **
## NeighborhoodOldTown -2.335e+04 8.078e+03 -2.891 0.003889 **
## NeighborhoodSawyer -3.012e+04 7.633e+03 -3.946 8.23e-05 ***
## NeighborhoodSawyerW -8.314e+03 7.599e+03 -1.094 0.274112
## NeighborhoodSomerst -8.833e+03 8.598e+03 -1.027 0.304385
## NeighborhoodStoneBr 3.185e+04 8.446e+03 3.770 0.000168 ***
## NeighborhoodSWISU -2.465e+04 9.145e+03 -2.696 0.007091 **
## NeighborhoodTimber -6.031e+03 7.992e+03 -0.755 0.450570
## NeighborhoodVeenker 1.021e+04 9.536e+03 1.071 0.284485
## LotArea 1.321e+00 1.176e-01 11.231 < 2e-16 ***
## OverallCond2 1.281e+04 2.991e+04 0.428 0.668375
## OverallCond3 2.431e+04 2.223e+04 1.094 0.274248
## OverallCond4 3.282e+04 2.204e+04 1.489 0.136691
## OverallCond5 3.912e+04 2.199e+04 1.779 0.075474 .
## OverallCond6 4.538e+04 2.203e+04 2.060 0.039556 *
## OverallCond7 5.190e+04 2.205e+04 2.354 0.018690 *
## OverallCond8 5.651e+04 2.213e+04 2.553 0.010746 *
## OverallCond9 7.254e+04 2.274e+04 3.190 0.001445 **
## FoundationCBlock 4.564e+03 2.831e+03 1.612 0.107069
## FoundationPConc 1.883e+04 3.082e+03 6.109 1.22e-09 ***
## FoundationSlab 3.196e+04 6.716e+03 4.759 2.10e-06 ***
## FoundationStone -8.538e+03 1.225e+04 -0.697 0.485947
## FoundationWood 7.437e+03 2.099e+04 0.354 0.723098
## BedroomAbvGr 1.307e+04 9.057e+02 14.432 < 2e-16 ***
## EnclosedPorch -1.836e+00 1.095e+01 -0.168 0.866779
## BsmtFinSF1 5.586e+01 2.483e+00 22.499 < 2e-16 ***
## BsmtFinSF2 5.029e+01 4.559e+00 11.030 < 2e-16 ***
## MasVnrTypeBrkFace 2.000e+04 6.728e+03 2.973 0.002989 **
## MasVnrTypeNone 1.593e+04 6.630e+03 2.402 0.016394 *
## MasVnrTypeStone 2.614e+04 7.143e+03 3.660 0.000260 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 28640 on 1828 degrees of freedom
## (因为不存在,29个观察量被删除了)
## Multiple R-squared: 0.8759, Adjusted R-squared: 0.8708
## F-statistic: 172 on 75 and 1828 DF, p-value: < 2.2e-16
Display diagnostic plots of your regression. Tip: The diagnostic plots include QQ-Plot, Residual versus Fitted Values plot, a \(\sqrt{Standardized \; Residuals}\) vs Fitted Values plot, and a Standardized Residuals vs Leverage plot. Do not worry if your residuals have a slight curve to them. Tip: Google “Plotting Diagnostics for Linear Models - CRAN” and don’t use any arguments for the function autoplot at this time.
#install.packages("ggfortify")
library(ggfortify)
## Warning: 程辑包'ggfortify'是用R版本4.3.2 来建造的
regBefore2009optimal %>%
autoplot()
## Warning: Removed 1904 rows containing missing values (`geom_line()`).
## Warning: Removed 7 rows containing missing values (`geom_point()`).
## Warning: Removed 14 rows containing missing values (`geom_line()`).
Now read in the PricesAfter2009.csv data and assign it to a variable called after2009. The dataset contains data for house prices after 2009. Then, repeat your data manipulation operations from Q2 and Q3 on this new dataset. Drop (remove) unnecessary columns that you dropped in Q5.. Rubric: 1 point for reading and 4 points for data manipulation.
after2009 <- read.csv("PricesAfter2009.csv")
head(after2009,20)
## X Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape
## 1 1 6 50 RL 85 14115 Pave <NA> IR1
## 2 2 8 60 RL NA 10382 Pave <NA> IR1
## 3 3 17 20 RL NA 11241 Pave <NA> IR1
## 4 4 20 20 RL 70 7560 Pave <NA> Reg
## 5 5 25 20 RL NA 8246 Pave <NA> IR1
## 6 6 26 20 RL 110 14230 Pave <NA> Reg
## 7 7 27 20 RL 60 7200 Pave <NA> Reg
## 8 8 28 20 RL 98 11478 Pave <NA> Reg
## 9 9 34 20 RL 70 10552 Pave <NA> IR1
## 10 10 37 20 RL 112 10859 Pave <NA> Reg
## 11 11 38 20 RL 74 8532 Pave <NA> Reg
## 12 12 39 20 RL 68 7922 Pave <NA> Reg
## 13 13 46 120 RL 61 7658 Pave <NA> Reg
## 14 14 47 50 RL 48 12822 Pave <NA> IR1
## 15 15 49 190 RM 33 4456 Pave <NA> Reg
## 16 16 53 90 RM 110 8472 Grvl <NA> IR2
## 17 17 57 160 FV 24 2645 Pave Pave Reg
## 18 18 64 70 RM 50 10300 Pave <NA> IR1
## 19 19 65 60 RL NA 9375 Pave <NA> Reg
## 20 20 67 20 RL NA 19900 Pave <NA> Reg
## LandContour Utilities LotConfig LandSlope Neighborhood Condition1 Condition2
## 1 Lvl AllPub Inside Gtl Mitchel Norm Norm
## 2 Lvl AllPub Corner Gtl NWAmes PosN Norm
## 3 Lvl AllPub CulDSac Gtl NAmes Norm Norm
## 4 Lvl AllPub Inside Gtl NAmes Norm Norm
## 5 Lvl AllPub Inside Gtl Sawyer Norm Norm
## 6 Lvl AllPub Corner Gtl NridgHt Norm Norm
## 7 Lvl AllPub Corner Gtl NAmes Norm Norm
## 8 Lvl AllPub Inside Gtl NridgHt Norm Norm
## 9 Lvl AllPub Inside Gtl NAmes Norm Norm
## 10 Lvl AllPub Corner Gtl CollgCr Norm Norm
## 11 Lvl AllPub Inside Gtl NAmes Norm Norm
## 12 Lvl AllPub Inside Gtl NAmes Norm Norm
## 13 Lvl AllPub Inside Gtl NridgHt Norm Norm
## 14 Lvl AllPub CulDSac Gtl Mitchel Norm Norm
## 15 Lvl AllPub Inside Gtl OldTown Norm Norm
## 16 Bnk AllPub Corner Mod IDOTRR RRNn Norm
## 17 Lvl AllPub Inside Gtl Somerst Norm Norm
## 18 Bnk AllPub Inside Gtl OldTown RRAn Feedr
## 19 Lvl AllPub Inside Gtl CollgCr Norm Norm
## 20 Lvl AllPub Inside Gtl NAmes PosA Norm
## BldgType HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle
## 1 1Fam 1.5Fin 5 5 1993 1995 Gable
## 2 1Fam 2Story 7 6 1973 1973 Gable
## 3 1Fam 1Story 6 7 1970 1970 Gable
## 4 1Fam 1Story 5 6 1958 1965 Hip
## 5 1Fam 1Story 5 8 1968 2001 Gable
## 6 1Fam 1Story 8 5 2007 2007 Gable
## 7 1Fam 1Story 5 7 1951 2000 Gable
## 8 1Fam 1Story 8 5 2007 2008 Gable
## 9 1Fam 1Story 5 5 1959 1959 Hip
## 10 1Fam 1Story 5 5 1994 1995 Gable
## 11 1Fam 1Story 5 6 1954 1990 Hip
## 12 1Fam 1Story 5 7 1953 2007 Gable
## 13 TwnhsE 1Story 9 5 2005 2005 Hip
## 14 1Fam 1.5Fin 7 5 2003 2003 Gable
## 15 2fmCon 2Story 4 5 1920 2008 Gable
## 16 Duplex 1Story 5 5 1963 1963 Gable
## 17 Twnhs 2Story 8 5 1999 2000 Gable
## 18 1Fam 2Story 7 6 1921 1950 Gable
## 19 1Fam 2Story 7 5 1997 1998 Gable
## 20 1Fam 1Story 7 5 1970 1989 Gable
## RoofMatl Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond
## 1 CompShg VinylSd VinylSd None 0 TA TA
## 2 CompShg HdBoard HdBoard Stone 240 TA TA
## 3 CompShg Wd Sdng Wd Sdng BrkFace 180 TA TA
## 4 CompShg BrkFace Plywood None 0 TA TA
## 5 CompShg Plywood Plywood None 0 TA Gd
## 6 CompShg VinylSd VinylSd Stone 640 Gd TA
## 7 CompShg Wd Sdng Wd Sdng None 0 TA TA
## 8 CompShg VinylSd VinylSd Stone 200 Gd TA
## 9 CompShg BrkFace BrkFace None 0 TA TA
## 10 CompShg VinylSd VinylSd None 0 TA TA
## 11 CompShg Wd Sdng Wd Sdng BrkFace 650 TA TA
## 12 CompShg VinylSd VinylSd None 0 TA Gd
## 13 CompShg MetalSd MetalSd BrkFace 412 Ex TA
## 14 CompShg VinylSd VinylSd None 0 Gd TA
## 15 CompShg MetalSd MetalSd None 0 TA TA
## 16 CompShg Wd Sdng Wd Sdng None 0 Fa TA
## 17 CompShg MetalSd MetalSd BrkFace 456 Gd TA
## 18 CompShg Stucco Stucco None 0 TA TA
## 19 CompShg VinylSd VinylSd BrkFace 573 TA TA
## 20 CompShg Plywood Plywood BrkFace 287 TA TA
## Foundation BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1
## 1 Wood Gd TA No GLQ 732
## 2 CBlock Gd TA Mn ALQ 859
## 3 CBlock TA TA No ALQ 578
## 4 CBlock TA TA No LwQ 504
## 5 CBlock TA TA Mn Rec 188
## 6 PConc Gd TA No Unf 0
## 7 CBlock TA TA Mn BLQ 234
## 8 PConc Ex TA No GLQ 1218
## 9 CBlock TA TA No Rec 1018
## 10 PConc Gd TA No Unf 0
## 11 CBlock TA TA No Rec 1213
## 12 CBlock TA TA No GLQ 731
## 13 PConc Ex TA No GLQ 456
## 14 PConc Ex TA No GLQ 1351
## 15 BrkTil TA TA No Unf 0
## 16 CBlock Gd TA Gd LwQ 104
## 17 PConc Gd TA No GLQ 649
## 18 BrkTil TA TA No Unf 0
## 19 PConc Gd TA No GLQ 739
## 20 CBlock Gd TA Gd GLQ 912
## BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir
## 1 Unf 0 64 796 GasA Ex Y
## 2 BLQ 32 216 1107 GasA Ex Y
## 3 Unf 0 426 1004 GasA Ex Y
## 4 Unf 0 525 1029 GasA TA Y
## 5 ALQ 668 204 1060 GasA Ex Y
## 6 Unf 0 1566 1566 GasA Ex Y
## 7 Rec 486 180 900 GasA TA Y
## 8 Unf 0 486 1704 GasA Ex Y
## 9 Unf 0 380 1398 GasA Gd Y
## 10 Unf 0 1097 1097 GasA Ex Y
## 11 Unf 0 84 1297 GasA Gd Y
## 12 Unf 0 326 1057 GasA TA Y
## 13 Unf 0 1296 1752 GasA Ex Y
## 14 Unf 0 83 1434 GasA Ex Y
## 15 Unf 0 736 736 GasA Gd Y
## 16 GLQ 712 0 816 GasA TA N
## 17 Unf 0 321 970 GasA Ex Y
## 18 Unf 0 576 576 GasA Gd Y
## 19 Unf 0 318 1057 GasA Ex Y
## 20 Unf 0 1035 1947 GasA TA Y
## Electrical X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath
## 1 SBrkr 796 566 0 1362 1
## 2 SBrkr 1107 983 0 2090 1
## 3 SBrkr 1004 0 0 1004 1
## 4 SBrkr 1339 0 0 1339 0
## 5 SBrkr 1060 0 0 1060 1
## 6 SBrkr 1600 0 0 1600 0
## 7 SBrkr 900 0 0 900 0
## 8 SBrkr 1704 0 0 1704 1
## 9 SBrkr 1700 0 0 1700 0
## 10 SBrkr 1097 0 0 1097 0
## 11 SBrkr 1297 0 0 1297 0
## 12 SBrkr 1057 0 0 1057 1
## 13 SBrkr 1752 0 0 1752 1
## 14 SBrkr 1518 631 0 2149 1
## 15 SBrkr 736 716 0 1452 0
## 16 SBrkr 816 0 0 816 1
## 17 SBrkr 983 756 0 1739 1
## 18 SBrkr 902 808 0 1710 0
## 19 SBrkr 1057 977 0 2034 1
## 20 SBrkr 2207 0 0 2207 1
## BsmtHalfBath FullBath HalfBath BedroomAbvGr KitchenAbvGr KitchenQual
## 1 0 1 1 1 1 TA
## 2 0 2 1 3 1 TA
## 3 0 1 0 2 1 TA
## 4 0 1 0 3 1 TA
## 5 0 1 0 3 1 Gd
## 6 0 2 0 3 1 Gd
## 7 1 1 0 3 1 Gd
## 8 0 2 0 3 1 Gd
## 9 1 1 1 4 1 Gd
## 10 0 1 1 3 1 TA
## 11 1 1 0 3 1 TA
## 12 0 1 0 3 1 Gd
## 13 0 2 0 2 1 Ex
## 14 0 1 1 1 1 Gd
## 15 0 2 0 2 3 TA
## 16 0 1 0 2 1 TA
## 17 0 2 1 3 1 Gd
## 18 0 2 0 3 1 TA
## 19 0 2 1 3 1 Gd
## 20 0 2 0 3 1 TA
## TotRmsAbvGrd Functional Fireplaces FireplaceQu GarageType GarageYrBlt
## 1 5 Typ 0 <NA> Attchd 1993
## 2 7 Typ 2 TA Attchd 1973
## 3 5 Typ 1 TA Attchd 1970
## 4 6 Min1 0 <NA> Attchd 1958
## 5 6 Typ 1 TA Attchd 1968
## 6 7 Typ 1 Gd Attchd 2007
## 7 5 Typ 0 <NA> Detchd 2005
## 8 7 Typ 1 Gd Attchd 2008
## 9 6 Typ 1 Gd Attchd 1959
## 10 6 Typ 0 <NA> Attchd 1995
## 11 5 Typ 1 TA Attchd 1954
## 12 5 Typ 0 <NA> Detchd 1953
## 13 6 Typ 1 Gd Attchd 2005
## 14 6 Typ 1 Ex Attchd 2003
## 15 8 Typ 0 <NA> <NA> NA
## 16 5 Typ 0 <NA> CarPort 1963
## 17 7 Typ 0 <NA> Attchd 1999
## 18 9 Typ 0 <NA> Detchd 1990
## 19 8 Typ 0 <NA> Attchd 1998
## 20 7 Min1 1 Gd Attchd 1970
## GarageFinish GarageCars GarageArea GarageQual GarageCond PavedDrive
## 1 Unf 2 480 TA TA Y
## 2 RFn 2 484 TA TA Y
## 3 Fin 2 480 TA TA Y
## 4 Unf 1 294 TA TA Y
## 5 Unf 1 270 TA TA Y
## 6 RFn 3 890 TA TA Y
## 7 Unf 2 576 TA TA Y
## 8 RFn 3 772 TA TA Y
## 9 RFn 2 447 TA TA Y
## 10 Unf 2 672 TA TA Y
## 11 Fin 2 498 TA TA Y
## 12 Unf 1 246 TA TA Y
## 13 RFn 2 576 TA TA Y
## 14 RFn 2 670 TA TA Y
## 15 <NA> 0 0 <NA> <NA> N
## 16 Unf 2 516 TA TA Y
## 17 Fin 2 480 TA TA Y
## 18 Unf 2 480 TA TA Y
## 19 RFn 2 645 TA TA Y
## 20 RFn 2 576 TA TA Y
## WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch ScreenPorch PoolArea PoolQC
## 1 40 30 0 320 0 0 NA
## 2 235 204 228 0 0 0 NA
## 3 0 0 0 0 0 0 NA
## 4 0 0 0 0 0 0 NA
## 5 406 90 0 0 0 0 NA
## 6 0 56 0 0 0 0 NA
## 7 222 32 0 0 0 0 NA
## 8 0 50 0 0 0 0 NA
## 9 0 38 0 0 0 0 NA
## 10 392 64 0 0 0 0 NA
## 11 0 0 0 0 0 0 NA
## 12 0 52 0 0 0 0 NA
## 13 196 82 0 0 0 0 NA
## 14 168 43 0 0 198 0 NA
## 15 0 0 102 0 0 0 NA
## 16 106 0 0 0 0 0 NA
## 17 115 0 0 0 0 0 NA
## 18 12 11 64 0 0 0 NA
## 19 576 36 0 0 0 0 NA
## 20 301 0 0 0 0 0 NA
## Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
## 1 MnPrv Shed 700 10 2009 WD Normal 143000
## 2 <NA> Shed 350 11 2009 WD Normal 200000
## 3 <NA> Shed 700 3 2010 WD Normal 149000
## 4 MnPrv <NA> 0 5 2009 COD Abnorml 139000
## 5 MnPrv <NA> 0 5 2010 WD Normal 154000
## 6 <NA> <NA> 0 7 2009 WD Normal 256300
## 7 <NA> <NA> 0 5 2010 WD Normal 134800
## 8 <NA> <NA> 0 5 2010 WD Normal 306000
## 9 <NA> <NA> 0 4 2010 WD Normal 165500
## 10 <NA> <NA> 0 6 2009 WD Normal 145000
## 11 <NA> <NA> 0 10 2009 WD Normal 153000
## 12 <NA> <NA> 0 1 2010 WD Abnorml 109000
## 13 <NA> <NA> 0 2 2010 WD Normal 319900
## 14 <NA> <NA> 0 8 2009 WD Abnorml 239686
## 15 <NA> <NA> 0 6 2009 New Partial 113000
## 16 <NA> <NA> 0 5 2010 WD Normal 110000
## 17 <NA> <NA> 0 8 2009 WD Abnorml 172500
## 18 GdPrv <NA> 0 4 2010 WD Normal 140000
## 19 GdPrv <NA> 0 2 2009 WD Normal 219500
## 20 <NA> <NA> 0 7 2010 WD Normal 180000
str(after2009)
## 'data.frame': 986 obs. of 82 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Id : int 6 8 17 20 25 26 27 28 34 37 ...
## $ MSSubClass : int 50 60 20 20 20 20 20 20 20 20 ...
## $ MSZoning : chr "RL" "RL" "RL" "RL" ...
## $ LotFrontage : int 85 NA NA 70 NA 110 60 98 70 112 ...
## $ LotArea : int 14115 10382 11241 7560 8246 14230 7200 11478 10552 10859 ...
## $ Street : chr "Pave" "Pave" "Pave" "Pave" ...
## $ Alley : chr NA NA NA NA ...
## $ LotShape : chr "IR1" "IR1" "IR1" "Reg" ...
## $ LandContour : chr "Lvl" "Lvl" "Lvl" "Lvl" ...
## $ Utilities : chr "AllPub" "AllPub" "AllPub" "AllPub" ...
## $ LotConfig : chr "Inside" "Corner" "CulDSac" "Inside" ...
## $ LandSlope : chr "Gtl" "Gtl" "Gtl" "Gtl" ...
## $ Neighborhood : chr "Mitchel" "NWAmes" "NAmes" "NAmes" ...
## $ Condition1 : chr "Norm" "PosN" "Norm" "Norm" ...
## $ Condition2 : chr "Norm" "Norm" "Norm" "Norm" ...
## $ BldgType : chr "1Fam" "1Fam" "1Fam" "1Fam" ...
## $ HouseStyle : chr "1.5Fin" "2Story" "1Story" "1Story" ...
## $ OverallQual : int 5 7 6 5 5 8 5 8 5 5 ...
## $ OverallCond : int 5 6 7 6 8 5 7 5 5 5 ...
## $ YearBuilt : int 1993 1973 1970 1958 1968 2007 1951 2007 1959 1994 ...
## $ YearRemodAdd : int 1995 1973 1970 1965 2001 2007 2000 2008 1959 1995 ...
## $ RoofStyle : chr "Gable" "Gable" "Gable" "Hip" ...
## $ RoofMatl : chr "CompShg" "CompShg" "CompShg" "CompShg" ...
## $ Exterior1st : chr "VinylSd" "HdBoard" "Wd Sdng" "BrkFace" ...
## $ Exterior2nd : chr "VinylSd" "HdBoard" "Wd Sdng" "Plywood" ...
## $ MasVnrType : chr "None" "Stone" "BrkFace" "None" ...
## $ MasVnrArea : int 0 240 180 0 0 640 0 200 0 0 ...
## $ ExterQual : chr "TA" "TA" "TA" "TA" ...
## $ ExterCond : chr "TA" "TA" "TA" "TA" ...
## $ Foundation : chr "Wood" "CBlock" "CBlock" "CBlock" ...
## $ BsmtQual : chr "Gd" "Gd" "TA" "TA" ...
## $ BsmtCond : chr "TA" "TA" "TA" "TA" ...
## $ BsmtExposure : chr "No" "Mn" "No" "No" ...
## $ BsmtFinType1 : chr "GLQ" "ALQ" "ALQ" "LwQ" ...
## $ BsmtFinSF1 : int 732 859 578 504 188 0 234 1218 1018 0 ...
## $ BsmtFinType2 : chr "Unf" "BLQ" "Unf" "Unf" ...
## $ BsmtFinSF2 : int 0 32 0 0 668 0 486 0 0 0 ...
## $ BsmtUnfSF : int 64 216 426 525 204 1566 180 486 380 1097 ...
## $ TotalBsmtSF : int 796 1107 1004 1029 1060 1566 900 1704 1398 1097 ...
## $ Heating : chr "GasA" "GasA" "GasA" "GasA" ...
## $ HeatingQC : chr "Ex" "Ex" "Ex" "TA" ...
## $ CentralAir : chr "Y" "Y" "Y" "Y" ...
## $ Electrical : chr "SBrkr" "SBrkr" "SBrkr" "SBrkr" ...
## $ X1stFlrSF : int 796 1107 1004 1339 1060 1600 900 1704 1700 1097 ...
## $ X2ndFlrSF : int 566 983 0 0 0 0 0 0 0 0 ...
## $ LowQualFinSF : int 0 0 0 0 0 0 0 0 0 0 ...
## $ GrLivArea : int 1362 2090 1004 1339 1060 1600 900 1704 1700 1097 ...
## $ BsmtFullBath : int 1 1 1 0 1 0 0 1 0 0 ...
## $ BsmtHalfBath : int 0 0 0 0 0 0 1 0 1 0 ...
## $ FullBath : int 1 2 1 1 1 2 1 2 1 1 ...
## $ HalfBath : int 1 1 0 0 0 0 0 0 1 1 ...
## $ BedroomAbvGr : int 1 3 2 3 3 3 3 3 4 3 ...
## $ KitchenAbvGr : int 1 1 1 1 1 1 1 1 1 1 ...
## $ KitchenQual : chr "TA" "TA" "TA" "TA" ...
## $ TotRmsAbvGrd : int 5 7 5 6 6 7 5 7 6 6 ...
## $ Functional : chr "Typ" "Typ" "Typ" "Min1" ...
## $ Fireplaces : int 0 2 1 0 1 1 0 1 1 0 ...
## $ FireplaceQu : chr NA "TA" "TA" NA ...
## $ GarageType : chr "Attchd" "Attchd" "Attchd" "Attchd" ...
## $ GarageYrBlt : int 1993 1973 1970 1958 1968 2007 2005 2008 1959 1995 ...
## $ GarageFinish : chr "Unf" "RFn" "Fin" "Unf" ...
## $ GarageCars : int 2 2 2 1 1 3 2 3 2 2 ...
## $ GarageArea : int 480 484 480 294 270 890 576 772 447 672 ...
## $ GarageQual : chr "TA" "TA" "TA" "TA" ...
## $ GarageCond : chr "TA" "TA" "TA" "TA" ...
## $ PavedDrive : chr "Y" "Y" "Y" "Y" ...
## $ WoodDeckSF : int 40 235 0 0 406 0 222 0 0 392 ...
## $ OpenPorchSF : int 30 204 0 0 90 56 32 50 38 64 ...
## $ EnclosedPorch: int 0 228 0 0 0 0 0 0 0 0 ...
## $ X3SsnPorch : int 320 0 0 0 0 0 0 0 0 0 ...
## $ ScreenPorch : int 0 0 0 0 0 0 0 0 0 0 ...
## $ PoolArea : int 0 0 0 0 0 0 0 0 0 0 ...
## $ PoolQC : logi NA NA NA NA NA NA ...
## $ Fence : chr "MnPrv" NA NA "MnPrv" ...
## $ MiscFeature : chr "Shed" "Shed" "Shed" NA ...
## $ MiscVal : int 700 350 700 0 0 0 0 0 0 0 ...
## $ MoSold : int 10 11 3 5 5 7 5 5 4 6 ...
## $ YrSold : int 2009 2009 2010 2009 2010 2009 2010 2010 2010 2009 ...
## $ SaleType : chr "WD" "WD" "WD" "COD" ...
## $ SaleCondition: chr "Normal" "Normal" "Normal" "Abnorml" ...
## $ SalePrice : num 143000 200000 149000 139000 154000 ...
after2009$MSSubClass <- as.factor(after2009$MSSubClass)
after2009$OverallQual <- as.factor(after2009$OverallQual)
after2009$OverallCond <- as.factor(after2009$OverallCond)
str(after2009[, c("MSSubClass", "OverallQual","OverallCond")])
## 'data.frame': 986 obs. of 3 variables:
## $ MSSubClass : Factor w/ 15 levels "20","30","40",..: 5 6 1 1 1 1 1 1 1 1 ...
## $ OverallQual: Factor w/ 10 levels "1","2","3","4",..: 5 7 6 5 5 8 5 8 5 5 ...
## $ OverallCond: Factor w/ 9 levels "1","2","3","4",..: 5 6 7 6 8 5 7 5 5 5 ...
count_NA_after <- sapply(after2009, function(x) sum(is.na(x)))
cols_NA_after <- names(count_NA[count_NA >= 20])
count_NA_after["SalePrice"] <- 0
cols_NA_after <- names(count_NA[count_NA >= 20])
cols_drop_after <- c("X","Id","Utilities")
dropCols_after <- union(cols_NA_after, cols_drop_after)
after2009 <- select(after2009, -dropCols_after)
## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
## # Was:
## data %>% select(dropCols_after)
##
## # Now:
## data %>% select(all_of(dropCols_after))
##
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
head(after2009, 20)
## MSSubClass MSZoning LotArea Street LotShape LandContour LotConfig LandSlope
## 1 50 RL 14115 Pave IR1 Lvl Inside Gtl
## 2 60 RL 10382 Pave IR1 Lvl Corner Gtl
## 3 20 RL 11241 Pave IR1 Lvl CulDSac Gtl
## 4 20 RL 7560 Pave Reg Lvl Inside Gtl
## 5 20 RL 8246 Pave IR1 Lvl Inside Gtl
## 6 20 RL 14230 Pave Reg Lvl Corner Gtl
## 7 20 RL 7200 Pave Reg Lvl Corner Gtl
## 8 20 RL 11478 Pave Reg Lvl Inside Gtl
## 9 20 RL 10552 Pave IR1 Lvl Inside Gtl
## 10 20 RL 10859 Pave Reg Lvl Corner Gtl
## 11 20 RL 8532 Pave Reg Lvl Inside Gtl
## 12 20 RL 7922 Pave Reg Lvl Inside Gtl
## 13 120 RL 7658 Pave Reg Lvl Inside Gtl
## 14 50 RL 12822 Pave IR1 Lvl CulDSac Gtl
## 15 190 RM 4456 Pave Reg Lvl Inside Gtl
## 16 90 RM 8472 Grvl IR2 Bnk Corner Mod
## 17 160 FV 2645 Pave Reg Lvl Inside Gtl
## 18 70 RM 10300 Pave IR1 Bnk Inside Gtl
## 19 60 RL 9375 Pave Reg Lvl Inside Gtl
## 20 20 RL 19900 Pave Reg Lvl Inside Gtl
## Neighborhood Condition1 Condition2 BldgType HouseStyle OverallQual
## 1 Mitchel Norm Norm 1Fam 1.5Fin 5
## 2 NWAmes PosN Norm 1Fam 2Story 7
## 3 NAmes Norm Norm 1Fam 1Story 6
## 4 NAmes Norm Norm 1Fam 1Story 5
## 5 Sawyer Norm Norm 1Fam 1Story 5
## 6 NridgHt Norm Norm 1Fam 1Story 8
## 7 NAmes Norm Norm 1Fam 1Story 5
## 8 NridgHt Norm Norm 1Fam 1Story 8
## 9 NAmes Norm Norm 1Fam 1Story 5
## 10 CollgCr Norm Norm 1Fam 1Story 5
## 11 NAmes Norm Norm 1Fam 1Story 5
## 12 NAmes Norm Norm 1Fam 1Story 5
## 13 NridgHt Norm Norm TwnhsE 1Story 9
## 14 Mitchel Norm Norm 1Fam 1.5Fin 7
## 15 OldTown Norm Norm 2fmCon 2Story 4
## 16 IDOTRR RRNn Norm Duplex 1Story 5
## 17 Somerst Norm Norm Twnhs 2Story 8
## 18 OldTown RRAn Feedr 1Fam 2Story 7
## 19 CollgCr Norm Norm 1Fam 2Story 7
## 20 NAmes PosA Norm 1Fam 1Story 7
## OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl Exterior1st
## 1 5 1993 1995 Gable CompShg VinylSd
## 2 6 1973 1973 Gable CompShg HdBoard
## 3 7 1970 1970 Gable CompShg Wd Sdng
## 4 6 1958 1965 Hip CompShg BrkFace
## 5 8 1968 2001 Gable CompShg Plywood
## 6 5 2007 2007 Gable CompShg VinylSd
## 7 7 1951 2000 Gable CompShg Wd Sdng
## 8 5 2007 2008 Gable CompShg VinylSd
## 9 5 1959 1959 Hip CompShg BrkFace
## 10 5 1994 1995 Gable CompShg VinylSd
## 11 6 1954 1990 Hip CompShg Wd Sdng
## 12 7 1953 2007 Gable CompShg VinylSd
## 13 5 2005 2005 Hip CompShg MetalSd
## 14 5 2003 2003 Gable CompShg VinylSd
## 15 5 1920 2008 Gable CompShg MetalSd
## 16 5 1963 1963 Gable CompShg Wd Sdng
## 17 5 1999 2000 Gable CompShg MetalSd
## 18 6 1921 1950 Gable CompShg Stucco
## 19 5 1997 1998 Gable CompShg VinylSd
## 20 5 1970 1989 Gable CompShg Plywood
## Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation BsmtFinSF1
## 1 VinylSd None 0 TA TA Wood 732
## 2 HdBoard Stone 240 TA TA CBlock 859
## 3 Wd Sdng BrkFace 180 TA TA CBlock 578
## 4 Plywood None 0 TA TA CBlock 504
## 5 Plywood None 0 TA Gd CBlock 188
## 6 VinylSd Stone 640 Gd TA PConc 0
## 7 Wd Sdng None 0 TA TA CBlock 234
## 8 VinylSd Stone 200 Gd TA PConc 1218
## 9 BrkFace None 0 TA TA CBlock 1018
## 10 VinylSd None 0 TA TA PConc 0
## 11 Wd Sdng BrkFace 650 TA TA CBlock 1213
## 12 VinylSd None 0 TA Gd CBlock 731
## 13 MetalSd BrkFace 412 Ex TA PConc 456
## 14 VinylSd None 0 Gd TA PConc 1351
## 15 MetalSd None 0 TA TA BrkTil 0
## 16 Wd Sdng None 0 Fa TA CBlock 104
## 17 MetalSd BrkFace 456 Gd TA PConc 649
## 18 Stucco None 0 TA TA BrkTil 0
## 19 VinylSd BrkFace 573 TA TA PConc 739
## 20 Plywood BrkFace 287 TA TA CBlock 912
## BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical
## 1 0 64 796 GasA Ex Y SBrkr
## 2 32 216 1107 GasA Ex Y SBrkr
## 3 0 426 1004 GasA Ex Y SBrkr
## 4 0 525 1029 GasA TA Y SBrkr
## 5 668 204 1060 GasA Ex Y SBrkr
## 6 0 1566 1566 GasA Ex Y SBrkr
## 7 486 180 900 GasA TA Y SBrkr
## 8 0 486 1704 GasA Ex Y SBrkr
## 9 0 380 1398 GasA Gd Y SBrkr
## 10 0 1097 1097 GasA Ex Y SBrkr
## 11 0 84 1297 GasA Gd Y SBrkr
## 12 0 326 1057 GasA TA Y SBrkr
## 13 0 1296 1752 GasA Ex Y SBrkr
## 14 0 83 1434 GasA Ex Y SBrkr
## 15 0 736 736 GasA Gd Y SBrkr
## 16 712 0 816 GasA TA N SBrkr
## 17 0 321 970 GasA Ex Y SBrkr
## 18 0 576 576 GasA Gd Y SBrkr
## 19 0 318 1057 GasA Ex Y SBrkr
## 20 0 1035 1947 GasA TA Y SBrkr
## X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath
## 1 796 566 0 1362 1 0
## 2 1107 983 0 2090 1 0
## 3 1004 0 0 1004 1 0
## 4 1339 0 0 1339 0 0
## 5 1060 0 0 1060 1 0
## 6 1600 0 0 1600 0 0
## 7 900 0 0 900 0 1
## 8 1704 0 0 1704 1 0
## 9 1700 0 0 1700 0 1
## 10 1097 0 0 1097 0 0
## 11 1297 0 0 1297 0 1
## 12 1057 0 0 1057 1 0
## 13 1752 0 0 1752 1 0
## 14 1518 631 0 2149 1 0
## 15 736 716 0 1452 0 0
## 16 816 0 0 816 1 0
## 17 983 756 0 1739 1 0
## 18 902 808 0 1710 0 0
## 19 1057 977 0 2034 1 0
## 20 2207 0 0 2207 1 0
## FullBath HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd
## 1 1 1 1 1 TA 5
## 2 2 1 3 1 TA 7
## 3 1 0 2 1 TA 5
## 4 1 0 3 1 TA 6
## 5 1 0 3 1 Gd 6
## 6 2 0 3 1 Gd 7
## 7 1 0 3 1 Gd 5
## 8 2 0 3 1 Gd 7
## 9 1 1 4 1 Gd 6
## 10 1 1 3 1 TA 6
## 11 1 0 3 1 TA 5
## 12 1 0 3 1 Gd 5
## 13 2 0 2 1 Ex 6
## 14 1 1 1 1 Gd 6
## 15 2 0 2 3 TA 8
## 16 1 0 2 1 TA 5
## 17 2 1 3 1 Gd 7
## 18 2 0 3 1 TA 9
## 19 2 1 3 1 Gd 8
## 20 2 0 3 1 TA 7
## Functional Fireplaces GarageCars GarageArea PavedDrive WoodDeckSF
## 1 Typ 0 2 480 Y 40
## 2 Typ 2 2 484 Y 235
## 3 Typ 1 2 480 Y 0
## 4 Min1 0 1 294 Y 0
## 5 Typ 1 1 270 Y 406
## 6 Typ 1 3 890 Y 0
## 7 Typ 0 2 576 Y 222
## 8 Typ 1 3 772 Y 0
## 9 Typ 1 2 447 Y 0
## 10 Typ 0 2 672 Y 392
## 11 Typ 1 2 498 Y 0
## 12 Typ 0 1 246 Y 0
## 13 Typ 1 2 576 Y 196
## 14 Typ 1 2 670 Y 168
## 15 Typ 0 0 0 N 0
## 16 Typ 0 2 516 Y 106
## 17 Typ 0 2 480 Y 115
## 18 Typ 0 2 480 Y 12
## 19 Typ 0 2 645 Y 576
## 20 Min1 1 2 576 Y 301
## OpenPorchSF EnclosedPorch X3SsnPorch ScreenPorch PoolArea MiscVal MoSold
## 1 30 0 320 0 0 700 10
## 2 204 228 0 0 0 350 11
## 3 0 0 0 0 0 700 3
## 4 0 0 0 0 0 0 5
## 5 90 0 0 0 0 0 5
## 6 56 0 0 0 0 0 7
## 7 32 0 0 0 0 0 5
## 8 50 0 0 0 0 0 5
## 9 38 0 0 0 0 0 4
## 10 64 0 0 0 0 0 6
## 11 0 0 0 0 0 0 10
## 12 52 0 0 0 0 0 1
## 13 82 0 0 0 0 0 2
## 14 43 0 0 198 0 0 8
## 15 0 102 0 0 0 0 6
## 16 0 0 0 0 0 0 5
## 17 0 0 0 0 0 0 8
## 18 11 64 0 0 0 0 4
## 19 36 0 0 0 0 0 2
## 20 0 0 0 0 0 0 7
## YrSold SaleType SaleCondition SalePrice
## 1 2009 WD Normal 143000
## 2 2009 WD Normal 200000
## 3 2010 WD Normal 149000
## 4 2009 COD Abnorml 139000
## 5 2010 WD Normal 154000
## 6 2009 WD Normal 256300
## 7 2010 WD Normal 134800
## 8 2010 WD Normal 306000
## 9 2010 WD Normal 165500
## 10 2009 WD Normal 145000
## 11 2009 WD Normal 153000
## 12 2010 WD Abnorml 109000
## 13 2010 WD Normal 319900
## 14 2009 WD Abnorml 239686
## 15 2009 New Partial 113000
## 16 2010 WD Normal 110000
## 17 2009 WD Abnorml 172500
## 18 2010 WD Normal 140000
## 19 2009 WD Normal 219500
## 20 2010 WD Normal 180000
Local authorities found in 2011 that there was housing fraud taking place in several neighborhoods, including NAmes, Gilbert and NridgHt, in 2009 and 2010. Make a density plot (which data scientists often use to catch outliers or anomalous activity) of SalePrice (after 2009) for all the neighborhoods (with or without fraud) and arrange them all in a grid. Tip: I recommend using ggplot2 for these plots with facet_wrap(~ Neighborhood). Your call will look something like this: ggplot(data = …, aes(…)) + geom_density() + facet_wrap(~ …) + ggtitle(“…”) + xlab(‘…’)
#install.packages("ggplot2")
library(ggplot2)
ggplot(data = after2009, aes(x = SalePrice)) +
geom_density() +
facet_wrap(~ Neighborhood) +
ggtitle("Density Plot of SalePrice by Neighborhood") +
xlab('SalePrice')
## Warning: Removed 5 rows containing non-finite values (`stat_density()`).
As you can see, the density plot for NAmes between 2009 and 2010 does not look any different from other density plots. If there are fraudsters, they are making an effort to mask their activities. Now, make 2 density plots, one for SalePrice in NAmes before 2009 and the other for after 2009. Compare the two to see if there is visual evidence of anomalous activity. Then, do the same for Gilbert and see if anything anomalous is detectable between these plots. Tip: I recommend using the gridExtra library’s grid.arrange function for all four plots so you can see the plots for each neighborhood side by side.
#install.packages("gridExtra")
library(gridExtra)
## Warning: 程辑包'gridExtra'是用R版本4.3.2 来建造的
##
## 载入程辑包:'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
NAmes_before2009 <- before2009[before2009$Neighborhood == "NAmes", ]
NAmes_after2009 <- after2009[after2009$Neighborhood == "NAmes", ]
plot_NAmes_before2009 <- ggplot(NAmes_before2009, aes(x = SalePrice)) +
geom_density() + ggtitle("NAmes Before 2009") + xlim(0,400000) +xlab("SalePrice")
plot_NAmes_after2009 <- ggplot(NAmes_after2009, aes(x = SalePrice)) +
geom_density() + ggtitle("NAmes After 2009") + xlim(0,400000) + xlab("SalePrice")
Gilbert_before2009 <- before2009[before2009$Neighborhood == "Gilbert", ]
Gilbert_after2009 <- after2009[after2009$Neighborhood == "Gilbert", ]
plot_Gilbert_before2009 <- ggplot(Gilbert_before2009, aes(x = SalePrice)) +
geom_density() + ggtitle("Gilbert Before 2009") + xlim(0,400000) +xlab("SalePrice")
plot_Gilbert_after2009 <- ggplot(Gilbert_after2009, aes(x = SalePrice)) +
geom_density() + ggtitle("Gilbert After 2009") + xlim(0,400000) +xlab("SalePrice")
grid.arrange(plot_NAmes_before2009, plot_NAmes_after2009, plot_Gilbert_before2009,
plot_Gilbert_after2009, ncol = 2)
## Warning: Removed 4 rows containing non-finite values (`stat_density()`).
## Warning: Removed 1 rows containing non-finite values (`stat_density()`).
We pick up this story from new Question 12 below and continue the investigation after you have learned regression in more detail. Tip: I bookended this assignment with the regression module so you can reinforce your understanding and apply it. (I also wanted to have empathy for your learning-life blend.) This will also, hopefully, cement your understanding and build your confidence.
Analyze the visualizations above for Gilbert and NAmes to detect possible fraud. Tip: Look for a fraud pattern.
### This section doesn't require code. Just answer the question as a comment.
# We found that there was a different distribution for the home prices for those before 2009 and after 2009 for both NAmes and Gilbert. However, after further analysis, we found that it made sense for the home prices to have a slightly lower price after 2009 due to the financial crisis, however, when looking at the distribution for the Gilbert neighborhood after 2009, we found that the sale price actually had a bimodal distribution, with a local peak at around 145,000. We found this interesting and may be a case of fraud as this bimodal distribution was not present in the distribution of home prices in Gilbert prior to 2009, and may be a case of fraud for a group of homes that have a sale price around 145,000. Furthermore, we notice a slightly different kurtosis for the home prices prior to 2009 and after 2009 in the NAmes neighborhood (with after 2009 having a lower kurtosis), which may be a good coverup for fraud activity that skewed some home prices less than the mean while still keeping the distribution somewhat intact. Overall, we believe that the ggplots show us evidence of fraud with different distribution of home prices after 2009 and can be further investigated. We suggest subsetting the home prices for the Gilbert neighborhood for prices around the first local peak (140,000 to 150,000), and see the trend of other variables that may indicate abnormalities when compared to similar home prices before 2009.
#
You may feel that the fraudsters were not very careful in masking their activity after identifying the fraud pattern. However, we don’t have sufficient evidence to claim that this is fraudulent activity (just based on the density plots). We will now use multiple linear regression to attempt to get more evidence. Run a regression on the data in after2009 using variables you already know to be good at predicting the SalePrice. Store the result in variable called regAfter2009optimal. Then print summary of regAfter2009optimal to verify that your code works. Tip: You can reuse your previous work on before2009. Rubric: 4 points for regression, 1 point for printing summary.
regAfter2009optimal <- lm(SalePrice ~ RoofMatl + LandSlope + BsmtUnfSF +
OverallQual+ Condition2 + MSZoning + Neighborhood +
LotArea + OverallCond +Foundation + BedroomAbvGr +
EnclosedPorch + BsmtFinSF1 +BsmtFinSF2 + MasVnrType,
data = after2009)
summary(regAfter2009optimal)
##
## Call:
## lm(formula = SalePrice ~ RoofMatl + LandSlope + BsmtUnfSF + OverallQual +
## Condition2 + MSZoning + Neighborhood + LotArea + OverallCond +
## Foundation + BedroomAbvGr + EnclosedPorch + BsmtFinSF1 +
## BsmtFinSF2 + MasVnrType, data = after2009)
##
## Residuals:
## Min 1Q Median 3Q Max
## -243969 -13856 127 14684 238698
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.886e+04 5.753e+04 0.849 0.395872
## RoofMatlTar&Grv 1.856e+04 1.391e+04 1.334 0.182580
## RoofMatlWdShake 7.726e+04 2.476e+04 3.121 0.001860 **
## RoofMatlWdShngl -1.076e+04 3.251e+04 -0.331 0.740735
## LandSlopeMod -2.704e+02 5.305e+03 -0.051 0.959363
## LandSlopeSev -7.526e+04 2.174e+04 -3.462 0.000561 ***
## BsmtUnfSF 2.568e+01 3.843e+00 6.681 4.13e-11 ***
## OverallQual2 -1.350e+05 4.692e+04 -2.877 0.004105 **
## OverallQual3 -1.199e+05 4.632e+04 -2.588 0.009806 **
## OverallQual4 -1.141e+05 4.534e+04 -2.517 0.012001 *
## OverallQual5 -1.122e+05 4.519e+04 -2.482 0.013230 *
## OverallQual6 -9.499e+04 4.541e+04 -2.092 0.036739 *
## OverallQual7 -7.374e+04 4.555e+04 -1.619 0.105802
## OverallQual8 -5.187e+04 4.579e+04 -1.133 0.257623
## OverallQual9 2.456e+04 4.616e+04 0.532 0.594773
## OverallQual10 7.218e+04 4.760e+04 1.517 0.129732
## Condition2Feedr 3.476e+04 3.739e+04 0.930 0.352778
## Condition2Norm 2.134e+04 3.378e+04 0.632 0.527682
## Condition2PosA 6.305e+04 4.788e+04 1.317 0.188188
## Condition2PosN -1.972e+05 4.806e+04 -4.104 4.43e-05 ***
## MSZoningFV 4.551e+04 1.895e+04 2.401 0.016556 *
## MSZoningRH 2.095e+04 1.773e+04 1.182 0.237706
## MSZoningRL 3.404e+04 1.429e+04 2.382 0.017427 *
## MSZoningRM 3.030e+04 1.327e+04 2.283 0.022669 *
## NeighborhoodBlueste 5.123e+03 1.906e+04 0.269 0.788103
## NeighborhoodBrDale 2.061e+03 1.719e+04 0.120 0.904626
## NeighborhoodBrkSide 1.432e+03 1.439e+04 0.099 0.920763
## NeighborhoodClearCr 2.748e+04 1.533e+04 1.792 0.073452 .
## NeighborhoodCollgCr -1.987e+02 1.211e+04 -0.016 0.986908
## NeighborhoodCrawfor 3.172e+04 1.376e+04 2.306 0.021322 *
## NeighborhoodEdwards -1.744e+04 1.297e+04 -1.345 0.179001
## NeighborhoodGilbert 2.554e+03 1.259e+04 0.203 0.839267
## NeighborhoodIDOTRR -1.767e+04 1.633e+04 -1.082 0.279391
## NeighborhoodMeadowV -1.513e+04 1.705e+04 -0.887 0.375234
## NeighborhoodMitchel -2.466e+03 1.301e+04 -0.190 0.849686
## NeighborhoodNAmes -1.207e+04 1.253e+04 -0.964 0.335502
## NeighborhoodNoRidge 5.246e+04 1.360e+04 3.857 0.000123 ***
## NeighborhoodNPkVill -1.985e+03 1.503e+04 -0.132 0.894976
## NeighborhoodNridgHt 8.247e+03 1.270e+04 0.649 0.516341
## NeighborhoodNWAmes -2.732e+03 1.294e+04 -0.211 0.832786
## NeighborhoodOldTown -1.649e+04 1.413e+04 -1.167 0.243435
## NeighborhoodSawyer -1.734e+04 1.326e+04 -1.308 0.191330
## NeighborhoodSawyerW 4.546e+03 1.253e+04 0.363 0.716771
## NeighborhoodSomerst 2.615e+03 1.639e+04 0.160 0.873298
## NeighborhoodStoneBr 4.473e+04 1.481e+04 3.021 0.002592 **
## NeighborhoodSWISU -8.673e+03 1.444e+04 -0.600 0.548330
## NeighborhoodTimber 7.695e+03 1.377e+04 0.559 0.576318
## NeighborhoodVeenker 4.835e+04 2.014e+04 2.401 0.016573 *
## LotArea 1.290e+00 1.765e-01 7.307 5.96e-13 ***
## OverallCond2 5.772e+04 2.699e+04 2.139 0.032692 *
## OverallCond3 4.801e+04 2.466e+04 1.947 0.051844 .
## OverallCond4 5.431e+04 2.440e+04 2.226 0.026285 *
## OverallCond5 6.200e+04 2.398e+04 2.585 0.009880 **
## OverallCond6 6.349e+04 2.399e+04 2.647 0.008266 **
## OverallCond7 7.203e+04 2.403e+04 2.998 0.002794 **
## OverallCond8 7.083e+04 2.438e+04 2.905 0.003757 **
## OverallCond9 7.590e+04 2.535e+04 2.994 0.002826 **
## FoundationCBlock 5.948e+02 4.528e+03 0.131 0.895529
## FoundationPConc 1.804e+04 5.022e+03 3.592 0.000346 ***
## FoundationSlab 1.666e+04 9.084e+03 1.834 0.066999 .
## FoundationStone 4.058e+04 1.517e+04 2.675 0.007612 **
## FoundationWood -1.902e+04 1.917e+04 -0.992 0.321310
## BedroomAbvGr 1.269e+04 1.434e+03 8.845 < 2e-16 ***
## EnclosedPorch 1.616e+01 1.853e+01 0.872 0.383414
## BsmtFinSF1 5.861e+01 3.953e+00 14.828 < 2e-16 ***
## BsmtFinSF2 3.723e+01 6.907e+00 5.390 8.97e-08 ***
## MasVnrTypeBrkFace -4.891e+03 1.469e+04 -0.333 0.739188
## MasVnrTypeNone -5.224e+03 1.458e+04 -0.358 0.720281
## MasVnrTypeStone 1.395e+04 1.507e+04 0.926 0.354951
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 31500 on 910 degrees of freedom
## (因为不存在,7个观察量被删除了)
## Multiple R-squared: 0.8405, Adjusted R-squared: 0.8286
## F-statistic: 70.51 on 68 and 910 DF, p-value: < 2.2e-16
Now, display diagnostic plots of your regression (regAfter2009optimal). Tip: You have already know how to autoplot.
regAfter2009optimal %>%
autoplot()
## Warning: Removed 979 rows containing missing values (`geom_line()`).
## Warning: Removed 3 rows containing missing values (`geom_point()`).
## Warning: Removed 10 rows containing missing values (`geom_line()`).
Now, let’s focus on the Residual vs. Fitted graph by plotting it by itself using ggplot. Tip: Call ggplot with the data parameter in regAfter2009optimal. The aes parameters are (.fitted, .resid), respectively. You can use stat_smooth() for the trendline and appropriately title the plot and label both axes. Tip: Check out cheatsheets such as https://rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf.
library(broom)
library(ggplot2)
regAfter2009optimal_aug <- augment(regAfter2009optimal)
ggplot(regAfter2009optimal_aug, aes(x = .fitted, y = .resid)) +
geom_point() +
stat_smooth(method = "loess", colour = "blue") +
ggtitle("Residuals vs Fitted") +
xlab("Fitted values") +
ylab("Residuals")
## `geom_smooth()` using formula = 'y ~ x'
Identify any outliers in the visualization from the last two chunks.
### This section doesn't require code. Just answer the question as a comment.
# In the "Residuals vs Fitted" plot, we can observe that the points labeled with the numbers 280, 348, and 533 are located at a significant distance from the regression line.
#QQ-Plot (Quantile-Quantile Plot): Points deviating from the slop line may indicate outliers like 280
#Scale-Location:points far from the horizontal center like 280 would be outliers
#Residuals vs. Leverage Plot:points outside the dashed horizontal lines like 529 is outlier
# These points can be identified as outliers because their residuals significantly deviate from the rest.
# The second ggplot also shows a similar pattern to the first graph. Furthermore we also see heteroscedascitiy that increases over the large values in the graph. #Residuals vs. Fitted Values Plot: outliers points far from the read line like the point which residuals over 2e+05
##
Now, let’s think like a fraudster and do something smarter fraudsters may do. Instead of misrepresenting values by just reporting the mean value of the houses sold in NAmes before 2009, what is something more clever and nuanced that the fraudsters could report these values? Specifically, consider a method smarter fraudsters may use to set the rows in which the prices are misrepresented? Then, using this method generate and set values for the SalePrice in those rows. Then, try your fraud inspection techniques of comparing old and new density plots as well as using the diagnostic plots to show that now the fraud is much harder to catch. Tip: You must use exact commands/functions to set the values and tell us why you chose to generate values this way. You must share the resulting diagnostic plots with us. Tip: Consider using more information (instead of the mean values) to generate the fraudulent values using what you learned from your work above. You can do this in two steps: Step 1: Find the rows set by the stupid fraudsters (by searching for the SalePrice of 142769.7). Step 2: Use a smarter way to generate and replace these values. Tip: For plotting, you may use ggplot to plot NAmes and NAmes. My ggplot call looked like this: before2009 %>% filter(Neighborhood == “???”) %>% ggplot(aes(x = SalePrice)) + geom_density(fill = “???”, alpha = 0.5) + ggtitle(“???”) + xlab(“???”) Tip: Always refine your model as fraudsters adapt their methods after they find out that you can catch them. Rubric: 10 points each for the fraud method and the plots.
### This section requires you to first explain your idea. Just answer this as a comment.
# Instead of simply misrepresenting values by reporting the average sale price of houses sold in NAmes before 2009, a more sophisticated method of adjusting prices would involve using a regression model to predict the SalePrice. After predicting the SalePrice that we wishes to manipulate, add random noise to ensure that the SalePrice does not match the model's predictions exactly.
fraud_price <- subset(after2009, SalePrice == 142769.7)
fraud_price_new <- predict(regAfter2009optimal, newdata = fraud_price)
set.seed(123)
random_noise <- rnorm(length(fraud_price_new))
fraud_price$SalePrice <- fraud_price_new + random_noise
after2009$SalePrice[after2009$Saleprice == 142769.7] <- fraud_price$SalePrice
p_before <- ggplot(before2009%>% filter(Neighborhood == "Gilbert"), aes(x = SalePrice)) + geom_density(fill = "red", alpha = 0.5) + ggtitle("Density Plot for Gilbert Before 2009") +xlab("SalePrice")
p_after <- ggplot(after2009%>% filter(Neighborhood == "Gilbert"), aes(x = SalePrice)) + geom_density(fill = "blue", alpha = 0.5) + ggtitle("Density Plot for Gilbert After 2009 with fraud") +xlab("SalePrice")
p_before1 <- ggplot(before2009%>% filter(Neighborhood == "NAmes"), aes(x = SalePrice)) + geom_density(fill = "yellow", alpha = 0.5) + ggtitle("Density Plot for NAmes Before 2009") +xlab("SalePrice")
p_after1 <- ggplot(after2009%>% filter(Neighborhood == "NAmes"), aes(x = SalePrice)) + geom_density(fill = "green", alpha = 0.5) + ggtitle("Density Plot for NAmes After 2009 with fraud") +xlab("SalePrice")
grid.arrange(p_before, p_after,p_before1,p_after1, ncol = 2)
## Warning: Removed 4 rows containing non-finite values (`stat_density()`).
## Warning: Removed 1 rows containing non-finite values (`stat_density()`).
Now, run a regression on the new data in after2009 using variables you know are good at predicting SalePrice. Store the result in variable called regAfter2009optimalFraud. Then print summary of regAfter2009optimalFraud to verify that your code works. Tip: You can reuse previous work you before2009. Rubric: 4 points for regression, 1 point for printing summary.
regAfter2009optimalFraud <- lm(SalePrice ~ RoofMatl + LandSlope + BsmtUnfSF +
OverallQual+ Condition2 + MSZoning + Neighborhood +
LotArea + OverallCond +Foundation + BedroomAbvGr +
EnclosedPorch + BsmtFinSF1 +BsmtFinSF2 +
MasVnrType, data = after2009)
summary(regAfter2009optimalFraud)
##
## Call:
## lm(formula = SalePrice ~ RoofMatl + LandSlope + BsmtUnfSF + OverallQual +
## Condition2 + MSZoning + Neighborhood + LotArea + OverallCond +
## Foundation + BedroomAbvGr + EnclosedPorch + BsmtFinSF1 +
## BsmtFinSF2 + MasVnrType, data = after2009)
##
## Residuals:
## Min 1Q Median 3Q Max
## -243969 -13856 127 14684 238698
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.886e+04 5.753e+04 0.849 0.395872
## RoofMatlTar&Grv 1.856e+04 1.391e+04 1.334 0.182580
## RoofMatlWdShake 7.726e+04 2.476e+04 3.121 0.001860 **
## RoofMatlWdShngl -1.076e+04 3.251e+04 -0.331 0.740735
## LandSlopeMod -2.704e+02 5.305e+03 -0.051 0.959363
## LandSlopeSev -7.526e+04 2.174e+04 -3.462 0.000561 ***
## BsmtUnfSF 2.568e+01 3.843e+00 6.681 4.13e-11 ***
## OverallQual2 -1.350e+05 4.692e+04 -2.877 0.004105 **
## OverallQual3 -1.199e+05 4.632e+04 -2.588 0.009806 **
## OverallQual4 -1.141e+05 4.534e+04 -2.517 0.012001 *
## OverallQual5 -1.122e+05 4.519e+04 -2.482 0.013230 *
## OverallQual6 -9.499e+04 4.541e+04 -2.092 0.036739 *
## OverallQual7 -7.374e+04 4.555e+04 -1.619 0.105802
## OverallQual8 -5.187e+04 4.579e+04 -1.133 0.257623
## OverallQual9 2.456e+04 4.616e+04 0.532 0.594773
## OverallQual10 7.218e+04 4.760e+04 1.517 0.129732
## Condition2Feedr 3.476e+04 3.739e+04 0.930 0.352778
## Condition2Norm 2.134e+04 3.378e+04 0.632 0.527682
## Condition2PosA 6.305e+04 4.788e+04 1.317 0.188188
## Condition2PosN -1.972e+05 4.806e+04 -4.104 4.43e-05 ***
## MSZoningFV 4.551e+04 1.895e+04 2.401 0.016556 *
## MSZoningRH 2.095e+04 1.773e+04 1.182 0.237706
## MSZoningRL 3.404e+04 1.429e+04 2.382 0.017427 *
## MSZoningRM 3.030e+04 1.327e+04 2.283 0.022669 *
## NeighborhoodBlueste 5.123e+03 1.906e+04 0.269 0.788103
## NeighborhoodBrDale 2.061e+03 1.719e+04 0.120 0.904626
## NeighborhoodBrkSide 1.432e+03 1.439e+04 0.099 0.920763
## NeighborhoodClearCr 2.748e+04 1.533e+04 1.792 0.073452 .
## NeighborhoodCollgCr -1.987e+02 1.211e+04 -0.016 0.986908
## NeighborhoodCrawfor 3.172e+04 1.376e+04 2.306 0.021322 *
## NeighborhoodEdwards -1.744e+04 1.297e+04 -1.345 0.179001
## NeighborhoodGilbert 2.554e+03 1.259e+04 0.203 0.839267
## NeighborhoodIDOTRR -1.767e+04 1.633e+04 -1.082 0.279391
## NeighborhoodMeadowV -1.513e+04 1.705e+04 -0.887 0.375234
## NeighborhoodMitchel -2.466e+03 1.301e+04 -0.190 0.849686
## NeighborhoodNAmes -1.207e+04 1.253e+04 -0.964 0.335502
## NeighborhoodNoRidge 5.246e+04 1.360e+04 3.857 0.000123 ***
## NeighborhoodNPkVill -1.985e+03 1.503e+04 -0.132 0.894976
## NeighborhoodNridgHt 8.247e+03 1.270e+04 0.649 0.516341
## NeighborhoodNWAmes -2.732e+03 1.294e+04 -0.211 0.832786
## NeighborhoodOldTown -1.649e+04 1.413e+04 -1.167 0.243435
## NeighborhoodSawyer -1.734e+04 1.326e+04 -1.308 0.191330
## NeighborhoodSawyerW 4.546e+03 1.253e+04 0.363 0.716771
## NeighborhoodSomerst 2.615e+03 1.639e+04 0.160 0.873298
## NeighborhoodStoneBr 4.473e+04 1.481e+04 3.021 0.002592 **
## NeighborhoodSWISU -8.673e+03 1.444e+04 -0.600 0.548330
## NeighborhoodTimber 7.695e+03 1.377e+04 0.559 0.576318
## NeighborhoodVeenker 4.835e+04 2.014e+04 2.401 0.016573 *
## LotArea 1.290e+00 1.765e-01 7.307 5.96e-13 ***
## OverallCond2 5.772e+04 2.699e+04 2.139 0.032692 *
## OverallCond3 4.801e+04 2.466e+04 1.947 0.051844 .
## OverallCond4 5.431e+04 2.440e+04 2.226 0.026285 *
## OverallCond5 6.200e+04 2.398e+04 2.585 0.009880 **
## OverallCond6 6.349e+04 2.399e+04 2.647 0.008266 **
## OverallCond7 7.203e+04 2.403e+04 2.998 0.002794 **
## OverallCond8 7.083e+04 2.438e+04 2.905 0.003757 **
## OverallCond9 7.590e+04 2.535e+04 2.994 0.002826 **
## FoundationCBlock 5.948e+02 4.528e+03 0.131 0.895529
## FoundationPConc 1.804e+04 5.022e+03 3.592 0.000346 ***
## FoundationSlab 1.666e+04 9.084e+03 1.834 0.066999 .
## FoundationStone 4.058e+04 1.517e+04 2.675 0.007612 **
## FoundationWood -1.902e+04 1.917e+04 -0.992 0.321310
## BedroomAbvGr 1.269e+04 1.434e+03 8.845 < 2e-16 ***
## EnclosedPorch 1.616e+01 1.853e+01 0.872 0.383414
## BsmtFinSF1 5.861e+01 3.953e+00 14.828 < 2e-16 ***
## BsmtFinSF2 3.723e+01 6.907e+00 5.390 8.97e-08 ***
## MasVnrTypeBrkFace -4.891e+03 1.469e+04 -0.333 0.739188
## MasVnrTypeNone -5.224e+03 1.458e+04 -0.358 0.720281
## MasVnrTypeStone 1.395e+04 1.507e+04 0.926 0.354951
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 31500 on 910 degrees of freedom
## (因为不存在,7个观察量被删除了)
## Multiple R-squared: 0.8405, Adjusted R-squared: 0.8286
## F-statistic: 70.51 on 68 and 910 DF, p-value: < 2.2e-16
Now, display diagnostic plots of your regression (regAfter2009optimalFraud). Tip: You have already know how to autoplot.
regAfter2009optimalFraud %>%
autoplot()
## Warning: Removed 979 rows containing missing values (`geom_line()`).
## Warning: Removed 3 rows containing missing values (`geom_point()`).
## Warning: Removed 10 rows containing missing values (`geom_line()`).
Now, look for outliers in diagnostic plots of your regression (regAfter2009optimal). Tip: You have already know how to autoplot.
### This section doesn't require code. Just answer the question as a comment.
## Residuals vs. Fitted Values Plot: outliers points far from the horizontal center line like 280
# QQ-Plot (Quantile-Quantile Plot): Points deviating from the slop line may indicate outliers like 533
# Scale-Location:points far from the horizontal center like 280 would be outliers
# Residuals vs. Leverage Plot:points outside the dashed horizontal lines like 524 is an outlier
Knit to html after eliminating all the errors. Submit both the Rmd and html files. Tip: Do not worry about minor formatting issues.
### This section doesn't require code. Just knit and submit the Rmd and html files.###