##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
As a final project for class Data 605 (Fundamentals of Computational Mathematics) we will explore questions pertaining to Probability, Descriptive and Inferential Statistics, Linear Algegra and Correlation, Calculus based Probability & Statistics, and modeling. For this exploration we will use the data set which is part of the “the House Prices: Advanced Regression Techniques competition” on Kaggle.com, see link below.
https://www.kaggle.com/c/house-prices-advanced-regression-techniques
In order to load the .csv file pertaining to this competition, we registered to www.kaggle.com and downloaded the data for the “Advanced Regression Techniques Competition” (train.csv). We will assume that the data resides in the working directory for the remaining of the analysis.
# Load raw data set
my_data <- read.csv(file="train.csv",head=TRUE,sep=",")
Let us performed some basic exploration of the data. This data set has 81 variables and 1460 observations. Based on the descriptions of the various variables (see data set text documentation), we may conclude that the dependent variable is the SalePrice. The remaining variables are both qualitative or quantitative in nature.
#display top and bottom few raws
head(my_data)
## Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape
## 1 1 60 RL 65 8450 Pave <NA> Reg
## 2 2 20 RL 80 9600 Pave <NA> Reg
## 3 3 60 RL 68 11250 Pave <NA> IR1
## 4 4 70 RL 60 9550 Pave <NA> IR1
## 5 5 60 RL 84 14260 Pave <NA> IR1
## 6 6 50 RL 85 14115 Pave <NA> IR1
## LandContour Utilities LotConfig LandSlope Neighborhood Condition1
## 1 Lvl AllPub Inside Gtl CollgCr Norm
## 2 Lvl AllPub FR2 Gtl Veenker Feedr
## 3 Lvl AllPub Inside Gtl CollgCr Norm
## 4 Lvl AllPub Corner Gtl Crawfor Norm
## 5 Lvl AllPub FR2 Gtl NoRidge Norm
## 6 Lvl AllPub Inside Gtl Mitchel Norm
## Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt
## 1 Norm 1Fam 2Story 7 5 2003
## 2 Norm 1Fam 1Story 6 8 1976
## 3 Norm 1Fam 2Story 7 5 2001
## 4 Norm 1Fam 2Story 7 5 1915
## 5 Norm 1Fam 2Story 8 5 2000
## 6 Norm 1Fam 1.5Fin 5 5 1993
## YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType
## 1 2003 Gable CompShg VinylSd VinylSd BrkFace
## 2 1976 Gable CompShg MetalSd MetalSd None
## 3 2002 Gable CompShg VinylSd VinylSd BrkFace
## 4 1970 Gable CompShg Wd Sdng Wd Shng None
## 5 2000 Gable CompShg VinylSd VinylSd BrkFace
## 6 1995 Gable CompShg VinylSd VinylSd None
## MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure
## 1 196 Gd TA PConc Gd TA No
## 2 0 TA TA CBlock Gd TA Gd
## 3 162 Gd TA PConc Gd TA Mn
## 4 0 TA TA BrkTil TA Gd No
## 5 350 Gd TA PConc Gd TA Av
## 6 0 TA TA Wood Gd TA No
## BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF
## 1 GLQ 706 Unf 0 150 856
## 2 ALQ 978 Unf 0 284 1262
## 3 GLQ 486 Unf 0 434 920
## 4 ALQ 216 Unf 0 540 756
## 5 GLQ 655 Unf 0 490 1145
## 6 GLQ 732 Unf 0 64 796
## Heating HeatingQC CentralAir Electrical X1stFlrSF X2ndFlrSF LowQualFinSF
## 1 GasA Ex Y SBrkr 856 854 0
## 2 GasA Ex Y SBrkr 1262 0 0
## 3 GasA Ex Y SBrkr 920 866 0
## 4 GasA Gd Y SBrkr 961 756 0
## 5 GasA Ex Y SBrkr 1145 1053 0
## 6 GasA Ex Y SBrkr 796 566 0
## GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr
## 1 1710 1 0 2 1 3
## 2 1262 0 1 2 0 3
## 3 1786 1 0 2 1 3
## 4 1717 1 0 1 0 3
## 5 2198 1 0 2 1 4
## 6 1362 1 0 1 1 1
## KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces FireplaceQu
## 1 1 Gd 8 Typ 0 <NA>
## 2 1 TA 6 Typ 1 TA
## 3 1 Gd 6 Typ 1 TA
## 4 1 Gd 7 Typ 1 Gd
## 5 1 Gd 9 Typ 1 TA
## 6 1 TA 5 Typ 0 <NA>
## GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual
## 1 Attchd 2003 RFn 2 548 TA
## 2 Attchd 1976 RFn 2 460 TA
## 3 Attchd 2001 RFn 2 608 TA
## 4 Detchd 1998 Unf 3 642 TA
## 5 Attchd 2000 RFn 3 836 TA
## 6 Attchd 1993 Unf 2 480 TA
## GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch
## 1 TA Y 0 61 0 0
## 2 TA Y 298 0 0 0
## 3 TA Y 0 42 0 0
## 4 TA Y 0 35 272 0
## 5 TA Y 192 84 0 0
## 6 TA Y 40 30 0 320
## ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold
## 1 0 0 <NA> <NA> <NA> 0 2 2008
## 2 0 0 <NA> <NA> <NA> 0 5 2007
## 3 0 0 <NA> <NA> <NA> 0 9 2008
## 4 0 0 <NA> <NA> <NA> 0 2 2006
## 5 0 0 <NA> <NA> <NA> 0 12 2008
## 6 0 0 <NA> MnPrv Shed 700 10 2009
## SaleType SaleCondition SalePrice
## 1 WD Normal 208500
## 2 WD Normal 181500
## 3 WD Normal 223500
## 4 WD Abnorml 140000
## 5 WD Normal 250000
## 6 WD Normal 143000
tail(my_data)
## Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape
## 1455 1455 20 FV 62 7500 Pave Pave Reg
## 1456 1456 60 RL 62 7917 Pave <NA> Reg
## 1457 1457 20 RL 85 13175 Pave <NA> Reg
## 1458 1458 70 RL 66 9042 Pave <NA> Reg
## 1459 1459 20 RL 68 9717 Pave <NA> Reg
## 1460 1460 20 RL 75 9937 Pave <NA> Reg
## LandContour Utilities LotConfig LandSlope Neighborhood Condition1
## 1455 Lvl AllPub Inside Gtl Somerst Norm
## 1456 Lvl AllPub Inside Gtl Gilbert Norm
## 1457 Lvl AllPub Inside Gtl NWAmes Norm
## 1458 Lvl AllPub Inside Gtl Crawfor Norm
## 1459 Lvl AllPub Inside Gtl NAmes Norm
## 1460 Lvl AllPub Inside Gtl Edwards Norm
## Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt
## 1455 Norm 1Fam 1Story 7 5 2004
## 1456 Norm 1Fam 2Story 6 5 1999
## 1457 Norm 1Fam 1Story 6 6 1978
## 1458 Norm 1Fam 2Story 7 9 1941
## 1459 Norm 1Fam 1Story 5 6 1950
## 1460 Norm 1Fam 1Story 5 6 1965
## YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType
## 1455 2005 Gable CompShg VinylSd VinylSd None
## 1456 2000 Gable CompShg VinylSd VinylSd None
## 1457 1988 Gable CompShg Plywood Plywood Stone
## 1458 2006 Gable CompShg CemntBd CmentBd None
## 1459 1996 Hip CompShg MetalSd MetalSd None
## 1460 1965 Gable CompShg HdBoard HdBoard None
## MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond
## 1455 0 Gd TA PConc Gd TA
## 1456 0 TA TA PConc Gd TA
## 1457 119 TA TA CBlock Gd TA
## 1458 0 Ex Gd Stone TA Gd
## 1459 0 TA TA CBlock TA TA
## 1460 0 Gd TA CBlock TA TA
## BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2
## 1455 No GLQ 410 Unf 0
## 1456 No Unf 0 Unf 0
## 1457 No ALQ 790 Rec 163
## 1458 No GLQ 275 Unf 0
## 1459 Mn GLQ 49 Rec 1029
## 1460 No BLQ 830 LwQ 290
## BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical
## 1455 811 1221 GasA Ex Y SBrkr
## 1456 953 953 GasA Ex Y SBrkr
## 1457 589 1542 GasA TA Y SBrkr
## 1458 877 1152 GasA Ex Y SBrkr
## 1459 0 1078 GasA Gd Y FuseA
## 1460 136 1256 GasA Gd Y SBrkr
## X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath
## 1455 1221 0 0 1221 1 0
## 1456 953 694 0 1647 0 0
## 1457 2073 0 0 2073 1 0
## 1458 1188 1152 0 2340 0 0
## 1459 1078 0 0 1078 1 0
## 1460 1256 0 0 1256 1 0
## FullBath HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd
## 1455 2 0 2 1 Gd 6
## 1456 2 1 3 1 TA 7
## 1457 2 0 3 1 TA 7
## 1458 2 0 4 1 Gd 9
## 1459 1 0 2 1 Gd 5
## 1460 1 1 3 1 TA 6
## Functional Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish
## 1455 Typ 0 <NA> Attchd 2004 RFn
## 1456 Typ 1 TA Attchd 1999 RFn
## 1457 Min1 2 TA Attchd 1978 Unf
## 1458 Typ 2 Gd Attchd 1941 RFn
## 1459 Typ 0 <NA> Attchd 1950 Unf
## 1460 Typ 0 <NA> Attchd 1965 Fin
## GarageCars GarageArea GarageQual GarageCond PavedDrive WoodDeckSF
## 1455 2 400 TA TA Y 0
## 1456 2 460 TA TA Y 0
## 1457 2 500 TA TA Y 349
## 1458 1 252 TA TA Y 0
## 1459 1 240 TA TA Y 366
## 1460 1 276 TA TA Y 736
## OpenPorchSF EnclosedPorch X3SsnPorch ScreenPorch PoolArea PoolQC
## 1455 113 0 0 0 0 <NA>
## 1456 40 0 0 0 0 <NA>
## 1457 0 0 0 0 0 <NA>
## 1458 60 0 0 0 0 <NA>
## 1459 0 112 0 0 0 <NA>
## 1460 68 0 0 0 0 <NA>
## Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition
## 1455 <NA> <NA> 0 10 2009 WD Normal
## 1456 <NA> <NA> 0 8 2007 WD Normal
## 1457 MnPrv <NA> 0 2 2010 WD Normal
## 1458 GdPrv Shed 2500 5 2010 WD Normal
## 1459 <NA> <NA> 0 4 2010 WD Normal
## 1460 <NA> <NA> 0 6 2008 WD Normal
## SalePrice
## 1455 185000
## 1456 175000
## 1457 210000
## 1458 266500
## 1459 142125
## 1460 147500
From the display, we observed that quite a few independent variables have missing observations as indicated by NA’s. These will have to be accounted for if one of these variables should prove to be under consideration. Let us now run the summary function on the data to obtain basic statistics.
# Summary function on Data Set
summary(my_data)
## Id MSSubClass MSZoning LotFrontage
## Min. : 1.0 Min. : 20.0 C (all): 10 Min. : 21.00
## 1st Qu.: 365.8 1st Qu.: 20.0 FV : 65 1st Qu.: 59.00
## Median : 730.5 Median : 50.0 RH : 16 Median : 69.00
## Mean : 730.5 Mean : 56.9 RL :1151 Mean : 70.05
## 3rd Qu.:1095.2 3rd Qu.: 70.0 RM : 218 3rd Qu.: 80.00
## Max. :1460.0 Max. :190.0 Max. :313.00
## NA's :259
## LotArea Street Alley LotShape LandContour
## Min. : 1300 Grvl: 6 Grvl: 50 IR1:484 Bnk: 63
## 1st Qu.: 7554 Pave:1454 Pave: 41 IR2: 41 HLS: 50
## Median : 9478 NA's:1369 IR3: 10 Low: 36
## Mean : 10517 Reg:925 Lvl:1311
## 3rd Qu.: 11602
## Max. :215245
##
## Utilities LotConfig LandSlope Neighborhood Condition1
## AllPub:1459 Corner : 263 Gtl:1382 NAmes :225 Norm :1260
## NoSeWa: 1 CulDSac: 94 Mod: 65 CollgCr:150 Feedr : 81
## FR2 : 47 Sev: 13 OldTown:113 Artery : 48
## FR3 : 4 Edwards:100 RRAn : 26
## Inside :1052 Somerst: 86 PosN : 19
## Gilbert: 79 RRAe : 11
## (Other):707 (Other): 15
## Condition2 BldgType HouseStyle OverallQual
## Norm :1445 1Fam :1220 1Story :726 Min. : 1.000
## Feedr : 6 2fmCon: 31 2Story :445 1st Qu.: 5.000
## Artery : 2 Duplex: 52 1.5Fin :154 Median : 6.000
## PosN : 2 Twnhs : 43 SLvl : 65 Mean : 6.099
## RRNn : 2 TwnhsE: 114 SFoyer : 37 3rd Qu.: 7.000
## PosA : 1 1.5Unf : 14 Max. :10.000
## (Other): 2 (Other): 19
## OverallCond YearBuilt YearRemodAdd RoofStyle
## Min. :1.000 Min. :1872 Min. :1950 Flat : 13
## 1st Qu.:5.000 1st Qu.:1954 1st Qu.:1967 Gable :1141
## Median :5.000 Median :1973 Median :1994 Gambrel: 11
## Mean :5.575 Mean :1971 Mean :1985 Hip : 286
## 3rd Qu.:6.000 3rd Qu.:2000 3rd Qu.:2004 Mansard: 7
## Max. :9.000 Max. :2010 Max. :2010 Shed : 2
##
## RoofMatl Exterior1st Exterior2nd MasVnrType MasVnrArea
## CompShg:1434 VinylSd:515 VinylSd:504 BrkCmn : 15 Min. : 0.0
## Tar&Grv: 11 HdBoard:222 MetalSd:214 BrkFace:445 1st Qu.: 0.0
## WdShngl: 6 MetalSd:220 HdBoard:207 None :864 Median : 0.0
## WdShake: 5 Wd Sdng:206 Wd Sdng:197 Stone :128 Mean : 103.7
## ClyTile: 1 Plywood:108 Plywood:142 NA's : 8 3rd Qu.: 166.0
## Membran: 1 CemntBd: 61 CmentBd: 60 Max. :1600.0
## (Other): 2 (Other):128 (Other):136 NA's :8
## ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure
## Ex: 52 Ex: 3 BrkTil:146 Ex :121 Fa : 45 Av :221
## Fa: 14 Fa: 28 CBlock:634 Fa : 35 Gd : 65 Gd :134
## Gd:488 Gd: 146 PConc :647 Gd :618 Po : 2 Mn :114
## TA:906 Po: 1 Slab : 24 TA :649 TA :1311 No :953
## TA:1282 Stone : 6 NA's: 37 NA's: 37 NA's: 38
## Wood : 3
##
## BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2
## ALQ :220 Min. : 0.0 ALQ : 19 Min. : 0.00
## BLQ :148 1st Qu.: 0.0 BLQ : 33 1st Qu.: 0.00
## GLQ :418 Median : 383.5 GLQ : 14 Median : 0.00
## LwQ : 74 Mean : 443.6 LwQ : 46 Mean : 46.55
## Rec :133 3rd Qu.: 712.2 Rec : 54 3rd Qu.: 0.00
## Unf :430 Max. :5644.0 Unf :1256 Max. :1474.00
## NA's: 37 NA's: 38
## BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir
## Min. : 0.0 Min. : 0.0 Floor: 1 Ex:741 N: 95
## 1st Qu.: 223.0 1st Qu.: 795.8 GasA :1428 Fa: 49 Y:1365
## Median : 477.5 Median : 991.5 GasW : 18 Gd:241
## Mean : 567.2 Mean :1057.4 Grav : 7 Po: 1
## 3rd Qu.: 808.0 3rd Qu.:1298.2 OthW : 2 TA:428
## Max. :2336.0 Max. :6110.0 Wall : 4
##
## Electrical X1stFlrSF X2ndFlrSF LowQualFinSF
## FuseA: 94 Min. : 334 Min. : 0 Min. : 0.000
## FuseF: 27 1st Qu.: 882 1st Qu.: 0 1st Qu.: 0.000
## FuseP: 3 Median :1087 Median : 0 Median : 0.000
## Mix : 1 Mean :1163 Mean : 347 Mean : 5.845
## SBrkr:1334 3rd Qu.:1391 3rd Qu.: 728 3rd Qu.: 0.000
## NA's : 1 Max. :4692 Max. :2065 Max. :572.000
##
## GrLivArea BsmtFullBath BsmtHalfBath FullBath
## Min. : 334 Min. :0.0000 Min. :0.00000 Min. :0.000
## 1st Qu.:1130 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:1.000
## Median :1464 Median :0.0000 Median :0.00000 Median :2.000
## Mean :1515 Mean :0.4253 Mean :0.05753 Mean :1.565
## 3rd Qu.:1777 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:2.000
## Max. :5642 Max. :3.0000 Max. :2.00000 Max. :3.000
##
## HalfBath BedroomAbvGr KitchenAbvGr KitchenQual
## Min. :0.0000 Min. :0.000 Min. :0.000 Ex:100
## 1st Qu.:0.0000 1st Qu.:2.000 1st Qu.:1.000 Fa: 39
## Median :0.0000 Median :3.000 Median :1.000 Gd:586
## Mean :0.3829 Mean :2.866 Mean :1.047 TA:735
## 3rd Qu.:1.0000 3rd Qu.:3.000 3rd Qu.:1.000
## Max. :2.0000 Max. :8.000 Max. :3.000
##
## TotRmsAbvGrd Functional Fireplaces FireplaceQu GarageType
## Min. : 2.000 Maj1: 14 Min. :0.000 Ex : 24 2Types : 6
## 1st Qu.: 5.000 Maj2: 5 1st Qu.:0.000 Fa : 33 Attchd :870
## Median : 6.000 Min1: 31 Median :1.000 Gd :380 Basment: 19
## Mean : 6.518 Min2: 34 Mean :0.613 Po : 20 BuiltIn: 88
## 3rd Qu.: 7.000 Mod : 15 3rd Qu.:1.000 TA :313 CarPort: 9
## Max. :14.000 Sev : 1 Max. :3.000 NA's:690 Detchd :387
## Typ :1360 NA's : 81
## GarageYrBlt GarageFinish GarageCars GarageArea GarageQual
## Min. :1900 Fin :352 Min. :0.000 Min. : 0.0 Ex : 3
## 1st Qu.:1961 RFn :422 1st Qu.:1.000 1st Qu.: 334.5 Fa : 48
## Median :1980 Unf :605 Median :2.000 Median : 480.0 Gd : 14
## Mean :1979 NA's: 81 Mean :1.767 Mean : 473.0 Po : 3
## 3rd Qu.:2002 3rd Qu.:2.000 3rd Qu.: 576.0 TA :1311
## Max. :2010 Max. :4.000 Max. :1418.0 NA's: 81
## NA's :81
## GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch
## Ex : 2 N: 90 Min. : 0.00 Min. : 0.00 Min. : 0.00
## Fa : 35 P: 30 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00
## Gd : 9 Y:1340 Median : 0.00 Median : 25.00 Median : 0.00
## Po : 7 Mean : 94.24 Mean : 46.66 Mean : 21.95
## TA :1326 3rd Qu.:168.00 3rd Qu.: 68.00 3rd Qu.: 0.00
## NA's: 81 Max. :857.00 Max. :547.00 Max. :552.00
##
## X3SsnPorch ScreenPorch PoolArea PoolQC
## Min. : 0.00 Min. : 0.00 Min. : 0.000 Ex : 2
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.000 Fa : 2
## Median : 0.00 Median : 0.00 Median : 0.000 Gd : 3
## Mean : 3.41 Mean : 15.06 Mean : 2.759 NA's:1453
## 3rd Qu.: 0.00 3rd Qu.: 0.00 3rd Qu.: 0.000
## Max. :508.00 Max. :480.00 Max. :738.000
##
## Fence MiscFeature MiscVal MoSold
## GdPrv: 59 Gar2: 2 Min. : 0.00 Min. : 1.000
## GdWo : 54 Othr: 2 1st Qu.: 0.00 1st Qu.: 5.000
## MnPrv: 157 Shed: 49 Median : 0.00 Median : 6.000
## MnWw : 11 TenC: 1 Mean : 43.49 Mean : 6.322
## NA's :1179 NA's:1406 3rd Qu.: 0.00 3rd Qu.: 8.000
## Max. :15500.00 Max. :12.000
##
## YrSold SaleType SaleCondition SalePrice
## Min. :2006 WD :1267 Abnorml: 101 Min. : 34900
## 1st Qu.:2007 New : 122 AdjLand: 4 1st Qu.:129975
## Median :2008 COD : 43 Alloca : 12 Median :163000
## Mean :2008 ConLD : 9 Family : 20 Mean :180921
## 3rd Qu.:2009 ConLI : 5 Normal :1198 3rd Qu.:214000
## Max. :2010 ConLw : 5 Partial: 125 Max. :755000
## (Other): 9
For the remaining of the analysis, we need to select one if the independent quantitative variables (one requirement is that the distribution of this variable is skewed to the right). We will plot histograms of various independent quantitative variables to determine the shape of the distribution. Please refer to appendix A for the type of variables). We will first consider quantitative variables with no missiing values (no NA’s).
Let us consider the following variables: LotArea, BsmtFinSF1, BsmtUnfSF, TotalBsmtSF, X1stFlrSF, GrLivArea, GarageArea
hist(my_data$LotArea)
hist(my_data$BsmtFinSF1)
hist(my_data$BsmtUnfSF)
hist(my_data$TotalBsmtSF)
hist(my_data$X1stFlrSF)
hist(my_data$GrLivArea)
hist(my_data$GarageArea)
We will consider TotalBsmtSF as our X independent Qualitative variable. From the histogram we can tell that this variable distribution is skewed to the right. We will verify this by calculating the mean and median and comparing them.
my_data %>% summarise(mean_g = mean(TotalBsmtSF), median_g = median(TotalBsmtSF))
## mean_g median_g
## 1 1057.429 991.5
summary(my_data$TotalBsmtSF)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 795.8 991.5 1057.0 1298.0 6110.0
Since mean > median for this variable, the distribution is skeewed to the right. Also, looking at the summary statistics, there is no missing value for the variable ‘TotalBsmtSF’ (Total Basement Square Feet) which makes it a good choice for X variable.
Let us denote X the independent variable TotalBsmtSF (Total square feet of basement area) and Y the dependent variable SalePrice (the property’s sale price in dollars). Let us assume that x is estimated at the 3rd quartile of X variable and y is estimated at 2nd quartile of Y variable.
We will now calculate the following:
By definition of the quartiles, we would have P(X<= x) = .75 and P(Y<= y) = .50, this would therefore implies that P(X
Since we have P(Y>y), the question is to determined the probability of P(X>x & Y>y), It highly unlikely that the 2 events X>x and Y>y are independent event since the initial premise of the data set is that they might be some correlation between the SalePrice (Y variable) and the other variables.
To determine P(X>x & Y>y) we will use the data we have. Probabilities calculation will be rounded to 2 decimal point.
# find x and y
x <- quantile(my_data$TotalBsmtSF)[4]
y <- quantile(my_data$SalePrice)[3]
# counts with respect of x and y
below_x <- my_data %>% filter(TotalBsmtSF <= x) %>% summarise(n())
below_strictly_x <- my_data %>% filter(TotalBsmtSF < x) %>% summarise(n())
above_x <- my_data %>% filter(TotalBsmtSF > x) %>% summarise(n())
below_y <- my_data %>% filter(SalePrice <= y) %>% summarise(n())
above_y <- my_data %>% filter(SalePrice > y) %>% summarise(n())
# First row of Grid
below_x_below_y <- my_data %>% filter(TotalBsmtSF <= x , SalePrice <= y) %>% summarise(n())
below_x_above_y <- my_data %>% filter(TotalBsmtSF <= x , SalePrice > y) %>% summarise(n())
below_strictly_x_above_y <- my_data %>% filter(TotalBsmtSF < x , SalePrice > y) %>% summarise(n())
# Second row of Grid
above_x_below_y <- my_data %>% filter(TotalBsmtSF > x , SalePrice <= y) %>% summarise(n())
above_x_above_y <- my_data %>% filter(TotalBsmtSF > x , SalePrice > y) %>% summarise(n())
total <- nrow(my_data)
# Sanity Check P(X <= x) and P(X > x) calculated
p_below_x <- round(below_x / total,2)
p_below_strictly_x <- round(below_strictly_x/total, 2)
p_above_x <- round(above_x / total,2)
# Calculate P(Y>y)
p_above_y <- round(above_y/total,2)
p_above_x_above_y <- round(above_x_above_y / total, 2)
p_below_strictly_x_above_y <- round(below_strictly_x_above_y / total, 2)
Based on our calculation, we have P(X>x & Y>y)= 0.23. Hence, substituing back in the formula, we obtain the following; P(X>x | Y>y) = 0.46.
P(X>x, Y>y), this represents the probability that the Total Basement Square Feet of house is above the 3rd Quartil and that the Sale Price of the house is above the 2nd Quartil. This has already been calculated above; 0.23.
P(X
\(P(X<x\quad |\quad Y>y)\quad =\quad \frac { P(X<x\quad \& \quad Y>y) }{ P(Y>y) }\)
\(P(X<x\quad |\quad Y>y)\quad =\quad\) 0.54.
x/y | below 2nd Qtrl | above 2nd Qtrl | Total |
---|---|---|---|
below 3rd Qtrl | 696 | 399 | 1095 |
above 3rd Qtrl | 36 | 329 | 365 |
Total | 732 | 728 | 1460 |
Let A be the new variable counting those observations above the 3d quartile for X, and let B be the new variable counting those observations above the 2d quartile for Y.
Hence A = X>x and B=Y>y, should A and B be independent, then knowing B should not impact the probability of A, hence if A and B are independent, P(A|B)=P(A).
This could be derive from the conditional probability formula:
\(P(A|B)=\frac { P(A\cap B) }{ P(B) }\), since when A and B are independent we have \(P(A\cap B)=P(A)\cdot P(B)\), this would lead to the following:
If A and B are idependent, we have \(P(A|B)=\frac { P(A)\cdot P(B) }{ P(B) } \quad \Leftrightarrow \quad P(A|B)=P(A)\)
Let us verify by calculation:
P(A|B) = 0.45
P(A) = 0.25
Since these 2 values are not the same. The variables A and B are not independent.
# Build contegency table for Variable A and B and run Chi-Square Test
m_tbl <- table(my_data$TotalBsmtSF > x, my_data$SalePrice>y)
m_tst <- chisq.test(m_tbl)
Let H0: A and B are independent
Ha: A and B are not independent
The result of the Chi-Square test indicates that the p value is extremely small and < 0.05, hence we reject the H0 hypothesis.
Results of Chi-Square test:
m_tst
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: m_tbl
## X-squared = 313.61, df = 1, p-value < 2.2e-16
Sill considering variables X and Y selected above, provide univariate statistics. On X and Y as define above, let us run some basic statistics on each variable.
mean_X <-round(mean(my_data$TotalBsmtSF), 2)
sd_X <- round(sd(my_data$TotalBsmtSF),4)
median_X <-round(median(my_data$TotalBsmtSF), 2)
mean_Y <- round(mean(my_data$SalePrice),2)
sd_Y <- round(sd(my_data$SalePrice),4)
median_Y <- round(median(my_data$SalePrice), 2)
t_obs <- nrow(my_data)
The mean for X is 1057.43 and Standard Deviation is 438.7053, the median is 991.5.
The mean for Y is 1.80921210^{5} and Standard Deviation is 7.944250310^{4}, the median is 1.6310^{5}.
# Box Plots
ggplot(my_data, aes(x=1, y=TotalBsmtSF)) + geom_boxplot() + scale_x_continuous(breaks = NULL) + theme(axis.title.x = element_blank())
ggplot(my_data, aes(x=1, y=SalePrice)) + geom_boxplot() + scale_x_continuous(breaks = NULL) + scale_y_continuous(labels = comma) + theme(axis.title.x = element_blank())
# Histograms
ggplot(my_data, aes(x=TotalBsmtSF)) + geom_histogram(binwidth = 20)
ggplot(my_data, aes(x=SalePrice)) + geom_histogram(binwidth = 10) + scale_x_continuous(labels = comma)
From the box plot and histograms, there appeared to be outliers for high value.
Let us now look at the scatter plot for X and Y. Because of the concentration of the data points in lower part of the graph, we will use transparency.
sp <- ggplot(my_data, aes(x=TotalBsmtSF, y=SalePrice))
sp + geom_point(alpha = 0.2, colour = 'blue')
From the scatter plot, there is a strong positive relationship between the Total Basement Square Footage and the Sale Price of the house.
result <- t.test(my_data$TotalBsmtSF, my_data$SalePrice, alternative = 'two.sided', paired = TRUE, conf.level = 0.95)
From the result of the calculation we a have a [-1.839283410^{5}], -1.757991910^{5}] for the differnce of the mean between X and Y, hence given difference of mean for a sample of X and Y, we would be 95% confident that the difference would be in this interval.
We will now find a correlation matrix for X and Y and test the correlation between these 2 variables with 99 confidence interval.
H0: correlation between X and Y is 0
Ha: correlation between X and Y is not 0
# MASS Package over wrote dplyr select function
df <- dplyr::select(my_data,TotalBsmtSF, SalePrice)
m_A <- cor(df)
c_A <- cor.test(my_data$TotalBsmtSF, my_data$SalePrice, conf.level = 0.99)
c_A
##
## Pearson's product-moment correlation
##
## data: my_data$TotalBsmtSF and my_data$SalePrice
## t = 29.671, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 99 percent confidence interval:
## 0.5697562 0.6539251
## sample estimates:
## cor
## 0.6135806
The p-value for the correlation test is very small (9.484229410^{-152}), less than 0.05, therefore we would reject the H0 hypothesis and conclude that the correlation is not zero. As a rule of thumb, if \(|r|\quad \ge \quad \frac { 2 }{ \sqrt { n } }\), where r is correlation and n is sample size, then a relationship exists. Since our sample size is so large, clearly we have:
\(|r|\quad \ge \quad \frac { 2 }{ \sqrt { n } }\) since 0.6135806 \(\ge\) 0.0523424
We can conclude that we have a positive relationship between X (Total Basement Square feet) and Y (Sale Price).
We will now conduct PCA on the correlation matrix. Since the 2 variables we have selected for the analysis are measure in different units, it is preferrable to use correlation matrix as a basis for PCA (correlation being a standardized measure).
Let us find the inverse of the correlation matrix, precision matrix.
m_A_inv <- solve(m_A)
M1 <- m_A %*% m_A_inv
M2 <- m_A_inv %*% m_A
Correlation Matrix = 1, 0.6135806, 0.6135806, 1, inverse matrix; Precision Matrix = 1.6038006, -0.9840609, -0.9840609, 1.6038006.
Since the 2 matrices were inverse of each other we would expect the product of the correlation matrix with the precision matrix to the Identity matrix.
Indeed, we have Correlation_Matrix X Precision_Matrix = 1, 0, 0, 1 and Precision_Matrix x Correlation_Matrix = 1, 0, 0, 1.
PCA or Principal Component Analysis is a data reduction technique that, as we understand it, is an iterative process project the data points on a vector in such a way we preserve maximum variability, once the first one is found, we need to find the next one such as the 2nd vector is orthogonal to the first and also maximize remaining variability, and so one. Hence this process reduces the number of observed variables to a smaller number of principal components which account for most of the variance of the observed variables.
From the correlation matrix, we will calculate the eigenvectors and eigenvalue for the matrix. The highest eigenvalue will indicate the most variability and will correspond to the eignvector that represents the first component.
# Eigen vectors and Eigen values of correlation matrix
m_A_eigen <- eigen(m_A)
#m_A_inv_eigen <- eigen(m_A_inv)
m_A_eigen
## $values
## [1] 1.6135806 0.3864194
##
## $vectors
## [,1] [,2]
## [1,] 0.7071068 -0.7071068
## [2,] 0.7071068 0.7071068
#m_A_inv_eigen
From the eigen values, 1.6135806, 0.3864194, we can conclude that the first principal component is given by: 0.7071068, 0.7071068 is the first component and -0.7071068, 0.7071068.
We will conduct PCA on quantitative variables from the training set as follows:
LotFrontage, we will impute missing data with 0,
LotArea,
YearBuilt,
YearRemodAdd,
MasVnrArea, we will impute mission data with 0,
BsmtFinSF1,
BsmtFinSF2,
BsmtUnfSF,
TotalBsmtSF,
X1stFlrSF,
X2ndFlrSF,
LowQualFinSF,
GrLivArea,
BsmtFullBath,
BsmtHalfBath,
FullBath,
HalfBath,
BedroomAbvGr,
KitchenAbvGr,
TotRmsAbvGrd,
Fireplaces,
GarageYrBlt, impute missing data with house build year
GarageCars,
GarageArea
We will first build this new data set and then impute the missing data for these variables as indicated.
df1 <- dplyr::select(my_data, LotFrontage, LotArea, YearBuilt, YearRemodAdd, MasVnrArea, BsmtFinSF1, BsmtFinSF2,BsmtUnfSF, TotalBsmtSF, X1stFlrSF, X2ndFlrSF, LowQualFinSF, GrLivArea, BsmtFullBath, BsmtHalfBath, FullBath, HalfBath, BedroomAbvGr, KitchenAbvGr, TotRmsAbvGrd, Fireplaces, GarageYrBlt, GarageCars, GarageArea)
# impute missing data
df1$LotFrontage[is.na(df1$LotFrontage)]<-0
df1$MasVnrArea[is.na(df1$MasVnrArea)]<-0
df1 <- df1 %>% mutate(GarageYrBlt = ifelse(is.na(GarageYrBlt), YearBuilt, GarageYrBlt))
We will perform PCA and this data set.
prin_comp <- prcomp(df1, center = TRUE, scale. = TRUE)
Let us now examine some key results of PCA, first we will plot the variables based on a graph with PC1 and PC2 as axis.
biplot(prin_comp, scale = 0)
Unfortunately, the results are difficult to interpret for variability along PC1. We would expect X2ndFlrSF, BedroomAbvGr, TotRmsAbvGrd, GrLivArea, BsmFinSF1 and BsmtFullBath (possibly). To get confirmation, we will look at the rotation matrix.
prin_comp$rotation
## PC1 PC2 PC3 PC4
## LotFrontage -0.107942712 0.031879350 0.11578258 -0.201992974
## LotArea -0.116287525 -0.023751975 0.29321734 0.001608433
## YearBuilt -0.250777046 -0.214371658 -0.33094074 0.083738271
## YearRemodAdd -0.222274311 -0.107477888 -0.29501259 0.049595993
## MasVnrArea -0.213317465 -0.028376777 0.02565112 0.048500894
## BsmtFinSF1 -0.149087754 -0.311156932 0.31586683 0.269215581
## BsmtFinSF2 0.015173227 -0.070534416 0.18889936 0.056799637
## BsmtUnfSF -0.126321173 0.138107898 -0.19960478 -0.556690778
## TotalBsmtSF -0.276650525 -0.210326435 0.19680776 -0.259927710
## X1stFlrSF -0.278448687 -0.149066580 0.27368130 -0.291096099
## X2ndFlrSF -0.148658060 0.415466735 -0.06283344 0.333536969
## LowQualFinSF 0.014675123 0.129056351 0.12126231 -0.065103163
## GrLivArea -0.326986077 0.247413326 0.16036640 0.056897790
## BsmtFullBath -0.077602016 -0.311686918 0.26363529 0.288699967
## BsmtHalfBath 0.013227881 0.002926211 0.06334389 0.033836542
## FullBath -0.286504039 0.140195572 -0.11350494 -0.059246660
## HalfBath -0.135802999 0.207823705 -0.13666140 0.414387939
## BedroomAbvGr -0.132845034 0.379561926 0.14494678 -0.004296945
## KitchenAbvGr 0.005442026 0.171565697 0.13734063 -0.135947137
## TotRmsAbvGrd -0.268648739 0.333118647 0.14998861 0.000976127
## Fireplaces -0.195478268 0.004670823 0.21057776 0.094437226
## GarageYrBlt -0.258308258 -0.186244169 -0.36318499 0.058798422
## GarageCars -0.313815483 -0.090553435 -0.14110813 0.006250847
## GarageArea -0.308738507 -0.122143433 -0.08063520 -0.010309663
## PC5 PC6 PC7 PC8
## LotFrontage -0.103636085 -0.067201700 0.15390070 0.329392373
## LotArea 0.236359298 -0.072518277 -0.01232332 -0.204829830
## YearBuilt -0.032688850 -0.067068008 -0.09232849 -0.002264289
## YearRemodAdd -0.030787946 -0.196932570 -0.02746442 0.113779586
## MasVnrArea 0.087224641 0.391865445 0.06102250 0.056790217
## BsmtFinSF1 -0.166865837 0.196098966 0.02437687 0.210171895
## BsmtFinSF2 0.253120072 -0.659634433 -0.34182048 -0.274986345
## BsmtUnfSF 0.141362403 0.053072713 0.15906132 -0.134162389
## TotalBsmtSF 0.061976295 0.014770039 0.05985784 -0.017742031
## X1stFlrSF 0.009027335 0.022615952 -0.02810272 -0.046147775
## X2ndFlrSF 0.011148682 0.001507863 0.06173446 -0.015776719
## LowQualFinSF -0.019839890 -0.455036236 0.36879121 0.572627737
## GrLivArea 0.014066933 -0.024214014 0.06473399 0.005929383
## BsmtFullBath -0.300669644 -0.131968773 0.07673101 -0.042339087
## BsmtHalfBath 0.482343579 0.183877377 -0.58071360 0.531546262
## FullBath -0.147252695 -0.092999464 -0.16917358 -0.025685709
## HalfBath 0.141687951 0.036432511 0.13891434 -0.076187189
## BedroomAbvGr -0.079060589 -0.040432742 -0.15373392 0.042135331
## KitchenAbvGr -0.547981979 0.101616129 -0.44645276 -0.087939153
## TotRmsAbvGrd -0.108527031 -0.034258483 -0.05234999 0.009544590
## Fireplaces 0.346647662 0.124606598 0.20593676 -0.244546866
## GarageYrBlt -0.061795252 -0.136855369 -0.08603007 0.054550678
## GarageCars 0.010532411 -0.002114496 -0.06463626 -0.001460012
## GarageArea -0.003173496 -0.004776191 -0.03641875 0.042071145
## PC9 PC10 PC11 PC12
## LotFrontage -0.740855966 0.218644481 -0.27560663 0.06319602
## LotArea -0.054720216 0.606120579 0.38396756 -0.50937189
## YearBuilt 0.085162076 0.068424524 -0.15266104 -0.14086430
## YearRemodAdd 0.167117397 0.275137501 -0.24754450 0.01866748
## MasVnrArea -0.055823971 -0.456713779 -0.09385641 -0.62667075
## BsmtFinSF1 0.090385507 0.065431024 -0.08456485 0.04855522
## BsmtFinSF2 -0.180384063 -0.346929982 -0.16858494 -0.08852110
## BsmtUnfSF 0.062823406 -0.025842925 -0.07073996 -0.03865640
## TotalBsmtSF 0.090914826 -0.085576035 -0.22115875 -0.02100544
## X1stFlrSF 0.119169822 -0.074325141 -0.09493369 0.08674646
## X2ndFlrSF -0.053708283 0.047031298 -0.01262683 0.01194043
## LowQualFinSF 0.310992688 -0.195093018 0.28786619 -0.13763508
## GrLivArea 0.071831077 -0.033661985 -0.05369416 0.06100173
## BsmtFullBath 0.024655177 0.021441649 -0.08901473 0.01333057
## BsmtHalfBath 0.069981100 0.057232064 -0.01924082 0.10050585
## FullBath 0.229745149 0.176203649 -0.05362661 -0.01588288
## HalfBath -0.140774563 -0.127792579 -0.11876591 -0.04919915
## BedroomAbvGr 0.002330621 0.078813298 -0.15052306 -0.05268625
## KitchenAbvGr 0.048000051 -0.115364910 0.19120823 -0.02267002
## TotRmsAbvGrd 0.036696599 -0.013580197 -0.05276266 0.06127635
## Fireplaces 0.176902793 0.005092835 0.04178891 0.44632165
## GarageYrBlt 0.032590204 0.033102955 0.05043677 -0.07158271
## GarageCars -0.227454612 -0.134685895 0.44390069 0.18209946
## GarageArea -0.268523106 -0.161858248 0.45846837 0.15392849
## PC13 PC14 PC15 PC16
## LotFrontage -0.207972882 0.205523963 -0.140465378 0.012716860
## LotArea -0.113313253 -0.071758102 -0.009754141 -0.005664877
## YearBuilt -0.065106514 -0.026519516 -0.443896821 0.052086020
## YearRemodAdd -0.219717279 0.079388664 0.516162019 -0.426223864
## MasVnrArea -0.018364906 0.340535683 0.062477255 -0.212310967
## BsmtFinSF1 0.078802494 -0.058356074 -0.008577452 0.258757311
## BsmtFinSF2 -0.027312609 0.108438835 0.036929580 0.052434830
## BsmtUnfSF -0.066237395 -0.274797047 0.047255071 -0.104135822
## TotalBsmtSF 0.005168619 -0.297572305 0.052257738 0.183410779
## X1stFlrSF -0.019099960 -0.078868766 0.051850172 0.111034017
## X2ndFlrSF 0.021034730 0.068277482 0.229187754 0.158183755
## LowQualFinSF -0.157843923 0.009210298 -0.137971779 -0.021789710
## GrLivArea -0.011182931 -0.000450640 0.215770185 0.211076602
## BsmtFullBath 0.058252338 -0.151852444 0.062355931 -0.370097158
## BsmtHalfBath -0.086150713 -0.106163602 0.026090363 -0.055805339
## FullBath 0.108149166 0.429133463 -0.043555824 0.382796154
## HalfBath -0.368746084 -0.502846195 -0.109356533 0.137758013
## BedroomAbvGr 0.501113606 -0.170596049 -0.359861980 -0.396217027
## KitchenAbvGr -0.553574027 -0.043707592 -0.078208061 -0.079749131
## TotRmsAbvGrd 0.077652127 -0.060247335 0.040215394 -0.126763253
## Fireplaces -0.302838824 0.357911165 -0.336810648 -0.279803722
## GarageYrBlt -0.004387633 -0.037261395 -0.305675983 0.025034339
## GarageCars 0.111972636 -0.016712775 0.073564141 -0.094852577
## GarageArea 0.165819335 -0.060787380 0.132828171 -0.029261784
## PC17 PC18 PC19 PC20
## LotFrontage -0.051037863 0.005475805 0.013715636 -0.01812178
## LotArea 0.024064367 -0.016827673 -0.004725283 0.01955012
## YearBuilt -0.035467928 -0.099113366 0.018824562 0.23502603
## YearRemodAdd 0.300977521 0.168981474 -0.088678446 -0.01864759
## MasVnrArea -0.006284062 0.020509997 0.022116404 -0.01570259
## BsmtFinSF1 0.214156061 0.112912459 -0.284168415 0.13748492
## BsmtFinSF2 0.029911285 0.028095552 -0.111488395 0.04684240
## BsmtUnfSF -0.347440305 0.030843961 -0.086336945 0.02390129
## TotalBsmtSF -0.116298913 0.158786397 -0.423389758 0.18423387
## X1stFlrSF 0.274889795 -0.135130803 0.443859683 -0.33345728
## X2ndFlrSF -0.275354611 -0.168735994 -0.371653865 -0.08179725
## LowQualFinSF -0.003365334 0.075743919 -0.014539185 0.01723154
## GrLivArea -0.026822555 -0.232577794 0.016454069 -0.31167571
## BsmtFullBath -0.619777913 -0.038689215 0.242737308 -0.08272200
## BsmtHalfBath -0.242152704 -0.045321188 0.078201077 -0.01548062
## FullBath -0.259363907 0.424019339 0.264157612 0.02415114
## HalfBath 0.101015017 0.313587530 0.287169968 -0.04497212
## BedroomAbvGr 0.177230936 0.284024621 -0.142328722 -0.22249894
## KitchenAbvGr 0.003057937 0.054618219 -0.182253289 -0.06122493
## TotRmsAbvGrd 0.133695037 -0.445394789 0.215122494 0.62912394
## Fireplaces -0.020352044 0.037222203 -0.138441601 -0.01846767
## GarageYrBlt 0.024637103 -0.440476006 -0.160445473 -0.31391361
## GarageCars -0.010009629 0.230848471 0.082287116 0.28395220
## GarageArea 0.034145290 0.047090485 -0.082787015 -0.18513167
## PC21 PC22 PC23 PC24
## LotFrontage 0.007395706 -0.0216147861 3.018241e-16 -1.005889e-16
## LotArea 0.005511807 -0.0046389545 2.906665e-16 -3.815000e-16
## YearBuilt -0.570089389 0.3302550110 2.646796e-16 2.155836e-16
## YearRemodAdd 0.003824775 0.0287699387 -6.903257e-16 3.154166e-16
## MasVnrArea 0.047978832 -0.0235906161 -2.095595e-16 2.402605e-16
## BsmtFinSF1 0.066273099 -0.0424492938 1.737657e-02 5.781645e-01
## BsmtFinSF2 0.016701886 -0.0048305951 6.145992e-03 2.044935e-01
## BsmtUnfSF 0.035240262 0.0086463349 1.683439e-02 5.601247e-01
## TotalBsmtSF 0.110536332 -0.0371998682 -1.671393e-02 -5.561169e-01
## X1stFlrSF -0.180714536 -0.0539374655 4.913397e-01 -1.476707e-02
## X2ndFlrSF -0.200335459 -0.0227150421 5.548126e-01 -1.667473e-02
## LowQualFinSF -0.038898585 0.0115785131 6.179826e-02 -1.857329e-03
## GrLivArea -0.302971381 -0.0574794846 -6.678674e-01 2.007256e-02
## BsmtFullBath 0.024727093 -0.0009081554 -2.269157e-17 1.780140e-17
## BsmtHalfBath 0.014043163 0.0068453432 4.346230e-17 -1.198273e-16
## FullBath 0.269022580 0.0348893208 -8.329774e-17 4.827869e-17
## HalfBath 0.189717516 0.0026948617 -7.188683e-17 -2.086940e-18
## BedroomAbvGr -0.052448710 -0.0317913225 1.055906e-16 3.319138e-16
## KitchenAbvGr -0.044747163 0.0070674175 -5.494251e-17 1.157658e-16
## TotRmsAbvGrd 0.272985109 0.1015726004 5.653153e-17 -2.483031e-16
## Fireplaces 0.068296834 0.0285671714 7.625148e-17 -1.981755e-17
## GarageYrBlt 0.432531080 -0.3449819791 -3.237769e-17 -2.096795e-16
## GarageCars -0.276625560 -0.5696061949 7.910578e-17 1.135158e-16
## GarageArea 0.177905934 0.6496013645 -6.126922e-17 -1.283654e-17
Let us now examin the scree plot and cummulative scree plots.
#compute standard deviation of each principal component
std_dev <- prin_comp$sdev
#compute variance
pr_var <- std_dev^2
#check variance of first 10 components
pr_var[1:10]
## [1] 6.1042470 3.1050505 2.2238539 1.8753994 1.1963679 1.0489798 1.0379231
## [8] 0.9777722 0.9202734 0.8185473
#proportion of variance explained
prop_varex <- pr_var/sum(pr_var)
prop_varex[1:20]
## [1] 0.254343624 0.129377106 0.092660578 0.078141644 0.049848661
## [6] 0.043707492 0.043246795 0.040740510 0.038344725 0.034106136
## [11] 0.033947264 0.028265587 0.025747604 0.024712960 0.020003264
## [16] 0.016665549 0.013363184 0.009332517 0.008403620 0.006091195
#scree plot
plot(prop_varex, xlab = "Principal Component", ylab = "Proportion of Variance Explained",type = "b")
#cumulative scree plot
plot(cumsum(prop_varex), xlab = "Principal Component", ylab = "Cumulative Proportion of Variance Explained", type = "b")
From the scree plots, it appears that over 92% of the variability can be explained with 15 principals components.
We will now fit a closed form distribution to the data from our variable X. Since we will fit an exponential probability density function, we need to ensure that the variable X has value over interval \(\left[ 0\quad ,\quad \infty \right)\).
Let us therefore consider the minimum for this distribution, 0, since minimum is greater or equal to 0, the distribution is within the interval and there is no need to shift to the right.
# Fitting to exponential distribution
fd <- fitdistr(my_data$TotalBsmtSF, 'exponential')
fd_est <- fd$estimate
We will now take a random sample of 1000 observation with same distribution and compare histogram of this distribution with the one of our original variable.
r_X <- rexp(1000, fd_est)
hist(r_X)
In constrast, let us look at the histogram for our X variable (Total Basement Square Feet).
hist(my_data$TotalBsmtSF)
The variable Total Basemenent Square feet have a sizeable number of observations at value of 0 since, some houses do not have a basement. However, barring this spkike at value 0, the distribution for the variable resemble more a normal distribution skeewed to the right due to some outliers with high value.
From the result of the fitdistr function, we know that the best fitting exponential probability distribution has a rate of 9.456895710^{-4}. Hence this would lead to the following PDF:
\(f\left( x \right) =\lambda { e }^{ -\lambda x }\) where \(\lambda\) = 9.456895710^{-4}.
Hence, its CPDF would be given by: \(F(x)=1-{ e }^{ -\lambda x },\quad where\quad \lambda =0.0009456896\)
In order to find the pth percentile, we would solve setting F(x)=p. Let us find the general formula for exponential distribution.
\(F({ x }_{ p })\quad =\quad 1-{ e }^{ -\lambda { x }_{ p } }\quad =\quad p\quad \Leftrightarrow \quad { e }^{ -\lambda { x }_{ p } }\quad =\quad 1-p\),
taking the logaithm on both side, we get;
\(-\lambda { x }_{ p }\quad =\quad \ln { (1-p)\quad \Leftrightarrow } \quad { x }_{ p }=\frac { -1 }{ \lambda } \ln { (1-p),\quad with\quad } 0\le p<1\)
Let us now find the 5th and 95th percentils.
5th & 95th percentils
p = 0.05 and p = 0.95, substituing p and \(\lambda\) in formula we derived we obtain:
f_perc <- function (p, l_rte){
xp <- round(-1/l_rte*(log(1-p)),2)
return(xp)
}
xp_5 <- f_perc(0.05, fd_est)
xp_95 <- f_perc(0.95, fd_est)
xp_5
## rate
## 54.24
xp_95
## rate
## 3167.78
For the Exponential distribution, we have for 5th percentile; 54.24 and for 95th pecentile; 3167.78. Hence we have 90% of the data points falling between these 2 values.
Let us now consider the empirical data we have. First we will build a 95% interval (for the mean) assuming normal distribution and then find the 5th and 95th percentile.
# 95% confidence interval assuming normal distribution
error <- qnorm(0.975)*sd_X/sqrt(t_obs)
lower_bound <- mean_X - error
upper_bound <- mean_X + error
Confidence interval: 1034.9268 to 1079.9332.
# Percentiles
xp_empirical <- quantile(my_data$TotalBsmtSF, c(.05, .95))
xp_empirical
## 5% 95%
## 519.3 1753.0
The results confirm that although the distribution for variable X is skeweed to the right, its distribution is more fitting a normal distriubtion (with the exception of the spike at value 0) than an exponential distribution.
we want to be able Sale Price based on selections of some (of all) independent variables. From our previous analysis and from survey from few people in Real Estate, we will consider the following variables are possible predictors. We should note that some are quantitative and some are categorical.
From observations of the data we will exclude variables since most have NA’s value or have a single value ‘Street’, ‘Alley’, ‘Utilities’, ‘CentralAir’, ‘FireplaceQu’, ‘PoolQC’, ‘Fence’, ‘MiscFeature’, ‘GarageQual’, ‘GarageCond’
From result of surveys and from PCA components, we will add variables to our model;
GrLivArea (0.5021)
TotRmsAbvGrd (0.5097)
GarageCars (0.6293)
LotArea (0.6331)
YearBuilt (0.6922)
YearRemodAdd (0.7038)
Neighborhood (0.7763) OverallCond (0.7855)
Foundation (0.7875) BedroomAbvGr (0.7933) FullBath (0.7932) When variable FullBath was added, Adj. R2 values went done. We will replace this variables by another
BsmtFinType1 (0.7972) BsmtFinSF1 (0.8008) BsmtFinType2 (0.8012)
BsmtFinSF2 (0.8017)
LowQualFinSF (0.8018)
MSSubClass (0.8168) MSZoning (0.8173) LandContour (0.8198) Condition1 (0.8219) Condition2 (0.8263) BldgType (0.8296) HouseStyle (0.8323) RoofMatl (0.8687) Exterior1st (0.8719) Exterior2nd (0.8733) ExterQual (0.8851) ExterCond (0.8852) BsmtQual (0.8915) BsmtCond (0.8914) When this variable is added, Adj. R2 values went done, we will replace by another Heating (0.8913) When this variable is added, Adj. R2 values went done, we will replace by another CentralAir (0.8916)
X1stFlrSF (0.8915) When this variable is added, Adj. R2 values went done, we will replace by another KitchenAbvGr (0.8925) KitchenQual (0.8964) PavedDrive (0.8962) When this variable is added, Adj. R2 values went done, we will replace by another
PoolArea (0.8979)
attach(my_data)
my.lm <- lm(SalePrice~GrLivArea)
my.lm_p <- lm(SalePrice~GrLivArea + TotRmsAbvGrd + GarageCars + LotArea + YearBuilt + YearRemodAdd + Neighborhood + OverallCond + Foundation + BedroomAbvGr + BsmtFinType1 + BsmtFinSF1 + BsmtFinType2 + BsmtFinSF2 + LowQualFinSF + MSSubClass + MSZoning + LandContour + Condition1 + Condition2 + BldgType + HouseStyle + RoofMatl + Exterior1st + Exterior2nd + ExterQual + ExterCond + BsmtQual + CentralAir + X1stFlrSF + KitchenAbvGr + KitchenQual + PoolArea)
As we add the variables we will check the Adjusted -R2 value, we will repeat process as long as we have increasing Adj. R2 values.
summary(my.lm_p)
##
## Call:
## lm(formula = SalePrice ~ GrLivArea + TotRmsAbvGrd + GarageCars +
## LotArea + YearBuilt + YearRemodAdd + Neighborhood + OverallCond +
## Foundation + BedroomAbvGr + BsmtFinType1 + BsmtFinSF1 + BsmtFinType2 +
## BsmtFinSF2 + LowQualFinSF + MSSubClass + MSZoning + LandContour +
## Condition1 + Condition2 + BldgType + HouseStyle + RoofMatl +
## Exterior1st + Exterior2nd + ExterQual + ExterCond + BsmtQual +
## CentralAir + X1stFlrSF + KitchenAbvGr + KitchenQual + PoolArea)
##
## Residuals:
## Min 1Q Median 3Q Max
## -178551 -10503 417 10767 178551
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.570e+06 1.790e+05 -8.773 < 2e-16 ***
## GrLivArea 7.328e+01 5.222e+00 14.034 < 2e-16 ***
## TotRmsAbvGrd 1.327e+03 1.013e+03 1.310 0.190396
## GarageCars 7.972e+03 1.367e+03 5.832 6.91e-09 ***
## LotArea 5.665e-01 8.736e-02 6.485 1.26e-10 ***
## YearBuilt 4.693e+02 7.213e+01 6.505 1.11e-10 ***
## YearRemodAdd 4.975e+01 5.673e+01 0.877 0.380644
## NeighborhoodBlueste 1.803e+03 2.060e+04 0.088 0.930267
## NeighborhoodBrDale 2.117e+03 1.151e+04 0.184 0.854076
## NeighborhoodBrkSide -1.072e+04 9.819e+03 -1.092 0.274952
## NeighborhoodClearCr -2.077e+04 9.701e+03 -2.141 0.032480 *
## NeighborhoodCollgCr -1.315e+04 7.548e+03 -1.742 0.081688 .
## NeighborhoodCrawfor 5.344e+03 8.947e+03 0.597 0.550423
## NeighborhoodEdwards -2.631e+04 8.309e+03 -3.166 0.001582 **
## NeighborhoodGilbert -1.804e+04 8.040e+03 -2.244 0.025020 *
## NeighborhoodIDOTRR -1.115e+04 1.116e+04 -0.999 0.318149
## NeighborhoodMeadowV -1.647e+04 1.170e+04 -1.408 0.159336
## NeighborhoodMitchel -2.910e+04 8.537e+03 -3.409 0.000672 ***
## NeighborhoodNAmes -2.454e+04 8.117e+03 -3.023 0.002548 **
## NeighborhoodNoRidge 2.540e+04 8.674e+03 2.928 0.003470 **
## NeighborhoodNPkVill 1.008e+04 1.491e+04 0.676 0.499229
## NeighborhoodNridgHt 2.235e+04 7.621e+03 2.933 0.003413 **
## NeighborhoodNWAmes -2.598e+04 8.364e+03 -3.106 0.001936 **
## NeighborhoodOldTown -2.177e+04 1.001e+04 -2.174 0.029878 *
## NeighborhoodSawyer -2.250e+04 8.510e+03 -2.644 0.008285 **
## NeighborhoodSawyerW -1.392e+04 8.200e+03 -1.697 0.089915 .
## NeighborhoodSomerst 5.735e+03 9.224e+03 0.622 0.534223
## NeighborhoodStoneBr 3.796e+04 8.651e+03 4.388 1.24e-05 ***
## NeighborhoodSWISU -1.401e+04 1.020e+04 -1.374 0.169555
## NeighborhoodTimber -1.555e+04 8.653e+03 -1.797 0.072575 .
## NeighborhoodVeenker 3.386e+02 1.091e+04 0.031 0.975252
## OverallCond 7.152e+03 8.917e+02 8.020 2.36e-15 ***
## FoundationCBlock 2.093e+03 3.328e+03 0.629 0.529523
## FoundationPConc 5.243e+03 3.658e+03 1.433 0.151961
## FoundationStone 3.662e+03 1.120e+04 0.327 0.743664
## FoundationWood -2.374e+04 1.555e+04 -1.527 0.127013
## BedroomAbvGr -4.678e+03 1.432e+03 -3.268 0.001112 **
## BsmtFinType1BLQ 3.206e+03 2.932e+03 1.093 0.274468
## BsmtFinType1GLQ 8.215e+03 2.655e+03 3.094 0.002021 **
## BsmtFinType1LwQ -6.355e+03 3.923e+03 -1.620 0.105533
## BsmtFinType1Rec -8.277e+02 3.146e+03 -0.263 0.792545
## BsmtFinType1Unf 7.097e+03 3.078e+03 2.306 0.021296 *
## BsmtFinSF1 2.503e+01 2.881e+00 8.688 < 2e-16 ***
## BsmtFinType2BLQ -2.212e+04 8.023e+03 -2.757 0.005911 **
## BsmtFinType2GLQ -9.404e+03 9.929e+03 -0.947 0.343710
## BsmtFinType2LwQ -2.337e+04 7.811e+03 -2.992 0.002825 **
## BsmtFinType2Rec -1.940e+04 7.444e+03 -2.606 0.009265 **
## BsmtFinType2Unf -1.679e+04 7.987e+03 -2.102 0.035782 *
## BsmtFinSF2 1.238e+01 8.387e+00 1.476 0.140241
## LowQualFinSF -4.802e+01 1.867e+01 -2.572 0.010235 *
## MSSubClass -3.801e+01 8.845e+01 -0.430 0.667447
## MSZoningFV 3.771e+04 1.245e+04 3.029 0.002506 **
## MSZoningRH 3.644e+04 1.262e+04 2.888 0.003948 **
## MSZoningRL 3.660e+04 1.060e+04 3.453 0.000571 ***
## MSZoningRM 3.272e+04 9.821e+03 3.331 0.000889 ***
## LandContourHLS 1.970e+04 5.464e+03 3.606 0.000323 ***
## LandContourLow 9.540e+02 6.710e+03 0.142 0.886963
## LandContourLvl 7.663e+03 3.881e+03 1.975 0.048523 *
## Condition1Feedr -3.247e+02 5.357e+03 -0.061 0.951685
## Condition1Norm 7.750e+03 4.398e+03 1.762 0.078263 .
## Condition1PosA -2.067e+03 1.064e+04 -0.194 0.846005
## Condition1PosN 3.305e+03 7.875e+03 0.420 0.674798
## Condition1RRAe -1.944e+04 9.540e+03 -2.038 0.041781 *
## Condition1RRAn 3.592e+03 7.261e+03 0.495 0.620902
## Condition1RRNe -5.971e+03 1.907e+04 -0.313 0.754258
## Condition1RRNn -3.056e+03 1.336e+04 -0.229 0.819127
## Condition2Feedr 5.860e+03 2.372e+04 0.247 0.804938
## Condition2Norm 3.286e+03 2.040e+04 0.161 0.872047
## Condition2PosA 3.301e+04 3.894e+04 0.848 0.396764
## Condition2PosN -2.084e+05 2.852e+04 -7.308 4.75e-13 ***
## Condition2RRAe -2.506e+04 3.402e+04 -0.737 0.461506
## Condition2RRAn -6.333e+03 3.326e+04 -0.190 0.849016
## Condition2RRNn 1.278e+04 2.833e+04 0.451 0.651956
## BldgType2fmCon 3.039e+03 1.336e+04 0.228 0.820008
## BldgTypeDuplex -4.252e+03 7.599e+03 -0.559 0.575924
## BldgTypeTwnhs -2.913e+04 1.053e+04 -2.766 0.005750 **
## BldgTypeTwnhsE -2.117e+04 9.560e+03 -2.215 0.026939 *
## HouseStyle1.5Unf 1.669e+04 7.999e+03 2.086 0.037147 *
## HouseStyle1Story 7.289e+03 4.521e+03 1.612 0.107186
## HouseStyle2.5Fin -1.027e+04 1.264e+04 -0.813 0.416596
## HouseStyle2.5Unf 7.904e+03 9.427e+03 0.839 0.401900
## HouseStyle2Story -3.458e+03 3.666e+03 -0.943 0.345622
## HouseStyleSFoyer 9.796e+03 6.708e+03 1.460 0.144420
## HouseStyleSLvl 4.052e+03 5.605e+03 0.723 0.469802
## RoofMatlCompShg 6.273e+05 3.233e+04 19.401 < 2e-16 ***
## RoofMatlMembran 6.885e+05 4.293e+04 16.037 < 2e-16 ***
## RoofMatlMetal 6.721e+05 4.213e+04 15.952 < 2e-16 ***
## RoofMatlRoll 6.400e+05 4.239e+04 15.097 < 2e-16 ***
## RoofMatlTar&Grv 6.155e+05 3.309e+04 18.598 < 2e-16 ***
## RoofMatlWdShake 6.332e+05 3.483e+04 18.180 < 2e-16 ***
## RoofMatlWdShngl 7.132e+05 3.373e+04 21.142 < 2e-16 ***
## Exterior1stBrkComm -5.874e+04 3.472e+04 -1.692 0.090933 .
## Exterior1stBrkFace 1.479e+04 1.394e+04 1.061 0.289008
## Exterior1stCBlock -1.316e+04 2.924e+04 -0.450 0.652598
## Exterior1stCemntBd 7.059e+02 2.038e+04 0.035 0.972372
## Exterior1stHdBoard -3.280e+03 1.398e+04 -0.235 0.814609
## Exterior1stImStucc -4.748e+04 3.013e+04 -1.576 0.115258
## Exterior1stMetalSd 2.640e+03 1.571e+04 0.168 0.866575
## Exterior1stPlywood -7.526e+03 1.389e+04 -0.542 0.587946
## Exterior1stStone -2.638e+04 2.588e+04 -1.020 0.308088
## Exterior1stStucco 1.382e+03 1.531e+04 0.090 0.928093
## Exterior1stVinylSd -1.582e+04 1.439e+04 -1.100 0.271724
## Exterior1stWd Sdng -7.139e+03 1.349e+04 -0.529 0.596709
## Exterior1stWdShing -3.073e+03 1.457e+04 -0.211 0.833024
## Exterior2ndAsphShn 8.376e+03 2.291e+04 0.366 0.714681
## Exterior2ndBrk Cmn 8.643e+03 2.208e+04 0.391 0.695513
## Exterior2ndBrkFace 6.267e+03 1.459e+04 0.430 0.667536
## Exterior2ndCBlock NA NA NA NA
## Exterior2ndCmentBd 6.577e+03 2.023e+04 0.325 0.745149
## Exterior2ndHdBoard 5.028e+03 1.364e+04 0.369 0.712535
## Exterior2ndImStucc 2.595e+04 1.553e+04 1.671 0.094995 .
## Exterior2ndMetalSd 2.014e+03 1.548e+04 0.130 0.896463
## Exterior2ndOther 2.080e+04 2.973e+04 0.700 0.484287
## Exterior2ndPlywood 6.638e+03 1.335e+04 0.497 0.619169
## Exterior2ndStone -3.131e+04 2.152e+04 -1.455 0.145870
## Exterior2ndStucco 1.056e+04 1.463e+04 0.722 0.470465
## Exterior2ndVinylSd 2.133e+04 1.396e+04 1.528 0.126690
## Exterior2ndWd Sdng 1.264e+04 1.317e+04 0.960 0.337459
## Exterior2ndWd Shng 4.504e+03 1.366e+04 0.330 0.741644
## ExterQualFa -2.506e+04 1.116e+04 -2.246 0.024860 *
## ExterQualGd -3.018e+04 5.070e+03 -5.953 3.39e-09 ***
## ExterQualTA -3.407e+04 5.587e+03 -6.098 1.42e-09 ***
## ExterCondFa -1.483e+04 1.949e+04 -0.761 0.446769
## ExterCondGd -2.118e+04 1.839e+04 -1.152 0.249613
## ExterCondPo -4.037e+04 3.291e+04 -1.227 0.220169
## ExterCondTA -1.877e+04 1.837e+04 -1.022 0.307083
## BsmtQualFa -2.726e+04 6.461e+03 -4.219 2.62e-05 ***
## BsmtQualGd -2.723e+04 3.507e+03 -7.765 1.66e-14 ***
## BsmtQualTA -2.543e+04 4.338e+03 -5.861 5.83e-09 ***
## CentralAirY 4.252e+03 3.840e+03 1.107 0.268392
## X1stFlrSF 4.673e+00 5.846e+00 0.799 0.424299
## KitchenAbvGr -2.115e+04 5.813e+03 -3.638 0.000286 ***
## KitchenQualFa -2.764e+04 6.548e+03 -4.221 2.60e-05 ***
## KitchenQualGd -2.632e+04 3.663e+03 -7.186 1.13e-12 ***
## KitchenQualTA -2.677e+04 4.157e+03 -6.440 1.68e-10 ***
## PoolArea 8.589e+01 1.897e+01 4.528 6.51e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 25360 on 1287 degrees of freedom
## (38 observations deleted due to missingness)
## Multiple R-squared: 0.9076, Adjusted R-squared: 0.8979
## F-statistic: 94.29 on 134 and 1287 DF, p-value: < 2.2e-16
We will now fit our model to the test data set.
tst_data <- read.csv(file="test.csv",head=TRUE,sep=",")
result_data <- predict(my.lm_p, tst_data, interval="predict")
I have not been able to run this model: Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) : factor Foundation has new levels Slab
I am not able to resolve all the categorical variables. I have check both data set and they have the same levels.
Kaggle user name: vbriot