Your final is due by the end of day on 12/16/2018. You should post your solutions to your GitHub account or RPubs. You are also expected to make a short presentation via YouTube and post that recording to the board. This project will show off your ability to understand the elements of the class.
Problem 1 Pick one of the quantitative independent variables (Xi) from the data set below, and define that variable as X. Also, pick one of the dependent variables (Yi) below, and define that as Y.
Lets pick y1 and x4.
p1data = read.csv("https://raw.githubusercontent.com/vindication09/DATA-605/master/data605_data.csv", header = TRUE)
p1<-subset(p1data, select=c(Y1, X4))
summary(p1)
## Y1 X4
## Min. :10.30 Min. :-1.000
## 1st Qu.:18.55 1st Qu.: 4.350
## Median :20.55 Median : 8.500
## Mean :20.29 Mean : 8.595
## 3rd Qu.:22.07 3rd Qu.:12.525
## Max. :28.70 Max. :19.900
Probability. Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the 3d quartile of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities.
Before we compute any probailities, we need to gather the values that fall within the desired quartile. We will be required to fill in the entries for problem 3
Y3q_greater = subset(p1, Y1 > 18.55)
Y3q_lesser = subset(p1, Y1 <= 18.55)
X4q_greater_given_yq3_greater = nrow(subset(Y3q_greater, X4 > 12.525))
X4q_lesser_given_Y3q_greater = nrow(subset(Y3q_greater, X4 <= 12.525))
X4q_greater_given_Y3q_lesser = nrow(subset(Y3q_lesser, X4 > 12.525))
X4q_lesser_given_Y3q_lesser = nrow(subset(Y3q_lesser, X4 <= 12.525))
a.P(X>x | Y>y)
nrow(subset(Y3q_greater, X4 > 12.525)) / nrow(Y3q_greater)
## [1] 0.2666667
nrow(subset(Y3q_greater, X4 > 12.525)) / nrow(p1)
## [1] 0.2
nrow(subset(Y3q_greater, X4 < 12.525)) / nrow(Y3q_greater)
## [1] 0.7333333
In addition, make a table of counts as shown below.
Lets take the counts we computed and put them all into a data frame
p2 = data.frame(
X4q_lesser_given_Y3q_lesser,
X4q_greater_given_Y3q_lesser,
X4q_greater_given_Y3q_lesser + X4q_greater_given_Y3q_lesser,
X4q_lesser_given_Y3q_greater,
X4q_greater_given_yq3_greater,
X4q_greater_given_yq3_greater + X4q_lesser_given_Y3q_greater,
X4q_lesser_given_Y3q_lesser + X4q_lesser_given_Y3q_greater,
X4q_greater_given_Y3q_lesser + X4q_greater_given_yq3_greater,
nrow(p1))
p2 = matrix(p2, nrow = 3, ncol = 3)
p2 = data.frame(p2)
rownames(p2) <- c('<=1st q for x', '>1st q for x', 'Total')
colnames(p2) <- c('<=3rd q for y', '>3rd q for y', 'Total')
print(p2)
## <=3rd q for y >3rd q for y Total
## <=1st q for x 4 11 15
## >1st q for x 1 4 5
## Total 2 15 20
5 points. Does splitting the training data in this fashion make them independent? Let A be the new variable counting those observations above the 1st quartile for X, and let B be the new variable counting those observations above the 1st quartile for Y. Does P(AB)=P(A)P(B)? Check mathematically, and then evaluate by running a Chi Square test for association.
x4q_greater = nrow(subset(p1, X4 > 4.350))
Y1q_greater = nrow(subset(p1, Y1 > 18.55))
X4q_greater_Y1q_lesser = nrow(subset(p1, X4 > 4.350 & Y1 > 18.55))
prob_AintersectB = X4q_greater_Y1q_lesser/nrow(p1)
probA = x4q_greater / nrow(p1)
ProbB = Y1q_greater / nrow(p1)
prob_AB = probA * ProbB
print(paste0(prob_AintersectB, " Not equal to ", prob_AB, " Hence not independent"))
## [1] "0.55 Not equal to 0.5625 Hence not independent"
Run a chi-squared test
X4q_greater = subset(p1, X4 > 4.350)
Y1q_greater = subset(p1, Y1 > 18.55)
chisq.test(X4q_greater$X4, Y1q_greater$Y1)
## Warning in chisq.test(X4q_greater$X4, Y1q_greater$Y1): Chi-squared
## approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: X4q_greater$X4 and Y1q_greater$Y1
## X-squared = 155, df = 144, p-value = 0.251
You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition. https://www.kaggle.com/c/house-prices-advanced-regression-techniques . I want you to do the following.
Read in the data
train = read.csv("https://raw.githubusercontent.com/vindication09/DATA-605/master/train.csv", header = TRUE)
test= read.csv("https://raw.githubusercontent.com/vindication09/DATA-605/master/test.csv", header = TRUE)
head(train, 10)
## Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape
## 1 1 60 RL 65 8450 Pave <NA> Reg
## 2 2 20 RL 80 9600 Pave <NA> Reg
## 3 3 60 RL 68 11250 Pave <NA> IR1
## 4 4 70 RL 60 9550 Pave <NA> IR1
## 5 5 60 RL 84 14260 Pave <NA> IR1
## 6 6 50 RL 85 14115 Pave <NA> IR1
## 7 7 20 RL 75 10084 Pave <NA> Reg
## 8 8 60 RL NA 10382 Pave <NA> IR1
## 9 9 50 RM 51 6120 Pave <NA> Reg
## 10 10 190 RL 50 7420 Pave <NA> Reg
## LandContour Utilities LotConfig LandSlope Neighborhood Condition1
## 1 Lvl AllPub Inside Gtl CollgCr Norm
## 2 Lvl AllPub FR2 Gtl Veenker Feedr
## 3 Lvl AllPub Inside Gtl CollgCr Norm
## 4 Lvl AllPub Corner Gtl Crawfor Norm
## 5 Lvl AllPub FR2 Gtl NoRidge Norm
## 6 Lvl AllPub Inside Gtl Mitchel Norm
## 7 Lvl AllPub Inside Gtl Somerst Norm
## 8 Lvl AllPub Corner Gtl NWAmes PosN
## 9 Lvl AllPub Inside Gtl OldTown Artery
## 10 Lvl AllPub Corner Gtl BrkSide Artery
## Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt
## 1 Norm 1Fam 2Story 7 5 2003
## 2 Norm 1Fam 1Story 6 8 1976
## 3 Norm 1Fam 2Story 7 5 2001
## 4 Norm 1Fam 2Story 7 5 1915
## 5 Norm 1Fam 2Story 8 5 2000
## 6 Norm 1Fam 1.5Fin 5 5 1993
## 7 Norm 1Fam 1Story 8 5 2004
## 8 Norm 1Fam 2Story 7 6 1973
## 9 Norm 1Fam 1.5Fin 7 5 1931
## 10 Artery 2fmCon 1.5Unf 5 6 1939
## YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType
## 1 2003 Gable CompShg VinylSd VinylSd BrkFace
## 2 1976 Gable CompShg MetalSd MetalSd None
## 3 2002 Gable CompShg VinylSd VinylSd BrkFace
## 4 1970 Gable CompShg Wd Sdng Wd Shng None
## 5 2000 Gable CompShg VinylSd VinylSd BrkFace
## 6 1995 Gable CompShg VinylSd VinylSd None
## 7 2005 Gable CompShg VinylSd VinylSd Stone
## 8 1973 Gable CompShg HdBoard HdBoard Stone
## 9 1950 Gable CompShg BrkFace Wd Shng None
## 10 1950 Gable CompShg MetalSd MetalSd None
## MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond
## 1 196 Gd TA PConc Gd TA
## 2 0 TA TA CBlock Gd TA
## 3 162 Gd TA PConc Gd TA
## 4 0 TA TA BrkTil TA Gd
## 5 350 Gd TA PConc Gd TA
## 6 0 TA TA Wood Gd TA
## 7 186 Gd TA PConc Ex TA
## 8 240 TA TA CBlock Gd TA
## 9 0 TA TA BrkTil TA TA
## 10 0 TA TA BrkTil TA TA
## BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF
## 1 No GLQ 706 Unf 0 150
## 2 Gd ALQ 978 Unf 0 284
## 3 Mn GLQ 486 Unf 0 434
## 4 No ALQ 216 Unf 0 540
## 5 Av GLQ 655 Unf 0 490
## 6 No GLQ 732 Unf 0 64
## 7 Av GLQ 1369 Unf 0 317
## 8 Mn ALQ 859 BLQ 32 216
## 9 No Unf 0 Unf 0 952
## 10 No GLQ 851 Unf 0 140
## TotalBsmtSF Heating HeatingQC CentralAir Electrical X1stFlrSF X2ndFlrSF
## 1 856 GasA Ex Y SBrkr 856 854
## 2 1262 GasA Ex Y SBrkr 1262 0
## 3 920 GasA Ex Y SBrkr 920 866
## 4 756 GasA Gd Y SBrkr 961 756
## 5 1145 GasA Ex Y SBrkr 1145 1053
## 6 796 GasA Ex Y SBrkr 796 566
## 7 1686 GasA Ex Y SBrkr 1694 0
## 8 1107 GasA Ex Y SBrkr 1107 983
## 9 952 GasA Gd Y FuseF 1022 752
## 10 991 GasA Ex Y SBrkr 1077 0
## LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath
## 1 0 1710 1 0 2 1
## 2 0 1262 0 1 2 0
## 3 0 1786 1 0 2 1
## 4 0 1717 1 0 1 0
## 5 0 2198 1 0 2 1
## 6 0 1362 1 0 1 1
## 7 0 1694 1 0 2 0
## 8 0 2090 1 0 2 1
## 9 0 1774 0 0 2 0
## 10 0 1077 1 0 1 0
## BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional
## 1 3 1 Gd 8 Typ
## 2 3 1 TA 6 Typ
## 3 3 1 Gd 6 Typ
## 4 3 1 Gd 7 Typ
## 5 4 1 Gd 9 Typ
## 6 1 1 TA 5 Typ
## 7 3 1 Gd 7 Typ
## 8 3 1 TA 7 Typ
## 9 2 2 TA 8 Min1
## 10 2 2 TA 5 Typ
## Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars
## 1 0 <NA> Attchd 2003 RFn 2
## 2 1 TA Attchd 1976 RFn 2
## 3 1 TA Attchd 2001 RFn 2
## 4 1 Gd Detchd 1998 Unf 3
## 5 1 TA Attchd 2000 RFn 3
## 6 0 <NA> Attchd 1993 Unf 2
## 7 1 Gd Attchd 2004 RFn 2
## 8 2 TA Attchd 1973 RFn 2
## 9 2 TA Detchd 1931 Unf 2
## 10 2 TA Attchd 1939 RFn 1
## GarageArea GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF
## 1 548 TA TA Y 0 61
## 2 460 TA TA Y 298 0
## 3 608 TA TA Y 0 42
## 4 642 TA TA Y 0 35
## 5 836 TA TA Y 192 84
## 6 480 TA TA Y 40 30
## 7 636 TA TA Y 255 57
## 8 484 TA TA Y 235 204
## 9 468 Fa TA Y 90 0
## 10 205 Gd TA Y 0 4
## EnclosedPorch X3SsnPorch ScreenPorch PoolArea PoolQC Fence MiscFeature
## 1 0 0 0 0 <NA> <NA> <NA>
## 2 0 0 0 0 <NA> <NA> <NA>
## 3 0 0 0 0 <NA> <NA> <NA>
## 4 272 0 0 0 <NA> <NA> <NA>
## 5 0 0 0 0 <NA> <NA> <NA>
## 6 0 320 0 0 <NA> MnPrv Shed
## 7 0 0 0 0 <NA> <NA> <NA>
## 8 228 0 0 0 <NA> <NA> Shed
## 9 205 0 0 0 <NA> <NA> <NA>
## 10 0 0 0 0 <NA> <NA> <NA>
## MiscVal MoSold YrSold SaleType SaleCondition SalePrice
## 1 0 2 2008 WD Normal 208500
## 2 0 5 2007 WD Normal 181500
## 3 0 9 2008 WD Normal 223500
## 4 0 2 2006 WD Abnorml 140000
## 5 0 12 2008 WD Normal 250000
## 6 700 10 2009 WD Normal 143000
## 7 0 8 2007 WD Normal 307000
## 8 350 11 2009 WD Normal 200000
## 9 0 4 2008 WD Abnorml 129900
## 10 0 1 2008 WD Normal 118000
Basic EDA View Data types for each of the variables
str(train)
## 'data.frame': 1460 obs. of 81 variables:
## $ Id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ MSSubClass : int 60 20 60 70 60 50 20 60 50 190 ...
## $ MSZoning : Factor w/ 5 levels "C (all)","FV",..: 4 4 4 4 4 4 4 4 5 4 ...
## $ LotFrontage : int 65 80 68 60 84 85 75 NA 51 50 ...
## $ LotArea : int 8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
## $ Street : Factor w/ 2 levels "Grvl","Pave": 2 2 2 2 2 2 2 2 2 2 ...
## $ Alley : Factor w/ 2 levels "Grvl","Pave": NA NA NA NA NA NA NA NA NA NA ...
## $ LotShape : Factor w/ 4 levels "IR1","IR2","IR3",..: 4 4 1 1 1 1 4 1 4 4 ...
## $ LandContour : Factor w/ 4 levels "Bnk","HLS","Low",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ Utilities : Factor w/ 2 levels "AllPub","NoSeWa": 1 1 1 1 1 1 1 1 1 1 ...
## $ LotConfig : Factor w/ 5 levels "Corner","CulDSac",..: 5 3 5 1 3 5 5 1 5 1 ...
## $ LandSlope : Factor w/ 3 levels "Gtl","Mod","Sev": 1 1 1 1 1 1 1 1 1 1 ...
## $ Neighborhood : Factor w/ 25 levels "Blmngtn","Blueste",..: 6 25 6 7 14 12 21 17 18 4 ...
## $ Condition1 : Factor w/ 9 levels "Artery","Feedr",..: 3 2 3 3 3 3 3 5 1 1 ...
## $ Condition2 : Factor w/ 8 levels "Artery","Feedr",..: 3 3 3 3 3 3 3 3 3 1 ...
## $ BldgType : Factor w/ 5 levels "1Fam","2fmCon",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ HouseStyle : Factor w/ 8 levels "1.5Fin","1.5Unf",..: 6 3 6 6 6 1 3 6 1 2 ...
## $ OverallQual : int 7 6 7 7 8 5 8 7 7 5 ...
## $ OverallCond : int 5 8 5 5 5 5 5 6 5 6 ...
## $ YearBuilt : int 2003 1976 2001 1915 2000 1993 2004 1973 1931 1939 ...
## $ YearRemodAdd : int 2003 1976 2002 1970 2000 1995 2005 1973 1950 1950 ...
## $ RoofStyle : Factor w/ 6 levels "Flat","Gable",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ RoofMatl : Factor w/ 8 levels "ClyTile","CompShg",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ Exterior1st : Factor w/ 15 levels "AsbShng","AsphShn",..: 13 9 13 14 13 13 13 7 4 9 ...
## $ Exterior2nd : Factor w/ 16 levels "AsbShng","AsphShn",..: 14 9 14 16 14 14 14 7 16 9 ...
## $ MasVnrType : Factor w/ 4 levels "BrkCmn","BrkFace",..: 2 3 2 3 2 3 4 4 3 3 ...
## $ MasVnrArea : int 196 0 162 0 350 0 186 240 0 0 ...
## $ ExterQual : Factor w/ 4 levels "Ex","Fa","Gd",..: 3 4 3 4 3 4 3 4 4 4 ...
## $ ExterCond : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 5 5 ...
## $ Foundation : Factor w/ 6 levels "BrkTil","CBlock",..: 3 2 3 1 3 6 3 2 1 1 ...
## $ BsmtQual : Factor w/ 4 levels "Ex","Fa","Gd",..: 3 3 3 4 3 3 1 3 4 4 ...
## $ BsmtCond : Factor w/ 4 levels "Fa","Gd","Po",..: 4 4 4 2 4 4 4 4 4 4 ...
## $ BsmtExposure : Factor w/ 4 levels "Av","Gd","Mn",..: 4 2 3 4 1 4 1 3 4 4 ...
## $ BsmtFinType1 : Factor w/ 6 levels "ALQ","BLQ","GLQ",..: 3 1 3 1 3 3 3 1 6 3 ...
## $ BsmtFinSF1 : int 706 978 486 216 655 732 1369 859 0 851 ...
## $ BsmtFinType2 : Factor w/ 6 levels "ALQ","BLQ","GLQ",..: 6 6 6 6 6 6 6 2 6 6 ...
## $ BsmtFinSF2 : int 0 0 0 0 0 0 0 32 0 0 ...
## $ BsmtUnfSF : int 150 284 434 540 490 64 317 216 952 140 ...
## $ TotalBsmtSF : int 856 1262 920 756 1145 796 1686 1107 952 991 ...
## $ Heating : Factor w/ 6 levels "Floor","GasA",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ HeatingQC : Factor w/ 5 levels "Ex","Fa","Gd",..: 1 1 1 3 1 1 1 1 3 1 ...
## $ CentralAir : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 2 2 2 ...
## $ Electrical : Factor w/ 5 levels "FuseA","FuseF",..: 5 5 5 5 5 5 5 5 2 5 ...
## $ X1stFlrSF : int 856 1262 920 961 1145 796 1694 1107 1022 1077 ...
## $ X2ndFlrSF : int 854 0 866 756 1053 566 0 983 752 0 ...
## $ LowQualFinSF : int 0 0 0 0 0 0 0 0 0 0 ...
## $ GrLivArea : int 1710 1262 1786 1717 2198 1362 1694 2090 1774 1077 ...
## $ BsmtFullBath : int 1 0 1 1 1 1 1 1 0 1 ...
## $ BsmtHalfBath : int 0 1 0 0 0 0 0 0 0 0 ...
## $ FullBath : int 2 2 2 1 2 1 2 2 2 1 ...
## $ HalfBath : int 1 0 1 0 1 1 0 1 0 0 ...
## $ BedroomAbvGr : int 3 3 3 3 4 1 3 3 2 2 ...
## $ KitchenAbvGr : int 1 1 1 1 1 1 1 1 2 2 ...
## $ KitchenQual : Factor w/ 4 levels "Ex","Fa","Gd",..: 3 4 3 3 3 4 3 4 4 4 ...
## $ TotRmsAbvGrd : int 8 6 6 7 9 5 7 7 8 5 ...
## $ Functional : Factor w/ 7 levels "Maj1","Maj2",..: 7 7 7 7 7 7 7 7 3 7 ...
## $ Fireplaces : int 0 1 1 1 1 0 1 2 2 2 ...
## $ FireplaceQu : Factor w/ 5 levels "Ex","Fa","Gd",..: NA 5 5 3 5 NA 3 5 5 5 ...
## $ GarageType : Factor w/ 6 levels "2Types","Attchd",..: 2 2 2 6 2 2 2 2 6 2 ...
## $ GarageYrBlt : int 2003 1976 2001 1998 2000 1993 2004 1973 1931 1939 ...
## $ GarageFinish : Factor w/ 3 levels "Fin","RFn","Unf": 2 2 2 3 2 3 2 2 3 2 ...
## $ GarageCars : int 2 2 2 3 3 2 2 2 2 1 ...
## $ GarageArea : int 548 460 608 642 836 480 636 484 468 205 ...
## $ GarageQual : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 2 3 ...
## $ GarageCond : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 5 5 ...
## $ PavedDrive : Factor w/ 3 levels "N","P","Y": 3 3 3 3 3 3 3 3 3 3 ...
## $ WoodDeckSF : int 0 298 0 0 192 40 255 235 90 0 ...
## $ OpenPorchSF : int 61 0 42 35 84 30 57 204 0 4 ...
## $ EnclosedPorch: int 0 0 0 272 0 0 0 228 205 0 ...
## $ X3SsnPorch : int 0 0 0 0 0 320 0 0 0 0 ...
## $ ScreenPorch : int 0 0 0 0 0 0 0 0 0 0 ...
## $ PoolArea : int 0 0 0 0 0 0 0 0 0 0 ...
## $ PoolQC : Factor w/ 3 levels "Ex","Fa","Gd": NA NA NA NA NA NA NA NA NA NA ...
## $ Fence : Factor w/ 4 levels "GdPrv","GdWo",..: NA NA NA NA NA 3 NA NA NA NA ...
## $ MiscFeature : Factor w/ 4 levels "Gar2","Othr",..: NA NA NA NA NA 3 NA 3 NA NA ...
## $ MiscVal : int 0 0 0 0 0 700 0 350 0 0 ...
## $ MoSold : int 2 5 9 2 12 10 8 11 4 1 ...
## $ YrSold : int 2008 2007 2008 2006 2008 2009 2007 2009 2008 2008 ...
## $ SaleType : Factor w/ 9 levels "COD","Con","ConLD",..: 9 9 9 9 9 9 9 9 9 9 ...
## $ SaleCondition: Factor w/ 6 levels "Abnorml","AdjLand",..: 5 5 5 1 5 5 5 5 1 5 ...
## $ SalePrice : int 208500 181500 223500 140000 250000 143000 307000 200000 129900 118000 ...
It seems we have many variables in this dataset which have multiple levels. I am almost certain that we do not need all or even most of these variables. We can systematically determine what variables to keep or not.
Check missing data
colSums(sapply(train, is.na))
## Id MSSubClass MSZoning LotFrontage LotArea
## 0 0 0 259 0
## Street Alley LotShape LandContour Utilities
## 0 1369 0 0 0
## LotConfig LandSlope Neighborhood Condition1 Condition2
## 0 0 0 0 0
## BldgType HouseStyle OverallQual OverallCond YearBuilt
## 0 0 0 0 0
## YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd
## 0 0 0 0 0
## MasVnrType MasVnrArea ExterQual ExterCond Foundation
## 8 8 0 0 0
## BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1
## 37 37 38 37 0
## BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating
## 38 0 0 0 0
## HeatingQC CentralAir Electrical X1stFlrSF X2ndFlrSF
## 0 0 1 0 0
## LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath
## 0 0 0 0 0
## HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd
## 0 0 0 0 0
## Functional Fireplaces FireplaceQu GarageType GarageYrBlt
## 0 0 690 81 81
## GarageFinish GarageCars GarageArea GarageQual GarageCond
## 81 0 0 81 81
## PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch
## 0 0 0 0 0
## ScreenPorch PoolArea PoolQC Fence MiscFeature
## 0 0 1453 1179 1406
## MiscVal MoSold YrSold SaleType SaleCondition
## 0 0 0 0 0
## SalePrice
## 0
Right off the bat, MiscFeature, PoolQC, and Alley can be eliminated. There are only 1460 rows, hence these variables are missing more than 50 percent of their total data.
5 points. Descriptive and Inferential Statistics. Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any THREE quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide a 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?
Descriptives
nums <- unlist(lapply(train, is.numeric))
trainb<-train[ , nums]
trainc<-subset(trainb, select=-c(Id))
summary(trainc)
## MSSubClass LotFrontage LotArea OverallQual
## Min. : 20.0 Min. : 21.00 Min. : 1300 Min. : 1.000
## 1st Qu.: 20.0 1st Qu.: 59.00 1st Qu.: 7554 1st Qu.: 5.000
## Median : 50.0 Median : 69.00 Median : 9478 Median : 6.000
## Mean : 56.9 Mean : 70.05 Mean : 10517 Mean : 6.099
## 3rd Qu.: 70.0 3rd Qu.: 80.00 3rd Qu.: 11602 3rd Qu.: 7.000
## Max. :190.0 Max. :313.00 Max. :215245 Max. :10.000
## NA's :259
## OverallCond YearBuilt YearRemodAdd MasVnrArea
## Min. :1.000 Min. :1872 Min. :1950 Min. : 0.0
## 1st Qu.:5.000 1st Qu.:1954 1st Qu.:1967 1st Qu.: 0.0
## Median :5.000 Median :1973 Median :1994 Median : 0.0
## Mean :5.575 Mean :1971 Mean :1985 Mean : 103.7
## 3rd Qu.:6.000 3rd Qu.:2000 3rd Qu.:2004 3rd Qu.: 166.0
## Max. :9.000 Max. :2010 Max. :2010 Max. :1600.0
## NA's :8
## BsmtFinSF1 BsmtFinSF2 BsmtUnfSF TotalBsmtSF
## Min. : 0.0 Min. : 0.00 Min. : 0.0 Min. : 0.0
## 1st Qu.: 0.0 1st Qu.: 0.00 1st Qu.: 223.0 1st Qu.: 795.8
## Median : 383.5 Median : 0.00 Median : 477.5 Median : 991.5
## Mean : 443.6 Mean : 46.55 Mean : 567.2 Mean :1057.4
## 3rd Qu.: 712.2 3rd Qu.: 0.00 3rd Qu.: 808.0 3rd Qu.:1298.2
## Max. :5644.0 Max. :1474.00 Max. :2336.0 Max. :6110.0
##
## X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea
## Min. : 334 Min. : 0 Min. : 0.000 Min. : 334
## 1st Qu.: 882 1st Qu.: 0 1st Qu.: 0.000 1st Qu.:1130
## Median :1087 Median : 0 Median : 0.000 Median :1464
## Mean :1163 Mean : 347 Mean : 5.845 Mean :1515
## 3rd Qu.:1391 3rd Qu.: 728 3rd Qu.: 0.000 3rd Qu.:1777
## Max. :4692 Max. :2065 Max. :572.000 Max. :5642
##
## BsmtFullBath BsmtHalfBath FullBath HalfBath
## Min. :0.0000 Min. :0.00000 Min. :0.000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:1.000 1st Qu.:0.0000
## Median :0.0000 Median :0.00000 Median :2.000 Median :0.0000
## Mean :0.4253 Mean :0.05753 Mean :1.565 Mean :0.3829
## 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:2.000 3rd Qu.:1.0000
## Max. :3.0000 Max. :2.00000 Max. :3.000 Max. :2.0000
##
## BedroomAbvGr KitchenAbvGr TotRmsAbvGrd Fireplaces
## Min. :0.000 Min. :0.000 Min. : 2.000 Min. :0.000
## 1st Qu.:2.000 1st Qu.:1.000 1st Qu.: 5.000 1st Qu.:0.000
## Median :3.000 Median :1.000 Median : 6.000 Median :1.000
## Mean :2.866 Mean :1.047 Mean : 6.518 Mean :0.613
## 3rd Qu.:3.000 3rd Qu.:1.000 3rd Qu.: 7.000 3rd Qu.:1.000
## Max. :8.000 Max. :3.000 Max. :14.000 Max. :3.000
##
## GarageYrBlt GarageCars GarageArea WoodDeckSF
## Min. :1900 Min. :0.000 Min. : 0.0 Min. : 0.00
## 1st Qu.:1961 1st Qu.:1.000 1st Qu.: 334.5 1st Qu.: 0.00
## Median :1980 Median :2.000 Median : 480.0 Median : 0.00
## Mean :1979 Mean :1.767 Mean : 473.0 Mean : 94.24
## 3rd Qu.:2002 3rd Qu.:2.000 3rd Qu.: 576.0 3rd Qu.:168.00
## Max. :2010 Max. :4.000 Max. :1418.0 Max. :857.00
## NA's :81
## OpenPorchSF EnclosedPorch X3SsnPorch ScreenPorch
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.00
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00
## Median : 25.00 Median : 0.00 Median : 0.00 Median : 0.00
## Mean : 46.66 Mean : 21.95 Mean : 3.41 Mean : 15.06
## 3rd Qu.: 68.00 3rd Qu.: 0.00 3rd Qu.: 0.00 3rd Qu.: 0.00
## Max. :547.00 Max. :552.00 Max. :508.00 Max. :480.00
##
## PoolArea MiscVal MoSold YrSold
## Min. : 0.000 Min. : 0.00 Min. : 1.000 Min. :2006
## 1st Qu.: 0.000 1st Qu.: 0.00 1st Qu.: 5.000 1st Qu.:2007
## Median : 0.000 Median : 0.00 Median : 6.000 Median :2008
## Mean : 2.759 Mean : 43.49 Mean : 6.322 Mean :2008
## 3rd Qu.: 0.000 3rd Qu.: 0.00 3rd Qu.: 8.000 3rd Qu.:2009
## Max. :738.000 Max. :15500.00 Max. :12.000 Max. :2010
##
## SalePrice
## Min. : 34900
## 1st Qu.:129975
## Median :163000
## Mean :180921
## 3rd Qu.:214000
## Max. :755000
##
Data documentation: http://jse.amstat.org/v19n3/decock.pdf
Plots
library(DataExplorer)
## Warning: package 'DataExplorer' was built under R version 3.4.4
plot_missing(trainc)[1]
## feature
## 1 MSSubClass
## 2 LotFrontage
## 3 LotArea
## 4 OverallQual
## 5 OverallCond
## 6 YearBuilt
## 7 YearRemodAdd
## 8 MasVnrArea
## 9 BsmtFinSF1
## 10 BsmtFinSF2
## 11 BsmtUnfSF
## 12 TotalBsmtSF
## 13 X1stFlrSF
## 14 X2ndFlrSF
## 15 LowQualFinSF
## 16 GrLivArea
## 17 BsmtFullBath
## 18 BsmtHalfBath
## 19 FullBath
## 20 HalfBath
## 21 BedroomAbvGr
## 22 KitchenAbvGr
## 23 TotRmsAbvGrd
## 24 Fireplaces
## 25 GarageYrBlt
## 26 GarageCars
## 27 GarageArea
## 28 WoodDeckSF
## 29 OpenPorchSF
## 30 EnclosedPorch
## 31 X3SsnPorch
## 32 ScreenPorch
## 33 PoolArea
## 34 MiscVal
## 35 MoSold
## 36 YrSold
## 37 SalePrice
#plot_histogram(wine_training2);plot_density(wine_training2)
Out of the selected numerical variables, they all hav less than 20% missing data. I would try to impute the missing values with the mean or mode, but it might be an effort outside the scope of DATA 605 but rather for DATA 621.
plot_histogram(trainc)
Full correlation matrix with all numerical variables
correlation_matrix <- round(cor(trainc),2)
# Get lower triangle of the correlation matrix
get_lower_tri<-function(correlation_matrix){
correlation_matrix[upper.tri(correlation_matrix)] <- NA
return(correlation_matrix)
}
# Get upper triangle of the correlation matrix
get_upper_tri <- function(correlation_matrix){
correlation_matrix[lower.tri(correlation_matrix)]<- NA
return(correlation_matrix)
}
upper_tri <- get_upper_tri(correlation_matrix)
library(reshape2)
# Melt the correlation matrix
melted_correlation_matrix <- melt(upper_tri, na.rm = TRUE)
# Heatmap
library(ggplot2)
ggheatmap <- ggplot(data = melted_correlation_matrix, aes(Var2, Var1, fill = value))+
geom_tile(color = "white")+
scale_fill_gradient2(low = "blue", high = "red", mid = "white",
midpoint = 0, limit = c(-1,1), space = "Lab",
name="Pearson\nCorrelation") +
theme_minimal()+
theme(axis.text.x = element_text(angle = 45, vjust = 1,
size = 15, hjust = 1))+
coord_fixed()
#add nice labels
ggheatmap +
geom_text(aes(Var2, Var1, label = value), color = "black", size = 3) +
theme(
axis.title.x = element_blank(),
axis.title.y = element_blank(),
axis.text.x=element_text(size=rel(0.8), angle=90),
axis.text.y=element_text(size=rel(0.8)),
panel.grid.major = element_blank(),
panel.border = element_blank(),
panel.background = element_blank(),
axis.ticks = element_blank(),
legend.justification = c(1, 0),
legend.position = c(0.6, 0.7),
legend.direction = "horizontal")+
guides(fill = guide_colorbar(barwicrash_training2h = 7, barheight = 2,
title.position = "top", title.hjust = 0.5))
The numbers are hard to see but we can use the heat map to figure the correlation levels for each variable. Lets reduce this to a handful of variables a little later on.
Lets provide a scatterplot matrix for two of the predictor variables along with the response. Lets also overlay a correlation analysis and see the relationship between said variable and the response variable.
library(ggpubr)
## Loading required package: magrittr
ggscatter(trainc, x = "GrLivArea", y = "SalePrice",
add = "reg.line", conf.int = TRUE,
cor.coef = TRUE, cor.method = "pearson",
xlab = "Above Grade Living Area", ylab = "Sales Price");
ggscatter(trainc, x = "LotArea", y = "SalePrice",
add = "reg.line", conf.int = TRUE,
cor.coef = TRUE, cor.method = "pearson",
xlab = "Lot Area", ylab = "Sales Price")
Derive a correlation matrix for any THREE variables Lets pick the two variables from the scatter plot and the response variable
corr_data<-subset(trainc,select=c("X1stFlrSF","LotArea", "SalePrice"))
correlation_matrix <- round(cor(corr_data),2)
# Get lower triangle of the correlation matrix
get_lower_tri<-function(correlation_matrix){
correlation_matrix[upper.tri(correlation_matrix)] <- NA
return(correlation_matrix)
}
# Get upper triangle of the correlation matrix
get_upper_tri <- function(correlation_matrix){
correlation_matrix[lower.tri(correlation_matrix)]<- NA
return(correlation_matrix)
}
upper_tri <- get_upper_tri(correlation_matrix)
library(reshape2)
# Melt the correlation matrix
melted_correlation_matrix <- melt(upper_tri, na.rm = TRUE)
# Heatmap
library(ggplot2)
ggheatmap <- ggplot(data = melted_correlation_matrix, aes(Var2, Var1, fill = value))+
geom_tile(color = "white")+
scale_fill_gradient2(low = "blue", high = "red", mid = "white",
midpoint = 0, limit = c(-1,1), space = "Lab",
name="Pearson\nCorrelation") +
theme_minimal()+
theme(axis.text.x = element_text(angle = 45, vjust = 1,
size = 15, hjust = 1))+
coord_fixed()
#add nice labels
ggheatmap +
geom_text(aes(Var2, Var1, label = value), color = "black", size = 3) +
theme(
axis.title.x = element_blank(),
axis.title.y = element_blank(),
axis.text.x=element_text(size=rel(0.8), angle=90),
axis.text.y=element_text(size=rel(0.8)),
panel.grid.major = element_blank(),
panel.border = element_blank(),
panel.background = element_blank(),
axis.ticks = element_blank(),
legend.justification = c(1, 0),
legend.position = c(0.6, 0.7),
legend.direction = "horizontal")+
guides(fill = guide_colorbar(barwicrash_training2h = 7, barheight = 1,
title.position = "top", title.hjust = 0.5))
Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide a 80% confidence interval.
cor.test(corr_data$X1stFlrSF, corr_data$SalePrice, method = c("pearson", "kendall", "spearman"), conf.level = 0.8)
##
## Pearson's product-moment correlation
##
## data: corr_data$X1stFlrSF and corr_data$SalePrice
## t = 29.078, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.5841687 0.6266715
## sample estimates:
## cor
## 0.6058522
cor.test(corr_data$LotArea, corr_data$SalePrice, method = c("pearson", "kendall", "spearman"), conf.level = 0.8)
##
## Pearson's product-moment correlation
##
## data: corr_data$LotArea and corr_data$SalePrice
## t = 10.445, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.2323391 0.2947946
## sample estimates:
## cor
## 0.2638434
cor.test(corr_data$X1stFlrSF, corr_data$LotArea, method = c("pearson", "kendall", "spearman"), conf.level = 0.8)
##
## Pearson's product-moment correlation
##
## data: corr_data$X1stFlrSF and corr_data$LotArea
## t = 11.985, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.2686127 0.3297222
## sample estimates:
## cor
## 0.2994746
In all three instances, we have generated an 80 percent confidence interval. We should also note the small p value. Hence for the three iterations of testing, we can reject the the null hypothesis and conclude that the true correlation is not 0 for the selected variables.
Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?
What is a family wise error anyways? It is a measurment of error when it comes o performing several iterations of estimates. This might cause results to be interpreted as being more independent then they really are. Our three tests of correlation had low p values, hence we can use that to derive the familywise error rate.
n=3
alpha=(0.5)/n
print(paste0("Familywise error rate is ", 1-alpha))
## [1] "Familywise error rate is 0.833333333333333"
5 points. Linear Algebra and Correlation. Invert your 3 x 3 correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.
Correlation matrix
print(correlation_matrix)
## X1stFlrSF LotArea SalePrice
## X1stFlrSF 1.00 0.30 0.61
## LotArea 0.30 1.00 0.26
## SalePrice 0.61 0.26 1.00
Invert correlation matrix
require(Matrix)
## Loading required package: Matrix
my_mat <- solve(correlation_matrix)
print(my_mat)
## X1stFlrSF LotArea SalePrice
## X1stFlrSF 1.6489230 -0.2500619 -0.9408269
## LotArea -0.2500619 1.1104234 -0.1361723
## SalePrice -0.9408269 -0.1361723 1.6093092
Multiply the correlation matrix by the precision matrix
p_mat <- correlation_matrix%*%my_mat
print(p_mat)
## X1stFlrSF LotArea SalePrice
## X1stFlrSF 1.000000e+00 1.387779e-17 0.000000e+00
## LotArea -2.775558e-17 1.000000e+00 -5.551115e-17
## SalePrice -1.110223e-16 0.000000e+00 1.000000e+00
multiply the precision matrix by the correlation matrix.
x_mat <- p_mat%*%correlation_matrix
print("We have derived our original correlation matrix")
## [1] "We have derived our original correlation matrix"
print( x_mat)
## X1stFlrSF LotArea SalePrice
## X1stFlrSF 1.00 0.30 0.61
## LotArea 0.30 1.00 0.26
## SalePrice 0.61 0.26 1.00
Conduct LU decomposition on the matrix.
lu_mat<-lu(correlation_matrix)
lu_mat2<-expand(lu_mat)
print(lu_mat2$L %*% lu_mat2$U)
## 3 x 3 Matrix of class "dgeMatrix"
## [,1] [,2] [,3]
## [1,] 1.00 0.30 0.61
## [2,] 0.30 1.00 0.26
## [3,] 0.61 0.26 1.00
5 points. Calculus-Based Probability & Statistics. Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of λ for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, λ)). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.
Gr Living Area is a variable with a right skew
plot_histogram(trainc$GrLivArea);
summary(trainc$GrLivArea)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 334 1130 1464 1515 1777 5642
Gr Living Area does not have a minimum of zero, therefore we do not need to shift the variable.
Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of λ for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, λ)). Plot a histogram and compare it with a histogram of your original variable.
library(MASS)
dist<-fitdistr(trainc$GrLivArea, densfun = 'exponential')
lambda <- dist$estimate
exp_distibution <- rexp(1000, lambda)
summary(exp_distibution);
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.69 423.94 1050.27 1530.28 2152.03 11314.29
plot_histogram(trainc$GrLivArea);
hist(exp_distibution, main = "Simulated Grade Living Area", xlab="", col = "blue")
Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF).
quantile(exp_distibution, c(.05, .95))
## 5% 95%
## 82.85466 4385.03358
Also generate a 95% confidence interval from the empirical data, assuming normality.
library(Rmisc)
## Loading required package: lattice
## Loading required package: plyr
CI(trainc$GrLivArea, ci=0.95)
## upper mean lower
## 1542.440 1515.464 1488.487
Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.
quantile(trainc$GrLivArea, c(.05, .95))
## 5% 95%
## 848.0 2466.1
10 points. Modeling. Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.
When it comes to modeling, there are numerous things that can be done with some more involved than others. I could do PCA decomposition to reduce the dimension of the data. I could also create multiple dummy variables with a degrees of freedom trade off. For the sake of this course, I will keep the model simple and limit them to variables that had decent correlation with Sales Price.
Lets see how close to the normal distribution our response variable is
ggqqplot(trainc$SalePrice);
x <- trainc$SalePrice
h<-hist(x, breaks=10, col="red", xlab="Sales Price",
main="Histogram with Normal Curve")
xfit<-seq(min(x),max(x),length=40)
yfit<-dnorm(xfit,mean=mean(x),sd=sd(x))
yfit <- yfit*diff(h$mids[1:2])*length(x)
lines(xfit, yfit, col="blue", lwd=2)
It seems that no major transformation needs to be done on the response variable, however we will confirm with diagnostics.
We examined a heat map based off the correlation matrix. We can use that to our advantage. We can actually systematically go through a process that can identify what predictors have significant correlations with the response variables.
cor(trainc[-37], trainc$SalePrice)
## [,1]
## MSSubClass -0.08428414
## LotFrontage NA
## LotArea 0.26384335
## OverallQual 0.79098160
## OverallCond -0.07785589
## YearBuilt 0.52289733
## YearRemodAdd 0.50710097
## MasVnrArea NA
## BsmtFinSF1 0.38641981
## BsmtFinSF2 -0.01137812
## BsmtUnfSF 0.21447911
## TotalBsmtSF 0.61358055
## X1stFlrSF 0.60585218
## X2ndFlrSF 0.31933380
## LowQualFinSF -0.02560613
## GrLivArea 0.70862448
## BsmtFullBath 0.22712223
## BsmtHalfBath -0.01684415
## FullBath 0.56066376
## HalfBath 0.28410768
## BedroomAbvGr 0.16821315
## KitchenAbvGr -0.13590737
## TotRmsAbvGrd 0.53372316
## Fireplaces 0.46692884
## GarageYrBlt NA
## GarageCars 0.64040920
## GarageArea 0.62343144
## WoodDeckSF 0.32441344
## OpenPorchSF 0.31585623
## EnclosedPorch -0.12857796
## X3SsnPorch 0.04458367
## ScreenPorch 0.11144657
## PoolArea 0.09240355
## MiscVal -0.02118958
## MoSold 0.04643225
## YrSold -0.02892259
We will take variables with strong positive correlations greater than .5.
mod <- lm(SalePrice~GarageArea+GarageCars+TotRmsAbvGrd+FullBath+GrLivArea+X1stFlrSF+TotalBsmtSF+YearRemodAdd+YearBuilt+OverallQual, data=trainc)
summary(mod)
##
## Call:
## lm(formula = SalePrice ~ GarageArea + GarageCars + TotRmsAbvGrd +
## FullBath + GrLivArea + X1stFlrSF + TotalBsmtSF + YearRemodAdd +
## YearBuilt + OverallQual, data = trainc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -489958 -19316 -1948 16020 290558
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.186e+06 1.291e+05 -9.187 < 2e-16 ***
## GarageArea 1.495e+01 1.031e+01 1.450 0.147384
## GarageCars 1.042e+04 3.044e+03 3.422 0.000639 ***
## TotRmsAbvGrd 3.310e+01 1.119e+03 0.030 0.976404
## FullBath -6.791e+03 2.682e+03 -2.532 0.011457 *
## GrLivArea 5.130e+01 4.233e+00 12.119 < 2e-16 ***
## X1stFlrSF 1.417e+01 4.930e+00 2.875 0.004097 **
## TotalBsmtSF 1.986e+01 4.295e+00 4.625 4.09e-06 ***
## YearRemodAdd 2.965e+02 6.363e+01 4.659 3.47e-06 ***
## YearBuilt 2.682e+02 5.035e+01 5.328 1.15e-07 ***
## OverallQual 1.960e+04 1.190e+03 16.472 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 37920 on 1449 degrees of freedom
## Multiple R-squared: 0.7737, Adjusted R-squared: 0.7721
## F-statistic: 495.4 on 10 and 1449 DF, p-value: < 2.2e-16
Garage area and Total Rooms Above grade do not appear to be significant. We have an adjusted R squared value of .77 meaning, 77% of the variability in the data is accounted for.
Remove non significant predictors and re-fit model
mod <- lm(SalePrice~GarageCars+FullBath+GrLivArea+X1stFlrSF+TotalBsmtSF+YearRemodAdd+YearBuilt+OverallQual, data=trainc)
summary(mod)
##
## Call:
## lm(formula = SalePrice ~ GarageCars + FullBath + GrLivArea +
## X1stFlrSF + TotalBsmtSF + YearRemodAdd + YearBuilt + OverallQual,
## data = trainc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -482525 -19191 -1801 16208 289639
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.188e+06 1.284e+05 -9.255 < 2e-16 ***
## GarageCars 1.395e+04 1.817e+03 7.680 2.92e-14 ***
## FullBath -7.184e+03 2.644e+03 -2.717 0.00666 **
## GrLivArea 5.177e+01 3.097e+00 16.714 < 2e-16 ***
## X1stFlrSF 1.465e+01 4.919e+00 2.979 0.00294 **
## TotalBsmtSF 2.039e+01 4.269e+00 4.775 1.98e-06 ***
## YearRemodAdd 2.957e+02 6.362e+01 4.649 3.64e-06 ***
## YearBuilt 2.699e+02 5.017e+01 5.380 8.69e-08 ***
## OverallQual 1.959e+04 1.188e+03 16.486 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 37920 on 1451 degrees of freedom
## Multiple R-squared: 0.7734, Adjusted R-squared: 0.7721
## F-statistic: 618.9 on 8 and 1451 DF, p-value: < 2.2e-16
We still retain almost identical adjusted r square with fewer predictors.
Diagnostics
hist(mod$residuals);
qqnorm(mod$residuals);
qqline(mod$residuals)
The residuals seem to follow a close to normal distribution. We need to check constant variance.
library(olsrr)
##
## Attaching package: 'olsrr'
## The following object is masked from 'package:MASS':
##
## cement
## The following object is masked from 'package:datasets':
##
## rivers
ols_test_breusch_pagan(mod)
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## -------------------------------------
## Response : SalePrice
## Variables: fitted values of SalePrice
##
## Test Summary
## ----------------------------
## DF = 1
## Chi2 = 2921.5080
## Prob > Chi2 = 0.0000
A small p value in the Breusch Pagan Test for Heteroskedasticity indicates strong evidence against the null value. We can say that constant variance is not met. Let us check visually.
plot(fitted(mod), residuals(mod), xlab="fitted", ylab="residuals")
abline(h=0);
plot(fitted(mod), sqrt(abs(residuals(mod))), xlab="fitted", ylab=expression(sqrt(hat(epsilon))))
If we look at the residuals, we can see a parabolic shape indicating that some transform needs to be done on the response variable. If we look at the square root of the residuals, the parabolic pattern becomes much more prominant.
We can apply the Box-Cox transform. This worlflow is highlighted in detail in Julian Farayws linear model in r book.
library(MASS)
boxcox(mod, plotit=T, lambda=seq(0, 0.2, by=0.01))
According to the transform, the max log-likelihood happens around -2700. We can estimate a parameter lambda by using the center line bounded by the interval roughly (0.07, 0.17). It looks like our power transform is going to be 0.13
traind<-trainc
mod2 <- lm(SalePrice^(0.13)~GarageCars+FullBath+GrLivArea+X1stFlrSF+TotalBsmtSF+YearRemodAdd+YearBuilt+OverallQual, data=trainc)
summary(mod2)
##
## Call:
## lm(formula = SalePrice^(0.13) ~ GarageCars + FullBath + GrLivArea +
## X1stFlrSF + TotalBsmtSF + YearRemodAdd + YearBuilt + OverallQual,
## data = trainc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.38745 -0.04679 0.00302 0.05812 0.34836
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.760e-01 3.544e-01 -2.471 0.0136 *
## GarageCars 5.209e-02 5.016e-03 10.383 < 2e-16 ***
## FullBath -1.291e-02 7.301e-03 -1.769 0.0771 .
## GrLivArea 1.500e-04 8.552e-06 17.542 < 2e-16 ***
## X1stFlrSF 3.827e-05 1.358e-05 2.818 0.0049 **
## TotalBsmtSF 5.718e-05 1.179e-05 4.850 1.36e-06 ***
## YearRemodAdd 1.331e-03 1.757e-04 7.579 6.19e-14 ***
## YearBuilt 1.139e-03 1.385e-04 8.219 4.50e-16 ***
## OverallQual 5.982e-02 3.281e-03 18.234 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1047 on 1451 degrees of freedom
## Multiple R-squared: 0.8246, Adjusted R-squared: 0.8236
## F-statistic: 852.4 on 8 and 1451 DF, p-value: < 2.2e-16
Residual standard error has decreased significantly while out adjusted r square has increased from .77 to .82.
hist(mod2$residuals);
qqnorm(mod2$residuals);
qqline(mod2$residuals)
There is a slight skew introduced into the residuals but it does not appear to be much. Lets visually check constant variance.
plot(fitted(mod2), resid(mod2), col = "dodgerblue",
pch = 20, cex = 1.5, xlab = "Fitted", ylab = "Residuals")
abline(h = 0, lty = 2, col = "darkorange", lwd = 2)
Lets examine outliers on a top level
library(car)
## Warning: package 'car' was built under R version 3.4.4
## Loading required package: carData
## Warning: package 'carData' was built under R version 3.4.4
outlierTest(mod2)
## rstudent unadjusted p-value Bonferonni p
## 1299 -15.538912 1.7562e-50 2.5640e-47
## 524 -9.634839 2.4441e-21 3.5683e-18
## 633 -4.748411 2.2531e-06 3.2895e-03
## 31 -4.617930 4.2190e-06 6.1597e-03
## 1325 -4.171794 3.2015e-05 4.6742e-02
We have identified several outliers however the low p values indicates that they do not seem to be significant.
Can we reduce our predictors even more by using variance inflation numbers?
vif(mod2)
## GarageCars FullBath GrLivArea X1stFlrSF TotalBsmtSF
## 1.869688 2.152343 2.687048 3.668313 3.558638
## YearRemodAdd YearBuilt OverallQual
## 1.750023 2.329177 2.738752
VIF numbers indicate that we do not need to remove additional predictors since there is no VIF number that is unsually large.
Before we conclude modeling, lets examine all possible permutations of predictors and see their performance based on KPI such as adjusted r square and mallows CP.
k<-ols_step_all_possible(mod2)
plot(k)
From a top level, using all 8 predictors yields the better adjusted r square and reduced the AIC.
Feature selection at a more detailed level
h<-ols_step_best_subset(mod2)
h
## Best Subsets Regression
## -----------------------------------------------------------------------------------------------------
## Model Index Predictors
## -----------------------------------------------------------------------------------------------------
## 1 OverallQual
## 2 GrLivArea OverallQual
## 3 GrLivArea YearBuilt OverallQual
## 4 GarageCars GrLivArea YearBuilt OverallQual
## 5 GarageCars GrLivArea TotalBsmtSF YearBuilt OverallQual
## 6 GarageCars GrLivArea TotalBsmtSF YearRemodAdd YearBuilt OverallQual
## 7 GarageCars GrLivArea X1stFlrSF TotalBsmtSF YearRemodAdd YearBuilt OverallQual
## 8 GarageCars FullBath GrLivArea X1stFlrSF TotalBsmtSF YearRemodAdd YearBuilt OverallQual
## -----------------------------------------------------------------------------------------------------
##
## Subsets Regression Summary
## ----------------------------------------------------------------------------------------------------------------------------------------
## Adj. Pred
## Model R-Square R-Square R-Square C(p) AIC SBIC SBC MSEP FPE HSP APC
## ----------------------------------------------------------------------------------------------------------------------------------------
## 1 0.6714 0.6712 0.6704 1261.4740 -1532.4749 -5678.0546 -1516.6163 0.0205 0.0205 0.0000 0.3295
## 2 0.7464 0.7461 0.7435 643.2617 -1908.7174 -6054.0257 -1887.5727 0.0158 0.0158 0.0000 0.2546
## 3 0.7871 0.7867 0.7839 308.7680 -2162.0458 -6306.7712 -2135.6149 0.0133 0.0133 0.0000 0.2141
## 4 0.8037 0.8032 0.8003 173.3529 -2278.6801 -6423.0072 -2246.9630 0.0123 0.0123 0.0000 0.1976
## 5 0.8166 0.8160 0.8073 68.6192 -2375.9748 -6519.7261 -2338.9715 0.0115 0.0115 0.0000 0.1849
## 6 0.8233 0.8225 0.8137 15.6517 -2427.8729 -6571.1893 -2385.5834 0.0111 0.0111 0.0000 0.1784
## 7 0.8242 0.8233 0.8144 10.1290 -2433.4066 -6576.6425 -2385.8308 0.0110 0.0110 0.0000 0.1778
## 8 0.8246 0.8236 0.8134 9.0000 -2434.5516 -6577.7405 -2381.6897 0.0110 0.0110 0.0000 0.1776
## ----------------------------------------------------------------------------------------------------------------------------------------
## AIC: Akaike Information Criteria
## SBIC: Sawa's Bayesian Information Criteria
## SBC: Schwarz Bayesian Criteria
## MSEP: Estimated error of prediction, assuming multivariate normality
## FPE: Final Prediction Error
## HSP: Hocking's Sp
## APC: Amemiya Prediction Criteria
It seems that model 4-8 are pretty close in adjusted r square but if any more predictors get removed, there is a sharp drop off.
plot(h)
It is easy to keep looking for methods to optimize the model. I would even go as far as saying that a GLM should be considered here but that is outside the scope of the class.
Lets apply to our test data and make some predictions
test_results <- predict(mod2, test)
prediction <- data.frame(Id = test[,"Id"], SalePrice = test_results)
prediction[prediction<0] <- 0
prediction <- replace(prediction,is.na(prediction),0)
prediction$SalePrice <- prediction$SalePrice^(1/.13)
head(test_results)
## 1 2 3 4 5 6
## 4.525119 4.684621 4.768366 4.825383 4.915384 4.806456
write.csv(prediction, "salepricepredictions.csv")
Kaggle results Number 4558, posted under Vinicio Haro: 1 submission scored at 0.47026