You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition. https://www.kaggle.com/c/house-prices-advanced-regression-techniques. Do the following:
Pick one of the quanititative independent variables from the training data set (train.csv) , and define that variable as X. Make sure this variable is skewed to the right! Pick the dependent variable and define it as Y.
suppressWarnings(suppressMessages(library(knitr)))
suppressWarnings(suppressMessages(library(RCurl)))
suppressWarnings(suppressMessages(library(ggplot2)))
suppressWarnings(suppressMessages(library(e1071)))
suppressWarnings(suppressMessages(library(dplyr)))
suppressWarnings(suppressMessages(library(scales)))
suppressWarnings(suppressMessages(library(cowplot)))
suppressWarnings(suppressMessages(library(corrplot)))
suppressWarnings(suppressMessages(library(caret)))
#suppressWarnings(suppressMessages(library(MASS)))
suppressWarnings(suppressMessages(library(Rmisc)))
suppressWarnings(suppressMessages(library(FactoMineR)))
suppressWarnings(suppressMessages(library(factoextra)))
Review the variables in the training dataset:
## 'data.frame': 1460 obs. of 81 variables:
## $ Id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ MSSubClass : chr "60" "20" "60" "70" ...
## $ MSZoning : chr "RL" "RL" "RL" "RL" ...
## $ LotFrontage : int 65 80 68 60 84 85 75 NA 51 50 ...
## $ LotArea : int 8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
## $ Street : chr "Pave" "Pave" "Pave" "Pave" ...
## $ Alley : chr NA NA NA NA ...
## $ LotShape : chr "Reg" "Reg" "IR1" "IR1" ...
## $ LandContour : chr "Lvl" "Lvl" "Lvl" "Lvl" ...
## $ Utilities : chr "AllPub" "AllPub" "AllPub" "AllPub" ...
## $ LotConfig : chr "Inside" "FR2" "Inside" "Corner" ...
## $ LandSlope : chr "Gtl" "Gtl" "Gtl" "Gtl" ...
## $ Neighborhood : chr "CollgCr" "Veenker" "CollgCr" "Crawfor" ...
## $ Condition1 : chr "Norm" "Feedr" "Norm" "Norm" ...
## $ Condition2 : chr "Norm" "Norm" "Norm" "Norm" ...
## $ BldgType : chr "1Fam" "1Fam" "1Fam" "1Fam" ...
## $ HouseStyle : chr "2Story" "1Story" "2Story" "2Story" ...
## $ OverallQual : int 7 6 7 7 8 5 8 7 7 5 ...
## $ OverallCond : int 5 8 5 5 5 5 5 6 5 6 ...
## $ YearBuilt : int 2003 1976 2001 1915 2000 1993 2004 1973 1931 1939 ...
## $ YearRemodAdd : int 2003 1976 2002 1970 2000 1995 2005 1973 1950 1950 ...
## $ RoofStyle : chr "Gable" "Gable" "Gable" "Gable" ...
## $ RoofMatl : chr "CompShg" "CompShg" "CompShg" "CompShg" ...
## $ Exterior1st : chr "VinylSd" "MetalSd" "VinylSd" "Wd Sdng" ...
## $ Exterior2nd : chr "VinylSd" "MetalSd" "VinylSd" "Wd Shng" ...
## $ MasVnrType : chr "BrkFace" "None" "BrkFace" "None" ...
## $ MasVnrArea : int 196 0 162 0 350 0 186 240 0 0 ...
## $ ExterQual : chr "Gd" "TA" "Gd" "TA" ...
## $ ExterCond : chr "TA" "TA" "TA" "TA" ...
## $ Foundation : chr "PConc" "CBlock" "PConc" "BrkTil" ...
## $ BsmtQual : chr "Gd" "Gd" "Gd" "TA" ...
## $ BsmtCond : chr "TA" "TA" "TA" "Gd" ...
## $ BsmtExposure : chr "No" "Gd" "Mn" "No" ...
## $ BsmtFinType1 : chr "GLQ" "ALQ" "GLQ" "ALQ" ...
## $ BsmtFinSF1 : int 706 978 486 216 655 732 1369 859 0 851 ...
## $ BsmtFinType2 : chr "Unf" "Unf" "Unf" "Unf" ...
## $ BsmtFinSF2 : int 0 0 0 0 0 0 0 32 0 0 ...
## $ BsmtUnfSF : int 150 284 434 540 490 64 317 216 952 140 ...
## $ TotalBsmtSF : int 856 1262 920 756 1145 796 1686 1107 952 991 ...
## $ Heating : chr "GasA" "GasA" "GasA" "GasA" ...
## $ HeatingQC : chr "Ex" "Ex" "Ex" "Gd" ...
## $ CentralAir : chr "Y" "Y" "Y" "Y" ...
## $ Electrical : chr "SBrkr" "SBrkr" "SBrkr" "SBrkr" ...
## $ X1stFlrSF : int 856 1262 920 961 1145 796 1694 1107 1022 1077 ...
## $ X2ndFlrSF : int 854 0 866 756 1053 566 0 983 752 0 ...
## $ LowQualFinSF : int 0 0 0 0 0 0 0 0 0 0 ...
## $ GrLivArea : int 1710 1262 1786 1717 2198 1362 1694 2090 1774 1077 ...
## $ BsmtFullBath : int 1 0 1 1 1 1 1 1 0 1 ...
## $ BsmtHalfBath : int 0 1 0 0 0 0 0 0 0 0 ...
## $ FullBath : int 2 2 2 1 2 1 2 2 2 1 ...
## $ HalfBath : int 1 0 1 0 1 1 0 1 0 0 ...
## $ BedroomAbvGr : int 3 3 3 3 4 1 3 3 2 2 ...
## $ KitchenAbvGr : int 1 1 1 1 1 1 1 1 2 2 ...
## $ KitchenQual : chr "Gd" "TA" "Gd" "Gd" ...
## $ TotRmsAbvGrd : int 8 6 6 7 9 5 7 7 8 5 ...
## $ Functional : chr "Typ" "Typ" "Typ" "Typ" ...
## $ Fireplaces : int 0 1 1 1 1 0 1 2 2 2 ...
## $ FireplaceQu : chr NA "TA" "TA" "Gd" ...
## $ GarageType : chr "Attchd" "Attchd" "Attchd" "Detchd" ...
## $ GarageYrBlt : int 2003 1976 2001 1998 2000 1993 2004 1973 1931 1939 ...
## $ GarageFinish : chr "RFn" "RFn" "RFn" "Unf" ...
## $ GarageCars : int 2 2 2 3 3 2 2 2 2 1 ...
## $ GarageArea : int 548 460 608 642 836 480 636 484 468 205 ...
## $ GarageQual : chr "TA" "TA" "TA" "TA" ...
## $ GarageCond : chr "TA" "TA" "TA" "TA" ...
## $ PavedDrive : chr "Y" "Y" "Y" "Y" ...
## $ WoodDeckSF : int 0 298 0 0 192 40 255 235 90 0 ...
## $ OpenPorchSF : int 61 0 42 35 84 30 57 204 0 4 ...
## $ EnclosedPorch: int 0 0 0 272 0 0 0 228 205 0 ...
## $ X3SsnPorch : int 0 0 0 0 0 320 0 0 0 0 ...
## $ ScreenPorch : int 0 0 0 0 0 0 0 0 0 0 ...
## $ PoolArea : int 0 0 0 0 0 0 0 0 0 0 ...
## $ PoolQC : chr NA NA NA NA ...
## $ Fence : chr NA NA NA NA ...
## $ MiscFeature : chr NA NA NA NA ...
## $ MiscVal : int 0 0 0 0 0 700 0 350 0 0 ...
## $ MoSold : int 2 5 9 2 12 10 8 11 4 1 ...
## $ YrSold : int 2008 2007 2008 2006 2008 2009 2007 2009 2008 2008 ...
## $ SaleType : chr "WD" "WD" "WD" "WD" ...
## $ SaleCondition: chr "Normal" "Normal" "Normal" "Abnorml" ...
## $ SalePrice : int 208500 181500 223500 140000 250000 143000 307000 200000 129900 118000 ...
Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the 3d quartile of the X variable, and the small letter “y” is estimated as the 2d quartile of the Y variable. Interpret the meaning of all probabilities. In addition, make a table of counts as shown below.
Determine the number of NA values in the numeric variables.
colSums(sapply(train[numeric_var], is.na))
## Id LotFrontage LotArea OverallQual OverallCond
## 0 259 0 0 0
## YearBuilt YearRemodAdd MasVnrArea BsmtFinSF1 BsmtFinSF2
## 0 0 8 0 0
## BsmtUnfSF TotalBsmtSF X1stFlrSF X2ndFlrSF LowQualFinSF
## 0 0 0 0 0
## GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath
## 0 0 0 0 0
## BedroomAbvGr KitchenAbvGr TotRmsAbvGrd Fireplaces GarageYrBlt
## 0 0 0 0 81
## GarageCars GarageArea WoodDeckSF OpenPorchSF EnclosedPorch
## 0 0 0 0 0
## X3SsnPorch ScreenPorch PoolArea MiscVal MoSold
## 0 0 0 0 0
## YrSold SalePrice TotalPorchSF
## 0 0 0
TotalBsmtSF
will be the selected for the X variable and SalePrice
will be used as the Y variable. We see that TotalBsmtSF is right skewed with the mean greater than the median.
summary(train$TotalBsmtSF)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 795.8 991.5 1057.0 1298.0 6110.0
skewness(train$TotalBsmtSF)
## [1] 1.521124
\(P(X>x | Y>y)\)
\(P(X>x, Y>y)\)
\(P(X<x | Y>y)\)
Formula for Conditional Probability
\(p(x|y) = p(x,y)/p(y)\)
Define the 3rd quartile for TotalBsmtSF (X) and the 2nd quartile for SalePrice (Y)
# TotalBsmtSF
(xQ3 <- quantile(train$TotalBsmtSF, 0.75))
## 75%
## 1298.25
# SalePrice
(yQ2 <- quantile(train$SalePrice, 0.5))
## 50%
## 163000
numerator <- filter(train, SalePrice > yQ2 & TotalBsmtSF > xQ3) %>% tally()/nrow(train)
denominator <- filter(train, SalePrice > yQ2) %>% tally()/nrow(train)
(a <- numerator/denominator)
## n
## 1 0.4519231
Xx <- filter(train, TotalBsmtSF > xQ3) %>% tally()/nrow(train)
Yy <- filter(train, SalePrice > yQ2) %>% tally()/nrow(train)
(b <- Xx * Yy)
## n
## 1 0.1246575
numerator <- filter(train, SalePrice > yQ2 & TotalBsmtSF < xQ3) %>% tally()/nrow(train)
denominator <- filter(train, SalePrice > yQ2) %>% tally()/nrow(train)
(c <- numerator/denominator)
## n
## 1 0.5480769
x/y | <=2d quartile | >2d quartile | Total |
---|---|---|---|
<=3d quartile | 696 | 399 | 1095 |
>3d quartile | 36 | 329 | 365 |
Total | 732 | 728 | 1460 |
Splitting them in this manner doesn’t make them independent, although it allows for testing independence below using the chi-squared test.
Let A be the new variable counting those observations above the 3d quartile for X, and let B be the new variable counting those observations above the 2d quartile for Y.
Does \(P(A|B)=P(A)P(B)\)? Check mathematically, and then evaluate by running a Chi Square test for association.
\(P(A) = 365/1460 = 0.25\)
\(P(B) = 728/1460 = 0.4986301\)
\(P(A)P(B)\) = 0.25 * 0.4986301 = 0.1246575
We know that \(P(A|B)\) = 0.4519231; therefore \(P(A|B)! =P(A)P(B)\) which suggests X and Y are not independent.
Test the hypothesis whether the X is independent of Y at a level at .05 significance level.
# matrix values are from the table above
mat <- matrix(c(696, 399, 36, 329), 2, 2, byrow=T)
chisq.test(mat, correct=TRUE)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: mat
## X-squared = 313.61, df = 1, p-value < 2.2e-16
As the p-value is significantly less than the .05 significance level, we reject the null hypothesis that the X is independent of Y. The Chi-squared test indicates dependence between X and Y (TotalBsmtSF and SalePrice).
Provide univariate descriptive statistics and appropriate plots for the training data set.
Numeric Variables listed below:
LotFrontage : Linear feet of street connected to property
LotArea : Lot size in square feet
OverallQual : Rates the overall material and finish of the house
OverallCond : Rates the overall condition of the house
YearBuilt : Original construction date
YearRemodAdd : Remodel date (same as construction date if no remodeling or additions)
MasVnrArea : Masonry veneer area in square feet
BsmtFinSF1 : Type 1 finished square feet
BsmtFinSF2 : Type 2 finished square feet
BsmtUnfSF : Unfinished square feet of basement area
TotalBsmtSF : Total square feet of basement area
1stFlrSF : First Floor square feet
2ndFlrSF : Second floor square feet
LowQualFinSF : Low quality finished square feet (all floors)
GrLivArea : Above grade (ground) living area square feet
BsmtFullBath : Basement full bathrooms
BsmtHalfBath : Basement half bathrooms
FullBath : Full bathrooms above grade
HalfBath : Half baths above grade
BedroomAbvGr : Bedrooms above grade (does NOT include basement bedrooms)
KitchenAbvGr : Kitchens above grade
TotRmsAbvGrd : Total rooms above grade (does not include bathrooms)
Fireplaces : Number of fireplaces
GarageYrBlt : Year garage was built
GarageCars : Size of garage in car capacity
GarageArea : Size of garage in square feet
WoodDeckSF : Wood deck area in square feet
OpenPorchSF : Open porch area in square feet
EnclosedPorch : Enclosed porch area in square feet
3SsnPorch : Three season porch area in square feet
ScreenPorch : Screen porch area in square feet
PoolArea : Pool area in square feet
MiscVal : $Value of miscellaneous feature
MoSold : Month Sold (MM)
YrSold : Year Sold (YYYY)
SalePrice : Sale Price of the House
Generate descriptive statistics on the numerical variables in the training dataset.
## Id LotFrontage LotArea OverallQual
## Min. : 1.0 Min. : 21.00 Min. : 1300 Min. : 1.000
## 1st Qu.: 365.8 1st Qu.: 59.00 1st Qu.: 7554 1st Qu.: 5.000
## Median : 730.5 Median : 69.00 Median : 9478 Median : 6.000
## Mean : 730.5 Mean : 70.05 Mean : 10517 Mean : 6.099
## 3rd Qu.:1095.2 3rd Qu.: 80.00 3rd Qu.: 11602 3rd Qu.: 7.000
## Max. :1460.0 Max. :313.00 Max. :215245 Max. :10.000
## NA's :259
## OverallCond YearBuilt YearRemodAdd MasVnrArea
## Min. :1.000 Min. :1872 Min. :1950 Min. : 0.0
## 1st Qu.:5.000 1st Qu.:1954 1st Qu.:1967 1st Qu.: 0.0
## Median :5.000 Median :1973 Median :1994 Median : 0.0
## Mean :5.575 Mean :1971 Mean :1985 Mean : 103.7
## 3rd Qu.:6.000 3rd Qu.:2000 3rd Qu.:2004 3rd Qu.: 166.0
## Max. :9.000 Max. :2010 Max. :2010 Max. :1600.0
## NA's :8
## BsmtFinSF1 BsmtFinSF2 BsmtUnfSF TotalBsmtSF
## Min. : 0.0 Min. : 0.00 Min. : 0.0 Min. : 0.0
## 1st Qu.: 0.0 1st Qu.: 0.00 1st Qu.: 223.0 1st Qu.: 795.8
## Median : 383.5 Median : 0.00 Median : 477.5 Median : 991.5
## Mean : 443.6 Mean : 46.55 Mean : 567.2 Mean :1057.4
## 3rd Qu.: 712.2 3rd Qu.: 0.00 3rd Qu.: 808.0 3rd Qu.:1298.2
## Max. :5644.0 Max. :1474.00 Max. :2336.0 Max. :6110.0
##
## X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea
## Min. : 334 Min. : 0 Min. : 0.000 Min. : 334
## 1st Qu.: 882 1st Qu.: 0 1st Qu.: 0.000 1st Qu.:1130
## Median :1087 Median : 0 Median : 0.000 Median :1464
## Mean :1163 Mean : 347 Mean : 5.845 Mean :1515
## 3rd Qu.:1391 3rd Qu.: 728 3rd Qu.: 0.000 3rd Qu.:1777
## Max. :4692 Max. :2065 Max. :572.000 Max. :5642
##
## BsmtFullBath BsmtHalfBath FullBath HalfBath
## Min. :0.0000 Min. :0.00000 Min. :0.000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:1.000 1st Qu.:0.0000
## Median :0.0000 Median :0.00000 Median :2.000 Median :0.0000
## Mean :0.4253 Mean :0.05753 Mean :1.565 Mean :0.3829
## 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:2.000 3rd Qu.:1.0000
## Max. :3.0000 Max. :2.00000 Max. :3.000 Max. :2.0000
##
## BedroomAbvGr KitchenAbvGr TotRmsAbvGrd Fireplaces
## Min. :0.000 Min. :0.000 Min. : 2.000 Min. :0.000
## 1st Qu.:2.000 1st Qu.:1.000 1st Qu.: 5.000 1st Qu.:0.000
## Median :3.000 Median :1.000 Median : 6.000 Median :1.000
## Mean :2.866 Mean :1.047 Mean : 6.518 Mean :0.613
## 3rd Qu.:3.000 3rd Qu.:1.000 3rd Qu.: 7.000 3rd Qu.:1.000
## Max. :8.000 Max. :3.000 Max. :14.000 Max. :3.000
##
## GarageYrBlt GarageCars GarageArea WoodDeckSF
## Min. :1900 Min. :0.000 Min. : 0.0 Min. : 0.00
## 1st Qu.:1961 1st Qu.:1.000 1st Qu.: 334.5 1st Qu.: 0.00
## Median :1980 Median :2.000 Median : 480.0 Median : 0.00
## Mean :1979 Mean :1.767 Mean : 473.0 Mean : 94.24
## 3rd Qu.:2002 3rd Qu.:2.000 3rd Qu.: 576.0 3rd Qu.:168.00
## Max. :2010 Max. :4.000 Max. :1418.0 Max. :857.00
## NA's :81
## OpenPorchSF EnclosedPorch X3SsnPorch ScreenPorch
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.00
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00
## Median : 25.00 Median : 0.00 Median : 0.00 Median : 0.00
## Mean : 46.66 Mean : 21.95 Mean : 3.41 Mean : 15.06
## 3rd Qu.: 68.00 3rd Qu.: 0.00 3rd Qu.: 0.00 3rd Qu.: 0.00
## Max. :547.00 Max. :552.00 Max. :508.00 Max. :480.00
##
## PoolArea MiscVal MoSold YrSold
## Min. : 0.000 Min. : 0.00 Min. : 1.000 Min. :2006
## 1st Qu.: 0.000 1st Qu.: 0.00 1st Qu.: 5.000 1st Qu.:2007
## Median : 0.000 Median : 0.00 Median : 6.000 Median :2008
## Mean : 2.759 Mean : 43.49 Mean : 6.322 Mean :2008
## 3rd Qu.: 0.000 3rd Qu.: 0.00 3rd Qu.: 8.000 3rd Qu.:2009
## Max. :738.000 Max. :15500.00 Max. :12.000 Max. :2010
##
## SalePrice TotalPorchSF
## Min. : 34900 Min. : 0.00
## 1st Qu.:129975 1st Qu.: 0.00
## Median :163000 Median : 48.00
## Mean :180921 Mean : 87.08
## 3rd Qu.:214000 3rd Qu.: 136.00
## Max. :755000 Max. :1027.00
##
X = TotalBsmtSF
Y = SalePrice
Derive a correlation matrix for two of the quantitative variables you selected.
Plot the correlation between LotArea, TotalBsmtSF, BsmtFinSF1, GrLivArea, GarageArea, PoolArea, TotalPorchSF, SalePrice:
cor_data <- select(train, LotArea, TotalBsmtSF, BsmtFinSF1, GrLivArea, GarageArea, PoolArea, TotalPorchSF, SalePrice)
mat <- cor(cor_data)
mat
## LotArea TotalBsmtSF BsmtFinSF1 GrLivArea GarageArea
## LotArea 1.00000000 0.2608331 0.21410313 0.2631162 0.18040276
## TotalBsmtSF 0.26083313 1.0000000 0.52239605 0.4548682 0.48666546
## BsmtFinSF1 0.21410313 0.5223961 1.00000000 0.2081711 0.29697039
## GrLivArea 0.26311617 0.4548682 0.20817113 1.0000000 0.46899748
## GarageArea 0.18040276 0.4866655 0.29697039 0.4689975 1.00000000
## PoolArea 0.07767239 0.1260531 0.14049129 0.1702053 0.06104727
## TotalPorchSF 0.07130996 0.1554712 0.05119947 0.2728528 0.11834590
## SalePrice 0.26384335 0.6135806 0.38641981 0.7086245 0.62343144
## PoolArea TotalPorchSF SalePrice
## LotArea 0.07767239 0.07130996 0.26384335
## TotalBsmtSF 0.12605313 0.15547122 0.61358055
## BsmtFinSF1 0.14049129 0.05119947 0.38641981
## GrLivArea 0.17020534 0.27285275 0.70862448
## GarageArea 0.06104727 0.11834590 0.62343144
## PoolArea 1.00000000 0.09473441 0.09240355
## TotalPorchSF 0.09473441 1.00000000 0.19573894
## SalePrice 0.09240355 0.19573894 1.00000000
corrplot(mat, method="square")
Looking at the resulting correlation plot, we see that Total Basement Square Feet, Above Ground Living Area, and Garage Area are the variables with the highest correlation with Sales Price. The X variable being tested, TotalBsmtSF, has a high positive correlation with SalePrice.
Provide a 95% CI for the difference in the mean of the variables.
t.test(train$TotalBsmtSF, train$SalePrice)
##
## Welch Two Sample t-test
##
## data: train$TotalBsmtSF and train$SalePrice
## t = -86.509, df = 1459.1, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -183942.2 -175785.3
## sample estimates:
## mean of x mean of y
## 1057.429 180921.196
In the house training dataset, the mean total basement area is 1057.429 and the mean sale price of a house is 180921.196. The 95% confidence interval of the difference in mean sale price is between 175,785.3 and 183,942.2.
We see a very small p-value (< 0.5) which leads us to reject the null hypothesis. There is strong evidence of a mean price increase between basement area and sales price, which is indicative of a relationship between these two variables.
Derive a correlation matrix for two of the quantitative variables you selected.
Test the hypothesis that the correlation between these variables is 0 and provide a 99% confidence interval.
Discuss the meaning of your analysis.
cor.test(train$TotalBsmtSF, train$SalePrice, method = "pearson" , conf.level = 0.99)
##
## Pearson's product-moment correlation
##
## data: train$TotalBsmtSF and train$SalePrice
## t = 29.671, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 99 percent confidence interval:
## 0.5697562 0.6539251
## sample estimates:
## cor
## 0.6135806
Results show that we reject the null hypothesis that the correlation between basement area and sale price is 0. Indeed we see that basement area and sales price have a strong, postive correlation of 0.613.
Invert your correlation matrix. (This is known as the precision matrix and contains variance inflation factors on the diagonal.)
xydata <- select(train, TotalBsmtSF, SalePrice)
cormatrix <- cor(xydata)
cormatrix
## TotalBsmtSF SalePrice
## TotalBsmtSF 1.0000000 0.6135806
## SalePrice 0.6135806 1.0000000
precmatrix <- solve(cormatrix)
precmatrix
## TotalBsmtSF SalePrice
## TotalBsmtSF 1.6038006 -0.9840609
## SalePrice -0.9840609 1.6038006
Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix.
cormatrix %*% precmatrix
## TotalBsmtSF SalePrice
## TotalBsmtSF 1 0
## SalePrice 0 1
precmatrix %*% cormatrix
## TotalBsmtSF SalePrice
## TotalBsmtSF 1 0
## SalePrice 0 1
Both matrix operations return the identity matrix.
Principal Component Analysis (PCA) is used to extract the important information from a multivariate set of data and to express this information as a set of new variables called principal components. Essentially PCA allows for the reduction of dimensionality of a dataset.
The PCA analysis will consider the following variables and use the FactoMineR
and factoextra
packages.
* LotArea
* TotalBsmtSF
* BsmtFinSF1
* GrLivArea
* GarageArea
* PoolArea
* TotalPorchSF (derived from OpenPorchSF + EnclosedPorch + X3SsnPorch + ScreenPorch)
house_data <- select(train, LotArea, TotalBsmtSF, BsmtFinSF1, GrLivArea, GarageArea, PoolArea, TotalPorchSF)
Call the PCA
function using scaling, number of dimensions to retain = 5, and graph = TRUE
house.pca = PCA(house_data, scale.unit=TRUE, ncp=5, graph=T)
The proportion of the total variation explained by the principal components is shown below:
eigenvalue | percentage of variance | |
---|---|---|
comp 1 | 2.5021041 | 35.744344 |
comp 2 | 1.0453524 | 14.933606 |
comp 3 | 0.9590896 | 13.701280 |
comp 4 | 0.8596081 | 12.280115 |
comp 5 | 0.7598602 | 10.855146 |
comp 6 | 0.4788855 | 6.841221 |
We see that component 1 accounts for 35.7% of the variance with an eigenvalue of 2.5. The inclusion of component 2 accounts for 50.64% of the total variation.
Scree plot:
Inspecting the scree plot, we see the “knee” at the inclusion of two components.
Correlation between the principal components and the variable:
Dim.1 | Dim.2 | Dim.3 | Dim.4 | Dim.5 | |
---|---|---|---|---|---|
LotArea | 0.4736633 | -0.1268642 | 0.0786642 | 0.8604403 | 0.0088340 |
TotalBsmtSF | 0.8098172 | -0.2069225 | -0.0214803 | -0.1612072 | 0.1242038 |
BsmtFinSF1 | 0.6283669 | -0.4049347 | 0.2582733 | -0.1765837 | 0.4878738 |
GrLivArea | 0.7295088 | 0.2649681 | -0.1905619 | 0.0257153 | -0.3469910 |
GarageArea | 0.7091088 | -0.1107127 | -0.2728814 | -0.2286004 | -0.3622159 |
PoolArea | 0.2837669 | 0.3949344 | 0.8399709 | -0.0948807 | -0.1985173 |
The FactoMineR
function dimdesc()
provides this with p-values:
# show correlation for the first 3 components
dimdesc(house.pca, axes=c(1, 2, 3))
## $Dim.1
## $Dim.1$quanti
## correlation p.value
## TotalBsmtSF 0.8098172 0.000000e+00
## GrLivArea 0.7295088 8.749992e-243
## GarageArea 0.7091088 1.650947e-223
## BsmtFinSF1 0.6283669 3.177627e-161
## LotArea 0.4736633 1.610732e-82
## TotalPorchSF 0.3340004 2.213736e-39
## PoolArea 0.2837669 1.926473e-28
##
##
## $Dim.2
## $Dim.2$quanti
## correlation p.value
## TotalPorchSF 0.7642187 5.402259e-280
## PoolArea 0.3949344 1.085348e-55
## GrLivArea 0.2649681 7.017051e-25
## GarageArea -0.1107127 2.237755e-05
## LotArea -0.1268642 1.156290e-06
## TotalBsmtSF -0.2069225 1.390383e-15
## BsmtFinSF1 -0.4049347 1.023382e-58
##
##
## $Dim.3
## $Dim.3$quanti
## correlation p.value
## PoolArea 0.83997090 0.000000e+00
## BsmtFinSF1 0.25827327 1.115883e-23
## LotArea 0.07866421 2.631094e-03
## GrLivArea -0.19056188 2.096320e-13
## TotalPorchSF -0.26344985 1.323331e-24
## GarageArea -0.27288140 2.404532e-26
For Component 1, TotalBsmtSF, GrLivArea, and GarageArea are the mostly highly correlated variables, with TotalBsmtSF being the highest at 0.809.
Component 2 sees the highest correlation with the variable TotalPorchSF. Component 3 sees the highest correlation with the variable PoolArea.
Component scores are given by :
sweep(house.pca$var$coord,2,sqrt(house.pca$eig[1:ncol(house.pca$var$coord),1]),FUN="/")
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
## LotArea 0.2994450 -0.1240817 0.08032442 0.92804809 0.01013418
## TotalBsmtSF 0.5119580 -0.2023841 -0.02193368 -0.17387379 0.14248461
## BsmtFinSF1 0.3972470 -0.3960533 0.26372414 -0.19045851 0.55968113
## GrLivArea 0.4611879 0.2591566 -0.19458370 0.02773587 -0.39806262
## GarageArea 0.4482911 -0.1082845 -0.27864057 -0.24656232 -0.41552837
## PoolArea 0.1793945 0.3862723 0.85769851 -0.10233580 -0.22773591
## TotalPorchSF 0.2111515 0.7474572 -0.26900997 0.01361348 0.53232616
Based on the PCA, we can derive component 1 as shown below:
\(PC1 = 0.299 * LotArea + 0.511 * TotalBsmtSF + 0.397 * BsmtFinSF1 + 0.461* GrLivArea + 0.448 * GarageArea + 0.179 * PoolArea + 0.2111 * TotalPorchSF\)
Include the supplemental variable OverallQualityRange, where Q1 is on the low end and Q5 is on the high end of the grouping.
Many times, it makes sense to fit a closed form distribution to data. For your variable that is skewed to the right, shift it so that the minimum value is above zero.
Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ).
Find the optimal value of \(\lambda\)??? for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, ???)). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.
suppressWarnings(suppressMessages(library(MASS)))
min(train$TotalBsmtSF)
## [1] 0
# shift TotalBsmtSF above 0 by adding a very small number
TotalBsmtSF <- train$TotalBsmtSF + 0.0000001
min(TotalBsmtSF)
## [1] 1e-07
Derive the exponential distribution:
fit <- fitdistr(TotalBsmtSF, "exponential")
# find lambda
(lambda <- fit$estimate)
## rate
## 0.0009456896
Create the sample of 1000
sample <- rexp(1000, lambda)
Histograms - Simulated vs. Observed
Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF).
Percentile is given by:
\[log(1 - P)/-\lambda\] where P = Percentile
# simulated
cdf.p5 <- log(1 - .05)/-lambda
csf.p95 <- log(1 - .95)/-lambda
obs.p5 <- quantile(train$TotalBsmtSF, 0.05)
obs.p95 <- quantile(train$TotalBsmtSF, 0.95)
Data | 5th Percentile | 95th Percentile |
---|---|---|
Simulated | 54.23904 | 3167.776 |
Observed | 519.3 | 1753.0 |
Calculated a 95% confidence interval from the empirical data, assuming normality.
CI(train$TotalBsmtSF, 0.95)
## upper mean lower
## 1079.951 1057.429 1034.908
With 95% confidence, the mean of TotalBsmtSF is between 1034.908 and 1079.951. The exponential distribution would not be a good fit in this case. We see that the center of the exponential distribution is shifted left as compared the empirical data. Additionally we see more spread in the exponential distribution.
Build some type of regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.
# create the training dataset, limited to numeric variables
numeric_var <- names(train)[which(sapply(train, is.numeric))]
house.train <- train[numeric_var]
# create the test dataset, limited to numeric variables
numeric_var <- names(test)[which(sapply(test, is.numeric))]
house.test <- test[numeric_var]
# replace missing values with 0
house.test[is.na(house.test)] <- 0
# Use the train function from the caret package to build a Random Forest modle
rfFit <-train(SalePrice ~.,
data=house.train,
method="rf",
trControl=trainControl(method="cv",number=5),
prox=TRUE, importance = TRUE,
allowParallel=TRUE)
# show the model summary
rfFit
# display the variables determined to be the most relevant
dotPlot(varImp(rfFit), main = "Random Forest Model - Most Relevant Variables")
# predict
pred_rf <- predict(rfFit, house.test)
# format
submission <- as.data.frame(cbind(test$Id, pred_rf))
colnames(submission) <- c("Id", "SalePrice")
dim(submission) # there should be 1459 rows
write.csv(submission, file = "Kaggle_Submission2.csv", quote=FALSE, row.names=FALSE)
Username | Submission # | Score |
---|---|---|
folsom98 | 2 | 0.16427 |