DATA 605 Computation Mathematics Final

You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition. https://www.kaggle.com/c/house-prices-advanced-regression-techniques. Do the following:

Pick one of the quanititative independent variables from the training data set (train.csv) , and define that variable as X. Make sure this variable is skewed to the right! Pick the dependent variable and define it as Y.

Required R Packages

suppressWarnings(suppressMessages(library(knitr)))
suppressWarnings(suppressMessages(library(RCurl)))
suppressWarnings(suppressMessages(library(ggplot2)))
suppressWarnings(suppressMessages(library(e1071)))
suppressWarnings(suppressMessages(library(dplyr)))
suppressWarnings(suppressMessages(library(scales)))
suppressWarnings(suppressMessages(library(cowplot)))
suppressWarnings(suppressMessages(library(corrplot)))
suppressWarnings(suppressMessages(library(caret)))
#suppressWarnings(suppressMessages(library(MASS)))
suppressWarnings(suppressMessages(library(Rmisc)))
suppressWarnings(suppressMessages(library(FactoMineR)))
suppressWarnings(suppressMessages(library(factoextra)))

Load and Review the Housing Prices Dataset

Review the variables in the training dataset:

## 'data.frame':    1460 obs. of  81 variables:
##  $ Id           : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ MSSubClass   : chr  "60" "20" "60" "70" ...
##  $ MSZoning     : chr  "RL" "RL" "RL" "RL" ...
##  $ LotFrontage  : int  65 80 68 60 84 85 75 NA 51 50 ...
##  $ LotArea      : int  8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
##  $ Street       : chr  "Pave" "Pave" "Pave" "Pave" ...
##  $ Alley        : chr  NA NA NA NA ...
##  $ LotShape     : chr  "Reg" "Reg" "IR1" "IR1" ...
##  $ LandContour  : chr  "Lvl" "Lvl" "Lvl" "Lvl" ...
##  $ Utilities    : chr  "AllPub" "AllPub" "AllPub" "AllPub" ...
##  $ LotConfig    : chr  "Inside" "FR2" "Inside" "Corner" ...
##  $ LandSlope    : chr  "Gtl" "Gtl" "Gtl" "Gtl" ...
##  $ Neighborhood : chr  "CollgCr" "Veenker" "CollgCr" "Crawfor" ...
##  $ Condition1   : chr  "Norm" "Feedr" "Norm" "Norm" ...
##  $ Condition2   : chr  "Norm" "Norm" "Norm" "Norm" ...
##  $ BldgType     : chr  "1Fam" "1Fam" "1Fam" "1Fam" ...
##  $ HouseStyle   : chr  "2Story" "1Story" "2Story" "2Story" ...
##  $ OverallQual  : int  7 6 7 7 8 5 8 7 7 5 ...
##  $ OverallCond  : int  5 8 5 5 5 5 5 6 5 6 ...
##  $ YearBuilt    : int  2003 1976 2001 1915 2000 1993 2004 1973 1931 1939 ...
##  $ YearRemodAdd : int  2003 1976 2002 1970 2000 1995 2005 1973 1950 1950 ...
##  $ RoofStyle    : chr  "Gable" "Gable" "Gable" "Gable" ...
##  $ RoofMatl     : chr  "CompShg" "CompShg" "CompShg" "CompShg" ...
##  $ Exterior1st  : chr  "VinylSd" "MetalSd" "VinylSd" "Wd Sdng" ...
##  $ Exterior2nd  : chr  "VinylSd" "MetalSd" "VinylSd" "Wd Shng" ...
##  $ MasVnrType   : chr  "BrkFace" "None" "BrkFace" "None" ...
##  $ MasVnrArea   : int  196 0 162 0 350 0 186 240 0 0 ...
##  $ ExterQual    : chr  "Gd" "TA" "Gd" "TA" ...
##  $ ExterCond    : chr  "TA" "TA" "TA" "TA" ...
##  $ Foundation   : chr  "PConc" "CBlock" "PConc" "BrkTil" ...
##  $ BsmtQual     : chr  "Gd" "Gd" "Gd" "TA" ...
##  $ BsmtCond     : chr  "TA" "TA" "TA" "Gd" ...
##  $ BsmtExposure : chr  "No" "Gd" "Mn" "No" ...
##  $ BsmtFinType1 : chr  "GLQ" "ALQ" "GLQ" "ALQ" ...
##  $ BsmtFinSF1   : int  706 978 486 216 655 732 1369 859 0 851 ...
##  $ BsmtFinType2 : chr  "Unf" "Unf" "Unf" "Unf" ...
##  $ BsmtFinSF2   : int  0 0 0 0 0 0 0 32 0 0 ...
##  $ BsmtUnfSF    : int  150 284 434 540 490 64 317 216 952 140 ...
##  $ TotalBsmtSF  : int  856 1262 920 756 1145 796 1686 1107 952 991 ...
##  $ Heating      : chr  "GasA" "GasA" "GasA" "GasA" ...
##  $ HeatingQC    : chr  "Ex" "Ex" "Ex" "Gd" ...
##  $ CentralAir   : chr  "Y" "Y" "Y" "Y" ...
##  $ Electrical   : chr  "SBrkr" "SBrkr" "SBrkr" "SBrkr" ...
##  $ X1stFlrSF    : int  856 1262 920 961 1145 796 1694 1107 1022 1077 ...
##  $ X2ndFlrSF    : int  854 0 866 756 1053 566 0 983 752 0 ...
##  $ LowQualFinSF : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ GrLivArea    : int  1710 1262 1786 1717 2198 1362 1694 2090 1774 1077 ...
##  $ BsmtFullBath : int  1 0 1 1 1 1 1 1 0 1 ...
##  $ BsmtHalfBath : int  0 1 0 0 0 0 0 0 0 0 ...
##  $ FullBath     : int  2 2 2 1 2 1 2 2 2 1 ...
##  $ HalfBath     : int  1 0 1 0 1 1 0 1 0 0 ...
##  $ BedroomAbvGr : int  3 3 3 3 4 1 3 3 2 2 ...
##  $ KitchenAbvGr : int  1 1 1 1 1 1 1 1 2 2 ...
##  $ KitchenQual  : chr  "Gd" "TA" "Gd" "Gd" ...
##  $ TotRmsAbvGrd : int  8 6 6 7 9 5 7 7 8 5 ...
##  $ Functional   : chr  "Typ" "Typ" "Typ" "Typ" ...
##  $ Fireplaces   : int  0 1 1 1 1 0 1 2 2 2 ...
##  $ FireplaceQu  : chr  NA "TA" "TA" "Gd" ...
##  $ GarageType   : chr  "Attchd" "Attchd" "Attchd" "Detchd" ...
##  $ GarageYrBlt  : int  2003 1976 2001 1998 2000 1993 2004 1973 1931 1939 ...
##  $ GarageFinish : chr  "RFn" "RFn" "RFn" "Unf" ...
##  $ GarageCars   : int  2 2 2 3 3 2 2 2 2 1 ...
##  $ GarageArea   : int  548 460 608 642 836 480 636 484 468 205 ...
##  $ GarageQual   : chr  "TA" "TA" "TA" "TA" ...
##  $ GarageCond   : chr  "TA" "TA" "TA" "TA" ...
##  $ PavedDrive   : chr  "Y" "Y" "Y" "Y" ...
##  $ WoodDeckSF   : int  0 298 0 0 192 40 255 235 90 0 ...
##  $ OpenPorchSF  : int  61 0 42 35 84 30 57 204 0 4 ...
##  $ EnclosedPorch: int  0 0 0 272 0 0 0 228 205 0 ...
##  $ X3SsnPorch   : int  0 0 0 0 0 320 0 0 0 0 ...
##  $ ScreenPorch  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PoolArea     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PoolQC       : chr  NA NA NA NA ...
##  $ Fence        : chr  NA NA NA NA ...
##  $ MiscFeature  : chr  NA NA NA NA ...
##  $ MiscVal      : int  0 0 0 0 0 700 0 350 0 0 ...
##  $ MoSold       : int  2 5 9 2 12 10 8 11 4 1 ...
##  $ YrSold       : int  2008 2007 2008 2006 2008 2009 2007 2009 2008 2008 ...
##  $ SaleType     : chr  "WD" "WD" "WD" "WD" ...
##  $ SaleCondition: chr  "Normal" "Normal" "Normal" "Abnorml" ...
##  $ SalePrice    : int  208500 181500 223500 140000 250000 143000 307000 200000 129900 118000 ...

Probability

Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the 3d quartile of the X variable, and the small letter “y” is estimated as the 2d quartile of the Y variable. Interpret the meaning of all probabilities. In addition, make a table of counts as shown below.

Determine the number of NA values in the numeric variables.

colSums(sapply(train[numeric_var], is.na))
##            Id   LotFrontage       LotArea   OverallQual   OverallCond 
##             0           259             0             0             0 
##     YearBuilt  YearRemodAdd    MasVnrArea    BsmtFinSF1    BsmtFinSF2 
##             0             0             8             0             0 
##     BsmtUnfSF   TotalBsmtSF     X1stFlrSF     X2ndFlrSF  LowQualFinSF 
##             0             0             0             0             0 
##     GrLivArea  BsmtFullBath  BsmtHalfBath      FullBath      HalfBath 
##             0             0             0             0             0 
##  BedroomAbvGr  KitchenAbvGr  TotRmsAbvGrd    Fireplaces   GarageYrBlt 
##             0             0             0             0            81 
##    GarageCars    GarageArea    WoodDeckSF   OpenPorchSF EnclosedPorch 
##             0             0             0             0             0 
##    X3SsnPorch   ScreenPorch      PoolArea       MiscVal        MoSold 
##             0             0             0             0             0 
##        YrSold     SalePrice  TotalPorchSF 
##             0             0             0

TotalBsmtSF will be the selected for the X variable and SalePrice will be used as the Y variable. We see that TotalBsmtSF is right skewed with the mean greater than the median.

summary(train$TotalBsmtSF)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   795.8   991.5  1057.0  1298.0  6110.0
skewness(train$TotalBsmtSF)
## [1] 1.521124

Calculate the following Probabilities

  1. \(P(X>x | Y>y)\)

  2. \(P(X>x, Y>y)\)

  3. \(P(X<x | Y>y)\)

Formula for Conditional Probability

\(p(x|y) = p(x,y)/p(y)\)

Define the 3rd quartile for TotalBsmtSF (X) and the 2nd quartile for SalePrice (Y)

# TotalBsmtSF
(xQ3 <- quantile(train$TotalBsmtSF, 0.75))
##     75% 
## 1298.25
# SalePrice
(yQ2 <- quantile(train$SalePrice, 0.5))
##    50% 
## 163000
  1. \(P(X>x | Y>y) = 0.2253425/0.4986301 = 0.4519231\)
numerator <- filter(train, SalePrice > yQ2 & TotalBsmtSF > xQ3) %>% tally()/nrow(train)

denominator <- filter(train, SalePrice > yQ2) %>% tally()/nrow(train)

(a <- numerator/denominator)
##           n
## 1 0.4519231
  1. \(P(X>x, Y>y) = 0.25 * 0.4986301 = 0.1246575\)
Xx <- filter(train, TotalBsmtSF > xQ3) %>% tally()/nrow(train)
Yy <- filter(train, SalePrice > yQ2) %>% tally()/nrow(train)

(b <- Xx * Yy)
##           n
## 1 0.1246575
  1. \(P(X<x | Y>y) = 0.2732877/0.4986301 = 0.5480769\)
numerator <- filter(train, SalePrice > yQ2 & TotalBsmtSF < xQ3) %>% tally()/nrow(train)

denominator <- filter(train, SalePrice > yQ2) %>% tally()/nrow(train)

(c <- numerator/denominator)
##           n
## 1 0.5480769
x/y <=2d quartile >2d quartile Total
<=3d quartile 696 399 1095
>3d quartile 36 329 365
Total 732 728 1460

Does splitting the training data in this fashion make them independent?

Splitting them in this manner doesn’t make them independent, although it allows for testing independence below using the chi-squared test.

Let A be the new variable counting those observations above the 3d quartile for X, and let B be the new variable counting those observations above the 2d quartile for Y.

Does \(P(A|B)=P(A)P(B)\)? Check mathematically, and then evaluate by running a Chi Square test for association.

\(P(A) = 365/1460 = 0.25\)
\(P(B) = 728/1460 = 0.4986301\)

\(P(A)P(B)\) = 0.25 * 0.4986301 = 0.1246575

We know that \(P(A|B)\) = 0.4519231; therefore \(P(A|B)! =P(A)P(B)\) which suggests X and Y are not independent.

Evaluate by running a Chi Square test for association on x and y.

Test the hypothesis whether the X is independent of Y at a level at .05 significance level.

# matrix values are from the table above
mat <- matrix(c(696, 399, 36, 329), 2, 2, byrow=T) 

chisq.test(mat, correct=TRUE) 
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  mat
## X-squared = 313.61, df = 1, p-value < 2.2e-16

As the p-value is significantly less than the .05 significance level, we reject the null hypothesis that the X is independent of Y. The Chi-squared test indicates dependence between X and Y (TotalBsmtSF and SalePrice).

Descriptive and Inferential Statistics

Provide univariate descriptive statistics and appropriate plots for the training data set.

Numeric Variables listed below:

LotFrontage : Linear feet of street connected to property
LotArea : Lot size in square feet
OverallQual : Rates the overall material and finish of the house
OverallCond : Rates the overall condition of the house
YearBuilt : Original construction date
YearRemodAdd : Remodel date (same as construction date if no remodeling or additions)
MasVnrArea : Masonry veneer area in square feet
BsmtFinSF1 : Type 1 finished square feet
BsmtFinSF2 : Type 2 finished square feet
BsmtUnfSF : Unfinished square feet of basement area
TotalBsmtSF : Total square feet of basement area
1stFlrSF : First Floor square feet
2ndFlrSF : Second floor square feet
LowQualFinSF : Low quality finished square feet (all floors)
GrLivArea : Above grade (ground) living area square feet
BsmtFullBath : Basement full bathrooms
BsmtHalfBath : Basement half bathrooms
FullBath : Full bathrooms above grade
HalfBath : Half baths above grade
BedroomAbvGr : Bedrooms above grade (does NOT include basement bedrooms)
KitchenAbvGr : Kitchens above grade
TotRmsAbvGrd : Total rooms above grade (does not include bathrooms)
Fireplaces : Number of fireplaces
GarageYrBlt : Year garage was built
GarageCars : Size of garage in car capacity
GarageArea : Size of garage in square feet
WoodDeckSF : Wood deck area in square feet
OpenPorchSF : Open porch area in square feet
EnclosedPorch : Enclosed porch area in square feet
3SsnPorch : Three season porch area in square feet
ScreenPorch : Screen porch area in square feet
PoolArea : Pool area in square feet
MiscVal : $Value of miscellaneous feature
MoSold : Month Sold (MM)
YrSold : Year Sold (YYYY)
SalePrice : Sale Price of the House

Generate descriptive statistics on the numerical variables in the training dataset.

##        Id          LotFrontage        LotArea        OverallQual    
##  Min.   :   1.0   Min.   : 21.00   Min.   :  1300   Min.   : 1.000  
##  1st Qu.: 365.8   1st Qu.: 59.00   1st Qu.:  7554   1st Qu.: 5.000  
##  Median : 730.5   Median : 69.00   Median :  9478   Median : 6.000  
##  Mean   : 730.5   Mean   : 70.05   Mean   : 10517   Mean   : 6.099  
##  3rd Qu.:1095.2   3rd Qu.: 80.00   3rd Qu.: 11602   3rd Qu.: 7.000  
##  Max.   :1460.0   Max.   :313.00   Max.   :215245   Max.   :10.000  
##                   NA's   :259                                       
##   OverallCond      YearBuilt     YearRemodAdd    MasVnrArea    
##  Min.   :1.000   Min.   :1872   Min.   :1950   Min.   :   0.0  
##  1st Qu.:5.000   1st Qu.:1954   1st Qu.:1967   1st Qu.:   0.0  
##  Median :5.000   Median :1973   Median :1994   Median :   0.0  
##  Mean   :5.575   Mean   :1971   Mean   :1985   Mean   : 103.7  
##  3rd Qu.:6.000   3rd Qu.:2000   3rd Qu.:2004   3rd Qu.: 166.0  
##  Max.   :9.000   Max.   :2010   Max.   :2010   Max.   :1600.0  
##                                                NA's   :8       
##    BsmtFinSF1       BsmtFinSF2        BsmtUnfSF       TotalBsmtSF    
##  Min.   :   0.0   Min.   :   0.00   Min.   :   0.0   Min.   :   0.0  
##  1st Qu.:   0.0   1st Qu.:   0.00   1st Qu.: 223.0   1st Qu.: 795.8  
##  Median : 383.5   Median :   0.00   Median : 477.5   Median : 991.5  
##  Mean   : 443.6   Mean   :  46.55   Mean   : 567.2   Mean   :1057.4  
##  3rd Qu.: 712.2   3rd Qu.:   0.00   3rd Qu.: 808.0   3rd Qu.:1298.2  
##  Max.   :5644.0   Max.   :1474.00   Max.   :2336.0   Max.   :6110.0  
##                                                                      
##    X1stFlrSF      X2ndFlrSF     LowQualFinSF       GrLivArea   
##  Min.   : 334   Min.   :   0   Min.   :  0.000   Min.   : 334  
##  1st Qu.: 882   1st Qu.:   0   1st Qu.:  0.000   1st Qu.:1130  
##  Median :1087   Median :   0   Median :  0.000   Median :1464  
##  Mean   :1163   Mean   : 347   Mean   :  5.845   Mean   :1515  
##  3rd Qu.:1391   3rd Qu.: 728   3rd Qu.:  0.000   3rd Qu.:1777  
##  Max.   :4692   Max.   :2065   Max.   :572.000   Max.   :5642  
##                                                                
##   BsmtFullBath     BsmtHalfBath        FullBath        HalfBath     
##  Min.   :0.0000   Min.   :0.00000   Min.   :0.000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:1.000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.00000   Median :2.000   Median :0.0000  
##  Mean   :0.4253   Mean   :0.05753   Mean   :1.565   Mean   :0.3829  
##  3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:2.000   3rd Qu.:1.0000  
##  Max.   :3.0000   Max.   :2.00000   Max.   :3.000   Max.   :2.0000  
##                                                                     
##   BedroomAbvGr    KitchenAbvGr    TotRmsAbvGrd      Fireplaces   
##  Min.   :0.000   Min.   :0.000   Min.   : 2.000   Min.   :0.000  
##  1st Qu.:2.000   1st Qu.:1.000   1st Qu.: 5.000   1st Qu.:0.000  
##  Median :3.000   Median :1.000   Median : 6.000   Median :1.000  
##  Mean   :2.866   Mean   :1.047   Mean   : 6.518   Mean   :0.613  
##  3rd Qu.:3.000   3rd Qu.:1.000   3rd Qu.: 7.000   3rd Qu.:1.000  
##  Max.   :8.000   Max.   :3.000   Max.   :14.000   Max.   :3.000  
##                                                                  
##   GarageYrBlt     GarageCars      GarageArea       WoodDeckSF    
##  Min.   :1900   Min.   :0.000   Min.   :   0.0   Min.   :  0.00  
##  1st Qu.:1961   1st Qu.:1.000   1st Qu.: 334.5   1st Qu.:  0.00  
##  Median :1980   Median :2.000   Median : 480.0   Median :  0.00  
##  Mean   :1979   Mean   :1.767   Mean   : 473.0   Mean   : 94.24  
##  3rd Qu.:2002   3rd Qu.:2.000   3rd Qu.: 576.0   3rd Qu.:168.00  
##  Max.   :2010   Max.   :4.000   Max.   :1418.0   Max.   :857.00  
##  NA's   :81                                                      
##   OpenPorchSF     EnclosedPorch      X3SsnPorch      ScreenPorch    
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.00   Min.   :  0.00  
##  1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.00  
##  Median : 25.00   Median :  0.00   Median :  0.00   Median :  0.00  
##  Mean   : 46.66   Mean   : 21.95   Mean   :  3.41   Mean   : 15.06  
##  3rd Qu.: 68.00   3rd Qu.:  0.00   3rd Qu.:  0.00   3rd Qu.:  0.00  
##  Max.   :547.00   Max.   :552.00   Max.   :508.00   Max.   :480.00  
##                                                                     
##     PoolArea          MiscVal             MoSold           YrSold    
##  Min.   :  0.000   Min.   :    0.00   Min.   : 1.000   Min.   :2006  
##  1st Qu.:  0.000   1st Qu.:    0.00   1st Qu.: 5.000   1st Qu.:2007  
##  Median :  0.000   Median :    0.00   Median : 6.000   Median :2008  
##  Mean   :  2.759   Mean   :   43.49   Mean   : 6.322   Mean   :2008  
##  3rd Qu.:  0.000   3rd Qu.:    0.00   3rd Qu.: 8.000   3rd Qu.:2009  
##  Max.   :738.000   Max.   :15500.00   Max.   :12.000   Max.   :2010  
##                                                                      
##    SalePrice       TotalPorchSF    
##  Min.   : 34900   Min.   :   0.00  
##  1st Qu.:129975   1st Qu.:   0.00  
##  Median :163000   Median :  48.00  
##  Mean   :180921   Mean   :  87.08  
##  3rd Qu.:214000   3rd Qu.: 136.00  
##  Max.   :755000   Max.   :1027.00  
## 

Plots






Overall House Quality and Condition Compared to Sale Price

Above Ground Living Area and Sale Price

House Price Compared to Year Built

Provide a scatterplot of X and Y

X = TotalBsmtSF
Y = SalePrice

Correlation

Derive a correlation matrix for two of the quantitative variables you selected.

Plot the correlation between LotArea, TotalBsmtSF, BsmtFinSF1, GrLivArea, GarageArea, PoolArea, TotalPorchSF, SalePrice:

cor_data <- select(train, LotArea, TotalBsmtSF, BsmtFinSF1, GrLivArea, GarageArea, PoolArea, TotalPorchSF, SalePrice)

mat <- cor(cor_data)
mat
##                 LotArea TotalBsmtSF BsmtFinSF1 GrLivArea GarageArea
## LotArea      1.00000000   0.2608331 0.21410313 0.2631162 0.18040276
## TotalBsmtSF  0.26083313   1.0000000 0.52239605 0.4548682 0.48666546
## BsmtFinSF1   0.21410313   0.5223961 1.00000000 0.2081711 0.29697039
## GrLivArea    0.26311617   0.4548682 0.20817113 1.0000000 0.46899748
## GarageArea   0.18040276   0.4866655 0.29697039 0.4689975 1.00000000
## PoolArea     0.07767239   0.1260531 0.14049129 0.1702053 0.06104727
## TotalPorchSF 0.07130996   0.1554712 0.05119947 0.2728528 0.11834590
## SalePrice    0.26384335   0.6135806 0.38641981 0.7086245 0.62343144
##                PoolArea TotalPorchSF  SalePrice
## LotArea      0.07767239   0.07130996 0.26384335
## TotalBsmtSF  0.12605313   0.15547122 0.61358055
## BsmtFinSF1   0.14049129   0.05119947 0.38641981
## GrLivArea    0.17020534   0.27285275 0.70862448
## GarageArea   0.06104727   0.11834590 0.62343144
## PoolArea     1.00000000   0.09473441 0.09240355
## TotalPorchSF 0.09473441   1.00000000 0.19573894
## SalePrice    0.09240355   0.19573894 1.00000000
corrplot(mat, method="square")

Looking at the resulting correlation plot, we see that Total Basement Square Feet, Above Ground Living Area, and Garage Area are the variables with the highest correlation with Sales Price. The X variable being tested, TotalBsmtSF, has a high positive correlation with SalePrice.

Provide a 95% CI for the difference in the mean of the variables.

t.test(train$TotalBsmtSF, train$SalePrice) 
## 
##  Welch Two Sample t-test
## 
## data:  train$TotalBsmtSF and train$SalePrice
## t = -86.509, df = 1459.1, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -183942.2 -175785.3
## sample estimates:
##  mean of x  mean of y 
##   1057.429 180921.196

In the house training dataset, the mean total basement area is 1057.429 and the mean sale price of a house is 180921.196. The 95% confidence interval of the difference in mean sale price is between 175,785.3 and 183,942.2.

We see a very small p-value (< 0.5) which leads us to reject the null hypothesis. There is strong evidence of a mean price increase between basement area and sales price, which is indicative of a relationship between these two variables.

Derive a correlation matrix for two of the quantitative variables you selected.

Test the hypothesis that the correlation between these variables is 0 and provide a 99% confidence interval.
Discuss the meaning of your analysis.

cor.test(train$TotalBsmtSF, train$SalePrice, method = "pearson" , conf.level = 0.99)
## 
##  Pearson's product-moment correlation
## 
## data:  train$TotalBsmtSF and train$SalePrice
## t = 29.671, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 99 percent confidence interval:
##  0.5697562 0.6539251
## sample estimates:
##       cor 
## 0.6135806

Results show that we reject the null hypothesis that the correlation between basement area and sale price is 0. Indeed we see that basement area and sales price have a strong, postive correlation of 0.613.

Linear Algebra and Correlation.

Invert your correlation matrix. (This is known as the precision matrix and contains variance inflation factors on the diagonal.)

xydata <- select(train, TotalBsmtSF, SalePrice)

cormatrix <- cor(xydata)
cormatrix
##             TotalBsmtSF SalePrice
## TotalBsmtSF   1.0000000 0.6135806
## SalePrice     0.6135806 1.0000000
precmatrix <- solve(cormatrix)

precmatrix
##             TotalBsmtSF  SalePrice
## TotalBsmtSF   1.6038006 -0.9840609
## SalePrice    -0.9840609  1.6038006

Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix.

cormatrix %*% precmatrix
##             TotalBsmtSF SalePrice
## TotalBsmtSF           1         0
## SalePrice             0         1
precmatrix %*% cormatrix
##             TotalBsmtSF SalePrice
## TotalBsmtSF           1         0
## SalePrice             0         1

Both matrix operations return the identity matrix.

Conduct principle components analysis and interpret. Discuss.

Principal Component Analysis (PCA) is used to extract the important information from a multivariate set of data and to express this information as a set of new variables called principal components. Essentially PCA allows for the reduction of dimensionality of a dataset.

The PCA analysis will consider the following variables and use the FactoMineR and factoextra packages.
* LotArea
* TotalBsmtSF
* BsmtFinSF1
* GrLivArea
* GarageArea
* PoolArea
* TotalPorchSF (derived from OpenPorchSF + EnclosedPorch + X3SsnPorch + ScreenPorch)

house_data <- select(train, LotArea, TotalBsmtSF, BsmtFinSF1, GrLivArea, GarageArea, PoolArea, TotalPorchSF)

Call the PCA function using scaling, number of dimensions to retain = 5, and graph = TRUE

house.pca = PCA(house_data, scale.unit=TRUE, ncp=5, graph=T)

The proportion of the total variation explained by the principal components is shown below:

eigenvalue percentage of variance
comp 1 2.5021041 35.744344
comp 2 1.0453524 14.933606
comp 3 0.9590896 13.701280
comp 4 0.8596081 12.280115
comp 5 0.7598602 10.855146
comp 6 0.4788855 6.841221

We see that component 1 accounts for 35.7% of the variance with an eigenvalue of 2.5. The inclusion of component 2 accounts for 50.64% of the total variation.

Scree plot:

Inspecting the scree plot, we see the “knee” at the inclusion of two components.

Correlation between the principal components and the variable:

Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
LotArea 0.4736633 -0.1268642 0.0786642 0.8604403 0.0088340
TotalBsmtSF 0.8098172 -0.2069225 -0.0214803 -0.1612072 0.1242038
BsmtFinSF1 0.6283669 -0.4049347 0.2582733 -0.1765837 0.4878738
GrLivArea 0.7295088 0.2649681 -0.1905619 0.0257153 -0.3469910
GarageArea 0.7091088 -0.1107127 -0.2728814 -0.2286004 -0.3622159
PoolArea 0.2837669 0.3949344 0.8399709 -0.0948807 -0.1985173

The FactoMineR function dimdesc() provides this with p-values:

# show correlation for the first 3 components
dimdesc(house.pca, axes=c(1, 2, 3))
## $Dim.1
## $Dim.1$quanti
##              correlation       p.value
## TotalBsmtSF    0.8098172  0.000000e+00
## GrLivArea      0.7295088 8.749992e-243
## GarageArea     0.7091088 1.650947e-223
## BsmtFinSF1     0.6283669 3.177627e-161
## LotArea        0.4736633  1.610732e-82
## TotalPorchSF   0.3340004  2.213736e-39
## PoolArea       0.2837669  1.926473e-28
## 
## 
## $Dim.2
## $Dim.2$quanti
##              correlation       p.value
## TotalPorchSF   0.7642187 5.402259e-280
## PoolArea       0.3949344  1.085348e-55
## GrLivArea      0.2649681  7.017051e-25
## GarageArea    -0.1107127  2.237755e-05
## LotArea       -0.1268642  1.156290e-06
## TotalBsmtSF   -0.2069225  1.390383e-15
## BsmtFinSF1    -0.4049347  1.023382e-58
## 
## 
## $Dim.3
## $Dim.3$quanti
##              correlation      p.value
## PoolArea      0.83997090 0.000000e+00
## BsmtFinSF1    0.25827327 1.115883e-23
## LotArea       0.07866421 2.631094e-03
## GrLivArea    -0.19056188 2.096320e-13
## TotalPorchSF -0.26344985 1.323331e-24
## GarageArea   -0.27288140 2.404532e-26

For Component 1, TotalBsmtSF, GrLivArea, and GarageArea are the mostly highly correlated variables, with TotalBsmtSF being the highest at 0.809.

Component 2 sees the highest correlation with the variable TotalPorchSF. Component 3 sees the highest correlation with the variable PoolArea.

Component scores are given by :

sweep(house.pca$var$coord,2,sqrt(house.pca$eig[1:ncol(house.pca$var$coord),1]),FUN="/")
##                  Dim.1      Dim.2       Dim.3       Dim.4       Dim.5
## LotArea      0.2994450 -0.1240817  0.08032442  0.92804809  0.01013418
## TotalBsmtSF  0.5119580 -0.2023841 -0.02193368 -0.17387379  0.14248461
## BsmtFinSF1   0.3972470 -0.3960533  0.26372414 -0.19045851  0.55968113
## GrLivArea    0.4611879  0.2591566 -0.19458370  0.02773587 -0.39806262
## GarageArea   0.4482911 -0.1082845 -0.27864057 -0.24656232 -0.41552837
## PoolArea     0.1793945  0.3862723  0.85769851 -0.10233580 -0.22773591
## TotalPorchSF 0.2111515  0.7474572 -0.26900997  0.01361348  0.53232616

Based on the PCA, we can derive component 1 as shown below:

\(PC1 = 0.299 * LotArea + 0.511 * TotalBsmtSF + 0.397 * BsmtFinSF1 + 0.461* GrLivArea + 0.448 * GarageArea + 0.179 * PoolArea + 0.2111 * TotalPorchSF\)

Include the supplemental variable OverallQualityRange, where Q1 is on the low end and Q5 is on the high end of the grouping.

Calculus-Based Probability & Statistics

Many times, it makes sense to fit a closed form distribution to data. For your variable that is skewed to the right, shift it so that the minimum value is above zero.

Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ).

Find the optimal value of \(\lambda\)??? for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, ???)). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.

suppressWarnings(suppressMessages(library(MASS)))

min(train$TotalBsmtSF)
## [1] 0
# shift TotalBsmtSF above 0 by adding a very small number 

TotalBsmtSF <- train$TotalBsmtSF + 0.0000001

min(TotalBsmtSF)
## [1] 1e-07

Derive the exponential distribution:

fit <- fitdistr(TotalBsmtSF, "exponential")

# find lambda

(lambda <- fit$estimate)
##         rate 
## 0.0009456896

Create the sample of 1000

sample <- rexp(1000, lambda)

Histograms - Simulated vs. Observed

Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF).

Percentile is given by:

\[log(1 - P)/-\lambda\] where P = Percentile

# simulated
cdf.p5 <- log(1 - .05)/-lambda
csf.p95 <- log(1 - .95)/-lambda


obs.p5 <- quantile(train$TotalBsmtSF, 0.05)
obs.p95 <- quantile(train$TotalBsmtSF, 0.95)
Data 5th Percentile 95th Percentile
Simulated 54.23904 3167.776
Observed 519.3 1753.0

Calculated a 95% confidence interval from the empirical data, assuming normality.

CI(train$TotalBsmtSF, 0.95)
##    upper     mean    lower 
## 1079.951 1057.429 1034.908

With 95% confidence, the mean of TotalBsmtSF is between 1034.908 and 1079.951. The exponential distribution would not be a good fit in this case. We see that the center of the exponential distribution is shifted left as compared the empirical data. Additionally we see more spread in the exponential distribution.

Modeling

Build some type of regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.

# create the training dataset, limited to numeric variables
numeric_var <- names(train)[which(sapply(train, is.numeric))]
house.train <- train[numeric_var]

# create the test dataset, limited to numeric variables
numeric_var <- names(test)[which(sapply(test, is.numeric))]
house.test <- test[numeric_var]

# replace missing values with 0
house.test[is.na(house.test)] <- 0

# Use the train function from the caret package to build a Random Forest modle
rfFit <-train(SalePrice ~.,
              data=house.train,
              method="rf",
              trControl=trainControl(method="cv",number=5),
              prox=TRUE, importance = TRUE,
              allowParallel=TRUE)

# show the model summary          
rfFit

# display the variables determined to be the most relevant
dotPlot(varImp(rfFit), main = "Random Forest Model - Most Relevant Variables")

# predict              
pred_rf <- predict(rfFit, house.test)

# format 
submission <- as.data.frame(cbind(test$Id, pred_rf))
colnames(submission) <- c("Id", "SalePrice")

dim(submission) # there should be 1459 rows
write.csv(submission, file = "Kaggle_Submission2.csv", quote=FALSE, row.names=FALSE)

Kaggle Results

Username Submission # Score
folsom98 2 0.16427