Problem 2

You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition. https://www.kaggle.com/c/house-prices-advanced-regression-techniques. I want you to do the following.

Descriptive and Inferential Statistics

Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable.

Load and Explore the Dataset

# Load Training Dataset
raw_data <- read.csv("Final_Project_Data_Files/train.csv")

# View and explore the data
dim(raw_data)

## [1] 1460   81

head(raw_data)

#summary(raw_data)
str(raw_data)

## 'data.frame':    1460 obs. of  81 variables:
##  $ Id           : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ MSSubClass   : int  60 20 60 70 60 50 20 60 50 190 ...
##  $ MSZoning     : Factor w/ 5 levels "C (all)","FV",..: 4 4 4 4 4 4 4 4 5 4 ...
##  $ LotFrontage  : int  65 80 68 60 84 85 75 NA 51 50 ...
##  $ LotArea      : int  8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
##  $ Street       : Factor w/ 2 levels "Grvl","Pave": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Alley        : Factor w/ 2 levels "Grvl","Pave": NA NA NA NA NA NA NA NA NA NA ...
##  $ LotShape     : Factor w/ 4 levels "IR1","IR2","IR3",..: 4 4 1 1 1 1 4 1 4 4 ...
##  $ LandContour  : Factor w/ 4 levels "Bnk","HLS","Low",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ Utilities    : Factor w/ 2 levels "AllPub","NoSeWa": 1 1 1 1 1 1 1 1 1 1 ...
##  $ LotConfig    : Factor w/ 5 levels "Corner","CulDSac",..: 5 3 5 1 3 5 5 1 5 1 ...
##  $ LandSlope    : Factor w/ 3 levels "Gtl","Mod","Sev": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Neighborhood : Factor w/ 25 levels "Blmngtn","Blueste",..: 6 25 6 7 14 12 21 17 18 4 ...
##  $ Condition1   : Factor w/ 9 levels "Artery","Feedr",..: 3 2 3 3 3 3 3 5 1 1 ...
##  $ Condition2   : Factor w/ 8 levels "Artery","Feedr",..: 3 3 3 3 3 3 3 3 3 1 ...
##  $ BldgType     : Factor w/ 5 levels "1Fam","2fmCon",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ HouseStyle   : Factor w/ 8 levels "1.5Fin","1.5Unf",..: 6 3 6 6 6 1 3 6 1 2 ...
##  $ OverallQual  : int  7 6 7 7 8 5 8 7 7 5 ...
##  $ OverallCond  : int  5 8 5 5 5 5 5 6 5 6 ...
##  $ YearBuilt    : int  2003 1976 2001 1915 2000 1993 2004 1973 1931 1939 ...
##  $ YearRemodAdd : int  2003 1976 2002 1970 2000 1995 2005 1973 1950 1950 ...
##  $ RoofStyle    : Factor w/ 6 levels "Flat","Gable",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ RoofMatl     : Factor w/ 8 levels "ClyTile","CompShg",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Exterior1st  : Factor w/ 15 levels "AsbShng","AsphShn",..: 13 9 13 14 13 13 13 7 4 9 ...
##  $ Exterior2nd  : Factor w/ 16 levels "AsbShng","AsphShn",..: 14 9 14 16 14 14 14 7 16 9 ...
##  $ MasVnrType   : Factor w/ 4 levels "BrkCmn","BrkFace",..: 2 3 2 3 2 3 4 4 3 3 ...
##  $ MasVnrArea   : int  196 0 162 0 350 0 186 240 0 0 ...
##  $ ExterQual    : Factor w/ 4 levels "Ex","Fa","Gd",..: 3 4 3 4 3 4 3 4 4 4 ...
##  $ ExterCond    : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 5 5 ...
##  $ Foundation   : Factor w/ 6 levels "BrkTil","CBlock",..: 3 2 3 1 3 6 3 2 1 1 ...
##  $ BsmtQual     : Factor w/ 4 levels "Ex","Fa","Gd",..: 3 3 3 4 3 3 1 3 4 4 ...
##  $ BsmtCond     : Factor w/ 4 levels "Fa","Gd","Po",..: 4 4 4 2 4 4 4 4 4 4 ...
##  $ BsmtExposure : Factor w/ 4 levels "Av","Gd","Mn",..: 4 2 3 4 1 4 1 3 4 4 ...
##  $ BsmtFinType1 : Factor w/ 6 levels "ALQ","BLQ","GLQ",..: 3 1 3 1 3 3 3 1 6 3 ...
##  $ BsmtFinSF1   : int  706 978 486 216 655 732 1369 859 0 851 ...
##  $ BsmtFinType2 : Factor w/ 6 levels "ALQ","BLQ","GLQ",..: 6 6 6 6 6 6 6 2 6 6 ...
##  $ BsmtFinSF2   : int  0 0 0 0 0 0 0 32 0 0 ...
##  $ BsmtUnfSF    : int  150 284 434 540 490 64 317 216 952 140 ...
##  $ TotalBsmtSF  : int  856 1262 920 756 1145 796 1686 1107 952 991 ...
##  $ Heating      : Factor w/ 6 levels "Floor","GasA",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ HeatingQC    : Factor w/ 5 levels "Ex","Fa","Gd",..: 1 1 1 3 1 1 1 1 3 1 ...
##  $ CentralAir   : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Electrical   : Factor w/ 5 levels "FuseA","FuseF",..: 5 5 5 5 5 5 5 5 2 5 ...
##  $ X1stFlrSF    : int  856 1262 920 961 1145 796 1694 1107 1022 1077 ...
##  $ X2ndFlrSF    : int  854 0 866 756 1053 566 0 983 752 0 ...
##  $ LowQualFinSF : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ GrLivArea    : int  1710 1262 1786 1717 2198 1362 1694 2090 1774 1077 ...
##  $ BsmtFullBath : int  1 0 1 1 1 1 1 1 0 1 ...
##  $ BsmtHalfBath : int  0 1 0 0 0 0 0 0 0 0 ...
##  $ FullBath     : int  2 2 2 1 2 1 2 2 2 1 ...
##  $ HalfBath     : int  1 0 1 0 1 1 0 1 0 0 ...
##  $ BedroomAbvGr : int  3 3 3 3 4 1 3 3 2 2 ...
##  $ KitchenAbvGr : int  1 1 1 1 1 1 1 1 2 2 ...
##  $ KitchenQual  : Factor w/ 4 levels "Ex","Fa","Gd",..: 3 4 3 3 3 4 3 4 4 4 ...
##  $ TotRmsAbvGrd : int  8 6 6 7 9 5 7 7 8 5 ...
##  $ Functional   : Factor w/ 7 levels "Maj1","Maj2",..: 7 7 7 7 7 7 7 7 3 7 ...
##  $ Fireplaces   : int  0 1 1 1 1 0 1 2 2 2 ...
##  $ FireplaceQu  : Factor w/ 5 levels "Ex","Fa","Gd",..: NA 5 5 3 5 NA 3 5 5 5 ...
##  $ GarageType   : Factor w/ 6 levels "2Types","Attchd",..: 2 2 2 6 2 2 2 2 6 2 ...
##  $ GarageYrBlt  : int  2003 1976 2001 1998 2000 1993 2004 1973 1931 1939 ...
##  $ GarageFinish : Factor w/ 3 levels "Fin","RFn","Unf": 2 2 2 3 2 3 2 2 3 2 ...
##  $ GarageCars   : int  2 2 2 3 3 2 2 2 2 1 ...
##  $ GarageArea   : int  548 460 608 642 836 480 636 484 468 205 ...
##  $ GarageQual   : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 2 3 ...
##  $ GarageCond   : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 5 5 ...
##  $ PavedDrive   : Factor w/ 3 levels "N","P","Y": 3 3 3 3 3 3 3 3 3 3 ...
##  $ WoodDeckSF   : int  0 298 0 0 192 40 255 235 90 0 ...
##  $ OpenPorchSF  : int  61 0 42 35 84 30 57 204 0 4 ...
##  $ EnclosedPorch: int  0 0 0 272 0 0 0 228 205 0 ...
##  $ X3SsnPorch   : int  0 0 0 0 0 320 0 0 0 0 ...
##  $ ScreenPorch  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PoolArea     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PoolQC       : Factor w/ 3 levels "Ex","Fa","Gd": NA NA NA NA NA NA NA NA NA NA ...
##  $ Fence        : Factor w/ 4 levels "GdPrv","GdWo",..: NA NA NA NA NA 3 NA NA NA NA ...
##  $ MiscFeature  : Factor w/ 4 levels "Gar2","Othr",..: NA NA NA NA NA 3 NA 3 NA NA ...
##  $ MiscVal      : int  0 0 0 0 0 700 0 350 0 0 ...
##  $ MoSold       : int  2 5 9 2 12 10 8 11 4 1 ...
##  $ YrSold       : int  2008 2007 2008 2006 2008 2009 2007 2009 2008 2008 ...
##  $ SaleType     : Factor w/ 9 levels "COD","Con","ConLD",..: 9 9 9 9 9 9 9 9 9 9 ...
##  $ SaleCondition: Factor w/ 6 levels "Abnorml","AdjLand",..: 5 5 5 1 5 5 5 5 1 5 ...
##  $ SalePrice    : int  208500 181500 223500 140000 250000 143000 307000 200000 129900 118000 ...

Looking through the structure of the data it seems that most of the variables are categorical with only a few numeric variables to work with. So let’s pair down the dataset to include only the numeric variables relevant to our project instructions.

Reduce the Variables to Numeric Data Only

We can quickly reduce the number of variables dramatically by simply filtering out only the numeric data types using dplyr’s select_if function…

numeric_data <- raw_data %>% select_if(is.numeric)
dim(numeric_data)

## [1] 1460   38

str(numeric_data)

## 'data.frame':    1460 obs. of  38 variables:
##  $ Id           : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ MSSubClass   : int  60 20 60 70 60 50 20 60 50 190 ...
##  $ LotFrontage  : int  65 80 68 60 84 85 75 NA 51 50 ...
##  $ LotArea      : int  8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
##  $ OverallQual  : int  7 6 7 7 8 5 8 7 7 5 ...
##  $ OverallCond  : int  5 8 5 5 5 5 5 6 5 6 ...
##  $ YearBuilt    : int  2003 1976 2001 1915 2000 1993 2004 1973 1931 1939 ...
##  $ YearRemodAdd : int  2003 1976 2002 1970 2000 1995 2005 1973 1950 1950 ...
##  $ MasVnrArea   : int  196 0 162 0 350 0 186 240 0 0 ...
##  $ BsmtFinSF1   : int  706 978 486 216 655 732 1369 859 0 851 ...
##  $ BsmtFinSF2   : int  0 0 0 0 0 0 0 32 0 0 ...
##  $ BsmtUnfSF    : int  150 284 434 540 490 64 317 216 952 140 ...
##  $ TotalBsmtSF  : int  856 1262 920 756 1145 796 1686 1107 952 991 ...
##  $ X1stFlrSF    : int  856 1262 920 961 1145 796 1694 1107 1022 1077 ...
##  $ X2ndFlrSF    : int  854 0 866 756 1053 566 0 983 752 0 ...
##  $ LowQualFinSF : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ GrLivArea    : int  1710 1262 1786 1717 2198 1362 1694 2090 1774 1077 ...
##  $ BsmtFullBath : int  1 0 1 1 1 1 1 1 0 1 ...
##  $ BsmtHalfBath : int  0 1 0 0 0 0 0 0 0 0 ...
##  $ FullBath     : int  2 2 2 1 2 1 2 2 2 1 ...
##  $ HalfBath     : int  1 0 1 0 1 1 0 1 0 0 ...
##  $ BedroomAbvGr : int  3 3 3 3 4 1 3 3 2 2 ...
##  $ KitchenAbvGr : int  1 1 1 1 1 1 1 1 2 2 ...
##  $ TotRmsAbvGrd : int  8 6 6 7 9 5 7 7 8 5 ...
##  $ Fireplaces   : int  0 1 1 1 1 0 1 2 2 2 ...
##  $ GarageYrBlt  : int  2003 1976 2001 1998 2000 1993 2004 1973 1931 1939 ...
##  $ GarageCars   : int  2 2 2 3 3 2 2 2 2 1 ...
##  $ GarageArea   : int  548 460 608 642 836 480 636 484 468 205 ...
##  $ WoodDeckSF   : int  0 298 0 0 192 40 255 235 90 0 ...
##  $ OpenPorchSF  : int  61 0 42 35 84 30 57 204 0 4 ...
##  $ EnclosedPorch: int  0 0 0 272 0 0 0 228 205 0 ...
##  $ X3SsnPorch   : int  0 0 0 0 0 320 0 0 0 0 ...
##  $ ScreenPorch  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PoolArea     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ MiscVal      : int  0 0 0 0 0 700 0 350 0 0 ...
##  $ MoSold       : int  2 5 9 2 12 10 8 11 4 1 ...
##  $ YrSold       : int  2008 2007 2008 2006 2008 2009 2007 2009 2008 2008 ...
##  $ SalePrice    : int  208500 181500 223500 140000 250000 143000 307000 200000 129900 118000 ...

That filtered out 43 of our 81 original variables leaving us with 38 remaining variables to work with, however, looking at the structure of the remaining data we can see that a lof of the integer variables are really not numeric data at all. Variables like YrSold while technically integer values really should be treated like categorical values, so let’s manually filter out some more non-numeric data. Looking at what’s left it may be simpler to simply chose the columns we want rather than remove ones we don’t. I used the data descriptions on the kaggle website to select the variables I wanted to keep from the remaining 38. While I am at it I am filtering out a few that have NA’s as well…

numeric_data <- numeric_data[,c(1, 4, 7,8, 10:17, 28:34, 38)]
str(numeric_data)

## 'data.frame':    1460 obs. of  20 variables:
##  $ Id           : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ LotArea      : int  8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
##  $ YearBuilt    : int  2003 1976 2001 1915 2000 1993 2004 1973 1931 1939 ...
##  $ YearRemodAdd : int  2003 1976 2002 1970 2000 1995 2005 1973 1950 1950 ...
##  $ BsmtFinSF1   : int  706 978 486 216 655 732 1369 859 0 851 ...
##  $ BsmtFinSF2   : int  0 0 0 0 0 0 0 32 0 0 ...
##  $ BsmtUnfSF    : int  150 284 434 540 490 64 317 216 952 140 ...
##  $ TotalBsmtSF  : int  856 1262 920 756 1145 796 1686 1107 952 991 ...
##  $ X1stFlrSF    : int  856 1262 920 961 1145 796 1694 1107 1022 1077 ...
##  $ X2ndFlrSF    : int  854 0 866 756 1053 566 0 983 752 0 ...
##  $ LowQualFinSF : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ GrLivArea    : int  1710 1262 1786 1717 2198 1362 1694 2090 1774 1077 ...
##  $ GarageArea   : int  548 460 608 642 836 480 636 484 468 205 ...
##  $ WoodDeckSF   : int  0 298 0 0 192 40 255 235 90 0 ...
##  $ OpenPorchSF  : int  61 0 42 35 84 30 57 204 0 4 ...
##  $ EnclosedPorch: int  0 0 0 272 0 0 0 228 205 0 ...
##  $ X3SsnPorch   : int  0 0 0 0 0 320 0 0 0 0 ...
##  $ ScreenPorch  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PoolArea     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ SalePrice    : int  208500 181500 223500 140000 250000 143000 307000 200000 129900 118000 ...

OK, now we are down to numeric columns only, with 19 predictor variables left, plus our ID column and outcome variable. Let’s plot histograms for all of them to see what we are working with…

numeric_data[, 2:20] %>%
  gather() %>% 
  ggplot(aes(value)) +
    facet_wrap(~ key, scales = "free") +
    geom_histogram()

Quite a few of the remaining variables have extremely high counts of zeros. Let’s remove a few more variables…

df <- numeric_data[,-c(6,11,16:19)]
str(df)

## 'data.frame':    1460 obs. of  14 variables:
##  $ Id          : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ LotArea     : int  8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
##  $ YearBuilt   : int  2003 1976 2001 1915 2000 1993 2004 1973 1931 1939 ...
##  $ YearRemodAdd: int  2003 1976 2002 1970 2000 1995 2005 1973 1950 1950 ...
##  $ BsmtFinSF1  : int  706 978 486 216 655 732 1369 859 0 851 ...
##  $ BsmtUnfSF   : int  150 284 434 540 490 64 317 216 952 140 ...
##  $ TotalBsmtSF : int  856 1262 920 756 1145 796 1686 1107 952 991 ...
##  $ X1stFlrSF   : int  856 1262 920 961 1145 796 1694 1107 1022 1077 ...
##  $ X2ndFlrSF   : int  854 0 866 756 1053 566 0 983 752 0 ...
##  $ GrLivArea   : int  1710 1262 1786 1717 2198 1362 1694 2090 1774 1077 ...
##  $ GarageArea  : int  548 460 608 642 836 480 636 484 468 205 ...
##  $ WoodDeckSF  : int  0 298 0 0 192 40 255 235 90 0 ...
##  $ OpenPorchSF : int  61 0 42 35 84 30 57 204 0 4 ...
##  $ SalePrice   : int  208500 181500 223500 140000 250000 143000 307000 200000 129900 118000 ...

Univariate Summary Statistics

summary(df)

##        Id            LotArea         YearBuilt     YearRemodAdd 
##  Min.   :   1.0   Min.   :  1300   Min.   :1872   Min.   :1950  
##  1st Qu.: 365.8   1st Qu.:  7554   1st Qu.:1954   1st Qu.:1967  
##  Median : 730.5   Median :  9478   Median :1973   Median :1994  
##  Mean   : 730.5   Mean   : 10517   Mean   :1971   Mean   :1985  
##  3rd Qu.:1095.2   3rd Qu.: 11602   3rd Qu.:2000   3rd Qu.:2004  
##  Max.   :1460.0   Max.   :215245   Max.   :2010   Max.   :2010  
##    BsmtFinSF1       BsmtUnfSF       TotalBsmtSF       X1stFlrSF   
##  Min.   :   0.0   Min.   :   0.0   Min.   :   0.0   Min.   : 334  
##  1st Qu.:   0.0   1st Qu.: 223.0   1st Qu.: 795.8   1st Qu.: 882  
##  Median : 383.5   Median : 477.5   Median : 991.5   Median :1087  
##  Mean   : 443.6   Mean   : 567.2   Mean   :1057.4   Mean   :1163  
##  3rd Qu.: 712.2   3rd Qu.: 808.0   3rd Qu.:1298.2   3rd Qu.:1391  
##  Max.   :5644.0   Max.   :2336.0   Max.   :6110.0   Max.   :4692  
##    X2ndFlrSF      GrLivArea      GarageArea       WoodDeckSF    
##  Min.   :   0   Min.   : 334   Min.   :   0.0   Min.   :  0.00  
##  1st Qu.:   0   1st Qu.:1130   1st Qu.: 334.5   1st Qu.:  0.00  
##  Median :   0   Median :1464   Median : 480.0   Median :  0.00  
##  Mean   : 347   Mean   :1515   Mean   : 473.0   Mean   : 94.24  
##  3rd Qu.: 728   3rd Qu.:1777   3rd Qu.: 576.0   3rd Qu.:168.00  
##  Max.   :2065   Max.   :5642   Max.   :1418.0   Max.   :857.00  
##   OpenPorchSF       SalePrice     
##  Min.   :  0.00   Min.   : 34900  
##  1st Qu.:  0.00   1st Qu.:129975  
##  Median : 25.00   Median :163000  
##  Mean   : 46.66   Mean   :180921  
##  3rd Qu.: 68.00   3rd Qu.:214000  
##  Max.   :547.00   Max.   :755000

Plots

df[, 2:14] %>%
  gather() %>% 
  ggplot(aes(value)) +
    facet_wrap(~ key, scales = "free") +
    geom_histogram()

pairs(df[, -c(1,2,5,7,12,13)])

par(mfrow=c(4, 2))
plot(df$YearBuilt,df$SalePrice)
plot(df$YearRemodAdd,df$SalePrice)
plot(df$BsmtUnfSF,df$SalePrice)
plot(df$X1stFlrSF,df$SalePrice)
plot(df$X2ndFlrSF,df$SalePrice)
plot(df$GrLivArea,df$SalePrice)
plot(df$GarageArea,df$SalePrice)

Correlation Matrix

Derive a correlation matrix for any THREE quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide a 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?

corr_data <- df[, c("BsmtUnfSF", "X1stFlrSF", "SalePrice")]
corr_matrix <- round(cor(corr_data),2)
corr_matrix

##           BsmtUnfSF X1stFlrSF SalePrice
## BsmtUnfSF      1.00      0.32      0.21
## X1stFlrSF      0.32      1.00      0.61
## SalePrice      0.21      0.61      1.00

cor.test(corr_data$BsmtUnfSF, corr_data$X1stFlrSF, conf.level = 0.8)

## 
##  Pearson's product-moment correlation
## 
## data:  corr_data$BsmtUnfSF and corr_data$X1stFlrSF
## t = 12.807, df = 1458, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.2874940 0.3478369
## sample estimates:
##       cor 
## 0.3179874

cor.test(corr_data$BsmtUnfSF, corr_data$SalePrice, conf.level = 0.8)

## 
##  Pearson's product-moment correlation
## 
## data:  corr_data$BsmtUnfSF and corr_data$SalePrice
## t = 8.3847, df = 1458, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.1822292 0.2462680
## sample estimates:
##       cor 
## 0.2144791

cor.test(corr_data$X1stFlrSF, corr_data$SalePrice, conf.level = 0.8)

## 
##  Pearson's product-moment correlation
## 
## data:  corr_data$X1stFlrSF and corr_data$SalePrice
## t = 29.078, df = 1458, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.5841687 0.6266715
## sample estimates:
##       cor 
## 0.6058522

In all 3 tests we have a very small p value, therefore, we can reject the the null hypothesis. The true correlation is not 0 for any of the three pairs of variables.

The formula to estimate the familywise error rate is:

\[FWE \leq 1 – (1 – alpha_{IT})^c\]

Where:

\(\alpha_{IT}\) = alpha level for an individual test (e.g. .05), \(c\) = Number of comparisons.

Source: https://www.statisticshowto.datasciencecentral.com/familywise-error-rate/

In our case…

FWE <- 1-((1-0.05)^3)
FWE

## [1] 0.142625

So the probability of a family-wise error is just over 14%.

This is definitelty a significant risk.

Linear Algebra and Correlation

Invert your 3 x 3 correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.

precision_matrix <- solve(corr_matrix)
precision_matrix

##             BsmtUnfSF  X1stFlrSF   SalePrice
## BsmtUnfSF  1.11451514 -0.3406203 -0.02626983
## X1stFlrSF -0.34062025  1.6967113 -0.96346364
## SalePrice -0.02626983 -0.9634636  1.59322948

round(corr_matrix %*% precision_matrix, 2)

##           BsmtUnfSF X1stFlrSF SalePrice
## BsmtUnfSF         1         0         0
## X1stFlrSF         0         1         0
## SalePrice         0         0         1

round(precision_matrix %*% corr_matrix, 2)

##           BsmtUnfSF X1stFlrSF SalePrice
## BsmtUnfSF         1         0         0
## X1stFlrSF         0         1         0
## SalePrice         0         0         1

lu.decomposition(corr_matrix)

## $L
##      [,1]      [,2] [,3]
## [1,] 1.00 0.0000000    0
## [2,] 0.32 1.0000000    0
## [3,] 0.21 0.6047237    1
## 
## $U
##      [,1]   [,2]     [,3]
## [1,]    1 0.3200 0.210000
## [2,]    0 0.8976 0.542800
## [3,]    0 0.0000 0.627656

Calculus-Based Probability & Statistics

Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html). Find the optimal value of \(\lambda\) for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, \(\lambda\))).

min(df$X1stFlrSF)

## [1] 334

hist(df$X1stFlrSF, breaks = 20)

library(MASS)
d <- fitdistr(df$X1stFlrSF, densfun = 'exponential')
lambda <- d$estimate
exp_dist <- rexp(1000, lambda)

Plot a histogram and compare it with a histogram of your original variable.

par(mfrow=c(1,2))
hist(df$X1stFlrSF, breaks = 20)
hist(exp_dist, breaks = 20)

Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.

round(quantile(exp_dist, c(.05, .95)), 3)

##       5%      95% 
##   69.290 3594.928

conf_int <- t.test(df$X1stFlrSF)
conf_int

## 
##  One Sample t-test
## 
## data:  df$X1stFlrSF
## t = 114.91, df = 1459, p-value < 0.00000000000000022
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  1142.780 1182.473
## sample estimates:
## mean of x 
##  1162.627

round(quantile(df$X1stFlrSF, c(.05, .95)), 3)

##      5%     95% 
##  672.95 1831.25

The simulated data is not a good fit for the observed data in this case. The simulated exponential distribution is much more skewed than our original data. While the 95the percentile isnb’t that far off, the 5th percentile is very different in our observed data vs. our simulation.

Modeling

Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.

# Standardize predictors
means <- sapply(df[,2:13],mean)
stdev <- sapply(df[,2:13],sd)
df.scaled <- as.data.frame(scale(df[,2:13], center=means, scale=stdev))
df.scaled$SalePrice <- df$SalePrice
df.scaled$Id <- df$Id
head(df.scaled)

attach(df.scaled)
mod_1 <- lm(SalePrice ~ LotArea + YearBuilt + YearRemodAdd + BsmtFinSF1 + BsmtUnfSF + TotalBsmtSF + X1stFlrSF + X2ndFlrSF + GrLivArea + GarageArea + WoodDeckSF + OpenPorchSF)
summary(mod_1)

## 
## Call:
## lm(formula = SalePrice ~ LotArea + YearBuilt + YearRemodAdd + 
##     BsmtFinSF1 + BsmtUnfSF + TotalBsmtSF + X1stFlrSF + X2ndFlrSF + 
##     GrLivArea + GarageArea + WoodDeckSF + OpenPorchSF)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -626565  -18103   -3396   14109  281540 
## 
## Coefficients:
##              Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)  180921.2     1070.4 169.030 < 0.0000000000000002 ***
## LotArea        3773.3     1155.6   3.265             0.001119 ** 
## YearBuilt     13846.4     1496.0   9.256 < 0.0000000000000002 ***
## YearRemodAdd  11793.7     1377.5   8.561 < 0.0000000000000002 ***
## BsmtFinSF1     7883.3     3204.7   2.460             0.014013 *  
## BsmtUnfSF      1408.7     3025.7   0.466             0.641591    
## TotalBsmtSF   10654.5     3469.5   3.071             0.002174 ** 
## X1stFlrSF     15002.4     8972.3   1.672             0.094725 .  
## X2ndFlrSF     16330.1     9986.8   1.635             0.102232    
## GrLivArea     15749.9    11843.8   1.330             0.183790    
## GarageArea    11974.5     1408.2   8.503 < 0.0000000000000002 ***
## WoodDeckSF     3831.2     1148.2   3.337             0.000869 ***
## OpenPorchSF     904.3     1158.8   0.780             0.435294    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 40900 on 1447 degrees of freedom
## Multiple R-squared:  0.7371, Adjusted R-squared:  0.735 
## F-statistic: 338.2 on 12 and 1447 DF,  p-value: < 0.00000000000000022

Remove Highest P-values One by One

# Remove BsmtUnfSF
mod_2 <- lm(SalePrice ~ LotArea + YearBuilt + YearRemodAdd + BsmtFinSF1 + TotalBsmtSF + X1stFlrSF + X2ndFlrSF + GrLivArea + GarageArea + WoodDeckSF + OpenPorchSF)
summary(mod_2)

## 
## Call:
## lm(formula = SalePrice ~ LotArea + YearBuilt + YearRemodAdd + 
##     BsmtFinSF1 + TotalBsmtSF + X1stFlrSF + X2ndFlrSF + GrLivArea + 
##     GarageArea + WoodDeckSF + OpenPorchSF)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -626191  -18050   -3432   14165  281717 
## 
## Coefficients:
##              Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)  180921.2     1070.1 169.076 < 0.0000000000000002 ***
## LotArea        3723.4     1150.3   3.237             0.001235 ** 
## YearBuilt     13867.3     1494.9   9.276 < 0.0000000000000002 ***
## YearRemodAdd  11829.7     1375.0   8.603 < 0.0000000000000002 ***
## BsmtFinSF1     6515.0     1277.2   5.101        0.00000038270 ***
## TotalBsmtSF   11957.2     2051.0   5.830        0.00000000683 ***
## X1stFlrSF     15016.2     8969.8   1.674             0.094331 .  
## X2ndFlrSF     16389.8     9983.3   1.642             0.100863    
## GrLivArea     15734.1    11840.5   1.329             0.184111    
## GarageArea    11994.6     1407.2   8.524 < 0.0000000000000002 ***
## WoodDeckSF     3788.6     1144.3   3.311             0.000952 ***
## OpenPorchSF     896.5     1158.4   0.774             0.439081    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 40890 on 1448 degrees of freedom
## Multiple R-squared:  0.7371, Adjusted R-squared:  0.7351 
## F-statistic: 369.1 on 11 and 1448 DF,  p-value: < 0.00000000000000022

# Remove OpenPorchSF
mod_3 <- lm(SalePrice ~ LotArea + YearBuilt + YearRemodAdd + BsmtFinSF1 + TotalBsmtSF + X1stFlrSF + X2ndFlrSF + GrLivArea + GarageArea + WoodDeckSF)
summary(mod_3)

## 
## Call:
## lm(formula = SalePrice ~ LotArea + YearBuilt + YearRemodAdd + 
##     BsmtFinSF1 + TotalBsmtSF + X1stFlrSF + X2ndFlrSF + GrLivArea + 
##     GarageArea + WoodDeckSF)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -625862  -18181   -3391   14450  280297 
## 
## Coefficients:
##              Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)    180921       1070 169.099 < 0.0000000000000002 ***
## LotArea          3728       1150   3.242              0.00122 ** 
## YearBuilt       13894       1494   9.298 < 0.0000000000000002 ***
## YearRemodAdd    11921       1370   8.703 < 0.0000000000000002 ***
## BsmtFinSF1       6511       1277   5.099        0.00000038738 ***
## TotalBsmtSF     12122       2040   5.944        0.00000000349 ***
## X1stFlrSF       14905       8967   1.662              0.09671 .  
## X2ndFlrSF       16408       9982   1.644              0.10043    
## GrLivArea       15965      11835   1.349              0.17755    
## GarageArea      12040       1406   8.564 < 0.0000000000000002 ***
## WoodDeckSF       3735       1142   3.271              0.00110 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 40880 on 1449 degrees of freedom
## Multiple R-squared:  0.737,  Adjusted R-squared:  0.7352 
## F-statistic:   406 on 10 and 1449 DF,  p-value: < 0.00000000000000022

# Remove GrLivArea
mod_4 <- lm(SalePrice ~ LotArea + YearBuilt + YearRemodAdd + BsmtFinSF1 + TotalBsmtSF + X1stFlrSF + X2ndFlrSF + GarageArea + WoodDeckSF)
summary(mod_4)

## 
## Call:
## lm(formula = SalePrice ~ LotArea + YearBuilt + YearRemodAdd + 
##     BsmtFinSF1 + TotalBsmtSF + X1stFlrSF + X2ndFlrSF + GarageArea + 
##     WoodDeckSF)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -626418  -18241   -3365   14384  279706 
## 
## Coefficients:
##              Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)    180921       1070 169.051 < 0.0000000000000002 ***
## LotArea          3713       1150   3.228              0.00128 ** 
## YearBuilt       13559       1474   9.199 < 0.0000000000000002 ***
## YearRemodAdd    11992       1369   8.759 < 0.0000000000000002 ***
## BsmtFinSF1       6447       1276   5.050        0.00000049678 ***
## TotalBsmtSF     12210       2039   5.988        0.00000000268 ***
## X1stFlrSF       26703       1980  13.489 < 0.0000000000000002 ***
## X2ndFlrSF       29780       1177  25.306 < 0.0000000000000002 ***
## GarageArea      12011       1406   8.543 < 0.0000000000000002 ***
## WoodDeckSF       3738       1142   3.272              0.00109 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 40890 on 1450 degrees of freedom
## Multiple R-squared:  0.7367, Adjusted R-squared:  0.735 
## F-statistic: 450.7 on 9 and 1450 DF,  p-value: < 0.00000000000000022

Our 4th model has all very low p-values and a moderately OK \(R^2\) vale at 0.735. Let’s see how it does on the test data…

par(mfrow=c(2,2))
plot(mod_4)

Test Data

#Load the data and remove columns same as our training data
test_df <- read.csv("Final_Project_Data_Files/test.csv")
test_df <- test_df %>% select_if(is.numeric)
test_df <- test_df[,c(1, 4, 7,8, 10:17, 28:34)]
test_df <- test_df[,-c(6,11,16:19)]
str(test_df)

## 'data.frame':    1459 obs. of  13 variables:
##  $ Id          : int  1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 ...
##  $ LotArea     : int  11622 14267 13830 9978 5005 10000 7980 8402 10176 8400 ...
##  $ YearBuilt   : int  1961 1958 1997 1998 1992 1993 1992 1998 1990 1970 ...
##  $ YearRemodAdd: int  1961 1958 1998 1998 1992 1994 2007 1998 1990 1970 ...
##  $ BsmtFinSF1  : int  468 923 791 602 263 0 935 0 637 804 ...
##  $ BsmtUnfSF   : int  270 406 137 324 1017 763 233 789 663 0 ...
##  $ TotalBsmtSF : int  882 1329 928 926 1280 763 1168 789 1300 882 ...
##  $ X1stFlrSF   : int  896 1329 928 926 1280 763 1187 789 1341 882 ...
##  $ X2ndFlrSF   : int  0 0 701 678 0 892 0 676 0 0 ...
##  $ GrLivArea   : int  896 1329 1629 1604 1280 1655 1187 1465 1341 882 ...
##  $ GarageArea  : int  730 312 482 470 506 440 420 393 506 525 ...
##  $ WoodDeckSF  : int  140 393 212 360 0 157 483 0 192 240 ...
##  $ OpenPorchSF : int  0 36 34 36 82 84 21 75 0 0 ...

# Standardize test predictors
test.scaled <- as.data.frame(scale(test_df[,2:13], center=means, scale=stdev))
test.scaled$SalePrice <- test_df$SalePrice
test.scaled$Id <- test_df$Id
head(test.scaled)

Prediction

predictions <- predict(mod_4,newdata=test.scaled)
predictions <- data.frame(as.vector(predictions))
predictions$Id <- test.scaled$Id
predictions[,c(1,2)] <- predictions[,c(2,1)]
colnames(predictions) <- c("Id", "SalePrice")
predictions[is.na(predictions)] <- 0
head(predictions)

write.csv(predictions,'predictions.csv', row.names=FALSE)

Kaggle Username Betsy - Score 0.48848 - Rank 4556

Submission Screen Shot

Can we refine the model?

step <- stepAIC(mod_4, direction="both")

## Start:  AIC=31016.6
## SalePrice ~ LotArea + YearBuilt + YearRemodAdd + BsmtFinSF1 + 
##     TotalBsmtSF + X1stFlrSF + X2ndFlrSF + GarageArea + WoodDeckSF
## 
##                Df     Sum of Sq           RSS   AIC
## <none>                          2424729637227 31017
## - LotArea       1   17422669941 2442152307168 31025
## - WoodDeckSF    1   17900888066 2442630525293 31025
## - BsmtFinSF1    1   42653084195 2467382721422 31040
## - TotalBsmtSF   1   59951999730 2484681636956 31050
## - GarageArea    1  122034744598 2546764381825 31086
## - YearRemodAdd  1  128299406158 2553029043384 31090
## - YearBuilt     1  141511271416 2566240908643 31097
## - X1stFlrSF     1  304262368638 2728992005865 31187
## - X2ndFlrSF     1 1070855373918 3495585011145 31549

step$anova # display results

Runnning the stepAIC function on my 4th model did not make any suggested changes. Let’s see what happens when we run it on my 1st model. Will it result in the same model that I did?

step <- stepAIC(mod_1, direction="both")

## Start:  AIC=31019.95
## SalePrice ~ LotArea + YearBuilt + YearRemodAdd + BsmtFinSF1 + 
##     BsmtUnfSF + TotalBsmtSF + X1stFlrSF + X2ndFlrSF + GrLivArea + 
##     GarageArea + WoodDeckSF + OpenPorchSF
## 
##                Df    Sum of Sq           RSS   AIC
## - BsmtUnfSF     1    362560078 2420686897931 31018
## - OpenPorchSF   1   1018637817 2421342975670 31019
## - GrLivArea     1   2957906201 2423282244054 31020
## <none>                         2420324337853 31020
## - X2ndFlrSF     1   4472277292 2424796615145 31021
## - X1stFlrSF     1   4676465971 2425000803824 31021
## - BsmtFinSF1    1  10121600989 2430445938842 31024
## - TotalBsmtSF   1  15773485683 2436097823536 31027
## - LotArea       1  17834511388 2438158849240 31029
## - WoodDeckSF    1  18622453832 2438946791685 31029
## - GarageArea    1 120937577527 2541261915380 31089
## - YearRemodAdd  1 122602344966 2542926682819 31090
## - YearBuilt     1 143288508865 2563612846718 31102
## 
## Step:  AIC=31018.17
## SalePrice ~ LotArea + YearBuilt + YearRemodAdd + BsmtFinSF1 + 
##     TotalBsmtSF + X1stFlrSF + X2ndFlrSF + GrLivArea + GarageArea + 
##     WoodDeckSF + OpenPorchSF
## 
##                Df    Sum of Sq           RSS   AIC
## - OpenPorchSF   1   1001396004 2421688293935 31017
## - GrLivArea     1   2951988078 2423638886009 31018
## <none>                         2420686897931 31018
## - X2ndFlrSF     1   4505829648 2425192727579 31019
## - X1stFlrSF     1   4685120327 2425372018258 31019
## + BsmtUnfSF     1    362560078 2420324337853 31020
## - LotArea       1  17516415877 2438203313808 31027
## - WoodDeckSF    1  18327053229 2439013951160 31027
## - BsmtFinSF1    1  43498052751 2464184950682 31042
## - TotalBsmtSF   1  56816634388 2477503532319 31050
## - GarageArea    1 121458021673 2542144919604 31088
## - YearRemodAdd  1 123741885115 2544428783046 31089
## - YearBuilt     1 143849437638 2564536335569 31100
## 
## Step:  AIC=31016.77
## SalePrice ~ LotArea + YearBuilt + YearRemodAdd + BsmtFinSF1 + 
##     TotalBsmtSF + X1stFlrSF + X2ndFlrSF + GrLivArea + GarageArea + 
##     WoodDeckSF
## 
##                Df    Sum of Sq           RSS   AIC
## - GrLivArea     1   3041343292 2424729637227 31017
## <none>                         2421688293935 31017
## - X2ndFlrSF     1   4516085337 2426204379272 31018
## - X1stFlrSF     1   4616966343 2426305260277 31018
## + OpenPorchSF   1   1001396004 2420686897931 31018
## + BsmtUnfSF     1    345318264 2421342975670 31019
## - LotArea       1  17561372771 2439249666706 31025
## - WoodDeckSF    1  17879084127 2439567378061 31026
## - BsmtFinSF1    1  43445835797 2465134129731 31041
## - TotalBsmtSF   1  59038883842 2480727177777 31050
## - GarageArea    1 122584003311 2544272297246 31087
## - YearRemodAdd  1 126586953344 2548275247279 31089
## - YearBuilt     1 144487032146 2566175326080 31099
## 
## Step:  AIC=31016.6
## SalePrice ~ LotArea + YearBuilt + YearRemodAdd + BsmtFinSF1 + 
##     TotalBsmtSF + X1stFlrSF + X2ndFlrSF + GarageArea + WoodDeckSF
## 
##                Df     Sum of Sq           RSS   AIC
## <none>                          2424729637227 31017
## + GrLivArea     1    3041343292 2421688293935 31017
## + OpenPorchSF   1    1090751218 2423638886009 31018
## + BsmtUnfSF     1     338715581 2424390921645 31018
## - LotArea       1   17422669941 2442152307168 31025
## - WoodDeckSF    1   17900888066 2442630525293 31025
## - BsmtFinSF1    1   42653084195 2467382721422 31040
## - TotalBsmtSF   1   59951999730 2484681636956 31050
## - GarageArea    1  122034744598 2546764381825 31086
## - YearRemodAdd  1  128299406158 2553029043384 31090
## - YearBuilt     1  141511271416 2566240908643 31097
## - X1stFlrSF     1  304262368638 2728992005865 31187
## - X2ndFlrSF     1 1070855373918 3495585011145 31549

step$anova # display results

Sure enough it made the exact same changes that I did…

Let’s try something else…

# All Subsets Regression
library(leaps)
leaps <- regsubsets(x=df.scaled[,1:12], y=df.scaled[,13],nbest=3)
# plot a table of models showing variables in each model.
# models are ordered by the selection statistic.
plot(leaps,scale="r2")

attach(df.scaled)
mod_5 <- lm(SalePrice ~ LotArea + YearBuilt + YearRemodAdd + BsmtFinSF1 + TotalBsmtSF + GrLivArea + GarageArea + WoodDeckSF)
summary(mod_5)

## 
## Call:
## lm(formula = SalePrice ~ LotArea + YearBuilt + YearRemodAdd + 
##     BsmtFinSF1 + TotalBsmtSF + GrLivArea + GarageArea + WoodDeckSF)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -624979  -18200   -3497   14506  281830 
## 
## Coefficients:
##              Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)    180921       1070 169.054 < 0.0000000000000002 ***
## LotArea          3795       1147   3.310             0.000956 ***
## YearBuilt       14232       1471   9.673 < 0.0000000000000002 ***
## YearRemodAdd    11869       1369   8.667 < 0.0000000000000002 ***
## BsmtFinSF1       6585       1276   5.162          0.000000279 ***
## TotalBsmtSF     12382       1472   8.413 < 0.0000000000000002 ***
## GrLivArea       35405       1322  26.776 < 0.0000000000000002 ***
## GarageArea      12184       1397   8.722 < 0.0000000000000002 ***
## WoodDeckSF       3762       1142   3.295             0.001009 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 40890 on 1451 degrees of freedom
## Multiple R-squared:  0.7365, Adjusted R-squared:  0.735 
## F-statistic: 506.9 on 8 and 1451 DF,  p-value: < 0.00000000000000022

Mod_5 Predictions

predictions2 <- predict(mod_5,newdata=test.scaled)
predictions2 <- data.frame(as.vector(predictions2))
predictions2$Id <- test.scaled$Id
predictions2[,c(1,2)] <- predictions2[,c(2,1)]
colnames(predictions2) <- c("Id", "SalePrice")
predictions2[is.na(predictions2)] <- 0
head(predictions2)

write.csv(predictions2,'predictions_mod_5.csv', row.names=FALSE)

Kaggle results… Not really an improvement :-(