DATA 605 Final Exam

library(GGally)
library(MASS)
library(modelr)
library(tidyverse)
library(stats)
set.seed(210514)

Using R, generate a random variable \(X\) that has \(10,000\) random uniform numbers from \(1\) to \(N\), where \(N\) can be any number of your choosing greater than or equal to \(6\). Then generate a random variable \(Y\) that has \(10,000\) random normal numbers with a mean of \(\mu = \sigma = (N+1)/2\).

N <- 6
X <- runif(10000, min = 1, max = N)
Y <- rnorm(10000, mean = (N + 1)/2, sd = (N + 1)/2)
summary(X)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.249   3.487   3.493   4.728   5.998

summary(Y)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -11.677   1.151   3.480   3.491   5.869  16.004

Calculate as a minimum the below probabilities a through c. Assume the small letter \(x\) is estimated as the median of the \(X\) variable, and the small letter \(y\) is estimated as the first quartile of the \(Y\) variable. Interpret the meaning of all probabilities.

\(P(X>x|X>y)\)

\(P(X>x|X>y)\) is the number of elements of \(X\) greater than both \(x\) and \(y\), divided by the number of elements of \(X\) greater than \(y\).

\(P((X>x)|(X>y)) = \frac{P(X>\text{max}(x,y))}{P(X>y)}\).

x <- median(X)
y <- quantile(Y, 0.25, names = FALSE)

numerator <- length(X[X > max(x,y)])
denominator <- length(X[X > y])

ans_1a <- numerator/denominator
ans_1a

## [1] 0.5154108

\(P((X>x) \cap (Y>y))\)

\(P(x>x)\), the probability that \(X\) is greater than its median, is \(0.50\). \(P(Y>y)\), the probability that \(Y\) is greater than its first quartile, is \(0.75\). Since \(X\) and \(Y\) are independent, \(P((X>x) \cap (Y>y)) = P(X>x)P(Y>y) = 0.375\).

\(P((X<x) | (X>y))\)

\(P((X<x) | (X>y))\) is the number of elements of \(X\) between \(y\) and \(x\), divided by the number of elements of \(X\) greater than \(y\).

\(P((X<x)|(X>y)) = \frac{P(y<X<x)}{P(X>y)}\)

numerator <- length(X[y < X & X < x])
denominator <- length(X[X > y])

ans_1c <- numerator/denominator
ans_1c

## [1] 0.4845892

Investigate whether \(P((X>x) \cap (Y>y)) = P(X>x)P(Y>y)\) by building a table and evaluating the marginal and joint probabilities.

The statement is true when \(X\) and \(Y\) are independent, which they are. As described in the response to b, \(P((X>x) \cap (Y>y)) = P(X>x)P(Y>y) = 0.375\). We can verify this by examining a table.

XY_df <- data.frame("X" = X, "Y" = Y)

mgl_prob_X_gt_x <- XY_df %>%
  filter(X > x) %>%
  nrow() / nrow(XY_df)

mgl_prob_Y_gt_y <- XY_df %>%
  filter(Y > y) %>%
  nrow() / nrow(XY_df)

jnt_prob_X_gt_x_and_Y_gt_y <- XY_df %>%
  filter((X > x) & (Y > y)) %>%
  nrow() / nrow(XY_df)

mgl_prob_X_gt_x * mgl_prob_Y_gt_y / jnt_prob_X_gt_x_and_Y_gt_y

## [1] 1.002674

The product of the marginal probabilities is approximately equal to the joint probability based on the table. The small discrepancy is the result of the finite number of samples from the joint distribution of \(X\) and \(Y\).

Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?

Both Fisher’s Exact Test and the Chi Square Test are used on categorical data summarized in a contingency table.

To use these tests, I’ll build a contingency table that counts instances when each variable is greater than or less than its mean.

XY <- XY_df %>%
  filter((X > mean(X)) & (Y > mean(Y))) %>%
  nrow()

XnotY <- XY_df %>%
  filter((X > mean(X)) & (Y <= mean(Y))) %>%
  nrow()

notXY <- XY_df %>%
  filter((X <= mean(X)) & (Y > mean(Y))) %>%
  nrow()

notXnotY <- XY_df %>%
  filter((X <= mean(X)) & (Y <= mean(Y))) %>%
  nrow()

cty_tbl <- matrix(c(XY, XnotY, notXY, notXnotY), nrow = 2)
colnames(cty_tbl) = c("X>Xbar", "X<=Xbar")
rownames(cty_tbl) = c("Y>Ybar", "Y<=Ybar")
cty_tbl

##         X>Xbar X<=Xbar
## Y>Ybar    2463    2527
## Y<=Ybar   2524    2486

fisher <- fisher.test(cty_tbl)
print(fisher)

## 
##  Fisher's Exact Test for Count Data
## 
## data:  cty_tbl
## p-value = 0.3078
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.8869234 1.0391208
## sample estimates:
## odds ratio 
##  0.9599796

For Fisher’s Exact Test, the null hypothesis is that variables are independent. The alternative hypothesis is that not all variables are independent. The \(p\) value for the above test is much greater than \(\alpha = 0.05\), so we fail to reject the null. Fisher’s Exact Test does not provide evidence that \(X\) and \(Y\) are not independent.

chisq <- chisq.test(cty_tbl)
print(chisq)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  cty_tbl
## X-squared = 1.0011, df = 1, p-value = 0.3171

For the Chi Square Test, the null and alternative hypotheses are the same as for Fisher’s Exact Test above. The \(p\) value for the Chi Square test just conducted is much greater than \(\alpha = 0.05\), so we fail to reject the null. Neither Fisher’s Exact Test nor the Chi Square Test provide evidence that \(X\) and \(Y\) are not independent.

The independence of \(X\) and \(Y\) is also evident in a scatterplot, which shows no relationship between the variables.

ggplot(XY_df, aes(x = X, y = Y)) +
  geom_point()

Because Fisher’s Exact Test was designed with smaller datasets in mind, it’s not the most appropriate test for this data. The Chi Square Test is a more appropriate choice.

h1.dat <- read_csv("https://raw.githubusercontent.com/dmoscoe/SPS/main/houses_train.csv")

The data set contains 1460 observations on 81 variables.

str(h1.dat)

## spec_tbl_df [1,460 x 81] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Id           : num [1:1460] 1 2 3 4 5 6 7 8 9 10 ...
##  $ MSSubClass   : num [1:1460] 60 20 60 70 60 50 20 60 50 190 ...
##  $ MSZoning     : chr [1:1460] "RL" "RL" "RL" "RL" ...
##  $ LotFrontage  : num [1:1460] 65 80 68 60 84 85 75 NA 51 50 ...
##  $ LotArea      : num [1:1460] 8450 9600 11250 9550 14260 ...
##  $ Street       : chr [1:1460] "Pave" "Pave" "Pave" "Pave" ...
##  $ Alley        : chr [1:1460] NA NA NA NA ...
##  $ LotShape     : chr [1:1460] "Reg" "Reg" "IR1" "IR1" ...
##  $ LandContour  : chr [1:1460] "Lvl" "Lvl" "Lvl" "Lvl" ...
##  $ Utilities    : chr [1:1460] "AllPub" "AllPub" "AllPub" "AllPub" ...
##  $ LotConfig    : chr [1:1460] "Inside" "FR2" "Inside" "Corner" ...
##  $ LandSlope    : chr [1:1460] "Gtl" "Gtl" "Gtl" "Gtl" ...
##  $ Neighborhood : chr [1:1460] "CollgCr" "Veenker" "CollgCr" "Crawfor" ...
##  $ Condition1   : chr [1:1460] "Norm" "Feedr" "Norm" "Norm" ...
##  $ Condition2   : chr [1:1460] "Norm" "Norm" "Norm" "Norm" ...
##  $ BldgType     : chr [1:1460] "1Fam" "1Fam" "1Fam" "1Fam" ...
##  $ HouseStyle   : chr [1:1460] "2Story" "1Story" "2Story" "2Story" ...
##  $ OverallQual  : num [1:1460] 7 6 7 7 8 5 8 7 7 5 ...
##  $ OverallCond  : num [1:1460] 5 8 5 5 5 5 5 6 5 6 ...
##  $ YearBuilt    : num [1:1460] 2003 1976 2001 1915 2000 ...
##  $ YearRemodAdd : num [1:1460] 2003 1976 2002 1970 2000 ...
##  $ RoofStyle    : chr [1:1460] "Gable" "Gable" "Gable" "Gable" ...
##  $ RoofMatl     : chr [1:1460] "CompShg" "CompShg" "CompShg" "CompShg" ...
##  $ Exterior1st  : chr [1:1460] "VinylSd" "MetalSd" "VinylSd" "Wd Sdng" ...
##  $ Exterior2nd  : chr [1:1460] "VinylSd" "MetalSd" "VinylSd" "Wd Shng" ...
##  $ MasVnrType   : chr [1:1460] "BrkFace" "None" "BrkFace" "None" ...
##  $ MasVnrArea   : num [1:1460] 196 0 162 0 350 0 186 240 0 0 ...
##  $ ExterQual    : chr [1:1460] "Gd" "TA" "Gd" "TA" ...
##  $ ExterCond    : chr [1:1460] "TA" "TA" "TA" "TA" ...
##  $ Foundation   : chr [1:1460] "PConc" "CBlock" "PConc" "BrkTil" ...
##  $ BsmtQual     : chr [1:1460] "Gd" "Gd" "Gd" "TA" ...
##  $ BsmtCond     : chr [1:1460] "TA" "TA" "TA" "Gd" ...
##  $ BsmtExposure : chr [1:1460] "No" "Gd" "Mn" "No" ...
##  $ BsmtFinType1 : chr [1:1460] "GLQ" "ALQ" "GLQ" "ALQ" ...
##  $ BsmtFinSF1   : num [1:1460] 706 978 486 216 655 ...
##  $ BsmtFinType2 : chr [1:1460] "Unf" "Unf" "Unf" "Unf" ...
##  $ BsmtFinSF2   : num [1:1460] 0 0 0 0 0 0 0 32 0 0 ...
##  $ BsmtUnfSF    : num [1:1460] 150 284 434 540 490 64 317 216 952 140 ...
##  $ TotalBsmtSF  : num [1:1460] 856 1262 920 756 1145 ...
##  $ Heating      : chr [1:1460] "GasA" "GasA" "GasA" "GasA" ...
##  $ HeatingQC    : chr [1:1460] "Ex" "Ex" "Ex" "Gd" ...
##  $ CentralAir   : chr [1:1460] "Y" "Y" "Y" "Y" ...
##  $ Electrical   : chr [1:1460] "SBrkr" "SBrkr" "SBrkr" "SBrkr" ...
##  $ 1stFlrSF     : num [1:1460] 856 1262 920 961 1145 ...
##  $ 2ndFlrSF     : num [1:1460] 854 0 866 756 1053 ...
##  $ LowQualFinSF : num [1:1460] 0 0 0 0 0 0 0 0 0 0 ...
##  $ GrLivArea    : num [1:1460] 1710 1262 1786 1717 2198 ...
##  $ BsmtFullBath : num [1:1460] 1 0 1 1 1 1 1 1 0 1 ...
##  $ BsmtHalfBath : num [1:1460] 0 1 0 0 0 0 0 0 0 0 ...
##  $ FullBath     : num [1:1460] 2 2 2 1 2 1 2 2 2 1 ...
##  $ HalfBath     : num [1:1460] 1 0 1 0 1 1 0 1 0 0 ...
##  $ BedroomAbvGr : num [1:1460] 3 3 3 3 4 1 3 3 2 2 ...
##  $ KitchenAbvGr : num [1:1460] 1 1 1 1 1 1 1 1 2 2 ...
##  $ KitchenQual  : chr [1:1460] "Gd" "TA" "Gd" "Gd" ...
##  $ TotRmsAbvGrd : num [1:1460] 8 6 6 7 9 5 7 7 8 5 ...
##  $ Functional   : chr [1:1460] "Typ" "Typ" "Typ" "Typ" ...
##  $ Fireplaces   : num [1:1460] 0 1 1 1 1 0 1 2 2 2 ...
##  $ FireplaceQu  : chr [1:1460] NA "TA" "TA" "Gd" ...
##  $ GarageType   : chr [1:1460] "Attchd" "Attchd" "Attchd" "Detchd" ...
##  $ GarageYrBlt  : num [1:1460] 2003 1976 2001 1998 2000 ...
##  $ GarageFinish : chr [1:1460] "RFn" "RFn" "RFn" "Unf" ...
##  $ GarageCars   : num [1:1460] 2 2 2 3 3 2 2 2 2 1 ...
##  $ GarageArea   : num [1:1460] 548 460 608 642 836 480 636 484 468 205 ...
##  $ GarageQual   : chr [1:1460] "TA" "TA" "TA" "TA" ...
##  $ GarageCond   : chr [1:1460] "TA" "TA" "TA" "TA" ...
##  $ PavedDrive   : chr [1:1460] "Y" "Y" "Y" "Y" ...
##  $ WoodDeckSF   : num [1:1460] 0 298 0 0 192 40 255 235 90 0 ...
##  $ OpenPorchSF  : num [1:1460] 61 0 42 35 84 30 57 204 0 4 ...
##  $ EnclosedPorch: num [1:1460] 0 0 0 272 0 0 0 228 205 0 ...
##  $ 3SsnPorch    : num [1:1460] 0 0 0 0 0 320 0 0 0 0 ...
##  $ ScreenPorch  : num [1:1460] 0 0 0 0 0 0 0 0 0 0 ...
##  $ PoolArea     : num [1:1460] 0 0 0 0 0 0 0 0 0 0 ...
##  $ PoolQC       : chr [1:1460] NA NA NA NA ...
##  $ Fence        : chr [1:1460] NA NA NA NA ...
##  $ MiscFeature  : chr [1:1460] NA NA NA NA ...
##  $ MiscVal      : num [1:1460] 0 0 0 0 0 700 0 350 0 0 ...
##  $ MoSold       : num [1:1460] 2 5 9 2 12 10 8 11 4 1 ...
##  $ YrSold       : num [1:1460] 2008 2007 2008 2006 2008 ...
##  $ SaleType     : chr [1:1460] "WD" "WD" "WD" "WD" ...
##  $ SaleCondition: chr [1:1460] "Normal" "Normal" "Normal" "Abnorml" ...
##  $ SalePrice    : num [1:1460] 208500 181500 223500 140000 250000 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Id = col_double(),
##   ..   MSSubClass = col_double(),
##   ..   MSZoning = col_character(),
##   ..   LotFrontage = col_double(),
##   ..   LotArea = col_double(),
##   ..   Street = col_character(),
##   ..   Alley = col_character(),
##   ..   LotShape = col_character(),
##   ..   LandContour = col_character(),
##   ..   Utilities = col_character(),
##   ..   LotConfig = col_character(),
##   ..   LandSlope = col_character(),
##   ..   Neighborhood = col_character(),
##   ..   Condition1 = col_character(),
##   ..   Condition2 = col_character(),
##   ..   BldgType = col_character(),
##   ..   HouseStyle = col_character(),
##   ..   OverallQual = col_double(),
##   ..   OverallCond = col_double(),
##   ..   YearBuilt = col_double(),
##   ..   YearRemodAdd = col_double(),
##   ..   RoofStyle = col_character(),
##   ..   RoofMatl = col_character(),
##   ..   Exterior1st = col_character(),
##   ..   Exterior2nd = col_character(),
##   ..   MasVnrType = col_character(),
##   ..   MasVnrArea = col_double(),
##   ..   ExterQual = col_character(),
##   ..   ExterCond = col_character(),
##   ..   Foundation = col_character(),
##   ..   BsmtQual = col_character(),
##   ..   BsmtCond = col_character(),
##   ..   BsmtExposure = col_character(),
##   ..   BsmtFinType1 = col_character(),
##   ..   BsmtFinSF1 = col_double(),
##   ..   BsmtFinType2 = col_character(),
##   ..   BsmtFinSF2 = col_double(),
##   ..   BsmtUnfSF = col_double(),
##   ..   TotalBsmtSF = col_double(),
##   ..   Heating = col_character(),
##   ..   HeatingQC = col_character(),
##   ..   CentralAir = col_character(),
##   ..   Electrical = col_character(),
##   ..   `1stFlrSF` = col_double(),
##   ..   `2ndFlrSF` = col_double(),
##   ..   LowQualFinSF = col_double(),
##   ..   GrLivArea = col_double(),
##   ..   BsmtFullBath = col_double(),
##   ..   BsmtHalfBath = col_double(),
##   ..   FullBath = col_double(),
##   ..   HalfBath = col_double(),
##   ..   BedroomAbvGr = col_double(),
##   ..   KitchenAbvGr = col_double(),
##   ..   KitchenQual = col_character(),
##   ..   TotRmsAbvGrd = col_double(),
##   ..   Functional = col_character(),
##   ..   Fireplaces = col_double(),
##   ..   FireplaceQu = col_character(),
##   ..   GarageType = col_character(),
##   ..   GarageYrBlt = col_double(),
##   ..   GarageFinish = col_character(),
##   ..   GarageCars = col_double(),
##   ..   GarageArea = col_double(),
##   ..   GarageQual = col_character(),
##   ..   GarageCond = col_character(),
##   ..   PavedDrive = col_character(),
##   ..   WoodDeckSF = col_double(),
##   ..   OpenPorchSF = col_double(),
##   ..   EnclosedPorch = col_double(),
##   ..   `3SsnPorch` = col_double(),
##   ..   ScreenPorch = col_double(),
##   ..   PoolArea = col_double(),
##   ..   PoolQC = col_character(),
##   ..   Fence = col_character(),
##   ..   MiscFeature = col_character(),
##   ..   MiscVal = col_double(),
##   ..   MoSold = col_double(),
##   ..   YrSold = col_double(),
##   ..   SaleType = col_character(),
##   ..   SaleCondition = col_character(),
##   ..   SalePrice = col_double()
##   .. )

The data set contains 1460 observations on 81 variables. Some variables, like Alley and LotFrontage, include missing values encoded as “NA”. There do not appear to be any alternate encodings for missing values.

Provide univariate descriptive statistics and appropriate plots for the training data set.

Rather than provide descriptive statistics and visualizations for all 81 variables, I will choose a few to study more closely. To identify numeric variables that might be significant in a predictive model, I’ll examine their correlations with the response variable SalePrice.

cor <- h1.dat %>%
  keep(is.numeric) %>%
  cor()

sort(cor[,38])

##  KitchenAbvGr EnclosedPorch    MSSubClass   OverallCond        YrSold 
##   -0.13590737   -0.12857796   -0.08428414   -0.07785589   -0.02892259 
##  LowQualFinSF            Id       MiscVal  BsmtHalfBath    BsmtFinSF2 
##   -0.02560613   -0.02191672   -0.02118958   -0.01684415   -0.01137812 
##     3SsnPorch        MoSold      PoolArea   ScreenPorch  BedroomAbvGr 
##    0.04458367    0.04643225    0.09240355    0.11144657    0.16821315 
##     BsmtUnfSF  BsmtFullBath       LotArea      HalfBath   OpenPorchSF 
##    0.21447911    0.22712223    0.26384335    0.28410768    0.31585623 
##      2ndFlrSF    WoodDeckSF    BsmtFinSF1    Fireplaces  YearRemodAdd 
##    0.31933380    0.32441344    0.38641981    0.46692884    0.50710097 
##     YearBuilt  TotRmsAbvGrd      FullBath      1stFlrSF   TotalBsmtSF 
##    0.52289733    0.53372316    0.56066376    0.60585218    0.61358055 
##    GarageArea    GarageCars     GrLivArea   OverallQual     SalePrice 
##    0.62343144    0.64040920    0.70862448    0.79098160    1.00000000

Numeric variables with potentially important relationships to SalePrice include OverallQual, GrLivArea, and KitchenAbvGr. I will also examine the response variable, SalePrice.

I’ll choose categorical variables that seem like they might be important based on my knowledge of housing costs.

cat <- h1.dat %>%
  discard(is.numeric) %>%
  colnames()
cat

##  [1] "MSZoning"      "Street"        "Alley"         "LotShape"     
##  [5] "LandContour"   "Utilities"     "LotConfig"     "LandSlope"    
##  [9] "Neighborhood"  "Condition1"    "Condition2"    "BldgType"     
## [13] "HouseStyle"    "RoofStyle"     "RoofMatl"      "Exterior1st"  
## [17] "Exterior2nd"   "MasVnrType"    "ExterQual"     "ExterCond"    
## [21] "Foundation"    "BsmtQual"      "BsmtCond"      "BsmtExposure" 
## [25] "BsmtFinType1"  "BsmtFinType2"  "Heating"       "HeatingQC"    
## [29] "CentralAir"    "Electrical"    "KitchenQual"   "Functional"   
## [33] "FireplaceQu"   "GarageType"    "GarageFinish"  "GarageQual"   
## [37] "GarageCond"    "PavedDrive"    "PoolQC"        "Fence"        
## [41] "MiscFeature"   "SaleType"      "SaleCondition"

I’ll examine LotShape, LandContour, and Neighborhood.

h1_subset.dat <- h1.dat %>%
  select("Id", "OverallQual", "GrLivArea", "KitchenAbvGr", "LotShape", "LandContour", "Neighborhood", "SalePrice")
summary(h1_subset.dat)

##        Id          OverallQual       GrLivArea     KitchenAbvGr  
##  Min.   :   1.0   Min.   : 1.000   Min.   : 334   Min.   :0.000  
##  1st Qu.: 365.8   1st Qu.: 5.000   1st Qu.:1130   1st Qu.:1.000  
##  Median : 730.5   Median : 6.000   Median :1464   Median :1.000  
##  Mean   : 730.5   Mean   : 6.099   Mean   :1515   Mean   :1.047  
##  3rd Qu.:1095.2   3rd Qu.: 7.000   3rd Qu.:1777   3rd Qu.:1.000  
##  Max.   :1460.0   Max.   :10.000   Max.   :5642   Max.   :3.000  
##    LotShape         LandContour        Neighborhood         SalePrice     
##  Length:1460        Length:1460        Length:1460        Min.   : 34900  
##  Class :character   Class :character   Class :character   1st Qu.:129975  
##  Mode  :character   Mode  :character   Mode  :character   Median :163000  
##                                                           Mean   :180921  
##                                                           3rd Qu.:214000  
##                                                           Max.   :755000

As expected, some of the variables appear significantly skewed right, especially GrLivArea, the above-ground living area in square feet, and SalePrice.

h1_subset.dat %>%
  keep(is.numeric) %>%
  gather() %>%
  ggplot(aes(value)) +
  facet_wrap(~ key, scales = "free") +
  geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The histogram for Id is not meaningful. As The distributions of SalePrice and GrLivArea are similar in shape, and their close relationship is supported by the idea that larger homes will, in general, cost more. There are few houses considered to be of low OverallQual in the data set. Almost all homes have 1 kitchen, but some have 2 or 3.

h1_subset.dat %>%
  group_by(LotShape) %>%
  summarise(n()) %>%
  arrange(desc(`n()`))

## # A tibble: 4 x 2
##   LotShape `n()`
##   <chr>    <int>
## 1 Reg        925
## 2 IR1        484
## 3 IR2         41
## 4 IR3         10

The majority of lots have a “general shape” that is regular. 484 lie on property that is “slightly irregular,” 41 on land that is “moderately irregular,” and 10 on “irregular” land.

h1_subset.dat %>%
  group_by(LandContour) %>%
  summarise(n()) %>%
  arrange(desc(`n()`))

## # A tibble: 4 x 2
##   LandContour `n()`
##   <chr>       <int>
## 1 Lvl          1311
## 2 Bnk            63
## 3 HLS            50
## 4 Low            36

The data description file tells us that Lvl means “Near Flat/Level, Bnk means”Banked-Quick and significant rise from street grade to building, HLS means “Hillside - Significant slope from side to side,” and Low means “Depression”.

h1_subset.dat %>%
  group_by(Neighborhood) %>%
  summarise(n()) %>%
  arrange(desc(`n()`))

## # A tibble: 25 x 2
##    Neighborhood `n()`
##    <chr>        <int>
##  1 NAmes          225
##  2 CollgCr        150
##  3 OldTown        113
##  4 Edwards        100
##  5 Somerst         86
##  6 Gilbert         79
##  7 NridgHt         77
##  8 Sawyer          74
##  9 NWAmes          73
## 10 SawyerW         59
## # ... with 15 more rows

Neighborhood names are generally self-explanatory. IDOTRR stands for “Iowa DOT and Rail Road”, and SWISU stands for “South & West of Iowa State University.” The data are distributed unequally, favoring a few neighborhoods like North Ames, College Creek, and Old Town.

Provide a scatterplot matrix for at least two of the independent variables and the dependent variable.

h1_subset.dat %>%
  keep(is.numeric) %>%
  ggpairs()

Derive a correlation matrix for any three quantitative variables in the dataset.

corr_matrix <- cor(h1.dat[,c("GrLivArea", "OverallQual", "SalePrice")])
corr_matrix

##             GrLivArea OverallQual SalePrice
## GrLivArea   1.0000000   0.5930074 0.7086245
## OverallQual 0.5930074   1.0000000 0.7909816
## SalePrice   0.7086245   0.7909816 1.0000000

Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval.

cor.test(h1.dat$GrLivArea, h1.dat$SalePrice, method = "pearson", conf.level = 0.80)

## 
##  Pearson's product-moment correlation
## 
## data:  h1.dat$GrLivArea and h1.dat$SalePrice
## t = 38.348, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.6915087 0.7249450
## sample estimates:
##       cor 
## 0.7086245

cor.test(h1.dat$SalePrice, h1.dat$OverallQual, method = "pearson", conf.level = 0.80)

## 
##  Pearson's product-moment correlation
## 
## data:  h1.dat$SalePrice and h1.dat$OverallQual
## t = 49.364, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.7780752 0.8032204
## sample estimates:
##       cor 
## 0.7909816

cor.test(h1.dat$GrLivArea, h1.dat$OverallQual, method = "pearson", conf.level = 0.80)

## 
##  Pearson's product-moment correlation
## 
## data:  h1.dat$GrLivArea and h1.dat$OverallQual
## t = 28.121, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.5708061 0.6143422
## sample estimates:
##       cor 
## 0.5930074

Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?

All pairwise correlations tested are statistically significant and positive. In a case of familywise error, a type I error occurs due to the large number of hypothesis tests being conducted. In a dataset with 81 variables, as we have here, testing each pair of variables for significant correlation is likely to lead eventually to some familywise error. For only three tests of correlation, as above, familywise error is not yet an important consideration. But if I were to continue testing for significance, then I would need to employ some adjustment to control for the possibility of familywise error.

Invert your correlation matrix from above.

corr_matrix_inverse <- solve(corr_matrix)
corr_matrix_inverse

##              GrLivArea OverallQual SalePrice
## GrLivArea    2.0200794  -0.1753704 -1.292763
## OverallQual -0.1753704   2.6865350 -2.000728
## SalePrice   -1.2927630  -2.0007280  3.498623

Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix.

Since the precision matrix and correlation matrices are inverses, their product in either direction will be the 3-by-3 identity matrix.

round(corr_matrix %*% corr_matrix_inverse, 4)

##             GrLivArea OverallQual SalePrice
## GrLivArea           1           0         0
## OverallQual         0           1         0
## SalePrice           0           0         1

round(corr_matrix_inverse %*% corr_matrix, 4)

##             GrLivArea OverallQual SalePrice
## GrLivArea           1           0         0
## OverallQual         0           1         0
## SalePrice           0           0         1

Conduct LU decomposition on the matrix.

LUFact <- function(A) {
  #Assumes A is a square matrix. Returns L and U, lower-triangular and upper-triangular factors of A.
  
  n <- nrow(A)
  
  #initialize U = [0], L = In
  
  L <- matrix(rep(0, length.out = n^2), nrow = n)
  U <- matrix(rep(0, length.out = n^2), nrow = n)

  for (k in seq(n)) {
    L[k,k] = 1
  }
  
  for (k in seq(n)) {
    U[k,k] = A[k,k]
    
    if (k < n) {
      
      for (i in seq(k + 1, n)) {
      L[i,k] = (A[i,k] / A[k,k])
      U[k,i] = A[k,i]
      }
    }
    
    for (i in seq(k + 1, n)) {
      
      if (k < n) {
        
        for (j in seq(k + 1, n)) {
          A[i,j] = A[i,j] - (L[i,k] * U[k,j])
        }
      }
    }
  }
  
  return(list("L" = L, "U" = U))
}

LUFact(corr_matrix)

## $L
##           [,1]      [,2] [,3]
## [1,] 1.0000000 0.0000000    0
## [2,] 0.5930074 1.0000000    0
## [3,] 0.7086245 0.5718616    1
## 
## $U
##      [,1]      [,2]      [,3]
## [1,]    1 0.5930074 0.7086245
## [2,]    0 0.6483422 0.3707620
## [3,]    0 0.0000000 0.2858268

Select a variable in the training dataset that is skewed to the right. Then load the MASS package and run fitdistr to fit an exponential probability density function.

The variable WoodDeckSF is skewed right.

hist(h1.dat$WoodDeckSF, breaks = 50)

exp_fit <- fitdistr(h1.dat$WoodDeckSF, densfun = "exponential")
str(exp_fit)

## List of 5
##  $ estimate: Named num 0.0106
##   ..- attr(*, "names")= chr "rate"
##  $ sd      : Named num 0.000278
##   ..- attr(*, "names")= chr "rate"
##  $ vcov    : num [1, 1] 7.71e-08
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr "rate"
##   .. ..$ : chr "rate"
##  $ n       : int 1460
##  $ loglik  : num -8097
##  - attr(*, "class")= chr "fitdistr"

Find the optimal value of \(\lambda\) for this distribution, and then take 1000 samples from this exponential distribution using this value.

exp_fit$estimate

##      rate 
## 0.0106107

exp_sample <- rexp(1000, exp_fit$estimate)

The optimal value of \(\lambda\) for this distribution is 0.01061.

Plot a histogram and compare it with a histogram of your original variable.

par(mfrow = c(1,2))
hist(h1.dat$WoodDeckSF, breaks = 50, main = "WoodDeckSF (actual)", xlab = "Wood Deck Area (sqft)", ylab = "relative frequency", freq = FALSE)
hist(exp_sample, breaks = 50, main = "WoodDeckSF (sim)", xlab = "sim'd Wood Deck Area (sqft)", ylab = "relative frequency", freq = FALSE)

Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF).

qexp(c(0.05, 0.95), rate = exp_fit$estimate)

## [1]   4.834112 282.331352

Generate a 95% confidence interval from the empirical data, assuming normality.

I think this means, “Find a 95% confidence interval for the mean of the empirical data, assuming normality.”

ttest <- t.test(h1.dat$WoodDeckSF, conf.level = 0.95)
ttest

## 
##  One Sample t-test
## 
## data:  h1.dat$WoodDeckSF
## t = 28.731, df = 1459, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##   87.80998 100.67906
## sample estimates:
## mean of x 
##  94.24452

Provide the empirical 5th percentile and 95th percentile of the data. Discuss.

quantile(h1.dat$WoodDeckSF, c(0.05, 0.95))

##  5% 95% 
##   0 335

ggplot(data = h1.dat, aes(x = WoodDeckSF)) +
  geom_histogram() +
  geom_vline(xintercept = quantile(exp_sample, 0.05), color = "blue", size = 1) +
  geom_vline(xintercept = quantile(exp_sample, 0.95), color = "blue", size = 1) +
  geom_vline(xintercept = quantile(h1.dat$WoodDeckSF, 0.05), color = "red", size = 1) +
  geom_vline(xintercept = quantile(h1.dat$WoodDeckSF, 0.95), color = "red", size = 1) +
  geom_vline(xintercept = ttest$conf.int[1], color = "red", size = 1) +
  geom_vline(xintercept = ttest$conf.int[2], color = "red", size = 1) +
  xlim(-15, 600) +
  geom_vline(xintercept = mean(exp_sample), color = "blue", size = 1) +
  xlim(-15, 600) +
  labs(title = "Comparing Quantiles for WoodDeckSF and a Fitted Exponential")

Red lines show the 5th and 95th percentile of the empirical data, as well as the 95% confidence interval for the mean of the empirical data. Blue lines show the 5th and 95th percentile of the fitted exponential distribution, as well as the mean of the fitted exponential distribution. The mean of the simulated data lies within the confidence interval for the empirical data, and the 5th and 95th percentiles are similar. I would proceed with caution when using the fitted exponential distribution instead of the empirical data. But based on the closeness of the values in the histogram above, there may be some instances when using the simulated data is appropriate.

Import Data

model.dat <- read_csv("https://raw.githubusercontent.com/dmoscoe/SPS/main/houses_train.csv")
predict.dat <- read_csv("https://raw.githubusercontent.com/dmoscoe/SPS/main/houses_test.csv")

Introduction

This data set contains 79 variables that describe houses in Ames, IA. The goal is to build a model that uses this data to explain the Sale Price for each house.

Tidy

Much of the work of generating a model for this dataset amounts to identifying which data to exclude. I followed a few principles in choosing which variables to include or drop from the final model:

If a group of variables all describe the same aspect of the house, such as the garage, then it might be possible to summarize them with a smaller number of variables. These variables are also likely to show correlation with each other, which may be a reason to exclude one or more of them.
If almost all the entries for a variable are the same, then the variation in this variable can’t explain variation in Sale Price. Variables like this are good candidates for exclusion.
Sale Price decisions are made by people. That means whatever actual people value most in a home will be strong determinants of Sale Price. So my informal understanding of home sales is at least potentially relevant to determining a model for Sale Price.

I start by dropping Id, Utilities, RoofMatl. Id merely signifies row numbers for observations, and is not a determinant of Sale Price. Almost every entry contains the same values for Utilities and RoofMatl.

model.dat <- model.dat[,c(2:9,11:22,24:81)]

The dataset contains a large number of NA entries. Reviewing the data dictionary explains that these entries actually do contain information. Often, they indicate that a particular feature of a house is missing, or that a measurement is zero. Based on the information in the data dictionary, I replace all NAs with more descriptive entries. The resulting dataset has no NA entries.

Here I also transform some very rare entries in Exterior1st and BldgType to the modes for those variables.

model.dat$Alley[is.na(model.dat$Alley)] <- "none"
model.dat$FireplaceQu[is.na(model.dat$FireplaceQu)] <- "none"
model.dat$PoolQC[is.na(model.dat$PoolQC)] <- "none"
model.dat$Fence[is.na(model.dat$Fence)] <- "none"
model.dat$MiscFeature[is.na(model.dat$MiscFeature)] <- "none"
model.dat$BsmtQual[is.na(model.dat$BsmtQual)] <- "none"
model.dat$BsmtCond[is.na(model.dat$BsmtCond)] <- "none"
model.dat$BsmtExposure[is.na(model.dat$BsmtExposure)] <- "none"
model.dat$BsmtFinType1[is.na(model.dat$BsmtFinType1)] <- "none"
model.dat$BsmtFinType2[is.na(model.dat$BsmtFinType2)] <- "none"
model.dat$GarageType[is.na(model.dat$GarageType)] <- "none"
model.dat$GarageFinish[is.na(model.dat$GarageFinish)] <- "none"
model.dat$GarageQual[is.na(model.dat$GarageQual)] <- "none"
model.dat$GarageCond[is.na(model.dat$GarageCond)] <- "none"
model.dat$MasVnrArea[is.na(model.dat$MasVnrArea)] <- 0
model.dat$MasVnrType[is.na(model.dat$MasVnrType)] <- "none"
model.dat$Electrical[is.na(model.dat$Electrical)] <- "SBrkr"
model.dat$GarageYrBlt[is.na(model.dat$GarageYrBlt)] <- 2006 #Most common value
model.dat$Exterior1st[model.dat$Exterior1st %in% c("AsphShn","BrkComm","CBlock","ImStucc","Stone")] <- "VinylSd" #Most common value
model.dat$BldgType[!(model.dat$BldgType %in% c("1Fam"))] <- "Not1Fam"
model.dat$LotFrontage[is.na(model.dat$LotFrontage)] <- 0

Test/Train Split

Now that NAs have been resolved, it’s time to partition the data into a training set and a testing set. I include 75% of the data in the training set, and reserve 25% for testing.

training_rows <- sample(nrow(model.dat), round(0.75 * nrow(model.dat)), replace = FALSE)
model_train.dat <- model.dat[training_rows,]
model_test.dat <- model.dat[-training_rows,]

Default Linear Model

My plan for constructing a model is to use the stepAIC function from modelr. This function eliminates variables as long as the elimination results in a reduction in the AIC. The stepAIC model requires a complete linear model to begin its work.

model_train.lm <- lm(SalePrice ~ ., data = model_train.dat)

Because the output from stepAIC is so long, I set include = FALSE in the code chunk containing it. The output of the stepAIC function is summarized below.

summary(model_train.slm)

## 
## Call:
## lm(formula = SalePrice ~ MSSubClass + LotFrontage + LotArea + 
##     Street + LotConfig + Neighborhood + Condition1 + Condition2 + 
##     BldgType + OverallQual + OverallCond + YearBuilt + YearRemodAdd + 
##     Exterior1st + MasVnrType + MasVnrArea + ExterQual + BsmtQual + 
##     BsmtExposure + BsmtFinSF1 + BsmtFinSF2 + BsmtUnfSF + `1stFlrSF` + 
##     `2ndFlrSF` + BsmtFullBath + BedroomAbvGr + KitchenAbvGr + 
##     KitchenQual + TotRmsAbvGrd + Functional + Fireplaces + GarageArea + 
##     GarageQual + GarageCond + WoodDeckSF + ScreenPorch + PoolArea + 
##     PoolQC + MoSold + SaleCondition, data = model_train.dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -174744   -9792       0    9367  174744 
## 
## Coefficients: (1 not defined because of singularities)
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -4.241e+06  2.741e+05 -15.469  < 2e-16 ***
## MSSubClass           -7.668e+01  4.612e+01  -1.663 0.096693 .  
## LotFrontage           4.735e+01  2.514e+01   1.883 0.059972 .  
## LotArea               4.548e-01  8.679e-02   5.240 1.97e-07 ***
## StreetPave            3.266e+04  1.365e+04   2.392 0.016945 *  
## LotConfigCulDSac      1.008e+04  3.561e+03   2.830 0.004754 ** 
## LotConfigFR2         -4.582e+03  4.834e+03  -0.948 0.343478    
## LotConfigFR3         -1.116e+04  1.308e+04  -0.853 0.393776    
## LotConfigInside       3.052e+02  2.048e+03   0.149 0.881575    
## NeighborhoodBlueste  -1.067e+04  2.512e+04  -0.425 0.671248    
## NeighborhoodBrDale   -1.302e+04  1.034e+04  -1.259 0.208299    
## NeighborhoodBrkSide  -4.435e+03  9.164e+03  -0.484 0.628519    
## NeighborhoodClearCr  -8.250e+03  9.500e+03  -0.868 0.385369    
## NeighborhoodCollgCr  -9.526e+03  7.516e+03  -1.267 0.205285    
## NeighborhoodCrawfor   9.984e+03  8.840e+03   1.129 0.259019    
## NeighborhoodEdwards  -2.028e+04  8.304e+03  -2.443 0.014756 *  
## NeighborhoodGilbert  -7.091e+03  8.016e+03  -0.885 0.376547    
## NeighborhoodIDOTRR   -1.759e+04  9.931e+03  -1.771 0.076896 .  
## NeighborhoodMeadowV  -2.754e+04  1.144e+04  -2.407 0.016286 *  
## NeighborhoodMitchel  -2.242e+04  8.670e+03  -2.586 0.009858 ** 
## NeighborhoodNAmes    -1.537e+04  8.075e+03  -1.904 0.057245 .  
## NeighborhoodNoRidge   2.970e+04  8.981e+03   3.307 0.000976 ***
## NeighborhoodNPkVill   5.294e+03  1.212e+04   0.437 0.662478    
## NeighborhoodNridgHt   1.884e+04  7.999e+03   2.356 0.018681 *  
## NeighborhoodNWAmes   -1.713e+04  8.406e+03  -2.038 0.041810 *  
## NeighborhoodOldTown  -1.663e+04  8.817e+03  -1.886 0.059568 .  
## NeighborhoodSawyer   -1.011e+04  8.570e+03  -1.180 0.238393    
## NeighborhoodSawyerW  -6.133e+03  8.244e+03  -0.744 0.457128    
## NeighborhoodSomerst   2.448e+03  7.888e+03   0.310 0.756387    
## NeighborhoodStoneBr   3.403e+04  8.873e+03   3.835 0.000134 ***
## NeighborhoodSWISU    -1.244e+04  1.026e+04  -1.212 0.225923    
## NeighborhoodTimber   -1.239e+04  8.408e+03  -1.474 0.140906    
## NeighborhoodVeenker   1.158e+03  1.108e+04   0.105 0.916771    
## Condition1Feedr       4.268e+03  5.663e+03   0.754 0.451282    
## Condition1Norm        1.192e+04  4.617e+03   2.581 0.010000 ** 
## Condition1PosA        3.993e+03  1.057e+04   0.378 0.705719    
## Condition1PosN        1.314e+04  8.231e+03   1.596 0.110704    
## Condition1RRAe       -1.427e+04  1.022e+04  -1.396 0.163108    
## Condition1RRAn        1.231e+04  7.333e+03   1.678 0.093577 .  
## Condition1RRNe        6.039e+01  1.801e+04   0.003 0.997324    
## Condition1RRNn       -3.038e+03  1.611e+04  -0.189 0.850429    
## Condition2Feedr      -2.485e+04  2.361e+04  -1.052 0.292879    
## Condition2Norm       -1.717e+04  1.919e+04  -0.895 0.371198    
## Condition2PosA        2.267e+04  3.157e+04   0.718 0.472968    
## Condition2PosN       -2.647e+05  2.743e+04  -9.651  < 2e-16 ***
## Condition2RRAe       -3.071e+04  3.143e+04  -0.977 0.328801    
## Condition2RRAn       -9.907e+03  3.090e+04  -0.321 0.748564    
## Condition2RRNn       -1.087e+04  3.143e+04  -0.346 0.729460    
## BldgTypeNot1Fam      -9.182e+03  5.505e+03  -1.668 0.095654 .  
## OverallQual           7.584e+03  1.108e+03   6.846 1.34e-11 ***
## OverallCond           4.787e+03  9.063e+02   5.282 1.58e-07 ***
## YearBuilt             3.160e+02  7.302e+01   4.328 1.66e-05 ***
## YearRemodAdd          9.271e+01  6.121e+01   1.515 0.130165    
## Exterior1stBrkFace    2.106e+04  7.840e+03   2.686 0.007361 ** 
## Exterior1stCemntBd    7.698e+03  8.184e+03   0.941 0.347166    
## Exterior1stHdBoard   -2.975e+03  6.946e+03  -0.428 0.668558    
## Exterior1stMetalSd    2.024e+03  6.717e+03   0.301 0.763167    
## Exterior1stPlywood   -5.656e+03  7.380e+03  -0.766 0.443673    
## Exterior1stStucco     7.383e+02  8.699e+03   0.085 0.932380    
## Exterior1stVinylSd   -1.510e+03  6.814e+03  -0.222 0.824631    
## Exterior1stWd Sdng   -8.877e+02  6.753e+03  -0.131 0.895438    
## Exterior1stWdShing   -2.515e+03  8.568e+03  -0.294 0.769181    
## MasVnrTypeBrkFace     1.142e+04  8.135e+03   1.404 0.160551    
## MasVnrTypenone        1.572e+04  1.244e+04   1.264 0.206484    
## MasVnrTypeNone        1.710e+04  8.259e+03   2.071 0.038634 *  
## MasVnrTypeStone       1.493e+04  8.648e+03   1.727 0.084569 .  
## MasVnrArea            3.004e+01  6.564e+00   4.577 5.34e-06 ***
## ExterQualFa          -2.094e+04  1.047e+04  -2.000 0.045823 *  
## ExterQualGd          -2.398e+04  5.544e+03  -4.326 1.67e-05 ***
## ExterQualTA          -2.555e+04  6.041e+03  -4.229 2.57e-05 ***
## BsmtQualFa           -1.829e+04  7.195e+03  -2.543 0.011155 *  
## BsmtQualGd           -2.154e+04  3.822e+03  -5.636 2.27e-08 ***
## BsmtQualnone          1.192e+03  2.530e+04   0.047 0.962428    
## BsmtQualTA           -2.174e+04  4.747e+03  -4.580 5.26e-06 ***
## BsmtExposureGd        1.945e+04  3.276e+03   5.935 4.07e-09 ***
## BsmtExposureMn       -1.636e+03  3.477e+03  -0.470 0.638129    
## BsmtExposureNo       -5.078e+03  2.389e+03  -2.126 0.033781 *  
## BsmtExposurenone     -1.178e+04  2.384e+04  -0.494 0.621246    
## BsmtFinSF1            3.582e+01  5.280e+00   6.785 2.00e-11 ***
## BsmtFinSF2            2.200e+01  6.693e+00   3.287 0.001049 ** 
## BsmtUnfSF             2.037e+01  4.974e+00   4.095 4.57e-05 ***
## `1stFlrSF`            4.584e+01  5.511e+00   8.317 3.00e-16 ***
## `2ndFlrSF`            5.300e+01  3.856e+00  13.744  < 2e-16 ***
## BsmtFullBath          3.915e+03  2.058e+03   1.902 0.057447 .  
## BedroomAbvGr         -3.741e+03  1.462e+03  -2.558 0.010663 *  
## KitchenAbvGr         -1.321e+04  5.164e+03  -2.558 0.010689 *  
## KitchenQualFa        -1.673e+04  7.043e+03  -2.375 0.017721 *  
## KitchenQualGd        -2.108e+04  3.971e+03  -5.309 1.37e-07 ***
## KitchenQualTA        -2.162e+04  4.550e+03  -4.752 2.32e-06 ***
## TotRmsAbvGrd          2.662e+03  1.074e+03   2.479 0.013332 *  
## FunctionalMaj2       -3.090e+03  1.513e+04  -0.204 0.838204    
## FunctionalMin1        7.353e+03  9.232e+03   0.797 0.425916    
## FunctionalMin2        9.049e+03  9.207e+03   0.983 0.325910    
## FunctionalMod         8.064e+03  9.978e+03   0.808 0.419199    
## FunctionalSev        -2.690e+04  2.693e+04  -0.999 0.318088    
## FunctionalTyp         2.088e+04  7.806e+03   2.676 0.007585 ** 
## Fireplaces            2.727e+03  1.530e+03   1.783 0.074921 .  
## GarageArea            2.533e+01  5.752e+00   4.403 1.18e-05 ***
## GarageQualFa         -1.696e+05  2.598e+04  -6.527 1.08e-10 ***
## GarageQualGd         -1.608e+05  2.710e+04  -5.932 4.15e-09 ***
## GarageQualnone       -2.900e+03  1.780e+04  -0.163 0.870613    
## GarageQualPo         -1.663e+05  3.264e+04  -5.096 4.16e-07 ***
## GarageQualTA         -1.663e+05  2.555e+04  -6.511 1.19e-10 ***
## GarageCondFa          1.602e+05  3.148e+04   5.089 4.31e-07 ***
## GarageCondGd          1.516e+05  3.238e+04   4.681 3.25e-06 ***
## GarageCondnone               NA         NA      NA       NA    
## GarageCondPo          1.562e+05  3.399e+04   4.595 4.89e-06 ***
## GarageCondTA          1.568e+05  3.076e+04   5.097 4.15e-07 ***
## WoodDeckSF            1.870e+01  6.555e+00   2.853 0.004429 ** 
## ScreenPorch           2.146e+01  1.376e+01   1.560 0.119097    
## PoolArea              6.661e+03  3.724e+02  17.885  < 2e-16 ***
## PoolQCFa             -8.557e+05  5.056e+04 -16.925  < 2e-16 ***
## PoolQCGd             -3.764e+05  2.536e+04 -14.839  < 2e-16 ***
## PoolQCnone            3.443e+06  2.018e+05  17.065  < 2e-16 ***
## MoSold               -4.978e+02  2.857e+02  -1.742 0.081795 .  
## SaleConditionAdjLand  2.529e+03  1.788e+04   0.141 0.887555    
## SaleConditionAlloca   8.402e+03  9.845e+03   0.853 0.393657    
## SaleConditionFamily  -1.074e+03  6.953e+03  -0.154 0.877298    
## SaleConditionNormal   6.943e+03  3.094e+03   2.244 0.025057 *  
## SaleConditionPartial  2.217e+04  4.391e+03   5.049 5.29e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23530 on 976 degrees of freedom
## Multiple R-squared:  0.9271, Adjusted R-squared:  0.9183 
## F-statistic: 105.2 on 118 and 976 DF,  p-value: < 2.2e-16

The stepAIC model results in a value for \(R^2\) so high that I suspect overfitting. Examining the model on the test set confirms this concern.

rsquare(model_train.slm, model_test.dat)

## Warning in predict.lm(model, data): prediction from a rank-deficient fit may be
## misleading

## [1] -0.05908811

To resolve the overfit, I need to remove variables from the model. The normal procedure of backward elimination would be relevant here, except that all variables in the model include at least one level with significance less than 0.05. Typically, this would mean the variable should remain in the model. My strategy for removing variables will rely on the principles described in the Tidy section. I’ll reserve only one variable for each attribute of the house (for example, multiple variables regarding the garage will be removed until only one about the garage remains). I’ll rely on some of my informal understanding of real estate sales. And I’ll try to reduce the number of factors of highly imbalanced categorical variables when possible.

Variable transformations

model_train.dat$BldgType[!(model_train.dat$BldgType %in% c("1Fam"))] <- "Not1Fam"
model_train.dat$HouseStyle[model_train.dat$HouseStyle %in% c("2.5Fin", "2.5Unf")] <- "2Story"
model_train.dat$BsmtFullBath[model_train.dat$BsmtFullBath > 1] <- 1
model_train.dat$HalfBath[model_train.dat$HalfBath > 1] <- 1
model_train.dat$BedroomAbvGr[model_train.dat$BedroomAbvGr > 4] <- 4
model_train.dat$Fireplaces[model_train.dat$Fireplaces > 1] <- 1
model_train.dat$SaleType[model_train.dat$SaleType %in% c("Con", "ConLD", "ConLI", "ConLW", "CWD", "Oth")] <- "WD"

model_test.dat$BldgType[!(model_test.dat$BldgType %in% c("1Fam"))] <- "Not1Fam"
model_test.dat$HouseStyle[model_test.dat$HouseStyle %in% c("2.5Fin", "2.5Unf")] <- "2Story"
model_test.dat$BsmtFullBath[model_test.dat$BsmtFullBath > 1] <- 1
model_test.dat$HalfBath[model_test.dat$HalfBath > 1] <- 1
model_test.dat$BedroomAbvGr[model_test.dat$BedroomAbvGr > 4] <- 4
model_test.dat$Fireplaces[model_test.dat$Fireplaces > 1] <- 1
model_test.dat$SaleType[model_test.dat$SaleType %in% c("Con", "ConLD", "ConLI", "ConLW", "CWD", "Oth")] <- "WD"

Updated model

model_train_210515.lm <- lm(SalePrice ~ LotFrontage + LotArea + LotConfig + Neighborhood + BldgType + HouseStyle + OverallQual + OverallCond + YearBuilt + YearRemodAdd + Exterior1st + BsmtFinSF1 + `1stFlrSF` + `2ndFlrSF` + BsmtFullBath + FullBath + HalfBath + BedroomAbvGr + KitchenQual + Fireplaces + GarageArea + SaleCondition, data = model_train.dat)

summary(model_train_210515.lm)

## 
## Call:
## lm(formula = SalePrice ~ LotFrontage + LotArea + LotConfig + 
##     Neighborhood + BldgType + HouseStyle + OverallQual + OverallCond + 
##     YearBuilt + YearRemodAdd + Exterior1st + BsmtFinSF1 + `1stFlrSF` + 
##     `2ndFlrSF` + BsmtFullBath + FullBath + HalfBath + BedroomAbvGr + 
##     KitchenQual + Fireplaces + GarageArea + SaleCondition, data = model_train.dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -377481  -12473    -565   11460  241977 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -4.986e+05  2.171e+05  -2.297 0.021805 *  
## LotFrontage           5.335e+00  3.320e+01   0.161 0.872351    
## LotArea               4.139e-01  1.098e-01   3.770 0.000173 ***
## LotConfigCulDSac      1.571e+04  4.844e+03   3.242 0.001224 ** 
## LotConfigFR2         -8.683e+03  6.581e+03  -1.319 0.187311    
## LotConfigFR3         -2.728e+04  1.693e+04  -1.612 0.107324    
## LotConfigInside       2.006e+03  2.756e+03   0.728 0.466914    
## NeighborhoodBlueste  -2.828e+04  3.436e+04  -0.823 0.410699    
## NeighborhoodBrDale   -6.279e+03  1.393e+04  -0.451 0.652261    
## NeighborhoodBrkSide  -1.001e+04  1.200e+04  -0.834 0.404238    
## NeighborhoodClearCr  -9.505e+02  1.253e+04  -0.076 0.939564    
## NeighborhoodCollgCr  -9.869e+03  9.913e+03  -0.996 0.319696    
## NeighborhoodCrawfor   9.097e+03  1.159e+04   0.785 0.432817    
## NeighborhoodEdwards  -3.111e+04  1.079e+04  -2.882 0.004030 ** 
## NeighborhoodGilbert  -1.132e+04  1.050e+04  -1.078 0.281330    
## NeighborhoodIDOTRR   -2.410e+04  1.299e+04  -1.855 0.063882 .  
## NeighborhoodMeadowV  -2.517e+04  1.528e+04  -1.648 0.099727 .  
## NeighborhoodMitchel  -1.715e+04  1.139e+04  -1.505 0.132520    
## NeighborhoodNAmes    -2.182e+04  1.050e+04  -2.077 0.038031 *  
## NeighborhoodNoRidge   5.417e+04  1.161e+04   4.664 3.51e-06 ***
## NeighborhoodNPkVill  -1.536e+03  1.632e+04  -0.094 0.925038    
## NeighborhoodNridgHt   3.665e+04  1.040e+04   3.526 0.000441 ***
## NeighborhoodNWAmes   -2.340e+04  1.089e+04  -2.149 0.031856 *  
## NeighborhoodOldTown  -2.416e+04  1.145e+04  -2.109 0.035150 *  
## NeighborhoodSawyer   -2.220e+04  1.108e+04  -2.004 0.045345 *  
## NeighborhoodSawyerW  -9.877e+03  1.079e+04  -0.916 0.360011    
## NeighborhoodSomerst   4.431e+03  1.011e+04   0.438 0.661246    
## NeighborhoodStoneBr   4.165e+04  1.173e+04   3.551 0.000401 ***
## NeighborhoodSWISU    -2.292e+04  1.360e+04  -1.685 0.092311 .  
## NeighborhoodTimber    1.731e+03  1.120e+04   0.155 0.877213    
## NeighborhoodVeenker   9.588e+03  1.490e+04   0.644 0.520026    
## BldgTypeNot1Fam      -2.073e+04  3.523e+03  -5.885 5.37e-09 ***
## HouseStyle1.5Unf      8.444e+03  1.164e+04   0.725 0.468519    
## HouseStyle1Story      2.359e+04  5.303e+03   4.448 9.61e-06 ***
## HouseStyle2Story     -8.355e+03  4.714e+03  -1.772 0.076608 .  
## HouseStyleSFoyer      1.769e+04  8.201e+03   2.157 0.031221 *  
## HouseStyleSLvl        1.168e+04  6.577e+03   1.776 0.076090 .  
## OverallQual           1.354e+04  1.391e+03   9.738  < 2e-16 ***
## OverallCond           5.566e+03  1.186e+03   4.691 3.09e-06 ***
## YearBuilt             2.326e+02  9.175e+01   2.535 0.011400 *  
## YearRemodAdd          1.507e+01  8.267e+01   0.182 0.855408    
## Exterior1stBrkFace    1.042e+04  1.039e+04   1.003 0.316225    
## Exterior1stCemntBd   -1.593e+03  1.081e+04  -0.147 0.882898    
## Exterior1stHdBoard   -1.091e+04  9.239e+03  -1.181 0.237771    
## Exterior1stMetalSd   -3.749e+03  8.874e+03  -0.423 0.672730    
## Exterior1stPlywood   -9.560e+03  9.819e+03  -0.974 0.330472    
## Exterior1stStucco    -3.060e+04  1.144e+04  -2.674 0.007616 ** 
## Exterior1stVinylSd   -6.880e+03  9.057e+03  -0.760 0.447680    
## Exterior1stWd Sdng   -3.083e+03  8.938e+03  -0.345 0.730181    
## Exterior1stWdShing   -8.416e+03  1.146e+04  -0.735 0.462768    
## BsmtFinSF1            1.052e+01  3.218e+00   3.269 0.001117 ** 
## `1stFlrSF`            4.402e+01  4.920e+00   8.947  < 2e-16 ***
## `2ndFlrSF`            6.543e+01  6.827e+00   9.583  < 2e-16 ***
## BsmtFullBath          1.353e+04  2.713e+03   4.988 7.16e-07 ***
## FullBath              9.908e+03  3.314e+03   2.989 0.002861 ** 
## HalfBath              5.066e+03  3.270e+03   1.549 0.121610    
## BedroomAbvGr         -1.114e+03  1.894e+03  -0.588 0.556722    
## KitchenQualFa        -3.601e+04  9.293e+03  -3.875 0.000113 ***
## KitchenQualGd        -4.129e+04  4.919e+03  -8.394  < 2e-16 ***
## KitchenQualTA        -4.116e+04  5.805e+03  -7.091 2.48e-12 ***
## Fireplaces            5.411e+03  2.541e+03   2.130 0.033413 *  
## GarageArea            1.937e+01  6.627e+00   2.923 0.003547 ** 
## SaleConditionAdjLand  2.284e+04  2.451e+04   0.932 0.351565    
## SaleConditionAlloca   1.402e+04  1.285e+04   1.091 0.275664    
## SaleConditionFamily   1.519e+03  9.514e+03   0.160 0.873183    
## SaleConditionNormal   7.447e+03  4.128e+03   1.804 0.071516 .  
## SaleConditionPartial  2.155e+04  5.824e+03   3.700 0.000227 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 32700 on 1028 degrees of freedom
## Multiple R-squared:  0.8517, Adjusted R-squared:  0.8422 
## F-statistic: 89.45 on 66 and 1028 DF,  p-value: < 2.2e-16

rsquare(model_train_210515.lm, model_test.dat)

## [1] 0.8757467

The simplified model performs much better on the test data.

Diagnostics

ggplot(model_train.slm, aes(sample = .resid)) +
  geom_qq() +
  geom_qq_line()

The QQ plot indicates that there are some important deviations from the linear model at the tails of the Sale Price distribution. One avenue to pursue if I were to continue this investigation might be to have separate models for predicting low, medium, and high-price houses. The first step of the model would be to predict the range of the house price, and then a separate multiple regression model for each group would be used to predict the final price.

ggplot(data = model_train.slm, aes(x = .fitted, y = .resid)) +
geom_point()

Again, we see problems at the higher values of SalePrice. The model systematically underestimates the prices of the most expensive houses in Ames.

Generate predictions for submission

predict.dat <- predict.dat %>%
  dplyr::select(LotFrontage, LotArea, LotConfig, Neighborhood, BldgType, HouseStyle, OverallQual, OverallCond, YearBuilt, YearRemodAdd, Exterior1st, BsmtFinSF1, `1stFlrSF`, `2ndFlrSF`, BsmtFullBath, FullBath, HalfBath, BedroomAbvGr, KitchenQual, Fireplaces, GarageArea, SaleType, SaleCondition)

predict.dat$LotFrontage[is.na(predict.dat$LotFrontage)] <- 0

predict.dat$Exterior1st[predict.dat$Exterior1st %in% c("AsphShn","BrkComm","CBlock","ImStucc","Stone")] <- "VinylSd"

predict.dat$Exterior1st[is.na(predict.dat$Exterior1st)] <- "VinylSd"

predict.dat$BsmtFinSF1[is.na(predict.dat$BsmtFinSF1)] <- 0
predict.dat$BsmtFullBath[is.na(predict.dat$BsmtFullBath)] <- 0
predict.dat$KitchenQual[is.na(predict.dat$KitchenQual)] <- "TA"
predict.dat$GarageArea[is.na(predict.dat$GarageArea)] <- 0
predict.dat$SaleType[is.na(predict.dat$SaleType)] <- "WD"
sum(is.na(predict.dat))

## [1] 0

predict.dat$BldgType[!(predict.dat$BldgType %in% c("1Fam"))] <- "Not1Fam"
predict.dat$HouseStyle[predict.dat$HouseStyle %in% c("2.5Fin", "2.5Unf")] <- "2Story"
predict.dat$BsmtFullBath[predict.dat$BsmtFullBath > 1] <- 1
predict.dat$HalfBath[predict.dat$HalfBath > 1] <- 1
predict.dat$BedroomAbvGr[predict.dat$BedroomAbvGr > 4] <- 4
predict.dat$Fireplaces[predict.dat$Fireplaces > 1] <- 1
predict.dat$SaleType[predict.dat$SaleType %in% c("Con", "ConLD", "ConLI", "ConLW", "CWD", "Oth")] <- "WD"

preds <- predict.lm(model_train_210515.lm, predict.dat, type = "response")
submit <- data.frame("Id" = seq(from = 1461, to = 2919), "SalePrice" = preds)
write_csv(submit, "C:/Users/dmosc/OneDrive/Desktop/y.csv")

Kaggle results

This model earned a score of 0.15828. My user name is Daniel Moscoe.