Problem 1

We pick the \(X3\) column vector for the independent variable, and the \(Y2\) column vector for the dependent variable. We assume \(x\) is estimated at \(X3\)’s third quartile and \(y\) at \(Y2\)’s first quartile.

# Choose Y2 for dependent variable
Y <- c(20.8, 14.6, 18, 7.3, 19.4, 13.5, 14.7, 15.3, 12.6, 13, 13.1, 10.3, 14.9, 14.8, 16.2, 15.7, 16.3, 11.5, 12.2, 11.8)

# Choose X3 for independent variable
X <- c(9.5, 3.7, 11.7, 7.4, 5.3, 7.4, 7.4, 8.6, 9.1, 11.4, 8.4, 7.3, 11.3, 4.4, 9.3, 10.9, 10.9, 7.7, 7.7, 11.5)

# Compile Y2 and X3 in a dataframe
prob_data <- data.frame(Y, X)

# Review a summary of the distribution
summary(prob_data)
##        Y               X         
##  Min.   : 7.30   Min.   : 3.700  
##  1st Qu.:12.50   1st Qu.: 7.400  
##  Median :14.65   Median : 8.500  
##  Mean   :14.30   Mean   : 8.545  
##  3rd Qu.:15.82   3rd Qu.:10.900  
##  Max.   :20.80   Max.   :11.700

\(X\)’s third quartile value of \(10.9\) appears twice within the distribution - to avoid it not being counted in probability calculcations we’ll treat the quartile inequalities as \(\leq\) and \(>\).

# Designate quantiles for independent and dependent variables
X_threshold1 <- quantile(X)[[4]]
X_threshold2 <- quantile(X)[[2]]
Y_threshold <- quantile(Y)[[2]]

# Calculate all the cases
ALLXY <- prob_data %>% tally()
XQ13 <- prob_data %>% filter(X <= X_threshold1) %>% tally()
XQ24 <- prob_data %>% filter(X > X_threshold2) %>% tally()
XQ4 <- prob_data %>% filter(X > X_threshold1) %>% tally()
YQ1 <- prob_data %>% filter(Y <= Y_threshold) %>% tally()
YQ24 <- prob_data %>% filter(Y > Y_threshold) %>% tally()
XQ13_YQ1 <- prob_data %>% filter(X <= X_threshold1 & Y <= Y_threshold) %>% tally()
XQ4_YQ1 <- prob_data %>% filter(X > X_threshold1 & Y <= Y_threshold) %>% tally()
XQ13_YQ24 <- prob_data %>% filter(X <= X_threshold1 & Y > Y_threshold) %>% tally()
XQ4_YQ24 <- prob_data %>% filter(X > X_threshold1 & Y > Y_threshold) %>% tally()
XQ24_YQ24 <- prob_data %>% filter(X > X_threshold2 & Y > Y_threshold) %>% tally()

Probability

We calculate the requested conditional probabilities:

  1. \(P(X>x | Y>y)\)
## Y is greater than the first quartile in 15 out of 20 cases.
## 
## X is greater than the third quartile in 4 out of 20 cases.
## 
## The probability that X is greater than the third quartile conditioned on Y being greater than the first quartile is 0.2 .
  1. \(P(X>x, Y>y)\)
## Y is greater than the first quartile in 15 out of 20 cases.
## 
## X is greater than the third quartile in 4 out of 20 cases.
## 
## The probability that X is greater than the third quartile and that Y is greater than the first quartile is 0.8.
  1. \(P(X<x | Y>y)\)
## Y is greater than the first quartile in 15 out of 20 cases.
## 
## X is less than the third quartile in 16 out of 20 cases.
## 
## The probability that X is less than the third quartile conditioned on Y being greater than the first quartile is 0.8.

We put together a contingency table to summarize the counts.

# Build table with totals
contingency_table <- matrix(c(
  XQ13_YQ1, XQ4_YQ1, YQ1,
  XQ13_YQ24, XQ4_YQ24, YQ24,
  XQ13, XQ4, ALLXY),
  nrow = 3, ncol = 3, byrow = TRUE)
colnames(contingency_table) <- c('<= 3d quartile', '>3d quartile', 'Total')
rownames(contingency_table) = c('<= 1st quartile', '> 1st quartile', 'Total')

contingency_table
##                 <= 3d quartile >3d quartile Total
## <= 1st quartile 4              1            5    
## > 1st quartile  12             3            15   
## Total           16             4            20

Splitting the the training data this way does not make them independent. Two random variables are independent if the probability distribution of one variable is not affected by the presence of another. The sampling method described - letting \(A\) be the new variable counting those observations above the 1st quartile for \(X\), and \(B\) be the new variable counting those observations above the 1st quartile for \(Y\) - deliberately selects a specific segment of the distribution. Since this is non-random, it is unlikely that this will result in indepedence.

We compute the probabilities \(P(A)\), \(P(B)\), \(P(A) \times P(B)\), and \(P(AB)\).

## P(A) is 0.65.
## 
## P(B) is 0.75.
## 
## P(A) X P(B) is 0.5.
## 
## P(AB) is 0.4875.

\(P(AB)\) and \(P(A) \times P(B)\) are not equivalent.

We check for association with a \(\chi^2\) test, treating the quartile associations as two factors with distinct categories.

## 
##  Pearson's Chi-squared test
## 
## data:  XQ24_vec
## X-squared = 2.7217, df = 12, p-value = 0.9972

The p-value is much larger than .05 threshold (nearly 1). Accordinlgy, we cannot be confident about rejecting the null hypothesis that there is no association between the two. This is consistent with the intuition that, as sampling is not random, the variables are unlikely to be independent.


Problem 2

We load in the Ames housing dataset from Kaggle’s House Prices: Advanced Regression Techniques competition.

# Load the train and test CSVs manually 
train_path <- file.path(getwd(), "train.csv")
train_df <- read_csv(file = train_path, col_names = TRUE)
## Parsed with column specification:
## cols(
##   .default = col_character(),
##   Id = col_integer(),
##   MSSubClass = col_integer(),
##   LotFrontage = col_integer(),
##   LotArea = col_integer(),
##   OverallQual = col_integer(),
##   OverallCond = col_integer(),
##   YearBuilt = col_integer(),
##   YearRemodAdd = col_integer(),
##   MasVnrArea = col_integer(),
##   BsmtFinSF1 = col_integer(),
##   BsmtFinSF2 = col_integer(),
##   BsmtUnfSF = col_integer(),
##   TotalBsmtSF = col_integer(),
##   `1stFlrSF` = col_integer(),
##   `2ndFlrSF` = col_integer(),
##   LowQualFinSF = col_integer(),
##   GrLivArea = col_integer(),
##   BsmtFullBath = col_integer(),
##   BsmtHalfBath = col_integer(),
##   FullBath = col_integer()
##   # ... with 18 more columns
## )
## See spec(...) for full column specifications.
test_path <- file.path(getwd(), "test.csv")
test_df <- read_csv(file = test_path, col_names = TRUE)
## Parsed with column specification:
## cols(
##   .default = col_character(),
##   Id = col_integer(),
##   MSSubClass = col_integer(),
##   LotFrontage = col_integer(),
##   LotArea = col_integer(),
##   OverallQual = col_integer(),
##   OverallCond = col_integer(),
##   YearBuilt = col_integer(),
##   YearRemodAdd = col_integer(),
##   MasVnrArea = col_integer(),
##   BsmtFinSF1 = col_integer(),
##   BsmtFinSF2 = col_integer(),
##   BsmtUnfSF = col_integer(),
##   TotalBsmtSF = col_integer(),
##   `1stFlrSF` = col_integer(),
##   `2ndFlrSF` = col_integer(),
##   LowQualFinSF = col_integer(),
##   GrLivArea = col_integer(),
##   BsmtFullBath = col_integer(),
##   BsmtHalfBath = col_integer(),
##   FullBath = col_integer()
##   # ... with 17 more columns
## )
## See spec(...) for full column specifications.
head(train_df)
## # A tibble: 6 x 81
##      Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape
##   <int>      <int> <chr>          <int>   <int> <chr>  <chr> <chr>   
## 1     1         60 RL                65    8450 Pave   <NA>  Reg     
## 2     2         20 RL                80    9600 Pave   <NA>  Reg     
## 3     3         60 RL                68   11250 Pave   <NA>  IR1     
## 4     4         70 RL                60    9550 Pave   <NA>  IR1     
## 5     5         60 RL                84   14260 Pave   <NA>  IR1     
## 6     6         50 RL                85   14115 Pave   <NA>  IR1     
## # ... with 73 more variables: LandContour <chr>, Utilities <chr>,
## #   LotConfig <chr>, LandSlope <chr>, Neighborhood <chr>,
## #   Condition1 <chr>, Condition2 <chr>, BldgType <chr>, HouseStyle <chr>,
## #   OverallQual <int>, OverallCond <int>, YearBuilt <int>,
## #   YearRemodAdd <int>, RoofStyle <chr>, RoofMatl <chr>,
## #   Exterior1st <chr>, Exterior2nd <chr>, MasVnrType <chr>,
## #   MasVnrArea <int>, ExterQual <chr>, ExterCond <chr>, Foundation <chr>,
## #   BsmtQual <chr>, BsmtCond <chr>, BsmtExposure <chr>,
## #   BsmtFinType1 <chr>, BsmtFinSF1 <int>, BsmtFinType2 <chr>,
## #   BsmtFinSF2 <int>, BsmtUnfSF <int>, TotalBsmtSF <int>, Heating <chr>,
## #   HeatingQC <chr>, CentralAir <chr>, Electrical <chr>, `1stFlrSF` <int>,
## #   `2ndFlrSF` <int>, LowQualFinSF <int>, GrLivArea <int>,
## #   BsmtFullBath <int>, BsmtHalfBath <int>, FullBath <int>,
## #   HalfBath <int>, BedroomAbvGr <int>, KitchenAbvGr <int>,
## #   KitchenQual <chr>, TotRmsAbvGrd <int>, Functional <chr>,
## #   Fireplaces <int>, FireplaceQu <chr>, GarageType <chr>,
## #   GarageYrBlt <int>, GarageFinish <chr>, GarageCars <int>,
## #   GarageArea <int>, GarageQual <chr>, GarageCond <chr>,
## #   PavedDrive <chr>, WoodDeckSF <int>, OpenPorchSF <int>,
## #   EnclosedPorch <int>, `3SsnPorch` <int>, ScreenPorch <int>,
## #   PoolArea <int>, PoolQC <chr>, Fence <chr>, MiscFeature <chr>,
## #   MiscVal <int>, MoSold <int>, YrSold <int>, SaleType <chr>,
## #   SaleCondition <chr>, SalePrice <int>
head(test_df)
## # A tibble: 6 x 80
##      Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape
##   <int>      <int> <chr>          <int>   <int> <chr>  <chr> <chr>   
## 1  1461         20 RH                80   11622 Pave   <NA>  Reg     
## 2  1462         20 RL                81   14267 Pave   <NA>  IR1     
## 3  1463         60 RL                74   13830 Pave   <NA>  IR1     
## 4  1464         60 RL                78    9978 Pave   <NA>  IR1     
## 5  1465        120 RL                43    5005 Pave   <NA>  IR1     
## 6  1466         60 RL                75   10000 Pave   <NA>  IR1     
## # ... with 72 more variables: LandContour <chr>, Utilities <chr>,
## #   LotConfig <chr>, LandSlope <chr>, Neighborhood <chr>,
## #   Condition1 <chr>, Condition2 <chr>, BldgType <chr>, HouseStyle <chr>,
## #   OverallQual <int>, OverallCond <int>, YearBuilt <int>,
## #   YearRemodAdd <int>, RoofStyle <chr>, RoofMatl <chr>,
## #   Exterior1st <chr>, Exterior2nd <chr>, MasVnrType <chr>,
## #   MasVnrArea <int>, ExterQual <chr>, ExterCond <chr>, Foundation <chr>,
## #   BsmtQual <chr>, BsmtCond <chr>, BsmtExposure <chr>,
## #   BsmtFinType1 <chr>, BsmtFinSF1 <int>, BsmtFinType2 <chr>,
## #   BsmtFinSF2 <int>, BsmtUnfSF <int>, TotalBsmtSF <int>, Heating <chr>,
## #   HeatingQC <chr>, CentralAir <chr>, Electrical <chr>, `1stFlrSF` <int>,
## #   `2ndFlrSF` <int>, LowQualFinSF <int>, GrLivArea <int>,
## #   BsmtFullBath <int>, BsmtHalfBath <int>, FullBath <int>,
## #   HalfBath <int>, BedroomAbvGr <int>, KitchenAbvGr <int>,
## #   KitchenQual <chr>, TotRmsAbvGrd <int>, Functional <chr>,
## #   Fireplaces <int>, FireplaceQu <chr>, GarageType <chr>,
## #   GarageYrBlt <int>, GarageFinish <chr>, GarageCars <int>,
## #   GarageArea <int>, GarageQual <chr>, GarageCond <chr>,
## #   PavedDrive <chr>, WoodDeckSF <int>, OpenPorchSF <int>,
## #   EnclosedPorch <int>, `3SsnPorch` <int>, ScreenPorch <int>,
## #   PoolArea <int>, PoolQC <chr>, Fence <chr>, MiscFeature <chr>,
## #   MiscVal <int>, MoSold <int>, YrSold <int>, SaleType <chr>,
## #   SaleCondition <chr>

Descriptive and Inferential Statistics

First, we summarize the distribution of the dependent variable in this housing dataset, the sales prices of homes, labelled SalePrice.

# Check for missing values
is.na(train_df$SalePrice) %>% sum()
## [1] 0
# Calculate the mean
train_df$SalePrice %>% describe()
##    vars    n     mean      sd median  trimmed     mad   min    max  range
## X1    1 1460 180921.2 79442.5 163000 170783.3 56338.8 34900 755000 720100
##    skew kurtosis      se
## X1 1.88      6.5 2079.11
# Plot the distribution
train_df %>% ggplot(aes(x = SalePrice, fill = ..count..)) +
  geom_histogram(bins=45) +
  scale_x_continuous(labels = comma) +
  scale_y_continuous(labels = comma) +
  labs(title = 'Distribution: Sale Price', x = 'Sale Price ($)', y = 'Count') +
  theme_minimal() +
  theme(plot.title = element_text(hjust = .5), legend.position = 'none')

There are 1,460 homes in the dataset, and a sales price for each of them. The lowest price is $34,900 and the highest is $755,000, with a mean value of $180,921. Prices show a pronounced rightward skew.

Documentation on the Ames Dataset suggests removing observations that are outliers or partial sales - all of which have square footage exceeding 4,000 square feet. As a best practice we could exclude these observations as the first act of data cleaning, but that would lead to a result set with a different length than that required by Kaggle.

# confirm 1stFlrSF is the right filter variable
train_df %>% 
  filter(GrLivArea >= 4000) %>%
  count()
## # A tibble: 1 x 1
##       n
##   <int>
## 1     4

Turning our attention to other features, we check for missing values in the training dataset.

# Check for noisome NAs
sapply(train_df, function(x) sum(is.na(x)))
##            Id    MSSubClass      MSZoning   LotFrontage       LotArea 
##             0             0             0           259             0 
##        Street         Alley      LotShape   LandContour     Utilities 
##             0          1369             0             0             0 
##     LotConfig     LandSlope  Neighborhood    Condition1    Condition2 
##             0             0             0             0             0 
##      BldgType    HouseStyle   OverallQual   OverallCond     YearBuilt 
##             0             0             0             0             0 
##  YearRemodAdd     RoofStyle      RoofMatl   Exterior1st   Exterior2nd 
##             0             0             0             0             0 
##    MasVnrType    MasVnrArea     ExterQual     ExterCond    Foundation 
##             8             8             0             0             0 
##      BsmtQual      BsmtCond  BsmtExposure  BsmtFinType1    BsmtFinSF1 
##            37            37            38            37             0 
##  BsmtFinType2    BsmtFinSF2     BsmtUnfSF   TotalBsmtSF       Heating 
##            38             0             0             0             0 
##     HeatingQC    CentralAir    Electrical      1stFlrSF      2ndFlrSF 
##             0             0             1             0             0 
##  LowQualFinSF     GrLivArea  BsmtFullBath  BsmtHalfBath      FullBath 
##             0             0             0             0             0 
##      HalfBath  BedroomAbvGr  KitchenAbvGr   KitchenQual  TotRmsAbvGrd 
##             0             0             0             0             0 
##    Functional    Fireplaces   FireplaceQu    GarageType   GarageYrBlt 
##             0             0           690            81            81 
##  GarageFinish    GarageCars    GarageArea    GarageQual    GarageCond 
##            81             0             0            81            81 
##    PavedDrive    WoodDeckSF   OpenPorchSF EnclosedPorch     3SsnPorch 
##             0             0             0             0             0 
##   ScreenPorch      PoolArea        PoolQC         Fence   MiscFeature 
##             0             0          1453          1179          1406 
##       MiscVal        MoSold        YrSold      SaleType SaleCondition 
##             0             0             0             0             0 
##     SalePrice 
##             0

Rather than impute values or remove observations, in this analysis we’ll exclude features with NAs.

# Remove the affected columns
clean_df <- train_df %>% 
  dplyr::select(-c(LotFrontage, Alley, MasVnrType, MasVnrArea, BsmtQual, BsmtCond, BsmtExposure, BsmtFinType1, BsmtFinType2, Electrical, FireplaceQu, GarageType, GarageYrBlt, GarageFinish, GarageQual, GarageCond, PoolQC, Fence, MiscFeature)) 

A large number of features remain - some categorical, some ordinal, and a number quantitative.

# Summary of all features not missing values
str(clean_df)
## Classes 'tbl_df', 'tbl' and 'data.frame':    1460 obs. of  62 variables:
##  $ Id           : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ MSSubClass   : int  60 20 60 70 60 50 20 60 50 190 ...
##  $ MSZoning     : chr  "RL" "RL" "RL" "RL" ...
##  $ LotArea      : int  8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
##  $ Street       : chr  "Pave" "Pave" "Pave" "Pave" ...
##  $ LotShape     : chr  "Reg" "Reg" "IR1" "IR1" ...
##  $ LandContour  : chr  "Lvl" "Lvl" "Lvl" "Lvl" ...
##  $ Utilities    : chr  "AllPub" "AllPub" "AllPub" "AllPub" ...
##  $ LotConfig    : chr  "Inside" "FR2" "Inside" "Corner" ...
##  $ LandSlope    : chr  "Gtl" "Gtl" "Gtl" "Gtl" ...
##  $ Neighborhood : chr  "CollgCr" "Veenker" "CollgCr" "Crawfor" ...
##  $ Condition1   : chr  "Norm" "Feedr" "Norm" "Norm" ...
##  $ Condition2   : chr  "Norm" "Norm" "Norm" "Norm" ...
##  $ BldgType     : chr  "1Fam" "1Fam" "1Fam" "1Fam" ...
##  $ HouseStyle   : chr  "2Story" "1Story" "2Story" "2Story" ...
##  $ OverallQual  : int  7 6 7 7 8 5 8 7 7 5 ...
##  $ OverallCond  : int  5 8 5 5 5 5 5 6 5 6 ...
##  $ YearBuilt    : int  2003 1976 2001 1915 2000 1993 2004 1973 1931 1939 ...
##  $ YearRemodAdd : int  2003 1976 2002 1970 2000 1995 2005 1973 1950 1950 ...
##  $ RoofStyle    : chr  "Gable" "Gable" "Gable" "Gable" ...
##  $ RoofMatl     : chr  "CompShg" "CompShg" "CompShg" "CompShg" ...
##  $ Exterior1st  : chr  "VinylSd" "MetalSd" "VinylSd" "Wd Sdng" ...
##  $ Exterior2nd  : chr  "VinylSd" "MetalSd" "VinylSd" "Wd Shng" ...
##  $ ExterQual    : chr  "Gd" "TA" "Gd" "TA" ...
##  $ ExterCond    : chr  "TA" "TA" "TA" "TA" ...
##  $ Foundation   : chr  "PConc" "CBlock" "PConc" "BrkTil" ...
##  $ BsmtFinSF1   : int  706 978 486 216 655 732 1369 859 0 851 ...
##  $ BsmtFinSF2   : int  0 0 0 0 0 0 0 32 0 0 ...
##  $ BsmtUnfSF    : int  150 284 434 540 490 64 317 216 952 140 ...
##  $ TotalBsmtSF  : int  856 1262 920 756 1145 796 1686 1107 952 991 ...
##  $ Heating      : chr  "GasA" "GasA" "GasA" "GasA" ...
##  $ HeatingQC    : chr  "Ex" "Ex" "Ex" "Gd" ...
##  $ CentralAir   : chr  "Y" "Y" "Y" "Y" ...
##  $ 1stFlrSF     : int  856 1262 920 961 1145 796 1694 1107 1022 1077 ...
##  $ 2ndFlrSF     : int  854 0 866 756 1053 566 0 983 752 0 ...
##  $ LowQualFinSF : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ GrLivArea    : int  1710 1262 1786 1717 2198 1362 1694 2090 1774 1077 ...
##  $ BsmtFullBath : int  1 0 1 1 1 1 1 1 0 1 ...
##  $ BsmtHalfBath : int  0 1 0 0 0 0 0 0 0 0 ...
##  $ FullBath     : int  2 2 2 1 2 1 2 2 2 1 ...
##  $ HalfBath     : int  1 0 1 0 1 1 0 1 0 0 ...
##  $ BedroomAbvGr : int  3 3 3 3 4 1 3 3 2 2 ...
##  $ KitchenAbvGr : int  1 1 1 1 1 1 1 1 2 2 ...
##  $ KitchenQual  : chr  "Gd" "TA" "Gd" "Gd" ...
##  $ TotRmsAbvGrd : int  8 6 6 7 9 5 7 7 8 5 ...
##  $ Functional   : chr  "Typ" "Typ" "Typ" "Typ" ...
##  $ Fireplaces   : int  0 1 1 1 1 0 1 2 2 2 ...
##  $ GarageCars   : int  2 2 2 3 3 2 2 2 2 1 ...
##  $ GarageArea   : int  548 460 608 642 836 480 636 484 468 205 ...
##  $ PavedDrive   : chr  "Y" "Y" "Y" "Y" ...
##  $ WoodDeckSF   : int  0 298 0 0 192 40 255 235 90 0 ...
##  $ OpenPorchSF  : int  61 0 42 35 84 30 57 204 0 4 ...
##  $ EnclosedPorch: int  0 0 0 272 0 0 0 228 205 0 ...
##  $ 3SsnPorch    : int  0 0 0 0 0 320 0 0 0 0 ...
##  $ ScreenPorch  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PoolArea     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ MiscVal      : int  0 0 0 0 0 700 0 350 0 0 ...
##  $ MoSold       : int  2 5 9 2 12 10 8 11 4 1 ...
##  $ YrSold       : int  2008 2007 2008 2006 2008 2009 2007 2009 2008 2008 ...
##  $ SaleType     : chr  "WD" "WD" "WD" "WD" ...
##  $ SaleCondition: chr  "Normal" "Normal" "Normal" "Abnorml" ...
##  $ SalePrice    : int  208500 181500 223500 140000 250000 143000 307000 200000 129900 118000 ...
##  - attr(*, "spec")=List of 2
##   ..$ cols   :List of 81
##   .. ..$ Id           : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ MSSubClass   : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ MSZoning     : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ LotFrontage  : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ LotArea      : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ Street       : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Alley        : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ LotShape     : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ LandContour  : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Utilities    : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ LotConfig    : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ LandSlope    : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Neighborhood : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Condition1   : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Condition2   : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ BldgType     : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ HouseStyle   : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ OverallQual  : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ OverallCond  : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ YearBuilt    : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ YearRemodAdd : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ RoofStyle    : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ RoofMatl     : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Exterior1st  : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Exterior2nd  : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ MasVnrType   : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ MasVnrArea   : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ ExterQual    : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ ExterCond    : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Foundation   : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ BsmtQual     : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ BsmtCond     : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ BsmtExposure : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ BsmtFinType1 : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ BsmtFinSF1   : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ BsmtFinType2 : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ BsmtFinSF2   : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ BsmtUnfSF    : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ TotalBsmtSF  : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ Heating      : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ HeatingQC    : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ CentralAir   : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Electrical   : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ 1stFlrSF     : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ 2ndFlrSF     : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ LowQualFinSF : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ GrLivArea    : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ BsmtFullBath : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ BsmtHalfBath : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ FullBath     : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ HalfBath     : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ BedroomAbvGr : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ KitchenAbvGr : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ KitchenQual  : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ TotRmsAbvGrd : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ Functional   : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Fireplaces   : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ FireplaceQu  : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ GarageType   : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ GarageYrBlt  : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ GarageFinish : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ GarageCars   : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ GarageArea   : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ GarageQual   : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ GarageCond   : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ PavedDrive   : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ WoodDeckSF   : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ OpenPorchSF  : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ EnclosedPorch: list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ 3SsnPorch    : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ ScreenPorch  : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ PoolArea     : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ PoolQC       : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Fence        : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ MiscFeature  : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ MiscVal      : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ MoSold       : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ YrSold       : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ SaleType     : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ SaleCondition: list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ SalePrice    : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   ..$ default: list()
##   .. ..- attr(*, "class")= chr  "collector_guess" "collector"
##   ..- attr(*, "class")= chr "col_spec"

There’s a lot to investigate in the training dataset, and a good way to understand relationships between the variables would be to conduct pairwise visualization. However, the number of factor levels for categorical variables makes this computationally prohibitive. As the aim of this analysis is regression, we reduce this dimensionality to focus on quantitative variables of interest before looking at correlations.

# Gather the quantitative variables of potential interest and crate a dataframe based on them
quantvar_subset1 <- c('Id', 'YearBuilt', 'YearRemodAdd', 'YrSold', 'LotArea', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'MiscVal', 'SalePrice')
working_df = clean_df[quantvar_subset1]

Narrowing the feature set to just quantitative variables still leaves around twenty variables - still quite a few. We visualize correlations between the features, as well as with sale prices.

# Remove Id, then create and visualize correlation matrix
working_df %>% 
  dplyr::select(-Id) %>% 
  cor() %>% 
  ggcorrplot(insig = 'blank', # leaves blank if not significant
             colors = c('midnightblue', 'white', 'red2')) + # provides custom color range
  labs(title = 'Correlation between Quantitative Variables', x = '', y = '') +
  theme_minimal() +
  theme(plot.title = element_text(hjust = .5), 
        legend.position = 'none', 
        axis.text.x = element_text(size = 8, angle = 90),
        axis.text.y = element_text(size = 8))

In this grid, the stronger the positive correlation, the redder the cell; the stronger the negative correlation, the bluer. Cells that are completely white demonstrate no relationship, or more likely, none that is statistically significant.

There are few negative correlations in this dataset, and the most not particular insightful. For example:

With the exception of second floor area and overall area, the strongest positive correlations are with sale prices, the most marked of which are found:

The year a home was built shows the most correlation with garage and total basement and less with first floor and total living area; the year it was remodeled has similar correlations with size, though total living area is higher and basement, lower.

We look at a summary of descriptive statistics for each feature.

# Create summary of univariate descriptive statistics
working_df_stats <- as_tibble(psych::describe(working_df))
working_df_stats <- tibble::rownames_to_column(working_df_stats, var = "feature")
working_df_stats_simple <- working_df_stats %>% dplyr::select(-c(vars, n, trimmed, mad, range, kurtosis, se))
working_df_stats_simple
## # A tibble: 21 x 7
##    feature           mean       sd  median   min    max     skew
##    <chr>            <dbl>    <dbl>   <dbl> <dbl>  <dbl>    <dbl>
##  1 Id              730.     422.      730.     1   1460   0     
##  2 YearBuilt      1971.      30.2    1973   1872   2010  -0.612 
##  3 YearRemodAdd   1985.      20.6    1994   1950   2010  -0.503 
##  4 YrSold         2008.       1.33   2008   2006   2010   0.0961
##  5 LotArea       10517.    9981.     9478.  1300 215245  12.2   
##  6 BsmtFinSF1      444.     456.      384.     0   5644   1.68  
##  7 BsmtFinSF2       46.5    161.        0      0   1474   4.25  
##  8 BsmtUnfSF       567.     442.      478.     0   2336   0.918 
##  9 TotalBsmtSF    1057.     439.      992.     0   6110   1.52  
## 10 1stFlrSF       1163.     387.     1087    334   4692   1.37  
## 11 2ndFlrSF        347.     437.        0      0   2065   0.811 
## 12 LowQualFinSF      5.84    48.6       0      0    572   8.99  
## 13 GrLivArea      1515.     525.     1464    334   5642   1.36  
## 14 GarageArea      473.     214.      480      0   1418   0.180 
## 15 WoodDeckSF       94.2    125.        0      0    857   1.54  
## 16 OpenPorchSF      46.7     66.3      25      0    547   2.36  
## 17 3SsnPorch         3.41    29.3       0      0    508  10.3   
## 18 ScreenPorch      15.1     55.8       0      0    480   4.11  
## 19 PoolArea          2.76    40.2       0      0    738  14.8   
## 20 MiscVal          43.5    496.        0      0  15500  24.4   
## 21 SalePrice    180921.   79443.   163000  34900 755000   1.88

We dig into three quantiative variables in particular: GrLivArea, GarageArea, and YearRemodAdd. We start with a smaller correlation matrix specific to the three.

# Revised data frame based on those quantitative variables
narrow_working_df <- working_df %>% dplyr::select(Id, SalePrice, GrLivArea, GarageArea, YearRemodAdd)

# Subset data for pairwise comparison
narrow_corr_plot <- narrow_working_df %>% 
  dplyr::select(-Id, -SalePrice) %>% 
  ggpairs() +
  theme_minimal()

narrow_corr_plot

GrLivArea is right-skewed - that is, there is a long tail of large homes, more of which were remodeled or possibly built (when homes have not been remodeled, this feature takes the same value at YearBuilt) in the nineties and aughts unsurprisingly bringing with them larger garages. It’s positivelty correlated with the other two features, more with GarageArea (.469) than YearRemodAdd (.287).

GarageArea follows an approximately normal distribution, save for the bump represented by homes without indoor or covered parking which is distributed across both home large and small and old and new. It’s positively correalted with YearRemodAdd (.372).

We already discussed YearRemodAdd’s in the context of the other features, so a quick word on its distribution: its left-skewed, representing either an uptick in remodeled and/or new homes from the nineties peaking in the mid-aughts (seemingly before 2007-8).

We also visualize the linear relationships of the features with the response variable: first, between the total living area of a home and its sale price, them the home’s garage area, and lastly the year the home was remodeled.

# GrLivArea vs. SalePrice: fit simple linear model and append predicted value and residuals
GrLivArea_lm <- lm(SalePrice ~ GrLivArea, data = narrow_working_df)
narrow_working_df$GrLivArea_pred <- predict(GrLivArea_lm)
narrow_working_df$GrLivArea_resid <- residuals(GrLivArea_lm)

# GrLivArea vs. SalePrice: scatterplot with regression line and residual visualization
GrLivArea_residviz <- ggplot(data = narrow_working_df, aes(x = GrLivArea, y = SalePrice)) +
  geom_segment(aes(xend = GrLivArea, yend = GrLivArea_pred), alpha = .2) +
  geom_smooth(method = 'lm', se = FALSE, color = 'darkgray') +
  geom_point(aes(color = GrLivArea_resid), size = 1.5) +
  scale_color_gradient2(low = 'midnightblue', mid = 'white', high = 'red2') +
  guides(color = FALSE) +
  geom_point(aes(y = GrLivArea_pred), size = 1.5, shape = 1) +
  labs(title = 'Linear Model with Residuals: Total Living Area vs. Sale Price', x = 'Total Living Area (sq ft)', y = 'Sale Price ($)') +
  ylim(0, 800000) +
  theme_minimal() +
  theme(plot.title = element_text(hjust = .5))

# GrLivArea vs. SalePrice: plot residuals
GrLivArea_residplot <- ggplot(data = GrLivArea_lm, aes(x = .fitted, y = .resid)) +
  geom_point(aes(y = .resid, color = .resid)) +
  scale_color_gradient2(low = "midnightblue", mid = 'white', high = 'red2') +
  stat_smooth(method = 'loess', se = TRUE, fill = 'gray95', color = 'darkgray') +
  geom_hline(yintercept = 0, col = "black", linetype = "dashed", alpha = .8, size = .5) +
  guides(color = FALSE) +
  labs(title = 'Residual Plot: Total Living Area vs. Sale Price', x = 'Fitted Values', y = 'Residuals') +
  theme_minimal() +
  theme(plot.title = element_text(hjust = .5))

# GrLivArea vs. SalePrice: QQ-plot residuals 
GrLivArea_qqplot <- ggplot(data = narrow_working_df, aes(sample = GrLivArea)) +
  stat_qq(size = 1.5) +
  stat_qq_line(color = 'darkgray') +
  labs(title = 'Quartile-Quartile Plot: Total Living Area vs. Sale Price', x = "Theoretical Quantiles", y = "Standardized Residuals") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = .5))

# GarageArea vs. SalePrice: fit simple linear model and append predicted value and residuals to GarageArea vs. SalePrice
GarageArea_lm <- lm(SalePrice ~ GarageArea, data = narrow_working_df)
narrow_working_df$GarageArea_pred <- predict(GarageArea_lm)
narrow_working_df$GarageArea_resid <- residuals(GarageArea_lm)

# GarageArea vs. SalePrice: scatterplot with regression line and residual visualization
GarageArea_residviz <- ggplot(data = narrow_working_df, aes(x = GarageArea, y = SalePrice)) +
  geom_segment(aes(xend = GarageArea, yend = GarageArea_pred), alpha = .2) +
  geom_smooth(method = 'lm', se = FALSE, color = 'darkgray') +
  geom_point(aes(color = GarageArea_resid), size = 1.5) +
  scale_color_gradient2(low = 'midnightblue', mid = 'white', high = 'red2') +
  guides(color = FALSE) +
  geom_point(aes(y = GarageArea_pred), size = 1.5, shape = 1) +
  labs(title = 'Linear Model with Residuals: Garage Area vs. Sale Price', x = 'Garage Area (sq ft)', y = 'Sale Price ($)') +
  ylim(0, 800000) +
  theme_minimal() +
  theme(plot.title = element_text(hjust = .5))

# GarageArea vs. SalePrice: plot residuals
GarageArea_residplot <- ggplot(data = GarageArea_lm, aes(x = .fitted, y = .resid)) +
  geom_point(aes(y = .resid, color = .resid)) +
  scale_color_gradient2(low = "midnightblue", mid = 'white', high = 'red2') +
  stat_smooth(method = 'loess', se = TRUE, fill = 'gray95', color = 'darkgray') +
  geom_hline(yintercept = 0, col = "black", linetype = "dashed", alpha = .8, size = .5) +
  guides(color = FALSE) +
  labs(title = 'Residual Plot: : Garage Area vs. Sale Price', x = 'Fitted Values', y = 'Residuals') +
  theme_minimal() +
  theme(plot.title = element_text(hjust = .5))

# GarageArea vs. SalePrice: QQ-plot residuals 
GarageArea_qqplot <- ggplot(data = narrow_working_df, aes(sample = GarageArea)) +
  stat_qq(size = 1.5) +
  stat_qq_line(color = 'darkgray') +
  labs(title = 'Quartile-Quartile Plot: Garage Living Area vs. Sale Price', x = "Theoretical Quantiles", y = "Standardized Residuals") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = .5))

# YearRemodAdd vs. SalePrice: fit simple linear model and append predicted value and residuals to original dataframe
YearRemodAdd_lm <- lm(SalePrice ~ YearRemodAdd, data = narrow_working_df)
narrow_working_df$YearRemodAdd_pred <- predict(YearRemodAdd_lm)
narrow_working_df$YearRemodAdd_resid <- residuals(YearRemodAdd_lm)

# YearRemodAdd vs. SalePrice: scatterplot with regression line and residual visualization
YearRemodAdd_residviz <- ggplot(data = narrow_working_df, aes(x = YearRemodAdd, y = SalePrice)) +
  geom_segment(aes(xend = YearRemodAdd, yend = YearRemodAdd_pred), alpha = .2) +
  geom_smooth(method = 'lm', se = FALSE, color = 'darkgray') +
  geom_point(aes(color = YearRemodAdd_resid), size = 1.5) +
  scale_color_gradient2(low = 'midnightblue', mid = 'white', high = 'red2') +
  guides(color = FALSE) +
  geom_point(aes(y = YearRemodAdd_pred), size = 1.5, shape = 1) +
  labs(title = 'Linear Model with Residuals: Year Remodeled', x = 'Year Remodeled', y = 'Sale Price ($)') +
  ylim(0, 800000) +
  theme_minimal() +
  theme(plot.title = element_text(hjust = .5))

# YearRemodAdd vs. SalePrice: plot residuals
YearRemodAdd_residplot <- ggplot(data = YearRemodAdd_lm, aes(x = .fitted, y = .resid)) +
  geom_point(aes(y = .resid, color = .resid)) +
  scale_color_gradient2(low = "midnightblue", mid = 'white', high = 'red2') +
  stat_smooth(method = 'loess', se = TRUE, fill = 'gray95', color = 'darkgray') +
  geom_hline(yintercept = 0, col = "black", linetype = "dashed", alpha = .8, size = .5) +
  guides(color = FALSE) +
  labs(title = 'Residual Plot: Year Remodeled', x = 'Fitted Values', y = 'Residuals') +
  theme_minimal() +
  theme(plot.title = element_text(hjust = .5))

# YearRemodAdd vs. SalePrice: QQ-plot residuals 
YearRemodAdd_qqplot <- ggplot(data = narrow_working_df, aes(sample = YearRemodAdd)) +
  stat_qq(size = 1.5) +
  stat_qq_line(color = 'darkgray') +
  labs(title = 'Quartile-Quartile Plot: Year Remodeled', x = "Theoretical Quantiles", y = "Standardized Residuals") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = .5))


# summary(GrLivArea_lm)
GrLivArea_residviz

GrLivArea_residplot

GrLivArea_qqplot

# summary(GarageArea_lm)
GarageArea_residviz

GarageArea_residplot

GarageArea_qqplot

# summary(YearRemodAdd_lm)
YearRemodAdd_residviz

YearRemodAdd_residplot

YearRemodAdd_qqplot

GrLivArea reveals a positive relationship with SalePrice. The QQ plot underscores the feature’s pronounced positive, right-ward skew, and outliers are exerting leverage. Residuals grow in size for large values of SalePrice, meaning that the distribution isn’t very normal.

GarageArea shows similar patterns - though the relationship with SalePrice isn’t as strong, there are outliers and positive skew, and the tail impact of homes with no garage space is discernible in the residuals.

Interestingly, YearRemodAdd by itself has an even tamer positive relationship with SalePrice. Records seem to commence around 1950 and remodeling / new home construction falls off in the mid-aughts, so the negative skew in the QQ-plot needs to be interpreted. Based on the residuals, the longitudinal distribution does not appear normal.

We briefly revisit correlations between the features.

# Confirm correlations at the 80% confidence level
GrLivArea_GarageArea_corr <- cor.test(
  x = narrow_working_df$GrLivArea, y = narrow_working_df$GarageArea, conf.level = .8)
GarageArea_YearRemodAdd_corr <- cor.test(
  x = narrow_working_df$GarageArea, y = narrow_working_df$YearRemodAdd, conf.level = .8)
YearRemodAdd_GrLivArea_corr <- cor.test(
  x = narrow_working_df$YearRemodAdd, y = narrow_working_df$GrLivArea, conf.level = .8)

cat(paste0('Correlation between total living area and garage area at the 80% confidence level is ', round(as.numeric(GrLivArea_GarageArea_corr$estimate), 4), '.'))
## Correlation between total living area and garage area at the 80% confidence level is 0.469.
cat(paste0('\nCorrelation between garage area and year remodeled at the 80% confidence level is ', round(as.numeric(GarageArea_YearRemodAdd_corr$estimate), 4), '.'))
## 
## Correlation between garage area and year remodeled at the 80% confidence level is 0.3716.
cat(paste0('\nCorrelation between year remodeled and total living area at the 80% confidence level is ', round(as.numeric(YearRemodAdd_GrLivArea_corr$estimate), 4), '.'))
## 
## Correlation between year remodeled and total living area at the 80% confidence level is 0.2874.

This confirms that Pearson test provided by the correaltion matrix were at the 80% confidence level, and confirms that \(P < .05\) for all three pairs.

Familywise error is the probability of coming to at least one false conclusion in a series of hyptohesis tests. To address this, we can group the confidence level across the three tests: \(\beta = .80\) across three tests equates to \(\beta = (1 - \frac{1 - .8}{3} = .933)\) for each test. We rerun the correlation tests at this confidence level.

# Conduct correlations at 93.3% confidence level
cor.test(x = narrow_working_df$GrLivArea, y = narrow_working_df$GarageArea, conf.level = .93333)
## 
##  Pearson's product-moment correlation
## 
## data:  narrow_working_df$GrLivArea and narrow_working_df$GarageArea
## t = 20.276, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 93.333 percent confidence interval:
##  0.4306870 0.5056208
## sample estimates:
##       cor 
## 0.4689975
cor.test(x = narrow_working_df$GarageArea, y = narrow_working_df$YearRemodAdd, conf.level = .93333)
## 
##  Pearson's product-moment correlation
## 
## data:  narrow_working_df$GarageArea and narrow_working_df$YearRemodAdd
## t = 15.283, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 93.333 percent confidence interval:
##  0.3294698 0.4122530
## sample estimates:
##       cor 
## 0.3715998
cor.test(x = narrow_working_df$YearRemodAdd, y = narrow_working_df$GrLivArea, conf.level = .93333)
## 
##  Pearson's product-moment correlation
## 
## data:  narrow_working_df$YearRemodAdd and narrow_working_df$GrLivArea
## t = 11.457, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 93.333 percent confidence interval:
##  0.2427298 0.3308317
## sample estimates:
##       cor 
## 0.2873885

At the 93.3% confidence level \(P < .05\) for all three pairs, so we can dismiss the null hypothesis.


Linear Algebra and Correlation.

We invert the correlation matrix to yield the precision matrix.

narrow_corr_matrix <- narrow_working_df %>% 
  dplyr::select(GrLivArea, GarageArea, YearRemodAdd) %>% 
  cor()

# Invert corrlation matrix
prec_matrix <- solve(narrow_corr_matrix) 

cat('Correlation Matrix \n ')
## Correlation Matrix 
## 
round(narrow_corr_matrix, 8)
##              GrLivArea GarageArea YearRemodAdd
## GrLivArea    1.0000000  0.4689975    0.2873885
## GarageArea   0.4689975  1.0000000    0.3715998
## YearRemodAdd 0.2873885  0.3715998    1.0000000
cat('\n Precision Matrix \n')
## 
##  Precision Matrix
round(prec_matrix, 8)
##               GrLivArea GarageArea YearRemodAdd
## GrLivArea     1.3068513 -0.5491812   -0.1714984
## GarageArea   -0.5491812  1.3909928   -0.3590643
## YearRemodAdd -0.1714984 -0.3590643    1.1827149

We multiply these two matrixes by one another to yield the identity matrix.

# Multiply correlation and precision matrix
check_matrix1 <- prec_matrix %*% narrow_corr_matrix
check_matrix2 <- narrow_corr_matrix %*% prec_matrix

cat('Precision Matrix X Correlation Matrix \n ')
## Precision Matrix X Correlation Matrix 
## 
round(check_matrix1, 8)
##              GrLivArea GarageArea YearRemodAdd
## GrLivArea            1          0            0
## GarageArea           0          1            0
## YearRemodAdd         0          0            1
cat('\n Correlation Matrix X Precision Matrix \n ')
## 
##  Correlation Matrix X Precision Matrix 
## 
round(check_matrix2, 8)
##              GrLivArea GarageArea YearRemodAdd
## GrLivArea            1          0            0
## GarageArea           0          1            0
## YearRemodAdd         0          0            1

Lastly, we conduct lower-upper decomposition on the matrix, using built-in functions rather than working long-hand.

# Perform LU decomposition
lu_matrix <- lu.decomposition(narrow_corr_matrix)

cat('LU Decomposition Matrix \n \n')
## LU Decomposition Matrix 
## 
lu_matrix
## $L
##           [,1]      [,2] [,3]
## [1,] 1.0000000 0.0000000    0
## [2,] 0.4689975 1.0000000    0
## [3,] 0.2873885 0.3035933    1
## 
## $U
##      [,1]         [,2]      [,3]
## [1,]    1 4.689975e-01 0.2873885
## [2,]    0 7.800414e-01 0.2368153
## [3,]    0 2.775558e-17 0.8455123

Calculus-based Probability and Statistics

We explore features in the dataset skewed to the right.

# Skew by feature
working_df_stats %>% 
  dplyr::select(feature, skew) %>% 
  arrange(desc(skew))
## # A tibble: 21 x 2
##    feature          skew
##    <chr>           <dbl>
##  1 MiscVal       24.4   
##  2 PoolArea      14.8   
##  3 LotArea       12.2   
##  4 3SsnPorch     10.3   
##  5 LowQualFinSF   8.99  
##  6 BsmtFinSF2     4.25  
##  7 ScreenPorch    4.11  
##  8 OpenPorchSF    2.36  
##  9 SalePrice      1.88  
## 10 BsmtFinSF1     1.68  
## 11 WoodDeckSF     1.54  
## 12 TotalBsmtSF    1.52  
## 13 1stFlrSF       1.37  
## 14 GrLivArea      1.36  
## 15 BsmtUnfSF      0.918 
## 16 2ndFlrSF       0.811 
## 17 GarageArea     0.180 
## 18 YrSold         0.0961
## 19 Id             0     
## 20 YearRemodAdd  -0.503 
## 21 YearBuilt     -0.612

Most features show positive (i.e. rightward) skew, with only a few revealing left skew. Of the features most right-skewed, MiscVal is definitionally non-specific and PoolArea’s incidence is somewhat sparse - so we choose LotArea, or the square footage of the home’s lot.

We scale LotArea so its minimum is just above zero.

# Find the minimum value and add a new feature that scales it just above 0
scaling_value <- min(working_df$LotArea) * .999999999999999
working_df <- working_df %>% 
  mutate(scaled_LotArea = LotArea - scaling_value)  # straight subtraction does not alter the correlation

# Confirm it worked
working_df %>% 
  dplyr::select(Id, LotArea, scaled_LotArea) %>% 
  head()
## # A tibble: 6 x 3
##      Id LotArea scaled_LotArea
##   <int>   <int>          <dbl>
## 1     1    8450          7150.
## 2     2    9600          8300.
## 3     3   11250          9950.
## 4     4    9550          8250.
## 5     5   14260         12960.
## 6     6   14115         12815.

We fit an exponential probability distribution function, take 1,000 samples from the distribution based on the value of \(\lambda\), and then compare that sampled distribution to the actual.

# Fit exponential distribution
exp_fit <- fitdistr(working_df$scaled_LotArea, densfun = 'exponential')
exp_fit_lambda <- as.numeric(exp_fit$estimate)
exp_fit_distr <- as.data.frame(rexp(n = 1000, rate = exp_fit_lambda))
colnames(exp_fit_distr) <- 'sampled_values'
exp_fit_mean <- mean(exp_fit_distr$sampled_values) 
exp_fit_stdev <- exp_fit$sd
lambda_ci <- c(exp_fit_mean + c(-1, 1) * 1.96 * exp_fit_stdev)  # confidence interval @ 95% for sampled distribution

# Plot a histogram of the sampled exponential distribution
exp_fit_plot <- ggplot(data = exp_fit_distr, aes(sampled_values)) +
  geom_histogram(binwidth = 5000, fill = 'midnightblue') +
  scale_x_continuous(limits = c(0, 100000), labels = comma, breaks = seq(from = 0, to = 100000, by = 10000)) +
  scale_y_continuous(limits = c(0, 500)) +
  labs(title = 'Scaled Lot Area (Sampled)', x = 'Lot Area (sq ft)', y = 'Count') +
  theme_minimal() +
  theme(plot.title = element_text(hjust = .5), legend.position = 'none')

# Plot a histogram of the empirical distribution
scaled_LotArea_plot <- ggplot(data = working_df, aes(scaled_LotArea)) +
  geom_histogram(binwidth = 5000, fill = 'dodgerblue3') +
  scale_x_continuous(limits = c(0, 100000), labels = comma, breaks = seq(from = 0, to = 100000, by = 10000)) +
  scale_y_continuous(limits = c(0, 500)) +
  labs(title = 'Scaled Lot Area (Empirical)', x = 'Lot Area (sq ft)', y = 'Count') +
  theme_minimal() +
  theme(plot.title = element_text(hjust = .5), legend.position = 'none')

We also find the 5th and 95th percentiles using the CDF.

# Calculate mean and standard deviation of empirical distribution
actual_mean <- mean(working_df$scaled_LotArea)
actual_stdev <- sd(working_df$scaled_LotArea)

# [EXPLORE DISCREPANCY BETWEEN THE SIMULATED AND EMPIRICAL MEANS]
cat(paste0('The sampled distribution\'s mean is ', as.character(round(exp_fit_mean, 3)), '.'))
## The sampled distribution's mean is 8986.702.
cat(paste0('\nThe empirical distribution\'s mean is ', as.character(round(actual_mean, 3)), '.'))
## 
## The empirical distribution's mean is 9216.828.
# The 5th and 95th percentiles of the exponential CDF
exp_quantiles <- qexp(p = c(.05, .95), rate = exp_fit_lambda)
cat(paste0('\nThe 5th percentile of the exponential cumulative distribution function is ', as.character(round(exp_quantiles[1]), 3), ' and the 95th percentile is ', as.character(round(exp_quantiles[2]), 3), '.'))
## 
## The 5th percentile of the exponential cumulative distribution function is 473 and the 95th percentile is 27611.

Though both distributions are based on the same expontial \(\lambda\), the empirical distribution is less peaked and has a higher minimum value. The density of the empirical distribution seems logical, given that there is likely less residential demand for larger properties as well as a regulatory optimum for lot size. Controlling for the long tail of large lots, a more normal distribution might better approximate.


Modeling

We build a mutliple regression model using stepwise backward elimination. As GrLivArea and TotalBsmtSF are highly correlated with 1stFlrSF, we won’t include it. TotalBsmtSF is also correlated with BsmtFinSF1, so we keep that out. We include LotArea but not scaled_LotArea.

We also check for missing values in the test dataset that will hamper are ability to make predictions.

# Check for noisome NAs
sapply(test_df, function(x) sum(is.na(x)))
##            Id    MSSubClass      MSZoning   LotFrontage       LotArea 
##             0             0             4           227             0 
##        Street         Alley      LotShape   LandContour     Utilities 
##             0          1352             0             0             2 
##     LotConfig     LandSlope  Neighborhood    Condition1    Condition2 
##             0             0             0             0             0 
##      BldgType    HouseStyle   OverallQual   OverallCond     YearBuilt 
##             0             0             0             0             0 
##  YearRemodAdd     RoofStyle      RoofMatl   Exterior1st   Exterior2nd 
##             0             0             0             1             1 
##    MasVnrType    MasVnrArea     ExterQual     ExterCond    Foundation 
##            16            15             0             0             0 
##      BsmtQual      BsmtCond  BsmtExposure  BsmtFinType1    BsmtFinSF1 
##            44            45            44            42             1 
##  BsmtFinType2    BsmtFinSF2     BsmtUnfSF   TotalBsmtSF       Heating 
##            42             1             1             1             0 
##     HeatingQC    CentralAir    Electrical      1stFlrSF      2ndFlrSF 
##             0             0             0             0             0 
##  LowQualFinSF     GrLivArea  BsmtFullBath  BsmtHalfBath      FullBath 
##             0             0             2             2             0 
##      HalfBath  BedroomAbvGr  KitchenAbvGr   KitchenQual  TotRmsAbvGrd 
##             0             0             0             1             0 
##    Functional    Fireplaces   FireplaceQu    GarageType   GarageYrBlt 
##             2             0           730            76            78 
##  GarageFinish    GarageCars    GarageArea    GarageQual    GarageCond 
##            78             1             1            78            78 
##    PavedDrive    WoodDeckSF   OpenPorchSF EnclosedPorch     3SsnPorch 
##             0             0             0             0             0 
##   ScreenPorch      PoolArea        PoolQC         Fence   MiscFeature 
##             0             0          1456          1169          1408 
##       MiscVal        MoSold        YrSold      SaleType SaleCondition 
##             0             0             0             1             0

The following features we would consider including in the regression are missing values in the test dataset: BsmtFinSF2, BsmtUnfSf, TotalBsmtSF, and GarageArea. We remove these from the model to avoid downstream issues with prediction on test data.

# We'll start with the following quantitative features 
multregr_fit <- lm(formula = SalePrice ~
                       YearBuilt + YearRemodAdd + YrSold + `2ndFlrSF` + LowQualFinSF + 
                     GrLivArea + WoodDeckSF + OpenPorchSF + `3SsnPorch` + ScreenPorch + 
                     PoolArea + MiscVal + LotArea, 
                      data = working_df
                      )

summary(multregr_fit)
## 
## Call:
## lm(formula = SalePrice ~ YearBuilt + YearRemodAdd + YrSold + 
##     `2ndFlrSF` + LowQualFinSF + GrLivArea + WoodDeckSF + OpenPorchSF + 
##     `3SsnPorch` + ScreenPorch + PoolArea + MiscVal + LotArea, 
##     data = working_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -498006  -20230   -2687   16807  308272 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -8.571e+05  1.705e+06  -0.503 0.615318    
## YearBuilt     6.946e+02  4.823e+01  14.401  < 2e-16 ***
## YearRemodAdd  5.863e+02  6.980e+01   8.400  < 2e-16 ***
## YrSold       -8.258e+02  8.495e+02  -0.972 0.331154    
## `2ndFlrSF`   -3.836e+01  3.703e+00 -10.361  < 2e-16 ***
## LowQualFinSF -8.251e+01  2.398e+01  -3.440 0.000598 ***
## GrLivArea     1.086e+02  3.480e+00  31.199  < 2e-16 ***
## WoodDeckSF    4.397e+01  9.581e+00   4.589 4.83e-06 ***
## OpenPorchSF   3.015e+01  1.823e+01   1.654 0.098438 .  
## `3SsnPorch`   3.407e+01  3.841e+01   0.887 0.375275    
## ScreenPorch   1.002e+02  2.046e+01   4.899 1.07e-06 ***
## PoolArea     -5.431e+01  2.849e+01  -1.906 0.056794 .  
## MiscVal      -1.328e+00  2.265e+00  -0.586 0.557722    
## LotArea       5.167e-01  1.200e-01   4.304 1.79e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 42800 on 1446 degrees of freedom
## Multiple R-squared:  0.7123, Adjusted R-squared:  0.7097 
## F-statistic: 275.4 on 13 and 1446 DF,  p-value: < 2.2e-16

Based on p-values of the features, candidates for removal from the model in stepwise backward elimination are: MiscVal, 3SsnPorch, YrSold, and OpenPorchSF. Pulling out in indiviual iterations led to the same conclusion, so we extract them all in one go.

# Remove multiple features, jumping several steps of iteration
multregr_fit2 <- lm(formula = SalePrice ~
                       YearBuilt + YearRemodAdd + `2ndFlrSF` + LowQualFinSF + 
                     GrLivArea + WoodDeckSF + ScreenPorch + PoolArea + LotArea,  
                      data = working_df
                      )

summary(multregr_fit2)
## 
## Call:
## lm(formula = SalePrice ~ YearBuilt + YearRemodAdd + `2ndFlrSF` + 
##     LowQualFinSF + GrLivArea + WoodDeckSF + ScreenPorch + PoolArea + 
##     LotArea, data = working_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -497392  -20233   -2667   16816  305240 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -2.548e+06  1.160e+05 -21.961  < 2e-16 ***
## YearBuilt     7.031e+02  4.807e+01  14.626  < 2e-16 ***
## YearRemodAdd  5.941e+02  6.934e+01   8.569  < 2e-16 ***
## `2ndFlrSF`   -3.859e+01  3.697e+00 -10.439  < 2e-16 ***
## LowQualFinSF -8.202e+01  2.398e+01  -3.420 0.000643 ***
## GrLivArea     1.100e+02  3.411e+00  32.239  < 2e-16 ***
## WoodDeckSF    4.237e+01  9.550e+00   4.437 9.81e-06 ***
## ScreenPorch   1.005e+02  2.041e+01   4.924 9.43e-07 ***
## PoolArea     -5.299e+01  2.844e+01  -1.863 0.062636 .  
## LotArea       5.185e-01  1.199e-01   4.323 1.64e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 42820 on 1450 degrees of freedom
## Multiple R-squared:  0.7113, Adjusted R-squared:  0.7095 
## F-statistic: 396.9 on 9 and 1450 DF,  p-value: < 2.2e-16

\(R^{2}\) has dropped slightly. As PoolArea’s p-value exceeds the \(.05\) threshold, we try removing that feature in one more iteration of the model.

# We'll iterate manually and jump steps
multregr_fit3 <- lm(formula = SalePrice ~
                       YearBuilt + YearRemodAdd + `2ndFlrSF` + LowQualFinSF +
                      GrLivArea + WoodDeckSF + ScreenPorch + LotArea, 
                      data = working_df
                      )

summary(multregr_fit3)
## 
## Call:
## lm(formula = SalePrice ~ YearBuilt + YearRemodAdd + `2ndFlrSF` + 
##     LowQualFinSF + GrLivArea + WoodDeckSF + ScreenPorch + LotArea, 
##     data = working_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -519312  -20227   -2631   16853  288125 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -2.557e+06  1.160e+05 -22.041  < 2e-16 ***
## YearBuilt     7.038e+02  4.811e+01  14.630  < 2e-16 ***
## YearRemodAdd  5.986e+02  6.936e+01   8.630  < 2e-16 ***
## `2ndFlrSF`   -3.830e+01  3.697e+00 -10.360  < 2e-16 ***
## LowQualFinSF -8.351e+01  2.399e+01  -3.481 0.000514 ***
## GrLivArea     1.091e+02  3.385e+00  32.242  < 2e-16 ***
## WoodDeckSF    4.170e+01  9.551e+00   4.366 1.36e-05 ***
## ScreenPorch   9.928e+01  2.042e+01   4.863 1.28e-06 ***
## LotArea       5.144e-01  1.200e-01   4.286 1.94e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 42850 on 1451 degrees of freedom
## Multiple R-squared:  0.7106, Adjusted R-squared:  0.709 
## F-statistic: 445.4 on 8 and 1451 DF,  p-value: < 2.2e-16

\(R^{2}\) fell with the removal of PoolArea, so we keep it and proceed with the previous iteration to predict SalePrice in the test dataset. As the remaining features have p-values below \(.05\) threshold, we halt backward elimination here.

We examine the model’s residual fit as well as a quantile-quantile plot.

# Multiple regression model: append predicted value and residuals
working_df$multregr_pred <- predict(multregr_fit2)
working_df$multregr_resid <- residuals(multregr_fit2)

# GrLivArea vs. SalePrice: plot residuals
multregr_residplot <- ggplot(data = multregr_fit2, aes(x = .fitted, y = .resid)) +
  geom_point(aes(y = .resid, color = .resid)) +
  scale_color_gradient2(low = "midnightblue", mid = 'white', high = 'red2') +
  stat_smooth(method = 'loess', se = TRUE, fill = 'gray95', color = 'darkgray') +
  geom_hline(yintercept = 0, col = "black", linetype = "dashed", alpha = .8, size = .5) +
  guides(color = FALSE) +
  labs(title = 'Residual Plot: Multiple Regression on Sale Price', x = 'Fitted Values', y = 'Residuals') +
  theme_minimal() +
  theme(plot.title = element_text(hjust = .5))

# GrLivArea vs. SalePrice: QQ-plot residuals 
multregr_qqplot <- ggplot(data = working_df, aes(sample = SalePrice)) +
  stat_qq(size = 1.5) +
  stat_qq_line(color = 'darkgray') +
  labs(title = 'Quartile-Quartile Plot: Multiple Regression on Sale Price', x = "Theoretical Quantiles", y = "Standardized Residuals") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = .5))

multregr_residplot

multregr_qqplot

While the $R^{2} is relatively high, the residuals do not display normal variance, suggeesting that this model could benefit from extensive tuning - which I’ll get to learn about next semester in DATA 621.

We predict SalePrice of homes in the holdout test dataset based on the multiple linear regression, and then save those predictions for submission to Kaggle.

# Add the scaled_LotArea feature to the test set for completeness
test_df2 <- test_df %>% 
  mutate(scaled_LotArea = LotArea - scaling_value) # %>% 
  # dplyr::select(Id, YearBuilt, YearRemodAdd, BsmtFinSF2, BsmtUnfSF, TotalBsmtSF, LowQualFinSF, GrLivArea, GarageArea, WoodDeckSF, ScreenPorch, PoolArea, scaled_LotArea)

# Check it worked
test_df2 %>% 
  dplyr::select(Id, scaled_LotArea) %>% 
  head()
## # A tibble: 6 x 2
##      Id scaled_LotArea
##   <int>          <dbl>
## 1  1461         10322.
## 2  1462         12967.
## 3  1463         12530.
## 4  1464          8678.
## 5  1465          3705.
## 6  1466          8700.
# Predict SalePrice of the test dataset
test_pred <- predict(object = multregr_fit, newdata = test_df2)

# Convert predicted values to tabular format
test_pred_df <- data.frame(cbind(test_df2$Id, test_pred))
colnames(test_pred_df) = c('Id', 'SalePrice')
test_pred_df$Id <- as.integer(test_pred_df$Id)

# Confirm data frame structure before exporting
nrow(test_pred_df)
## [1] 1459
colnames(test_pred_df)
## [1] "Id"        "SalePrice"
# Export to CSV for submission to Kaggle
submit_path <- file.path(getwd(), "JO_submission.csv")
write_csv(x = test_pred_df, col_names = TRUE, path = submit_path)

My user name is jeremyobrien16, and this multiple regression was scored \(.21748\) on Kaggle.