Problem 1

We start by generating 10,000 draws of the random variables \(X\) and \(Y\), where:

  • \(X\) is uniformly distributed from 1 to \(N=41\), and
  • \(Y\) is normally distributed with a mean and standard deviation of \(\mu = \sigma = (N+1)/2 = 21\).
n <- 10000
N <- 41
m <- (N + 1) / 2    # mean of uniform distribution for X
X <- runif(n, min = 1, max = N)     # random variable X
Y <- rnorm(n, mean = m, sd = m)     # random variable Y

df <- data.frame(X, Y)
str(df)
## 'data.frame':    10000 obs. of  2 variables:
##  $ X: num  29.95 17.83 4.73 38.69 7.8 ...
##  $ Y: num  23.2 63.2 24.9 21 22.9 ...
summary(df)
##        X                Y          
##  Min.   : 1.002   Min.   :-54.944  
##  1st Qu.:10.835   1st Qu.:  6.699  
##  Median :20.494   Median : 20.930  
##  Mean   :20.787   Mean   : 21.015  
##  3rd Qu.:30.884   3rd Qu.: 35.167  
##  Max.   :40.998   Max.   :102.070

We compute the following values:

  • Sample median \(x\) and theoretical median \(x_{th}\) of the \(X\) variable, as well as
  • Sample 1st quartile \(y\) and theoretical 1st quartile \(y_{th}\) of the \(Y\) variable.
x <- median(X)          # sample median of X
y <- quantile(Y, 0.25)  # sample 1st quartile of Y

x_th <- qunif(0.5, min = 1, max = N)    # population median of X
y_th <- qnorm(0.25, mean = m, sd = m)   # population 1st quartile of Y 

xy = data.frame("x" = c(x, x_th), "y" = c(y, y_th), row.names = c("Sample", "Theoretical"))
kable(xy, digits = 3, align = 'cc', caption = "Sample and Theoretical Values of x & y")
Sample and Theoretical Values of x & y
x y
Sample 20.494 6.699
Theoretical 21.000 6.836

Then we plot the histograms of \(X\) and \(Y\) along with their probability density functions, and show the values of the median \(x\) and 1st quartile \(y\).

# plot histogram of X with probability density function
hist(X, breaks = 1:N, main = 'Histogram of X with median value (blue)')
abline(h = n / (N - 1), lwd = 3, col = 'violet')
abline(v = x_th, lty = 3, lwd = 3, col = 'blue')

# plot histogram of Y with probability density function
hist(Y, breaks = (m-200):(m+200), xlim = c(m-60, m+60), 
     main = 'Histogram of Y with 1st quartile (blue) and mean (red) values')
curve(n * dnorm(x, mean = m, sd = m), add = TRUE, lwd = 3, col = 'violet') 
abline(v = y_th, lty = 3, lwd = 3, col = 'blue')
abline(v = m, lty = 3, lwd = 3, col = 'red')

A. Calculate the probabilities below and interpret their meaning.

In the probability calculations in this part, we give the sample probabilities computed from the 10,000 random draws generated for \(X\) and \(Y\) as well as the theoretical probabilities computed from the specified distributions for \(X\) (uniform) and \(Y\) (normal).

  1. \(\Pr(X>x ~|~ X>y) = p_A\)

    Answer:

    This is the conditional probability that \(X>x\) given that \(X>y\), which can be expressed as:

    \[\begin{eqnarray} p_A = \Pr(X>x ~|~ X>y) &=& \frac{\Pr(X>x ~\cap~ X>y)}{\Pr(X>y)} && \\ &=& \frac{\Pr(X>x)}{\Pr(X>y)} && \text{ since } x>y \\ &=& \frac{0.5}{\Pr(X>y)} && \text{ by definition of median} \end{eqnarray}\]

    Both the sample probability and theoretical probability are approximately 58.5%, as can be seen in the table from part (c) below.

    # sample probability
    p1_s <- sum(X > x) / sum(X > y)
    # theoretical probability
    p1_th <- 0.5 / ((N - y_th) / (N - 1))
  2. \(\Pr(X>x,~ Y>y) = p_B\)

    Answer:

    This is the joint probability that \(X>x\) and \(Y>y\). Since the random variables \(X\) and \(Y\) are independent (see part B below), the theoretical probability can be expressed as:

    \[ p_B = \Pr(X>x,~ Y>y) = \Pr(X>x) \cdot \Pr(Y>y) = 0.5 \cdot 0.75 \]

    by the definitions of \(x\) (median of \(X\)) and \(y\) (1st quartile of \(Y\)).

    Both the sample probability and theoretical probability are approximately 37.5%, as can be seen in the table from part (c) below.

    # sample probability
    p2_s <- nrow(df[df$X > x & df$Y > y, ]) / n
    # theoretical probability
    p2_th <- 0.5 * 0.75
  3. \(\Pr(X<x ~|~ X>y) = p_C\)

    Answer:

    This is the conditional probability that \(X<x\) given that \(X>y\), which can be expressed as:

    \[\begin{eqnarray} p_C = \Pr(X<x ~|~ X>y) &=& \frac{\Pr(X<x ~\cap~ X>y)}{\Pr(X>y)} && \\ &=& \frac{\Pr(y < X < x)}{\Pr(X>y)} && \text{ since } y < x \end{eqnarray}\]

    Both the sample probability and theoretical probability are approximately 41.5%, as can be seen in the table below.

    Note that this also can be computed simply by using the probability \(p_A\) from part (a) above, since \(X<x\) and \(X>x\) are complements:

    \[p_C = \Pr(X<x ~|~ X>y) = 1 - \Pr(X>x ~|~ X>y) = 1 - p_A\]

    # sample probability
    p3_s <- sum(X < x & X > y) / sum(X > y)
    # theoretical probability
    p3_th <- (x_th - y_th) / (N - y_th) 
    
    # table of probabilities
    p <- data.frame("p_A" = c(p1_s, p1_th), "p_B" = c(p2_s, p2_th), "p_C" = c(p3_s, p3_th), 
                    row.names = c("Sample", "Theoretical"))
    kable(p, digits = 4, align = 'ccc', 
          caption = "Sample and Theoretical Values of Probabilities p_A, p_B, and p_C")
    Sample and Theoretical Values of Probabilities p_A, p_B, and p_C
    p_A p_B p_C
    Sample 0.5831 0.3754 0.4169
    Theoretical 0.5854 0.3750 0.4146

B. Investigate whether \(\Pr(X>x \text{ and } Y>y) = \Pr(X>x) \cdot \Pr(Y>y)\) by building a table and evaluating the marginal and joint probabilities.

We evaluate the conditions \(X > x\) and \(Y > y\) on the \(X\) and \(Y\) samples, store the results in a dataframe, and then convert this into a contingency table. The table is shown in terms of both frequency and probability.

# construct dataframe for X>x and Y>y
df2 <- data.frame(X = ifelse(X <= x, 'X <= x', 'X > x'), Y = ifelse(Y <= y, 'Y <= y', 'Y > y'))
# construct and summarize contingency table
df3 <- table(df2)
summary(df3)    
## Number of cases in table: 10000 
## Number of factors: 2 
## Test for independence of all factors:
##  Chisq = 0.03413, df = 1, p-value = 0.8534
# reformat table as dataframe for display
C1 <- c(df3[ ,1], sum(df3[ ,1]))
C2 <- c(df3[ ,2], sum(df3[ ,2]))
C3 <- C1 + C2
df4 <- data.frame(C1, C2, C3) 
rownames(df4) <- c('X <= x', 'X > x', 'Total')
colnames(df4) <- c('Y <= y', 'Y > y', 'Total')

# display table: frequency
kable(df4, align = 'ccc', caption = 'Contingency Table of X>x and Y>y: Frequency')
Contingency Table of X>x and Y>y: Frequency
Y <= y Y > y Total
X <= x 1254 3746 5000
X > x 1246 3754 5000
Total 2500 7500 10000
# display table: probability
kable(df4 / n, digits = 4, align = 'ccc', 
      caption = 'Contingency Table of X>x and Y>y: Probability')
Contingency Table of X>x and Y>y: Probability
Y <= y Y > y Total
X <= x 0.1254 0.3746 0.5
X > x 0.1246 0.3754 0.5
Total 0.2500 0.7500 1.0

We see that the marginal probabilities and joint probabilities are what we would expect for independent random variables \(X\) and \(Y\). For instance:

\[\begin{eqnarray} \Pr(X > x) &=& 0.5 \\ \Pr(Y > y) &=& 0.75 \\ \Pr(X > x,~ Y > y) &\approx& 0.375 = \Pr(X > x) \cdot \Pr(Y > y) \end{eqnarray}\]

Alternatively, we can check for independence by comparing the marginal and conditional probabilities:

\[\begin{eqnarray} \Pr(X > x ~|~ Y > y) &=& \frac{\Pr(X > x,~ Y > y)}{\Pr(Y > y)} = \frac{\Pr(X > x,~ Y > y)}{0.75} \approx 0.5 = \Pr(X > x) \\ \Pr(Y > y ~|~ X > x) &=& \frac{\Pr(X > x,~ Y > y)}{\Pr(X > x)} = \frac{\Pr(X > x,~ Y > y)}{0.5} \approx 0.75 = \Pr(Y > y) \end{eqnarray}\]

Again, the sample probabilities support the hypothesis that \(X\) and \(Y\) are independent.

C. Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?

We test for independence with the following null and alternative hypotheses:

  • \(H_0\): \(X\) and \(Y\) are independent, and deviations in the frequency of \(X > x\) and \(Y > y\) versus the expected values assuming independence are explained by random chance.
  • \(H_A\): \(X\) and \(Y\) are dependent.

We compute the relevant test statistic under the null hypothesis that \(X\) and \(Y\) are independent, and then compare the resulting p-value to the significance level \(\alpha = 0.05\). If the p-value is \(< \alpha\), then we reject \(H_0\) and conclude that the variables are dependent; however if the p-value is \(> \alpha\), then we conclude that the evidence does not support rejecting the hypothesis that \(X\) and \(Y\) are independent.

In this case, we use two test statistics: Fisher’s exact test and the chi-squared test. We compute the test statistics and related p-values using the R functions fisher.test and chisq.test, with the chi-squared test shown both with and without the “continuity correction”. Note that the p-values from Fisher’s exact test and the chi-squared test with “continuity correction” are extremely close, if not identical.

# calc Fisher's exact test
fisher.test(df3)
## 
##  Fisher's Exact Test for Count Data
## 
## data:  df3
## p-value = 0.8716
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.9202847 1.1052820
## sample estimates:
## odds ratio 
##    1.00857
# calc chi-square test with & without continuity correction
chisq.test(df3)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  df3
## X-squared = 0.026133, df = 1, p-value = 0.8716
chisq.test(df3, correct = FALSE)
## 
##  Pearson's Chi-squared test
## 
## data:  df3
## X-squared = 0.034133, df = 1, p-value = 0.8534

Generally (in almost all runs), the p-values from the two tests are \(\gg \alpha\). Hence we can conclude that the sample data for \(X\) and \(Y\) do not support the alternative hypothesis that the variables are dependent.

Finally, from Wikipedia, Fisher’s exact test provides a significance test statistic that can be computed exactly (using the hypergeometric distribution), which is valid when the marginal (row and column) totals for the contingency table are held fixed. In contrast, the chi-squared test statistic provides only an approximate significance value, which is valid only for large sample sizes and for frequency counts greater than 5-10 in each cell of the table. In our case, the sample size of 10,000 is large enough that either test can be used, and the corresponding p-values are very close.

Problem 2

We start by loading and viewing the Ames housing training dataset. The dataset has been downloaded from the Kaggle website and saved to local disk as well as my GitHub repo, which is private. For reproducibility, the R code below loads the data from disk.

The training dataset includes 1,460 observations of 81 variables, which includes the target variable SalePrice and 80 feature variables.

# local directory
path <- "train.csv"
# private GitHub repo; need to accept invitation for access
# path <- "https://raw.githubusercontent.com/kecbenson/DATA605/master/train.csv?token=AJYRCNGY6EDCSXGWG2PULFC45CHCA"

# load and view training dataset
df_train <- read.csv(path)
glimpse(df_train)
## Observations: 1,460
## Variables: 81
## $ Id            <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1...
## $ MSSubClass    <int> 60, 20, 60, 70, 60, 50, 20, 60, 50, 190, 20, 60,...
## $ MSZoning      <fct> RL, RL, RL, RL, RL, RL, RL, RL, RM, RL, RL, RL, ...
## $ LotFrontage   <int> 65, 80, 68, 60, 84, 85, 75, NA, 51, 50, 70, 85, ...
## $ LotArea       <int> 8450, 9600, 11250, 9550, 14260, 14115, 10084, 10...
## $ Street        <fct> Pave, Pave, Pave, Pave, Pave, Pave, Pave, Pave, ...
## $ Alley         <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ LotShape      <fct> Reg, Reg, IR1, IR1, IR1, IR1, Reg, IR1, Reg, Reg...
## $ LandContour   <fct> Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, Lvl...
## $ Utilities     <fct> AllPub, AllPub, AllPub, AllPub, AllPub, AllPub, ...
## $ LotConfig     <fct> Inside, FR2, Inside, Corner, FR2, Inside, Inside...
## $ LandSlope     <fct> Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl...
## $ Neighborhood  <fct> CollgCr, Veenker, CollgCr, Crawfor, NoRidge, Mit...
## $ Condition1    <fct> Norm, Feedr, Norm, Norm, Norm, Norm, Norm, PosN,...
## $ Condition2    <fct> Norm, Norm, Norm, Norm, Norm, Norm, Norm, Norm, ...
## $ BldgType      <fct> 1Fam, 1Fam, 1Fam, 1Fam, 1Fam, 1Fam, 1Fam, 1Fam, ...
## $ HouseStyle    <fct> 2Story, 1Story, 2Story, 2Story, 2Story, 1.5Fin, ...
## $ OverallQual   <int> 7, 6, 7, 7, 8, 5, 8, 7, 7, 5, 5, 9, 5, 7, 6, 7, ...
## $ OverallCond   <int> 5, 8, 5, 5, 5, 5, 5, 6, 5, 6, 5, 5, 6, 5, 5, 8, ...
## $ YearBuilt     <int> 2003, 1976, 2001, 1915, 2000, 1993, 2004, 1973, ...
## $ YearRemodAdd  <int> 2003, 1976, 2002, 1970, 2000, 1995, 2005, 1973, ...
## $ RoofStyle     <fct> Gable, Gable, Gable, Gable, Gable, Gable, Gable,...
## $ RoofMatl      <fct> CompShg, CompShg, CompShg, CompShg, CompShg, Com...
## $ Exterior1st   <fct> VinylSd, MetalSd, VinylSd, Wd Sdng, VinylSd, Vin...
## $ Exterior2nd   <fct> VinylSd, MetalSd, VinylSd, Wd Shng, VinylSd, Vin...
## $ MasVnrType    <fct> BrkFace, None, BrkFace, None, BrkFace, None, Sto...
## $ MasVnrArea    <int> 196, 0, 162, 0, 350, 0, 186, 240, 0, 0, 0, 286, ...
## $ ExterQual     <fct> Gd, TA, Gd, TA, Gd, TA, Gd, TA, TA, TA, TA, Ex, ...
## $ ExterCond     <fct> TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, ...
## $ Foundation    <fct> PConc, CBlock, PConc, BrkTil, PConc, Wood, PConc...
## $ BsmtQual      <fct> Gd, Gd, Gd, TA, Gd, Gd, Ex, Gd, TA, TA, TA, Ex, ...
## $ BsmtCond      <fct> TA, TA, TA, Gd, TA, TA, TA, TA, TA, TA, TA, TA, ...
## $ BsmtExposure  <fct> No, Gd, Mn, No, Av, No, Av, Mn, No, No, No, No, ...
## $ BsmtFinType1  <fct> GLQ, ALQ, GLQ, ALQ, GLQ, GLQ, GLQ, ALQ, Unf, GLQ...
## $ BsmtFinSF1    <int> 706, 978, 486, 216, 655, 732, 1369, 859, 0, 851,...
## $ BsmtFinType2  <fct> Unf, Unf, Unf, Unf, Unf, Unf, Unf, BLQ, Unf, Unf...
## $ BsmtFinSF2    <int> 0, 0, 0, 0, 0, 0, 0, 32, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ BsmtUnfSF     <int> 150, 284, 434, 540, 490, 64, 317, 216, 952, 140,...
## $ TotalBsmtSF   <int> 856, 1262, 920, 756, 1145, 796, 1686, 1107, 952,...
## $ Heating       <fct> GasA, GasA, GasA, GasA, GasA, GasA, GasA, GasA, ...
## $ HeatingQC     <fct> Ex, Ex, Ex, Gd, Ex, Ex, Ex, Ex, Gd, Ex, Ex, Ex, ...
## $ CentralAir    <fct> Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, ...
## $ Electrical    <fct> SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SBrkr,...
## $ X1stFlrSF     <int> 856, 1262, 920, 961, 1145, 796, 1694, 1107, 1022...
## $ X2ndFlrSF     <int> 854, 0, 866, 756, 1053, 566, 0, 983, 752, 0, 0, ...
## $ LowQualFinSF  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ GrLivArea     <int> 1710, 1262, 1786, 1717, 2198, 1362, 1694, 2090, ...
## $ BsmtFullBath  <int> 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, ...
## $ BsmtHalfBath  <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ FullBath      <int> 2, 2, 2, 1, 2, 1, 2, 2, 2, 1, 1, 3, 1, 2, 1, 1, ...
## $ HalfBath      <int> 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, ...
## $ BedroomAbvGr  <int> 3, 3, 3, 3, 4, 1, 3, 3, 2, 2, 3, 4, 2, 3, 2, 2, ...
## $ KitchenAbvGr  <int> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, ...
## $ KitchenQual   <fct> Gd, TA, Gd, Gd, Gd, TA, Gd, TA, TA, TA, TA, Ex, ...
## $ TotRmsAbvGrd  <int> 8, 6, 6, 7, 9, 5, 7, 7, 8, 5, 5, 11, 4, 7, 5, 5,...
## $ Functional    <fct> Typ, Typ, Typ, Typ, Typ, Typ, Typ, Typ, Min1, Ty...
## $ Fireplaces    <int> 0, 1, 1, 1, 1, 0, 1, 2, 2, 2, 0, 2, 0, 1, 1, 0, ...
## $ FireplaceQu   <fct> NA, TA, TA, Gd, TA, NA, Gd, TA, TA, TA, NA, Gd, ...
## $ GarageType    <fct> Attchd, Attchd, Attchd, Detchd, Attchd, Attchd, ...
## $ GarageYrBlt   <int> 2003, 1976, 2001, 1998, 2000, 1993, 2004, 1973, ...
## $ GarageFinish  <fct> RFn, RFn, RFn, Unf, RFn, Unf, RFn, RFn, Unf, RFn...
## $ GarageCars    <int> 2, 2, 2, 3, 3, 2, 2, 2, 2, 1, 1, 3, 1, 3, 1, 2, ...
## $ GarageArea    <int> 548, 460, 608, 642, 836, 480, 636, 484, 468, 205...
## $ GarageQual    <fct> TA, TA, TA, TA, TA, TA, TA, TA, Fa, Gd, TA, TA, ...
## $ GarageCond    <fct> TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, ...
## $ PavedDrive    <fct> Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, ...
## $ WoodDeckSF    <int> 0, 298, 0, 0, 192, 40, 255, 235, 90, 0, 0, 147, ...
## $ OpenPorchSF   <int> 61, 0, 42, 35, 84, 30, 57, 204, 0, 4, 0, 21, 0, ...
## $ EnclosedPorch <int> 0, 0, 0, 272, 0, 0, 0, 228, 205, 0, 0, 0, 0, 0, ...
## $ X3SsnPorch    <int> 0, 0, 0, 0, 0, 320, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ ScreenPorch   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 176, 0, 0, 0...
## $ PoolArea      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ PoolQC        <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ Fence         <fct> NA, NA, NA, NA, NA, MnPrv, NA, NA, NA, NA, NA, N...
## $ MiscFeature   <fct> NA, NA, NA, NA, NA, Shed, NA, Shed, NA, NA, NA, ...
## $ MiscVal       <int> 0, 0, 0, 0, 0, 700, 0, 350, 0, 0, 0, 0, 0, 0, 0,...
## $ MoSold        <int> 2, 5, 9, 2, 12, 10, 8, 11, 4, 1, 2, 7, 9, 8, 5, ...
## $ YrSold        <int> 2008, 2007, 2008, 2006, 2008, 2009, 2007, 2009, ...
## $ SaleType      <fct> WD, WD, WD, WD, WD, WD, WD, WD, WD, WD, WD, New,...
## $ SaleCondition <fct> Normal, Normal, Normal, Abnorml, Normal, Normal,...
## $ SalePrice     <int> 208500, 181500, 223500, 140000, 250000, 143000, ...

A. Descriptive and Inferential Statistics

  • Provide univariate descriptive statistics and appropriate plots for the training data set.

    First let’s review summary statistics for the data using the summary function.

    summary(df_train)
    ##        Id           MSSubClass       MSZoning     LotFrontage    
    ##  Min.   :   1.0   Min.   : 20.0   C (all):  10   Min.   : 21.00  
    ##  1st Qu.: 365.8   1st Qu.: 20.0   FV     :  65   1st Qu.: 59.00  
    ##  Median : 730.5   Median : 50.0   RH     :  16   Median : 69.00  
    ##  Mean   : 730.5   Mean   : 56.9   RL     :1151   Mean   : 70.05  
    ##  3rd Qu.:1095.2   3rd Qu.: 70.0   RM     : 218   3rd Qu.: 80.00  
    ##  Max.   :1460.0   Max.   :190.0                  Max.   :313.00  
    ##                                                  NA's   :259     
    ##     LotArea        Street      Alley      LotShape  LandContour
    ##  Min.   :  1300   Grvl:   6   Grvl:  50   IR1:484   Bnk:  63   
    ##  1st Qu.:  7554   Pave:1454   Pave:  41   IR2: 41   HLS:  50   
    ##  Median :  9478               NA's:1369   IR3: 10   Low:  36   
    ##  Mean   : 10517                           Reg:925   Lvl:1311   
    ##  3rd Qu.: 11602                                                
    ##  Max.   :215245                                                
    ##                                                                
    ##   Utilities      LotConfig    LandSlope   Neighborhood   Condition1  
    ##  AllPub:1459   Corner : 263   Gtl:1382   NAmes  :225   Norm   :1260  
    ##  NoSeWa:   1   CulDSac:  94   Mod:  65   CollgCr:150   Feedr  :  81  
    ##                FR2    :  47   Sev:  13   OldTown:113   Artery :  48  
    ##                FR3    :   4              Edwards:100   RRAn   :  26  
    ##                Inside :1052              Somerst: 86   PosN   :  19  
    ##                                          Gilbert: 79   RRAe   :  11  
    ##                                          (Other):707   (Other):  15  
    ##    Condition2     BldgType      HouseStyle   OverallQual    
    ##  Norm   :1445   1Fam  :1220   1Story :726   Min.   : 1.000  
    ##  Feedr  :   6   2fmCon:  31   2Story :445   1st Qu.: 5.000  
    ##  Artery :   2   Duplex:  52   1.5Fin :154   Median : 6.000  
    ##  PosN   :   2   Twnhs :  43   SLvl   : 65   Mean   : 6.099  
    ##  RRNn   :   2   TwnhsE: 114   SFoyer : 37   3rd Qu.: 7.000  
    ##  PosA   :   1                 1.5Unf : 14   Max.   :10.000  
    ##  (Other):   2                 (Other): 19                   
    ##   OverallCond      YearBuilt     YearRemodAdd    RoofStyle   
    ##  Min.   :1.000   Min.   :1872   Min.   :1950   Flat   :  13  
    ##  1st Qu.:5.000   1st Qu.:1954   1st Qu.:1967   Gable  :1141  
    ##  Median :5.000   Median :1973   Median :1994   Gambrel:  11  
    ##  Mean   :5.575   Mean   :1971   Mean   :1985   Hip    : 286  
    ##  3rd Qu.:6.000   3rd Qu.:2000   3rd Qu.:2004   Mansard:   7  
    ##  Max.   :9.000   Max.   :2010   Max.   :2010   Shed   :   2  
    ##                                                              
    ##     RoofMatl     Exterior1st   Exterior2nd    MasVnrType    MasVnrArea    
    ##  CompShg:1434   VinylSd:515   VinylSd:504   BrkCmn : 15   Min.   :   0.0  
    ##  Tar&Grv:  11   HdBoard:222   MetalSd:214   BrkFace:445   1st Qu.:   0.0  
    ##  WdShngl:   6   MetalSd:220   HdBoard:207   None   :864   Median :   0.0  
    ##  WdShake:   5   Wd Sdng:206   Wd Sdng:197   Stone  :128   Mean   : 103.7  
    ##  ClyTile:   1   Plywood:108   Plywood:142   NA's   :  8   3rd Qu.: 166.0  
    ##  Membran:   1   CemntBd: 61   CmentBd: 60                 Max.   :1600.0  
    ##  (Other):   2   (Other):128   (Other):136                 NA's   :8       
    ##  ExterQual ExterCond  Foundation  BsmtQual   BsmtCond    BsmtExposure
    ##  Ex: 52    Ex:   3   BrkTil:146   Ex  :121   Fa  :  45   Av  :221    
    ##  Fa: 14    Fa:  28   CBlock:634   Fa  : 35   Gd  :  65   Gd  :134    
    ##  Gd:488    Gd: 146   PConc :647   Gd  :618   Po  :   2   Mn  :114    
    ##  TA:906    Po:   1   Slab  : 24   TA  :649   TA  :1311   No  :953    
    ##            TA:1282   Stone :  6   NA's: 37   NA's:  37   NA's: 38    
    ##                      Wood  :  3                                      
    ##                                                                      
    ##  BsmtFinType1   BsmtFinSF1     BsmtFinType2   BsmtFinSF2     
    ##  ALQ :220     Min.   :   0.0   ALQ :  19    Min.   :   0.00  
    ##  BLQ :148     1st Qu.:   0.0   BLQ :  33    1st Qu.:   0.00  
    ##  GLQ :418     Median : 383.5   GLQ :  14    Median :   0.00  
    ##  LwQ : 74     Mean   : 443.6   LwQ :  46    Mean   :  46.55  
    ##  Rec :133     3rd Qu.: 712.2   Rec :  54    3rd Qu.:   0.00  
    ##  Unf :430     Max.   :5644.0   Unf :1256    Max.   :1474.00  
    ##  NA's: 37                      NA's:  38                     
    ##    BsmtUnfSF       TotalBsmtSF      Heating     HeatingQC CentralAir
    ##  Min.   :   0.0   Min.   :   0.0   Floor:   1   Ex:741    N:  95    
    ##  1st Qu.: 223.0   1st Qu.: 795.8   GasA :1428   Fa: 49    Y:1365    
    ##  Median : 477.5   Median : 991.5   GasW :  18   Gd:241              
    ##  Mean   : 567.2   Mean   :1057.4   Grav :   7   Po:  1              
    ##  3rd Qu.: 808.0   3rd Qu.:1298.2   OthW :   2   TA:428              
    ##  Max.   :2336.0   Max.   :6110.0   Wall :   4                       
    ##                                                                     
    ##  Electrical     X1stFlrSF      X2ndFlrSF     LowQualFinSF    
    ##  FuseA:  94   Min.   : 334   Min.   :   0   Min.   :  0.000  
    ##  FuseF:  27   1st Qu.: 882   1st Qu.:   0   1st Qu.:  0.000  
    ##  FuseP:   3   Median :1087   Median :   0   Median :  0.000  
    ##  Mix  :   1   Mean   :1163   Mean   : 347   Mean   :  5.845  
    ##  SBrkr:1334   3rd Qu.:1391   3rd Qu.: 728   3rd Qu.:  0.000  
    ##  NA's :   1   Max.   :4692   Max.   :2065   Max.   :572.000  
    ##                                                              
    ##    GrLivArea     BsmtFullBath     BsmtHalfBath        FullBath    
    ##  Min.   : 334   Min.   :0.0000   Min.   :0.00000   Min.   :0.000  
    ##  1st Qu.:1130   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:1.000  
    ##  Median :1464   Median :0.0000   Median :0.00000   Median :2.000  
    ##  Mean   :1515   Mean   :0.4253   Mean   :0.05753   Mean   :1.565  
    ##  3rd Qu.:1777   3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:2.000  
    ##  Max.   :5642   Max.   :3.0000   Max.   :2.00000   Max.   :3.000  
    ##                                                                   
    ##     HalfBath       BedroomAbvGr    KitchenAbvGr   KitchenQual
    ##  Min.   :0.0000   Min.   :0.000   Min.   :0.000   Ex:100     
    ##  1st Qu.:0.0000   1st Qu.:2.000   1st Qu.:1.000   Fa: 39     
    ##  Median :0.0000   Median :3.000   Median :1.000   Gd:586     
    ##  Mean   :0.3829   Mean   :2.866   Mean   :1.047   TA:735     
    ##  3rd Qu.:1.0000   3rd Qu.:3.000   3rd Qu.:1.000              
    ##  Max.   :2.0000   Max.   :8.000   Max.   :3.000              
    ##                                                              
    ##   TotRmsAbvGrd    Functional    Fireplaces    FireplaceQu   GarageType 
    ##  Min.   : 2.000   Maj1:  14   Min.   :0.000   Ex  : 24    2Types :  6  
    ##  1st Qu.: 5.000   Maj2:   5   1st Qu.:0.000   Fa  : 33    Attchd :870  
    ##  Median : 6.000   Min1:  31   Median :1.000   Gd  :380    Basment: 19  
    ##  Mean   : 6.518   Min2:  34   Mean   :0.613   Po  : 20    BuiltIn: 88  
    ##  3rd Qu.: 7.000   Mod :  15   3rd Qu.:1.000   TA  :313    CarPort:  9  
    ##  Max.   :14.000   Sev :   1   Max.   :3.000   NA's:690    Detchd :387  
    ##                   Typ :1360                               NA's   : 81  
    ##   GarageYrBlt   GarageFinish   GarageCars      GarageArea     GarageQual 
    ##  Min.   :1900   Fin :352     Min.   :0.000   Min.   :   0.0   Ex  :   3  
    ##  1st Qu.:1961   RFn :422     1st Qu.:1.000   1st Qu.: 334.5   Fa  :  48  
    ##  Median :1980   Unf :605     Median :2.000   Median : 480.0   Gd  :  14  
    ##  Mean   :1979   NA's: 81     Mean   :1.767   Mean   : 473.0   Po  :   3  
    ##  3rd Qu.:2002                3rd Qu.:2.000   3rd Qu.: 576.0   TA  :1311  
    ##  Max.   :2010                Max.   :4.000   Max.   :1418.0   NA's:  81  
    ##  NA's   :81                                                              
    ##  GarageCond  PavedDrive   WoodDeckSF      OpenPorchSF     EnclosedPorch   
    ##  Ex  :   2   N:  90     Min.   :  0.00   Min.   :  0.00   Min.   :  0.00  
    ##  Fa  :  35   P:  30     1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.00  
    ##  Gd  :   9   Y:1340     Median :  0.00   Median : 25.00   Median :  0.00  
    ##  Po  :   7              Mean   : 94.24   Mean   : 46.66   Mean   : 21.95  
    ##  TA  :1326              3rd Qu.:168.00   3rd Qu.: 68.00   3rd Qu.:  0.00  
    ##  NA's:  81              Max.   :857.00   Max.   :547.00   Max.   :552.00  
    ##                                                                           
    ##    X3SsnPorch      ScreenPorch        PoolArea        PoolQC    
    ##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.000   Ex  :   2  
    ##  1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.000   Fa  :   2  
    ##  Median :  0.00   Median :  0.00   Median :  0.000   Gd  :   3  
    ##  Mean   :  3.41   Mean   : 15.06   Mean   :  2.759   NA's:1453  
    ##  3rd Qu.:  0.00   3rd Qu.:  0.00   3rd Qu.:  0.000              
    ##  Max.   :508.00   Max.   :480.00   Max.   :738.000              
    ##                                                                 
    ##    Fence      MiscFeature    MiscVal             MoSold      
    ##  GdPrv:  59   Gar2:   2   Min.   :    0.00   Min.   : 1.000  
    ##  GdWo :  54   Othr:   2   1st Qu.:    0.00   1st Qu.: 5.000  
    ##  MnPrv: 157   Shed:  49   Median :    0.00   Median : 6.000  
    ##  MnWw :  11   TenC:   1   Mean   :   43.49   Mean   : 6.322  
    ##  NA's :1179   NA's:1406   3rd Qu.:    0.00   3rd Qu.: 8.000  
    ##                           Max.   :15500.00   Max.   :12.000  
    ##                                                              
    ##      YrSold        SaleType    SaleCondition    SalePrice     
    ##  Min.   :2006   WD     :1267   Abnorml: 101   Min.   : 34900  
    ##  1st Qu.:2007   New    : 122   AdjLand:   4   1st Qu.:129975  
    ##  Median :2008   COD    :  43   Alloca :  12   Median :163000  
    ##  Mean   :2008   ConLD  :   9   Family :  20   Mean   :180921  
    ##  3rd Qu.:2009   ConLI  :   5   Normal :1198   3rd Qu.:214000  
    ##  Max.   :2010   ConLw  :   5   Partial: 125   Max.   :755000  
    ##                 (Other):   9

    Then we review the distribution of the target variable SalePrice as well as the key variable GrLivArea. Note that both variables are skewed to the right and appear to follow a log-normal distribution.

    attach(df_train)
    par(mfrow = c(1, 2))
    hist(SalePrice)
    # log transform
    hist(log(SalePrice), main = "Histogram of log(SalePrice)")

    hist(GrLivArea)
    # log transform
    hist(log(GrLivArea), main = "Histogram of log(GrLivArea)")

    Next we view bivariate scatter plots (using the pairs function) for groups of selected numeric variables:
    • SalePrice, GrLivArea, LotArea
    • SalePrice, OverallQual, OverallCond
    • SalePrice, YearBuilt, YearRemodAdd
    pairs(cbind(SalePrice, GrLivArea, LotArea),
          main = "Bivariate plots of SalePrice, GrLivArea, LotArea")

    pairs(cbind(SalePrice, OverallQual, OverallCond),
          main = "Bivariate plots of SalePrice, OverallQual, OverallCond")

    pairs(cbind(SalePrice, YearBuilt, YearRemodAdd), 
          main = "Bivariate plots of SalePrice, YearBuilt, YearRemodAdd")

    We also view boxplots of the SalePrice variable for groups of selected categorical and discrete numeric variables:
    • Neighborhood, OverallQual
    • BldgType, HouseStyle
    • FullBath, TotBath (= FullBath + 0.5 * HalfBath)
    • BedroomAbvGr, TotRmsAbvGrd
    • YrSold, MoSold
    • SaleType, SaleCondition
    • RoofMatl, KitchenQual
    • BsmtQual, GarageCars
    par(las = 1)
    boxplot(SalePrice ~ Neighborhood, horizontal = TRUE, main = "SalePrice vs. Neighborhood")

    boxplot(SalePrice ~ OverallQual, horizontal = TRUE, main = "SalePrice vs. OverallQual")

    par(mfrow = c(1, 2))
    boxplot(SalePrice ~ BldgType, main = "SalePrice vs. BldgType")
    boxplot(SalePrice ~ HouseStyle, main = "SalePrice vs. HouseStyle")

    # define total baths
    TotBath <- FullBath + 0.5 * HalfBath
    df_train <- cbind(df_train, TotBath)
    boxplot(SalePrice ~ FullBath, main = "SalePrice vs. FullBath")
    boxplot(SalePrice ~ TotBath, main = "SalePrice vs. TotBath")

    boxplot(SalePrice ~ BedroomAbvGr, main = "SalePrice vs. BedroomAbvGr")
    boxplot(SalePrice ~ TotRmsAbvGrd, main = "SalePrice vs. TotRmsAbvGrd")

    boxplot(SalePrice ~ YrSold, main = "SalePrice vs. YrSold")
    boxplot(SalePrice ~ MoSold, main = "SalePrice vs. MoSold")

    boxplot(SalePrice ~ SaleType, main = "SalePrice vs. SaleType")
    boxplot(SalePrice ~ SaleCondition, main = "SalePrice vs. SaleCondition")

    boxplot(SalePrice ~ RoofMatl, main = "SalePrice vs. RoofMatl")
    boxplot(SalePrice ~ KitchenQual, main = "SalePrice vs. KitchenQual")

    boxplot(SalePrice ~ BsmtQual, main = "SalePrice vs. BsmtQual")
    boxplot(SalePrice ~ GarageCars, main = "SalePrice vs. GarageCars")

    Note that several pairs of variables are highly correlated; if we decide to use these variables in a multiple regression, we should make sure to avoid using both variables in any pair below, in order to mitigate collinearity issues:
    • YearBuilt and YearRemodAdd
    • FullBath and TotBath
    • BedroomAbvGr and TotRmsAbvGrd
    cor(GrLivArea, LotArea)
    ## [1] 0.2631162
    cor(YearBuilt, YearRemodAdd)
    ## [1] 0.592855
    cor(FullBath, TotBath)
    ## [1] 0.9201157
    cor(BedroomAbvGr, TotRmsAbvGrd)
    ## [1] 0.6766199

    From these plots, we see can narrow down the complete set of 80 feature variables to a short list of potential predictor variables to use in a multiple regression model for the sale price:

    • GrLivArea
    • OverallQual
    • LotArea
    • YearRemodAdd
    • Neighborhood
    • HouseStyle
    • TotBath
    • TotRmsAbvGrd
    • MoSold
    • SaleType
    • SaleCondition
    • RoofMatl
    • KitchenQual
    • BsmtQual
    • GarageCars.
  • Provide a scatterplot matrix for at least two of the independent variables and the dependent variable.

    Let’s view a scatter plot of SalePrice vs GrLivArea, and then segment the same relationship by OverallQual, Neighborhood, and HouseStyle. The slope of the best-fit line for each scatter plot gives the price per square foot of gross living area, which is a key metric used in the real estate industry. Judging from the graphs, these appear to be key variables to use when starting the regression model for SalePrice. For instance, higher quality homes (OverallQual = 7, 8, 9), certain neighborhoods (Neighborhood = NoRidge, NridgHt, StoneBr), and certain house styles (HouseStyle = 1Story) have higher prices per square foot than other homes.

    # SalePrice vs GrLivArea
    g <- ggplot(df_train, aes(x = GrLivArea, y = SalePrice))
    g + geom_point(alpha = 0.2) + 
        geom_smooth(method = "lm", se = FALSE) + 
        labs(title = "SalePrice vs. GrLivArea")

    # SalePrice vs GrLivArea, by OverallQual
    g + geom_point(aes(color = OverallQual), alpha = 0.5) + 
        geom_smooth(method = "lm", se = FALSE) + 
        facet_wrap(~ OverallQual) + guides(color = "none") +
        labs(title = "SalePrice vs. GrLivArea, by OverallQual")

    # SalePrice vs GrLivArea, by Neighborhood
    g + geom_point(aes(color = Neighborhood), alpha = 0.5) + 
        geom_smooth(method = "lm", se = FALSE) + 
        facet_wrap(~ Neighborhood) + guides(color = "none") +
        labs(title = "SalePrice vs. GrLivArea, by Neighborhood")

    # SalePrice vs GrLivArea, by HouseStyle
    g + geom_point(aes(color = HouseStyle), alpha = 0.5) + 
        geom_smooth(method = "lm", se = FALSE) + 
        facet_wrap(~ HouseStyle) + guides(color = "none") +
        labs(title = "SalePrice vs. GrLivArea, by HouseStyle")

    We also view a scatter plot of SalePrice vs log(LotArea), and then segment the same relationship by HouseStyle, BedroomAbvGr, and TotBath. We log-transform the LotArea variable because of outliers. Based on the graphs, we can see that the price per square foot of lot area is not as strong a relationship as the price per square foot of gross living area.

    # SalePrice vs LotArea
    h <- ggplot(df_train, aes(x = log(LotArea), y = SalePrice))
    h + geom_point(alpha = 0.2) + 
        geom_smooth(method = "lm", se = FALSE) + 
        labs(title = "SalePrice vs. log(LotArea)")

    # SalePrice vs LotArea, by HouseStyle
    h + geom_point(aes(color = HouseStyle), alpha = 0.5) + 
        geom_smooth(method = "lm", se = FALSE) +
        facet_wrap(~ HouseStyle) + guides(color = "none") +
        labs(title = "SalePrice vs. log(LotArea), by HouseStyle")

    # SalePrice vs LotArea, by BedroomAbvGr
    h + geom_point(aes(color = BedroomAbvGr), alpha = 0.5) + 
        geom_smooth(method = "lm", se = FALSE) + 
        facet_wrap(~ BedroomAbvGr) + guides(color = "none") +
        labs(title = "SalePrice vs. log(LotArea), by BedroomAbvGr")

    # SalePrice vs LotArea, by TotBath
    h + geom_point(aes(color = TotBath), alpha = 0.5) + 
        geom_smooth(method = "lm", se = FALSE) + 
        facet_wrap(~ TotBath) + guides(color = "none") +
        labs(title = "SalePrice vs. LotArea, by TotBath")

    Finally, since the SalePrice and GrLivArea variables are skewed to the right, let’s log-transform these variables and view the relationship segmented by OverallQual, BldgType, and HouseStyle. These graphs confirm that OverallQual, BldgType, and HouseStyle are key variables to consider in predicting SalePrice.

    # SalePrice vs. GrLivArea, by OverallQual and BldgType
    g + geom_point(aes(color = OverallQual), alpha = 0.5) + 
        scale_x_log10() + scale_y_log10() + 
        geom_smooth(method = "lm", se = FALSE) + 
        facet_wrap(~ BldgType) +
        labs(title = "log(SalePrice) vs. log(GrLivArea), by OverallQual and BldgType")

    # SalePrice vs. GrLivArea, by OverallQual and HouseStyle
    g + geom_point(aes(color = OverallQual), alpha = 0.5) + 
        scale_x_log10() + scale_y_log10() + 
        geom_smooth(method = "lm", se = FALSE) + 
        facet_wrap(~ HouseStyle) +
        labs(title = "log(SalePrice) vs. log(GrLivArea), by OverallQual and HouseStyle")

  • Derive a correlation matrix for any three quantitative variables in the dataset.

    We choose the variables LotArea, GrLivArea, and SalePrice and use the R function cor to compute the correlation matrix.

    correln <- cor(cbind('LSZ' = df_train$LotArea, 'GLA' = df_train$GrLivArea, 'PRX' = df_train$SalePrice))
    kable(correln, digits = 4, align = 'ccc', caption = 'Correlation matrix')
    Correlation matrix
    LSZ GLA PRX
    LSZ 1.0000 0.2631 0.2638
    GLA 0.2631 1.0000 0.7086
    PRX 0.2638 0.7086 1.0000
  • Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval.

    We use the sample pairwise correlations \(r_{ij}\) for variables \(i\) and \(j\) from the correlation matrix to infer whether the true population correlations \(\rho_{ij}\) are different than 0, at a significance level of \(\alpha = 0.05\). The null and alternative hypotheses for this test are as follows:

    • \(H_0\): \(\rho_{ij} = 0\)
    • \(H_A\): \(\rho_{ij} \neq 0\)

    From Wikipedia, the test statistic for the correlation test is:

    \[t_{ij} = \frac{r_{ij}\sqrt{n-2}}{\sqrt{1 - r_{ij}^2}}\]

    Assuming that the null hypothesis \(H_0\) is true and that the variables \(i\) and \(j\) are independently normally distributed, then the test statistic \(t_{ij}\) follows the student’s \(t\)-distribution with \(n-2\) degrees of freedom. This is approximately true for non-normally distributed variables \(i\) and \(j\) so long as the sample size is large enough, which we can assume to be the case for \(n=1460\).

    So we compute the test statistics \(t_{ij}\) and determine the associated p-values:

    \[\begin{eqnarray} \text{LSZ-GLA:} && t_{12} = 10.41 &\implies& \Pr(|t_{12}| > 10.41) \approx 0 \\ \text{LSZ-PRX:} && t_{13} = 10.44 &\implies& \Pr(|t_{13}| > 10.44) \approx 0 \\ \text{GLA-PRX:} && t_{23} = 38.35 &\implies& \Pr(|t_{23}| > 38.35) \approx 0 \end{eqnarray}\]

    n <- nrow(df_train)     # sample size
    r <- c(correln[1,2], correln[1,3], correln[2,3])
    (t <- r * sqrt(n - 2) / sqrt(1 - r^2))  # test statistic for sample correlation
    ## [1] 10.41370 10.44463 38.34821
    (p <- 2 * (1 - pt(t, df = n - 2)))  # p-value for two-tailed test
    ## [1] 0 0 0

    The p-values are approximately 0 and in any case are \(\ll \alpha\), so we reject the null hypothesis and conclude that each pairwise correlation \(\rho_{ij} \neq 0\). Note that we can use the R function cor.test to confirm the test statistics and p-values above, and also to compute the 80% confidence intervals.

    \[\begin{eqnarray} \text{LSZ-GLA:} && 0.2316 \leq \rho_{12} \leq 0.2941 \\ \text{LSZ-PRX:} && 0.2323 \leq \rho_{13} \leq 0.2948 \\ \text{GLA-PRX:} && 0.6915 \leq \rho_{23} \leq 0.7249 \end{eqnarray}\]

    cor.test(df_train$LotArea, df_train$GrLivArea, conf.level = 0.8)
    ## 
    ##  Pearson's product-moment correlation
    ## 
    ## data:  df_train$LotArea and df_train$GrLivArea
    ## t = 10.414, df = 1458, p-value < 2.2e-16
    ## alternative hypothesis: true correlation is not equal to 0
    ## 80 percent confidence interval:
    ##  0.2315997 0.2940809
    ## sample estimates:
    ##       cor 
    ## 0.2631162
    cor.test(df_train$LotArea, df_train$SalePrice, conf.level = 0.8)
    ## 
    ##  Pearson's product-moment correlation
    ## 
    ## data:  df_train$LotArea and df_train$SalePrice
    ## t = 10.445, df = 1458, p-value < 2.2e-16
    ## alternative hypothesis: true correlation is not equal to 0
    ## 80 percent confidence interval:
    ##  0.2323391 0.2947946
    ## sample estimates:
    ##       cor 
    ## 0.2638434
    cor.test(df_train$GrLivArea, df_train$SalePrice, conf.level = 0.8)
    ## 
    ##  Pearson's product-moment correlation
    ## 
    ## data:  df_train$GrLivArea and df_train$SalePrice
    ## t = 38.348, df = 1458, p-value < 2.2e-16
    ## alternative hypothesis: true correlation is not equal to 0
    ## 80 percent confidence interval:
    ##  0.6915087 0.7249450
    ## sample estimates:
    ##       cor 
    ## 0.7086245
  • Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?

    In our analysis, we computed the sample correlations \(r_{ij}\) from the sample data, and then applied statistical inference (hypothesis testing) to determine whether the \(r_{ij} \neq 0\) are sufficient to conclude that the true population correlations \(\rho_{ij} \neq 0\) at a significance level of \(\alpha = 0.05\). The calculated test statistics and related p-values supported the conclusion that the \(\rho_{ij} \neq 0\) at \(\alpha = 0.05\) significance.

    In the last step, we applied the same hypothesis test on three pairwise correlations (\(\rho_{12}\), \(\rho_{13}\), and \(\rho_{23}\)). Generally, when testing multiple hypotheses on related variables at the same time, the risk is that we increase the likelihood of a legitimate rare event, i.e., a test statistic that is validly in the tail of the distribution for the null hypothesis. In this circumstance, we would incorrectly reject the null hypothesis in favor of the alternative hypothesis (a Type I error). So in the general case, we should be worried about and examine ways to mitigate the possibility of familywise error.

    In this particular case, however, familywise error should not be a substantial risk, since the p-value for each hypothesis test is extraordinarily small (\(< 2.2 \cdot 10^{-16}\)). For instance, if we were to apply the Bonferroni correction (see Wikipedia) by using a significance level of \(\alpha / 3 \approx 0.017\), we would still reject the null hypothesis and conclude that \(\rho_{ij} \neq 0\).

B. Linear Algebra and Correlation

  • Invert the correlation matrix from above; the inverse matrix is known as the precision matrix and contains variance inflation factors on the diagonal.

    We use the R function solve to invert the correlation matrix and arrive at the precision matrix.

    precisn <- solve(correln)
    kable(precisn, digits = 4, align = 'ccc', caption = 'Precision matrix')
    Precision matrix
    LSZ GLA PRX
    LSZ 1.0884 -0.1665 -0.1692
    GLA -0.1665 2.0341 -1.3975
    PRX -0.1692 -1.3975 2.0349
  • Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix.

    The product of the correlation matrix and the precision matrix, and vice versa, are both equal to the identity matrix.

    (res1 <- correln %*% precisn)
    ##               LSZ           GLA          PRX
    ## LSZ  1.000000e+00  0.000000e+00 0.000000e+00
    ## GLA  1.387779e-17  1.000000e+00 2.220446e-16
    ## PRX -2.775558e-17 -2.220446e-16 1.000000e+00
    near(res1, diag(nrow(res1)))
    ##      LSZ  GLA  PRX
    ## LSZ TRUE TRUE TRUE
    ## GLA TRUE TRUE TRUE
    ## PRX TRUE TRUE TRUE
    (res2 <- precisn %*% correln)
    ##               LSZ         GLA          PRX
    ## LSZ  1.000000e+00 1.94289e-16 1.387779e-16
    ## GLA -1.110223e-16 1.00000e+00 0.000000e+00
    ## PRX -1.110223e-16 0.00000e+00 1.000000e+00
    near(res2, diag(nrow(res2)))
    ##      LSZ  GLA  PRX
    ## LSZ TRUE TRUE TRUE
    ## GLA TRUE TRUE TRUE
    ## PRX TRUE TRUE TRUE
  • Conduct LU decomposition on the correlation matrix.

    In Homework Assignment #2, we wrote a function to factorize a square matrix \(A\) into the LU decomposition. We use the same function here, as shown below.

    # Define the function LU to decompose an input square matrix A into a lower triangular matrix L and an 
    # upper triangular matrix U, with A = L %*% U.  The functon works by implementing a series of row 
    # reductions to eliminate all the below-diagonal elements of A.  
    # Several features to note about the function LU:
    #   - The function takes as input any n-dimensional square matrix A
    #   - Error handling is included for either of the following events:
    #       . The input matrix A is not square
    #       . A pivot value is zero
    #   - The function returns a named list of the lower and upper triangular matrices ("lower" and "upper").
    
    LU <- function(M) {
        n <- nrow(M)
        # check M is square matrix, else stop
        if (n != ncol(M)) stop("Not a square matrix")
        # initialize L to the identity matrix with the same column & row names as M
        L <- diag(nrow = n)
        colnames(L) <- colnames(M)
        rownames(L) <- rownames(M)
        # initialize U to the original matrix M
        U <- M
    
        # loop across the columns that have below-diagonal elements
        for ( j in 1:(n-1) ) {
            # check pivot value is non-zero, else stop
            if (U[j, j] == 0) stop(paste0("Zero pivot value, column j=", j))
            # loop across the rows that have below-diagonal elements
            for ( i in (j+1):n ) {
                factor <- - U[i, j] / U[j, j] 
                # row-reduce using pivot row
                U[i, ] <- U[i, ] + U[j, ] * factor
                L[i, j] <- - factor
            }
        }
        return(list("lower" = L, "upper" = U))
    }

    Now we use this function to factorize the correlation matrix, and confirm that the resulting lower and upper triangular matrices \(L\) and \(U\) satisfy \(A = LU\).

    (L <- LU(correln)$lower)
    ##           LSZ       GLA PRX
    ## LSZ 1.0000000 0.0000000   0
    ## GLA 0.2631162 1.0000000   0
    ## PRX 0.2638434 0.6867466   1
    (U <- LU(correln)$upper)
    ##     LSZ       GLA       PRX
    ## LSZ   1 0.2631162 0.2638434
    ## GLA   0 0.9307699 0.6392030
    ## PRX   0 0.0000000 0.4914162
    (A <- L %*% U)
    ##           LSZ       GLA       PRX
    ## LSZ 1.0000000 0.2631162 0.2638434
    ## GLA 0.2631162 1.0000000 0.7086245
    ## PRX 0.2638434 0.7086245 1.0000000
    correln - A
    ##     LSZ          GLA          PRX
    ## LSZ   0 0.000000e+00 0.000000e+00
    ## GLA   0 0.000000e+00 1.110223e-16
    ## PRX   0 1.110223e-16 0.000000e+00
    near(correln, A)
    ##      LSZ  GLA  PRX
    ## LSZ TRUE TRUE TRUE
    ## GLA TRUE TRUE TRUE
    ## PRX TRUE TRUE TRUE

C. Calculus-Based Probability & Statistics

  • Select a variable in the training dataset that is skewed to the right.

    We choose to analyze the SalePrice variable from the training dataset. As we saw above, the empirical distribution for this variable is skewed to the right. Also all values of the variable are greater than zero, so there is no need to shift the distribution above zero.

  • Use the MASS package to fit an exponential probability density function to the data.

    We load the MASS package and use fitdistr to fit an exponential probability density function to the data. The optimal value of \(\lambda\) for the fitted exponential distribution is found to be:

    \[ \lambda = 5.527 \cdot 10^{-6} \]

    library(MASS)
    # fit exponential distribution to SalePrice data
    f_exp <- fitdistr(df_train$SalePrice, "exponential")
    (lambda <- f_exp$estimate)
    ##         rate 
    ## 5.527268e-06
  • Generate a histogram from the fitted distribution and compare to the histogram of the selected variable.

    We generate 1,000 draws from the fitted distribution, which we use to produce a histogram of values overlaid with a plot of the fitted exponential probability density function. Then we compare this to the histogram of the SalePrice variable overlaid with the same density function.

    # generate sample of 1,000 draws from fitted distribution
    n <- 1000
    draws1 = rexp(n, lambda)
    
    # combine dataframe for side-by-side histograms
    df_hist1 <- rbind(data.frame(obs = draws1, grp = 'sample'), 
                      data.frame(obs = df_train$SalePrice, grp = 'empirical'))
    
    # plot histograms of sample & empirical values, along with fitted probability density
    ggplot(df_hist1) + geom_histogram(aes(x = obs, y = ..density..)) + 
        stat_function(fun = dexp, col = 'blue', args = list(rate = lambda)) + 
        facet_wrap(~ grp) + 
        labs(title = 'Comparison of sample & empirical distributions for SalePrice', 
             subtitle = 'Fitted exponential probability density')

    As we can see, the shape of the exponential distribution doesn’t match the shape of the empirical data, since the exponential density function increase as SalePrice goes to zero and it also misses the “hump” (mode of the data) near $180,000. A closer fit can be found by using a log-normal distribution, which we fit using the same fitdistr function from the MASS package. After fitting the log-normal distribution, a comparison of the sample and empirical distributions shows a closer match to the data.

    # try fitting log-normal distribution to data instead
    f_lnorm <- fitdistr(df_train$SalePrice, "lognormal")
    ml <- f_lnorm$estimate[1]
    sl <- f_lnorm$estimate[2]
    
    # generate sample of 1,000 draws from fitted distribution
    draws2 = rlnorm(n, meanlog = ml, sdlog = sl)
    
    # combine dataframe for side-by-side histograms
    df_hist2 <- rbind(data.frame(obs = draws2, grp = 'sample'), 
                      data.frame(obs = df_train$SalePrice, grp = 'empirical'))
    
    # plot histograms of sample & empirical values, along with fitted probability density
    ggplot(df_hist2) + geom_histogram(aes(x = obs, y = ..density..)) + 
        stat_function(fun = dlnorm, col = 'blue', args = list(meanlog = ml, sdlog = sl)) + 
        facet_wrap(~ grp) + 
        labs(title = 'Comparison of sample & empirical distributions for SalePrice', 
             subtitle = 'Fitted log-normal probability density')

  • Find the 5th and 95th percentiles from the fitted and empirical distributions, and compare this to the 95% confidence interval from the empirical data.

    From the table below, we see that:

    • The 5th and 95th percentiles from the fitted exponential distribution are approximately $9,300 and $542,000, with the sample distribution showing slight deviations compared to the theoretical (exact) distribution.
    • The 5th and 95th percentiles from the fitted log-normal distribution are approximately $86,400 and $321,500, with the sample distribution showing slight deviations compared to the theoretical (exact) distribution.

    For comparison:

    • The 5th and 95th percentiles from the empirical SalePrice data are $88,000 and $326,100.

    Consistent with the observation noted earlier, the 5th and 95th percentiles of the empirical distribution are better approximated by the fitted log-normal distribution than by the fitted exponential distribution.

    # range of percentiles
    pct <- c(0.025, 0.05, 0.25, 0.5, 0.75, 0.95, 0.975)
    # create & display table
    tble <- rbind('Exponential - theoretical' = qexp(pct, lambda), 
                  'Exponential - sample' = quantile(draws1, pct), 
                  'Log-normal - theoretical' = qlnorm(pct, ml, sl), 
                  'Log-normal - sample' = quantile(draws2, pct), 
                  'Empirical distribution' = quantile(df_train$SalePrice, pct))
    kable(tble, digits = 0, row.names = pct, format.args = list(big.mark = ','),
          caption = 'Percentiles from fitted and empirical distributions for SalePrice')
    Percentiles from fitted and empirical distributions for SalePrice
    2.5% 5% 25% 50% 75% 95% 97.5%
    Exponential - theoretical 4,581 9,280 52,048 125,405 250,810 541,991 667,396
    Exponential - sample 5,529 11,261 54,782 129,538 244,845 520,520 649,719
    Log-normal - theoretical 76,222 86,443 127,353 166,717 218,247 321,536 364,650
    Log-normal - sample 74,958 85,937 129,514 170,064 222,498 340,981 382,732
    Empirical distribution 80,000 88,000 129,975 163,000 214,000 326,100 384,511

    Next, from the sample SalePrice data in the training dataset, we can estimate the population mean \(\mu\) and construct a 95% confidence interval for the mean home sale price. We know from the Central Limit Theorem that the distribution of the sample mean will approximate a normal distribution for large enough sample size. Furthermore, assuming the SalePrice variable is normally distributed, we can use the Z-score (for larger sample sizes) or t-statistic (for smaller sample sizes) to determine the confidence interval of the sample mean.

    First, from the training dataset, we estimate the sample mean \(m\) and sample standard deviation \(s\):

    \[\begin{eqnarray} m &=& 180,921 \\ s &=& 79,443 \end{eqnarray}\]

    (m <- mean(df_train$SalePrice))
    ## [1] 180921.2
    (s <- sd(df_train$SalePrice))
    ## [1] 79442.5

    Then we find the critical \(Z\) and \(t\) values corresponding to p-values of 2.5% and 97.5%, where \(n=1460\) is the sample size and \(df = n-1 = 1459\) is the degrees of freedom for determining the critical t-values. Putting all this together, we arrive at the confidence intervals based on the Z-score and t-statistic:

    \[\begin{eqnarray} \text{Z-score:} && m - Z_{\text{crit}} \cdot \frac{s}{\sqrt{n}} &\leq& \mu \leq m + Z_{\text{crit}} \cdot \frac{s}{\sqrt{n}} &~\implies~& 176,846 &\leq& \mu \leq 184,996 \\ \text{t-statistic:} && m - t_{\text{crit}} \cdot \frac{s}{\sqrt{n}} &\leq& \mu \leq m + t_{\text{crit}} \cdot \frac{s}{\sqrt{n}} &~\implies~& 176,843 &\leq& \mu \leq 185,000 \end{eqnarray}\]

    n1 <- length(df_train$SalePrice)
    zcrit <- qnorm(c(0.025, 0.975))
    tcrit <- qt(c(0.025, 0.975), df = n1 - 1)
    m + zcrit * s / sqrt(n1)
    ## [1] 176846.2 184996.2
    m + tcrit * s / sqrt(n1)
    ## [1] 176842.8 184999.6

    As expected, the confidence intervals are virtually identical since the simple size of \(n=1460\) is very large.

    Finally, note that the 95% confidence interval tells us that we can be 95% confident that the true population mean \(\mu\) lies within the interval, or alternatively, that if we generate a large number of samples having the same sample size of \(n=1460\) and construct the corresponding confidence intervals, then 95% of these intervals will contain the true population mean.

    This concept differs from the 5th and 95th percentiles calculated above, which simply indicate the values of the SalePrice variable below which 5% and 95% of the distribution lies.

D. Modeling

  • Build some type of multiple regression model and submit the results to the Kaggle competition board.

    In part A above we arrived at the following short list of feature variables to use as potential predictors in a multiple regression model for SalePrice:
    • GrLivArea
    • OverallQual
    • LotArea
    • YearRemodAdd
    • Neighborhood
    • HouseStyle
    • TotBath
    • TotRmsAbvGrd
    • MoSold
    • SaleType
    • SaleCondition
    • RoofMatl
    • KitchenQual
    • BsmtQual
    • GarageCars.

    We will use this short list to develop several models for SalePrice, proceeding in order of increasing complexity. Because of the large number of variables, we use a version of forward selection, whereby we start with the most important variables and then selectively add new variables to the model.

    • Model 0: base model

      After some experimentation and applying some business logic, we arrive at a base linear model lm0 to start the analysis:

      \[\text{SalePrice} ~\sim~ \text{GrLivArea:Neighborhood} + \text{OverallQual}\]

      This model assumes that home prices are primarily determined by gross living area (total square feet), the price per square foot segmented by neighborhood (estimated slope coefficient), and an adjustment based on the overall quality of the house. This base model can account for \(R^2_{\text{adj}} = 0.836\) of the total variability in home prices in the training dataset.

      # test #0: 0.8363
      lm0 <- lm(SalePrice ~ GrLivArea:Neighborhood + as.factor(OverallQual))
      s0 <- summary(lm0)
      a0 <- anova(lm0)
      s0$adj.r.squared
      ## [1] 0.8362588
    • Model 1: add HouseStyle to Model 0

      This model takes our base model and adds the next most important feature variable, HouseStyle:

      \[\begin{eqnarray} \text{SalePrice} &~\sim~& \text{GrLivArea:Neighborhood} + \text{OverallQual} \\ && + \text{HouseStyle} \end{eqnarray}\]

      The table below Model 3 records the adjusted \(R^2\) values for each candidate variable that can be added to Model 0. This version of the model has an \(R^2_{\text{adj}} = 0.848\).

      # test #1: 0.8478
      lm1 <- lm(SalePrice ~ GrLivArea:Neighborhood + as.factor(OverallQual) + HouseStyle)
      s1 <- summary(lm1)
      a1 <- anova(lm1)
      s1$adj.r.squared
      ## [1] 0.8477604
    • Model 2: add RoofMatl to Model 1

      This model takes Model 1 and adds the next most important feature variable, RoofMatl:

      \[\begin{eqnarray} \text{SalePrice} &~\sim~& \text{GrLivArea:Neighborhood} + \text{OverallQual} \\ && + \text{HouseStyle} + \text{RoofMatl} \end{eqnarray}\]

      The table below Model 3 records the adjusted \(R^2\) values for each candidate variable that can be added to Model 1. This version of the model has an \(R^2_{\text{adj}} = 0.860\).

      # test #2: 0.8595
      lm2 <- lm(SalePrice ~ GrLivArea:Neighborhood + as.factor(OverallQual) + HouseStyle + RoofMatl)
      s2 <- summary(lm2)
      a2 <- anova(lm2)
      s2$adj.r.squared
      ## [1] 0.8595235
    • Model 3: add SaleCondition to Model 2

      This model takes Model 2 and adds the next most important feature variable, SaleCondition:

      \[\begin{eqnarray} \text{SalePrice} &~\sim~& \text{GrLivArea:Neighborhood} + \text{OverallQual} \\ && + \text{HouseStyle} + \text{RoofMatl} + \text{SaleCondition} \end{eqnarray}\]

      The table below records the adjusted \(R^2\) values for each candidate variable that can be added to Model 2. This version of the model has an \(R^2_{\text{adj}} = 0.866\).

      # test #3: 0.8658
      lm3 <- lm(SalePrice ~ GrLivArea:Neighborhood + as.factor(OverallQual) + HouseStyle + RoofMatl + SaleCondition)
      s3 <- summary(lm3)
      a3 <- anova(lm3)
      s3$adj.r.squared
      ## [1] 0.8657639

    The following table details the \(R^2_{\text{adj}}\) values that result from adding each variable (one at a time) to Models 0, 1, and 2, and supports the variables selected at each modeling step.

    Variable Added Model 0 Model 1 Model 2
    Starting \(R^2_{\text{adj}}\) 0.836 0.848 0.860
    LotArea 0.834 0.852 0.864
    YearBuilt 0.839 0.849 0.861
    YearRemodAdd 0.841 0.852 0.864
    HouseStyle 0.848
    BedroomAbvGr 0.838 0.848 0.861
    TotRmsAbvGrd 0.837 0.848 0.860
    FullBath 0.836 0.848 0.860
    HalfBath 0.837 0.849 0.860
    TotBath 0.836 0.848 0.859
    RoofMatl 0.847 0.860
    KitchenQual 0.843 0.853 0.865
    BsmtQual 0.842 0.853 0.865
    GarageCars 0.843 0.852 0.864
    SaleCondition 0.842 0.853 0.866
    SaleType 0.841 0.852 0.864

    We stop here since adding new variables to the model is yielding diminishing returns, as well as increasing the likelihood of over-fitting and collinearity among the predictor variables.

    Interestingly, some variables that might be expected to increase the explanatory power of the model turn out not to do so. For example, the variables YearRemodAdd, BedroomAbvGr, TotBath, KitchenQual, and GarageCars don’t appear to increase the \(R^2_{\text{adj}}\) materially. This may be the case because they might be correlated to other variables that already included in the model, such as OverallQual.

    Finally, we use the lm0 through lm3 models to predict values of the target variable SalePrice for the test dataset. The test dataset includes 1,459 observations with the same 80 feature variables found in the training dataset. We use the predict function to predict values of SalePrice for each of the 4 models and then save the results into 4 CSV files, which we submit to the Kaggle competition board.

    # detach training dataset to avoid confusion
    detach(df_train)
    
    # local directory
    path1 <- "test.csv"
    # private GitHub repo; need to accept invitation for access
    # path <- "https://raw.githubusercontent.com/kecbenson/DATA605/master/test.csv?token=AJYRCND5HTO6SZNTFJD2QT245L2FE"
    
    # load the test dataset
    df_test <- read.csv(path1)
    
    # predict values
    p0 <- predict(lm0, df_test)
    p1 <- predict(lm1, df_test)
    p2 <- predict(lm2, df_test)
    p3 <- predict(lm3, df_test)
    
    # save csv files
    write.csv(cbind(Id = df_test$Id, SalePrice = p0), file = "kb0.csv", row.names = FALSE)
    write.csv(cbind(Id = df_test$Id, SalePrice = p1), file = "kb1.csv", row.names = FALSE)
    write.csv(cbind(Id = df_test$Id, SalePrice = p2), file = "kb2.csv", row.names = FALSE)
    write.csv(cbind(Id = df_test$Id, SalePrice = p3), file = "kb3.csv", row.names = FALSE)
  • Provide a complete model summary and results with analysis.

    For each of the 4 models, we show below the model summary, ANOVA table, and diagnostic plots.

    • Model 0
    s0
    ## 
    ## Call:
    ## lm(formula = SalePrice ~ GrLivArea:Neighborhood + as.factor(OverallQual))
    ## 
    ## Residuals:
    ##     Min      1Q  Median      3Q     Max 
    ## -235927  -15564    -315   13429  212193 
    ## 
    ## Coefficients:
    ##                                Estimate Std. Error t value Pr(>|t|)    
    ## (Intercept)                   3.065e+04  2.280e+04   1.345 0.178999    
    ## as.factor(OverallQual)2       1.077e+03  2.938e+04   0.037 0.970769    
    ## as.factor(OverallQual)3       1.993e+04  2.390e+04   0.834 0.404315    
    ## as.factor(OverallQual)4       3.610e+04  2.298e+04   1.571 0.116326    
    ## as.factor(OverallQual)5       4.626e+04  2.286e+04   2.023 0.043247 *  
    ## as.factor(OverallQual)6       5.675e+04  2.291e+04   2.478 0.013340 *  
    ## as.factor(OverallQual)7       7.263e+04  2.301e+04   3.156 0.001631 ** 
    ## as.factor(OverallQual)8       1.022e+05  2.320e+04   4.403 1.15e-05 ***
    ## as.factor(OverallQual)9       1.710e+05  2.382e+04   7.177 1.14e-12 ***
    ## as.factor(OverallQual)10      2.095e+05  2.473e+04   8.469  < 2e-16 ***
    ## GrLivArea:NeighborhoodBlmngtn 6.106e+01  6.244e+00   9.778  < 2e-16 ***
    ## GrLivArea:NeighborhoodBlueste 3.661e+01  1.642e+01   2.229 0.025959 *  
    ## GrLivArea:NeighborhoodBrDale  1.902e+01  7.544e+00   2.522 0.011777 *  
    ## GrLivArea:NeighborhoodBrkSide 4.202e+01  4.111e+00  10.223  < 2e-16 ***
    ## GrLivArea:NeighborhoodClearCr 6.738e+01  3.753e+00  17.955  < 2e-16 ***
    ## GrLivArea:NeighborhoodCollgCr 6.510e+01  3.046e+00  21.368  < 2e-16 ***
    ## GrLivArea:NeighborhoodCrawfor 6.502e+01  3.063e+00  21.225  < 2e-16 ***
    ## GrLivArea:NeighborhoodEdwards 2.761e+01  2.985e+00   9.252  < 2e-16 ***
    ## GrLivArea:NeighborhoodGilbert 5.775e+01  3.139e+00  18.401  < 2e-16 ***
    ## GrLivArea:NeighborhoodIDOTRR  2.429e+01  5.192e+00   4.678 3.17e-06 ***
    ## GrLivArea:NeighborhoodMeadowV 2.589e+01  7.219e+00   3.586 0.000347 ***
    ## GrLivArea:NeighborhoodMitchel 5.465e+01  4.148e+00  13.177  < 2e-16 ***
    ## GrLivArea:NeighborhoodNAmes   4.845e+01  2.738e+00  17.697  < 2e-16 ***
    ## GrLivArea:NeighborhoodNoRidge 8.216e+01  2.784e+00  29.508  < 2e-16 ***
    ## GrLivArea:NeighborhoodNPkVill 4.373e+01  8.893e+00   4.918 9.76e-07 ***
    ## GrLivArea:NeighborhoodNridgHt 8.368e+01  3.374e+00  24.800  < 2e-16 ***
    ## GrLivArea:NeighborhoodNWAmes  5.527e+01  2.935e+00  18.834  < 2e-16 ***
    ## GrLivArea:NeighborhoodOldTown 3.102e+01  2.753e+00  11.268  < 2e-16 ***
    ## GrLivArea:NeighborhoodSawyer  4.788e+01  3.809e+00  12.570  < 2e-16 ***
    ## GrLivArea:NeighborhoodSawyerW 5.905e+01  3.295e+00  17.922  < 2e-16 ***
    ## GrLivArea:NeighborhoodSomerst 6.867e+01  3.448e+00  19.918  < 2e-16 ***
    ## GrLivArea:NeighborhoodStoneBr 9.014e+01  4.218e+00  21.373  < 2e-16 ***
    ## GrLivArea:NeighborhoodSWISU   3.391e+01  3.741e+00   9.066  < 2e-16 ***
    ## GrLivArea:NeighborhoodTimber  7.162e+01  3.760e+00  19.047  < 2e-16 ***
    ## GrLivArea:NeighborhoodVeenker 8.462e+01  6.665e+00  12.696  < 2e-16 ***
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## Residual standard error: 32150 on 1425 degrees of freedom
    ## Multiple R-squared:  0.8401, Adjusted R-squared:  0.8363 
    ## F-statistic: 220.2 on 34 and 1425 DF,  p-value: < 2.2e-16
    a0
    ## Analysis of Variance Table
    ## 
    ## Response: SalePrice
    ##                          Df     Sum Sq    Mean Sq F value    Pr(>F)    
    ## as.factor(OverallQual)    9 6.2999e+12 6.9999e+11 677.370 < 2.2e-16 ***
    ## GrLivArea:Neighborhood   25 1.4355e+12 5.7418e+10  55.563 < 2.2e-16 ***
    ## Residuals              1425 1.4726e+12 1.0334e+09                      
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    par(mfrow = c(2, 2))
    plot(lm0)

    • Model 1
    s1
    ## 
    ## Call:
    ## lm(formula = SalePrice ~ GrLivArea:Neighborhood + as.factor(OverallQual) + 
    ##     HouseStyle)
    ## 
    ## Residuals:
    ##     Min      1Q  Median      3Q     Max 
    ## -257430  -14498       4   13780  199565 
    ## 
    ## Coefficients:
    ##                                 Estimate Std. Error t value Pr(>|t|)    
    ## (Intercept)                     3381.558  22352.296   0.151 0.879773    
    ## as.factor(OverallQual)2         -519.774  28329.494  -0.018 0.985364    
    ## as.factor(OverallQual)3        16447.678  23049.615   0.714 0.475606    
    ## as.factor(OverallQual)4        34160.783  22165.786   1.541 0.123504    
    ## as.factor(OverallQual)5        43535.724  22063.594   1.973 0.048668 *  
    ## as.factor(OverallQual)6        55459.169  22104.371   2.509 0.012219 *  
    ## as.factor(OverallQual)7        69594.343  22207.933   3.134 0.001761 ** 
    ## as.factor(OverallQual)8        93176.866  22405.838   4.159 3.39e-05 ***
    ## as.factor(OverallQual)9       157022.204  23031.721   6.818 1.37e-11 ***
    ## as.factor(OverallQual)10      188671.291  23964.640   7.873 6.84e-15 ***
    ## HouseStyle1.5Unf                3602.793   8860.407   0.407 0.684351    
    ## HouseStyle1Story               17800.974   3292.811   5.406 7.55e-08 ***
    ## HouseStyle2.5Fin               -5893.478  12699.471  -0.464 0.642667    
    ## HouseStyle2.5Unf              -17302.051  10057.488  -1.720 0.085593 .  
    ## HouseStyle2Story               -7236.439   3396.604  -2.130 0.033303 *  
    ## HouseStyleSFoyer               26685.498   6125.392   4.357 1.42e-05 ***
    ## HouseStyleSLvl                 17147.874   4974.213   3.447 0.000583 ***
    ## GrLivArea:NeighborhoodBlmngtn     70.466      6.131  11.493  < 2e-16 ***
    ## GrLivArea:NeighborhoodBlueste     61.969     16.044   3.862 0.000117 ***
    ## GrLivArea:NeighborhoodBrDale      50.248      7.935   6.332 3.23e-10 ***
    ## GrLivArea:NeighborhoodBrkSide     61.868      4.508  13.725  < 2e-16 ***
    ## GrLivArea:NeighborhoodClearCr     80.385      3.866  20.795  < 2e-16 ***
    ## GrLivArea:NeighborhoodCollgCr     80.559      3.354  24.017  < 2e-16 ***
    ## GrLivArea:NeighborhoodCrawfor     79.490      3.319  23.951  < 2e-16 ***
    ## GrLivArea:NeighborhoodEdwards     41.229      3.210  12.842  < 2e-16 ***
    ## GrLivArea:NeighborhoodGilbert     76.961      3.609  21.323  < 2e-16 ***
    ## GrLivArea:NeighborhoodIDOTRR      45.721      5.520   8.283 2.75e-16 ***
    ## GrLivArea:NeighborhoodMeadowV     45.867      7.304   6.279 4.51e-10 ***
    ## GrLivArea:NeighborhoodMitchel     65.658      4.181  15.705  < 2e-16 ***
    ## GrLivArea:NeighborhoodNAmes       60.153      2.929  20.534  < 2e-16 ***
    ## GrLivArea:NeighborhoodNoRidge     97.345      3.088  31.520  < 2e-16 ***
    ## GrLivArea:NeighborhoodNPkVill     64.459      8.840   7.292 5.07e-13 ***
    ## GrLivArea:NeighborhoodNridgHt     98.937      3.613  27.380  < 2e-16 ***
    ## GrLivArea:NeighborhoodNWAmes      68.214      3.141  21.719  < 2e-16 ***
    ## GrLivArea:NeighborhoodOldTown     49.363      3.341  14.776  < 2e-16 ***
    ## GrLivArea:NeighborhoodSawyer      59.315      3.883  15.275  < 2e-16 ***
    ## GrLivArea:NeighborhoodSawyerW     73.835      3.521  20.971  < 2e-16 ***
    ## GrLivArea:NeighborhoodSomerst     86.875      3.816  22.764  < 2e-16 ***
    ## GrLivArea:NeighborhoodStoneBr    104.479      4.329  24.136  < 2e-16 ***
    ## GrLivArea:NeighborhoodSWISU       49.959      4.402  11.349  < 2e-16 ***
    ## GrLivArea:NeighborhoodTimber      84.666      3.878  21.832  < 2e-16 ***
    ## GrLivArea:NeighborhoodVeenker     95.874      6.558  14.620  < 2e-16 ***
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## Residual standard error: 31000 on 1418 degrees of freedom
    ## Multiple R-squared:  0.852,  Adjusted R-squared:  0.8478 
    ## F-statistic: 199.2 on 41 and 1418 DF,  p-value: < 2.2e-16
    a1
    ## Analysis of Variance Table
    ## 
    ## Response: SalePrice
    ##                          Df     Sum Sq    Mean Sq F value    Pr(>F)    
    ## as.factor(OverallQual)    9 6.2999e+12 6.9999e+11 728.545 < 2.2e-16 ***
    ## HouseStyle                7 7.7976e+10 1.1139e+10  11.594 2.143e-14 ***
    ## GrLivArea:Neighborhood   25 1.4676e+12 5.8706e+10  61.101 < 2.2e-16 ***
    ## Residuals              1418 1.3624e+12 9.6080e+08                      
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    par(mfrow = c(2, 2))
    plot(lm1)

    • Model 2
    s2
    ## 
    ## Call:
    ## lm(formula = SalePrice ~ GrLivArea:Neighborhood + as.factor(OverallQual) + 
    ##     HouseStyle + RoofMatl)
    ## 
    ## Residuals:
    ##     Min      1Q  Median      3Q     Max 
    ## -248210  -13847     207   13572  199904 
    ## 
    ## Coefficients:
    ##                                 Estimate Std. Error t value Pr(>|t|)    
    ## (Intercept)                   -3.310e+05  4.100e+04  -8.073 1.46e-15 ***
    ## as.factor(OverallQual)2        2.596e+03  2.722e+04   0.095  0.92401    
    ## as.factor(OverallQual)3        1.561e+04  2.215e+04   0.705  0.48093    
    ## as.factor(OverallQual)4        3.397e+04  2.129e+04   1.595  0.11085    
    ## as.factor(OverallQual)5        4.370e+04  2.119e+04   2.062  0.03941 *  
    ## as.factor(OverallQual)6        5.515e+04  2.123e+04   2.597  0.00949 ** 
    ## as.factor(OverallQual)7        6.920e+04  2.133e+04   3.243  0.00121 ** 
    ## as.factor(OverallQual)8        9.246e+04  2.152e+04   4.296 1.86e-05 ***
    ## as.factor(OverallQual)9        1.538e+05  2.213e+04   6.948 5.63e-12 ***
    ## as.factor(OverallQual)10       1.926e+05  2.311e+04   8.333  < 2e-16 ***
    ## HouseStyle1.5Unf               7.925e+03  8.527e+03   0.929  0.35284    
    ## HouseStyle1Story               2.039e+04  3.187e+03   6.398 2.14e-10 ***
    ## HouseStyle2.5Fin              -1.949e+04  1.230e+04  -1.584  0.11352    
    ## HouseStyle2.5Unf              -1.726e+04  9.668e+03  -1.785  0.07441 .  
    ## HouseStyle2Story              -5.735e+03  3.270e+03  -1.754  0.07969 .  
    ## HouseStyleSFoyer               2.977e+04  5.913e+03   5.034 5.43e-07 ***
    ## HouseStyleSLvl                 1.813e+04  4.827e+03   3.755  0.00018 ***
    ## RoofMatlCompShg                3.250e+05  3.391e+04   9.585  < 2e-16 ***
    ## RoofMatlMembran                3.702e+05  4.542e+04   8.152 7.83e-16 ***
    ## RoofMatlMetal                  3.561e+05  4.565e+04   7.799 1.20e-14 ***
    ## RoofMatlRoll                   3.049e+05  4.515e+04   6.754 2.09e-11 ***
    ## RoofMatlTar&Grv                3.315e+05  3.474e+04   9.543  < 2e-16 ***
    ## RoofMatlWdShake                3.041e+05  3.631e+04   8.375  < 2e-16 ***
    ## RoofMatlWdShngl                3.867e+05  3.569e+04  10.836  < 2e-16 ***
    ## GrLivArea:NeighborhoodBlmngtn  7.549e+01  5.913e+00  12.766  < 2e-16 ***
    ## GrLivArea:NeighborhoodBlueste  6.776e+01  1.543e+01   4.392 1.21e-05 ***
    ## GrLivArea:NeighborhoodBrDale   5.713e+01  7.671e+00   7.447 1.65e-13 ***
    ## GrLivArea:NeighborhoodBrkSide  6.812e+01  4.393e+00  15.506  < 2e-16 ***
    ## GrLivArea:NeighborhoodClearCr  8.268e+01  3.848e+00  21.489  < 2e-16 ***
    ## GrLivArea:NeighborhoodCollgCr  8.538e+01  3.264e+00  26.161  < 2e-16 ***
    ## GrLivArea:NeighborhoodCrawfor  8.386e+01  3.224e+00  26.010  < 2e-16 ***
    ## GrLivArea:NeighborhoodEdwards  5.392e+01  3.366e+00  16.020  < 2e-16 ***
    ## GrLivArea:NeighborhoodGilbert  8.180e+01  3.510e+00  23.304  < 2e-16 ***
    ## GrLivArea:NeighborhoodIDOTRR   5.264e+01  5.365e+00   9.812  < 2e-16 ***
    ## GrLivArea:NeighborhoodMeadowV  5.184e+01  7.052e+00   7.351 3.33e-13 ***
    ## GrLivArea:NeighborhoodMitchel  7.093e+01  4.059e+00  17.476  < 2e-16 ***
    ## GrLivArea:NeighborhoodNAmes    6.518e+01  2.904e+00  22.443  < 2e-16 ***
    ## GrLivArea:NeighborhoodNoRidge  9.933e+01  2.984e+00  33.289  < 2e-16 ***
    ## GrLivArea:NeighborhoodNPkVill  7.048e+01  8.522e+00   8.270 3.06e-16 ***
    ## GrLivArea:NeighborhoodNridgHt  1.030e+02  3.494e+00  29.485  < 2e-16 ***
    ## GrLivArea:NeighborhoodNWAmes   7.198e+01  3.067e+00  23.465  < 2e-16 ***
    ## GrLivArea:NeighborhoodOldTown  5.390e+01  3.267e+00  16.499  < 2e-16 ***
    ## GrLivArea:NeighborhoodSawyer   6.477e+01  3.783e+00  17.122  < 2e-16 ***
    ## GrLivArea:NeighborhoodSawyerW  7.825e+01  3.418e+00  22.894  < 2e-16 ***
    ## GrLivArea:NeighborhoodSomerst  9.172e+01  3.701e+00  24.783  < 2e-16 ***
    ## GrLivArea:NeighborhoodStoneBr  1.085e+02  4.176e+00  25.976  < 2e-16 ***
    ## GrLivArea:NeighborhoodSWISU    5.605e+01  4.273e+00  13.117  < 2e-16 ***
    ## GrLivArea:NeighborhoodTimber   8.984e+01  3.803e+00  23.626  < 2e-16 ***
    ## GrLivArea:NeighborhoodVeenker  9.662e+01  6.378e+00  15.149  < 2e-16 ***
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## Residual standard error: 29780 on 1411 degrees of freedom
    ## Multiple R-squared:  0.8641, Adjusted R-squared:  0.8595 
    ## F-statistic:   187 on 48 and 1411 DF,  p-value: < 2.2e-16
    a2
    ## Analysis of Variance Table
    ## 
    ## Response: SalePrice
    ##                          Df     Sum Sq    Mean Sq F value    Pr(>F)    
    ## as.factor(OverallQual)    9 6.2999e+12 6.9999e+11 789.551 < 2.2e-16 ***
    ## HouseStyle                7 7.7976e+10 1.1139e+10  12.565  1.05e-15 ***
    ## RoofMatl                  7 1.2232e+11 1.7475e+10  19.710 < 2.2e-16 ***
    ## GrLivArea:Neighborhood   25 1.4568e+12 5.8272e+10  65.728 < 2.2e-16 ***
    ## Residuals              1411 1.2509e+12 8.8656e+08                      
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    par(mfrow = c(2, 2))
    plot(lm2)

    • Model 3
    s3
    ## 
    ## Call:
    ## lm(formula = SalePrice ~ GrLivArea:Neighborhood + as.factor(OverallQual) + 
    ##     HouseStyle + RoofMatl + SaleCondition)
    ## 
    ## Residuals:
    ##     Min      1Q  Median      3Q     Max 
    ## -264444  -13552       0   13841  190903 
    ## 
    ## Coefficients:
    ##                                 Estimate Std. Error t value Pr(>|t|)    
    ## (Intercept)                   -3.598e+05  4.027e+04  -8.933  < 2e-16 ***
    ## as.factor(OverallQual)2        6.939e+03  2.662e+04   0.261  0.79440    
    ## as.factor(OverallQual)3        1.767e+04  2.165e+04   0.816  0.41454    
    ## as.factor(OverallQual)4        3.539e+04  2.082e+04   1.700  0.08935 .  
    ## as.factor(OverallQual)5        4.488e+04  2.072e+04   2.166  0.03048 *  
    ## as.factor(OverallQual)6        5.536e+04  2.076e+04   2.667  0.00775 ** 
    ## as.factor(OverallQual)7        6.848e+04  2.086e+04   3.283  0.00105 ** 
    ## as.factor(OverallQual)8        9.194e+04  2.105e+04   4.368 1.34e-05 ***
    ## as.factor(OverallQual)9        1.532e+05  2.164e+04   7.078 2.31e-12 ***
    ## as.factor(OverallQual)10       1.909e+05  2.260e+04   8.448  < 2e-16 ***
    ## HouseStyle1.5Unf               6.211e+03  8.346e+03   0.744  0.45684    
    ## HouseStyle1Story               1.973e+04  3.141e+03   6.279 4.52e-10 ***
    ## HouseStyle2.5Fin              -1.712e+04  1.205e+04  -1.421  0.15566    
    ## HouseStyle2.5Unf              -1.840e+04  9.461e+03  -1.945  0.05200 .  
    ## HouseStyle2Story              -5.395e+03  3.205e+03  -1.683  0.09251 .  
    ## HouseStyleSFoyer               3.075e+04  5.859e+03   5.248 1.78e-07 ***
    ## HouseStyleSLvl                 1.933e+04  4.743e+03   4.075 4.86e-05 ***
    ## RoofMatlCompShg                3.407e+05  3.325e+04  10.245  < 2e-16 ***
    ## RoofMatlMembran                3.868e+05  4.450e+04   8.692  < 2e-16 ***
    ## RoofMatlMetal                  3.696e+05  4.471e+04   8.268 3.12e-16 ***
    ## RoofMatlRoll                   3.178e+05  4.420e+04   7.189 1.05e-12 ***
    ## RoofMatlTar&Grv                3.496e+05  3.416e+04  10.236  < 2e-16 ***
    ## RoofMatlWdShake                3.238e+05  3.562e+04   9.090  < 2e-16 ***
    ## RoofMatlWdShngl                4.015e+05  3.500e+04  11.472  < 2e-16 ***
    ## SaleConditionAdjLand           3.984e+03  1.526e+04   0.261  0.79411    
    ## SaleConditionAlloca           -6.204e+03  9.224e+03  -0.673  0.50136    
    ## SaleConditionFamily            1.425e+03  7.220e+03   0.197  0.84355    
    ## SaleConditionNormal            1.416e+04  3.071e+03   4.610 4.40e-06 ***
    ## SaleConditionPartial           3.369e+04  4.318e+03   7.802 1.18e-14 ***
    ## GrLivArea:NeighborhoodBlmngtn  7.181e+01  5.820e+00  12.339  < 2e-16 ***
    ## GrLivArea:NeighborhoodBlueste  6.664e+01  1.509e+01   4.416 1.08e-05 ***
    ## GrLivArea:NeighborhoodBrDale   5.846e+01  7.537e+00   7.756 1.67e-14 ***
    ## GrLivArea:NeighborhoodBrkSide  6.766e+01  4.318e+00  15.668  < 2e-16 ***
    ## GrLivArea:NeighborhoodClearCr  8.228e+01  3.774e+00  21.801  < 2e-16 ***
    ## GrLivArea:NeighborhoodCollgCr  8.356e+01  3.223e+00  25.928  < 2e-16 ***
    ## GrLivArea:NeighborhoodCrawfor  8.447e+01  3.205e+00  26.359  < 2e-16 ***
    ## GrLivArea:NeighborhoodEdwards  5.327e+01  3.323e+00  16.030  < 2e-16 ***
    ## GrLivArea:NeighborhoodGilbert  7.975e+01  3.474e+00  22.955  < 2e-16 ***
    ## GrLivArea:NeighborhoodIDOTRR   5.352e+01  5.282e+00  10.132  < 2e-16 ***
    ## GrLivArea:NeighborhoodMeadowV  5.017e+01  6.902e+00   7.268 6.02e-13 ***
    ## GrLivArea:NeighborhoodMitchel  7.125e+01  3.995e+00  17.833  < 2e-16 ***
    ## GrLivArea:NeighborhoodNAmes    6.527e+01  2.853e+00  22.877  < 2e-16 ***
    ## GrLivArea:NeighborhoodNoRidge  9.978e+01  2.929e+00  34.059  < 2e-16 ***
    ## GrLivArea:NeighborhoodNPkVill  7.058e+01  8.341e+00   8.461  < 2e-16 ***
    ## GrLivArea:NeighborhoodNridgHt  9.868e+01  3.504e+00  28.164  < 2e-16 ***
    ## GrLivArea:NeighborhoodNWAmes   7.224e+01  3.014e+00  23.969  < 2e-16 ***
    ## GrLivArea:NeighborhoodOldTown  5.399e+01  3.224e+00  16.747  < 2e-16 ***
    ## GrLivArea:NeighborhoodSawyer   6.455e+01  3.712e+00  17.393  < 2e-16 ***
    ## GrLivArea:NeighborhoodSawyerW  7.892e+01  3.399e+00  23.222  < 2e-16 ***
    ## GrLivArea:NeighborhoodSomerst  8.735e+01  3.720e+00  23.480  < 2e-16 ***
    ## GrLivArea:NeighborhoodStoneBr  1.050e+02  4.137e+00  25.387  < 2e-16 ***
    ## GrLivArea:NeighborhoodSWISU    5.599e+01  4.194e+00  13.348  < 2e-16 ***
    ## GrLivArea:NeighborhoodTimber   8.808e+01  3.747e+00  23.508  < 2e-16 ***
    ## GrLivArea:NeighborhoodVeenker  9.615e+01  6.242e+00  15.405  < 2e-16 ***
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## Residual standard error: 29110 on 1406 degrees of freedom
    ## Multiple R-squared:  0.8706, Adjusted R-squared:  0.8658 
    ## F-statistic: 178.5 on 53 and 1406 DF,  p-value: < 2.2e-16
    a3
    ## Analysis of Variance Table
    ## 
    ## Response: SalePrice
    ##                          Df     Sum Sq    Mean Sq F value    Pr(>F)    
    ## as.factor(OverallQual)    9 6.2999e+12 6.9999e+11 826.256 < 2.2e-16 ***
    ## HouseStyle                7 7.7976e+10 1.1139e+10  13.149 < 2.2e-16 ***
    ## RoofMatl                  7 1.2232e+11 1.7475e+10  20.627 < 2.2e-16 ***
    ## SaleCondition             5 8.9597e+10 1.7919e+10  21.152 < 2.2e-16 ***
    ## GrLivArea:Neighborhood   25 1.4270e+12 5.7080e+10  67.377 < 2.2e-16 ***
    ## Residuals              1406 1.1911e+12 8.4718e+08                      
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    par(mfrow = c(2, 2))
    plot(lm3)

    In general, the model summary and ANOVA tables show that:
    • All variable families are statistically significant
    • Some members of variable families may be insignificant
    • The overall model relationships are significant.

    For instance, in Model 3, the ANOVA table indicates that the variable family OverallQual is highly significant (with p-value < 2.2e-16); however, from the summary table, we can see that coefficient estimates for OverallQual = 2 and 3 are not significant. In addition, the overall model relationship is highly significant, as indicated by the F-statistic in the summary table (with p-value < 2.2e-16).

    From the diagnostic plots, we can observe that:
    • The assumption of constant variance of the residuals may be an issue, as the residual vs. fitted value plots suggest that higher predicted values may be associated with greater variance in the residuals.
    • The assumption of approximately normally distributed residuals also may be an issue, as the quantile-quantile plots indicate deviations from normality in the tails of the residual distribution.

    It is worth mentioning that multi-collinearity of the predictor variables should be evaluated, by either reviewing the variance inflation factors or inspecting the correlation matrix.

    Finally, it should be kept in mind that we stopped adding variables after Model 3, because of diminishing returns in the \(R^2_{\text{adj}}\) and a desire to avoid potential over-fitting and collinearity issues. However, it is possible that adding more variables to the model may help mitigate the shortcomings highlighted above.

  • Report your Kaggle.com user name and score.

    My user name is “kecbenson”, and scores for the 4 models are as follows:

    Model Score
    lm0 0.1752
    lm1 0.1724
    lm2 0.1692
    lm3 0.1657
  • Suggestions for further work.

    • Continue adding more variables to the model to see if this improves performance
    • Explore other functional forms for the regression model relationship; for instance we could try:

      • Log-transform the SalePrice variable in the regression, since the variable is skewed to the right
      • Transform other variables that appear to be non-normally distributed
      • Develop the regression on a sale price per square foot variable, i.e., change the target variable to SalePrice / GrLivArea
    • Try other regression estimation approaches such as generalized linear models, non-linear models, or more advanced techniques (ridge, lasso, etc.).