We start by generating 10,000 draws of the random variables \(X\) and \(Y\), where:
n <- 10000
N <- 41
m <- (N + 1) / 2 # mean of uniform distribution for X
X <- runif(n, min = 1, max = N) # random variable X
Y <- rnorm(n, mean = m, sd = m) # random variable Y
df <- data.frame(X, Y)
str(df)
## 'data.frame': 10000 obs. of 2 variables:
## $ X: num 29.95 17.83 4.73 38.69 7.8 ...
## $ Y: num 23.2 63.2 24.9 21 22.9 ...
summary(df)
## X Y
## Min. : 1.002 Min. :-54.944
## 1st Qu.:10.835 1st Qu.: 6.699
## Median :20.494 Median : 20.930
## Mean :20.787 Mean : 21.015
## 3rd Qu.:30.884 3rd Qu.: 35.167
## Max. :40.998 Max. :102.070
We compute the following values:
x <- median(X) # sample median of X
y <- quantile(Y, 0.25) # sample 1st quartile of Y
x_th <- qunif(0.5, min = 1, max = N) # population median of X
y_th <- qnorm(0.25, mean = m, sd = m) # population 1st quartile of Y
xy = data.frame("x" = c(x, x_th), "y" = c(y, y_th), row.names = c("Sample", "Theoretical"))
kable(xy, digits = 3, align = 'cc', caption = "Sample and Theoretical Values of x & y")
| x | y | |
|---|---|---|
| Sample | 20.494 | 6.699 |
| Theoretical | 21.000 | 6.836 |
Then we plot the histograms of \(X\) and \(Y\) along with their probability density functions, and show the values of the median \(x\) and 1st quartile \(y\).
# plot histogram of X with probability density function
hist(X, breaks = 1:N, main = 'Histogram of X with median value (blue)')
abline(h = n / (N - 1), lwd = 3, col = 'violet')
abline(v = x_th, lty = 3, lwd = 3, col = 'blue')
# plot histogram of Y with probability density function
hist(Y, breaks = (m-200):(m+200), xlim = c(m-60, m+60),
main = 'Histogram of Y with 1st quartile (blue) and mean (red) values')
curve(n * dnorm(x, mean = m, sd = m), add = TRUE, lwd = 3, col = 'violet')
abline(v = y_th, lty = 3, lwd = 3, col = 'blue')
abline(v = m, lty = 3, lwd = 3, col = 'red')
In the probability calculations in this part, we give the sample probabilities computed from the 10,000 random draws generated for \(X\) and \(Y\) as well as the theoretical probabilities computed from the specified distributions for \(X\) (uniform) and \(Y\) (normal).
\(\Pr(X>x ~|~ X>y) = p_A\)
Answer:
This is the conditional probability that \(X>x\) given that \(X>y\), which can be expressed as:
\[\begin{eqnarray} p_A = \Pr(X>x ~|~ X>y) &=& \frac{\Pr(X>x ~\cap~ X>y)}{\Pr(X>y)} && \\ &=& \frac{\Pr(X>x)}{\Pr(X>y)} && \text{ since } x>y \\ &=& \frac{0.5}{\Pr(X>y)} && \text{ by definition of median} \end{eqnarray}\]
Both the sample probability and theoretical probability are approximately 58.5%, as can be seen in the table from part (c) below.
# sample probability
p1_s <- sum(X > x) / sum(X > y)
# theoretical probability
p1_th <- 0.5 / ((N - y_th) / (N - 1))\(\Pr(X>x,~ Y>y) = p_B\)
Answer:
This is the joint probability that \(X>x\) and \(Y>y\). Since the random variables \(X\) and \(Y\) are independent (see part B below), the theoretical probability can be expressed as:
\[ p_B = \Pr(X>x,~ Y>y) = \Pr(X>x) \cdot \Pr(Y>y) = 0.5 \cdot 0.75 \]
by the definitions of \(x\) (median of \(X\)) and \(y\) (1st quartile of \(Y\)).
Both the sample probability and theoretical probability are approximately 37.5%, as can be seen in the table from part (c) below.
# sample probability
p2_s <- nrow(df[df$X > x & df$Y > y, ]) / n
# theoretical probability
p2_th <- 0.5 * 0.75\(\Pr(X<x ~|~ X>y) = p_C\)
Answer:
This is the conditional probability that \(X<x\) given that \(X>y\), which can be expressed as:
\[\begin{eqnarray} p_C = \Pr(X<x ~|~ X>y) &=& \frac{\Pr(X<x ~\cap~ X>y)}{\Pr(X>y)} && \\ &=& \frac{\Pr(y < X < x)}{\Pr(X>y)} && \text{ since } y < x \end{eqnarray}\]
Both the sample probability and theoretical probability are approximately 41.5%, as can be seen in the table below.
Note that this also can be computed simply by using the probability \(p_A\) from part (a) above, since \(X<x\) and \(X>x\) are complements:
\[p_C = \Pr(X<x ~|~ X>y) = 1 - \Pr(X>x ~|~ X>y) = 1 - p_A\]
# sample probability
p3_s <- sum(X < x & X > y) / sum(X > y)
# theoretical probability
p3_th <- (x_th - y_th) / (N - y_th)
# table of probabilities
p <- data.frame("p_A" = c(p1_s, p1_th), "p_B" = c(p2_s, p2_th), "p_C" = c(p3_s, p3_th),
row.names = c("Sample", "Theoretical"))
kable(p, digits = 4, align = 'ccc',
caption = "Sample and Theoretical Values of Probabilities p_A, p_B, and p_C")
| p_A | p_B | p_C | |
|---|---|---|---|
| Sample | 0.5831 | 0.3754 | 0.4169 |
| Theoretical | 0.5854 | 0.3750 | 0.4146 |
We evaluate the conditions \(X > x\) and \(Y > y\) on the \(X\) and \(Y\) samples, store the results in a dataframe, and then convert this into a contingency table. The table is shown in terms of both frequency and probability.
# construct dataframe for X>x and Y>y
df2 <- data.frame(X = ifelse(X <= x, 'X <= x', 'X > x'), Y = ifelse(Y <= y, 'Y <= y', 'Y > y'))
# construct and summarize contingency table
df3 <- table(df2)
summary(df3)
## Number of cases in table: 10000
## Number of factors: 2
## Test for independence of all factors:
## Chisq = 0.03413, df = 1, p-value = 0.8534
# reformat table as dataframe for display
C1 <- c(df3[ ,1], sum(df3[ ,1]))
C2 <- c(df3[ ,2], sum(df3[ ,2]))
C3 <- C1 + C2
df4 <- data.frame(C1, C2, C3)
rownames(df4) <- c('X <= x', 'X > x', 'Total')
colnames(df4) <- c('Y <= y', 'Y > y', 'Total')
# display table: frequency
kable(df4, align = 'ccc', caption = 'Contingency Table of X>x and Y>y: Frequency')
| Y <= y | Y > y | Total | |
|---|---|---|---|
| X <= x | 1254 | 3746 | 5000 |
| X > x | 1246 | 3754 | 5000 |
| Total | 2500 | 7500 | 10000 |
# display table: probability
kable(df4 / n, digits = 4, align = 'ccc',
caption = 'Contingency Table of X>x and Y>y: Probability')
| Y <= y | Y > y | Total | |
|---|---|---|---|
| X <= x | 0.1254 | 0.3746 | 0.5 |
| X > x | 0.1246 | 0.3754 | 0.5 |
| Total | 0.2500 | 0.7500 | 1.0 |
We see that the marginal probabilities and joint probabilities are what we would expect for independent random variables \(X\) and \(Y\). For instance:
\[\begin{eqnarray} \Pr(X > x) &=& 0.5 \\ \Pr(Y > y) &=& 0.75 \\ \Pr(X > x,~ Y > y) &\approx& 0.375 = \Pr(X > x) \cdot \Pr(Y > y) \end{eqnarray}\]
Alternatively, we can check for independence by comparing the marginal and conditional probabilities:
\[\begin{eqnarray} \Pr(X > x ~|~ Y > y) &=& \frac{\Pr(X > x,~ Y > y)}{\Pr(Y > y)} = \frac{\Pr(X > x,~ Y > y)}{0.75} \approx 0.5 = \Pr(X > x) \\ \Pr(Y > y ~|~ X > x) &=& \frac{\Pr(X > x,~ Y > y)}{\Pr(X > x)} = \frac{\Pr(X > x,~ Y > y)}{0.5} \approx 0.75 = \Pr(Y > y) \end{eqnarray}\]
Again, the sample probabilities support the hypothesis that \(X\) and \(Y\) are independent.
We test for independence with the following null and alternative hypotheses:
We compute the relevant test statistic under the null hypothesis that \(X\) and \(Y\) are independent, and then compare the resulting p-value to the significance level \(\alpha = 0.05\). If the p-value is \(< \alpha\), then we reject \(H_0\) and conclude that the variables are dependent; however if the p-value is \(> \alpha\), then we conclude that the evidence does not support rejecting the hypothesis that \(X\) and \(Y\) are independent.
In this case, we use two test statistics: Fisher’s exact test and the chi-squared test. We compute the test statistics and related p-values using the R functions fisher.test and chisq.test, with the chi-squared test shown both with and without the “continuity correction”. Note that the p-values from Fisher’s exact test and the chi-squared test with “continuity correction” are extremely close, if not identical.
# calc Fisher's exact test
fisher.test(df3)
##
## Fisher's Exact Test for Count Data
##
## data: df3
## p-value = 0.8716
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.9202847 1.1052820
## sample estimates:
## odds ratio
## 1.00857
# calc chi-square test with & without continuity correction
chisq.test(df3)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: df3
## X-squared = 0.026133, df = 1, p-value = 0.8716
chisq.test(df3, correct = FALSE)
##
## Pearson's Chi-squared test
##
## data: df3
## X-squared = 0.034133, df = 1, p-value = 0.8534
Generally (in almost all runs), the p-values from the two tests are \(\gg \alpha\). Hence we can conclude that the sample data for \(X\) and \(Y\) do not support the alternative hypothesis that the variables are dependent.
Finally, from Wikipedia, Fisher’s exact test provides a significance test statistic that can be computed exactly (using the hypergeometric distribution), which is valid when the marginal (row and column) totals for the contingency table are held fixed. In contrast, the chi-squared test statistic provides only an approximate significance value, which is valid only for large sample sizes and for frequency counts greater than 5-10 in each cell of the table. In our case, the sample size of 10,000 is large enough that either test can be used, and the corresponding p-values are very close.
We start by loading and viewing the Ames housing training dataset. The dataset has been downloaded from the Kaggle website and saved to local disk as well as my GitHub repo, which is private. For reproducibility, the R code below loads the data from disk.
The training dataset includes 1,460 observations of 81 variables, which includes the target variable SalePrice and 80 feature variables.
# local directory
path <- "train.csv"
# private GitHub repo; need to accept invitation for access
# path <- "https://raw.githubusercontent.com/kecbenson/DATA605/master/train.csv?token=AJYRCNGY6EDCSXGWG2PULFC45CHCA"
# load and view training dataset
df_train <- read.csv(path)
glimpse(df_train)
## Observations: 1,460
## Variables: 81
## $ Id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1...
## $ MSSubClass <int> 60, 20, 60, 70, 60, 50, 20, 60, 50, 190, 20, 60,...
## $ MSZoning <fct> RL, RL, RL, RL, RL, RL, RL, RL, RM, RL, RL, RL, ...
## $ LotFrontage <int> 65, 80, 68, 60, 84, 85, 75, NA, 51, 50, 70, 85, ...
## $ LotArea <int> 8450, 9600, 11250, 9550, 14260, 14115, 10084, 10...
## $ Street <fct> Pave, Pave, Pave, Pave, Pave, Pave, Pave, Pave, ...
## $ Alley <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ LotShape <fct> Reg, Reg, IR1, IR1, IR1, IR1, Reg, IR1, Reg, Reg...
## $ LandContour <fct> Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, Lvl...
## $ Utilities <fct> AllPub, AllPub, AllPub, AllPub, AllPub, AllPub, ...
## $ LotConfig <fct> Inside, FR2, Inside, Corner, FR2, Inside, Inside...
## $ LandSlope <fct> Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl...
## $ Neighborhood <fct> CollgCr, Veenker, CollgCr, Crawfor, NoRidge, Mit...
## $ Condition1 <fct> Norm, Feedr, Norm, Norm, Norm, Norm, Norm, PosN,...
## $ Condition2 <fct> Norm, Norm, Norm, Norm, Norm, Norm, Norm, Norm, ...
## $ BldgType <fct> 1Fam, 1Fam, 1Fam, 1Fam, 1Fam, 1Fam, 1Fam, 1Fam, ...
## $ HouseStyle <fct> 2Story, 1Story, 2Story, 2Story, 2Story, 1.5Fin, ...
## $ OverallQual <int> 7, 6, 7, 7, 8, 5, 8, 7, 7, 5, 5, 9, 5, 7, 6, 7, ...
## $ OverallCond <int> 5, 8, 5, 5, 5, 5, 5, 6, 5, 6, 5, 5, 6, 5, 5, 8, ...
## $ YearBuilt <int> 2003, 1976, 2001, 1915, 2000, 1993, 2004, 1973, ...
## $ YearRemodAdd <int> 2003, 1976, 2002, 1970, 2000, 1995, 2005, 1973, ...
## $ RoofStyle <fct> Gable, Gable, Gable, Gable, Gable, Gable, Gable,...
## $ RoofMatl <fct> CompShg, CompShg, CompShg, CompShg, CompShg, Com...
## $ Exterior1st <fct> VinylSd, MetalSd, VinylSd, Wd Sdng, VinylSd, Vin...
## $ Exterior2nd <fct> VinylSd, MetalSd, VinylSd, Wd Shng, VinylSd, Vin...
## $ MasVnrType <fct> BrkFace, None, BrkFace, None, BrkFace, None, Sto...
## $ MasVnrArea <int> 196, 0, 162, 0, 350, 0, 186, 240, 0, 0, 0, 286, ...
## $ ExterQual <fct> Gd, TA, Gd, TA, Gd, TA, Gd, TA, TA, TA, TA, Ex, ...
## $ ExterCond <fct> TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, ...
## $ Foundation <fct> PConc, CBlock, PConc, BrkTil, PConc, Wood, PConc...
## $ BsmtQual <fct> Gd, Gd, Gd, TA, Gd, Gd, Ex, Gd, TA, TA, TA, Ex, ...
## $ BsmtCond <fct> TA, TA, TA, Gd, TA, TA, TA, TA, TA, TA, TA, TA, ...
## $ BsmtExposure <fct> No, Gd, Mn, No, Av, No, Av, Mn, No, No, No, No, ...
## $ BsmtFinType1 <fct> GLQ, ALQ, GLQ, ALQ, GLQ, GLQ, GLQ, ALQ, Unf, GLQ...
## $ BsmtFinSF1 <int> 706, 978, 486, 216, 655, 732, 1369, 859, 0, 851,...
## $ BsmtFinType2 <fct> Unf, Unf, Unf, Unf, Unf, Unf, Unf, BLQ, Unf, Unf...
## $ BsmtFinSF2 <int> 0, 0, 0, 0, 0, 0, 0, 32, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ BsmtUnfSF <int> 150, 284, 434, 540, 490, 64, 317, 216, 952, 140,...
## $ TotalBsmtSF <int> 856, 1262, 920, 756, 1145, 796, 1686, 1107, 952,...
## $ Heating <fct> GasA, GasA, GasA, GasA, GasA, GasA, GasA, GasA, ...
## $ HeatingQC <fct> Ex, Ex, Ex, Gd, Ex, Ex, Ex, Ex, Gd, Ex, Ex, Ex, ...
## $ CentralAir <fct> Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, ...
## $ Electrical <fct> SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SBrkr,...
## $ X1stFlrSF <int> 856, 1262, 920, 961, 1145, 796, 1694, 1107, 1022...
## $ X2ndFlrSF <int> 854, 0, 866, 756, 1053, 566, 0, 983, 752, 0, 0, ...
## $ LowQualFinSF <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ GrLivArea <int> 1710, 1262, 1786, 1717, 2198, 1362, 1694, 2090, ...
## $ BsmtFullBath <int> 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, ...
## $ BsmtHalfBath <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ FullBath <int> 2, 2, 2, 1, 2, 1, 2, 2, 2, 1, 1, 3, 1, 2, 1, 1, ...
## $ HalfBath <int> 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, ...
## $ BedroomAbvGr <int> 3, 3, 3, 3, 4, 1, 3, 3, 2, 2, 3, 4, 2, 3, 2, 2, ...
## $ KitchenAbvGr <int> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, ...
## $ KitchenQual <fct> Gd, TA, Gd, Gd, Gd, TA, Gd, TA, TA, TA, TA, Ex, ...
## $ TotRmsAbvGrd <int> 8, 6, 6, 7, 9, 5, 7, 7, 8, 5, 5, 11, 4, 7, 5, 5,...
## $ Functional <fct> Typ, Typ, Typ, Typ, Typ, Typ, Typ, Typ, Min1, Ty...
## $ Fireplaces <int> 0, 1, 1, 1, 1, 0, 1, 2, 2, 2, 0, 2, 0, 1, 1, 0, ...
## $ FireplaceQu <fct> NA, TA, TA, Gd, TA, NA, Gd, TA, TA, TA, NA, Gd, ...
## $ GarageType <fct> Attchd, Attchd, Attchd, Detchd, Attchd, Attchd, ...
## $ GarageYrBlt <int> 2003, 1976, 2001, 1998, 2000, 1993, 2004, 1973, ...
## $ GarageFinish <fct> RFn, RFn, RFn, Unf, RFn, Unf, RFn, RFn, Unf, RFn...
## $ GarageCars <int> 2, 2, 2, 3, 3, 2, 2, 2, 2, 1, 1, 3, 1, 3, 1, 2, ...
## $ GarageArea <int> 548, 460, 608, 642, 836, 480, 636, 484, 468, 205...
## $ GarageQual <fct> TA, TA, TA, TA, TA, TA, TA, TA, Fa, Gd, TA, TA, ...
## $ GarageCond <fct> TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, ...
## $ PavedDrive <fct> Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, ...
## $ WoodDeckSF <int> 0, 298, 0, 0, 192, 40, 255, 235, 90, 0, 0, 147, ...
## $ OpenPorchSF <int> 61, 0, 42, 35, 84, 30, 57, 204, 0, 4, 0, 21, 0, ...
## $ EnclosedPorch <int> 0, 0, 0, 272, 0, 0, 0, 228, 205, 0, 0, 0, 0, 0, ...
## $ X3SsnPorch <int> 0, 0, 0, 0, 0, 320, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ ScreenPorch <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 176, 0, 0, 0...
## $ PoolArea <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ PoolQC <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ Fence <fct> NA, NA, NA, NA, NA, MnPrv, NA, NA, NA, NA, NA, N...
## $ MiscFeature <fct> NA, NA, NA, NA, NA, Shed, NA, Shed, NA, NA, NA, ...
## $ MiscVal <int> 0, 0, 0, 0, 0, 700, 0, 350, 0, 0, 0, 0, 0, 0, 0,...
## $ MoSold <int> 2, 5, 9, 2, 12, 10, 8, 11, 4, 1, 2, 7, 9, 8, 5, ...
## $ YrSold <int> 2008, 2007, 2008, 2006, 2008, 2009, 2007, 2009, ...
## $ SaleType <fct> WD, WD, WD, WD, WD, WD, WD, WD, WD, WD, WD, New,...
## $ SaleCondition <fct> Normal, Normal, Normal, Abnorml, Normal, Normal,...
## $ SalePrice <int> 208500, 181500, 223500, 140000, 250000, 143000, ...
Provide univariate descriptive statistics and appropriate plots for the training data set.
First let’s review summary statistics for the data using the summary function.
summary(df_train)
## Id MSSubClass MSZoning LotFrontage
## Min. : 1.0 Min. : 20.0 C (all): 10 Min. : 21.00
## 1st Qu.: 365.8 1st Qu.: 20.0 FV : 65 1st Qu.: 59.00
## Median : 730.5 Median : 50.0 RH : 16 Median : 69.00
## Mean : 730.5 Mean : 56.9 RL :1151 Mean : 70.05
## 3rd Qu.:1095.2 3rd Qu.: 70.0 RM : 218 3rd Qu.: 80.00
## Max. :1460.0 Max. :190.0 Max. :313.00
## NA's :259
## LotArea Street Alley LotShape LandContour
## Min. : 1300 Grvl: 6 Grvl: 50 IR1:484 Bnk: 63
## 1st Qu.: 7554 Pave:1454 Pave: 41 IR2: 41 HLS: 50
## Median : 9478 NA's:1369 IR3: 10 Low: 36
## Mean : 10517 Reg:925 Lvl:1311
## 3rd Qu.: 11602
## Max. :215245
##
## Utilities LotConfig LandSlope Neighborhood Condition1
## AllPub:1459 Corner : 263 Gtl:1382 NAmes :225 Norm :1260
## NoSeWa: 1 CulDSac: 94 Mod: 65 CollgCr:150 Feedr : 81
## FR2 : 47 Sev: 13 OldTown:113 Artery : 48
## FR3 : 4 Edwards:100 RRAn : 26
## Inside :1052 Somerst: 86 PosN : 19
## Gilbert: 79 RRAe : 11
## (Other):707 (Other): 15
## Condition2 BldgType HouseStyle OverallQual
## Norm :1445 1Fam :1220 1Story :726 Min. : 1.000
## Feedr : 6 2fmCon: 31 2Story :445 1st Qu.: 5.000
## Artery : 2 Duplex: 52 1.5Fin :154 Median : 6.000
## PosN : 2 Twnhs : 43 SLvl : 65 Mean : 6.099
## RRNn : 2 TwnhsE: 114 SFoyer : 37 3rd Qu.: 7.000
## PosA : 1 1.5Unf : 14 Max. :10.000
## (Other): 2 (Other): 19
## OverallCond YearBuilt YearRemodAdd RoofStyle
## Min. :1.000 Min. :1872 Min. :1950 Flat : 13
## 1st Qu.:5.000 1st Qu.:1954 1st Qu.:1967 Gable :1141
## Median :5.000 Median :1973 Median :1994 Gambrel: 11
## Mean :5.575 Mean :1971 Mean :1985 Hip : 286
## 3rd Qu.:6.000 3rd Qu.:2000 3rd Qu.:2004 Mansard: 7
## Max. :9.000 Max. :2010 Max. :2010 Shed : 2
##
## RoofMatl Exterior1st Exterior2nd MasVnrType MasVnrArea
## CompShg:1434 VinylSd:515 VinylSd:504 BrkCmn : 15 Min. : 0.0
## Tar&Grv: 11 HdBoard:222 MetalSd:214 BrkFace:445 1st Qu.: 0.0
## WdShngl: 6 MetalSd:220 HdBoard:207 None :864 Median : 0.0
## WdShake: 5 Wd Sdng:206 Wd Sdng:197 Stone :128 Mean : 103.7
## ClyTile: 1 Plywood:108 Plywood:142 NA's : 8 3rd Qu.: 166.0
## Membran: 1 CemntBd: 61 CmentBd: 60 Max. :1600.0
## (Other): 2 (Other):128 (Other):136 NA's :8
## ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure
## Ex: 52 Ex: 3 BrkTil:146 Ex :121 Fa : 45 Av :221
## Fa: 14 Fa: 28 CBlock:634 Fa : 35 Gd : 65 Gd :134
## Gd:488 Gd: 146 PConc :647 Gd :618 Po : 2 Mn :114
## TA:906 Po: 1 Slab : 24 TA :649 TA :1311 No :953
## TA:1282 Stone : 6 NA's: 37 NA's: 37 NA's: 38
## Wood : 3
##
## BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2
## ALQ :220 Min. : 0.0 ALQ : 19 Min. : 0.00
## BLQ :148 1st Qu.: 0.0 BLQ : 33 1st Qu.: 0.00
## GLQ :418 Median : 383.5 GLQ : 14 Median : 0.00
## LwQ : 74 Mean : 443.6 LwQ : 46 Mean : 46.55
## Rec :133 3rd Qu.: 712.2 Rec : 54 3rd Qu.: 0.00
## Unf :430 Max. :5644.0 Unf :1256 Max. :1474.00
## NA's: 37 NA's: 38
## BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir
## Min. : 0.0 Min. : 0.0 Floor: 1 Ex:741 N: 95
## 1st Qu.: 223.0 1st Qu.: 795.8 GasA :1428 Fa: 49 Y:1365
## Median : 477.5 Median : 991.5 GasW : 18 Gd:241
## Mean : 567.2 Mean :1057.4 Grav : 7 Po: 1
## 3rd Qu.: 808.0 3rd Qu.:1298.2 OthW : 2 TA:428
## Max. :2336.0 Max. :6110.0 Wall : 4
##
## Electrical X1stFlrSF X2ndFlrSF LowQualFinSF
## FuseA: 94 Min. : 334 Min. : 0 Min. : 0.000
## FuseF: 27 1st Qu.: 882 1st Qu.: 0 1st Qu.: 0.000
## FuseP: 3 Median :1087 Median : 0 Median : 0.000
## Mix : 1 Mean :1163 Mean : 347 Mean : 5.845
## SBrkr:1334 3rd Qu.:1391 3rd Qu.: 728 3rd Qu.: 0.000
## NA's : 1 Max. :4692 Max. :2065 Max. :572.000
##
## GrLivArea BsmtFullBath BsmtHalfBath FullBath
## Min. : 334 Min. :0.0000 Min. :0.00000 Min. :0.000
## 1st Qu.:1130 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:1.000
## Median :1464 Median :0.0000 Median :0.00000 Median :2.000
## Mean :1515 Mean :0.4253 Mean :0.05753 Mean :1.565
## 3rd Qu.:1777 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:2.000
## Max. :5642 Max. :3.0000 Max. :2.00000 Max. :3.000
##
## HalfBath BedroomAbvGr KitchenAbvGr KitchenQual
## Min. :0.0000 Min. :0.000 Min. :0.000 Ex:100
## 1st Qu.:0.0000 1st Qu.:2.000 1st Qu.:1.000 Fa: 39
## Median :0.0000 Median :3.000 Median :1.000 Gd:586
## Mean :0.3829 Mean :2.866 Mean :1.047 TA:735
## 3rd Qu.:1.0000 3rd Qu.:3.000 3rd Qu.:1.000
## Max. :2.0000 Max. :8.000 Max. :3.000
##
## TotRmsAbvGrd Functional Fireplaces FireplaceQu GarageType
## Min. : 2.000 Maj1: 14 Min. :0.000 Ex : 24 2Types : 6
## 1st Qu.: 5.000 Maj2: 5 1st Qu.:0.000 Fa : 33 Attchd :870
## Median : 6.000 Min1: 31 Median :1.000 Gd :380 Basment: 19
## Mean : 6.518 Min2: 34 Mean :0.613 Po : 20 BuiltIn: 88
## 3rd Qu.: 7.000 Mod : 15 3rd Qu.:1.000 TA :313 CarPort: 9
## Max. :14.000 Sev : 1 Max. :3.000 NA's:690 Detchd :387
## Typ :1360 NA's : 81
## GarageYrBlt GarageFinish GarageCars GarageArea GarageQual
## Min. :1900 Fin :352 Min. :0.000 Min. : 0.0 Ex : 3
## 1st Qu.:1961 RFn :422 1st Qu.:1.000 1st Qu.: 334.5 Fa : 48
## Median :1980 Unf :605 Median :2.000 Median : 480.0 Gd : 14
## Mean :1979 NA's: 81 Mean :1.767 Mean : 473.0 Po : 3
## 3rd Qu.:2002 3rd Qu.:2.000 3rd Qu.: 576.0 TA :1311
## Max. :2010 Max. :4.000 Max. :1418.0 NA's: 81
## NA's :81
## GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch
## Ex : 2 N: 90 Min. : 0.00 Min. : 0.00 Min. : 0.00
## Fa : 35 P: 30 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00
## Gd : 9 Y:1340 Median : 0.00 Median : 25.00 Median : 0.00
## Po : 7 Mean : 94.24 Mean : 46.66 Mean : 21.95
## TA :1326 3rd Qu.:168.00 3rd Qu.: 68.00 3rd Qu.: 0.00
## NA's: 81 Max. :857.00 Max. :547.00 Max. :552.00
##
## X3SsnPorch ScreenPorch PoolArea PoolQC
## Min. : 0.00 Min. : 0.00 Min. : 0.000 Ex : 2
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.000 Fa : 2
## Median : 0.00 Median : 0.00 Median : 0.000 Gd : 3
## Mean : 3.41 Mean : 15.06 Mean : 2.759 NA's:1453
## 3rd Qu.: 0.00 3rd Qu.: 0.00 3rd Qu.: 0.000
## Max. :508.00 Max. :480.00 Max. :738.000
##
## Fence MiscFeature MiscVal MoSold
## GdPrv: 59 Gar2: 2 Min. : 0.00 Min. : 1.000
## GdWo : 54 Othr: 2 1st Qu.: 0.00 1st Qu.: 5.000
## MnPrv: 157 Shed: 49 Median : 0.00 Median : 6.000
## MnWw : 11 TenC: 1 Mean : 43.49 Mean : 6.322
## NA's :1179 NA's:1406 3rd Qu.: 0.00 3rd Qu.: 8.000
## Max. :15500.00 Max. :12.000
##
## YrSold SaleType SaleCondition SalePrice
## Min. :2006 WD :1267 Abnorml: 101 Min. : 34900
## 1st Qu.:2007 New : 122 AdjLand: 4 1st Qu.:129975
## Median :2008 COD : 43 Alloca : 12 Median :163000
## Mean :2008 ConLD : 9 Family : 20 Mean :180921
## 3rd Qu.:2009 ConLI : 5 Normal :1198 3rd Qu.:214000
## Max. :2010 ConLw : 5 Partial: 125 Max. :755000
## (Other): 9
Then we review the distribution of the target variable SalePrice as well as the key variable GrLivArea. Note that both variables are skewed to the right and appear to follow a log-normal distribution.
attach(df_train)
par(mfrow = c(1, 2))
hist(SalePrice)
# log transform
hist(log(SalePrice), main = "Histogram of log(SalePrice)")
hist(GrLivArea)
# log transform
hist(log(GrLivArea), main = "Histogram of log(GrLivArea)")
pairs function) for groups of selected numeric variables:
SalePrice, GrLivArea, LotAreaSalePrice, OverallQual, OverallCondSalePrice, YearBuilt, YearRemodAddpairs(cbind(SalePrice, GrLivArea, LotArea),
main = "Bivariate plots of SalePrice, GrLivArea, LotArea")
pairs(cbind(SalePrice, OverallQual, OverallCond),
main = "Bivariate plots of SalePrice, OverallQual, OverallCond")
pairs(cbind(SalePrice, YearBuilt, YearRemodAdd),
main = "Bivariate plots of SalePrice, YearBuilt, YearRemodAdd")
SalePrice variable for groups of selected categorical and discrete numeric variables:
Neighborhood, OverallQualBldgType, HouseStyleFullBath, TotBath (= FullBath + 0.5 * HalfBath)BedroomAbvGr, TotRmsAbvGrdYrSold, MoSoldSaleType, SaleConditionRoofMatl, KitchenQualBsmtQual, GarageCarspar(las = 1)
boxplot(SalePrice ~ Neighborhood, horizontal = TRUE, main = "SalePrice vs. Neighborhood")
boxplot(SalePrice ~ OverallQual, horizontal = TRUE, main = "SalePrice vs. OverallQual")
par(mfrow = c(1, 2))
boxplot(SalePrice ~ BldgType, main = "SalePrice vs. BldgType")
boxplot(SalePrice ~ HouseStyle, main = "SalePrice vs. HouseStyle")
# define total baths
TotBath <- FullBath + 0.5 * HalfBath
df_train <- cbind(df_train, TotBath)
boxplot(SalePrice ~ FullBath, main = "SalePrice vs. FullBath")
boxplot(SalePrice ~ TotBath, main = "SalePrice vs. TotBath")
boxplot(SalePrice ~ BedroomAbvGr, main = "SalePrice vs. BedroomAbvGr")
boxplot(SalePrice ~ TotRmsAbvGrd, main = "SalePrice vs. TotRmsAbvGrd")
boxplot(SalePrice ~ YrSold, main = "SalePrice vs. YrSold")
boxplot(SalePrice ~ MoSold, main = "SalePrice vs. MoSold")
boxplot(SalePrice ~ SaleType, main = "SalePrice vs. SaleType")
boxplot(SalePrice ~ SaleCondition, main = "SalePrice vs. SaleCondition")
boxplot(SalePrice ~ RoofMatl, main = "SalePrice vs. RoofMatl")
boxplot(SalePrice ~ KitchenQual, main = "SalePrice vs. KitchenQual")
boxplot(SalePrice ~ BsmtQual, main = "SalePrice vs. BsmtQual")
boxplot(SalePrice ~ GarageCars, main = "SalePrice vs. GarageCars")
YearBuilt and YearRemodAddFullBath and TotBathBedroomAbvGr and TotRmsAbvGrdcor(GrLivArea, LotArea)
## [1] 0.2631162
cor(YearBuilt, YearRemodAdd)
## [1] 0.592855
cor(FullBath, TotBath)
## [1] 0.9201157
cor(BedroomAbvGr, TotRmsAbvGrd)
## [1] 0.6766199
From these plots, we see can narrow down the complete set of 80 feature variables to a short list of potential predictor variables to use in a multiple regression model for the sale price:
GrLivAreaOverallQualLotAreaYearRemodAddNeighborhoodHouseStyleTotBathTotRmsAbvGrdMoSoldSaleTypeSaleConditionRoofMatlKitchenQualBsmtQualGarageCars.Provide a scatterplot matrix for at least two of the independent variables and the dependent variable.
Let’s view a scatter plot of SalePrice vs GrLivArea, and then segment the same relationship by OverallQual, Neighborhood, and HouseStyle. The slope of the best-fit line for each scatter plot gives the price per square foot of gross living area, which is a key metric used in the real estate industry. Judging from the graphs, these appear to be key variables to use when starting the regression model for SalePrice. For instance, higher quality homes (OverallQual = 7, 8, 9), certain neighborhoods (Neighborhood = NoRidge, NridgHt, StoneBr), and certain house styles (HouseStyle = 1Story) have higher prices per square foot than other homes.
# SalePrice vs GrLivArea
g <- ggplot(df_train, aes(x = GrLivArea, y = SalePrice))
g + geom_point(alpha = 0.2) +
geom_smooth(method = "lm", se = FALSE) +
labs(title = "SalePrice vs. GrLivArea")
# SalePrice vs GrLivArea, by OverallQual
g + geom_point(aes(color = OverallQual), alpha = 0.5) +
geom_smooth(method = "lm", se = FALSE) +
facet_wrap(~ OverallQual) + guides(color = "none") +
labs(title = "SalePrice vs. GrLivArea, by OverallQual")
# SalePrice vs GrLivArea, by Neighborhood
g + geom_point(aes(color = Neighborhood), alpha = 0.5) +
geom_smooth(method = "lm", se = FALSE) +
facet_wrap(~ Neighborhood) + guides(color = "none") +
labs(title = "SalePrice vs. GrLivArea, by Neighborhood")
# SalePrice vs GrLivArea, by HouseStyle
g + geom_point(aes(color = HouseStyle), alpha = 0.5) +
geom_smooth(method = "lm", se = FALSE) +
facet_wrap(~ HouseStyle) + guides(color = "none") +
labs(title = "SalePrice vs. GrLivArea, by HouseStyle")
We also view a scatter plot of SalePrice vs log(LotArea), and then segment the same relationship by HouseStyle, BedroomAbvGr, and TotBath. We log-transform the LotArea variable because of outliers. Based on the graphs, we can see that the price per square foot of lot area is not as strong a relationship as the price per square foot of gross living area.
# SalePrice vs LotArea
h <- ggplot(df_train, aes(x = log(LotArea), y = SalePrice))
h + geom_point(alpha = 0.2) +
geom_smooth(method = "lm", se = FALSE) +
labs(title = "SalePrice vs. log(LotArea)")
# SalePrice vs LotArea, by HouseStyle
h + geom_point(aes(color = HouseStyle), alpha = 0.5) +
geom_smooth(method = "lm", se = FALSE) +
facet_wrap(~ HouseStyle) + guides(color = "none") +
labs(title = "SalePrice vs. log(LotArea), by HouseStyle")
# SalePrice vs LotArea, by BedroomAbvGr
h + geom_point(aes(color = BedroomAbvGr), alpha = 0.5) +
geom_smooth(method = "lm", se = FALSE) +
facet_wrap(~ BedroomAbvGr) + guides(color = "none") +
labs(title = "SalePrice vs. log(LotArea), by BedroomAbvGr")
# SalePrice vs LotArea, by TotBath
h + geom_point(aes(color = TotBath), alpha = 0.5) +
geom_smooth(method = "lm", se = FALSE) +
facet_wrap(~ TotBath) + guides(color = "none") +
labs(title = "SalePrice vs. LotArea, by TotBath")
Finally, since the SalePrice and GrLivArea variables are skewed to the right, let’s log-transform these variables and view the relationship segmented by OverallQual, BldgType, and HouseStyle. These graphs confirm that OverallQual, BldgType, and HouseStyle are key variables to consider in predicting SalePrice.
# SalePrice vs. GrLivArea, by OverallQual and BldgType
g + geom_point(aes(color = OverallQual), alpha = 0.5) +
scale_x_log10() + scale_y_log10() +
geom_smooth(method = "lm", se = FALSE) +
facet_wrap(~ BldgType) +
labs(title = "log(SalePrice) vs. log(GrLivArea), by OverallQual and BldgType")
# SalePrice vs. GrLivArea, by OverallQual and HouseStyle
g + geom_point(aes(color = OverallQual), alpha = 0.5) +
scale_x_log10() + scale_y_log10() +
geom_smooth(method = "lm", se = FALSE) +
facet_wrap(~ HouseStyle) +
labs(title = "log(SalePrice) vs. log(GrLivArea), by OverallQual and HouseStyle")
Derive a correlation matrix for any three quantitative variables in the dataset.
We choose the variables LotArea, GrLivArea, and SalePrice and use the R function cor to compute the correlation matrix.
correln <- cor(cbind('LSZ' = df_train$LotArea, 'GLA' = df_train$GrLivArea, 'PRX' = df_train$SalePrice))
kable(correln, digits = 4, align = 'ccc', caption = 'Correlation matrix')
| LSZ | GLA | PRX | |
|---|---|---|---|
| LSZ | 1.0000 | 0.2631 | 0.2638 |
| GLA | 0.2631 | 1.0000 | 0.7086 |
| PRX | 0.2638 | 0.7086 | 1.0000 |
Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval.
We use the sample pairwise correlations \(r_{ij}\) for variables \(i\) and \(j\) from the correlation matrix to infer whether the true population correlations \(\rho_{ij}\) are different than 0, at a significance level of \(\alpha = 0.05\). The null and alternative hypotheses for this test are as follows:
From Wikipedia, the test statistic for the correlation test is:
\[t_{ij} = \frac{r_{ij}\sqrt{n-2}}{\sqrt{1 - r_{ij}^2}}\]
Assuming that the null hypothesis \(H_0\) is true and that the variables \(i\) and \(j\) are independently normally distributed, then the test statistic \(t_{ij}\) follows the student’s \(t\)-distribution with \(n-2\) degrees of freedom. This is approximately true for non-normally distributed variables \(i\) and \(j\) so long as the sample size is large enough, which we can assume to be the case for \(n=1460\).
So we compute the test statistics \(t_{ij}\) and determine the associated p-values:
\[\begin{eqnarray} \text{LSZ-GLA:} && t_{12} = 10.41 &\implies& \Pr(|t_{12}| > 10.41) \approx 0 \\ \text{LSZ-PRX:} && t_{13} = 10.44 &\implies& \Pr(|t_{13}| > 10.44) \approx 0 \\ \text{GLA-PRX:} && t_{23} = 38.35 &\implies& \Pr(|t_{23}| > 38.35) \approx 0 \end{eqnarray}\]
n <- nrow(df_train) # sample size
r <- c(correln[1,2], correln[1,3], correln[2,3])
(t <- r * sqrt(n - 2) / sqrt(1 - r^2)) # test statistic for sample correlation
## [1] 10.41370 10.44463 38.34821
(p <- 2 * (1 - pt(t, df = n - 2))) # p-value for two-tailed test
## [1] 0 0 0
The p-values are approximately 0 and in any case are \(\ll \alpha\), so we reject the null hypothesis and conclude that each pairwise correlation \(\rho_{ij} \neq 0\). Note that we can use the R function cor.test to confirm the test statistics and p-values above, and also to compute the 80% confidence intervals.
\[\begin{eqnarray} \text{LSZ-GLA:} && 0.2316 \leq \rho_{12} \leq 0.2941 \\ \text{LSZ-PRX:} && 0.2323 \leq \rho_{13} \leq 0.2948 \\ \text{GLA-PRX:} && 0.6915 \leq \rho_{23} \leq 0.7249 \end{eqnarray}\]
cor.test(df_train$LotArea, df_train$GrLivArea, conf.level = 0.8)
##
## Pearson's product-moment correlation
##
## data: df_train$LotArea and df_train$GrLivArea
## t = 10.414, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.2315997 0.2940809
## sample estimates:
## cor
## 0.2631162
cor.test(df_train$LotArea, df_train$SalePrice, conf.level = 0.8)
##
## Pearson's product-moment correlation
##
## data: df_train$LotArea and df_train$SalePrice
## t = 10.445, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.2323391 0.2947946
## sample estimates:
## cor
## 0.2638434
cor.test(df_train$GrLivArea, df_train$SalePrice, conf.level = 0.8)
##
## Pearson's product-moment correlation
##
## data: df_train$GrLivArea and df_train$SalePrice
## t = 38.348, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.6915087 0.7249450
## sample estimates:
## cor
## 0.7086245Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?
In our analysis, we computed the sample correlations \(r_{ij}\) from the sample data, and then applied statistical inference (hypothesis testing) to determine whether the \(r_{ij} \neq 0\) are sufficient to conclude that the true population correlations \(\rho_{ij} \neq 0\) at a significance level of \(\alpha = 0.05\). The calculated test statistics and related p-values supported the conclusion that the \(\rho_{ij} \neq 0\) at \(\alpha = 0.05\) significance.
In the last step, we applied the same hypothesis test on three pairwise correlations (\(\rho_{12}\), \(\rho_{13}\), and \(\rho_{23}\)). Generally, when testing multiple hypotheses on related variables at the same time, the risk is that we increase the likelihood of a legitimate rare event, i.e., a test statistic that is validly in the tail of the distribution for the null hypothesis. In this circumstance, we would incorrectly reject the null hypothesis in favor of the alternative hypothesis (a Type I error). So in the general case, we should be worried about and examine ways to mitigate the possibility of familywise error.
In this particular case, however, familywise error should not be a substantial risk, since the p-value for each hypothesis test is extraordinarily small (\(< 2.2 \cdot 10^{-16}\)). For instance, if we were to apply the Bonferroni correction (see Wikipedia) by using a significance level of \(\alpha / 3 \approx 0.017\), we would still reject the null hypothesis and conclude that \(\rho_{ij} \neq 0\).
Invert the correlation matrix from above; the inverse matrix is known as the precision matrix and contains variance inflation factors on the diagonal.
We use the R function solve to invert the correlation matrix and arrive at the precision matrix.
precisn <- solve(correln)
kable(precisn, digits = 4, align = 'ccc', caption = 'Precision matrix')
| LSZ | GLA | PRX | |
|---|---|---|---|
| LSZ | 1.0884 | -0.1665 | -0.1692 |
| GLA | -0.1665 | 2.0341 | -1.3975 |
| PRX | -0.1692 | -1.3975 | 2.0349 |
Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix.
The product of the correlation matrix and the precision matrix, and vice versa, are both equal to the identity matrix.
(res1 <- correln %*% precisn)
## LSZ GLA PRX
## LSZ 1.000000e+00 0.000000e+00 0.000000e+00
## GLA 1.387779e-17 1.000000e+00 2.220446e-16
## PRX -2.775558e-17 -2.220446e-16 1.000000e+00
near(res1, diag(nrow(res1)))
## LSZ GLA PRX
## LSZ TRUE TRUE TRUE
## GLA TRUE TRUE TRUE
## PRX TRUE TRUE TRUE
(res2 <- precisn %*% correln)
## LSZ GLA PRX
## LSZ 1.000000e+00 1.94289e-16 1.387779e-16
## GLA -1.110223e-16 1.00000e+00 0.000000e+00
## PRX -1.110223e-16 0.00000e+00 1.000000e+00
near(res2, diag(nrow(res2)))
## LSZ GLA PRX
## LSZ TRUE TRUE TRUE
## GLA TRUE TRUE TRUE
## PRX TRUE TRUE TRUEConduct LU decomposition on the correlation matrix.
In Homework Assignment #2, we wrote a function to factorize a square matrix \(A\) into the LU decomposition. We use the same function here, as shown below.
# Define the function LU to decompose an input square matrix A into a lower triangular matrix L and an
# upper triangular matrix U, with A = L %*% U. The functon works by implementing a series of row
# reductions to eliminate all the below-diagonal elements of A.
# Several features to note about the function LU:
# - The function takes as input any n-dimensional square matrix A
# - Error handling is included for either of the following events:
# . The input matrix A is not square
# . A pivot value is zero
# - The function returns a named list of the lower and upper triangular matrices ("lower" and "upper").
LU <- function(M) {
n <- nrow(M)
# check M is square matrix, else stop
if (n != ncol(M)) stop("Not a square matrix")
# initialize L to the identity matrix with the same column & row names as M
L <- diag(nrow = n)
colnames(L) <- colnames(M)
rownames(L) <- rownames(M)
# initialize U to the original matrix M
U <- M
# loop across the columns that have below-diagonal elements
for ( j in 1:(n-1) ) {
# check pivot value is non-zero, else stop
if (U[j, j] == 0) stop(paste0("Zero pivot value, column j=", j))
# loop across the rows that have below-diagonal elements
for ( i in (j+1):n ) {
factor <- - U[i, j] / U[j, j]
# row-reduce using pivot row
U[i, ] <- U[i, ] + U[j, ] * factor
L[i, j] <- - factor
}
}
return(list("lower" = L, "upper" = U))
}
Now we use this function to factorize the correlation matrix, and confirm that the resulting lower and upper triangular matrices \(L\) and \(U\) satisfy \(A = LU\).
(L <- LU(correln)$lower)
## LSZ GLA PRX
## LSZ 1.0000000 0.0000000 0
## GLA 0.2631162 1.0000000 0
## PRX 0.2638434 0.6867466 1
(U <- LU(correln)$upper)
## LSZ GLA PRX
## LSZ 1 0.2631162 0.2638434
## GLA 0 0.9307699 0.6392030
## PRX 0 0.0000000 0.4914162
(A <- L %*% U)
## LSZ GLA PRX
## LSZ 1.0000000 0.2631162 0.2638434
## GLA 0.2631162 1.0000000 0.7086245
## PRX 0.2638434 0.7086245 1.0000000
correln - A
## LSZ GLA PRX
## LSZ 0 0.000000e+00 0.000000e+00
## GLA 0 0.000000e+00 1.110223e-16
## PRX 0 1.110223e-16 0.000000e+00
near(correln, A)
## LSZ GLA PRX
## LSZ TRUE TRUE TRUE
## GLA TRUE TRUE TRUE
## PRX TRUE TRUE TRUESelect a variable in the training dataset that is skewed to the right.
We choose to analyze the SalePrice variable from the training dataset. As we saw above, the empirical distribution for this variable is skewed to the right. Also all values of the variable are greater than zero, so there is no need to shift the distribution above zero.
Use the MASS package to fit an exponential probability density function to the data.
We load the MASS package and use fitdistr to fit an exponential probability density function to the data. The optimal value of \(\lambda\) for the fitted exponential distribution is found to be:
\[ \lambda = 5.527 \cdot 10^{-6} \]
library(MASS)
# fit exponential distribution to SalePrice data
f_exp <- fitdistr(df_train$SalePrice, "exponential")
(lambda <- f_exp$estimate)
## rate
## 5.527268e-06Generate a histogram from the fitted distribution and compare to the histogram of the selected variable.
We generate 1,000 draws from the fitted distribution, which we use to produce a histogram of values overlaid with a plot of the fitted exponential probability density function. Then we compare this to the histogram of the SalePrice variable overlaid with the same density function.
# generate sample of 1,000 draws from fitted distribution
n <- 1000
draws1 = rexp(n, lambda)
# combine dataframe for side-by-side histograms
df_hist1 <- rbind(data.frame(obs = draws1, grp = 'sample'),
data.frame(obs = df_train$SalePrice, grp = 'empirical'))
# plot histograms of sample & empirical values, along with fitted probability density
ggplot(df_hist1) + geom_histogram(aes(x = obs, y = ..density..)) +
stat_function(fun = dexp, col = 'blue', args = list(rate = lambda)) +
facet_wrap(~ grp) +
labs(title = 'Comparison of sample & empirical distributions for SalePrice',
subtitle = 'Fitted exponential probability density')
As we can see, the shape of the exponential distribution doesn’t match the shape of the empirical data, since the exponential density function increase as SalePrice goes to zero and it also misses the “hump” (mode of the data) near $180,000. A closer fit can be found by using a log-normal distribution, which we fit using the same fitdistr function from the MASS package. After fitting the log-normal distribution, a comparison of the sample and empirical distributions shows a closer match to the data.
# try fitting log-normal distribution to data instead
f_lnorm <- fitdistr(df_train$SalePrice, "lognormal")
ml <- f_lnorm$estimate[1]
sl <- f_lnorm$estimate[2]
# generate sample of 1,000 draws from fitted distribution
draws2 = rlnorm(n, meanlog = ml, sdlog = sl)
# combine dataframe for side-by-side histograms
df_hist2 <- rbind(data.frame(obs = draws2, grp = 'sample'),
data.frame(obs = df_train$SalePrice, grp = 'empirical'))
# plot histograms of sample & empirical values, along with fitted probability density
ggplot(df_hist2) + geom_histogram(aes(x = obs, y = ..density..)) +
stat_function(fun = dlnorm, col = 'blue', args = list(meanlog = ml, sdlog = sl)) +
facet_wrap(~ grp) +
labs(title = 'Comparison of sample & empirical distributions for SalePrice',
subtitle = 'Fitted log-normal probability density')
Find the 5th and 95th percentiles from the fitted and empirical distributions, and compare this to the 95% confidence interval from the empirical data.
From the table below, we see that:
For comparison:
SalePrice data are $88,000 and $326,100.Consistent with the observation noted earlier, the 5th and 95th percentiles of the empirical distribution are better approximated by the fitted log-normal distribution than by the fitted exponential distribution.
# range of percentiles
pct <- c(0.025, 0.05, 0.25, 0.5, 0.75, 0.95, 0.975)
# create & display table
tble <- rbind('Exponential - theoretical' = qexp(pct, lambda),
'Exponential - sample' = quantile(draws1, pct),
'Log-normal - theoretical' = qlnorm(pct, ml, sl),
'Log-normal - sample' = quantile(draws2, pct),
'Empirical distribution' = quantile(df_train$SalePrice, pct))
kable(tble, digits = 0, row.names = pct, format.args = list(big.mark = ','),
caption = 'Percentiles from fitted and empirical distributions for SalePrice')
| 2.5% | 5% | 25% | 50% | 75% | 95% | 97.5% | |
|---|---|---|---|---|---|---|---|
| Exponential - theoretical | 4,581 | 9,280 | 52,048 | 125,405 | 250,810 | 541,991 | 667,396 |
| Exponential - sample | 5,529 | 11,261 | 54,782 | 129,538 | 244,845 | 520,520 | 649,719 |
| Log-normal - theoretical | 76,222 | 86,443 | 127,353 | 166,717 | 218,247 | 321,536 | 364,650 |
| Log-normal - sample | 74,958 | 85,937 | 129,514 | 170,064 | 222,498 | 340,981 | 382,732 |
| Empirical distribution | 80,000 | 88,000 | 129,975 | 163,000 | 214,000 | 326,100 | 384,511 |
Next, from the sample SalePrice data in the training dataset, we can estimate the population mean \(\mu\) and construct a 95% confidence interval for the mean home sale price. We know from the Central Limit Theorem that the distribution of the sample mean will approximate a normal distribution for large enough sample size. Furthermore, assuming the SalePrice variable is normally distributed, we can use the Z-score (for larger sample sizes) or t-statistic (for smaller sample sizes) to determine the confidence interval of the sample mean.
First, from the training dataset, we estimate the sample mean \(m\) and sample standard deviation \(s\):
\[\begin{eqnarray} m &=& 180,921 \\ s &=& 79,443 \end{eqnarray}\]
(m <- mean(df_train$SalePrice))
## [1] 180921.2
(s <- sd(df_train$SalePrice))
## [1] 79442.5
Then we find the critical \(Z\) and \(t\) values corresponding to p-values of 2.5% and 97.5%, where \(n=1460\) is the sample size and \(df = n-1 = 1459\) is the degrees of freedom for determining the critical t-values. Putting all this together, we arrive at the confidence intervals based on the Z-score and t-statistic:
\[\begin{eqnarray} \text{Z-score:} && m - Z_{\text{crit}} \cdot \frac{s}{\sqrt{n}} &\leq& \mu \leq m + Z_{\text{crit}} \cdot \frac{s}{\sqrt{n}} &~\implies~& 176,846 &\leq& \mu \leq 184,996 \\ \text{t-statistic:} && m - t_{\text{crit}} \cdot \frac{s}{\sqrt{n}} &\leq& \mu \leq m + t_{\text{crit}} \cdot \frac{s}{\sqrt{n}} &~\implies~& 176,843 &\leq& \mu \leq 185,000 \end{eqnarray}\]
n1 <- length(df_train$SalePrice)
zcrit <- qnorm(c(0.025, 0.975))
tcrit <- qt(c(0.025, 0.975), df = n1 - 1)
m + zcrit * s / sqrt(n1)
## [1] 176846.2 184996.2
m + tcrit * s / sqrt(n1)
## [1] 176842.8 184999.6
As expected, the confidence intervals are virtually identical since the simple size of \(n=1460\) is very large.
Finally, note that the 95% confidence interval tells us that we can be 95% confident that the true population mean \(\mu\) lies within the interval, or alternatively, that if we generate a large number of samples having the same sample size of \(n=1460\) and construct the corresponding confidence intervals, then 95% of these intervals will contain the true population mean.
This concept differs from the 5th and 95th percentiles calculated above, which simply indicate the values of the SalePrice variable below which 5% and 95% of the distribution lies.
Build some type of multiple regression model and submit the results to the Kaggle competition board.
In part A above we arrived at the following short list of feature variables to use as potential predictors in a multiple regression model forSalePrice:
GrLivAreaOverallQualLotAreaYearRemodAddNeighborhoodHouseStyleTotBathTotRmsAbvGrdMoSoldSaleTypeSaleConditionRoofMatlKitchenQualBsmtQualGarageCars.We will use this short list to develop several models for SalePrice, proceeding in order of increasing complexity. Because of the large number of variables, we use a version of forward selection, whereby we start with the most important variables and then selectively add new variables to the model.
Model 0: base model
After some experimentation and applying some business logic, we arrive at a base linear model lm0 to start the analysis:
\[\text{SalePrice} ~\sim~ \text{GrLivArea:Neighborhood} + \text{OverallQual}\]
This model assumes that home prices are primarily determined by gross living area (total square feet), the price per square foot segmented by neighborhood (estimated slope coefficient), and an adjustment based on the overall quality of the house. This base model can account for \(R^2_{\text{adj}} = 0.836\) of the total variability in home prices in the training dataset.
# test #0: 0.8363
lm0 <- lm(SalePrice ~ GrLivArea:Neighborhood + as.factor(OverallQual))
s0 <- summary(lm0)
a0 <- anova(lm0)
s0$adj.r.squared
## [1] 0.8362588Model 1: add HouseStyle to Model 0
This model takes our base model and adds the next most important feature variable, HouseStyle:
\[\begin{eqnarray} \text{SalePrice} &~\sim~& \text{GrLivArea:Neighborhood} + \text{OverallQual} \\ && + \text{HouseStyle} \end{eqnarray}\]
The table below Model 3 records the adjusted \(R^2\) values for each candidate variable that can be added to Model 0. This version of the model has an \(R^2_{\text{adj}} = 0.848\).
# test #1: 0.8478
lm1 <- lm(SalePrice ~ GrLivArea:Neighborhood + as.factor(OverallQual) + HouseStyle)
s1 <- summary(lm1)
a1 <- anova(lm1)
s1$adj.r.squared
## [1] 0.8477604Model 2: add RoofMatl to Model 1
This model takes Model 1 and adds the next most important feature variable, RoofMatl:
\[\begin{eqnarray} \text{SalePrice} &~\sim~& \text{GrLivArea:Neighborhood} + \text{OverallQual} \\ && + \text{HouseStyle} + \text{RoofMatl} \end{eqnarray}\]
The table below Model 3 records the adjusted \(R^2\) values for each candidate variable that can be added to Model 1. This version of the model has an \(R^2_{\text{adj}} = 0.860\).
# test #2: 0.8595
lm2 <- lm(SalePrice ~ GrLivArea:Neighborhood + as.factor(OverallQual) + HouseStyle + RoofMatl)
s2 <- summary(lm2)
a2 <- anova(lm2)
s2$adj.r.squared
## [1] 0.8595235Model 3: add SaleCondition to Model 2
This model takes Model 2 and adds the next most important feature variable, SaleCondition:
\[\begin{eqnarray} \text{SalePrice} &~\sim~& \text{GrLivArea:Neighborhood} + \text{OverallQual} \\ && + \text{HouseStyle} + \text{RoofMatl} + \text{SaleCondition} \end{eqnarray}\]
The table below records the adjusted \(R^2\) values for each candidate variable that can be added to Model 2. This version of the model has an \(R^2_{\text{adj}} = 0.866\).
# test #3: 0.8658
lm3 <- lm(SalePrice ~ GrLivArea:Neighborhood + as.factor(OverallQual) + HouseStyle + RoofMatl + SaleCondition)
s3 <- summary(lm3)
a3 <- anova(lm3)
s3$adj.r.squared
## [1] 0.8657639The following table details the \(R^2_{\text{adj}}\) values that result from adding each variable (one at a time) to Models 0, 1, and 2, and supports the variables selected at each modeling step.
| Variable Added | Model 0 | Model 1 | Model 2 |
|---|---|---|---|
| Starting \(R^2_{\text{adj}}\) | 0.836 | 0.848 | 0.860 |
| LotArea | 0.834 | 0.852 | 0.864 |
| YearBuilt | 0.839 | 0.849 | 0.861 |
| YearRemodAdd | 0.841 | 0.852 | 0.864 |
| HouseStyle | 0.848 | – | – |
| BedroomAbvGr | 0.838 | 0.848 | 0.861 |
| TotRmsAbvGrd | 0.837 | 0.848 | 0.860 |
| FullBath | 0.836 | 0.848 | 0.860 |
| HalfBath | 0.837 | 0.849 | 0.860 |
| TotBath | 0.836 | 0.848 | 0.859 |
| RoofMatl | 0.847 | 0.860 | – |
| KitchenQual | 0.843 | 0.853 | 0.865 |
| BsmtQual | 0.842 | 0.853 | 0.865 |
| GarageCars | 0.843 | 0.852 | 0.864 |
| SaleCondition | 0.842 | 0.853 | 0.866 |
| SaleType | 0.841 | 0.852 | 0.864 |
We stop here since adding new variables to the model is yielding diminishing returns, as well as increasing the likelihood of over-fitting and collinearity among the predictor variables.
Interestingly, some variables that might be expected to increase the explanatory power of the model turn out not to do so. For example, the variables YearRemodAdd, BedroomAbvGr, TotBath, KitchenQual, and GarageCars don’t appear to increase the \(R^2_{\text{adj}}\) materially. This may be the case because they might be correlated to other variables that already included in the model, such as OverallQual.
Finally, we use the lm0 through lm3 models to predict values of the target variable SalePrice for the test dataset. The test dataset includes 1,459 observations with the same 80 feature variables found in the training dataset. We use the predict function to predict values of SalePrice for each of the 4 models and then save the results into 4 CSV files, which we submit to the Kaggle competition board.
# detach training dataset to avoid confusion
detach(df_train)
# local directory
path1 <- "test.csv"
# private GitHub repo; need to accept invitation for access
# path <- "https://raw.githubusercontent.com/kecbenson/DATA605/master/test.csv?token=AJYRCND5HTO6SZNTFJD2QT245L2FE"
# load the test dataset
df_test <- read.csv(path1)
# predict values
p0 <- predict(lm0, df_test)
p1 <- predict(lm1, df_test)
p2 <- predict(lm2, df_test)
p3 <- predict(lm3, df_test)
# save csv files
write.csv(cbind(Id = df_test$Id, SalePrice = p0), file = "kb0.csv", row.names = FALSE)
write.csv(cbind(Id = df_test$Id, SalePrice = p1), file = "kb1.csv", row.names = FALSE)
write.csv(cbind(Id = df_test$Id, SalePrice = p2), file = "kb2.csv", row.names = FALSE)
write.csv(cbind(Id = df_test$Id, SalePrice = p3), file = "kb3.csv", row.names = FALSE)Provide a complete model summary and results with analysis.
For each of the 4 models, we show below the model summary, ANOVA table, and diagnostic plots.
s0
##
## Call:
## lm(formula = SalePrice ~ GrLivArea:Neighborhood + as.factor(OverallQual))
##
## Residuals:
## Min 1Q Median 3Q Max
## -235927 -15564 -315 13429 212193
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.065e+04 2.280e+04 1.345 0.178999
## as.factor(OverallQual)2 1.077e+03 2.938e+04 0.037 0.970769
## as.factor(OverallQual)3 1.993e+04 2.390e+04 0.834 0.404315
## as.factor(OverallQual)4 3.610e+04 2.298e+04 1.571 0.116326
## as.factor(OverallQual)5 4.626e+04 2.286e+04 2.023 0.043247 *
## as.factor(OverallQual)6 5.675e+04 2.291e+04 2.478 0.013340 *
## as.factor(OverallQual)7 7.263e+04 2.301e+04 3.156 0.001631 **
## as.factor(OverallQual)8 1.022e+05 2.320e+04 4.403 1.15e-05 ***
## as.factor(OverallQual)9 1.710e+05 2.382e+04 7.177 1.14e-12 ***
## as.factor(OverallQual)10 2.095e+05 2.473e+04 8.469 < 2e-16 ***
## GrLivArea:NeighborhoodBlmngtn 6.106e+01 6.244e+00 9.778 < 2e-16 ***
## GrLivArea:NeighborhoodBlueste 3.661e+01 1.642e+01 2.229 0.025959 *
## GrLivArea:NeighborhoodBrDale 1.902e+01 7.544e+00 2.522 0.011777 *
## GrLivArea:NeighborhoodBrkSide 4.202e+01 4.111e+00 10.223 < 2e-16 ***
## GrLivArea:NeighborhoodClearCr 6.738e+01 3.753e+00 17.955 < 2e-16 ***
## GrLivArea:NeighborhoodCollgCr 6.510e+01 3.046e+00 21.368 < 2e-16 ***
## GrLivArea:NeighborhoodCrawfor 6.502e+01 3.063e+00 21.225 < 2e-16 ***
## GrLivArea:NeighborhoodEdwards 2.761e+01 2.985e+00 9.252 < 2e-16 ***
## GrLivArea:NeighborhoodGilbert 5.775e+01 3.139e+00 18.401 < 2e-16 ***
## GrLivArea:NeighborhoodIDOTRR 2.429e+01 5.192e+00 4.678 3.17e-06 ***
## GrLivArea:NeighborhoodMeadowV 2.589e+01 7.219e+00 3.586 0.000347 ***
## GrLivArea:NeighborhoodMitchel 5.465e+01 4.148e+00 13.177 < 2e-16 ***
## GrLivArea:NeighborhoodNAmes 4.845e+01 2.738e+00 17.697 < 2e-16 ***
## GrLivArea:NeighborhoodNoRidge 8.216e+01 2.784e+00 29.508 < 2e-16 ***
## GrLivArea:NeighborhoodNPkVill 4.373e+01 8.893e+00 4.918 9.76e-07 ***
## GrLivArea:NeighborhoodNridgHt 8.368e+01 3.374e+00 24.800 < 2e-16 ***
## GrLivArea:NeighborhoodNWAmes 5.527e+01 2.935e+00 18.834 < 2e-16 ***
## GrLivArea:NeighborhoodOldTown 3.102e+01 2.753e+00 11.268 < 2e-16 ***
## GrLivArea:NeighborhoodSawyer 4.788e+01 3.809e+00 12.570 < 2e-16 ***
## GrLivArea:NeighborhoodSawyerW 5.905e+01 3.295e+00 17.922 < 2e-16 ***
## GrLivArea:NeighborhoodSomerst 6.867e+01 3.448e+00 19.918 < 2e-16 ***
## GrLivArea:NeighborhoodStoneBr 9.014e+01 4.218e+00 21.373 < 2e-16 ***
## GrLivArea:NeighborhoodSWISU 3.391e+01 3.741e+00 9.066 < 2e-16 ***
## GrLivArea:NeighborhoodTimber 7.162e+01 3.760e+00 19.047 < 2e-16 ***
## GrLivArea:NeighborhoodVeenker 8.462e+01 6.665e+00 12.696 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 32150 on 1425 degrees of freedom
## Multiple R-squared: 0.8401, Adjusted R-squared: 0.8363
## F-statistic: 220.2 on 34 and 1425 DF, p-value: < 2.2e-16
a0
## Analysis of Variance Table
##
## Response: SalePrice
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(OverallQual) 9 6.2999e+12 6.9999e+11 677.370 < 2.2e-16 ***
## GrLivArea:Neighborhood 25 1.4355e+12 5.7418e+10 55.563 < 2.2e-16 ***
## Residuals 1425 1.4726e+12 1.0334e+09
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
par(mfrow = c(2, 2))
plot(lm0)
s1
##
## Call:
## lm(formula = SalePrice ~ GrLivArea:Neighborhood + as.factor(OverallQual) +
## HouseStyle)
##
## Residuals:
## Min 1Q Median 3Q Max
## -257430 -14498 4 13780 199565
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3381.558 22352.296 0.151 0.879773
## as.factor(OverallQual)2 -519.774 28329.494 -0.018 0.985364
## as.factor(OverallQual)3 16447.678 23049.615 0.714 0.475606
## as.factor(OverallQual)4 34160.783 22165.786 1.541 0.123504
## as.factor(OverallQual)5 43535.724 22063.594 1.973 0.048668 *
## as.factor(OverallQual)6 55459.169 22104.371 2.509 0.012219 *
## as.factor(OverallQual)7 69594.343 22207.933 3.134 0.001761 **
## as.factor(OverallQual)8 93176.866 22405.838 4.159 3.39e-05 ***
## as.factor(OverallQual)9 157022.204 23031.721 6.818 1.37e-11 ***
## as.factor(OverallQual)10 188671.291 23964.640 7.873 6.84e-15 ***
## HouseStyle1.5Unf 3602.793 8860.407 0.407 0.684351
## HouseStyle1Story 17800.974 3292.811 5.406 7.55e-08 ***
## HouseStyle2.5Fin -5893.478 12699.471 -0.464 0.642667
## HouseStyle2.5Unf -17302.051 10057.488 -1.720 0.085593 .
## HouseStyle2Story -7236.439 3396.604 -2.130 0.033303 *
## HouseStyleSFoyer 26685.498 6125.392 4.357 1.42e-05 ***
## HouseStyleSLvl 17147.874 4974.213 3.447 0.000583 ***
## GrLivArea:NeighborhoodBlmngtn 70.466 6.131 11.493 < 2e-16 ***
## GrLivArea:NeighborhoodBlueste 61.969 16.044 3.862 0.000117 ***
## GrLivArea:NeighborhoodBrDale 50.248 7.935 6.332 3.23e-10 ***
## GrLivArea:NeighborhoodBrkSide 61.868 4.508 13.725 < 2e-16 ***
## GrLivArea:NeighborhoodClearCr 80.385 3.866 20.795 < 2e-16 ***
## GrLivArea:NeighborhoodCollgCr 80.559 3.354 24.017 < 2e-16 ***
## GrLivArea:NeighborhoodCrawfor 79.490 3.319 23.951 < 2e-16 ***
## GrLivArea:NeighborhoodEdwards 41.229 3.210 12.842 < 2e-16 ***
## GrLivArea:NeighborhoodGilbert 76.961 3.609 21.323 < 2e-16 ***
## GrLivArea:NeighborhoodIDOTRR 45.721 5.520 8.283 2.75e-16 ***
## GrLivArea:NeighborhoodMeadowV 45.867 7.304 6.279 4.51e-10 ***
## GrLivArea:NeighborhoodMitchel 65.658 4.181 15.705 < 2e-16 ***
## GrLivArea:NeighborhoodNAmes 60.153 2.929 20.534 < 2e-16 ***
## GrLivArea:NeighborhoodNoRidge 97.345 3.088 31.520 < 2e-16 ***
## GrLivArea:NeighborhoodNPkVill 64.459 8.840 7.292 5.07e-13 ***
## GrLivArea:NeighborhoodNridgHt 98.937 3.613 27.380 < 2e-16 ***
## GrLivArea:NeighborhoodNWAmes 68.214 3.141 21.719 < 2e-16 ***
## GrLivArea:NeighborhoodOldTown 49.363 3.341 14.776 < 2e-16 ***
## GrLivArea:NeighborhoodSawyer 59.315 3.883 15.275 < 2e-16 ***
## GrLivArea:NeighborhoodSawyerW 73.835 3.521 20.971 < 2e-16 ***
## GrLivArea:NeighborhoodSomerst 86.875 3.816 22.764 < 2e-16 ***
## GrLivArea:NeighborhoodStoneBr 104.479 4.329 24.136 < 2e-16 ***
## GrLivArea:NeighborhoodSWISU 49.959 4.402 11.349 < 2e-16 ***
## GrLivArea:NeighborhoodTimber 84.666 3.878 21.832 < 2e-16 ***
## GrLivArea:NeighborhoodVeenker 95.874 6.558 14.620 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 31000 on 1418 degrees of freedom
## Multiple R-squared: 0.852, Adjusted R-squared: 0.8478
## F-statistic: 199.2 on 41 and 1418 DF, p-value: < 2.2e-16
a1
## Analysis of Variance Table
##
## Response: SalePrice
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(OverallQual) 9 6.2999e+12 6.9999e+11 728.545 < 2.2e-16 ***
## HouseStyle 7 7.7976e+10 1.1139e+10 11.594 2.143e-14 ***
## GrLivArea:Neighborhood 25 1.4676e+12 5.8706e+10 61.101 < 2.2e-16 ***
## Residuals 1418 1.3624e+12 9.6080e+08
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
par(mfrow = c(2, 2))
plot(lm1)
s2
##
## Call:
## lm(formula = SalePrice ~ GrLivArea:Neighborhood + as.factor(OverallQual) +
## HouseStyle + RoofMatl)
##
## Residuals:
## Min 1Q Median 3Q Max
## -248210 -13847 207 13572 199904
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.310e+05 4.100e+04 -8.073 1.46e-15 ***
## as.factor(OverallQual)2 2.596e+03 2.722e+04 0.095 0.92401
## as.factor(OverallQual)3 1.561e+04 2.215e+04 0.705 0.48093
## as.factor(OverallQual)4 3.397e+04 2.129e+04 1.595 0.11085
## as.factor(OverallQual)5 4.370e+04 2.119e+04 2.062 0.03941 *
## as.factor(OverallQual)6 5.515e+04 2.123e+04 2.597 0.00949 **
## as.factor(OverallQual)7 6.920e+04 2.133e+04 3.243 0.00121 **
## as.factor(OverallQual)8 9.246e+04 2.152e+04 4.296 1.86e-05 ***
## as.factor(OverallQual)9 1.538e+05 2.213e+04 6.948 5.63e-12 ***
## as.factor(OverallQual)10 1.926e+05 2.311e+04 8.333 < 2e-16 ***
## HouseStyle1.5Unf 7.925e+03 8.527e+03 0.929 0.35284
## HouseStyle1Story 2.039e+04 3.187e+03 6.398 2.14e-10 ***
## HouseStyle2.5Fin -1.949e+04 1.230e+04 -1.584 0.11352
## HouseStyle2.5Unf -1.726e+04 9.668e+03 -1.785 0.07441 .
## HouseStyle2Story -5.735e+03 3.270e+03 -1.754 0.07969 .
## HouseStyleSFoyer 2.977e+04 5.913e+03 5.034 5.43e-07 ***
## HouseStyleSLvl 1.813e+04 4.827e+03 3.755 0.00018 ***
## RoofMatlCompShg 3.250e+05 3.391e+04 9.585 < 2e-16 ***
## RoofMatlMembran 3.702e+05 4.542e+04 8.152 7.83e-16 ***
## RoofMatlMetal 3.561e+05 4.565e+04 7.799 1.20e-14 ***
## RoofMatlRoll 3.049e+05 4.515e+04 6.754 2.09e-11 ***
## RoofMatlTar&Grv 3.315e+05 3.474e+04 9.543 < 2e-16 ***
## RoofMatlWdShake 3.041e+05 3.631e+04 8.375 < 2e-16 ***
## RoofMatlWdShngl 3.867e+05 3.569e+04 10.836 < 2e-16 ***
## GrLivArea:NeighborhoodBlmngtn 7.549e+01 5.913e+00 12.766 < 2e-16 ***
## GrLivArea:NeighborhoodBlueste 6.776e+01 1.543e+01 4.392 1.21e-05 ***
## GrLivArea:NeighborhoodBrDale 5.713e+01 7.671e+00 7.447 1.65e-13 ***
## GrLivArea:NeighborhoodBrkSide 6.812e+01 4.393e+00 15.506 < 2e-16 ***
## GrLivArea:NeighborhoodClearCr 8.268e+01 3.848e+00 21.489 < 2e-16 ***
## GrLivArea:NeighborhoodCollgCr 8.538e+01 3.264e+00 26.161 < 2e-16 ***
## GrLivArea:NeighborhoodCrawfor 8.386e+01 3.224e+00 26.010 < 2e-16 ***
## GrLivArea:NeighborhoodEdwards 5.392e+01 3.366e+00 16.020 < 2e-16 ***
## GrLivArea:NeighborhoodGilbert 8.180e+01 3.510e+00 23.304 < 2e-16 ***
## GrLivArea:NeighborhoodIDOTRR 5.264e+01 5.365e+00 9.812 < 2e-16 ***
## GrLivArea:NeighborhoodMeadowV 5.184e+01 7.052e+00 7.351 3.33e-13 ***
## GrLivArea:NeighborhoodMitchel 7.093e+01 4.059e+00 17.476 < 2e-16 ***
## GrLivArea:NeighborhoodNAmes 6.518e+01 2.904e+00 22.443 < 2e-16 ***
## GrLivArea:NeighborhoodNoRidge 9.933e+01 2.984e+00 33.289 < 2e-16 ***
## GrLivArea:NeighborhoodNPkVill 7.048e+01 8.522e+00 8.270 3.06e-16 ***
## GrLivArea:NeighborhoodNridgHt 1.030e+02 3.494e+00 29.485 < 2e-16 ***
## GrLivArea:NeighborhoodNWAmes 7.198e+01 3.067e+00 23.465 < 2e-16 ***
## GrLivArea:NeighborhoodOldTown 5.390e+01 3.267e+00 16.499 < 2e-16 ***
## GrLivArea:NeighborhoodSawyer 6.477e+01 3.783e+00 17.122 < 2e-16 ***
## GrLivArea:NeighborhoodSawyerW 7.825e+01 3.418e+00 22.894 < 2e-16 ***
## GrLivArea:NeighborhoodSomerst 9.172e+01 3.701e+00 24.783 < 2e-16 ***
## GrLivArea:NeighborhoodStoneBr 1.085e+02 4.176e+00 25.976 < 2e-16 ***
## GrLivArea:NeighborhoodSWISU 5.605e+01 4.273e+00 13.117 < 2e-16 ***
## GrLivArea:NeighborhoodTimber 8.984e+01 3.803e+00 23.626 < 2e-16 ***
## GrLivArea:NeighborhoodVeenker 9.662e+01 6.378e+00 15.149 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 29780 on 1411 degrees of freedom
## Multiple R-squared: 0.8641, Adjusted R-squared: 0.8595
## F-statistic: 187 on 48 and 1411 DF, p-value: < 2.2e-16
a2
## Analysis of Variance Table
##
## Response: SalePrice
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(OverallQual) 9 6.2999e+12 6.9999e+11 789.551 < 2.2e-16 ***
## HouseStyle 7 7.7976e+10 1.1139e+10 12.565 1.05e-15 ***
## RoofMatl 7 1.2232e+11 1.7475e+10 19.710 < 2.2e-16 ***
## GrLivArea:Neighborhood 25 1.4568e+12 5.8272e+10 65.728 < 2.2e-16 ***
## Residuals 1411 1.2509e+12 8.8656e+08
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
par(mfrow = c(2, 2))
plot(lm2)
s3
##
## Call:
## lm(formula = SalePrice ~ GrLivArea:Neighborhood + as.factor(OverallQual) +
## HouseStyle + RoofMatl + SaleCondition)
##
## Residuals:
## Min 1Q Median 3Q Max
## -264444 -13552 0 13841 190903
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.598e+05 4.027e+04 -8.933 < 2e-16 ***
## as.factor(OverallQual)2 6.939e+03 2.662e+04 0.261 0.79440
## as.factor(OverallQual)3 1.767e+04 2.165e+04 0.816 0.41454
## as.factor(OverallQual)4 3.539e+04 2.082e+04 1.700 0.08935 .
## as.factor(OverallQual)5 4.488e+04 2.072e+04 2.166 0.03048 *
## as.factor(OverallQual)6 5.536e+04 2.076e+04 2.667 0.00775 **
## as.factor(OverallQual)7 6.848e+04 2.086e+04 3.283 0.00105 **
## as.factor(OverallQual)8 9.194e+04 2.105e+04 4.368 1.34e-05 ***
## as.factor(OverallQual)9 1.532e+05 2.164e+04 7.078 2.31e-12 ***
## as.factor(OverallQual)10 1.909e+05 2.260e+04 8.448 < 2e-16 ***
## HouseStyle1.5Unf 6.211e+03 8.346e+03 0.744 0.45684
## HouseStyle1Story 1.973e+04 3.141e+03 6.279 4.52e-10 ***
## HouseStyle2.5Fin -1.712e+04 1.205e+04 -1.421 0.15566
## HouseStyle2.5Unf -1.840e+04 9.461e+03 -1.945 0.05200 .
## HouseStyle2Story -5.395e+03 3.205e+03 -1.683 0.09251 .
## HouseStyleSFoyer 3.075e+04 5.859e+03 5.248 1.78e-07 ***
## HouseStyleSLvl 1.933e+04 4.743e+03 4.075 4.86e-05 ***
## RoofMatlCompShg 3.407e+05 3.325e+04 10.245 < 2e-16 ***
## RoofMatlMembran 3.868e+05 4.450e+04 8.692 < 2e-16 ***
## RoofMatlMetal 3.696e+05 4.471e+04 8.268 3.12e-16 ***
## RoofMatlRoll 3.178e+05 4.420e+04 7.189 1.05e-12 ***
## RoofMatlTar&Grv 3.496e+05 3.416e+04 10.236 < 2e-16 ***
## RoofMatlWdShake 3.238e+05 3.562e+04 9.090 < 2e-16 ***
## RoofMatlWdShngl 4.015e+05 3.500e+04 11.472 < 2e-16 ***
## SaleConditionAdjLand 3.984e+03 1.526e+04 0.261 0.79411
## SaleConditionAlloca -6.204e+03 9.224e+03 -0.673 0.50136
## SaleConditionFamily 1.425e+03 7.220e+03 0.197 0.84355
## SaleConditionNormal 1.416e+04 3.071e+03 4.610 4.40e-06 ***
## SaleConditionPartial 3.369e+04 4.318e+03 7.802 1.18e-14 ***
## GrLivArea:NeighborhoodBlmngtn 7.181e+01 5.820e+00 12.339 < 2e-16 ***
## GrLivArea:NeighborhoodBlueste 6.664e+01 1.509e+01 4.416 1.08e-05 ***
## GrLivArea:NeighborhoodBrDale 5.846e+01 7.537e+00 7.756 1.67e-14 ***
## GrLivArea:NeighborhoodBrkSide 6.766e+01 4.318e+00 15.668 < 2e-16 ***
## GrLivArea:NeighborhoodClearCr 8.228e+01 3.774e+00 21.801 < 2e-16 ***
## GrLivArea:NeighborhoodCollgCr 8.356e+01 3.223e+00 25.928 < 2e-16 ***
## GrLivArea:NeighborhoodCrawfor 8.447e+01 3.205e+00 26.359 < 2e-16 ***
## GrLivArea:NeighborhoodEdwards 5.327e+01 3.323e+00 16.030 < 2e-16 ***
## GrLivArea:NeighborhoodGilbert 7.975e+01 3.474e+00 22.955 < 2e-16 ***
## GrLivArea:NeighborhoodIDOTRR 5.352e+01 5.282e+00 10.132 < 2e-16 ***
## GrLivArea:NeighborhoodMeadowV 5.017e+01 6.902e+00 7.268 6.02e-13 ***
## GrLivArea:NeighborhoodMitchel 7.125e+01 3.995e+00 17.833 < 2e-16 ***
## GrLivArea:NeighborhoodNAmes 6.527e+01 2.853e+00 22.877 < 2e-16 ***
## GrLivArea:NeighborhoodNoRidge 9.978e+01 2.929e+00 34.059 < 2e-16 ***
## GrLivArea:NeighborhoodNPkVill 7.058e+01 8.341e+00 8.461 < 2e-16 ***
## GrLivArea:NeighborhoodNridgHt 9.868e+01 3.504e+00 28.164 < 2e-16 ***
## GrLivArea:NeighborhoodNWAmes 7.224e+01 3.014e+00 23.969 < 2e-16 ***
## GrLivArea:NeighborhoodOldTown 5.399e+01 3.224e+00 16.747 < 2e-16 ***
## GrLivArea:NeighborhoodSawyer 6.455e+01 3.712e+00 17.393 < 2e-16 ***
## GrLivArea:NeighborhoodSawyerW 7.892e+01 3.399e+00 23.222 < 2e-16 ***
## GrLivArea:NeighborhoodSomerst 8.735e+01 3.720e+00 23.480 < 2e-16 ***
## GrLivArea:NeighborhoodStoneBr 1.050e+02 4.137e+00 25.387 < 2e-16 ***
## GrLivArea:NeighborhoodSWISU 5.599e+01 4.194e+00 13.348 < 2e-16 ***
## GrLivArea:NeighborhoodTimber 8.808e+01 3.747e+00 23.508 < 2e-16 ***
## GrLivArea:NeighborhoodVeenker 9.615e+01 6.242e+00 15.405 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 29110 on 1406 degrees of freedom
## Multiple R-squared: 0.8706, Adjusted R-squared: 0.8658
## F-statistic: 178.5 on 53 and 1406 DF, p-value: < 2.2e-16
a3
## Analysis of Variance Table
##
## Response: SalePrice
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(OverallQual) 9 6.2999e+12 6.9999e+11 826.256 < 2.2e-16 ***
## HouseStyle 7 7.7976e+10 1.1139e+10 13.149 < 2.2e-16 ***
## RoofMatl 7 1.2232e+11 1.7475e+10 20.627 < 2.2e-16 ***
## SaleCondition 5 8.9597e+10 1.7919e+10 21.152 < 2.2e-16 ***
## GrLivArea:Neighborhood 25 1.4270e+12 5.7080e+10 67.377 < 2.2e-16 ***
## Residuals 1406 1.1911e+12 8.4718e+08
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
par(mfrow = c(2, 2))
plot(lm3)
For instance, in Model 3, the ANOVA table indicates that the variable family OverallQual is highly significant (with p-value < 2.2e-16); however, from the summary table, we can see that coefficient estimates for OverallQual = 2 and 3 are not significant. In addition, the overall model relationship is highly significant, as indicated by the F-statistic in the summary table (with p-value < 2.2e-16).
It is worth mentioning that multi-collinearity of the predictor variables should be evaluated, by either reviewing the variance inflation factors or inspecting the correlation matrix.
Finally, it should be kept in mind that we stopped adding variables after Model 3, because of diminishing returns in the \(R^2_{\text{adj}}\) and a desire to avoid potential over-fitting and collinearity issues. However, it is possible that adding more variables to the model may help mitigate the shortcomings highlighted above.
Report your Kaggle.com user name and score.
My user name is “kecbenson”, and scores for the 4 models are as follows:
| Model | Score |
|---|---|
| lm0 | 0.1752 |
| lm1 | 0.1724 |
| lm2 | 0.1692 |
| lm3 | 0.1657 |
Suggestions for further work.
Explore other functional forms for the regression model relationship; for instance we could try:
SalePrice variable in the regression, since the variable is skewed to the rightSalePrice / GrLivAreaTry other regression estimation approaches such as generalized linear models, non-linear models, or more advanced techniques (ridge, lasso, etc.).