library(GGally)
library(MASS)
library(modelr)
library(tidyverse)
library(stats)
set.seed(210514)
N <- 6
X <- runif(10000, min = 1, max = N)
Y <- rnorm(10000, mean = (N + 1)/2, sd = (N + 1)/2)
summary(X)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.249 3.487 3.493 4.728 5.998
summary(Y)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -11.677 1.151 3.480 3.491 5.869 16.004
Calculate as a minimum the below probabilities a through c. Assume the small letter \(x\) is estimated as the median of the \(X\) variable, and the small letter \(y\) is estimated as the first quartile of the \(Y\) variable. Interpret the meaning of all probabilities.
\(P(X>x|X>y)\) is the number of elements of \(X\) greater than both \(x\) and \(y\), divided by the number of elements of \(X\) greater than \(y\).
\(P((X>x)|(X>y)) = \frac{P(X>\text{max}(x,y))}{P(X>y)}\).
x <- median(X)
y <- quantile(Y, 0.25, names = FALSE)
numerator <- length(X[X > max(x,y)])
denominator <- length(X[X > y])
ans_1a <- numerator/denominator
ans_1a
## [1] 0.5154108
\(P(x>x)\), the probability that \(X\) is greater than its median, is \(0.50\). \(P(Y>y)\), the probability that \(Y\) is greater than its first quartile, is \(0.75\). Since \(X\) and \(Y\) are independent, \(P((X>x) \cap (Y>y)) = P(X>x)P(Y>y) = 0.375\).
\(P((X<x) | (X>y))\) is the number of elements of \(X\) between \(y\) and \(x\), divided by the number of elements of \(X\) greater than \(y\).
\(P((X<x)|(X>y)) = \frac{P(y<X<x)}{P(X>y)}\)
numerator <- length(X[y < X & X < x])
denominator <- length(X[X > y])
ans_1c <- numerator/denominator
ans_1c
## [1] 0.4845892
Investigate whether \(P((X>x) \cap (Y>y)) = P(X>x)P(Y>y)\) by building a table and evaluating the marginal and joint probabilities.
The statement is true when \(X\) and \(Y\) are independent, which they are. As described in the response to b, \(P((X>x) \cap (Y>y)) = P(X>x)P(Y>y) = 0.375\). We can verify this by examining a table.
XY_df <- data.frame("X" = X, "Y" = Y)
mgl_prob_X_gt_x <- XY_df %>%
filter(X > x) %>%
nrow() / nrow(XY_df)
mgl_prob_Y_gt_y <- XY_df %>%
filter(Y > y) %>%
nrow() / nrow(XY_df)
jnt_prob_X_gt_x_and_Y_gt_y <- XY_df %>%
filter((X > x) & (Y > y)) %>%
nrow() / nrow(XY_df)
mgl_prob_X_gt_x * mgl_prob_Y_gt_y / jnt_prob_X_gt_x_and_Y_gt_y
## [1] 1.002674
The product of the marginal probabilities is approximately equal to the joint probability based on the table. The small discrepancy is the result of the finite number of samples from the joint distribution of \(X\) and \(Y\).
Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?
Both Fisher’s Exact Test and the Chi Square Test are used on categorical data summarized in a contingency table.
To use these tests, I’ll build a contingency table that counts instances when each variable is greater than or less than its mean.
XY <- XY_df %>%
filter((X > mean(X)) & (Y > mean(Y))) %>%
nrow()
XnotY <- XY_df %>%
filter((X > mean(X)) & (Y <= mean(Y))) %>%
nrow()
notXY <- XY_df %>%
filter((X <= mean(X)) & (Y > mean(Y))) %>%
nrow()
notXnotY <- XY_df %>%
filter((X <= mean(X)) & (Y <= mean(Y))) %>%
nrow()
cty_tbl <- matrix(c(XY, XnotY, notXY, notXnotY), nrow = 2)
colnames(cty_tbl) = c("X>Xbar", "X<=Xbar")
rownames(cty_tbl) = c("Y>Ybar", "Y<=Ybar")
cty_tbl
## X>Xbar X<=Xbar
## Y>Ybar 2463 2527
## Y<=Ybar 2524 2486
fisher <- fisher.test(cty_tbl)
print(fisher)
##
## Fisher's Exact Test for Count Data
##
## data: cty_tbl
## p-value = 0.3078
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.8869234 1.0391208
## sample estimates:
## odds ratio
## 0.9599796
For Fisher’s Exact Test, the null hypothesis is that variables are independent. The alternative hypothesis is that not all variables are independent. The \(p\) value for the above test is much greater than \(\alpha = 0.05\), so we fail to reject the null. Fisher’s Exact Test does not provide evidence that \(X\) and \(Y\) are not independent.
chisq <- chisq.test(cty_tbl)
print(chisq)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: cty_tbl
## X-squared = 1.0011, df = 1, p-value = 0.3171
For the Chi Square Test, the null and alternative hypotheses are the same as for Fisher’s Exact Test above. The \(p\) value for the Chi Square test just conducted is much greater than \(\alpha = 0.05\), so we fail to reject the null. Neither Fisher’s Exact Test nor the Chi Square Test provide evidence that \(X\) and \(Y\) are not independent.
The independence of \(X\) and \(Y\) is also evident in a scatterplot, which shows no relationship between the variables.
ggplot(XY_df, aes(x = X, y = Y)) +
geom_point()
Because Fisher’s Exact Test was designed with smaller datasets in mind, it’s not the most appropriate test for this data. The Chi Square Test is a more appropriate choice.
h1.dat <- read_csv("https://raw.githubusercontent.com/dmoscoe/SPS/main/houses_train.csv")
The data set contains 1460 observations on 81 variables.
str(h1.dat)
## spec_tbl_df [1,460 x 81] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Id : num [1:1460] 1 2 3 4 5 6 7 8 9 10 ...
## $ MSSubClass : num [1:1460] 60 20 60 70 60 50 20 60 50 190 ...
## $ MSZoning : chr [1:1460] "RL" "RL" "RL" "RL" ...
## $ LotFrontage : num [1:1460] 65 80 68 60 84 85 75 NA 51 50 ...
## $ LotArea : num [1:1460] 8450 9600 11250 9550 14260 ...
## $ Street : chr [1:1460] "Pave" "Pave" "Pave" "Pave" ...
## $ Alley : chr [1:1460] NA NA NA NA ...
## $ LotShape : chr [1:1460] "Reg" "Reg" "IR1" "IR1" ...
## $ LandContour : chr [1:1460] "Lvl" "Lvl" "Lvl" "Lvl" ...
## $ Utilities : chr [1:1460] "AllPub" "AllPub" "AllPub" "AllPub" ...
## $ LotConfig : chr [1:1460] "Inside" "FR2" "Inside" "Corner" ...
## $ LandSlope : chr [1:1460] "Gtl" "Gtl" "Gtl" "Gtl" ...
## $ Neighborhood : chr [1:1460] "CollgCr" "Veenker" "CollgCr" "Crawfor" ...
## $ Condition1 : chr [1:1460] "Norm" "Feedr" "Norm" "Norm" ...
## $ Condition2 : chr [1:1460] "Norm" "Norm" "Norm" "Norm" ...
## $ BldgType : chr [1:1460] "1Fam" "1Fam" "1Fam" "1Fam" ...
## $ HouseStyle : chr [1:1460] "2Story" "1Story" "2Story" "2Story" ...
## $ OverallQual : num [1:1460] 7 6 7 7 8 5 8 7 7 5 ...
## $ OverallCond : num [1:1460] 5 8 5 5 5 5 5 6 5 6 ...
## $ YearBuilt : num [1:1460] 2003 1976 2001 1915 2000 ...
## $ YearRemodAdd : num [1:1460] 2003 1976 2002 1970 2000 ...
## $ RoofStyle : chr [1:1460] "Gable" "Gable" "Gable" "Gable" ...
## $ RoofMatl : chr [1:1460] "CompShg" "CompShg" "CompShg" "CompShg" ...
## $ Exterior1st : chr [1:1460] "VinylSd" "MetalSd" "VinylSd" "Wd Sdng" ...
## $ Exterior2nd : chr [1:1460] "VinylSd" "MetalSd" "VinylSd" "Wd Shng" ...
## $ MasVnrType : chr [1:1460] "BrkFace" "None" "BrkFace" "None" ...
## $ MasVnrArea : num [1:1460] 196 0 162 0 350 0 186 240 0 0 ...
## $ ExterQual : chr [1:1460] "Gd" "TA" "Gd" "TA" ...
## $ ExterCond : chr [1:1460] "TA" "TA" "TA" "TA" ...
## $ Foundation : chr [1:1460] "PConc" "CBlock" "PConc" "BrkTil" ...
## $ BsmtQual : chr [1:1460] "Gd" "Gd" "Gd" "TA" ...
## $ BsmtCond : chr [1:1460] "TA" "TA" "TA" "Gd" ...
## $ BsmtExposure : chr [1:1460] "No" "Gd" "Mn" "No" ...
## $ BsmtFinType1 : chr [1:1460] "GLQ" "ALQ" "GLQ" "ALQ" ...
## $ BsmtFinSF1 : num [1:1460] 706 978 486 216 655 ...
## $ BsmtFinType2 : chr [1:1460] "Unf" "Unf" "Unf" "Unf" ...
## $ BsmtFinSF2 : num [1:1460] 0 0 0 0 0 0 0 32 0 0 ...
## $ BsmtUnfSF : num [1:1460] 150 284 434 540 490 64 317 216 952 140 ...
## $ TotalBsmtSF : num [1:1460] 856 1262 920 756 1145 ...
## $ Heating : chr [1:1460] "GasA" "GasA" "GasA" "GasA" ...
## $ HeatingQC : chr [1:1460] "Ex" "Ex" "Ex" "Gd" ...
## $ CentralAir : chr [1:1460] "Y" "Y" "Y" "Y" ...
## $ Electrical : chr [1:1460] "SBrkr" "SBrkr" "SBrkr" "SBrkr" ...
## $ 1stFlrSF : num [1:1460] 856 1262 920 961 1145 ...
## $ 2ndFlrSF : num [1:1460] 854 0 866 756 1053 ...
## $ LowQualFinSF : num [1:1460] 0 0 0 0 0 0 0 0 0 0 ...
## $ GrLivArea : num [1:1460] 1710 1262 1786 1717 2198 ...
## $ BsmtFullBath : num [1:1460] 1 0 1 1 1 1 1 1 0 1 ...
## $ BsmtHalfBath : num [1:1460] 0 1 0 0 0 0 0 0 0 0 ...
## $ FullBath : num [1:1460] 2 2 2 1 2 1 2 2 2 1 ...
## $ HalfBath : num [1:1460] 1 0 1 0 1 1 0 1 0 0 ...
## $ BedroomAbvGr : num [1:1460] 3 3 3 3 4 1 3 3 2 2 ...
## $ KitchenAbvGr : num [1:1460] 1 1 1 1 1 1 1 1 2 2 ...
## $ KitchenQual : chr [1:1460] "Gd" "TA" "Gd" "Gd" ...
## $ TotRmsAbvGrd : num [1:1460] 8 6 6 7 9 5 7 7 8 5 ...
## $ Functional : chr [1:1460] "Typ" "Typ" "Typ" "Typ" ...
## $ Fireplaces : num [1:1460] 0 1 1 1 1 0 1 2 2 2 ...
## $ FireplaceQu : chr [1:1460] NA "TA" "TA" "Gd" ...
## $ GarageType : chr [1:1460] "Attchd" "Attchd" "Attchd" "Detchd" ...
## $ GarageYrBlt : num [1:1460] 2003 1976 2001 1998 2000 ...
## $ GarageFinish : chr [1:1460] "RFn" "RFn" "RFn" "Unf" ...
## $ GarageCars : num [1:1460] 2 2 2 3 3 2 2 2 2 1 ...
## $ GarageArea : num [1:1460] 548 460 608 642 836 480 636 484 468 205 ...
## $ GarageQual : chr [1:1460] "TA" "TA" "TA" "TA" ...
## $ GarageCond : chr [1:1460] "TA" "TA" "TA" "TA" ...
## $ PavedDrive : chr [1:1460] "Y" "Y" "Y" "Y" ...
## $ WoodDeckSF : num [1:1460] 0 298 0 0 192 40 255 235 90 0 ...
## $ OpenPorchSF : num [1:1460] 61 0 42 35 84 30 57 204 0 4 ...
## $ EnclosedPorch: num [1:1460] 0 0 0 272 0 0 0 228 205 0 ...
## $ 3SsnPorch : num [1:1460] 0 0 0 0 0 320 0 0 0 0 ...
## $ ScreenPorch : num [1:1460] 0 0 0 0 0 0 0 0 0 0 ...
## $ PoolArea : num [1:1460] 0 0 0 0 0 0 0 0 0 0 ...
## $ PoolQC : chr [1:1460] NA NA NA NA ...
## $ Fence : chr [1:1460] NA NA NA NA ...
## $ MiscFeature : chr [1:1460] NA NA NA NA ...
## $ MiscVal : num [1:1460] 0 0 0 0 0 700 0 350 0 0 ...
## $ MoSold : num [1:1460] 2 5 9 2 12 10 8 11 4 1 ...
## $ YrSold : num [1:1460] 2008 2007 2008 2006 2008 ...
## $ SaleType : chr [1:1460] "WD" "WD" "WD" "WD" ...
## $ SaleCondition: chr [1:1460] "Normal" "Normal" "Normal" "Abnorml" ...
## $ SalePrice : num [1:1460] 208500 181500 223500 140000 250000 ...
## - attr(*, "spec")=
## .. cols(
## .. Id = col_double(),
## .. MSSubClass = col_double(),
## .. MSZoning = col_character(),
## .. LotFrontage = col_double(),
## .. LotArea = col_double(),
## .. Street = col_character(),
## .. Alley = col_character(),
## .. LotShape = col_character(),
## .. LandContour = col_character(),
## .. Utilities = col_character(),
## .. LotConfig = col_character(),
## .. LandSlope = col_character(),
## .. Neighborhood = col_character(),
## .. Condition1 = col_character(),
## .. Condition2 = col_character(),
## .. BldgType = col_character(),
## .. HouseStyle = col_character(),
## .. OverallQual = col_double(),
## .. OverallCond = col_double(),
## .. YearBuilt = col_double(),
## .. YearRemodAdd = col_double(),
## .. RoofStyle = col_character(),
## .. RoofMatl = col_character(),
## .. Exterior1st = col_character(),
## .. Exterior2nd = col_character(),
## .. MasVnrType = col_character(),
## .. MasVnrArea = col_double(),
## .. ExterQual = col_character(),
## .. ExterCond = col_character(),
## .. Foundation = col_character(),
## .. BsmtQual = col_character(),
## .. BsmtCond = col_character(),
## .. BsmtExposure = col_character(),
## .. BsmtFinType1 = col_character(),
## .. BsmtFinSF1 = col_double(),
## .. BsmtFinType2 = col_character(),
## .. BsmtFinSF2 = col_double(),
## .. BsmtUnfSF = col_double(),
## .. TotalBsmtSF = col_double(),
## .. Heating = col_character(),
## .. HeatingQC = col_character(),
## .. CentralAir = col_character(),
## .. Electrical = col_character(),
## .. `1stFlrSF` = col_double(),
## .. `2ndFlrSF` = col_double(),
## .. LowQualFinSF = col_double(),
## .. GrLivArea = col_double(),
## .. BsmtFullBath = col_double(),
## .. BsmtHalfBath = col_double(),
## .. FullBath = col_double(),
## .. HalfBath = col_double(),
## .. BedroomAbvGr = col_double(),
## .. KitchenAbvGr = col_double(),
## .. KitchenQual = col_character(),
## .. TotRmsAbvGrd = col_double(),
## .. Functional = col_character(),
## .. Fireplaces = col_double(),
## .. FireplaceQu = col_character(),
## .. GarageType = col_character(),
## .. GarageYrBlt = col_double(),
## .. GarageFinish = col_character(),
## .. GarageCars = col_double(),
## .. GarageArea = col_double(),
## .. GarageQual = col_character(),
## .. GarageCond = col_character(),
## .. PavedDrive = col_character(),
## .. WoodDeckSF = col_double(),
## .. OpenPorchSF = col_double(),
## .. EnclosedPorch = col_double(),
## .. `3SsnPorch` = col_double(),
## .. ScreenPorch = col_double(),
## .. PoolArea = col_double(),
## .. PoolQC = col_character(),
## .. Fence = col_character(),
## .. MiscFeature = col_character(),
## .. MiscVal = col_double(),
## .. MoSold = col_double(),
## .. YrSold = col_double(),
## .. SaleType = col_character(),
## .. SaleCondition = col_character(),
## .. SalePrice = col_double()
## .. )
The data set contains 1460 observations on 81 variables. Some variables, like Alley and LotFrontage, include missing values encoded as “NA”. There do not appear to be any alternate encodings for missing values.
Provide univariate descriptive statistics and appropriate plots for the training data set.
Rather than provide descriptive statistics and visualizations for all 81 variables, I will choose a few to study more closely. To identify numeric variables that might be significant in a predictive model, I’ll examine their correlations with the response variable SalePrice.
cor <- h1.dat %>%
keep(is.numeric) %>%
cor()
sort(cor[,38])
## KitchenAbvGr EnclosedPorch MSSubClass OverallCond YrSold
## -0.13590737 -0.12857796 -0.08428414 -0.07785589 -0.02892259
## LowQualFinSF Id MiscVal BsmtHalfBath BsmtFinSF2
## -0.02560613 -0.02191672 -0.02118958 -0.01684415 -0.01137812
## 3SsnPorch MoSold PoolArea ScreenPorch BedroomAbvGr
## 0.04458367 0.04643225 0.09240355 0.11144657 0.16821315
## BsmtUnfSF BsmtFullBath LotArea HalfBath OpenPorchSF
## 0.21447911 0.22712223 0.26384335 0.28410768 0.31585623
## 2ndFlrSF WoodDeckSF BsmtFinSF1 Fireplaces YearRemodAdd
## 0.31933380 0.32441344 0.38641981 0.46692884 0.50710097
## YearBuilt TotRmsAbvGrd FullBath 1stFlrSF TotalBsmtSF
## 0.52289733 0.53372316 0.56066376 0.60585218 0.61358055
## GarageArea GarageCars GrLivArea OverallQual SalePrice
## 0.62343144 0.64040920 0.70862448 0.79098160 1.00000000
Numeric variables with potentially important relationships to SalePrice include OverallQual, GrLivArea, and KitchenAbvGr. I will also examine the response variable, SalePrice.
I’ll choose categorical variables that seem like they might be important based on my knowledge of housing costs.
cat <- h1.dat %>%
discard(is.numeric) %>%
colnames()
cat
## [1] "MSZoning" "Street" "Alley" "LotShape"
## [5] "LandContour" "Utilities" "LotConfig" "LandSlope"
## [9] "Neighborhood" "Condition1" "Condition2" "BldgType"
## [13] "HouseStyle" "RoofStyle" "RoofMatl" "Exterior1st"
## [17] "Exterior2nd" "MasVnrType" "ExterQual" "ExterCond"
## [21] "Foundation" "BsmtQual" "BsmtCond" "BsmtExposure"
## [25] "BsmtFinType1" "BsmtFinType2" "Heating" "HeatingQC"
## [29] "CentralAir" "Electrical" "KitchenQual" "Functional"
## [33] "FireplaceQu" "GarageType" "GarageFinish" "GarageQual"
## [37] "GarageCond" "PavedDrive" "PoolQC" "Fence"
## [41] "MiscFeature" "SaleType" "SaleCondition"
I’ll examine LotShape, LandContour, and Neighborhood.
h1_subset.dat <- h1.dat %>%
select("Id", "OverallQual", "GrLivArea", "KitchenAbvGr", "LotShape", "LandContour", "Neighborhood", "SalePrice")
summary(h1_subset.dat)
## Id OverallQual GrLivArea KitchenAbvGr
## Min. : 1.0 Min. : 1.000 Min. : 334 Min. :0.000
## 1st Qu.: 365.8 1st Qu.: 5.000 1st Qu.:1130 1st Qu.:1.000
## Median : 730.5 Median : 6.000 Median :1464 Median :1.000
## Mean : 730.5 Mean : 6.099 Mean :1515 Mean :1.047
## 3rd Qu.:1095.2 3rd Qu.: 7.000 3rd Qu.:1777 3rd Qu.:1.000
## Max. :1460.0 Max. :10.000 Max. :5642 Max. :3.000
## LotShape LandContour Neighborhood SalePrice
## Length:1460 Length:1460 Length:1460 Min. : 34900
## Class :character Class :character Class :character 1st Qu.:129975
## Mode :character Mode :character Mode :character Median :163000
## Mean :180921
## 3rd Qu.:214000
## Max. :755000
As expected, some of the variables appear significantly skewed right, especially GrLivArea, the above-ground living area in square feet, and SalePrice.
h1_subset.dat %>%
keep(is.numeric) %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The histogram for Id is not meaningful. As The distributions of SalePrice and GrLivArea are similar in shape, and their close relationship is supported by the idea that larger homes will, in general, cost more. There are few houses considered to be of low OverallQual in the data set. Almost all homes have 1 kitchen, but some have 2 or 3.
h1_subset.dat %>%
group_by(LotShape) %>%
summarise(n()) %>%
arrange(desc(`n()`))
## # A tibble: 4 x 2
## LotShape `n()`
## <chr> <int>
## 1 Reg 925
## 2 IR1 484
## 3 IR2 41
## 4 IR3 10
The majority of lots have a “general shape” that is regular. 484 lie on property that is “slightly irregular,” 41 on land that is “moderately irregular,” and 10 on “irregular” land.
h1_subset.dat %>%
group_by(LandContour) %>%
summarise(n()) %>%
arrange(desc(`n()`))
## # A tibble: 4 x 2
## LandContour `n()`
## <chr> <int>
## 1 Lvl 1311
## 2 Bnk 63
## 3 HLS 50
## 4 Low 36
The data description file tells us that Lvl means “Near Flat/Level, Bnk means”Banked-Quick and significant rise from street grade to building, HLS means “Hillside - Significant slope from side to side,” and Low means “Depression”.
h1_subset.dat %>%
group_by(Neighborhood) %>%
summarise(n()) %>%
arrange(desc(`n()`))
## # A tibble: 25 x 2
## Neighborhood `n()`
## <chr> <int>
## 1 NAmes 225
## 2 CollgCr 150
## 3 OldTown 113
## 4 Edwards 100
## 5 Somerst 86
## 6 Gilbert 79
## 7 NridgHt 77
## 8 Sawyer 74
## 9 NWAmes 73
## 10 SawyerW 59
## # ... with 15 more rows
Neighborhood names are generally self-explanatory. IDOTRR stands for “Iowa DOT and Rail Road”, and SWISU stands for “South & West of Iowa State University.” The data are distributed unequally, favoring a few neighborhoods like North Ames, College Creek, and Old Town.
Provide a scatterplot matrix for at least two of the independent variables and the dependent variable.
h1_subset.dat %>%
keep(is.numeric) %>%
ggpairs()
Derive a correlation matrix for any three quantitative variables in the dataset.
corr_matrix <- cor(h1.dat[,c("GrLivArea", "OverallQual", "SalePrice")])
corr_matrix
## GrLivArea OverallQual SalePrice
## GrLivArea 1.0000000 0.5930074 0.7086245
## OverallQual 0.5930074 1.0000000 0.7909816
## SalePrice 0.7086245 0.7909816 1.0000000
Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval.
cor.test(h1.dat$GrLivArea, h1.dat$SalePrice, method = "pearson", conf.level = 0.80)
##
## Pearson's product-moment correlation
##
## data: h1.dat$GrLivArea and h1.dat$SalePrice
## t = 38.348, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.6915087 0.7249450
## sample estimates:
## cor
## 0.7086245
cor.test(h1.dat$SalePrice, h1.dat$OverallQual, method = "pearson", conf.level = 0.80)
##
## Pearson's product-moment correlation
##
## data: h1.dat$SalePrice and h1.dat$OverallQual
## t = 49.364, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.7780752 0.8032204
## sample estimates:
## cor
## 0.7909816
cor.test(h1.dat$GrLivArea, h1.dat$OverallQual, method = "pearson", conf.level = 0.80)
##
## Pearson's product-moment correlation
##
## data: h1.dat$GrLivArea and h1.dat$OverallQual
## t = 28.121, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.5708061 0.6143422
## sample estimates:
## cor
## 0.5930074
Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?
All pairwise correlations tested are statistically significant and positive. In a case of familywise error, a type I error occurs due to the large number of hypothesis tests being conducted. In a dataset with 81 variables, as we have here, testing each pair of variables for significant correlation is likely to lead eventually to some familywise error. For only three tests of correlation, as above, familywise error is not yet an important consideration. But if I were to continue testing for significance, then I would need to employ some adjustment to control for the possibility of familywise error.
Invert your correlation matrix from above.
corr_matrix_inverse <- solve(corr_matrix)
corr_matrix_inverse
## GrLivArea OverallQual SalePrice
## GrLivArea 2.0200794 -0.1753704 -1.292763
## OverallQual -0.1753704 2.6865350 -2.000728
## SalePrice -1.2927630 -2.0007280 3.498623
Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix.
Since the precision matrix and correlation matrices are inverses, their product in either direction will be the 3-by-3 identity matrix.
round(corr_matrix %*% corr_matrix_inverse, 4)
## GrLivArea OverallQual SalePrice
## GrLivArea 1 0 0
## OverallQual 0 1 0
## SalePrice 0 0 1
round(corr_matrix_inverse %*% corr_matrix, 4)
## GrLivArea OverallQual SalePrice
## GrLivArea 1 0 0
## OverallQual 0 1 0
## SalePrice 0 0 1
Conduct LU decomposition on the matrix.
LUFact <- function(A) {
#Assumes A is a square matrix. Returns L and U, lower-triangular and upper-triangular factors of A.
n <- nrow(A)
#initialize U = [0], L = In
L <- matrix(rep(0, length.out = n^2), nrow = n)
U <- matrix(rep(0, length.out = n^2), nrow = n)
for (k in seq(n)) {
L[k,k] = 1
}
for (k in seq(n)) {
U[k,k] = A[k,k]
if (k < n) {
for (i in seq(k + 1, n)) {
L[i,k] = (A[i,k] / A[k,k])
U[k,i] = A[k,i]
}
}
for (i in seq(k + 1, n)) {
if (k < n) {
for (j in seq(k + 1, n)) {
A[i,j] = A[i,j] - (L[i,k] * U[k,j])
}
}
}
}
return(list("L" = L, "U" = U))
}
LUFact(corr_matrix)
## $L
## [,1] [,2] [,3]
## [1,] 1.0000000 0.0000000 0
## [2,] 0.5930074 1.0000000 0
## [3,] 0.7086245 0.5718616 1
##
## $U
## [,1] [,2] [,3]
## [1,] 1 0.5930074 0.7086245
## [2,] 0 0.6483422 0.3707620
## [3,] 0 0.0000000 0.2858268
Select a variable in the training dataset that is skewed to the right. Then load the MASS package and run fitdistr to fit an exponential probability density function.
The variable WoodDeckSF is skewed right.
hist(h1.dat$WoodDeckSF, breaks = 50)
exp_fit <- fitdistr(h1.dat$WoodDeckSF, densfun = "exponential")
str(exp_fit)
## List of 5
## $ estimate: Named num 0.0106
## ..- attr(*, "names")= chr "rate"
## $ sd : Named num 0.000278
## ..- attr(*, "names")= chr "rate"
## $ vcov : num [1, 1] 7.71e-08
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr "rate"
## .. ..$ : chr "rate"
## $ n : int 1460
## $ loglik : num -8097
## - attr(*, "class")= chr "fitdistr"
Find the optimal value of \(\lambda\) for this distribution, and then take 1000 samples from this exponential distribution using this value.
exp_fit$estimate
## rate
## 0.0106107
exp_sample <- rexp(1000, exp_fit$estimate)
The optimal value of \(\lambda\) for this distribution is 0.01061.
Plot a histogram and compare it with a histogram of your original variable.
par(mfrow = c(1,2))
hist(h1.dat$WoodDeckSF, breaks = 50, main = "WoodDeckSF (actual)", xlab = "Wood Deck Area (sqft)", ylab = "relative frequency", freq = FALSE)
hist(exp_sample, breaks = 50, main = "WoodDeckSF (sim)", xlab = "sim'd Wood Deck Area (sqft)", ylab = "relative frequency", freq = FALSE)
Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF).
qexp(c(0.05, 0.95), rate = exp_fit$estimate)
## [1] 4.834112 282.331352
Generate a 95% confidence interval from the empirical data, assuming normality.
I think this means, “Find a 95% confidence interval for the mean of the empirical data, assuming normality.”
ttest <- t.test(h1.dat$WoodDeckSF, conf.level = 0.95)
ttest
##
## One Sample t-test
##
## data: h1.dat$WoodDeckSF
## t = 28.731, df = 1459, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 87.80998 100.67906
## sample estimates:
## mean of x
## 94.24452
Provide the empirical 5th percentile and 95th percentile of the data. Discuss.
quantile(h1.dat$WoodDeckSF, c(0.05, 0.95))
## 5% 95%
## 0 335
ggplot(data = h1.dat, aes(x = WoodDeckSF)) +
geom_histogram() +
geom_vline(xintercept = quantile(exp_sample, 0.05), color = "blue", size = 1) +
geom_vline(xintercept = quantile(exp_sample, 0.95), color = "blue", size = 1) +
geom_vline(xintercept = quantile(h1.dat$WoodDeckSF, 0.05), color = "red", size = 1) +
geom_vline(xintercept = quantile(h1.dat$WoodDeckSF, 0.95), color = "red", size = 1) +
geom_vline(xintercept = ttest$conf.int[1], color = "red", size = 1) +
geom_vline(xintercept = ttest$conf.int[2], color = "red", size = 1) +
xlim(-15, 600) +
geom_vline(xintercept = mean(exp_sample), color = "blue", size = 1) +
xlim(-15, 600) +
labs(title = "Comparing Quantiles for WoodDeckSF and a Fitted Exponential")
Red lines show the 5th and 95th percentile of the empirical data, as well as the 95% confidence interval for the mean of the empirical data. Blue lines show the 5th and 95th percentile of the fitted exponential distribution, as well as the mean of the fitted exponential distribution. The mean of the simulated data lies within the confidence interval for the empirical data, and the 5th and 95th percentiles are similar. I would proceed with caution when using the fitted exponential distribution instead of the empirical data. But based on the closeness of the values in the histogram above, there may be some instances when using the simulated data is appropriate.
model.dat <- read_csv("https://raw.githubusercontent.com/dmoscoe/SPS/main/houses_train.csv")
predict.dat <- read_csv("https://raw.githubusercontent.com/dmoscoe/SPS/main/houses_test.csv")
This data set contains 79 variables that describe houses in Ames, IA. The goal is to build a model that uses this data to explain the Sale Price for each house.
Much of the work of generating a model for this dataset amounts to identifying which data to exclude. I followed a few principles in choosing which variables to include or drop from the final model:
If a group of variables all describe the same aspect of the house, such as the garage, then it might be possible to summarize them with a smaller number of variables. These variables are also likely to show correlation with each other, which may be a reason to exclude one or more of them.
If almost all the entries for a variable are the same, then the variation in this variable can’t explain variation in Sale Price. Variables like this are good candidates for exclusion.
Sale Price decisions are made by people. That means whatever actual people value most in a home will be strong determinants of Sale Price. So my informal understanding of home sales is at least potentially relevant to determining a model for Sale Price.
I start by dropping Id, Utilities, RoofMatl. Id merely signifies row numbers for observations, and is not a determinant of Sale Price. Almost every entry contains the same values for Utilities and RoofMatl.
model.dat <- model.dat[,c(2:9,11:22,24:81)]
The dataset contains a large number of NA entries. Reviewing the data dictionary explains that these entries actually do contain information. Often, they indicate that a particular feature of a house is missing, or that a measurement is zero. Based on the information in the data dictionary, I replace all NAs with more descriptive entries. The resulting dataset has no NA entries.
Here I also transform some very rare entries in Exterior1st and BldgType to the modes for those variables.
model.dat$Alley[is.na(model.dat$Alley)] <- "none"
model.dat$FireplaceQu[is.na(model.dat$FireplaceQu)] <- "none"
model.dat$PoolQC[is.na(model.dat$PoolQC)] <- "none"
model.dat$Fence[is.na(model.dat$Fence)] <- "none"
model.dat$MiscFeature[is.na(model.dat$MiscFeature)] <- "none"
model.dat$BsmtQual[is.na(model.dat$BsmtQual)] <- "none"
model.dat$BsmtCond[is.na(model.dat$BsmtCond)] <- "none"
model.dat$BsmtExposure[is.na(model.dat$BsmtExposure)] <- "none"
model.dat$BsmtFinType1[is.na(model.dat$BsmtFinType1)] <- "none"
model.dat$BsmtFinType2[is.na(model.dat$BsmtFinType2)] <- "none"
model.dat$GarageType[is.na(model.dat$GarageType)] <- "none"
model.dat$GarageFinish[is.na(model.dat$GarageFinish)] <- "none"
model.dat$GarageQual[is.na(model.dat$GarageQual)] <- "none"
model.dat$GarageCond[is.na(model.dat$GarageCond)] <- "none"
model.dat$MasVnrArea[is.na(model.dat$MasVnrArea)] <- 0
model.dat$MasVnrType[is.na(model.dat$MasVnrType)] <- "none"
model.dat$Electrical[is.na(model.dat$Electrical)] <- "SBrkr"
model.dat$GarageYrBlt[is.na(model.dat$GarageYrBlt)] <- 2006 #Most common value
model.dat$Exterior1st[model.dat$Exterior1st %in% c("AsphShn","BrkComm","CBlock","ImStucc","Stone")] <- "VinylSd" #Most common value
model.dat$BldgType[!(model.dat$BldgType %in% c("1Fam"))] <- "Not1Fam"
model.dat$LotFrontage[is.na(model.dat$LotFrontage)] <- 0
Now that NAs have been resolved, it’s time to partition the data into a training set and a testing set. I include 75% of the data in the training set, and reserve 25% for testing.
training_rows <- sample(nrow(model.dat), round(0.75 * nrow(model.dat)), replace = FALSE)
model_train.dat <- model.dat[training_rows,]
model_test.dat <- model.dat[-training_rows,]
My plan for constructing a model is to use the stepAIC function from modelr. This function eliminates variables as long as the elimination results in a reduction in the AIC. The stepAIC model requires a complete linear model to begin its work.
model_train.lm <- lm(SalePrice ~ ., data = model_train.dat)
Because the output from stepAIC is so long, I set include = FALSE in the code chunk containing it. The output of the stepAIC function is summarized below.
summary(model_train.slm)
##
## Call:
## lm(formula = SalePrice ~ MSSubClass + LotFrontage + LotArea +
## Street + LotConfig + Neighborhood + Condition1 + Condition2 +
## BldgType + OverallQual + OverallCond + YearBuilt + YearRemodAdd +
## Exterior1st + MasVnrType + MasVnrArea + ExterQual + BsmtQual +
## BsmtExposure + BsmtFinSF1 + BsmtFinSF2 + BsmtUnfSF + `1stFlrSF` +
## `2ndFlrSF` + BsmtFullBath + BedroomAbvGr + KitchenAbvGr +
## KitchenQual + TotRmsAbvGrd + Functional + Fireplaces + GarageArea +
## GarageQual + GarageCond + WoodDeckSF + ScreenPorch + PoolArea +
## PoolQC + MoSold + SaleCondition, data = model_train.dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -174744 -9792 0 9367 174744
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.241e+06 2.741e+05 -15.469 < 2e-16 ***
## MSSubClass -7.668e+01 4.612e+01 -1.663 0.096693 .
## LotFrontage 4.735e+01 2.514e+01 1.883 0.059972 .
## LotArea 4.548e-01 8.679e-02 5.240 1.97e-07 ***
## StreetPave 3.266e+04 1.365e+04 2.392 0.016945 *
## LotConfigCulDSac 1.008e+04 3.561e+03 2.830 0.004754 **
## LotConfigFR2 -4.582e+03 4.834e+03 -0.948 0.343478
## LotConfigFR3 -1.116e+04 1.308e+04 -0.853 0.393776
## LotConfigInside 3.052e+02 2.048e+03 0.149 0.881575
## NeighborhoodBlueste -1.067e+04 2.512e+04 -0.425 0.671248
## NeighborhoodBrDale -1.302e+04 1.034e+04 -1.259 0.208299
## NeighborhoodBrkSide -4.435e+03 9.164e+03 -0.484 0.628519
## NeighborhoodClearCr -8.250e+03 9.500e+03 -0.868 0.385369
## NeighborhoodCollgCr -9.526e+03 7.516e+03 -1.267 0.205285
## NeighborhoodCrawfor 9.984e+03 8.840e+03 1.129 0.259019
## NeighborhoodEdwards -2.028e+04 8.304e+03 -2.443 0.014756 *
## NeighborhoodGilbert -7.091e+03 8.016e+03 -0.885 0.376547
## NeighborhoodIDOTRR -1.759e+04 9.931e+03 -1.771 0.076896 .
## NeighborhoodMeadowV -2.754e+04 1.144e+04 -2.407 0.016286 *
## NeighborhoodMitchel -2.242e+04 8.670e+03 -2.586 0.009858 **
## NeighborhoodNAmes -1.537e+04 8.075e+03 -1.904 0.057245 .
## NeighborhoodNoRidge 2.970e+04 8.981e+03 3.307 0.000976 ***
## NeighborhoodNPkVill 5.294e+03 1.212e+04 0.437 0.662478
## NeighborhoodNridgHt 1.884e+04 7.999e+03 2.356 0.018681 *
## NeighborhoodNWAmes -1.713e+04 8.406e+03 -2.038 0.041810 *
## NeighborhoodOldTown -1.663e+04 8.817e+03 -1.886 0.059568 .
## NeighborhoodSawyer -1.011e+04 8.570e+03 -1.180 0.238393
## NeighborhoodSawyerW -6.133e+03 8.244e+03 -0.744 0.457128
## NeighborhoodSomerst 2.448e+03 7.888e+03 0.310 0.756387
## NeighborhoodStoneBr 3.403e+04 8.873e+03 3.835 0.000134 ***
## NeighborhoodSWISU -1.244e+04 1.026e+04 -1.212 0.225923
## NeighborhoodTimber -1.239e+04 8.408e+03 -1.474 0.140906
## NeighborhoodVeenker 1.158e+03 1.108e+04 0.105 0.916771
## Condition1Feedr 4.268e+03 5.663e+03 0.754 0.451282
## Condition1Norm 1.192e+04 4.617e+03 2.581 0.010000 **
## Condition1PosA 3.993e+03 1.057e+04 0.378 0.705719
## Condition1PosN 1.314e+04 8.231e+03 1.596 0.110704
## Condition1RRAe -1.427e+04 1.022e+04 -1.396 0.163108
## Condition1RRAn 1.231e+04 7.333e+03 1.678 0.093577 .
## Condition1RRNe 6.039e+01 1.801e+04 0.003 0.997324
## Condition1RRNn -3.038e+03 1.611e+04 -0.189 0.850429
## Condition2Feedr -2.485e+04 2.361e+04 -1.052 0.292879
## Condition2Norm -1.717e+04 1.919e+04 -0.895 0.371198
## Condition2PosA 2.267e+04 3.157e+04 0.718 0.472968
## Condition2PosN -2.647e+05 2.743e+04 -9.651 < 2e-16 ***
## Condition2RRAe -3.071e+04 3.143e+04 -0.977 0.328801
## Condition2RRAn -9.907e+03 3.090e+04 -0.321 0.748564
## Condition2RRNn -1.087e+04 3.143e+04 -0.346 0.729460
## BldgTypeNot1Fam -9.182e+03 5.505e+03 -1.668 0.095654 .
## OverallQual 7.584e+03 1.108e+03 6.846 1.34e-11 ***
## OverallCond 4.787e+03 9.063e+02 5.282 1.58e-07 ***
## YearBuilt 3.160e+02 7.302e+01 4.328 1.66e-05 ***
## YearRemodAdd 9.271e+01 6.121e+01 1.515 0.130165
## Exterior1stBrkFace 2.106e+04 7.840e+03 2.686 0.007361 **
## Exterior1stCemntBd 7.698e+03 8.184e+03 0.941 0.347166
## Exterior1stHdBoard -2.975e+03 6.946e+03 -0.428 0.668558
## Exterior1stMetalSd 2.024e+03 6.717e+03 0.301 0.763167
## Exterior1stPlywood -5.656e+03 7.380e+03 -0.766 0.443673
## Exterior1stStucco 7.383e+02 8.699e+03 0.085 0.932380
## Exterior1stVinylSd -1.510e+03 6.814e+03 -0.222 0.824631
## Exterior1stWd Sdng -8.877e+02 6.753e+03 -0.131 0.895438
## Exterior1stWdShing -2.515e+03 8.568e+03 -0.294 0.769181
## MasVnrTypeBrkFace 1.142e+04 8.135e+03 1.404 0.160551
## MasVnrTypenone 1.572e+04 1.244e+04 1.264 0.206484
## MasVnrTypeNone 1.710e+04 8.259e+03 2.071 0.038634 *
## MasVnrTypeStone 1.493e+04 8.648e+03 1.727 0.084569 .
## MasVnrArea 3.004e+01 6.564e+00 4.577 5.34e-06 ***
## ExterQualFa -2.094e+04 1.047e+04 -2.000 0.045823 *
## ExterQualGd -2.398e+04 5.544e+03 -4.326 1.67e-05 ***
## ExterQualTA -2.555e+04 6.041e+03 -4.229 2.57e-05 ***
## BsmtQualFa -1.829e+04 7.195e+03 -2.543 0.011155 *
## BsmtQualGd -2.154e+04 3.822e+03 -5.636 2.27e-08 ***
## BsmtQualnone 1.192e+03 2.530e+04 0.047 0.962428
## BsmtQualTA -2.174e+04 4.747e+03 -4.580 5.26e-06 ***
## BsmtExposureGd 1.945e+04 3.276e+03 5.935 4.07e-09 ***
## BsmtExposureMn -1.636e+03 3.477e+03 -0.470 0.638129
## BsmtExposureNo -5.078e+03 2.389e+03 -2.126 0.033781 *
## BsmtExposurenone -1.178e+04 2.384e+04 -0.494 0.621246
## BsmtFinSF1 3.582e+01 5.280e+00 6.785 2.00e-11 ***
## BsmtFinSF2 2.200e+01 6.693e+00 3.287 0.001049 **
## BsmtUnfSF 2.037e+01 4.974e+00 4.095 4.57e-05 ***
## `1stFlrSF` 4.584e+01 5.511e+00 8.317 3.00e-16 ***
## `2ndFlrSF` 5.300e+01 3.856e+00 13.744 < 2e-16 ***
## BsmtFullBath 3.915e+03 2.058e+03 1.902 0.057447 .
## BedroomAbvGr -3.741e+03 1.462e+03 -2.558 0.010663 *
## KitchenAbvGr -1.321e+04 5.164e+03 -2.558 0.010689 *
## KitchenQualFa -1.673e+04 7.043e+03 -2.375 0.017721 *
## KitchenQualGd -2.108e+04 3.971e+03 -5.309 1.37e-07 ***
## KitchenQualTA -2.162e+04 4.550e+03 -4.752 2.32e-06 ***
## TotRmsAbvGrd 2.662e+03 1.074e+03 2.479 0.013332 *
## FunctionalMaj2 -3.090e+03 1.513e+04 -0.204 0.838204
## FunctionalMin1 7.353e+03 9.232e+03 0.797 0.425916
## FunctionalMin2 9.049e+03 9.207e+03 0.983 0.325910
## FunctionalMod 8.064e+03 9.978e+03 0.808 0.419199
## FunctionalSev -2.690e+04 2.693e+04 -0.999 0.318088
## FunctionalTyp 2.088e+04 7.806e+03 2.676 0.007585 **
## Fireplaces 2.727e+03 1.530e+03 1.783 0.074921 .
## GarageArea 2.533e+01 5.752e+00 4.403 1.18e-05 ***
## GarageQualFa -1.696e+05 2.598e+04 -6.527 1.08e-10 ***
## GarageQualGd -1.608e+05 2.710e+04 -5.932 4.15e-09 ***
## GarageQualnone -2.900e+03 1.780e+04 -0.163 0.870613
## GarageQualPo -1.663e+05 3.264e+04 -5.096 4.16e-07 ***
## GarageQualTA -1.663e+05 2.555e+04 -6.511 1.19e-10 ***
## GarageCondFa 1.602e+05 3.148e+04 5.089 4.31e-07 ***
## GarageCondGd 1.516e+05 3.238e+04 4.681 3.25e-06 ***
## GarageCondnone NA NA NA NA
## GarageCondPo 1.562e+05 3.399e+04 4.595 4.89e-06 ***
## GarageCondTA 1.568e+05 3.076e+04 5.097 4.15e-07 ***
## WoodDeckSF 1.870e+01 6.555e+00 2.853 0.004429 **
## ScreenPorch 2.146e+01 1.376e+01 1.560 0.119097
## PoolArea 6.661e+03 3.724e+02 17.885 < 2e-16 ***
## PoolQCFa -8.557e+05 5.056e+04 -16.925 < 2e-16 ***
## PoolQCGd -3.764e+05 2.536e+04 -14.839 < 2e-16 ***
## PoolQCnone 3.443e+06 2.018e+05 17.065 < 2e-16 ***
## MoSold -4.978e+02 2.857e+02 -1.742 0.081795 .
## SaleConditionAdjLand 2.529e+03 1.788e+04 0.141 0.887555
## SaleConditionAlloca 8.402e+03 9.845e+03 0.853 0.393657
## SaleConditionFamily -1.074e+03 6.953e+03 -0.154 0.877298
## SaleConditionNormal 6.943e+03 3.094e+03 2.244 0.025057 *
## SaleConditionPartial 2.217e+04 4.391e+03 5.049 5.29e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23530 on 976 degrees of freedom
## Multiple R-squared: 0.9271, Adjusted R-squared: 0.9183
## F-statistic: 105.2 on 118 and 976 DF, p-value: < 2.2e-16
The stepAIC model results in a value for \(R^2\) so high that I suspect overfitting. Examining the model on the test set confirms this concern.
rsquare(model_train.slm, model_test.dat)
## Warning in predict.lm(model, data): prediction from a rank-deficient fit may be
## misleading
## [1] -0.05908811
To resolve the overfit, I need to remove variables from the model. The normal procedure of backward elimination would be relevant here, except that all variables in the model include at least one level with significance less than 0.05. Typically, this would mean the variable should remain in the model. My strategy for removing variables will rely on the principles described in the Tidy section. I’ll reserve only one variable for each attribute of the house (for example, multiple variables regarding the garage will be removed until only one about the garage remains). I’ll rely on some of my informal understanding of real estate sales. And I’ll try to reduce the number of factors of highly imbalanced categorical variables when possible.
model_train.dat$BldgType[!(model_train.dat$BldgType %in% c("1Fam"))] <- "Not1Fam"
model_train.dat$HouseStyle[model_train.dat$HouseStyle %in% c("2.5Fin", "2.5Unf")] <- "2Story"
model_train.dat$BsmtFullBath[model_train.dat$BsmtFullBath > 1] <- 1
model_train.dat$HalfBath[model_train.dat$HalfBath > 1] <- 1
model_train.dat$BedroomAbvGr[model_train.dat$BedroomAbvGr > 4] <- 4
model_train.dat$Fireplaces[model_train.dat$Fireplaces > 1] <- 1
model_train.dat$SaleType[model_train.dat$SaleType %in% c("Con", "ConLD", "ConLI", "ConLW", "CWD", "Oth")] <- "WD"
model_test.dat$BldgType[!(model_test.dat$BldgType %in% c("1Fam"))] <- "Not1Fam"
model_test.dat$HouseStyle[model_test.dat$HouseStyle %in% c("2.5Fin", "2.5Unf")] <- "2Story"
model_test.dat$BsmtFullBath[model_test.dat$BsmtFullBath > 1] <- 1
model_test.dat$HalfBath[model_test.dat$HalfBath > 1] <- 1
model_test.dat$BedroomAbvGr[model_test.dat$BedroomAbvGr > 4] <- 4
model_test.dat$Fireplaces[model_test.dat$Fireplaces > 1] <- 1
model_test.dat$SaleType[model_test.dat$SaleType %in% c("Con", "ConLD", "ConLI", "ConLW", "CWD", "Oth")] <- "WD"
model_train_210515.lm <- lm(SalePrice ~ LotFrontage + LotArea + LotConfig + Neighborhood + BldgType + HouseStyle + OverallQual + OverallCond + YearBuilt + YearRemodAdd + Exterior1st + BsmtFinSF1 + `1stFlrSF` + `2ndFlrSF` + BsmtFullBath + FullBath + HalfBath + BedroomAbvGr + KitchenQual + Fireplaces + GarageArea + SaleCondition, data = model_train.dat)
summary(model_train_210515.lm)
##
## Call:
## lm(formula = SalePrice ~ LotFrontage + LotArea + LotConfig +
## Neighborhood + BldgType + HouseStyle + OverallQual + OverallCond +
## YearBuilt + YearRemodAdd + Exterior1st + BsmtFinSF1 + `1stFlrSF` +
## `2ndFlrSF` + BsmtFullBath + FullBath + HalfBath + BedroomAbvGr +
## KitchenQual + Fireplaces + GarageArea + SaleCondition, data = model_train.dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -377481 -12473 -565 11460 241977
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.986e+05 2.171e+05 -2.297 0.021805 *
## LotFrontage 5.335e+00 3.320e+01 0.161 0.872351
## LotArea 4.139e-01 1.098e-01 3.770 0.000173 ***
## LotConfigCulDSac 1.571e+04 4.844e+03 3.242 0.001224 **
## LotConfigFR2 -8.683e+03 6.581e+03 -1.319 0.187311
## LotConfigFR3 -2.728e+04 1.693e+04 -1.612 0.107324
## LotConfigInside 2.006e+03 2.756e+03 0.728 0.466914
## NeighborhoodBlueste -2.828e+04 3.436e+04 -0.823 0.410699
## NeighborhoodBrDale -6.279e+03 1.393e+04 -0.451 0.652261
## NeighborhoodBrkSide -1.001e+04 1.200e+04 -0.834 0.404238
## NeighborhoodClearCr -9.505e+02 1.253e+04 -0.076 0.939564
## NeighborhoodCollgCr -9.869e+03 9.913e+03 -0.996 0.319696
## NeighborhoodCrawfor 9.097e+03 1.159e+04 0.785 0.432817
## NeighborhoodEdwards -3.111e+04 1.079e+04 -2.882 0.004030 **
## NeighborhoodGilbert -1.132e+04 1.050e+04 -1.078 0.281330
## NeighborhoodIDOTRR -2.410e+04 1.299e+04 -1.855 0.063882 .
## NeighborhoodMeadowV -2.517e+04 1.528e+04 -1.648 0.099727 .
## NeighborhoodMitchel -1.715e+04 1.139e+04 -1.505 0.132520
## NeighborhoodNAmes -2.182e+04 1.050e+04 -2.077 0.038031 *
## NeighborhoodNoRidge 5.417e+04 1.161e+04 4.664 3.51e-06 ***
## NeighborhoodNPkVill -1.536e+03 1.632e+04 -0.094 0.925038
## NeighborhoodNridgHt 3.665e+04 1.040e+04 3.526 0.000441 ***
## NeighborhoodNWAmes -2.340e+04 1.089e+04 -2.149 0.031856 *
## NeighborhoodOldTown -2.416e+04 1.145e+04 -2.109 0.035150 *
## NeighborhoodSawyer -2.220e+04 1.108e+04 -2.004 0.045345 *
## NeighborhoodSawyerW -9.877e+03 1.079e+04 -0.916 0.360011
## NeighborhoodSomerst 4.431e+03 1.011e+04 0.438 0.661246
## NeighborhoodStoneBr 4.165e+04 1.173e+04 3.551 0.000401 ***
## NeighborhoodSWISU -2.292e+04 1.360e+04 -1.685 0.092311 .
## NeighborhoodTimber 1.731e+03 1.120e+04 0.155 0.877213
## NeighborhoodVeenker 9.588e+03 1.490e+04 0.644 0.520026
## BldgTypeNot1Fam -2.073e+04 3.523e+03 -5.885 5.37e-09 ***
## HouseStyle1.5Unf 8.444e+03 1.164e+04 0.725 0.468519
## HouseStyle1Story 2.359e+04 5.303e+03 4.448 9.61e-06 ***
## HouseStyle2Story -8.355e+03 4.714e+03 -1.772 0.076608 .
## HouseStyleSFoyer 1.769e+04 8.201e+03 2.157 0.031221 *
## HouseStyleSLvl 1.168e+04 6.577e+03 1.776 0.076090 .
## OverallQual 1.354e+04 1.391e+03 9.738 < 2e-16 ***
## OverallCond 5.566e+03 1.186e+03 4.691 3.09e-06 ***
## YearBuilt 2.326e+02 9.175e+01 2.535 0.011400 *
## YearRemodAdd 1.507e+01 8.267e+01 0.182 0.855408
## Exterior1stBrkFace 1.042e+04 1.039e+04 1.003 0.316225
## Exterior1stCemntBd -1.593e+03 1.081e+04 -0.147 0.882898
## Exterior1stHdBoard -1.091e+04 9.239e+03 -1.181 0.237771
## Exterior1stMetalSd -3.749e+03 8.874e+03 -0.423 0.672730
## Exterior1stPlywood -9.560e+03 9.819e+03 -0.974 0.330472
## Exterior1stStucco -3.060e+04 1.144e+04 -2.674 0.007616 **
## Exterior1stVinylSd -6.880e+03 9.057e+03 -0.760 0.447680
## Exterior1stWd Sdng -3.083e+03 8.938e+03 -0.345 0.730181
## Exterior1stWdShing -8.416e+03 1.146e+04 -0.735 0.462768
## BsmtFinSF1 1.052e+01 3.218e+00 3.269 0.001117 **
## `1stFlrSF` 4.402e+01 4.920e+00 8.947 < 2e-16 ***
## `2ndFlrSF` 6.543e+01 6.827e+00 9.583 < 2e-16 ***
## BsmtFullBath 1.353e+04 2.713e+03 4.988 7.16e-07 ***
## FullBath 9.908e+03 3.314e+03 2.989 0.002861 **
## HalfBath 5.066e+03 3.270e+03 1.549 0.121610
## BedroomAbvGr -1.114e+03 1.894e+03 -0.588 0.556722
## KitchenQualFa -3.601e+04 9.293e+03 -3.875 0.000113 ***
## KitchenQualGd -4.129e+04 4.919e+03 -8.394 < 2e-16 ***
## KitchenQualTA -4.116e+04 5.805e+03 -7.091 2.48e-12 ***
## Fireplaces 5.411e+03 2.541e+03 2.130 0.033413 *
## GarageArea 1.937e+01 6.627e+00 2.923 0.003547 **
## SaleConditionAdjLand 2.284e+04 2.451e+04 0.932 0.351565
## SaleConditionAlloca 1.402e+04 1.285e+04 1.091 0.275664
## SaleConditionFamily 1.519e+03 9.514e+03 0.160 0.873183
## SaleConditionNormal 7.447e+03 4.128e+03 1.804 0.071516 .
## SaleConditionPartial 2.155e+04 5.824e+03 3.700 0.000227 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 32700 on 1028 degrees of freedom
## Multiple R-squared: 0.8517, Adjusted R-squared: 0.8422
## F-statistic: 89.45 on 66 and 1028 DF, p-value: < 2.2e-16
rsquare(model_train_210515.lm, model_test.dat)
## [1] 0.8757467
The simplified model performs much better on the test data.
ggplot(model_train.slm, aes(sample = .resid)) +
geom_qq() +
geom_qq_line()
The QQ plot indicates that there are some important deviations from the linear model at the tails of the Sale Price distribution. One avenue to pursue if I were to continue this investigation might be to have separate models for predicting low, medium, and high-price houses. The first step of the model would be to predict the range of the house price, and then a separate multiple regression model for each group would be used to predict the final price.
ggplot(data = model_train.slm, aes(x = .fitted, y = .resid)) +
geom_point()
Again, we see problems at the higher values of SalePrice. The model systematically underestimates the prices of the most expensive houses in Ames.
predict.dat <- predict.dat %>%
dplyr::select(LotFrontage, LotArea, LotConfig, Neighborhood, BldgType, HouseStyle, OverallQual, OverallCond, YearBuilt, YearRemodAdd, Exterior1st, BsmtFinSF1, `1stFlrSF`, `2ndFlrSF`, BsmtFullBath, FullBath, HalfBath, BedroomAbvGr, KitchenQual, Fireplaces, GarageArea, SaleType, SaleCondition)
predict.dat$LotFrontage[is.na(predict.dat$LotFrontage)] <- 0
predict.dat$Exterior1st[predict.dat$Exterior1st %in% c("AsphShn","BrkComm","CBlock","ImStucc","Stone")] <- "VinylSd"
predict.dat$Exterior1st[is.na(predict.dat$Exterior1st)] <- "VinylSd"
predict.dat$BsmtFinSF1[is.na(predict.dat$BsmtFinSF1)] <- 0
predict.dat$BsmtFullBath[is.na(predict.dat$BsmtFullBath)] <- 0
predict.dat$KitchenQual[is.na(predict.dat$KitchenQual)] <- "TA"
predict.dat$GarageArea[is.na(predict.dat$GarageArea)] <- 0
predict.dat$SaleType[is.na(predict.dat$SaleType)] <- "WD"
sum(is.na(predict.dat))
## [1] 0
predict.dat$BldgType[!(predict.dat$BldgType %in% c("1Fam"))] <- "Not1Fam"
predict.dat$HouseStyle[predict.dat$HouseStyle %in% c("2.5Fin", "2.5Unf")] <- "2Story"
predict.dat$BsmtFullBath[predict.dat$BsmtFullBath > 1] <- 1
predict.dat$HalfBath[predict.dat$HalfBath > 1] <- 1
predict.dat$BedroomAbvGr[predict.dat$BedroomAbvGr > 4] <- 4
predict.dat$Fireplaces[predict.dat$Fireplaces > 1] <- 1
predict.dat$SaleType[predict.dat$SaleType %in% c("Con", "ConLD", "ConLI", "ConLW", "CWD", "Oth")] <- "WD"
preds <- predict.lm(model_train_210515.lm, predict.dat, type = "response")
submit <- data.frame("Id" = seq(from = 1461, to = 2919), "SalePrice" = preds)
write_csv(submit, "C:/Users/dmosc/OneDrive/Desktop/y.csv")
This model earned a score of 0.15828. My user name is Daniel Moscoe.