Your final is due by the end of the last week of class. You should post your solutions to your GitHub account or RPubs. You are also expected to make a short presentation via YouTube and post that recording to the board. This project will show off your ability to understand the elements of the class.
library(tidyverse)
library(matrixcalc)
library(scales)
library(psych)
library(MASS)
library(kableExtra)
library(GGally)
library(fastDummies)set.seed(605)
N <- 9
n <- 10000
X <- runif(n, 1, N)
mean <- (N+1)/2
sd <- (N+1)/2
Y <- rnorm(n, mean=mean, sd=sd)
Recall for independent events, \(P(X \ and \ Y) = P(X)P(Y)\) and for dependent events \(P(X \ and \ Y) = P(X)P(Y \ | \ X)\)
## [1] 5.03
## 25%
## 1.65
\(P(X>x \ | \ X>y) = \frac{P(X>x \ and \ X>y)}{P(X>y)}\) where the numerator simplifies to \(P(X>x)\)
## [1] 0.547
This is a joint probability.
## [1] 0.375
Similarly to part a, we are looking for \(P(X<x \ | \ X>y) = \frac{P(X<x \ and \ X>y)}{P(X>y)} = \frac{P(y<X<x)}{P(X>y)}\).
## [1] 0.453
m <- matrix(c(length(X[X<x])/n*length(X[Y<y])/n,
length(X[X>x])/n*length(X[Y<y])/n,
length(X[X<x])/n*length(X[Y>y])/n,
length(X[X>x])/n*length(X[Y>y])/n),
ncol=2,byrow=TRUE)
colnames(m) <- c("P(X<x)","P(X>x)")
rownames(m) <- c("P(Y<y)","P(Y>y)")| P(X<x) | P(X>x) | |
|---|---|---|
| P(Y<y) | 0.125 | 0.125 |
| P(Y>y) | 0.375 | 0.375 |
Marginal probability of X:
## P(X<x) P(X>x)
## 0.5 0.5
Marginal probability of Y:
## P(Y<y) P(Y>y)
## 0.25 0.75
If we compare the product of the two marginal probabilities P(X>x)P(Y>y) = 0.5*0.75 = 0.375 we get the same result as the joint probability in part b (also shown in bottom right of table above). The statement holds.
We see that from both test results below, we obtain a very high p-value (1) so we conclude that independence holds. Fisher’s test is exact in its accuracy, and is usually used for small sample sizes summarized in 2x2 contingency tables. The Chi-squared test is used for large sample sizes of arbitraty dimensions. Since we are working with a 2x2 contingency table, Fisher’s test is appriopriate, but the chi-square might be preffered because of the large sample size. Given that the random variables were generated independently, it should not be surprising that both results hold.
##
## Fisher's Exact Test for Count Data
##
## data: m * n
## p-value = 1
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.912 1.096
## sample estimates:
## odds ratio
## 1
##
## Pearson's Chi-squared test
##
## data: m * n
## X-squared = 0, df = 1, p-value = 1
You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition. https://www.kaggle.com/c/house-prices-advanced-regression-techniques.
Load data set:
## Id MSSubClass MSZoning LotFrontage
## Min. : 1 Min. : 20.0 Length:1460 Min. : 21
## 1st Qu.: 366 1st Qu.: 20.0 Class :character 1st Qu.: 59
## Median : 730 Median : 50.0 Mode :character Median : 69
## Mean : 730 Mean : 56.9 Mean : 70
## 3rd Qu.:1095 3rd Qu.: 70.0 3rd Qu.: 80
## Max. :1460 Max. :190.0 Max. :313
## NA's :259
## LotArea Street Alley LotShape
## Min. : 1300 Length:1460 Length:1460 Length:1460
## 1st Qu.: 7554 Class :character Class :character Class :character
## Median : 9478 Mode :character Mode :character Mode :character
## Mean : 10517
## 3rd Qu.: 11602
## Max. :215245
##
## LandContour Utilities LotConfig
## Length:1460 Length:1460 Length:1460
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## LandSlope Neighborhood Condition1
## Length:1460 Length:1460 Length:1460
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## Condition2 BldgType HouseStyle OverallQual
## Length:1460 Length:1460 Length:1460 Min. : 1.0
## Class :character Class :character Class :character 1st Qu.: 5.0
## Mode :character Mode :character Mode :character Median : 6.0
## Mean : 6.1
## 3rd Qu.: 7.0
## Max. :10.0
##
## OverallCond YearBuilt YearRemodAdd RoofStyle
## Min. :1.00 Min. :1872 Min. :1950 Length:1460
## 1st Qu.:5.00 1st Qu.:1954 1st Qu.:1967 Class :character
## Median :5.00 Median :1973 Median :1994 Mode :character
## Mean :5.58 Mean :1971 Mean :1985
## 3rd Qu.:6.00 3rd Qu.:2000 3rd Qu.:2004
## Max. :9.00 Max. :2010 Max. :2010
##
## RoofMatl Exterior1st Exterior2nd
## Length:1460 Length:1460 Length:1460
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## MasVnrType MasVnrArea ExterQual ExterCond
## Length:1460 Min. : 0 Length:1460 Length:1460
## Class :character 1st Qu.: 0 Class :character Class :character
## Mode :character Median : 0 Mode :character Mode :character
## Mean : 104
## 3rd Qu.: 166
## Max. :1600
## NA's :8
## Foundation BsmtQual BsmtCond
## Length:1460 Length:1460 Length:1460
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2
## Length:1460 Length:1460 Min. : 0 Length:1460
## Class :character Class :character 1st Qu.: 0 Class :character
## Mode :character Mode :character Median : 384 Mode :character
## Mean : 444
## 3rd Qu.: 712
## Max. :5644
##
## BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating
## Min. : 0 Min. : 0 Min. : 0 Length:1460
## 1st Qu.: 0 1st Qu.: 223 1st Qu.: 796 Class :character
## Median : 0 Median : 478 Median : 992 Mode :character
## Mean : 47 Mean : 567 Mean :1057
## 3rd Qu.: 0 3rd Qu.: 808 3rd Qu.:1298
## Max. :1474 Max. :2336 Max. :6110
##
## HeatingQC CentralAir Electrical 1stFlrSF
## Length:1460 Length:1460 Length:1460 Min. : 334
## Class :character Class :character Class :character 1st Qu.: 882
## Mode :character Mode :character Mode :character Median :1087
## Mean :1163
## 3rd Qu.:1391
## Max. :4692
##
## 2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath
## Min. : 0 Min. : 0 Min. : 334 Min. :0.000
## 1st Qu.: 0 1st Qu.: 0 1st Qu.:1130 1st Qu.:0.000
## Median : 0 Median : 0 Median :1464 Median :0.000
## Mean : 347 Mean : 6 Mean :1515 Mean :0.425
## 3rd Qu.: 728 3rd Qu.: 0 3rd Qu.:1777 3rd Qu.:1.000
## Max. :2065 Max. :572 Max. :5642 Max. :3.000
##
## BsmtHalfBath FullBath HalfBath BedroomAbvGr
## Min. :0.000 Min. :0.00 Min. :0.000 Min. :0.00
## 1st Qu.:0.000 1st Qu.:1.00 1st Qu.:0.000 1st Qu.:2.00
## Median :0.000 Median :2.00 Median :0.000 Median :3.00
## Mean :0.058 Mean :1.56 Mean :0.383 Mean :2.87
## 3rd Qu.:0.000 3rd Qu.:2.00 3rd Qu.:1.000 3rd Qu.:3.00
## Max. :2.000 Max. :3.00 Max. :2.000 Max. :8.00
##
## KitchenAbvGr KitchenQual TotRmsAbvGrd Functional
## Min. :0.00 Length:1460 Min. : 2.00 Length:1460
## 1st Qu.:1.00 Class :character 1st Qu.: 5.00 Class :character
## Median :1.00 Mode :character Median : 6.00 Mode :character
## Mean :1.05 Mean : 6.52
## 3rd Qu.:1.00 3rd Qu.: 7.00
## Max. :3.00 Max. :14.00
##
## Fireplaces FireplaceQu GarageType GarageYrBlt
## Min. :0.000 Length:1460 Length:1460 Min. :1900
## 1st Qu.:0.000 Class :character Class :character 1st Qu.:1961
## Median :1.000 Mode :character Mode :character Median :1980
## Mean :0.613 Mean :1979
## 3rd Qu.:1.000 3rd Qu.:2002
## Max. :3.000 Max. :2010
## NA's :81
## GarageFinish GarageCars GarageArea GarageQual
## Length:1460 Min. :0.00 Min. : 0 Length:1460
## Class :character 1st Qu.:1.00 1st Qu.: 334 Class :character
## Mode :character Median :2.00 Median : 480 Mode :character
## Mean :1.77 Mean : 473
## 3rd Qu.:2.00 3rd Qu.: 576
## Max. :4.00 Max. :1418
##
## GarageCond PavedDrive WoodDeckSF OpenPorchSF
## Length:1460 Length:1460 Min. : 0 Min. : 0
## Class :character Class :character 1st Qu.: 0 1st Qu.: 0
## Mode :character Mode :character Median : 0 Median : 25
## Mean : 94 Mean : 47
## 3rd Qu.:168 3rd Qu.: 68
## Max. :857 Max. :547
##
## EnclosedPorch 3SsnPorch ScreenPorch PoolArea
## Min. : 0 Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 0 1st Qu.: 0 1st Qu.: 0 1st Qu.: 0
## Median : 0 Median : 0 Median : 0 Median : 0
## Mean : 22 Mean : 3 Mean : 15 Mean : 3
## 3rd Qu.: 0 3rd Qu.: 0 3rd Qu.: 0 3rd Qu.: 0
## Max. :552 Max. :508 Max. :480 Max. :738
##
## PoolQC Fence MiscFeature MiscVal
## Length:1460 Length:1460 Length:1460 Min. : 0
## Class :character Class :character Class :character 1st Qu.: 0
## Mode :character Mode :character Mode :character Median : 0
## Mean : 43
## 3rd Qu.: 0
## Max. :15500
##
## MoSold YrSold SaleType SaleCondition
## Min. : 1.00 Min. :2006 Length:1460 Length:1460
## 1st Qu.: 5.00 1st Qu.:2007 Class :character Class :character
## Median : 6.00 Median :2008 Mode :character Mode :character
## Mean : 6.32 Mean :2008
## 3rd Qu.: 8.00 3rd Qu.:2009
## Max. :12.00 Max. :2010
##
## SalePrice
## Min. : 34900
## 1st Qu.:129975
## Median :163000
## Mean :180921
## 3rd Qu.:214000
## Max. :755000
##
Let’s take a closer look at the dependent variable: SalePrice. We learn from the information below that the variable is skewed to the right with an average SalePrice of 180,921 and a standard deviation of 79,442.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 34900 129975 163000 180921 214000 755000
## vars n mean sd median trimmed mad min max range skew
## X1 1 1460 180921 79442 163000 170783 56339 34900 755000 720100 1.88
## kurtosis se
## X1 6.5 2079
ggplot(train, aes(x=SalePrice)) +
geom_histogram(color="black", fill="white") +
scale_x_continuous(breaks=seq(0,max(train$SalePrice),100000), labels=function(x) format(x, big.mark = ",",scientific = FALSE)) +
ggtitle('Dependent Variable: SalePrice')| LotArea | GrLivArea | GarageArea | |
|---|---|---|---|
| LotArea | 1.000 | 0.263 | 0.180 |
| GrLivArea | 0.263 | 1.000 | 0.469 |
| GarageArea | 0.180 | 0.469 | 1.000 |
##
## Pearson's product-moment correlation
##
## data: train$LotArea and train$GrLivArea
## t = 10, df = 1458, p-value <2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.232 0.294
## sample estimates:
## cor
## 0.263
##
## Pearson's product-moment correlation
##
## data: train$LotArea and train$GarageArea
## t = 7, df = 1458, p-value = 4e-12
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.148 0.213
## sample estimates:
## cor
## 0.18
##
## Pearson's product-moment correlation
##
## data: train$GrLivArea and train$GarageArea
## t = 20, df = 1458, p-value <2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.442 0.495
## sample estimates:
## cor
## 0.469
The analysis above concludes that the tested variables are not independent of each other and that there are statistically significant correlations between the variables, although the correlations are not strong. I would be concerned with familywise errors as the three variables are connected: LotArea is in part composed of GarageArea (if there is one) and GrLivingArea.
if(det(cor_matrix)>0) # Verify that the correlation matrix is invertible
pres_matrix <- solve(cor_matrix)
pres_matrix## LotArea GrLivArea GarageArea
## LotArea 1.0792 -0.247 -0.0789
## GrLivArea -0.2470 1.339 -0.5832
## GarageArea -0.0789 -0.583 1.2877
## LotArea GrLivArea GarageArea
## LotArea 1 0 0
## GrLivArea 0 1 0
## GarageArea 0 0 1
## LotArea GrLivArea GarageArea
## LotArea 1 0 0
## GrLivArea 0 1 0
## GarageArea 0 0 1
## [,1] [,2] [,3]
## [1,] 1.000 0.000 0
## [2,] 0.263 1.000 0
## [3,] 0.180 0.453 1
## [,1] [,2] [,3]
## [1,] 1 0.263 0.180
## [2,] 0 0.931 0.422
## [3,] 0 0.000 0.777
## [,1] [,2] [,3]
## [1,] 1.000 0.263 0.180
## [2,] 0.263 1.000 0.469
## [3,] 0.180 0.469 1.000
Many times, it makes sense to fit a closed form distribution to data.
The GrLivArea variable was selected. We can see from the summary that the min value is not 0, so there is no need to shift.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 334 1130 1464 1515 1777 5642
Fit an exponential distribution to the data.
## rate
## 0.00066
Sample the exponential distribution generated from our rate estimate.
df_1 <- data.frame(obs = gr_liv_area)
df_1$data<- 'empirical'
df_2 <- data.frame(obs = sim_exp_dist)
df_2$data <- 'fit'
df <- rbind(df_1, df_2)
ggplot(data=df, aes(obs, fill=data)) +
geom_density(alpha = 0.4) +
ggtitle('Comparing Empirical and Fit Data')## [1] 77.7 4539.9
mu <- mean(gr_liv_area)
sd <- sd(gr_liv_area)
n <- length(gr_liv_area)
me <- qnorm(.95)*sd/sqrt(n)
ci <- c(mu-me,mu+me)
ci## [1] 1493 1538
## 5% 95%
## 848 2466
The comparative histogram and the mismatched percentiles shows that the exponential distribution is not a good fit for our data. However, we note that the mean of the exponential distribution which is the reciprocal of the rate we estimated above, does fit within the confidence interval for the mean. Aalso note that we usually refer to lambda as the mean of the exponential distribution and 1/lambda as the rate.
## rate
## 1515
## [1] "Id" "MSSubClass" "MSZoning" "LotFrontage"
## [5] "LotArea" "Street" "Alley" "LotShape"
## [9] "LandContour" "Utilities" "LotConfig" "LandSlope"
## [13] "Neighborhood" "Condition1" "Condition2" "BldgType"
## [17] "HouseStyle" "OverallQual" "OverallCond" "YearBuilt"
## [21] "YearRemodAdd" "RoofStyle" "RoofMatl" "Exterior1st"
## [25] "Exterior2nd" "MasVnrType" "MasVnrArea" "ExterQual"
## [29] "ExterCond" "Foundation" "BsmtQual" "BsmtCond"
## [33] "BsmtExposure" "BsmtFinType1" "BsmtFinSF1" "BsmtFinType2"
## [37] "BsmtFinSF2" "BsmtUnfSF" "TotalBsmtSF" "Heating"
## [41] "HeatingQC" "CentralAir" "Electrical" "1stFlrSF"
## [45] "2ndFlrSF" "LowQualFinSF" "GrLivArea" "BsmtFullBath"
## [49] "BsmtHalfBath" "FullBath" "HalfBath" "BedroomAbvGr"
## [53] "KitchenAbvGr" "KitchenQual" "TotRmsAbvGrd" "Functional"
## [57] "Fireplaces" "FireplaceQu" "GarageType" "GarageYrBlt"
## [61] "GarageFinish" "GarageCars" "GarageArea" "GarageQual"
## [65] "GarageCond" "PavedDrive" "WoodDeckSF" "OpenPorchSF"
## [69] "EnclosedPorch" "3SsnPorch" "ScreenPorch" "PoolArea"
## [73] "PoolQC" "Fence" "MiscFeature" "MiscVal"
## [77] "MoSold" "YrSold" "SaleType" "SaleCondition"
## [81] "SalePrice"
Below we identify the features with missing values. Studying the variables names gives us idea of how we might deal with the NA values. For example, NA values for LotFrontage simply mean that the home is within the permiter of the property line and NA values can be replaced by zero. The feature Alley could be changed to a binary variable (alley, no alley). However, these are probably not the most important features so we decide to drop all of them. In general, we can remove the feature entirely if more than 15% of its observations are NA.
NAs <- train %>%
summarise_all(funs(sum(is.na(.)))) %>%
gather(key='Feature', value='num_NA') %>%
mutate(prop = num_NA/dim(train)[1]) %>%
filter(num_NA>0) %>%
arrange(desc(prop))| Feature | num_NA | prop |
|---|---|---|
| PoolQC | 1453 | 0.995 |
| MiscFeature | 1406 | 0.963 |
| Alley | 1369 | 0.938 |
| Fence | 1179 | 0.808 |
| FireplaceQu | 690 | 0.473 |
| LotFrontage | 259 | 0.177 |
| GarageType | 81 | 0.055 |
| GarageYrBlt | 81 | 0.055 |
| GarageFinish | 81 | 0.055 |
| GarageQual | 81 | 0.055 |
| GarageCond | 81 | 0.055 |
| BsmtExposure | 38 | 0.026 |
| BsmtFinType2 | 38 | 0.026 |
| BsmtQual | 37 | 0.025 |
| BsmtCond | 37 | 0.025 |
| BsmtFinType1 | 37 | 0.025 |
| MasVnrType | 8 | 0.005 |
| MasVnrArea | 8 | 0.005 |
| Electrical | 1 | 0.001 |
Studying the correlation matrix and histogram for quantitative values helps us select features of interest. From the most correlated variables, we select the following: OverallQual, TotalBsmtSF, GrLivArea, GarageArea, GarageCars. We also select LotArea and FullBath, which are more moderately correlated but intuitively have some importance.
pairs.panels(train[c('SalePrice','LotArea','OverallQual','OverallCond','GrLivArea','FullBath', 'GarageArea')],
method = "pearson", hist.col = "#00AFBB",density = TRUE, ellipses = TRUE)We proceed with the selection of interesting features and include a qualitative variable ’ExterQual that may bear some weight:
selected_features <- c('Id','OverallQual','TotalBsmtSF','GrLivArea','GarageArea','GarageCars','LotArea','FullBath', 'ExterQual')
train_slim <- train %>% dplyr::select(selected_features)
train_slim$SalePrice <- train$SalePriceStudying the scatterplot matrix of the selected variables one more time, we can identify interesting outliers in the LotArea, GrLivArea and TotalBsmtAreaSF.
ggplot(train_slim, aes(LotArea, SalePrice)) +
geom_point(alpha=0.2) +
ggtitle('SalePrice vs LotArea')Upon closer look at LotArea, the outliers are significant and can perhaps be associated with much larger lots than differ from the typical home market. Given that the correlation between LotArea and SalePrice was actually quite low (0.26) we throw out this variable altogether and update our selected_features vector for future uses.
train_slim <- train_slim %>% dplyr::select(-LotArea)
selected_features <- c('Id','OverallQual','TotalBsmtSF','GrLivArea','GarageArea','GarageCars','FullBath','ExterQual')ggplot(train_slim, aes(GrLivArea, SalePrice)) +
geom_point(alpha=0.2) +
ggtitle('SalePrice vs GrLivArea')The picture is clearer for GrLivArea, but the two rightmost point do not make sense as the observations appear to have too low a price for such area. We will throw out those two points by refering to their index.
ggplot(train_slim, aes(TotalBsmtSF, SalePrice)) +
geom_point(alpha=0.2) +
ggtitle('SalePrice vs TotalBsmtSF')There is one significant outlier nearly double the area of the closer observation. We remove this value as well (in fact it is the same Id=1299 that we decided to remove previously). Notice as well that there are a number of observations with TotalBsmtSF = 0. This is also the case for the GarageArea variable. We create two binary categorical variables to account for that:
train_slim <- train_slim %>% filter(Id != 1299)
train_slim <- train_slim %>% filter(Id != 524)
train_slim <- train_slim %>% mutate(HasBasement = ifelse(TotalBsmtSF > 0, 1, 0))
train_slim <- train_slim %>% mutate(HasGarage = ifelse(GarageArea > 0, 1, 0))The box plot below reveals that there are quite a few outliers when studying the ExterQual variable. However, we will limit our outlier elimination to the few observations above.
ggplot(train, aes(x=ExterQual, y=SalePrice)) +
geom_boxplot(outlier.colour="red", outlier.shape=1,
outlier.size=4) +
scale_y_continuous(breaks=seq(0,max(train$SalePrice),100000), labels=function(x) format(x, big.mark = ",",scientific = FALSE)) +
ggtitle('SalePrice vs ExterQual') Generate dummy variables for the ExterQual feature.
Train the model with the selected features and update it using backwards elimination until we are left with the most statistically significant features.
selected.lm <- lm(SalePrice ~ ., data=train_slim[,-1])
modified.lm <- update(selected.lm, .~. - ExterQual_TA, data=train_slim[-1])
modified.lm <- update(modified.lm, .~. - ExterQual_Fa, data=train_slim[-1])
modified.lm <- update(modified.lm, .~. - FullBath, data=train_slim[-1])
modified.lm <- update(modified.lm, .~. - HasGarage, data=train_slim[-1])
modified.lm <- update(modified.lm, .~. - GarageCars, data=train_slim[-1])
summary(modified.lm)##
## Call:
## lm(formula = SalePrice ~ OverallQual + TotalBsmtSF + GrLivArea +
## GarageArea + HasBasement + ExterQual_Ex + ExterQual_Gd, data = train_slim[-1])
##
## Residuals:
## Min 1Q Median 3Q Max
## -144863 -18366 77 16840 256435
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -53194.19 6830.36 -7.79 1.3e-14 ***
## OverallQual 16917.21 1059.46 15.97 < 2e-16 ***
## TotalBsmtSF 44.54 2.81 15.82 < 2e-16 ***
## GrLivArea 53.64 2.16 24.86 < 2e-16 ***
## GarageArea 45.94 5.19 8.86 < 2e-16 ***
## HasBasement -27162.97 6138.79 -4.42 1.0e-05 ***
## ExterQual_Ex 78836.42 5816.29 13.55 < 2e-16 ***
## ExterQual_Gd 15580.21 2471.59 6.30 3.8e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 32800 on 1450 degrees of freedom
## Multiple R-squared: 0.83, Adjusted R-squared: 0.83
## F-statistic: 1.01e+03 on 7 and 1450 DF, p-value: <2e-16
The model results are decent, with an adjusted R-squared of 83%. However, we must verify the assumptions for linear regression by studying the residuals.
While the residuals are mostly evenly distributed between the 1st and 3rd quartiles, there is significant skew as revealed by the min and max values. Visually, we can identify a outliers outliers influencial in the residual which makes us cautious about accepting this model.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -142358 -17494 -51 0 17367 254937
When testing for normality, both the significant deviation at the tail ends of the plot below and the result of the Shapiro-Wilk test (H0: normal, HA: not normal) causes us to further question the validity of this model. The density plot comparison below clearly shows the deviation from normal.
##
## Shapiro-Wilk normality test
##
## data: selected.lm$residuals
## W = 0.9, p-value <2e-16
res <- selected.lm$residuals
df_1 <- data.frame(residuals = res)
df_1$data<- 'model'
df_2 <- data.frame(residuals = rnorm(length(res),mean(res),sd(res)))
df_2$data <- 'normal'
df <- rbind(df_1, df_2)
ggplot(data=df, aes(residuals, fill=data)) +
geom_density(alpha = 0.4) +
ggtitle('Residual Comparison')Despite a potentially invalid model, we try to predict the SalePrice of the test data using the model. As was done for the training data, we read the data, extract the selected features from the test data, deal with any missing values and add the categorical variables.
test_slim <- test %>% dplyr::select(selected_features)
test_slim[is.na(test_slim)] <- 0
test_slim <- test_slim %>% mutate(HasGarage = ifelse(GarageArea > 0, 1, 0))
test_slim <- test_slim %>% mutate(HasBasement = ifelse(TotalBsmtSF > 0, 1, 0))
test_slim <- fastDummies::dummy_cols(test_slim, select_columns = "ExterQual")
test_slim <- test_slim %>% dplyr::select(-ExterQual)Prepare the output for submission:
My Kaggle username is maelillien and my best prediction score is 0.21558 which is not great compared to the leaderboard, but it is a 795 spot improvement from my initial attempt which had a score of 0.93893. The current model score is 0.22516 which is slightly worse than the best attempt but this current model shows more feature engineering. The first attempt did not include the qualitative variables.
While the model yields decent prediction results, we saw that the underlying assumptions of the linear model are not fully met. To correct that and improve the model, there are a number of methods to consider: