Website: https://www.kaggle.com/c/house-prices-advanced-regression-techniques provides a dataset about 1460 objects of 80 variables that With 79 explanatory variables describing every aspect of residential homes in Ames, Iowa, to predict the final saleprice of each home.
# read in training data, use ID column as row names
train <- read.csv('train.csv', row.names = 1)
head(train)
## MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape
## 1 60 RL 65 8450 Pave <NA> Reg
## 2 20 RL 80 9600 Pave <NA> Reg
## 3 60 RL 68 11250 Pave <NA> IR1
## 4 70 RL 60 9550 Pave <NA> IR1
## 5 60 RL 84 14260 Pave <NA> IR1
## 6 50 RL 85 14115 Pave <NA> IR1
## LandContour Utilities LotConfig LandSlope Neighborhood Condition1
## 1 Lvl AllPub Inside Gtl CollgCr Norm
## 2 Lvl AllPub FR2 Gtl Veenker Feedr
## 3 Lvl AllPub Inside Gtl CollgCr Norm
## 4 Lvl AllPub Corner Gtl Crawfor Norm
## 5 Lvl AllPub FR2 Gtl NoRidge Norm
## 6 Lvl AllPub Inside Gtl Mitchel Norm
## Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt
## 1 Norm 1Fam 2Story 7 5 2003
## 2 Norm 1Fam 1Story 6 8 1976
## 3 Norm 1Fam 2Story 7 5 2001
## 4 Norm 1Fam 2Story 7 5 1915
## 5 Norm 1Fam 2Story 8 5 2000
## 6 Norm 1Fam 1.5Fin 5 5 1993
## YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType
## 1 2003 Gable CompShg VinylSd VinylSd BrkFace
## 2 1976 Gable CompShg MetalSd MetalSd None
## 3 2002 Gable CompShg VinylSd VinylSd BrkFace
## 4 1970 Gable CompShg Wd Sdng Wd Shng None
## 5 2000 Gable CompShg VinylSd VinylSd BrkFace
## 6 1995 Gable CompShg VinylSd VinylSd None
## MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure
## 1 196 Gd TA PConc Gd TA No
## 2 0 TA TA CBlock Gd TA Gd
## 3 162 Gd TA PConc Gd TA Mn
## 4 0 TA TA BrkTil TA Gd No
## 5 350 Gd TA PConc Gd TA Av
## 6 0 TA TA Wood Gd TA No
## BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF
## 1 GLQ 706 Unf 0 150 856
## 2 ALQ 978 Unf 0 284 1262
## 3 GLQ 486 Unf 0 434 920
## 4 ALQ 216 Unf 0 540 756
## 5 GLQ 655 Unf 0 490 1145
## 6 GLQ 732 Unf 0 64 796
## Heating HeatingQC CentralAir Electrical X1stFlrSF X2ndFlrSF LowQualFinSF
## 1 GasA Ex Y SBrkr 856 854 0
## 2 GasA Ex Y SBrkr 1262 0 0
## 3 GasA Ex Y SBrkr 920 866 0
## 4 GasA Gd Y SBrkr 961 756 0
## 5 GasA Ex Y SBrkr 1145 1053 0
## 6 GasA Ex Y SBrkr 796 566 0
## GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr
## 1 1710 1 0 2 1 3
## 2 1262 0 1 2 0 3
## 3 1786 1 0 2 1 3
## 4 1717 1 0 1 0 3
## 5 2198 1 0 2 1 4
## 6 1362 1 0 1 1 1
## KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces FireplaceQu
## 1 1 Gd 8 Typ 0 <NA>
## 2 1 TA 6 Typ 1 TA
## 3 1 Gd 6 Typ 1 TA
## 4 1 Gd 7 Typ 1 Gd
## 5 1 Gd 9 Typ 1 TA
## 6 1 TA 5 Typ 0 <NA>
## GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual
## 1 Attchd 2003 RFn 2 548 TA
## 2 Attchd 1976 RFn 2 460 TA
## 3 Attchd 2001 RFn 2 608 TA
## 4 Detchd 1998 Unf 3 642 TA
## 5 Attchd 2000 RFn 3 836 TA
## 6 Attchd 1993 Unf 2 480 TA
## GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch
## 1 TA Y 0 61 0 0
## 2 TA Y 298 0 0 0
## 3 TA Y 0 42 0 0
## 4 TA Y 0 35 272 0
## 5 TA Y 192 84 0 0
## 6 TA Y 40 30 0 320
## ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold
## 1 0 0 <NA> <NA> <NA> 0 2 2008
## 2 0 0 <NA> <NA> <NA> 0 5 2007
## 3 0 0 <NA> <NA> <NA> 0 9 2008
## 4 0 0 <NA> <NA> <NA> 0 2 2006
## 5 0 0 <NA> <NA> <NA> 0 12 2008
## 6 0 0 <NA> MnPrv Shed 700 10 2009
## SaleType SaleCondition SalePrice
## 1 WD Normal 208500
## 2 WD Normal 181500
## 3 WD Normal 223500
## 4 WD Abnorml 140000
## 5 WD Normal 250000
## 6 WD Normal 143000
Final sale price (in dollars) is defined as the dependent variable \(Y\) and the above grade living area (in square feet) is defined as the independent variable \(X\).
# Choose my quantitative independent variable to be Overall Quality
X <- train$GrLivArea
# Choose Sale Price as my dependent variable as per the Kaggle Competition
Y <- train$SalePrice
# Let's get the median and mean to show it is skewed to the right
summary(X)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 334 1130 1464 1515 1777 5642
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.3.3
# Plot Overall Quality to visually show it is skewed to the right
ggplot(data = train) + geom_density(aes(x=OverallQual), fill="blue") + labs(x="Overall Quality") + ggtitle("Overall Quality Density") + theme_light()
# We have the quartiles for X above, here we get them for Y
summary(Y)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 34900 130000 163000 180900 214000 755000
“x” is estimated as the 1d quartile of the X variable: x = 1130
“y” is estimated as the 2d quartile of the Y variable: y = 163,000
# we count the number of rows with SalesPrice > 163,000
sum(Y > 163000)
## [1] 728
# Total number of rows
length(Y)
## [1] 1460
length(X)
## [1] 1460
# Rows where both Y > y and X > x
sum(Y > 163000 & X > 1130)
## [1] 720
Q1x <- quantile(X,0.25)
Q1x
## 25%
## 1129.5
Q2y <- quantile(Y,0.5)
Q2y
## 50%
## 163000
# The probability P(X>x | Y>y)
(sum(Y > Q2y & X > Q1x) / length(X)) / (sum(Y > Q2y) / length(Y))
## [1] 0.989011
# The probability P(X>x and Y>y)
sum(X > Q1x & Y > Q2y) / length(X)
## [1] 0.4931507
# The probability P(X<x | Y>y)
(sum(Y > Q2y & X < Q1x) / length(X)) / (sum(Y > Q2y) / length(Y))
## [1] 0.01098901
Does splitting the training data in this fashion make them independent? In other words, does P(X|Y)=P(X)P(Y))? Check mathematically, and then evaluate by running a Chi Square test for association. You might have to research this.
# P(x) = P(above the 1d quartile for X)
px <- sum(X > Q1x) / length(X)
px
## [1] 0.75
# P(Y) = P(above the 2d quartile for Y)
py <- sum(Y > Q2y) / length(Y)
py
## [1] 0.4986301
# P(x)*P(Y)
pxy <- px*py
pxy
## [1] 0.3739726
# P(X|Y) = P(Y and X) / P(Y)
pxxyy <- (sum(X > Q1x & Y > Q2y) / length(X)) / py
pxxyy
## [1] 0.989011
\(P(A | B) \ne P(A) P(B)\), which implies that my variables are not independent, so splitting the data in this fashion does not make them independent.
The Chi Square Test, tests the hypothesis of whether the grade living area (in square feet) is independent of the Sale Price. We could debate a 5% to 1% significance level, but in this case that’s not necessary. With a p-value of well under 1% we reject the null hypothesis that the grade living area (in square feet) is independent of the Sale Price of a house.
if (!require(MASS)) install.packages("MASS")
## Loading required package: MASS
library(MASS)
tbl <- table(train$GrLivArea > Q1x, train$SalePrice > Q2y)
chisq.test(tbl)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: tbl
## X-squared = 439.85, df = 1, p-value < 2.2e-16
Provide univariate descriptive statistics and appropriate plots for both variables. Provide a scatterplot of X and Y. Transform both variables simultaneously using Box-Cox transformations. You might have to research this.
Train data set we can see from running dim() and str() below that we have 1,460 observations and 80 variables
Summary statistics for \(X\) and \(Y\) are provided in the table below:
library(pander)
## Warning: package 'pander' was built under R version 3.3.3
# prepare summary table supplemented with standard deviation
sum_tbl <- rbind(c(summary(X), 'Std. Dev.'=round(sd(X), 0)),
c(summary(Y), 'Std. Dev.'=round(sd(Y), 0)))
row.names(sum_tbl) <- c('X', 'Y')
pander(sum_tbl)
| Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. | Std. Dev. | |
|---|---|---|---|---|---|---|---|
| X | 334 | 1130 | 1464 | 1515 | 1777 | 5642 | 525 |
| Y | 34900 | 130000 | 163000 | 180900 | 214000 | 755000 | 79443 |
The plots below show the joint distributions of the two variables
library(ggplot2)
library(scales)
# scatter of X & Y
pscat <- ggplot(data.frame(X, Y), aes(X, Y / 1000)) +
geom_point(alpha = 0.25) +
scale_x_continuous('Above Grade Living Area [square feet]') +
scale_y_continuous('Sale Price [thousands]', labels = dollar)
# histogram of X
px <- ggplot(data.frame(X), aes(X, ..count.. / sum(..count..))) +
geom_histogram(alpha = 0.5, col = 'black', binwidth = 200) +
scale_x_continuous(name = NULL, labels = NULL) +
scale_y_continuous(name = '', labels = percent)
# histogram of Y
py <- ggplot(data.frame(Y), aes(Y, ..count.. / sum(..count..))) +
geom_histogram(alpha = 0.5, col = 'black', binwidth = 25000) +
scale_x_continuous(name = NULL, labels = NULL) +
scale_y_continuous(name = '', labels = percent) +
coord_flip()
# combined plots
library(grid)
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 3.3.3
grid.arrange(px, rectGrob(NA), pscat, py,
nrow=2, widths = c(0.8, 0.2), heights = c(0.2, 0.8),
top = 'Histograms and Scatterplot of X & Y Variables')
Below is a scatterplot of grade living area (in square feet) versus Sale Price. I overlayed the regression line and the lighter band around the line is the 95% confidence interval.
# Plot Overall Quality versus Sales Price and overlay a regression line
ggplot(train, aes(x = GrLivArea, y = SalePrice)) + geom_point(color="red") + labs(x="grade living area", y = "Sale Price") + ggtitle("grade living area vs. Sales Price") + geom_smooth(method = "lm", se = TRUE) + theme_light()
Provide a 95% CI for the difference in the mean of the variables.
We show below a 95% confidence that the difference in the means of x and y will be between 175.3K and 183.5K
# Run the T test
t.test(Y, X)
##
## Welch Two Sample t-test
##
## data: Y and X
## t = 86.288, df = 1459.1, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 175327.3 183484.2
## sample estimates:
## mean of x mean of y
## 180921.196 1515.464
# The assumption for the test is that both groups are sampled from normal distributions with equal variances
diffmean <- t.test(X, Y, paired = TRUE)$conf.int
diffmean
## [1] -183465.0 -175346.4
## attr(,"conf.level")
## [1] 0.95
Derive a correlation matrix for two of the quantitative variables you selected. Test the hypothesis that the correlation between these variables is 0 and provide a 99% confidence interval.
# Setup and run correlation test
data <- data.frame(X, Y)
cm <- cor(data)
cm
## X Y
## X 1.0000000 0.7086245
## Y 0.7086245 1.0000000
cor.test(Y, X, conf.level = 0.99)
##
## Pearson's product-moment correlation
##
## data: Y and X
## t = 38.348, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 99 percent confidence interval:
## 0.6733974 0.7406408
## sample estimates:
## cor
## 0.7086245
The Pearson correlation is 0.7086. The Pearson correlation coefficient can take a range of values from +1 to -1. A value of 0 indicates that there is no association between the two variables. A value greater than 0 indicates a positive association; that is, as the value of one variable increases, so does the value of the other variable. This test result is a strong indication of a correlation between grade living area and Sale Price.
In a similar way the T-test or the difference in the means also indicates an association between grade living area and Sale Price, because the true difference in means is not equal to 0
Using at least three untransformed variables, build a correlation matrix. Invert your correlation matrix. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix.
Assuming our square correlation matrix is invertible, the process outlined above should create the identity matrix due to \(AA^{-1} = A^{-1}A = I\)
# Correlaiton Matrix
cm
## X Y
## X 1.0000000 0.7086245
## Y 0.7086245 1.0000000
# #Invert Correlaiton Matrix
pm <- solve(cm)
pm
## X Y
## X 2.008632 -1.423366
## Y -1.423366 2.008632
# Multiply the correlation matrix by the precision matrix# Multiply the correlation matrix by the precision matrix
#A*A^-1
AAn1 <- cm%*%pm
AAn1
## X Y
## X 1 0
## Y 0 1
# Multiply the precision matrix by the correlation matrix
#A^-1*A
An1A <- pm %*% cm
An1A
## X Y
## X 1 0
## Y 0 1
Many times, it makes sense to fit a closed form distribution to data. For your non-transformed independent variable, location shift (if necessary) it so that the minimum value is above zero. Then load the MASS package and run fitdistr to fit a density function of your choice.
Find the optimal value of λλ for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, λλ)). Plot a histogram and compare it with a histogram of your original variable.
# Run fitdistr to fit an exponential probability density function
ep <- fitdistr(train$GrLivArea, "exponential")
ep
## rate
## 6.598640e-04
## (1.726943e-05)
# Take 1000 samples from the exponential distribution, set.seed is so percentiles below stay the same
set.seed(1)
esamp <- rexp(1000, 6.598640e-04)
# Plot a histogram and compare it with a histogram of your original variable
hist(esamp)
hist(train$GrLivArea)
using the cumulative distribution function (CDF) to find the 5th and 95th percentiles:
quantile(esamp, probs=c(.05,.95))
## 5% 95%
## 78.87384 4481.40505
To generate a 95% confidence interval from the empirical data: Confidence Interval as: CI = Sample Mean ±± Standard Error
lower or left side of the interval:
# The lower or left side of the interval for GrLivArea = mean - error
mean(train$GrLivArea) - qnorm(0.95) * sd(train$GrLivArea) / sqrt(length(train$GrLivArea))
## [1] 1492.843
# The upper or right side of the interval for GrLivArea = mean + error
mean(train$GrLivArea) + qnorm(0.95) * sd(train$GrLivArea) / sqrt(length(train$GrLivArea))
## [1] 1538.084
# Find the empirical 5th percentile and 95th percentile of the data
quantile(train$GrLivArea, probs=c(.05,.95))
## 5% 95%
## 848.0 2466.1
For grade living area, 95% CI is 1492 < X < 1538.
5th percentile is 848 and 95th percentile is 2466.
In order to predict the sale price of houses using information about the characteristics of a house, regression models are constructed and their results entered into the Kaggle competition.
The model considers only the numeric variables examined in previous section. The leaps package is utilized to determine the minimum Bayesian Information Criterion (BIC) to select the best model as shown below.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:gridExtra':
##
## combine
## The following object is masked from 'package:MASS':
##
## select
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# first variable is categorical values stored by numeric key; convert to factor
train$MSSubClass <- factor(train$MSSubClass)
# impute LotFrontage and MasVnrArea with zero
train$LotFrontage[is.na(train$LotFrontage)] <- 0
train$MasVnrArea[is.na(train$MasVnrArea)] <- 0
# impute GarageYrBlt with delta between YearBuilt
train$GarageYrBlt[is.na(train$GarageYrBlt)] <-
train$YearBuilt[is.na(train$GarageYrBlt)] +
median(train$GarageYrBlt - train$YearBuilt, na.rm = TRUE)
# gather all explanatory variables and exclude non-numeric variables
pca_data <- train[, sapply(train, is.numeric)] %>% select(-SalePrice)
```
#library(dplyr)
# gather all explanatory variables and exclude non-numeric variables
#pca_data <- train[, sapply(train, is.numeric)] %>% select(-SalePrice)
# use numeric variables from PCA; include target variable
model_data <- cbind(pca_data, SalePrice = train$SalePrice)
# use regsubsets to get best subsets per number of variables
library(leaps)
## Warning: package 'leaps' was built under R version 3.3.3
regfit <- regsubsets(SalePrice~., data = model_data, nvmax = 20)
## Warning in leaps.setup(x, y, wt = wt, nbest = nbest, nvmax = nvmax,
## force.in = force.in, : 2 linear dependencies found
## Reordering variables and trying again:
reg.summary <- summary(regfit)
# plot variables vs BIC & BIC vs # variables
par(mfrow = c(1, 2), mar=c(1,1,1,1))
plot(regfit, scale = "bic", main = "Predictor Variables vs. BIC")
plot(reg.summary$bic, xlab = "Number of Predictors", ylab = "BIC",
type = "l", main = "Best Subset Selection Using BIC")
# annotate lowest BIC
minbic <- which.min(reg.summary$bic)
points(minbic, reg.summary$bic[minbic], col = "brown", cex = 2, pch = 20)
The parameters of this model are presented below:
# get formula using 15 criteria
usedvars = names(coef(regfit, minbic))[-1]
BIC_model = paste0("glm(SalePrice ~ ", paste(usedvars, collapse = " + "),
", data = model_data)")
# run model and print summary
bestsubset = eval(parse(text = BIC_model))
# print model summary
pander(summary(bestsubset))
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | -1066852 | 113116 | -9.431 | 1.553e-20 |
| LotArea | 0.6162 | 0.1077 | 5.721 | 1.286e-08 |
| OverallQual | 24431 | 1123 | 21.75 | 5.986e-91 |
| OverallCond | 5557 | 1007 | 5.517 | 4.085e-08 |
| YearBuilt | 273.4 | 71.76 | 3.81 | 0.0001446 |
| MasVnrArea | 39.46 | 6.257 | 6.307 | 3.779e-10 |
| BsmtFinSF1 | 19.94 | 2.592 | 7.692 | 2.657e-14 |
| X1stFlrSF | 26.85 | 3.475 | 7.726 | 2.06e-14 |
| BsmtHalfBath | -4250 | 4211 | -1.009 | 0.313 |
| FullBath | 6379 | 2583 | 2.47 | 0.01363 |
| TotRmsAbvGrd | 8901 | 828.8 | 10.74 | 6.092e-26 |
| Fireplaces | 8832 | 1820 | 4.853 | 1.346e-06 |
| GarageYrBlt | 203.1 | 72.73 | 2.792 | 0.005306 |
| WoodDeckSF | 32.95 | 8.486 | 3.883 | 0.000108 |
| EnclosedPorch | 25.93 | 17.95 | 1.445 | 0.1488 |
| MiscVal | -1.491 | 2.008 | -0.7423 | 0.458 |
(Dispersion parameter for gaussian family taken to be 1432794491 )
| Null deviance: | 9.208e+12 on 1459 degrees of freedom |
| Residual deviance: | 2.069e+12 on 1444 degrees of freedom |
# read in test data, using ID column as row names
test <- read.csv('test.csv', row.names = 1)
# apply same imputations to test data as train data
test$MSSubClass <- factor(test$MSSubClass)
test$LotFrontage[is.na(test$LotFrontage)] <- 0
test$MasVnrArea[is.na(test$MasVnrArea)] <- 0
test$GarageYrBlt[is.na(test$GarageYrBlt)] <-
test$YearBuilt[is.na(test$GarageYrBlt)] +
median(test$GarageYrBlt - test$YearBuilt, na.rm = TRUE)
head(test)
## MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape
## 1461 20 RH 80 11622 Pave <NA> Reg
## 1462 20 RL 81 14267 Pave <NA> IR1
## 1463 60 RL 74 13830 Pave <NA> IR1
## 1464 60 RL 78 9978 Pave <NA> IR1
## 1465 120 RL 43 5005 Pave <NA> IR1
## 1466 60 RL 75 10000 Pave <NA> IR1
## LandContour Utilities LotConfig LandSlope Neighborhood Condition1
## 1461 Lvl AllPub Inside Gtl NAmes Feedr
## 1462 Lvl AllPub Corner Gtl NAmes Norm
## 1463 Lvl AllPub Inside Gtl Gilbert Norm
## 1464 Lvl AllPub Inside Gtl Gilbert Norm
## 1465 HLS AllPub Inside Gtl StoneBr Norm
## 1466 Lvl AllPub Corner Gtl Gilbert Norm
## Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt
## 1461 Norm 1Fam 1Story 5 6 1961
## 1462 Norm 1Fam 1Story 6 6 1958
## 1463 Norm 1Fam 2Story 5 5 1997
## 1464 Norm 1Fam 2Story 6 6 1998
## 1465 Norm TwnhsE 1Story 8 5 1992
## 1466 Norm 1Fam 2Story 6 5 1993
## YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType
## 1461 1961 Gable CompShg VinylSd VinylSd None
## 1462 1958 Hip CompShg Wd Sdng Wd Sdng BrkFace
## 1463 1998 Gable CompShg VinylSd VinylSd None
## 1464 1998 Gable CompShg VinylSd VinylSd BrkFace
## 1465 1992 Gable CompShg HdBoard HdBoard None
## 1466 1994 Gable CompShg HdBoard HdBoard None
## MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond
## 1461 0 TA TA CBlock TA TA
## 1462 108 TA TA CBlock TA TA
## 1463 0 TA TA PConc Gd TA
## 1464 20 TA TA PConc TA TA
## 1465 0 Gd TA PConc Gd TA
## 1466 0 TA TA PConc Gd TA
## BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2
## 1461 No Rec 468 LwQ 144
## 1462 No ALQ 923 Unf 0
## 1463 No GLQ 791 Unf 0
## 1464 No GLQ 602 Unf 0
## 1465 No ALQ 263 Unf 0
## 1466 No Unf 0 Unf 0
## BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical
## 1461 270 882 GasA TA Y SBrkr
## 1462 406 1329 GasA TA Y SBrkr
## 1463 137 928 GasA Gd Y SBrkr
## 1464 324 926 GasA Ex Y SBrkr
## 1465 1017 1280 GasA Ex Y SBrkr
## 1466 763 763 GasA Gd Y SBrkr
## X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath
## 1461 896 0 0 896 0 0
## 1462 1329 0 0 1329 0 0
## 1463 928 701 0 1629 0 0
## 1464 926 678 0 1604 0 0
## 1465 1280 0 0 1280 0 0
## 1466 763 892 0 1655 0 0
## FullBath HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd
## 1461 1 0 2 1 TA 5
## 1462 1 1 3 1 Gd 6
## 1463 2 1 3 1 TA 6
## 1464 2 1 3 1 Gd 7
## 1465 2 0 2 1 Gd 5
## 1466 2 1 3 1 TA 7
## Functional Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish
## 1461 Typ 0 <NA> Attchd 1961 Unf
## 1462 Typ 0 <NA> Attchd 1958 Unf
## 1463 Typ 1 TA Attchd 1997 Fin
## 1464 Typ 1 Gd Attchd 1998 Fin
## 1465 Typ 0 <NA> Attchd 1992 RFn
## 1466 Typ 1 TA Attchd 1993 Fin
## GarageCars GarageArea GarageQual GarageCond PavedDrive WoodDeckSF
## 1461 1 730 TA TA Y 140
## 1462 1 312 TA TA Y 393
## 1463 2 482 TA TA Y 212
## 1464 2 470 TA TA Y 360
## 1465 2 506 TA TA Y 0
## 1466 2 440 TA TA Y 157
## OpenPorchSF EnclosedPorch X3SsnPorch ScreenPorch PoolArea PoolQC
## 1461 0 0 0 120 0 <NA>
## 1462 36 0 0 0 0 <NA>
## 1463 34 0 0 0 0 <NA>
## 1464 36 0 0 0 0 <NA>
## 1465 82 0 0 144 0 <NA>
## 1466 84 0 0 0 0 <NA>
## Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition
## 1461 MnPrv <NA> 0 6 2010 WD Normal
## 1462 <NA> Gar2 12500 6 2010 WD Normal
## 1463 MnPrv <NA> 0 3 2010 WD Normal
## 1464 <NA> <NA> 0 6 2010 WD Normal
## 1465 <NA> <NA> 0 1 2010 WD Normal
## 1466 <NA> <NA> 0 4 2010 WD Normal
# apply models to test dataset
pred_bic = predict(bestsubset, test)
# impute median for NAs
pred_bic[is.na(pred_bic)] <- median(train$SalePrice)
# shift values so all are positive
pred_bic <- pred_bic - min(pred_bic)
# store as data frames
pred_bic <- data.frame(Id = names(pred_bic), SalePrice = pred_bic)
The results of the two models’ predictions are output to csv files and submitted to the Kaggle competition under the username [jim lung], the score is 0.46252. The scores of these results are posted below:
# store results
write.csv(pred_bic, 'jimlungBIC.csv', row.names = FALSE)