You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition.
# LOAD DATA
test <- read.csv("./kaggle_data/test.csv", stringsAsFactors = F, header = T)
train <- read.csv("./kaggle_data/train.csv", stringsAsFactors = F, header = T)
submission <- read.csv("./kaggle_data/sample_submission.csv", stringsAsFactors = F, header = T)
Provide univariate descriptive statistics and appropriate plots for the training data set
. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any THREE quantitative variables in the dataset.
Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide a 80% confidence interval.
Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?
The train data shape:
dim(train)
## [1] 1460 81
Numeric Variables listed below:
LotFrontage : Linear feet of street connected to property * LotArea : Lot size in square feet OverallQual : Rates the overall material and finish of the house OverallCond : Rates the overall condition of the house YearBuilt : Original construction date YearRemodAdd : Remodel date (same as construction date if no remodeling or additions) MasVnrArea : Masonry veneer area in square feet BsmtFinSF1 : Type 1 finished square feet BsmtFinSF2 : Type 2 finished square feet BsmtUnfSF : Unfinished square feet of basement area * TotalBsmtSF : Total square feet of basement area 1stFlrSF : First Floor square feet 2ndFlrSF : Second floor square feet LowQualFinSF : Low quality finished square feet (all floors) * GrLivArea : Above grade (ground) living area square feet BsmtFullBath : Basement full bathrooms BsmtHalfBath : Basement half bathrooms FullBath : Full bathrooms above grade HalfBath : Half baths above grade BedroomAbvGr : Bedrooms above grade (does NOT include basement bedrooms) KitchenAbvGr : Kitchens above grade TotRmsAbvGrd : Total rooms above grade (does not include bathrooms) Fireplaces : Number of fireplaces GarageYrBlt : Year garage was built GarageCars : Size of garage in car capacity GarageArea : Size of garage in square feet WoodDeckSF : Wood deck area in square feet OpenPorchSF : Open porch area in square feet EnclosedPorch : Enclosed porch area in square feet 3SsnPorch : Three season porch area in square feet ScreenPorch : Screen porch area in square feet PoolArea : Pool area in square feet MiscVal : $Value of miscellaneous feature MoSold : Month Sold (MM) YrSold : Year Sold (YYYY) * SalePrice : Sale Price of the Housetraining data set
.Firstly, get a summary of numerical variables in training data set.
train %>%
select_if(is.numeric) %>% # filter numeric var
summary()
## Id MSSubClass LotFrontage LotArea
## Min. : 1.0 Min. : 20.0 Min. : 21.00 Min. : 1300
## 1st Qu.: 365.8 1st Qu.: 20.0 1st Qu.: 59.00 1st Qu.: 7554
## Median : 730.5 Median : 50.0 Median : 69.00 Median : 9478
## Mean : 730.5 Mean : 56.9 Mean : 70.05 Mean : 10517
## 3rd Qu.:1095.2 3rd Qu.: 70.0 3rd Qu.: 80.00 3rd Qu.: 11602
## Max. :1460.0 Max. :190.0 Max. :313.00 Max. :215245
## NA's :259
## OverallQual OverallCond YearBuilt YearRemodAdd
## Min. : 1.000 Min. :1.000 Min. :1872 Min. :1950
## 1st Qu.: 5.000 1st Qu.:5.000 1st Qu.:1954 1st Qu.:1967
## Median : 6.000 Median :5.000 Median :1973 Median :1994
## Mean : 6.099 Mean :5.575 Mean :1971 Mean :1985
## 3rd Qu.: 7.000 3rd Qu.:6.000 3rd Qu.:2000 3rd Qu.:2004
## Max. :10.000 Max. :9.000 Max. :2010 Max. :2010
##
## MasVnrArea BsmtFinSF1 BsmtFinSF2 BsmtUnfSF
## Min. : 0.0 Min. : 0.0 Min. : 0.00 Min. : 0.0
## 1st Qu.: 0.0 1st Qu.: 0.0 1st Qu.: 0.00 1st Qu.: 223.0
## Median : 0.0 Median : 383.5 Median : 0.00 Median : 477.5
## Mean : 103.7 Mean : 443.6 Mean : 46.55 Mean : 567.2
## 3rd Qu.: 166.0 3rd Qu.: 712.2 3rd Qu.: 0.00 3rd Qu.: 808.0
## Max. :1600.0 Max. :5644.0 Max. :1474.00 Max. :2336.0
## NA's :8
## TotalBsmtSF X1stFlrSF X2ndFlrSF LowQualFinSF
## Min. : 0.0 Min. : 334 Min. : 0 Min. : 0.000
## 1st Qu.: 795.8 1st Qu.: 882 1st Qu.: 0 1st Qu.: 0.000
## Median : 991.5 Median :1087 Median : 0 Median : 0.000
## Mean :1057.4 Mean :1163 Mean : 347 Mean : 5.845
## 3rd Qu.:1298.2 3rd Qu.:1391 3rd Qu.: 728 3rd Qu.: 0.000
## Max. :6110.0 Max. :4692 Max. :2065 Max. :572.000
##
## GrLivArea BsmtFullBath BsmtHalfBath FullBath
## Min. : 334 Min. :0.0000 Min. :0.00000 Min. :0.000
## 1st Qu.:1130 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:1.000
## Median :1464 Median :0.0000 Median :0.00000 Median :2.000
## Mean :1515 Mean :0.4253 Mean :0.05753 Mean :1.565
## 3rd Qu.:1777 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:2.000
## Max. :5642 Max. :3.0000 Max. :2.00000 Max. :3.000
##
## HalfBath BedroomAbvGr KitchenAbvGr TotRmsAbvGrd
## Min. :0.0000 Min. :0.000 Min. :0.000 Min. : 2.000
## 1st Qu.:0.0000 1st Qu.:2.000 1st Qu.:1.000 1st Qu.: 5.000
## Median :0.0000 Median :3.000 Median :1.000 Median : 6.000
## Mean :0.3829 Mean :2.866 Mean :1.047 Mean : 6.518
## 3rd Qu.:1.0000 3rd Qu.:3.000 3rd Qu.:1.000 3rd Qu.: 7.000
## Max. :2.0000 Max. :8.000 Max. :3.000 Max. :14.000
##
## Fireplaces GarageYrBlt GarageCars GarageArea
## Min. :0.000 Min. :1900 Min. :0.000 Min. : 0.0
## 1st Qu.:0.000 1st Qu.:1961 1st Qu.:1.000 1st Qu.: 334.5
## Median :1.000 Median :1980 Median :2.000 Median : 480.0
## Mean :0.613 Mean :1979 Mean :1.767 Mean : 473.0
## 3rd Qu.:1.000 3rd Qu.:2002 3rd Qu.:2.000 3rd Qu.: 576.0
## Max. :3.000 Max. :2010 Max. :4.000 Max. :1418.0
## NA's :81
## WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.00
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00
## Median : 0.00 Median : 25.00 Median : 0.00 Median : 0.00
## Mean : 94.24 Mean : 46.66 Mean : 21.95 Mean : 3.41
## 3rd Qu.:168.00 3rd Qu.: 68.00 3rd Qu.: 0.00 3rd Qu.: 0.00
## Max. :857.00 Max. :547.00 Max. :552.00 Max. :508.00
##
## ScreenPorch PoolArea MiscVal MoSold
## Min. : 0.00 Min. : 0.000 Min. : 0.00 Min. : 1.000
## 1st Qu.: 0.00 1st Qu.: 0.000 1st Qu.: 0.00 1st Qu.: 5.000
## Median : 0.00 Median : 0.000 Median : 0.00 Median : 6.000
## Mean : 15.06 Mean : 2.759 Mean : 43.49 Mean : 6.322
## 3rd Qu.: 0.00 3rd Qu.: 0.000 3rd Qu.: 0.00 3rd Qu.: 8.000
## Max. :480.00 Max. :738.000 Max. :15500.00 Max. :12.000
##
## YrSold SalePrice
## Min. :2006 Min. : 34900
## 1st Qu.:2007 1st Qu.:129975
## Median :2008 Median :163000
## Mean :2008 Mean :180921
## 3rd Qu.:2009 3rd Qu.:214000
## Max. :2010 Max. :755000
##
The selected variables LotArea
, GrLivArea
, TotalBsmtSF
, SalePricr
were showing heavy skewness. In order to create comprehensive visualization, log transformed values have been used to plot.
train %>%
dplyr::select(LotArea, GrLivArea, TotalBsmtSF, SalePrice) %>%
mutate(LotArea = log(LotArea),
GrLivArea = log(GrLivArea),
TotalBsmtSF = log(TotalBsmtSF),
SalePrice = log(SalePrice)) %>%
gather('key', 'value') %>%
ggplot(aes(x = value, y = ..density..)) +
geom_histogram(
fill = 'red',
alpha = 0.5,
bins = 15) +
geom_density() +
facet_wrap(~key, scales ='free_x') +
labs(x = 'log value',
y = '',
title = 'Histogram (log transformed values)')
## Warning: Removed 37 rows containing non-finite values (stat_bin).
## Warning: Removed 37 rows containing non-finite values (stat_density).
train %>%
dplyr::select(LotArea, GrLivArea, TotalBsmtSF, SalePrice) %>%
mutate(LotArea = log(LotArea),
GrLivArea = log(GrLivArea),
TotalBsmtSF = log(TotalBsmtSF),
SalePrice = log(SalePrice)) %>%
pairs(main = 'Scatterplot matrix LotArea, GrLivArea, TotalBsmtSF, SalePrice')
train %>%
dplyr::select(LotArea, GrLivArea, TotalBsmtSF, SalePrice) %>%
cor() %>%
corrplot(method = 'number')
Looking at the correlation plot, we can see that the GrLivArea
is the variable with highest correlation with SalePrice.
cor.test(train$GrLivArea, train$SalePrice, method = 'pearson', conf.level = 0.80)
##
## Pearson's product-moment correlation
##
## data: train$GrLivArea and train$SalePrice
## t = 38.348, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.6915087 0.7249450
## sample estimates:
## cor
## 0.7086245
cor.test(train$TotalBsmtSF, train$SalePrice, method = 'pearson', conf.level = 0.80)
##
## Pearson's product-moment correlation
##
## data: train$TotalBsmtSF and train$SalePrice
## t = 29.671, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.5922142 0.6340846
## sample estimates:
## cor
## 0.6135806
cor.test(train$LotArea, train$SalePrice, method = 'pearson', conf.level = 0.80)
##
## Pearson's product-moment correlation
##
## data: train$LotArea and train$SalePrice
## t = 10.445, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.2323391 0.2947946
## sample estimates:
## cor
## 0.2638434
The confidence interval for the correlation between GrLivArea
and SalePrice
is 0.6915
and 0.7249
, with a strong positive correlation of 0.70. The confidence interval for the correlation between TotalBsmtSF
and Sale Price
is 0.5922
and 0.6340
, with a strong positive correlation of 0.6135. The confidence interval for the correlation between LotArea
and SalePrice
is 0.2323
and 0.2947
, with a positive correlation of 0.2638.
We see a very small p-value <2.2e-16
that appears to be statistically significant thus we reject the null hypothesis that the correlations between each pairwise set of variables is 0. We can state that there is strong evidence the selected variables have correlation between each pairwise.
Familywise error rate is the probability of a coming to at least one false conclusion in a series of hypothesis tests. in other words, it is the probability of making at least one Type I Error
. A typical FWER approach used in the scientific literature is a Bonferroni correction. However, given the small number of tests with significantly low p-values, it is likely that this error has not occured.
Invert your 3 x 3 correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.
# create 3*# correlation matrix with GrLivArea, TotalBsmtSf, SalePrice
mat <- train %>%
dplyr::select(GrLivArea, TotalBsmtSF, SalePrice) %>%
cor() %>%
round(4)
mat
## GrLivArea TotalBsmtSF SalePrice
## GrLivArea 1.0000 0.4549 0.7086
## TotalBsmtSF 0.4549 1.0000 0.6136
## SalePrice 0.7086 0.6136 1.0000
prec.mat <- solve(mat) %>%
round(4)
prec.mat
## GrLivArea TotalBsmtSF SalePrice
## GrLivArea 2.0111 -0.0648 -1.3853
## TotalBsmtSF -0.0648 1.6060 -0.9395
## SalePrice -1.3853 -0.9395 2.5581
Since \(Precision = Correlation^{-1}\) thus \(Precision \times Correlation\) should be equal to I.
prec.mat %*% mat %>%
round(4)
## GrLivArea TotalBsmtSF SalePrice
## GrLivArea 1 0 0
## TotalBsmtSF 0 1 0
## SalePrice 0 0 1
mat %*% prec.mat %>%
round(4)
## GrLivArea TotalBsmtSF SalePrice
## GrLivArea 1 0 0
## TotalBsmtSF 0 1 0
## SalePrice 0 0 1
ALU <- function(A){
# Factorize A, a square matrix, into Lower Triangular(L) and Upper Triangular(U) matrix
# This function only works for square matrix, thus check whether the matrix is square first.
if(nrow(A) != ncol(A)){
stop("This is not a square matrix. Try again.")
}
# initialize variables
counter <- as.integer(nrow(A))
U <- A
L <- diag(counter)
# m for rows, n for columns
# iterate through columns
for (n in 1:(counter-1)){
# iterate through the rows
m <- n+1
for (m in (n+1):counter){
x <- -U[m,n]/U[n,n] # U[m,n] <- U[m,n]+x*U[n,n]
U[m,n] <- 0
# multiply the remiainings of current row by x
n2 <- n+1
for (n2 in (n+1):counter){
U[m,n2] <- U[m,n2] + x*U[n,n2]
}
# assign the x to L
L[m,n] = -x
}
}
line <- ("=========================================")
print("A")
print(A)
print(line)
print("U")
print(U)
print(line)
print("L")
print(L)
print(line)
print("A = LU")
print(A == (L %*% U))
print(L %*% U)
}
# or simply, lu.decomposition(mat)
ALU(mat)
## [1] "A"
## GrLivArea TotalBsmtSF SalePrice
## GrLivArea 1.0000 0.4549 0.7086
## TotalBsmtSF 0.4549 1.0000 0.6136
## SalePrice 0.7086 0.6136 1.0000
## [1] "========================================="
## [1] "U"
## GrLivArea TotalBsmtSF SalePrice
## GrLivArea 1 0.454900 0.7086000
## TotalBsmtSF 0 0.793066 0.2912579
## SalePrice 0 0.000000 0.3909200
## [1] "========================================="
## [1] "L"
## [,1] [,2] [,3]
## [1,] 1.0000 0.0000000 0
## [2,] 0.4549 1.0000000 0
## [3,] 0.7086 0.3672555 1
## [1] "========================================="
## [1] "A = LU"
## GrLivArea TotalBsmtSF SalePrice
## GrLivArea TRUE TRUE TRUE
## TotalBsmtSF TRUE TRUE TRUE
## SalePrice TRUE TRUE TRUE
## GrLivArea TotalBsmtSF SalePrice
## [1,] 1.0000 0.4549 0.7086
## [2,] 0.4549 1.0000 0.6136
## [3,] 0.7086 0.6136 1.0000
Many times, it makes sense to fit a closed form distribution to data.
Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary.
Then load the MASS package and run fitdistr to fit an exponential probability density function.
(See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ).
Find the optimal value of \(\lambda\) for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, \(\lambda\) )).
Plot a histogram and compare it with a histogram of your original variable.
Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF).
Also generate a 95% confidence interval from the empirical data, assuming normality.
Finally, provide the empirical 5th percentile and 95th percentile of the data.
Discuss.
I selected GrLivArea
as it appears to be skewed to the right.
First, Check the summary of the GrLivArea and plot to see the distribution.
summary(train$GrLivArea)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 334 1130 1464 1515 1777 5642
par(mfrow=c(1,2))
hist(train$GrLivArea, main="Histogram of GrLivArea")
boxplot(train$GrLivArea, main="Boxplot of GrLivArea")
Calculate the optimal lambda for right-skewed GrLivArea using MASS library.
lambda <- fitdistr(train$GrLivArea, densfun='exponential')
lambda$estimate
## rate
## 0.000659864
Transpose the rate into 1000 selected variables as an exponential distribution.
set.seed(23)
pdf.dist <- rexp(1000,lambda$estimate)
Plot the results of the exponential distribution.
hist(pdf.dist, freq = FALSE, breaks = 100,
main ="Fitted Exponential PDF with GrLivArea",
xlab = "PDF GrLivArea",
xlim = c(1, quantile(pdf.dist, 0.99)))
curve(dexp(x, rate = lambda$estimate), col = "red", add = TRUE)
Original - GrLivArea
set.seed(23)
samp.GrLivArea <- sample(train$GrLivArea, 1000, replace=TRUE, prob=NULL)
exp.train <- data_frame(Expo=rexp(1000, lambda$estimate)) %>%
mutate(GrLivArea = samp.GrLivArea)
plotdist(exp.train$GrLivArea, histo = TRUE, demp = TRUE)
Sample - Fitted Exponential PDF with GrLivArea
plotdist(exp.train$Expo, histo = TRUE, demp = TRUE)
hist(train$GrLivArea, freq = FALSE, breaks = 100,
main ="Comparison with original GrLivArea",
xlab="Original GrLivArea",
xlim = c(1, quantile(train$GrLivArea, 0.99)))
curve(dexp(x, rate = lambda$estimate), col = "red", add = TRUE)
pdf.5th <- qexp(0.05, rate = lambda$estimate, lower.tail = TRUE, log.p = FALSE)
pdf.5th <- round(pdf.5th, 4)
pdf.95th <- qexp(0.95, rate = lambda$estimate, lower.tail = TRUE, log.p = FALSE)
pdf.95th <- round(pdf.95th, 4)
The 5th percentile is 77.7331 and 95th percentile is 4539.9235
CI(train$GrLivArea, 0.95)
## upper mean lower
## 1542.440 1515.464 1488.487
quantile(train$GrLivArea, c(.05, .95))
## 5% 95%
## 848.0 2466.1
We are 95% confident that the mean of GrLivArea
is between 1488.487 and 1542.440. The exponential distribution is not a good fit as we can see the center of the exp distribution is shifted left as compared to the empirical data.
Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.
#str(train)
#str(test)
#combine data to clean
test$SalePrice <- 0
train$label <- 0
test$label <- 1
df.merge <- rbind(train,test)
# Divide the data categorical/nemerical
df.cat <- df.merge[, sapply(df.merge, is.character)]
df.num <- df.merge[, sapply(df.merge, is.numeric)]
# Check NA
library(Amelia)
# visualization
missmap(df.merge,
main = "Misisng Map",
col =c("yellow","black"),
legend = FALSE)
NA.check <- function(data){
index <- sapply(data, function(x) sum(is.na(x)))
new.data <- data.frame(index = names(data),
na.values = index)
new.data$perc <- round(new.data$na.values/nrow(data),4)
arrange(new.data[new.data$na.values > 0,], desc(na.values))
}
# table
na.cat <- NA.check(df.cat)
na.num <- NA.check(df.num)
kable(na.cat)
index | na.values | perc |
---|---|---|
PoolQC | 2909 | 0.9966 |
MiscFeature | 2814 | 0.9640 |
Alley | 2721 | 0.9322 |
Fence | 2348 | 0.8044 |
FireplaceQu | 1420 | 0.4865 |
GarageFinish | 159 | 0.0545 |
GarageQual | 159 | 0.0545 |
GarageCond | 159 | 0.0545 |
GarageType | 157 | 0.0538 |
BsmtCond | 82 | 0.0281 |
BsmtExposure | 82 | 0.0281 |
BsmtQual | 81 | 0.0277 |
BsmtFinType2 | 80 | 0.0274 |
BsmtFinType1 | 79 | 0.0271 |
MasVnrType | 24 | 0.0082 |
MSZoning | 4 | 0.0014 |
Utilities | 2 | 0.0007 |
Functional | 2 | 0.0007 |
Exterior1st | 1 | 0.0003 |
Exterior2nd | 1 | 0.0003 |
Electrical | 1 | 0.0003 |
KitchenQual | 1 | 0.0003 |
SaleType | 1 | 0.0003 |
kable(na.num)
index | na.values | perc |
---|---|---|
LotFrontage | 486 | 0.1665 |
GarageYrBlt | 159 | 0.0545 |
MasVnrArea | 23 | 0.0079 |
BsmtFullBath | 2 | 0.0007 |
BsmtHalfBath | 2 | 0.0007 |
BsmtFinSF1 | 1 | 0.0003 |
BsmtFinSF2 | 1 | 0.0003 |
BsmtUnfSF | 1 | 0.0003 |
TotalBsmtSF | 1 | 0.0003 |
GarageCars | 1 | 0.0003 |
GarageArea | 1 | 0.0003 |
For categorical NA values, we will simply replace them with None and for numerical NA values, we will impute with median value.
# Given the NA table above, we arbitrarily keep variables that contains around 95% of original values for categorical variables.
df.cat <- subset(df.cat,
select = -c(PoolQC, MiscFeature, Alley, Fence, FireplaceQu))
# We convert the categorical value as factor then as integer. Replace NAs with 0
replace.cat <- function(df) {
df[sapply(df, is.character)] <- lapply(df[sapply(df, is.character)], as.factor)
df[sapply(df, is.factor)] <- lapply(df[sapply(df, is.factor)], as.integer)
df[is.na(df)] <- 0
df
}
df.cat <- replace.cat(df.cat)
NA.check(df.cat)
## [1] index na.values perc
## <0 rows> (or 0-length row.names)
# For numerical missing values, we impute with mean
summary(df.num)
## Id MSSubClass LotFrontage LotArea
## Min. : 1.0 Min. : 20.00 Min. : 21.00 Min. : 1300
## 1st Qu.: 730.5 1st Qu.: 20.00 1st Qu.: 59.00 1st Qu.: 7478
## Median :1460.0 Median : 50.00 Median : 68.00 Median : 9453
## Mean :1460.0 Mean : 57.14 Mean : 69.31 Mean : 10168
## 3rd Qu.:2189.5 3rd Qu.: 70.00 3rd Qu.: 80.00 3rd Qu.: 11570
## Max. :2919.0 Max. :190.00 Max. :313.00 Max. :215245
## NA's :486
## OverallQual OverallCond YearBuilt YearRemodAdd
## Min. : 1.000 Min. :1.000 Min. :1872 Min. :1950
## 1st Qu.: 5.000 1st Qu.:5.000 1st Qu.:1954 1st Qu.:1965
## Median : 6.000 Median :5.000 Median :1973 Median :1993
## Mean : 6.089 Mean :5.565 Mean :1971 Mean :1984
## 3rd Qu.: 7.000 3rd Qu.:6.000 3rd Qu.:2001 3rd Qu.:2004
## Max. :10.000 Max. :9.000 Max. :2010 Max. :2010
##
## MasVnrArea BsmtFinSF1 BsmtFinSF2 BsmtUnfSF
## Min. : 0.0 Min. : 0.0 Min. : 0.00 Min. : 0.0
## 1st Qu.: 0.0 1st Qu.: 0.0 1st Qu.: 0.00 1st Qu.: 220.0
## Median : 0.0 Median : 368.5 Median : 0.00 Median : 467.0
## Mean : 102.2 Mean : 441.4 Mean : 49.58 Mean : 560.8
## 3rd Qu.: 164.0 3rd Qu.: 733.0 3rd Qu.: 0.00 3rd Qu.: 805.5
## Max. :1600.0 Max. :5644.0 Max. :1526.00 Max. :2336.0
## NA's :23 NA's :1 NA's :1 NA's :1
## TotalBsmtSF X1stFlrSF X2ndFlrSF LowQualFinSF
## Min. : 0.0 Min. : 334 Min. : 0.0 Min. : 0.000
## 1st Qu.: 793.0 1st Qu.: 876 1st Qu.: 0.0 1st Qu.: 0.000
## Median : 989.5 Median :1082 Median : 0.0 Median : 0.000
## Mean :1051.8 Mean :1160 Mean : 336.5 Mean : 4.694
## 3rd Qu.:1302.0 3rd Qu.:1388 3rd Qu.: 704.0 3rd Qu.: 0.000
## Max. :6110.0 Max. :5095 Max. :2065.0 Max. :1064.000
## NA's :1
## GrLivArea BsmtFullBath BsmtHalfBath FullBath
## Min. : 334 Min. :0.0000 Min. :0.00000 Min. :0.000
## 1st Qu.:1126 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:1.000
## Median :1444 Median :0.0000 Median :0.00000 Median :2.000
## Mean :1501 Mean :0.4299 Mean :0.06136 Mean :1.568
## 3rd Qu.:1744 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:2.000
## Max. :5642 Max. :3.0000 Max. :2.00000 Max. :4.000
## NA's :2 NA's :2
## HalfBath BedroomAbvGr KitchenAbvGr TotRmsAbvGrd
## Min. :0.0000 Min. :0.00 Min. :0.000 Min. : 2.000
## 1st Qu.:0.0000 1st Qu.:2.00 1st Qu.:1.000 1st Qu.: 5.000
## Median :0.0000 Median :3.00 Median :1.000 Median : 6.000
## Mean :0.3803 Mean :2.86 Mean :1.045 Mean : 6.452
## 3rd Qu.:1.0000 3rd Qu.:3.00 3rd Qu.:1.000 3rd Qu.: 7.000
## Max. :2.0000 Max. :8.00 Max. :3.000 Max. :15.000
##
## Fireplaces GarageYrBlt GarageCars GarageArea
## Min. :0.0000 Min. :1895 Min. :0.000 Min. : 0.0
## 1st Qu.:0.0000 1st Qu.:1960 1st Qu.:1.000 1st Qu.: 320.0
## Median :1.0000 Median :1979 Median :2.000 Median : 480.0
## Mean :0.5971 Mean :1978 Mean :1.767 Mean : 472.9
## 3rd Qu.:1.0000 3rd Qu.:2002 3rd Qu.:2.000 3rd Qu.: 576.0
## Max. :4.0000 Max. :2207 Max. :5.000 Max. :1488.0
## NA's :159 NA's :1 NA's :1
## WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch
## Min. : 0.00 Min. : 0.00 Min. : 0.0 Min. : 0.000
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.0 1st Qu.: 0.000
## Median : 0.00 Median : 26.00 Median : 0.0 Median : 0.000
## Mean : 93.71 Mean : 47.49 Mean : 23.1 Mean : 2.602
## 3rd Qu.: 168.00 3rd Qu.: 70.00 3rd Qu.: 0.0 3rd Qu.: 0.000
## Max. :1424.00 Max. :742.00 Max. :1012.0 Max. :508.000
##
## ScreenPorch PoolArea MiscVal MoSold
## Min. : 0.00 Min. : 0.000 Min. : 0.00 Min. : 1.000
## 1st Qu.: 0.00 1st Qu.: 0.000 1st Qu.: 0.00 1st Qu.: 4.000
## Median : 0.00 Median : 0.000 Median : 0.00 Median : 6.000
## Mean : 16.06 Mean : 2.252 Mean : 50.83 Mean : 6.213
## 3rd Qu.: 0.00 3rd Qu.: 0.000 3rd Qu.: 0.00 3rd Qu.: 8.000
## Max. :576.00 Max. :800.000 Max. :17000.00 Max. :12.000
##
## YrSold SalePrice label
## Min. :2006 Min. : 0 Min. :0.0000
## 1st Qu.:2007 1st Qu.: 0 1st Qu.:0.0000
## Median :2008 Median : 34900 Median :0.0000
## Mean :2008 Mean : 90492 Mean :0.4998
## 3rd Qu.:2009 3rd Qu.:163000 3rd Qu.:1.0000
## Max. :2010 Max. :755000 Max. :1.0000
##
df.num <- df.num %>%
mutate_all(~ifelse(is.na(.x),
mean(.x, na.rm = TRUE),
.x))
summary(df.num)
## Id MSSubClass LotFrontage LotArea
## Min. : 1.0 Min. : 20.00 Min. : 21.00 Min. : 1300
## 1st Qu.: 730.5 1st Qu.: 20.00 1st Qu.: 60.00 1st Qu.: 7478
## Median :1460.0 Median : 50.00 Median : 69.31 Median : 9453
## Mean :1460.0 Mean : 57.14 Mean : 69.31 Mean : 10168
## 3rd Qu.:2189.5 3rd Qu.: 70.00 3rd Qu.: 78.00 3rd Qu.: 11570
## Max. :2919.0 Max. :190.00 Max. :313.00 Max. :215245
## OverallQual OverallCond YearBuilt YearRemodAdd
## Min. : 1.000 Min. :1.000 Min. :1872 Min. :1950
## 1st Qu.: 5.000 1st Qu.:5.000 1st Qu.:1954 1st Qu.:1965
## Median : 6.000 Median :5.000 Median :1973 Median :1993
## Mean : 6.089 Mean :5.565 Mean :1971 Mean :1984
## 3rd Qu.: 7.000 3rd Qu.:6.000 3rd Qu.:2001 3rd Qu.:2004
## Max. :10.000 Max. :9.000 Max. :2010 Max. :2010
## MasVnrArea BsmtFinSF1 BsmtFinSF2 BsmtUnfSF
## Min. : 0.0 Min. : 0.0 Min. : 0.00 Min. : 0.0
## 1st Qu.: 0.0 1st Qu.: 0.0 1st Qu.: 0.00 1st Qu.: 220.0
## Median : 0.0 Median : 369.0 Median : 0.00 Median : 467.0
## Mean : 102.2 Mean : 441.4 Mean : 49.58 Mean : 560.8
## 3rd Qu.: 163.5 3rd Qu.: 733.0 3rd Qu.: 0.00 3rd Qu.: 805.0
## Max. :1600.0 Max. :5644.0 Max. :1526.00 Max. :2336.0
## TotalBsmtSF X1stFlrSF X2ndFlrSF LowQualFinSF
## Min. : 0 Min. : 334 Min. : 0.0 Min. : 0.000
## 1st Qu.: 793 1st Qu.: 876 1st Qu.: 0.0 1st Qu.: 0.000
## Median : 990 Median :1082 Median : 0.0 Median : 0.000
## Mean :1052 Mean :1160 Mean : 336.5 Mean : 4.694
## 3rd Qu.:1302 3rd Qu.:1388 3rd Qu.: 704.0 3rd Qu.: 0.000
## Max. :6110 Max. :5095 Max. :2065.0 Max. :1064.000
## GrLivArea BsmtFullBath BsmtHalfBath FullBath
## Min. : 334 Min. :0.0000 Min. :0.00000 Min. :0.000
## 1st Qu.:1126 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:1.000
## Median :1444 Median :0.0000 Median :0.00000 Median :2.000
## Mean :1501 Mean :0.4299 Mean :0.06136 Mean :1.568
## 3rd Qu.:1744 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:2.000
## Max. :5642 Max. :3.0000 Max. :2.00000 Max. :4.000
## HalfBath BedroomAbvGr KitchenAbvGr TotRmsAbvGrd
## Min. :0.0000 Min. :0.00 Min. :0.000 Min. : 2.000
## 1st Qu.:0.0000 1st Qu.:2.00 1st Qu.:1.000 1st Qu.: 5.000
## Median :0.0000 Median :3.00 Median :1.000 Median : 6.000
## Mean :0.3803 Mean :2.86 Mean :1.045 Mean : 6.452
## 3rd Qu.:1.0000 3rd Qu.:3.00 3rd Qu.:1.000 3rd Qu.: 7.000
## Max. :2.0000 Max. :8.00 Max. :3.000 Max. :15.000
## Fireplaces GarageYrBlt GarageCars GarageArea
## Min. :0.0000 Min. :1895 Min. :0.000 Min. : 0.0
## 1st Qu.:0.0000 1st Qu.:1962 1st Qu.:1.000 1st Qu.: 320.0
## Median :1.0000 Median :1978 Median :2.000 Median : 480.0
## Mean :0.5971 Mean :1978 Mean :1.767 Mean : 472.9
## 3rd Qu.:1.0000 3rd Qu.:2001 3rd Qu.:2.000 3rd Qu.: 576.0
## Max. :4.0000 Max. :2207 Max. :5.000 Max. :1488.0
## WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch
## Min. : 0.00 Min. : 0.00 Min. : 0.0 Min. : 0.000
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.0 1st Qu.: 0.000
## Median : 0.00 Median : 26.00 Median : 0.0 Median : 0.000
## Mean : 93.71 Mean : 47.49 Mean : 23.1 Mean : 2.602
## 3rd Qu.: 168.00 3rd Qu.: 70.00 3rd Qu.: 0.0 3rd Qu.: 0.000
## Max. :1424.00 Max. :742.00 Max. :1012.0 Max. :508.000
## ScreenPorch PoolArea MiscVal MoSold
## Min. : 0.00 Min. : 0.000 Min. : 0.00 Min. : 1.000
## 1st Qu.: 0.00 1st Qu.: 0.000 1st Qu.: 0.00 1st Qu.: 4.000
## Median : 0.00 Median : 0.000 Median : 0.00 Median : 6.000
## Mean : 16.06 Mean : 2.252 Mean : 50.83 Mean : 6.213
## 3rd Qu.: 0.00 3rd Qu.: 0.000 3rd Qu.: 0.00 3rd Qu.: 8.000
## Max. :576.00 Max. :800.000 Max. :17000.00 Max. :12.000
## YrSold SalePrice label
## Min. :2006 Min. : 0 Min. :0.0000
## 1st Qu.:2007 1st Qu.: 0 1st Qu.:0.0000
## Median :2008 Median : 34900 Median :0.0000
## Mean :2008 Mean : 90492 Mean :0.4998
## 3rd Qu.:2009 3rd Qu.:163000 3rd Qu.:1.0000
## Max. :2010 Max. :755000 Max. :1.0000
NA.check(df.num)
## [1] index na.values perc
## <0 rows> (or 0-length row.names)
# Aseemble data
full.updated <- cbind(df.num, df.cat)
# Split data
train <- full.updated %>%
filter(label == 0) %>%
dplyr::select(-label)
test <- full.updated %>%
filter(label == 1) %>%
dplyr::select(-label, -SalePrice)
# Visualization
train %>%
dplyr::select(-Id) %>%
cor() %>%
ggcorr()
# Table
train %>%
dplyr::select(-Id) %>%
cor() %>%
as.data.frame() %>%
mutate(cor.names = colnames(.)) %>%
dplyr::select(cor.names, SalePrice) %>%
arrange(desc(SalePrice)) %>%
kable(.)
cor.names | SalePrice |
---|---|
SalePrice | 1.0000000 |
OverallQual | 0.7909816 |
GrLivArea | 0.7086245 |
GarageCars | 0.6404092 |
GarageArea | 0.6234314 |
TotalBsmtSF | 0.6135806 |
X1stFlrSF | 0.6058522 |
FullBath | 0.5606638 |
TotRmsAbvGrd | 0.5337232 |
YearBuilt | 0.5228973 |
YearRemodAdd | 0.5071010 |
MasVnrArea | 0.4752097 |
GarageYrBlt | 0.4710619 |
Fireplaces | 0.4669288 |
BsmtFinSF1 | 0.3864198 |
Foundation | 0.3824790 |
LotFrontage | 0.3348202 |
WoodDeckSF | 0.3244134 |
X2ndFlrSF | 0.3193338 |
OpenPorchSF | 0.3158562 |
HalfBath | 0.2841077 |
GarageCond | 0.2757815 |
LotArea | 0.2638434 |
GarageQual | 0.2613470 |
CentralAir | 0.2513282 |
Electrical | 0.2339194 |
PavedDrive | 0.2313570 |
BsmtFullBath | 0.2271222 |
RoofStyle | 0.2224053 |
BsmtUnfSF | 0.2144791 |
SaleCondition | 0.2130920 |
HouseStyle | 0.1801626 |
Neighborhood | 0.1709413 |
BedroomAbvGr | 0.1682132 |
BsmtCond | 0.1473674 |
RoofMatl | 0.1323831 |
BsmtFinType2 | 0.1308142 |
ExterCond | 0.1173027 |
Functional | 0.1153279 |
ScreenPorch | 0.1114466 |
Exterior2nd | 0.1037655 |
Exterior1st | 0.1035510 |
PoolArea | 0.0924035 |
Condition1 | 0.0911549 |
LandSlope | 0.0511522 |
MoSold | 0.0464322 |
X3SsnPorch | 0.0445837 |
Street | 0.0410355 |
LandContour | 0.0154532 |
Condition2 | 0.0075127 |
MasVnrType | -0.0004878 |
BsmtFinSF2 | -0.0113781 |
BsmtFinType1 | -0.0132329 |
Utilities | -0.0143143 |
BsmtHalfBath | -0.0168442 |
MiscVal | -0.0211896 |
LowQualFinSF | -0.0256061 |
YrSold | -0.0289226 |
SaleType | -0.0503695 |
LotConfig | -0.0673960 |
OverallCond | -0.0778559 |
MSSubClass | -0.0842841 |
BldgType | -0.0855906 |
Heating | -0.0988121 |
EnclosedPorch | -0.1285780 |
KitchenAbvGr | -0.1359074 |
MSZoning | -0.1668722 |
BsmtExposure | -0.1930786 |
GarageType | -0.2238185 |
LotShape | -0.2555799 |
GarageFinish | -0.2924833 |
HeatingQC | -0.4001775 |
BsmtQual | -0.4388810 |
KitchenQual | -0.5891888 |
ExterQual | -0.6368837 |
ggplot(train, aes(x = SalePrice, y = ..density..)) +
geom_histogram(fill = 'red',
alpha = 0.5,
bins = 15) +
geom_density() +
labs(title = 'Sales price histogram')
SalePrice details:
describe(train$SalePrice)
## vars n mean sd median trimmed mad min max range
## X1 1 1460 180921.2 79442.5 163000 170783.3 56338.8 34900 755000 720100
## skew kurtosis se
## X1 1.88 6.5 2079.11
round(sd(train$SalePrice)) ** 2 #variance
## [1] 6311190249
Prior to building the model, Look at Sale Price. We can see below that the sales price is skewed to the right. Given all the values are positive, has low mean and has large variance, this data may be a good fit to log transform.
fit <- fitdistr(train$SalePrice, densfun="log-normal")
fit
## meanlog sdlog
## 12.024050901 0.399315046
## ( 0.010450552) ( 0.007389656)
SalePrice.log <- log(train$SalePrice)
sample.log <- sample(SalePrice.log, size = 1000, replace = TRUE, prob = NULL)
sample.origin <- sample(train$SalePrice, size = 1000, replace = TRUE, prob = NULL)
par(mfrow=c(1,2))
hist(sample.origin, col ="red")
hist(sample.log, col = "blue")
The log transformation does normalize the sales price. With the log transformation, the SalePrice data appears to be more normalized. Let’s set this new column SalePrice_log into the dataset.
# cbind the Salesprice.log
d.train <- train %>%
cbind(., SalePrice.log) %>%
dplyr::select(-SalePrice)
# multiple regression
reg.lm <- lm(SalePrice.log ~ . , data = d.train)
summary(reg.lm)
##
## Call:
## lm(formula = SalePrice.log ~ ., data = d.train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.72503 -0.06237 0.00320 0.06696 0.54303
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.081e+01 5.754e+00 3.616 0.000310 ***
## Id -8.974e-06 8.839e-06 -1.015 0.310149
## MSSubClass -2.239e-04 1.974e-04 -1.134 0.256984
## LotFrontage -4.996e-04 2.181e-04 -2.291 0.022126 *
## LotArea 1.557e-06 4.624e-07 3.368 0.000779 ***
## OverallQual 7.023e-02 5.159e-03 13.614 < 2e-16 ***
## OverallCond 3.968e-02 4.549e-03 8.722 < 2e-16 ***
## YearBuilt 1.825e-03 3.427e-04 5.325 1.18e-07 ***
## YearRemodAdd 7.902e-04 2.954e-04 2.675 0.007570 **
## MasVnrArea 1.379e-05 2.632e-05 0.524 0.600378
## BsmtFinSF1 3.777e-05 2.240e-05 1.686 0.092042 .
## BsmtFinSF2 1.262e-04 3.437e-05 3.672 0.000250 ***
## BsmtUnfSF 2.870e-05 2.193e-05 1.309 0.190718
## TotalBsmtSF NA NA NA NA
## X1stFlrSF 2.092e-04 2.708e-05 7.725 2.13e-14 ***
## X2ndFlrSF 1.654e-04 2.094e-05 7.896 5.80e-15 ***
## LowQualFinSF 1.680e-04 8.160e-05 2.059 0.039660 *
## GrLivArea NA NA NA NA
## BsmtFullBath 5.729e-02 1.063e-02 5.391 8.21e-08 ***
## BsmtHalfBath 2.178e-02 1.671e-02 1.303 0.192741
## FullBath 3.897e-02 1.169e-02 3.333 0.000883 ***
## HalfBath 1.794e-02 1.103e-02 1.626 0.104103
## BedroomAbvGr 7.074e-03 7.244e-03 0.977 0.328919
## KitchenAbvGr -3.421e-02 2.187e-02 -1.564 0.118072
## TotRmsAbvGrd 1.364e-02 5.091e-03 2.679 0.007465 **
## Fireplaces 3.457e-02 7.323e-03 4.720 2.59e-06 ***
## GarageYrBlt -7.378e-04 3.051e-04 -2.418 0.015733 *
## GarageCars 6.276e-02 1.211e-02 5.181 2.53e-07 ***
## GarageArea 4.230e-05 4.224e-05 1.001 0.316860
## WoodDeckSF 1.077e-04 3.262e-05 3.303 0.000982 ***
## OpenPorchSF -3.070e-05 6.215e-05 -0.494 0.621466
## EnclosedPorch 1.518e-04 6.793e-05 2.235 0.025572 *
## X3SsnPorch 1.782e-04 1.264e-04 1.410 0.158910
## ScreenPorch 3.228e-04 6.989e-05 4.619 4.21e-06 ***
## PoolArea -2.753e-04 9.629e-05 -2.859 0.004316 **
## MiscVal -9.565e-07 7.555e-06 -0.127 0.899280
## MoSold 9.798e-05 1.387e-03 0.071 0.943682
## YrSold -7.107e-03 2.842e-03 -2.500 0.012525 *
## MSZoning -1.774e-02 6.584e-03 -2.695 0.007125 **
## Street 1.909e-01 6.088e-02 3.135 0.001752 **
## LotShape -6.274e-03 2.874e-03 -2.183 0.029209 *
## LandContour 1.062e-02 5.844e-03 1.818 0.069328 .
## Utilities -1.750e-01 1.447e-01 -1.209 0.226909
## LotConfig -1.887e-03 2.382e-03 -0.792 0.428423
## LandSlope 3.529e-02 1.665e-02 2.119 0.034227 *
## Neighborhood 8.286e-04 6.842e-04 1.211 0.226096
## Condition1 1.939e-03 4.417e-03 0.439 0.660793
## Condition2 -4.587e-02 1.459e-02 -3.144 0.001704 **
## BldgType -1.178e-02 6.499e-03 -1.813 0.070001 .
## HouseStyle -4.533e-03 2.847e-03 -1.592 0.111560
## RoofStyle 4.944e-03 4.890e-03 1.011 0.312102
## RoofMatl 9.324e-03 6.542e-03 1.425 0.154286
## Exterior1st -3.620e-03 2.270e-03 -1.595 0.110929
## Exterior2nd 3.370e-03 2.053e-03 1.642 0.100831
## MasVnrType 4.663e-03 6.414e-03 0.727 0.467331
## ExterQual -9.145e-03 8.571e-03 -1.067 0.286203
## ExterCond 1.072e-02 5.450e-03 1.968 0.049302 *
## Foundation 1.357e-02 7.326e-03 1.853 0.064124 .
## BsmtQual -1.325e-02 5.900e-03 -2.245 0.024920 *
## BsmtCond 1.250e-02 5.702e-03 2.192 0.028534 *
## BsmtExposure -7.357e-03 3.812e-03 -1.930 0.053836 .
## BsmtFinType1 -6.083e-03 2.707e-03 -2.247 0.024806 *
## BsmtFinType2 1.603e-02 4.881e-03 3.285 0.001045 **
## Heating -3.864e-03 1.399e-02 -0.276 0.782411
## HeatingQC -7.933e-03 2.657e-03 -2.986 0.002878 **
## CentralAir 7.360e-02 1.953e-02 3.769 0.000171 ***
## Electrical -1.220e-03 3.957e-03 -0.308 0.757965
## KitchenQual -2.337e-02 6.256e-03 -3.735 0.000195 ***
## Functional 1.744e-02 4.107e-03 4.246 2.32e-05 ***
## GarageType -4.013e-03 2.716e-03 -1.477 0.139780
## GarageFinish -9.937e-03 6.209e-03 -1.600 0.109740
## GarageQual -5.157e-04 7.270e-03 -0.071 0.943460
## GarageCond 8.531e-03 7.633e-03 1.118 0.263915
## PavedDrive 2.362e-02 8.998e-03 2.625 0.008772 **
## SaleType -1.493e-03 2.491e-03 -0.599 0.548950
## SaleCondition 2.298e-02 3.594e-03 6.393 2.21e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1381 on 1386 degrees of freedom
## Multiple R-squared: 0.8865, Adjusted R-squared: 0.8805
## F-statistic: 148.3 on 73 and 1386 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(reg.lm)
## Warning: not plotting observations with leverage one:
## 945
## Warning: not plotting observations with leverage one:
## 945
The regression model produces an adjusted R-squred value and F-statistic that correstponds with a significant p-value. There seems to be some variables that do not seem to be significant.
new.lm <- lm(SalePrice.log~LotFrontage + LotArea + OverallQual + OverallCond + YearBuilt + YearRemodAdd +
BsmtFinSF2 + X1stFlrSF + X2ndFlrSF + LowQualFinSF + BsmtFullBath + FullBath + TotRmsAbvGrd +
Fireplaces + GarageYrBlt + GarageCars + WoodDeckSF + EnclosedPorch + ScreenPorch + PoolArea +
YrSold + MSZoning + Street + LotShape + LandSlope + Condition2 + ExterCond + BsmtQual + BsmtCond +
BsmtFinType1 + BsmtFinType2 + HeatingQC + CentralAir + KitchenQual + Functional + PavedDrive + SaleCondition ,
train)
summary(new.lm)
##
## Call:
## lm(formula = SalePrice.log ~ LotFrontage + LotArea + OverallQual +
## OverallCond + YearBuilt + YearRemodAdd + BsmtFinSF2 + X1stFlrSF +
## X2ndFlrSF + LowQualFinSF + BsmtFullBath + FullBath + TotRmsAbvGrd +
## Fireplaces + GarageYrBlt + GarageCars + WoodDeckSF + EnclosedPorch +
## ScreenPorch + PoolArea + YrSold + MSZoning + Street + LotShape +
## LandSlope + Condition2 + ExterCond + BsmtQual + BsmtCond +
## BsmtFinType1 + BsmtFinType2 + HeatingQC + CentralAir + KitchenQual +
## Functional + PavedDrive + SaleCondition, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.82405 -0.06765 0.00430 0.07434 0.61319
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.926e+01 5.673e+00 3.395 0.000705 ***
## LotFrontage 8.707e-05 1.972e-04 0.442 0.658843
## LotArea 1.997e-06 4.579e-07 4.361 1.39e-05 ***
## OverallQual 7.455e-02 4.902e-03 15.207 < 2e-16 ***
## OverallCond 4.068e-02 4.501e-03 9.038 < 2e-16 ***
## YearBuilt 2.242e-03 2.945e-04 7.616 4.76e-14 ***
## YearRemodAdd 7.179e-04 2.879e-04 2.493 0.012773 *
## BsmtFinSF2 9.700e-05 3.076e-05 3.154 0.001644 **
## X1stFlrSF 2.604e-04 1.919e-05 13.566 < 2e-16 ***
## X2ndFlrSF 1.712e-04 1.666e-05 10.278 < 2e-16 ***
## LowQualFinSF 1.656e-04 8.075e-05 2.051 0.040436 *
## BsmtFullBath 5.444e-02 8.579e-03 6.346 2.97e-10 ***
## FullBath 2.215e-02 1.037e-02 2.136 0.032838 *
## TotRmsAbvGrd 1.429e-02 4.250e-03 3.362 0.000794 ***
## Fireplaces 3.498e-02 7.131e-03 4.905 1.04e-06 ***
## GarageYrBlt -6.078e-04 2.773e-04 -2.192 0.028537 *
## GarageCars 7.127e-02 6.942e-03 10.266 < 2e-16 ***
## WoodDeckSF 1.377e-04 3.234e-05 4.259 2.19e-05 ***
## EnclosedPorch 1.754e-04 6.796e-05 2.581 0.009964 **
## ScreenPorch 3.663e-04 6.950e-05 5.271 1.57e-07 ***
## PoolArea -3.358e-04 9.603e-05 -3.497 0.000485 ***
## YrSold -6.930e-03 2.814e-03 -2.463 0.013893 *
## MSZoning -2.304e-02 6.313e-03 -3.650 0.000272 ***
## Street 2.028e-01 6.018e-02 3.370 0.000773 ***
## LotShape -8.841e-03 2.783e-03 -3.177 0.001521 **
## LandSlope 2.655e-02 1.541e-02 1.724 0.085014 .
## Condition2 -4.741e-02 1.440e-02 -3.293 0.001015 **
## ExterCond 8.432e-03 5.426e-03 1.554 0.120377
## BsmtQual -1.351e-02 5.509e-03 -2.453 0.014283 *
## BsmtCond 1.297e-02 5.580e-03 2.325 0.020209 *
## BsmtFinType1 -7.593e-03 2.413e-03 -3.146 0.001687 **
## BsmtFinType2 1.592e-02 4.607e-03 3.455 0.000566 ***
## HeatingQC -9.981e-03 2.572e-03 -3.880 0.000109 ***
## CentralAir 8.385e-02 1.807e-02 4.641 3.78e-06 ***
## KitchenQual -2.762e-02 5.833e-03 -4.736 2.40e-06 ***
## Functional 1.902e-02 4.046e-03 4.702 2.83e-06 ***
## PavedDrive 2.484e-02 8.839e-03 2.810 0.005020 **
## SaleCondition 2.438e-02 3.518e-03 6.928 6.44e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1404 on 1422 degrees of freedom
## Multiple R-squared: 0.8795, Adjusted R-squared: 0.8764
## F-statistic: 280.6 on 37 and 1422 DF, p-value: < 2.2e-16
The elimination decreased the Adjusted R-squared value.
We will utilize stepAIC() algorithm.
step.lm <- stepAIC(reg.lm, trace=FALSE)
summary(step.lm)
##
## Call:
## lm(formula = SalePrice.log ~ LotFrontage + LotArea + OverallQual +
## OverallCond + YearBuilt + YearRemodAdd + BsmtFinSF1 + BsmtFinSF2 +
## BsmtUnfSF + X1stFlrSF + X2ndFlrSF + LowQualFinSF + BsmtFullBath +
## FullBath + HalfBath + KitchenAbvGr + TotRmsAbvGrd + Fireplaces +
## GarageYrBlt + GarageCars + WoodDeckSF + EnclosedPorch + X3SsnPorch +
## ScreenPorch + PoolArea + YrSold + MSZoning + Street + LotShape +
## LandContour + LandSlope + Condition2 + BldgType + HouseStyle +
## RoofMatl + Exterior1st + Exterior2nd + ExterCond + Foundation +
## BsmtQual + BsmtCond + BsmtExposure + BsmtFinType1 + BsmtFinType2 +
## HeatingQC + CentralAir + KitchenQual + Functional + GarageType +
## GarageFinish + GarageCond + PavedDrive + SaleCondition, data = d.train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.75367 -0.06306 0.00364 0.06753 0.55406
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.966e+01 5.611e+00 3.505 0.000472 ***
## LotFrontage -4.345e-04 2.133e-04 -2.037 0.041800 *
## LotArea 1.543e-06 4.559e-07 3.385 0.000730 ***
## OverallQual 7.187e-02 4.973e-03 14.452 < 2e-16 ***
## OverallCond 4.014e-02 4.485e-03 8.949 < 2e-16 ***
## YearBuilt 1.903e-03 3.309e-04 5.749 1.10e-08 ***
## YearRemodAdd 7.789e-04 2.879e-04 2.705 0.006904 **
## BsmtFinSF1 5.076e-05 2.131e-05 2.382 0.017369 *
## BsmtFinSF2 1.359e-04 3.356e-05 4.050 5.40e-05 ***
## BsmtUnfSF 3.705e-05 2.110e-05 1.756 0.079373 .
## X1stFlrSF 2.136e-04 2.607e-05 8.196 5.52e-16 ***
## X2ndFlrSF 1.648e-04 1.973e-05 8.357 < 2e-16 ***
## LowQualFinSF 1.723e-04 8.046e-05 2.141 0.032453 *
## BsmtFullBath 5.151e-02 9.993e-03 5.154 2.91e-07 ***
## FullBath 3.539e-02 1.131e-02 3.130 0.001782 **
## HalfBath 1.615e-02 1.085e-02 1.489 0.136823
## KitchenAbvGr -3.630e-02 2.093e-02 -1.734 0.083126 .
## TotRmsAbvGrd 1.587e-02 4.501e-03 3.526 0.000436 ***
## Fireplaces 3.255e-02 7.193e-03 4.525 6.55e-06 ***
## GarageYrBlt -6.903e-04 2.889e-04 -2.389 0.017021 *
## GarageCars 7.302e-02 8.331e-03 8.765 < 2e-16 ***
## WoodDeckSF 1.129e-04 3.222e-05 3.505 0.000471 ***
## EnclosedPorch 1.662e-04 6.705e-05 2.480 0.013272 *
## X3SsnPorch 1.951e-04 1.251e-04 1.559 0.119141
## ScreenPorch 3.214e-04 6.892e-05 4.663 3.42e-06 ***
## PoolArea -2.901e-04 9.513e-05 -3.049 0.002337 **
## YrSold -6.725e-03 2.777e-03 -2.422 0.015569 *
## MSZoning -2.104e-02 6.276e-03 -3.352 0.000824 ***
## Street 1.741e-01 5.958e-02 2.922 0.003529 **
## LotShape -6.832e-03 2.772e-03 -2.464 0.013839 *
## LandContour 1.110e-02 5.775e-03 1.921 0.054877 .
## LandSlope 3.371e-02 1.644e-02 2.051 0.040465 *
## Condition2 -4.441e-02 1.421e-02 -3.126 0.001811 **
## BldgType -1.853e-02 3.865e-03 -4.795 1.80e-06 ***
## HouseStyle -5.611e-03 2.521e-03 -2.225 0.026230 *
## RoofMatl 9.334e-03 6.462e-03 1.445 0.148799
## Exterior1st -3.381e-03 2.231e-03 -1.515 0.129935
## Exterior2nd 2.990e-03 2.011e-03 1.487 0.137317
## ExterCond 9.998e-03 5.342e-03 1.872 0.061452 .
## Foundation 1.238e-02 7.211e-03 1.716 0.086345 .
## BsmtQual -1.534e-02 5.653e-03 -2.713 0.006745 **
## BsmtCond 1.149e-02 5.599e-03 2.052 0.040349 *
## BsmtExposure -5.888e-03 3.703e-03 -1.590 0.112022
## BsmtFinType1 -6.356e-03 2.677e-03 -2.374 0.017718 *
## BsmtFinType2 1.542e-02 4.806e-03 3.208 0.001366 **
## HeatingQC -7.478e-03 2.611e-03 -2.864 0.004241 **
## CentralAir 7.587e-02 1.807e-02 4.198 2.87e-05 ***
## KitchenQual -2.544e-02 5.803e-03 -4.384 1.25e-05 ***
## Functional 1.713e-02 4.051e-03 4.229 2.50e-05 ***
## GarageType -3.795e-03 2.672e-03 -1.420 0.155697
## GarageFinish -9.027e-03 6.081e-03 -1.484 0.137929
## GarageCond 7.841e-03 4.861e-03 1.613 0.106984
## PavedDrive 2.405e-02 8.906e-03 2.701 0.007005 **
## SaleCondition 2.309e-02 3.475e-03 6.645 4.31e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1377 on 1406 degrees of freedom
## Multiple R-squared: 0.8854, Adjusted R-squared: 0.8811
## F-statistic: 205 on 53 and 1406 DF, p-value: < 2.2e-16
dim(reg.lm$model)
## [1] 1460 76
dim(step.lm$model)
## [1] 1460 54
The stepAIC() algorithm eliminated 22 variables from the original model and increased the Adjusted R-squared value to .8811
.
Let’s create the new model.
#final model
final.lm <- lm(SalePrice.log ~., data = step.lm$model)
summary(final.lm)
##
## Call:
## lm(formula = SalePrice.log ~ ., data = step.lm$model)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.75367 -0.06306 0.00364 0.06753 0.55406
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.966e+01 5.611e+00 3.505 0.000472 ***
## LotFrontage -4.345e-04 2.133e-04 -2.037 0.041800 *
## LotArea 1.543e-06 4.559e-07 3.385 0.000730 ***
## OverallQual 7.187e-02 4.973e-03 14.452 < 2e-16 ***
## OverallCond 4.014e-02 4.485e-03 8.949 < 2e-16 ***
## YearBuilt 1.903e-03 3.309e-04 5.749 1.10e-08 ***
## YearRemodAdd 7.789e-04 2.879e-04 2.705 0.006904 **
## BsmtFinSF1 5.076e-05 2.131e-05 2.382 0.017369 *
## BsmtFinSF2 1.359e-04 3.356e-05 4.050 5.40e-05 ***
## BsmtUnfSF 3.705e-05 2.110e-05 1.756 0.079373 .
## X1stFlrSF 2.136e-04 2.607e-05 8.196 5.52e-16 ***
## X2ndFlrSF 1.648e-04 1.973e-05 8.357 < 2e-16 ***
## LowQualFinSF 1.723e-04 8.046e-05 2.141 0.032453 *
## BsmtFullBath 5.151e-02 9.993e-03 5.154 2.91e-07 ***
## FullBath 3.539e-02 1.131e-02 3.130 0.001782 **
## HalfBath 1.615e-02 1.085e-02 1.489 0.136823
## KitchenAbvGr -3.630e-02 2.093e-02 -1.734 0.083126 .
## TotRmsAbvGrd 1.587e-02 4.501e-03 3.526 0.000436 ***
## Fireplaces 3.255e-02 7.193e-03 4.525 6.55e-06 ***
## GarageYrBlt -6.903e-04 2.889e-04 -2.389 0.017021 *
## GarageCars 7.302e-02 8.331e-03 8.765 < 2e-16 ***
## WoodDeckSF 1.129e-04 3.222e-05 3.505 0.000471 ***
## EnclosedPorch 1.662e-04 6.705e-05 2.480 0.013272 *
## X3SsnPorch 1.951e-04 1.251e-04 1.559 0.119141
## ScreenPorch 3.214e-04 6.892e-05 4.663 3.42e-06 ***
## PoolArea -2.901e-04 9.513e-05 -3.049 0.002337 **
## YrSold -6.725e-03 2.777e-03 -2.422 0.015569 *
## MSZoning -2.104e-02 6.276e-03 -3.352 0.000824 ***
## Street 1.741e-01 5.958e-02 2.922 0.003529 **
## LotShape -6.832e-03 2.772e-03 -2.464 0.013839 *
## LandContour 1.110e-02 5.775e-03 1.921 0.054877 .
## LandSlope 3.371e-02 1.644e-02 2.051 0.040465 *
## Condition2 -4.441e-02 1.421e-02 -3.126 0.001811 **
## BldgType -1.853e-02 3.865e-03 -4.795 1.80e-06 ***
## HouseStyle -5.611e-03 2.521e-03 -2.225 0.026230 *
## RoofMatl 9.334e-03 6.462e-03 1.445 0.148799
## Exterior1st -3.381e-03 2.231e-03 -1.515 0.129935
## Exterior2nd 2.990e-03 2.011e-03 1.487 0.137317
## ExterCond 9.998e-03 5.342e-03 1.872 0.061452 .
## Foundation 1.238e-02 7.211e-03 1.716 0.086345 .
## BsmtQual -1.534e-02 5.653e-03 -2.713 0.006745 **
## BsmtCond 1.149e-02 5.599e-03 2.052 0.040349 *
## BsmtExposure -5.888e-03 3.703e-03 -1.590 0.112022
## BsmtFinType1 -6.356e-03 2.677e-03 -2.374 0.017718 *
## BsmtFinType2 1.542e-02 4.806e-03 3.208 0.001366 **
## HeatingQC -7.478e-03 2.611e-03 -2.864 0.004241 **
## CentralAir 7.587e-02 1.807e-02 4.198 2.87e-05 ***
## KitchenQual -2.544e-02 5.803e-03 -4.384 1.25e-05 ***
## Functional 1.713e-02 4.051e-03 4.229 2.50e-05 ***
## GarageType -3.795e-03 2.672e-03 -1.420 0.155697
## GarageFinish -9.027e-03 6.081e-03 -1.484 0.137929
## GarageCond 7.841e-03 4.861e-03 1.613 0.106984
## PavedDrive 2.405e-02 8.906e-03 2.701 0.007005 **
## SaleCondition 2.309e-02 3.475e-03 6.645 4.31e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1377 on 1406 degrees of freedom
## Multiple R-squared: 0.8854, Adjusted R-squared: 0.8811
## F-statistic: 205 on 53 and 1406 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(final.lm)
The Residuals vs Fitted appears to be fairly random. The Q-Q plot appears to fit the Q-Q line with exceptions towards the end of the plot.
# Building the prediction
pred <- predict(final.lm, newdata = test)
pred.exp <- sapply(pred, exp)
#Id-SalePrice
Id <- test$Id
SalePrice <- pred.exp
submission <- data.frame(Id, SalePrice)
head(submission)
## Id SalePrice
## 1 1461 115639.7
## 2 1462 152002.9
## 3 1463 169448.4
## 4 1464 198115.9
## 5 1465 185758.2
## 6 1466 174048.4
write.csv(submission, file = "submission.csv", row.names=FALSE)
First attempt with reg.lm
Second attempt with
step.lm