You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition. https://www.kaggle.com/c/house-prices-advanced-regression-techniques. I want you to do the following.
library(tidyverse)
library(psych)
library(reshape)
library(readxl)
library(scales)
library(corrgram)
library(MASS)
library(knitr)
library(corrplot)
library(fitdistrplus)
library(gmodels)
library(randomForest)
train <- read_csv('https://raw.githubusercontent.com/niteen11/CUNY_DATA_605/master/data_finalexam/kaggle%20data%20set/all/train.csv')
test <- read_csv('https://raw.githubusercontent.com/niteen11/CUNY_DATA_605/master/data_finalexam/kaggle%20data%20set/all/test.csv')
dim(train)
## [1] 1460 81
kable(head(train))
Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | LotConfig | LandSlope | Neighborhood | Condition1 | Condition2 | BldgType | HouseStyle | OverallQual | OverallCond | YearBuilt | YearRemodAdd | RoofStyle | RoofMatl | Exterior1st | Exterior2nd | MasVnrType | MasVnrArea | ExterQual | ExterCond | Foundation | BsmtQual | BsmtCond | BsmtExposure | BsmtFinType1 | BsmtFinSF1 | BsmtFinType2 | BsmtFinSF2 | BsmtUnfSF | TotalBsmtSF | Heating | HeatingQC | CentralAir | Electrical | 1stFlrSF | 2ndFlrSF | LowQualFinSF | GrLivArea | BsmtFullBath | BsmtHalfBath | FullBath | HalfBath | BedroomAbvGr | KitchenAbvGr | KitchenQual | TotRmsAbvGrd | Functional | Fireplaces | FireplaceQu | GarageType | GarageYrBlt | GarageFinish | GarageCars | GarageArea | GarageQual | GarageCond | PavedDrive | WoodDeckSF | OpenPorchSF | EnclosedPorch | 3SsnPorch | ScreenPorch | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 60 | RL | 65 | 8450 | Pave | NA | Reg | Lvl | AllPub | Inside | Gtl | CollgCr | Norm | Norm | 1Fam | 2Story | 7 | 5 | 2003 | 2003 | Gable | CompShg | VinylSd | VinylSd | BrkFace | 196 | Gd | TA | PConc | Gd | TA | No | GLQ | 706 | Unf | 0 | 150 | 856 | GasA | Ex | Y | SBrkr | 856 | 854 | 0 | 1710 | 1 | 0 | 2 | 1 | 3 | 1 | Gd | 8 | Typ | 0 | NA | Attchd | 2003 | RFn | 2 | 548 | TA | TA | Y | 0 | 61 | 0 | 0 | 0 | 0 | NA | NA | NA | 0 | 2 | 2008 | WD | Normal | 208500 |
2 | 20 | RL | 80 | 9600 | Pave | NA | Reg | Lvl | AllPub | FR2 | Gtl | Veenker | Feedr | Norm | 1Fam | 1Story | 6 | 8 | 1976 | 1976 | Gable | CompShg | MetalSd | MetalSd | None | 0 | TA | TA | CBlock | Gd | TA | Gd | ALQ | 978 | Unf | 0 | 284 | 1262 | GasA | Ex | Y | SBrkr | 1262 | 0 | 0 | 1262 | 0 | 1 | 2 | 0 | 3 | 1 | TA | 6 | Typ | 1 | TA | Attchd | 1976 | RFn | 2 | 460 | TA | TA | Y | 298 | 0 | 0 | 0 | 0 | 0 | NA | NA | NA | 0 | 5 | 2007 | WD | Normal | 181500 |
3 | 60 | RL | 68 | 11250 | Pave | NA | IR1 | Lvl | AllPub | Inside | Gtl | CollgCr | Norm | Norm | 1Fam | 2Story | 7 | 5 | 2001 | 2002 | Gable | CompShg | VinylSd | VinylSd | BrkFace | 162 | Gd | TA | PConc | Gd | TA | Mn | GLQ | 486 | Unf | 0 | 434 | 920 | GasA | Ex | Y | SBrkr | 920 | 866 | 0 | 1786 | 1 | 0 | 2 | 1 | 3 | 1 | Gd | 6 | Typ | 1 | TA | Attchd | 2001 | RFn | 2 | 608 | TA | TA | Y | 0 | 42 | 0 | 0 | 0 | 0 | NA | NA | NA | 0 | 9 | 2008 | WD | Normal | 223500 |
4 | 70 | RL | 60 | 9550 | Pave | NA | IR1 | Lvl | AllPub | Corner | Gtl | Crawfor | Norm | Norm | 1Fam | 2Story | 7 | 5 | 1915 | 1970 | Gable | CompShg | Wd Sdng | Wd Shng | None | 0 | TA | TA | BrkTil | TA | Gd | No | ALQ | 216 | Unf | 0 | 540 | 756 | GasA | Gd | Y | SBrkr | 961 | 756 | 0 | 1717 | 1 | 0 | 1 | 0 | 3 | 1 | Gd | 7 | Typ | 1 | Gd | Detchd | 1998 | Unf | 3 | 642 | TA | TA | Y | 0 | 35 | 272 | 0 | 0 | 0 | NA | NA | NA | 0 | 2 | 2006 | WD | Abnorml | 140000 |
5 | 60 | RL | 84 | 14260 | Pave | NA | IR1 | Lvl | AllPub | FR2 | Gtl | NoRidge | Norm | Norm | 1Fam | 2Story | 8 | 5 | 2000 | 2000 | Gable | CompShg | VinylSd | VinylSd | BrkFace | 350 | Gd | TA | PConc | Gd | TA | Av | GLQ | 655 | Unf | 0 | 490 | 1145 | GasA | Ex | Y | SBrkr | 1145 | 1053 | 0 | 2198 | 1 | 0 | 2 | 1 | 4 | 1 | Gd | 9 | Typ | 1 | TA | Attchd | 2000 | RFn | 3 | 836 | TA | TA | Y | 192 | 84 | 0 | 0 | 0 | 0 | NA | NA | NA | 0 | 12 | 2008 | WD | Normal | 250000 |
6 | 50 | RL | 85 | 14115 | Pave | NA | IR1 | Lvl | AllPub | Inside | Gtl | Mitchel | Norm | Norm | 1Fam | 1.5Fin | 5 | 5 | 1993 | 1995 | Gable | CompShg | VinylSd | VinylSd | None | 0 | TA | TA | Wood | Gd | TA | No | GLQ | 732 | Unf | 0 | 64 | 796 | GasA | Ex | Y | SBrkr | 796 | 566 | 0 | 1362 | 1 | 0 | 1 | 1 | 1 | 1 | TA | 5 | Typ | 0 | NA | Attchd | 1993 | Unf | 2 | 480 | TA | TA | Y | 40 | 30 | 0 | 320 | 0 | 0 | NA | MnPrv | Shed | 700 | 10 | 2009 | WD | Normal | 143000 |
kable(head(test))
Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | LotConfig | LandSlope | Neighborhood | Condition1 | Condition2 | BldgType | HouseStyle | OverallQual | OverallCond | YearBuilt | YearRemodAdd | RoofStyle | RoofMatl | Exterior1st | Exterior2nd | MasVnrType | MasVnrArea | ExterQual | ExterCond | Foundation | BsmtQual | BsmtCond | BsmtExposure | BsmtFinType1 | BsmtFinSF1 | BsmtFinType2 | BsmtFinSF2 | BsmtUnfSF | TotalBsmtSF | Heating | HeatingQC | CentralAir | Electrical | 1stFlrSF | 2ndFlrSF | LowQualFinSF | GrLivArea | BsmtFullBath | BsmtHalfBath | FullBath | HalfBath | BedroomAbvGr | KitchenAbvGr | KitchenQual | TotRmsAbvGrd | Functional | Fireplaces | FireplaceQu | GarageType | GarageYrBlt | GarageFinish | GarageCars | GarageArea | GarageQual | GarageCond | PavedDrive | WoodDeckSF | OpenPorchSF | EnclosedPorch | 3SsnPorch | ScreenPorch | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1461 | 20 | RH | 80 | 11622 | Pave | NA | Reg | Lvl | AllPub | Inside | Gtl | NAmes | Feedr | Norm | 1Fam | 1Story | 5 | 6 | 1961 | 1961 | Gable | CompShg | VinylSd | VinylSd | None | 0 | TA | TA | CBlock | TA | TA | No | Rec | 468 | LwQ | 144 | 270 | 882 | GasA | TA | Y | SBrkr | 896 | 0 | 0 | 896 | 0 | 0 | 1 | 0 | 2 | 1 | TA | 5 | Typ | 0 | NA | Attchd | 1961 | Unf | 1 | 730 | TA | TA | Y | 140 | 0 | 0 | 0 | 120 | 0 | NA | MnPrv | NA | 0 | 6 | 2010 | WD | Normal |
1462 | 20 | RL | 81 | 14267 | Pave | NA | IR1 | Lvl | AllPub | Corner | Gtl | NAmes | Norm | Norm | 1Fam | 1Story | 6 | 6 | 1958 | 1958 | Hip | CompShg | Wd Sdng | Wd Sdng | BrkFace | 108 | TA | TA | CBlock | TA | TA | No | ALQ | 923 | Unf | 0 | 406 | 1329 | GasA | TA | Y | SBrkr | 1329 | 0 | 0 | 1329 | 0 | 0 | 1 | 1 | 3 | 1 | Gd | 6 | Typ | 0 | NA | Attchd | 1958 | Unf | 1 | 312 | TA | TA | Y | 393 | 36 | 0 | 0 | 0 | 0 | NA | NA | Gar2 | 12500 | 6 | 2010 | WD | Normal |
1463 | 60 | RL | 74 | 13830 | Pave | NA | IR1 | Lvl | AllPub | Inside | Gtl | Gilbert | Norm | Norm | 1Fam | 2Story | 5 | 5 | 1997 | 1998 | Gable | CompShg | VinylSd | VinylSd | None | 0 | TA | TA | PConc | Gd | TA | No | GLQ | 791 | Unf | 0 | 137 | 928 | GasA | Gd | Y | SBrkr | 928 | 701 | 0 | 1629 | 0 | 0 | 2 | 1 | 3 | 1 | TA | 6 | Typ | 1 | TA | Attchd | 1997 | Fin | 2 | 482 | TA | TA | Y | 212 | 34 | 0 | 0 | 0 | 0 | NA | MnPrv | NA | 0 | 3 | 2010 | WD | Normal |
1464 | 60 | RL | 78 | 9978 | Pave | NA | IR1 | Lvl | AllPub | Inside | Gtl | Gilbert | Norm | Norm | 1Fam | 2Story | 6 | 6 | 1998 | 1998 | Gable | CompShg | VinylSd | VinylSd | BrkFace | 20 | TA | TA | PConc | TA | TA | No | GLQ | 602 | Unf | 0 | 324 | 926 | GasA | Ex | Y | SBrkr | 926 | 678 | 0 | 1604 | 0 | 0 | 2 | 1 | 3 | 1 | Gd | 7 | Typ | 1 | Gd | Attchd | 1998 | Fin | 2 | 470 | TA | TA | Y | 360 | 36 | 0 | 0 | 0 | 0 | NA | NA | NA | 0 | 6 | 2010 | WD | Normal |
1465 | 120 | RL | 43 | 5005 | Pave | NA | IR1 | HLS | AllPub | Inside | Gtl | StoneBr | Norm | Norm | TwnhsE | 1Story | 8 | 5 | 1992 | 1992 | Gable | CompShg | HdBoard | HdBoard | None | 0 | Gd | TA | PConc | Gd | TA | No | ALQ | 263 | Unf | 0 | 1017 | 1280 | GasA | Ex | Y | SBrkr | 1280 | 0 | 0 | 1280 | 0 | 0 | 2 | 0 | 2 | 1 | Gd | 5 | Typ | 0 | NA | Attchd | 1992 | RFn | 2 | 506 | TA | TA | Y | 0 | 82 | 0 | 0 | 144 | 0 | NA | NA | NA | 0 | 1 | 2010 | WD | Normal |
1466 | 60 | RL | 75 | 10000 | Pave | NA | IR1 | Lvl | AllPub | Corner | Gtl | Gilbert | Norm | Norm | 1Fam | 2Story | 6 | 5 | 1993 | 1994 | Gable | CompShg | HdBoard | HdBoard | None | 0 | TA | TA | PConc | Gd | TA | No | Unf | 0 | Unf | 0 | 763 | 763 | GasA | Gd | Y | SBrkr | 763 | 892 | 0 | 1655 | 0 | 0 | 2 | 1 | 3 | 1 | TA | 7 | Typ | 1 | TA | Attchd | 1993 | Fin | 2 | 440 | TA | TA | Y | 157 | 84 | 0 | 0 | 0 | 0 | NA | NA | NA | 0 | 4 | 2010 | WD | Normal |
Descriptive and Inferential Statistics. Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any THREE quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide a 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?
train %>%
dplyr::select(LotArea, GrLivArea, TotalBsmtSF, SalePrice) %>%
mutate(LotArea = (LotArea),
GrLivArea = (GrLivArea),
TotalBsmtSF = (TotalBsmtSF),
SalePrice = (SalePrice)) %>%
gather('key', 'value') %>%
ggplot(aes(x = value, y = ..density..)) +
geom_histogram(
fill = 'red',
alpha = 0.5,
bins = 15) +
geom_density() +
facet_wrap(~key, scales ='free_x') +
labs(title = 'Histogram PLots')
ggplot(train, aes(train$LotArea, train$SalePrice)) +
geom_point(aes(color=train$TotalBsmtSF)) +
xlab("Lot Area") +
ylab("Sales Price") +
ggtitle("Sales Price vs. Lot Area") +
scale_y_continuous(labels = comma)
cor <- train %>%
dplyr::select(LotArea, GrLivArea, TotalBsmtSF, SalePrice) %>%
cor()
corrplot(cor, method = 'circle')
corrgram(cor, order=TRUE, upper.panel=panel.cor, main= "train")
cor.test(train$GrLivArea, train$SalePrice, method = 'pearson', conf.level = 0.80)
##
## Pearson's product-moment correlation
##
## data: train$GrLivArea and train$SalePrice
## t = 38.348, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.6915087 0.7249450
## sample estimates:
## cor
## 0.7086245
cor.test(train$TotalBsmtSF, train$SalePrice, method = 'pearson', conf.level = 0.80)
##
## Pearson's product-moment correlation
##
## data: train$TotalBsmtSF and train$SalePrice
## t = 29.671, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.5922142 0.6340846
## sample estimates:
## cor
## 0.6135806
cor.test(train$LotArea, train$SalePrice, method = 'pearson', conf.level = 0.80)
##
## Pearson's product-moment correlation
##
## data: train$LotArea and train$SalePrice
## t = 10.445, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.2323391 0.2947946
## sample estimates:
## cor
## 0.2638434
The p-values for all three variables are less than 0.05. The correlation falls within the 80% CI
The familywise error rate (FWE or FWER) is the probability of a coming to at least one false conclusion in a series of hypothesis tests. In other words, it’s the probability of making at least one Type I Error.
\[1-(1-\alpha )^c\]
1-(.20/3)
## [1] 0.9333333
Now checking corelation for familywise.
cor.test(train$GrLivArea, train$SalePrice, method = 'pearson', conf.level = 0.93)
##
## Pearson's product-moment correlation
##
## data: train$GrLivArea and train$SalePrice
## t = 38.348, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 93 percent confidence interval:
## 0.6841885 0.7314712
## sample estimates:
## cor
## 0.7086245
cor.test(train$TotalBsmtSF, train$SalePrice, method = 'pearson', conf.level = 0.93)
##
## Pearson's product-moment correlation
##
## data: train$TotalBsmtSF and train$SalePrice
## t = 29.671, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 93 percent confidence interval:
## 0.5831186 0.6423195
## sample estimates:
## cor
## 0.6135806
cor.test(train$LotArea, train$SalePrice, method = 'pearson', conf.level = 0.93)
##
## Pearson's product-moment correlation
##
## data: train$LotArea and train$SalePrice
## t = 10.445, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 93 percent confidence interval:
## 0.219153 0.307429
## sample estimates:
## cor
## 0.2638434
. Invert your 3 x 3 correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.
kable(cor)
LotArea | GrLivArea | TotalBsmtSF | SalePrice | |
---|---|---|---|---|
LotArea | 1.0000000 | 0.2631162 | 0.2608331 | 0.2638434 |
GrLivArea | 0.2631162 | 1.0000000 | 0.4548682 | 0.7086245 |
TotalBsmtSF | 0.2608331 | 0.4548682 | 1.0000000 | 0.6135806 |
SalePrice | 0.2638434 | 0.7086245 | 0.6135806 | 1.0000000 |
precision_matrix <- solve(cor)
kable(precision_matrix)
LotArea | GrLivArea | TotalBsmtSF | SalePrice | |
---|---|---|---|---|
LotArea | 1.1062218 | -0.1623394 | -0.1703170 | -0.0723285 |
GrLivArea | -0.1623394 | 2.0350650 | -0.0397442 | -1.3748784 |
TotalBsmtSF | -0.1703170 | -0.0397442 | 1.6321069 | -0.9283283 |
SalePrice | -0.0723285 | -1.3748784 | -0.9283283 | 2.5629601 |
kable(cor %*% precision_matrix)
LotArea | GrLivArea | TotalBsmtSF | SalePrice | |
---|---|---|---|---|
LotArea | 1 | 0 | 0 | 0 |
GrLivArea | 0 | 1 | 0 | 0 |
TotalBsmtSF | 0 | 0 | 1 | 0 |
SalePrice | 0 | 0 | 0 | 1 |
kable(precision_matrix%*% cor)
LotArea | GrLivArea | TotalBsmtSF | SalePrice | |
---|---|---|---|---|
LotArea | 1 | 0 | 0 | 0 |
GrLivArea | 0 | 1 | 0 | 0 |
TotalBsmtSF | 0 | 0 | 1 | 0 |
SalePrice | 0 | 0 | 0 | 1 |
Other Approach
mat <- train %>%
dplyr::select(GrLivArea, TotalBsmtSF, SalePrice) %>%
cor() %>%
round(4)
kable(mat)
GrLivArea | TotalBsmtSF | SalePrice | |
---|---|---|---|
GrLivArea | 1.0000 | 0.4549 | 0.7086 |
TotalBsmtSF | 0.4549 | 1.0000 | 0.6136 |
SalePrice | 0.7086 | 0.6136 | 1.0000 |
prec.mat <- solve(mat) %>%
round(4)
kable(prec.mat)
GrLivArea | TotalBsmtSF | SalePrice | |
---|---|---|---|
GrLivArea | 2.0111 | -0.0648 | -1.3853 |
TotalBsmtSF | -0.0648 | 1.6060 | -0.9395 |
SalePrice | -1.3853 | -0.9395 | 2.5581 |
kable(prec.mat %*% mat %>%
round(4))
GrLivArea | TotalBsmtSF | SalePrice | |
---|---|---|---|
GrLivArea | 1 | 0 | 0 |
TotalBsmtSF | 0 | 1 | 0 |
SalePrice | 0 | 0 | 1 |
kable(mat %*% prec.mat %>%
round(4))
GrLivArea | TotalBsmtSF | SalePrice | |
---|---|---|---|
GrLivArea | 1 | 0 | 0 |
TotalBsmtSF | 0 | 1 | 0 |
SalePrice | 0 | 0 | 1 |
Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ).
Find the optimal value of \(\lambda\) for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., \(rexp(1000,\lambda)\) )
par(mfrow=c(1,2))
hist(train$TotalBsmtSF, main="Histogram of TotalBsmtSF")
boxplot(train$TotalBsmtSF, main="Boxplot of TotalBsmtSF")
lambda <- fitdistr(train$TotalBsmtSF, densfun='exponential')
lambda$estimate
## rate
## 0.0009456896
set.seed(23)
pdf.dist <- rexp(1000,lambda$estimate)
hist(pdf.dist, freq = FALSE, breaks = 100,
main ="Fitted Exponential PDF with TotalBsmtSFTotal ",
xlab = "PDF TotalBsmtSFTotal",
xlim = c(1, quantile(pdf.dist, 0.99)))
curve(dexp(x, rate = lambda$estimate), col = "red", add = TRUE)
Plot a histogram and compare it with a histogram of your original variable.
set.seed(23)
samp.TotalBsmtSF <- sample(train$TotalBsmtSF, 1000, replace=TRUE, prob=NULL)
exp.train <- data_frame(Expo=rexp(1000, lambda$estimate)) %>%
mutate(TotalBsmtSF = samp.TotalBsmtSF)
plotdist(exp.train$TotalBsmtSF, histo = TRUE, demp = TRUE)
plotdist(exp.train$Expo, histo = TRUE, demp = TRUE)
hist(train$TotalBsmtSF, freq = FALSE, breaks = 100,
main ="Comparison with original TotalBsmtSF",
xlab="Original TotalBsmtSF",
xlim = c(1, quantile(train$TotalBsmtSF, 0.99)))
curve(dexp(x, rate = lambda$estimate), col = "red", add = TRUE)
Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF).
pdf_5th <- qexp(0.05, rate = lambda$estimate, lower.tail = TRUE, log.p = FALSE)
pdf_5th <- round(pdf_5th, 4)
pdf_95th <- qexp(0.95, rate = lambda$estimate, lower.tail = TRUE, log.p = FALSE)
pdf_95th <- round(pdf_95th, 4)
Also generate a 95% confidence interval from the empirical data, assuming normality.
ci(train$TotalBsmtSF, 0.95)
## Estimate CI lower CI upper Std. Error
## 1057.42945 1034.90755 1079.95135 11.48144
Finally, provide the empirical 5th percentile and 95th percentile of the data.
quantile(train$TotalBsmtSF, c(.05, .95))
## 5% 95%
## 519.3 1753.0
Cleaning Train dataset
names(train) <- make.names(names(train))
features <- setdiff(colnames(train), c("Id", "SalePrice"))
for (i in features) {
if (any(is.na(train[[i]])))
if (is.character(train[[i]])){
train[[i]][is.na(train[[i]])] <- "Others"
}else{
train[[i]][is.na(train[[i]])] <- -999
}
}
col_class <- lapply(train,class)
col_class <- col_class[col_class != "factor"]
factor_levels <- lapply(train, nlevels)
factor_levels <- factor_levels[factor_levels > 1]
train <- train[,names(train) %in% c(names(factor_levels), names(col_class))]
train <- as.data.frame(unclass(train))
ftrain<-select_if(train, is.numeric)
Clean Test Dataset
names(test) <- make.names(names(test))
features <- setdiff(colnames(test), c("Id", "SalePrice"))
for (j in features) {
if (any(is.na(test[[j]])))
if (is.character(test[[j]])){
test[[j]][is.na(test[[j]])] <- "Others"
}else{
test[[j]][is.na(test[[j]])] <- -999
}
}
col_class <- lapply(test,class)
col_class <- col_class[col_class != "factor"]
factor_levels <- lapply(test, nlevels)
factor_levels <- factor_levels[factor_levels > 1]
train <- test[,names(test) %in% c(names(factor_levels), names(col_class))]
test <- as.data.frame(unclass(test))
ftest<-select_if(test, is.numeric)
model1 <- lm(SalePrice ~ . , data=ftrain)
summary(model1)
##
## Call:
## lm(formula = SalePrice ~ ., data = ftrain)
##
## Residuals:
## Min 1Q Median 3Q Max
## -475262 -16290 -2217 14029 295097
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.143e+05 1.403e+06 0.295 0.767771
## Id -1.048e+00 2.171e+00 -0.483 0.629203
## MSSubClass -1.687e+02 2.608e+01 -6.468 1.36e-10 ***
## LotFrontage 3.109e+00 2.299e+00 1.352 0.176463
## LotArea 4.024e-01 1.003e-01 4.013 6.30e-05 ***
## OverallQual 1.736e+04 1.181e+03 14.697 < 2e-16 ***
## OverallCond 5.153e+03 1.025e+03 5.027 5.61e-07 ***
## YearBuilt 3.497e+02 6.067e+01 5.764 1.01e-08 ***
## YearRemodAdd 1.115e+02 6.619e+01 1.684 0.092412 .
## MasVnrArea 2.150e+01 5.193e+00 4.141 3.66e-05 ***
## BsmtFinSF1 1.900e+01 4.628e+00 4.105 4.27e-05 ***
## BsmtFinSF2 8.632e+00 7.011e+00 1.231 0.218432
## BsmtUnfSF 8.368e+00 4.176e+00 2.004 0.045279 *
## TotalBsmtSF NA NA NA NA
## X1stFlrSF 4.812e+01 5.731e+00 8.397 < 2e-16 ***
## X2ndFlrSF 4.889e+01 4.917e+00 9.943 < 2e-16 ***
## LowQualFinSF 1.731e+01 1.970e+01 0.879 0.379726
## GrLivArea NA NA NA NA
## BsmtFullBath 8.486e+03 2.598e+03 3.267 0.001114 **
## BsmtHalfBath 1.862e+03 4.058e+03 0.459 0.646417
## FullBath 3.219e+03 2.803e+03 1.148 0.250967
## HalfBath -1.563e+03 2.644e+03 -0.591 0.554423
## BedroomAbvGr -1.021e+04 1.683e+03 -6.066 1.68e-09 ***
## KitchenAbvGr -1.543e+04 5.200e+03 -2.968 0.003048 **
## TotRmsAbvGrd 4.836e+03 1.230e+03 3.932 8.84e-05 ***
## Fireplaces 4.314e+03 1.766e+03 2.443 0.014694 *
## GarageYrBlt -9.649e+00 1.773e+00 -5.441 6.23e-08 ***
## GarageCars 1.568e+04 2.978e+03 5.266 1.61e-07 ***
## GarageArea 5.190e+00 9.707e+00 0.535 0.592957
## WoodDeckSF 2.568e+01 7.923e+00 3.242 0.001215 **
## OpenPorchSF -5.835e+00 1.509e+01 -0.387 0.698981
## EnclosedPorch 1.222e+01 1.674e+01 0.730 0.465505
## X3SsnPorch 2.017e+01 3.118e+01 0.647 0.517861
## ScreenPorch 5.712e+01 1.707e+01 3.347 0.000837 ***
## PoolArea -3.455e+01 2.346e+01 -1.473 0.140989
## MiscVal -3.759e-01 1.847e+00 -0.204 0.838748
## MoSold -5.100e+01 3.424e+02 -0.149 0.881631
## YrSold -6.839e+02 6.974e+02 -0.981 0.326921
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 34500 on 1424 degrees of freedom
## Multiple R-squared: 0.816, Adjusted R-squared: 0.8114
## F-statistic: 180.4 on 35 and 1424 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(model1)
sigvars <- data.frame(summary(model1)$coef[summary(model1)$coef[,4] <= .05, 4])
sigvars <- add_rownames(sigvars, "vars")
colist<-dplyr::pull(sigvars, vars)
matchid <- match(colist, names(ftrain))
train2 <- cbind(ftrain[,matchid], ftrain['SalePrice'])
model2<-lm(SalePrice ~ ., data=train2)
summary(model2)
##
## Call:
## lm(formula = SalePrice ~ ., data = train2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -485362 -16106 -2176 14198 277137
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.060e+05 8.784e+04 -9.176 < 2e-16 ***
## MSSubClass -1.648e+02 2.561e+01 -6.436 1.67e-10 ***
## LotArea 4.076e-01 9.914e-02 4.111 4.16e-05 ***
## OverallQual 1.821e+04 1.146e+03 15.884 < 2e-16 ***
## OverallCond 5.692e+03 9.195e+02 6.190 7.84e-10 ***
## YearBuilt 3.809e+02 4.455e+01 8.550 < 2e-16 ***
## MasVnrArea 2.041e+01 5.089e+00 4.010 6.39e-05 ***
## BsmtFinSF1 1.559e+01 3.862e+00 4.036 5.71e-05 ***
## BsmtUnfSF 6.304e+00 3.604e+00 1.749 0.080498 .
## X1stFlrSF 5.211e+01 5.039e+00 10.341 < 2e-16 ***
## X2ndFlrSF 4.860e+01 4.029e+00 12.062 < 2e-16 ***
## BsmtFullBath 8.699e+03 2.379e+03 3.657 0.000264 ***
## BedroomAbvGr -1.037e+04 1.629e+03 -6.363 2.65e-10 ***
## KitchenAbvGr -1.610e+04 5.049e+03 -3.188 0.001463 **
## TotRmsAbvGrd 5.168e+03 1.200e+03 4.307 1.76e-05 ***
## Fireplaces 3.302e+03 1.718e+03 1.922 0.054783 .
## GarageYrBlt -1.041e+01 1.716e+00 -6.067 1.66e-09 ***
## GarageCars 1.766e+04 2.031e+03 8.696 < 2e-16 ***
## WoodDeckSF 2.517e+01 7.819e+00 3.219 0.001317 **
## ScreenPorch 5.360e+01 1.674e+01 3.203 0.001391 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 34490 on 1440 degrees of freedom
## Multiple R-squared: 0.814, Adjusted R-squared: 0.8115
## F-statistic: 331.6 on 19 and 1440 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(model2)
A random forest, or random decision forests, are an “ensemble learning method for classification, regression, and other tasks.” To use the random forest method in R, the randomForest package must be loaded to create and analyze random forests
y_train <- ftrain$SalePrice
x_train_df <- subset(ftrain, select = -SalePrice)
model3 <- randomForest(x_train_df, y = y_train, ntree = 500, importance = T)
plot(model3)
pred1<-predict(model1, ftest)
kaggle <- as.data.frame(cbind(ftest$Id, pred1))
colnames(kaggle) <- c("Id", "SalePrice")
#write.csv(kaggle, file = "Kaggle_housePrice_Submission1.csv", quote=FALSE, row.names=FALSE)
pred2<-predict(model2, ftest)
#export data for Kaggle
kaggle <- as.data.frame(cbind(ftest$Id, pred2))
colnames(kaggle) <- c("Id", "SalePrice")
#write.csv(kaggle, file = "Kaggle_housePrice_Submission2.csv", quote=FALSE, row.names=FALSE)
pred3<-predict(model3, ftest)
kaggle <- as.data.frame(cbind(ftest$Id, pred3))
colnames(kaggle) <- c("Id", "SalePrice")
#write.csv(kaggle, file = "Kaggle_housePrice_Submission3.csv", quote=FALSE, row.names=FALSE)
Out of the three models created, (Baseline LM, LM with highly significant variables, and RF), the Random Forest made the best predictions.