compete Kaggle.com Advanced Regression Techniques competition: https://www.kaggle.com/c/house-prices-advanced-regression-techniques.
Pick one of the quantitative independent variables from the training data set (train.csv) , and define that variable as X. Pick SalePrice as the dependent variable, and define it as Y for the next analysis.
Probability. Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the 4th quartile (this is correct) of the X variable, and the small letter “y” is estimated as the 2nd quartile of the Y variable. Interpret the meaning of all probabilities.
library(ggplot2)
library(caret)
library(AppliedPredictiveModeling)
library(corrplot)
library(dplyr)
library(forecast)
train <- read.csv('train.csv', stringsAsFactors = FALSE)
test <- read.csv('test.csv', stringsAsFactors = FALSE)
testId <- test$Id
test$SalePrice <- 0
dim(train)
## [1] 1460 81
Y <- train$SalePrice
X <- train$LotArea
p.x <- quantile(X)[4]
p.y <- quantile(Y)[2]
# a. P(X>x | Y>y)
px <- train %>% filter(LotArea > p.x & SalePrice > p.y) %>%
tally()/nrow(train)
py <- train%>% filter(SalePrice > p.y) %>%
tally()/nrow(train)
(px/py)
## n
## 1 0.3123288
# b. P(X>x, Y>y)
px <- train %>% filter(LotArea > p.x) %>%
tally()/nrow(train)
py <- train%>% filter(SalePrice > p.y) %>%
tally()/nrow(train)
(px * py)
## n
## 1 0.1875
# c. P(X<x | Y>y)
px <- train %>% filter(LotArea < p.x & SalePrice > p.y) %>%
tally()/nrow(train)
py <- train%>% filter(SalePrice > p.y) %>%
tally()/nrow(train)
(px/py)
## n
## 1 0.6876712
Does splitting the training data in this fashion make them independent? In other words, does P(XY)=P(X)P(Y) or does P(X|Y) = P(X)? Check mathematically, and then evaluate by running a Chi Square test for association. You might have to research this. A Chi Square test for independence (association) will require you to bin the data into logical groups.
Below I will check if P(X|Y) = P(X)
# use example a. P(X>x | Y>y) as P(X|Y)
Y <- train$SalePrice
X <- train$LotArea
p.x <- quantile(X)[4]
p.y <- quantile(Y)[2]
px <- train %>% filter(LotArea > p.x & SalePrice > p.y) %>%
tally()/nrow(train)
py <- train%>% filter(SalePrice > p.y) %>%
tally()/nrow(train)
p.xy <- px/py
#P(X)
px <- train %>% filter(LotArea > p.x) %>%
tally()/nrow(train)
py <- train %>%
tally()/nrow(train)
p.x <- px/py
(p.xy == p.x)
## n
## [1,] FALSE
P(X|Y) = P(X) will be true only when p(X|Y = y) = p(X) for all y, in the splitting examples above, Y was not for all y, so splitting the training dagta in a to c those fashion don’t make them independent.
# chi square test
chi.test <- table(train$SalePrice, train$LotArea)
chisq.test(chi.test)
## Warning in chisq.test(chi.test): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: chi.test
## X-squared = 735090, df = 709660, p-value < 2.2e-16
Based on Chi Square test result, p-value < .05, the null hypothese that there is no relationship between SalePrice and LotArea was rejected.
Provide univariate descriptive statistics and appropriate plots for both variables. Provide a scatterplot of X and Y. Transform both variables simultaneously using Box-Cox transformations. You might have to research this. Using the transformed variables, run a correlation analysis and interpret. Test the hypothesis that the correlation between these variables is 0 and provide a 99% confidence interval. Discuss the meaning of your analysis.
# SalePrice
summary(train$SalePrice)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 34900 129975 163000 180921 214000 755000
# LotArea
summary(train$LotArea)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1300 7554 9478 10517 11602 215245
comb <- data.frame(SalePrice = train$SalePrice, LotArea = train$LotArea)
plot(train$LotArea, train$SalePrice)
yx.lm <- lm(train$SalePrice ~ train$LotArea)
lines(train$LotArea, predict(yx.lm), col="red")
# medians of Y and X
salePrice.median <-median(Y)
lotArea.median <-median(X)
#Create density plot for LotArea variable.
# density of SalePrice
ggplot(train, aes(x = SalePrice)) +
geom_vline(xintercept = salePrice.median, col = "green", lwd = 1) +
geom_density(adjust = 5, col = 'darkblue') +
labs(title="Density and Median for SalePrice") +
labs(x="SalePrice", y="")
# density of X
ggplot(train, aes(x = LotArea)) +
geom_vline(xintercept = lotArea.median, col = "green", lwd = 1) +
geom_density(adjust = 5, col = 'darkblue') +
labs(title="Density and Median for LotArea") +
labs(x="LotArea", y="")
# transform variables using Box-Cox transformations
SalePrice.lam <- BoxCox.lambda(train$SalePrice)
trans.SalePrice <- BoxCox(train$SalePrice, SalePrice.lam)
hist(trans.SalePrice)
LotArea.lam <- BoxCox.lambda(train$LotArea)
trans.LotArea <- BoxCox(train$LotArea, LotArea.lam)
hist(trans.LotArea)
Hypothesis test using the transformed variables LotArea nd SalePrice to show a relationship betwee them
(cor(cbind(trans.LotArea, trans.SalePrice)))
## trans.LotArea trans.SalePrice
## trans.LotArea 1.0000000 0.3893308
## trans.SalePrice 0.3893308 1.0000000
(cor.test(trans.LotArea, trans.SalePrice, method = "pearson", conf.level = .99))
##
## Pearson's product-moment correlation
##
## data: trans.LotArea and trans.SalePrice
## t = 16.14, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 99 percent confidence interval:
## 0.3306244 0.4450358
## sample estimates:
## cor
## 0.3893308
(cor.test(train$LotArea, train$SalePrice, method = "pearson", conf.level = .99))
##
## Pearson's product-moment correlation
##
## data: train$LotArea and train$SalePrice
## t = 10.445, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 99 percent confidence interval:
## 0.2000196 0.3254375
## sample estimates:
## cor
## 0.2638434
The above 2 correlation tests showed positive relationships between variables LotArea nd SalePrice before and after tranformation in 99% confident level, and the relationship became stronger after the variables were transformed .
Invert your correlation matrix from the previous section. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix.
(matr <- cor(cbind(trans.LotArea, trans.SalePrice)))
## trans.LotArea trans.SalePrice
## trans.LotArea 1.0000000 0.3893308
## trans.SalePrice 0.3893308 1.0000000
(invM <- solve(matr))
## trans.LotArea trans.SalePrice
## trans.LotArea 1.1786593 -0.4588883
## trans.SalePrice -0.4588883 1.1786593
(matr %*% invM)
## trans.LotArea trans.SalePrice
## trans.LotArea 1.000000e+00 0
## trans.SalePrice 5.551115e-17 1
(invM %*% matr)
## trans.LotArea trans.SalePrice
## trans.LotArea 1.000000e+00 0
## trans.SalePrice 5.551115e-17 1
Many times, it makes sense to fit a closed form distribution to data. For your non-transformed independent variable ( X ), location shift it so that the minimum value is above zero. Then load the MASS package and run fitdistr to fit a density function of your choice. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of the parameters for this distribution, and then take 1000 samples from this distribution (e.g., rexp(1000, $ $) for an exponential). Plot a histogram and compare it with a histogram of your non-transformed original variable.
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
# X is above zero, no shift is neccessary.
(min(X) > 0)
## [1] TRUE
op <- options(digits = 3)
set.seed(123)
dstr <- fitdistr(train$LotArea, "exponential")
lamb <- dstr$estimate
lotArea.opt <- rexp(1000, lamb)
After the variable was optimized
hist(lotArea.opt)
Before the variable was optimized
hist(train$LotArea)
Using optimization, the variable distribution was improved, less skewed, close to normal distribution.
# set up train and test datasets
train$IsTrainSet <- TRUE
test$IsTrainSet <- FALSE
df <- rbind(train, test)
# checking missing value
Num_NA <- sapply(df, function(y)
length(which(is.na(y) == T)))
NA_Count <- data.frame(Item = colnames(df), Count = Num_NA)
col_rm <- NA_Count$Count > 1500
# remove the variables with high missing values higher than 1500
df <- df[, !col_rm]
# numeric variables and transfer dumm
isnum <- sapply(df, is.numeric)
dfnum <- df[, isnum]
dim(dfnum)
## [1] 2919 38
# convert factor values into numbers
for(i in 1:76){
if(is.factor(df[,i])){
df[,i]<-as.integer(df[,i])
}
}
# replace missing value with 0
df[is.na(df)] <- 0
dfnum[is.na(dfnum)] <- 0
# Descriptive Analysis, identify correlated predictors
descrCor <- cor(dfnum, use = "pairwise.complete.obs")
# corrplot
corrplot(
descrCor,
method = "circle",
type = "lower",
sig.level = 0.01,
insig = "blank"
)
# remove descriptors with absolute correlations > .75
(highlyCorDescr <- findCorrelation(descrCor, cutoff = .75))
## [1] 17 14 27
# 37 is SalePrice, selected as dependent var., remove the highly correlative var 4
dfnum <- dfnum[, -c(4, highlyCorDescr[-1])]
descrCor2 <- cor(dfnum)
summary(descrCor2[upper.tri(descrCor2)])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.740 -0.019 0.023 0.069 0.130 0.808
# corrplot
corrplot(
descrCor2,
method = "circle",
type = "lower",
sig.level = 0.01,
insig = "blank"
)
# plot vars that have strong relationships with SalePrice
pairs(~SalePrice+TotalBsmtSF+GarageArea+BsmtFullBath+TotRmsAbvGrd+BedroomAbvGr,data=dfnum,
main="Scatterplot Matrix")
# remove linear dependencies
lincomb = findLinearCombos(dfnum)
lapply(lincomb$linearCombos, function(x) colnames(dfnum)[x])
## [[1]]
## [1] "TotalBsmtSF" "BsmtFinSF1" "BsmtFinSF2" "BsmtUnfSF"
dfnum <- dfnum[-lincomb$remove]
dim(dfnum)
## [1] 2919 34
# categorical casting
dfchar <- df[, !isnum]
dim(dfchar)
## [1] 2919 40
dfchar[] <- lapply(dfchar, factor)
# Boruta Feature Selection
set.seed(121)
if (!require('Boruta')) install.packages('Boruta')
## Loading required package: Boruta
## Loading required package: ranger
library(Boruta)
sel.var <- cbind(dfnum[, -c(1)], dfchar)
train <- sel.var[sel.var$IsTrainSet == TRUE, ]
test <- sel.var[sel.var$IsTrainSet == FALSE, ]
na.omit(train)->hvo
bor.results <- Boruta(hvo[, -37],
hvo$SalePrice,
# maxRuns=101,
doTrace=0)
boruta.train <- TentativeRoughFix(bor.results)
plot(boruta.train, xlab = "", xaxt = "n")
lz<-lapply(1:ncol(boruta.train$ImpHistory),function(i)
boruta.train$ImpHistory[is.finite(boruta.train$ImpHistory[,i]),i])
names(lz) <- colnames(boruta.train$ImpHistory)
Labels <- sort(sapply(lz,median))
axis(side = 1,las=2,labels = names(Labels), at = 1:ncol(boruta.train$ImpHistory), cex.axis = 0.7)
keepMe <- getSelectedAttributes(boruta.train, withTentative = F)
myvars_train <- names(train) %in% keepMe
myvars_test <- names(test) %in% keepMe
train <- train[myvars_train]
test <- test[myvars_test]
test$Id <- testId
# predict
# variable Exterior1st has different levels in test and train data set, so use train data set
# remove Id and IsTrainSet
train[] <- sapply(train, as.numeric)
lm_model <- lm(SalePrice ~ ., data = train)
summary(lm_model)
##
## Call:
## lm(formula = SalePrice ~ ., data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -489752 -14139 -1583 13293 270617
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.70e+05 1.63e+05 -2.87 0.0041 **
## MSSubClass -1.10e+02 4.61e+01 -2.38 0.0175 *
## LotFrontage -1.74e+01 2.85e+01 -0.61 0.5406
## OverallQual 1.33e+04 1.19e+03 11.18 < 2e-16 ***
## OverallCond 5.29e+03 1.03e+03 5.11 3.7e-07 ***
## YearBuilt 2.27e+02 7.29e+01 3.11 0.0019 **
## YearRemodAdd 4.20e+01 6.80e+01 0.62 0.5368
## MasVnrArea 3.01e+01 6.15e+00 4.89 1.1e-06 ***
## BsmtFinSF1 1.14e+01 4.69e+00 2.43 0.0152 *
## BsmtUnfSF 6.57e+00 4.67e+00 1.41 0.1600
## X2ndFlrSF 2.60e+00 5.51e+00 0.47 0.6374
## GrLivArea 4.21e+01 5.52e+00 7.63 4.2e-14 ***
## BsmtFullBath 1.03e+04 2.34e+03 4.40 1.1e-05 ***
## FullBath 5.96e+03 2.74e+03 2.18 0.0295 *
## HalfBath 2.20e+03 2.60e+03 0.85 0.3966
## BedroomAbvGr -4.29e+03 1.70e+03 -2.52 0.0117 *
## KitchenAbvGr -1.53e+04 5.10e+03 -3.01 0.0027 **
## TotRmsAbvGrd 3.19e+03 1.19e+03 2.68 0.0074 **
## Fireplaces 1.17e+04 2.66e+03 4.38 1.3e-05 ***
## GarageYrBlt 1.36e-01 5.93e+00 0.02 0.9817
## GarageArea 3.27e+01 6.69e+00 4.89 1.2e-06 ***
## WoodDeckSF 2.08e+01 7.54e+00 2.76 0.0059 **
## OpenPorchSF -1.21e+01 1.46e+01 -0.83 0.4049
## MSZoning -1.01e+03 1.53e+03 -0.66 0.5110
## LotShape -1.33e+03 6.81e+02 -1.96 0.0504 .
## Neighborhood 2.51e+02 1.61e+02 1.56 0.1190
## Condition1 -2.10e+02 1.03e+03 -0.20 0.8393
## BldgType -2.26e+03 1.51e+03 -1.50 0.1336
## HouseStyle -1.42e+03 6.58e+02 -2.15 0.0315 *
## RoofStyle 2.35e+03 1.15e+03 2.05 0.0409 *
## Exterior1st -7.77e+02 5.35e+02 -1.45 0.1465
## Exterior2nd 1.77e+02 4.85e+02 0.37 0.7150
## MasVnrType 3.35e+03 1.52e+03 2.21 0.0275 *
## ExterQual -8.16e+03 1.98e+03 -4.12 4.1e-05 ***
## Foundation -4.45e+02 1.71e+03 -0.26 0.7955
## BsmtQual -6.58e+03 1.33e+03 -4.94 9.0e-07 ***
## BsmtCond 4.19e+03 1.28e+03 3.27 0.0011 **
## BsmtExposure -3.48e+03 8.86e+02 -3.93 9.0e-05 ***
## BsmtFinType1 -6.07e+02 6.35e+02 -0.96 0.3393
## BsmtFinType2 1.14e+03 1.00e+03 1.14 0.2562
## HeatingQC -4.63e+02 6.27e+02 -0.74 0.4599
## CentralAir 2.33e+03 4.37e+03 0.53 0.5947
## Electrical -4.35e+02 9.33e+02 -0.47 0.6411
## KitchenQual -7.90e+03 1.49e+03 -5.30 1.3e-07 ***
## FireplaceQu -1.84e+03 8.09e+02 -2.28 0.0230 *
## GarageType 1.27e+01 6.53e+02 0.02 0.9845
## GarageFinish -1.80e+03 1.52e+03 -1.18 0.2371
## GarageQual -6.77e+02 1.82e+03 -0.37 0.7100
## GarageCond 4.45e+01 2.06e+03 0.02 0.9828
## PavedDrive 2.75e+03 2.13e+03 1.29 0.1969
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 32900 on 1410 degrees of freedom
## Multiple R-squared: 0.834, Adjusted R-squared: 0.828
## F-statistic: 144 on 49 and 1410 DF, p-value: <2e-16
aov(lm_model)
## Call:
## aov(formula = lm_model)
##
## Terms:
## MSSubClass LotFrontage OverallQual OverallCond YearBuilt
## Sum of Squares 6.54e+10 3.54e+11 5.47e+12 7.37e+08 8.37e+10
## Deg. of Freedom 1 1 1 1 1
## YearRemodAdd MasVnrArea BsmtFinSF1 BsmtUnfSF X2ndFlrSF
## Sum of Squares 2.30e+10 2.46e+11 2.40e+11 4.95e+10 3.57e+11
## Deg. of Freedom 1 1 1 1 1
## GrLivArea BsmtFullBath FullBath HalfBath BedroomAbvGr
## Sum of Squares 3.68e+11 2.64e+10 9.54e+08 4.82e+07 4.17e+10
## Deg. of Freedom 1 1 1 1 1
## KitchenAbvGr TotRmsAbvGrd Fireplaces GarageYrBlt
## Sum of Squares 4.88e+09 2.43e+10 1.55e+10 1.34e+08
## Deg. of Freedom 1 1 1 1
## GarageArea WoodDeckSF OpenPorchSF MSZoning LotShape
## Sum of Squares 4.98e+10 1.34e+10 2.88e+08 1.73e+09 6.12e+09
## Deg. of Freedom 1 1 1 1 1
## Neighborhood Condition1 BldgType HouseStyle RoofStyle
## Sum of Squares 2.83e+09 9.15e+07 1.03e+09 6.63e+09 6.14e+09
## Deg. of Freedom 1 1 1 1 1
## Exterior1st Exterior2nd MasVnrType ExterQual Foundation
## Sum of Squares 4.84e+09 4.57e+08 2.20e+10 7.03e+10 1.57e+09
## Deg. of Freedom 1 1 1 1 1
## BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinType2
## Sum of Squares 3.84e+10 1.20e+10 1.77e+10 6.03e+08 1.40e+09
## Deg. of Freedom 1 1 1 1 1
## HeatingQC CentralAir Electrical KitchenQual FireplaceQu
## Sum of Squares 1.90e+09 6.30e+06 9.84e+08 3.39e+10 5.20e+09
## Deg. of Freedom 1 1 1 1 1
## GarageType GarageFinish GarageQual GarageCond PavedDrive
## Sum of Squares 6.14e+07 1.37e+09 1.34e+08 1.44e+07 1.81e+09
## Deg. of Freedom 1 1 1 1 1
## Residuals
## Sum of Squares 1.53e+12
## Deg. of Freedom 1410
##
## Residual standard error: 32944
## Estimated effects may be unbalanced
test$SalePrice <- 0
test[] <- sapply(test, as.numeric)
myPred <- predict(lm_model, newdata = test[, -51])
submit <- data.frame(Id = test$Id, SalePrice = myPred)
write.csv(submit, file = "SalePricePred.csv", row.names = FALSE)
Caption for the picture.
In this final project, I have learned so much by putting together math, probability, and modeling. Due to time constraint, I wouldn’t be able to explore more models, but I used caret and some other packages for the first time, and submitted the predict results to Kaggle, that was a milestone.
http://topepo.github.io/caret/index.html
https://github.com/wwells/CUNY_DATA_605/blob/master/FinalExam/WWells_FinalExam_pt1.Rmd
http://rstudio-pubs-static.s3.amazonaws.com/256459_5a62c0ca6d5849af92607011bb6c3e1d.html
https://github.com/wwells/CUNY_DATA_605/blob/master/FinalExam/WWells_FinalExam_pt2.Rmd