Final Project

compete Kaggle.com Advanced Regression Techniques competition: https://www.kaggle.com/c/house-prices-advanced-regression-techniques.

Pick one of the quantitative independent variables from the training data set (train.csv) , and define that variable as X. Pick SalePrice as the dependent variable, and define it as Y for the next analysis.

Probability. Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the 4th quartile (this is correct) of the X variable, and the small letter “y” is estimated as the 2nd quartile of the Y variable. Interpret the meaning of all probabilities.

  1. P(X>x | Y>y) b. P(X>x, Y>y) c. P(Xy)
library(ggplot2)
library(caret)
library(AppliedPredictiveModeling)
library(corrplot)
library(dplyr)
library(forecast)

train <- read.csv('train.csv', stringsAsFactors = FALSE)
test <- read.csv('test.csv', stringsAsFactors = FALSE)
testId <- test$Id
test$SalePrice <- 0
dim(train)
## [1] 1460   81

Probability

Y <- train$SalePrice
X <- train$LotArea

p.x <- quantile(X)[4]
p.y <- quantile(Y)[2]

# a. P(X>x | Y>y)
px <- train %>% filter(LotArea > p.x & SalePrice > p.y) %>% 
        tally()/nrow(train)
py <- train%>% filter(SalePrice > p.y) %>% 
        tally()/nrow(train)
(px/py) 
##           n
## 1 0.3123288
# b. P(X>x, Y>y) 
px <- train %>% filter(LotArea > p.x) %>% 
        tally()/nrow(train)
py <- train%>% filter(SalePrice > p.y) %>% 
        tally()/nrow(train)
(px * py)
##        n
## 1 0.1875
# c. P(X<x | Y>y)
px <- train %>% filter(LotArea < p.x & SalePrice > p.y) %>% 
        tally()/nrow(train)
py <- train%>% filter(SalePrice > p.y) %>% 
        tally()/nrow(train)
(px/py)
##           n
## 1 0.6876712

Independence

Does splitting the training data in this fashion make them independent? In other words, does P(XY)=P(X)P(Y) or does P(X|Y) = P(X)? Check mathematically, and then evaluate by running a Chi Square test for association. You might have to research this. A Chi Square test for independence (association) will require you to bin the data into logical groups.

Below I will check if P(X|Y) = P(X)

# use example a. P(X>x | Y>y) as P(X|Y)
Y <- train$SalePrice
X <- train$LotArea

p.x <- quantile(X)[4]
p.y <- quantile(Y)[2]

px <- train %>% filter(LotArea > p.x & SalePrice > p.y) %>% 
        tally()/nrow(train)
py <- train%>% filter(SalePrice > p.y) %>% 
        tally()/nrow(train)
p.xy <- px/py

#P(X)
px <- train %>% filter(LotArea > p.x) %>% 
        tally()/nrow(train)
py <- train %>% 
        tally()/nrow(train)
p.x <- px/py

(p.xy == p.x)
##          n
## [1,] FALSE

P(X|Y) = P(X) will be true only when p(X|Y = y) = p(X) for all y, in the splitting examples above, Y was not for all y, so splitting the training dagta in a to c those fashion don’t make them independent.

# chi square test
chi.test <- table(train$SalePrice, train$LotArea)
chisq.test(chi.test)
## Warning in chisq.test(chi.test): Chi-squared approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  chi.test
## X-squared = 735090, df = 709660, p-value < 2.2e-16

Based on Chi Square test result, p-value < .05, the null hypothese that there is no relationship between SalePrice and LotArea was rejected.

Descriptive and Inferential Statistics

Provide univariate descriptive statistics and appropriate plots for both variables. Provide a scatterplot of X and Y. Transform both variables simultaneously using Box-Cox transformations. You might have to research this. Using the transformed variables, run a correlation analysis and interpret. Test the hypothesis that the correlation between these variables is 0 and provide a 99% confidence interval. Discuss the meaning of your analysis.

# SalePrice
summary(train$SalePrice)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   34900  129975  163000  180921  214000  755000
# LotArea
summary(train$LotArea)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1300    7554    9478   10517   11602  215245
comb <- data.frame(SalePrice = train$SalePrice, LotArea = train$LotArea)
plot(train$LotArea, train$SalePrice)
yx.lm <- lm(train$SalePrice ~ train$LotArea)
lines(train$LotArea, predict(yx.lm), col="red")

# medians of Y and X
salePrice.median <-median(Y)
lotArea.median <-median(X)

#Create density plot for LotArea variable.
# density of SalePrice
ggplot(train, aes(x = SalePrice)) + 
  geom_vline(xintercept = salePrice.median, col = "green", lwd = 1) +
  geom_density(adjust = 5, col = 'darkblue') + 
  labs(title="Density and Median for SalePrice") +
  labs(x="SalePrice", y="")

# density of X
ggplot(train, aes(x = LotArea)) + 
  geom_vline(xintercept = lotArea.median, col = "green", lwd = 1) +
  geom_density(adjust = 5, col = 'darkblue') + 
  labs(title="Density and Median for LotArea") +
  labs(x="LotArea", y="")

# transform variables using Box-Cox transformations
SalePrice.lam <- BoxCox.lambda(train$SalePrice)
trans.SalePrice <- BoxCox(train$SalePrice, SalePrice.lam)
hist(trans.SalePrice)

LotArea.lam <- BoxCox.lambda(train$LotArea)
trans.LotArea <- BoxCox(train$LotArea, LotArea.lam)
hist(trans.LotArea)

Hypothesis test using the transformed variables LotArea nd SalePrice to show a relationship betwee them

(cor(cbind(trans.LotArea, trans.SalePrice)))
##                 trans.LotArea trans.SalePrice
## trans.LotArea       1.0000000       0.3893308
## trans.SalePrice     0.3893308       1.0000000
(cor.test(trans.LotArea, trans.SalePrice, method = "pearson", conf.level = .99))
## 
##  Pearson's product-moment correlation
## 
## data:  trans.LotArea and trans.SalePrice
## t = 16.14, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 99 percent confidence interval:
##  0.3306244 0.4450358
## sample estimates:
##       cor 
## 0.3893308
(cor.test(train$LotArea, train$SalePrice, method = "pearson", conf.level = .99))
## 
##  Pearson's product-moment correlation
## 
## data:  train$LotArea and train$SalePrice
## t = 10.445, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 99 percent confidence interval:
##  0.2000196 0.3254375
## sample estimates:
##       cor 
## 0.2638434

The above 2 correlation tests showed positive relationships between variables LotArea nd SalePrice before and after tranformation in 99% confident level, and the relationship became stronger after the variables were transformed .

Linear Algebra and Correlation.

Invert your correlation matrix from the previous section. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix.

(matr <- cor(cbind(trans.LotArea, trans.SalePrice)))
##                 trans.LotArea trans.SalePrice
## trans.LotArea       1.0000000       0.3893308
## trans.SalePrice     0.3893308       1.0000000
(invM <- solve(matr))
##                 trans.LotArea trans.SalePrice
## trans.LotArea       1.1786593      -0.4588883
## trans.SalePrice    -0.4588883       1.1786593
(matr %*% invM)
##                 trans.LotArea trans.SalePrice
## trans.LotArea    1.000000e+00               0
## trans.SalePrice  5.551115e-17               1
(invM %*% matr)
##                 trans.LotArea trans.SalePrice
## trans.LotArea    1.000000e+00               0
## trans.SalePrice  5.551115e-17               1

Calculus-Based Probability & Statistics

Many times, it makes sense to fit a closed form distribution to data. For your non-transformed independent variable ( X ), location shift it so that the minimum value is above zero. Then load the MASS package and run fitdistr to fit a density function of your choice. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of the parameters for this distribution, and then take 1000 samples from this distribution (e.g., rexp(1000, $ $) for an exponential). Plot a histogram and compare it with a histogram of your non-transformed original variable.

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
# X is above zero, no shift is neccessary. 
(min(X) > 0)
## [1] TRUE
op <- options(digits = 3)
set.seed(123)
dstr <- fitdistr(train$LotArea, "exponential")
lamb <- dstr$estimate

lotArea.opt <- rexp(1000, lamb)

After the variable was optimized

hist(lotArea.opt)

Before the variable was optimized

hist(train$LotArea)

Using optimization, the variable distribution was improved, less skewed, close to normal distribution.

PreProcess

# set up train and test datasets
train$IsTrainSet <- TRUE
test$IsTrainSet <- FALSE
df <- rbind(train, test)

# checking missing value
Num_NA <- sapply(df, function(y)
  length(which(is.na(y) == T)))
NA_Count <- data.frame(Item = colnames(df), Count = Num_NA)
col_rm <- NA_Count$Count > 1500
# remove the variables with high missing values higher than 1500
df <- df[, !col_rm]
# numeric variables and transfer dumm
isnum <- sapply(df, is.numeric)
dfnum <- df[, isnum]
dim(dfnum)
## [1] 2919   38
# convert factor values into numbers
for(i in 1:76){
  if(is.factor(df[,i])){
    df[,i]<-as.integer(df[,i])
  }
}

# replace missing value with 0
df[is.na(df)] <- 0
dfnum[is.na(dfnum)] <- 0

# Descriptive Analysis, identify correlated predictors
descrCor <- cor(dfnum, use = "pairwise.complete.obs")
# corrplot
corrplot(
  descrCor,
  method = "circle",
  type = "lower",
  sig.level = 0.01,
  insig = "blank"
)

# remove descriptors with absolute correlations > .75
(highlyCorDescr <- findCorrelation(descrCor, cutoff = .75))
## [1] 17 14 27
# 37 is SalePrice, selected as dependent var., remove the highly correlative var 4
dfnum <- dfnum[, -c(4, highlyCorDescr[-1])]
descrCor2 <- cor(dfnum)
summary(descrCor2[upper.tri(descrCor2)])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -0.740  -0.019   0.023   0.069   0.130   0.808
# corrplot
corrplot(
  descrCor2,
  method = "circle",
  type = "lower",
  sig.level = 0.01,
  insig = "blank"
)

# plot vars that have strong relationships with SalePrice
pairs(~SalePrice+TotalBsmtSF+GarageArea+BsmtFullBath+TotRmsAbvGrd+BedroomAbvGr,data=dfnum,
      main="Scatterplot Matrix")

# remove linear dependencies
lincomb = findLinearCombos(dfnum)
lapply(lincomb$linearCombos, function(x) colnames(dfnum)[x])
## [[1]]
## [1] "TotalBsmtSF" "BsmtFinSF1"  "BsmtFinSF2"  "BsmtUnfSF"
dfnum <- dfnum[-lincomb$remove]
dim(dfnum)
## [1] 2919   34
# categorical casting
dfchar <- df[, !isnum]
dim(dfchar)
## [1] 2919   40
dfchar[] <- lapply(dfchar, factor) 

# Boruta Feature Selection
set.seed(121)
if (!require('Boruta')) install.packages('Boruta')
## Loading required package: Boruta
## Loading required package: ranger
library(Boruta)
sel.var <- cbind(dfnum[, -c(1)], dfchar)
train <- sel.var[sel.var$IsTrainSet == TRUE, ]
test <- sel.var[sel.var$IsTrainSet == FALSE, ]

na.omit(train)->hvo
bor.results <- Boruta(hvo[, -37],
                      hvo$SalePrice,
                      # maxRuns=101,
                      doTrace=0)

boruta.train <- TentativeRoughFix(bor.results)

plot(boruta.train, xlab = "", xaxt = "n")
lz<-lapply(1:ncol(boruta.train$ImpHistory),function(i)
boruta.train$ImpHistory[is.finite(boruta.train$ImpHistory[,i]),i])
names(lz) <- colnames(boruta.train$ImpHistory)
Labels <- sort(sapply(lz,median))
axis(side = 1,las=2,labels = names(Labels), at = 1:ncol(boruta.train$ImpHistory), cex.axis = 0.7)

keepMe <- getSelectedAttributes(boruta.train, withTentative = F)

myvars_train <- names(train) %in% keepMe
myvars_test <- names(test) %in% keepMe

train <- train[myvars_train]
test <- test[myvars_test]
test$Id <- testId

Model Selection and Training

# predict 

# variable Exterior1st has different levels in test and train data set, so use train data set
# remove Id and IsTrainSet
train[] <- sapply(train, as.numeric)
lm_model <- lm(SalePrice ~ ., data = train)

summary(lm_model)
## 
## Call:
## lm(formula = SalePrice ~ ., data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -489752  -14139   -1583   13293  270617 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -4.70e+05   1.63e+05   -2.87   0.0041 ** 
## MSSubClass   -1.10e+02   4.61e+01   -2.38   0.0175 *  
## LotFrontage  -1.74e+01   2.85e+01   -0.61   0.5406    
## OverallQual   1.33e+04   1.19e+03   11.18  < 2e-16 ***
## OverallCond   5.29e+03   1.03e+03    5.11  3.7e-07 ***
## YearBuilt     2.27e+02   7.29e+01    3.11   0.0019 ** 
## YearRemodAdd  4.20e+01   6.80e+01    0.62   0.5368    
## MasVnrArea    3.01e+01   6.15e+00    4.89  1.1e-06 ***
## BsmtFinSF1    1.14e+01   4.69e+00    2.43   0.0152 *  
## BsmtUnfSF     6.57e+00   4.67e+00    1.41   0.1600    
## X2ndFlrSF     2.60e+00   5.51e+00    0.47   0.6374    
## GrLivArea     4.21e+01   5.52e+00    7.63  4.2e-14 ***
## BsmtFullBath  1.03e+04   2.34e+03    4.40  1.1e-05 ***
## FullBath      5.96e+03   2.74e+03    2.18   0.0295 *  
## HalfBath      2.20e+03   2.60e+03    0.85   0.3966    
## BedroomAbvGr -4.29e+03   1.70e+03   -2.52   0.0117 *  
## KitchenAbvGr -1.53e+04   5.10e+03   -3.01   0.0027 ** 
## TotRmsAbvGrd  3.19e+03   1.19e+03    2.68   0.0074 ** 
## Fireplaces    1.17e+04   2.66e+03    4.38  1.3e-05 ***
## GarageYrBlt   1.36e-01   5.93e+00    0.02   0.9817    
## GarageArea    3.27e+01   6.69e+00    4.89  1.2e-06 ***
## WoodDeckSF    2.08e+01   7.54e+00    2.76   0.0059 ** 
## OpenPorchSF  -1.21e+01   1.46e+01   -0.83   0.4049    
## MSZoning     -1.01e+03   1.53e+03   -0.66   0.5110    
## LotShape     -1.33e+03   6.81e+02   -1.96   0.0504 .  
## Neighborhood  2.51e+02   1.61e+02    1.56   0.1190    
## Condition1   -2.10e+02   1.03e+03   -0.20   0.8393    
## BldgType     -2.26e+03   1.51e+03   -1.50   0.1336    
## HouseStyle   -1.42e+03   6.58e+02   -2.15   0.0315 *  
## RoofStyle     2.35e+03   1.15e+03    2.05   0.0409 *  
## Exterior1st  -7.77e+02   5.35e+02   -1.45   0.1465    
## Exterior2nd   1.77e+02   4.85e+02    0.37   0.7150    
## MasVnrType    3.35e+03   1.52e+03    2.21   0.0275 *  
## ExterQual    -8.16e+03   1.98e+03   -4.12  4.1e-05 ***
## Foundation   -4.45e+02   1.71e+03   -0.26   0.7955    
## BsmtQual     -6.58e+03   1.33e+03   -4.94  9.0e-07 ***
## BsmtCond      4.19e+03   1.28e+03    3.27   0.0011 ** 
## BsmtExposure -3.48e+03   8.86e+02   -3.93  9.0e-05 ***
## BsmtFinType1 -6.07e+02   6.35e+02   -0.96   0.3393    
## BsmtFinType2  1.14e+03   1.00e+03    1.14   0.2562    
## HeatingQC    -4.63e+02   6.27e+02   -0.74   0.4599    
## CentralAir    2.33e+03   4.37e+03    0.53   0.5947    
## Electrical   -4.35e+02   9.33e+02   -0.47   0.6411    
## KitchenQual  -7.90e+03   1.49e+03   -5.30  1.3e-07 ***
## FireplaceQu  -1.84e+03   8.09e+02   -2.28   0.0230 *  
## GarageType    1.27e+01   6.53e+02    0.02   0.9845    
## GarageFinish -1.80e+03   1.52e+03   -1.18   0.2371    
## GarageQual   -6.77e+02   1.82e+03   -0.37   0.7100    
## GarageCond    4.45e+01   2.06e+03    0.02   0.9828    
## PavedDrive    2.75e+03   2.13e+03    1.29   0.1969    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 32900 on 1410 degrees of freedom
## Multiple R-squared:  0.834,  Adjusted R-squared:  0.828 
## F-statistic:  144 on 49 and 1410 DF,  p-value: <2e-16
aov(lm_model)
## Call:
##    aov(formula = lm_model)
## 
## Terms:
##                 MSSubClass LotFrontage OverallQual OverallCond YearBuilt
## Sum of Squares    6.54e+10    3.54e+11    5.47e+12    7.37e+08  8.37e+10
## Deg. of Freedom          1           1           1           1         1
##                 YearRemodAdd MasVnrArea BsmtFinSF1 BsmtUnfSF X2ndFlrSF
## Sum of Squares      2.30e+10   2.46e+11   2.40e+11  4.95e+10  3.57e+11
## Deg. of Freedom            1          1          1         1         1
##                 GrLivArea BsmtFullBath FullBath HalfBath BedroomAbvGr
## Sum of Squares   3.68e+11     2.64e+10 9.54e+08 4.82e+07     4.17e+10
## Deg. of Freedom         1            1        1        1            1
##                 KitchenAbvGr TotRmsAbvGrd Fireplaces GarageYrBlt
## Sum of Squares      4.88e+09     2.43e+10   1.55e+10    1.34e+08
## Deg. of Freedom            1            1          1           1
##                 GarageArea WoodDeckSF OpenPorchSF MSZoning LotShape
## Sum of Squares    4.98e+10   1.34e+10    2.88e+08 1.73e+09 6.12e+09
## Deg. of Freedom          1          1           1        1        1
##                 Neighborhood Condition1 BldgType HouseStyle RoofStyle
## Sum of Squares      2.83e+09   9.15e+07 1.03e+09   6.63e+09  6.14e+09
## Deg. of Freedom            1          1        1          1         1
##                 Exterior1st Exterior2nd MasVnrType ExterQual Foundation
## Sum of Squares     4.84e+09    4.57e+08   2.20e+10  7.03e+10   1.57e+09
## Deg. of Freedom           1           1          1         1          1
##                 BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinType2
## Sum of Squares  3.84e+10 1.20e+10     1.77e+10     6.03e+08     1.40e+09
## Deg. of Freedom        1        1            1            1            1
##                 HeatingQC CentralAir Electrical KitchenQual FireplaceQu
## Sum of Squares   1.90e+09   6.30e+06   9.84e+08    3.39e+10    5.20e+09
## Deg. of Freedom         1          1          1           1           1
##                 GarageType GarageFinish GarageQual GarageCond PavedDrive
## Sum of Squares    6.14e+07     1.37e+09   1.34e+08   1.44e+07   1.81e+09
## Deg. of Freedom          1            1          1          1          1
##                 Residuals
## Sum of Squares   1.53e+12
## Deg. of Freedom      1410
## 
## Residual standard error: 32944
## Estimated effects may be unbalanced
test$SalePrice <- 0
 
test[] <- sapply(test, as.numeric)
myPred <- predict(lm_model, newdata = test[, -51])

submit <- data.frame(Id = test$Id, SalePrice = myPred)
write.csv(submit, file = "SalePricePred.csv",  row.names = FALSE)

Kaggle Submission

Caption for the picture.

Caption for the picture.

Conclusion:

In this final project, I have learned so much by putting together math, probability, and modeling. Due to time constraint, I wouldn’t be able to explore more models, but I used caret and some other packages for the first time, and submitted the predict results to Kaggle, that was a milestone.