You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition.

# LOAD DATA
test <- read.csv("./kaggle_data/test.csv", stringsAsFactors = F, header = T)
train <- read.csv("./kaggle_data/train.csv", stringsAsFactors = F, header = T)
submission <- read.csv("./kaggle_data/sample_submission.csv", stringsAsFactors = F, header = T)

Descriptive and Inferential Statistics

Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any THREE quantitative variables in the dataset.
Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide a 80% confidence interval.
Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?

The train data shape:

dim(train)
## [1] 1460   81

Numeric Variables listed below:

LotFrontage : Linear feet of street connected to property
* LotArea : Lot size in square feet
OverallQual : Rates the overall material and finish of the house
OverallCond : Rates the overall condition of the house
YearBuilt : Original construction date
YearRemodAdd : Remodel date (same as construction date if no remodeling or additions)
MasVnrArea : Masonry veneer area in square feet
BsmtFinSF1 : Type 1 finished square feet
BsmtFinSF2 : Type 2 finished square feet
BsmtUnfSF : Unfinished square feet of basement area
* TotalBsmtSF : Total square feet of basement area
1stFlrSF : First Floor square feet
2ndFlrSF : Second floor square feet
LowQualFinSF : Low quality finished square feet (all floors)
* GrLivArea : Above grade (ground) living area square feet
BsmtFullBath : Basement full bathrooms
BsmtHalfBath : Basement half bathrooms
FullBath : Full bathrooms above grade
HalfBath : Half baths above grade
BedroomAbvGr : Bedrooms above grade (does NOT include basement bedrooms)
KitchenAbvGr : Kitchens above grade
TotRmsAbvGrd : Total rooms above grade (does not include bathrooms)
Fireplaces : Number of fireplaces
GarageYrBlt : Year garage was built
GarageCars : Size of garage in car capacity
GarageArea : Size of garage in square feet
WoodDeckSF : Wood deck area in square feet
OpenPorchSF : Open porch area in square feet
EnclosedPorch : Enclosed porch area in square feet
3SsnPorch : Three season porch area in square feet
ScreenPorch : Screen porch area in square feet
PoolArea : Pool area in square feet
MiscVal : $Value of miscellaneous feature
MoSold : Month Sold (MM)
YrSold : Year Sold (YYYY)
* SalePrice : Sale Price of the House

* Provide univariate descriptive statistics and appropriate plots for the training data set.

Firstly, get a summary of numerical variables in training data set.

train %>%
  select_if(is.numeric) %>% # filter numeric var
  summary()
##        Id           MSSubClass     LotFrontage        LotArea      
##  Min.   :   1.0   Min.   : 20.0   Min.   : 21.00   Min.   :  1300  
##  1st Qu.: 365.8   1st Qu.: 20.0   1st Qu.: 59.00   1st Qu.:  7554  
##  Median : 730.5   Median : 50.0   Median : 69.00   Median :  9478  
##  Mean   : 730.5   Mean   : 56.9   Mean   : 70.05   Mean   : 10517  
##  3rd Qu.:1095.2   3rd Qu.: 70.0   3rd Qu.: 80.00   3rd Qu.: 11602  
##  Max.   :1460.0   Max.   :190.0   Max.   :313.00   Max.   :215245  
##                                   NA's   :259                      
##   OverallQual      OverallCond      YearBuilt     YearRemodAdd 
##  Min.   : 1.000   Min.   :1.000   Min.   :1872   Min.   :1950  
##  1st Qu.: 5.000   1st Qu.:5.000   1st Qu.:1954   1st Qu.:1967  
##  Median : 6.000   Median :5.000   Median :1973   Median :1994  
##  Mean   : 6.099   Mean   :5.575   Mean   :1971   Mean   :1985  
##  3rd Qu.: 7.000   3rd Qu.:6.000   3rd Qu.:2000   3rd Qu.:2004  
##  Max.   :10.000   Max.   :9.000   Max.   :2010   Max.   :2010  
##                                                                
##    MasVnrArea       BsmtFinSF1       BsmtFinSF2        BsmtUnfSF     
##  Min.   :   0.0   Min.   :   0.0   Min.   :   0.00   Min.   :   0.0  
##  1st Qu.:   0.0   1st Qu.:   0.0   1st Qu.:   0.00   1st Qu.: 223.0  
##  Median :   0.0   Median : 383.5   Median :   0.00   Median : 477.5  
##  Mean   : 103.7   Mean   : 443.6   Mean   :  46.55   Mean   : 567.2  
##  3rd Qu.: 166.0   3rd Qu.: 712.2   3rd Qu.:   0.00   3rd Qu.: 808.0  
##  Max.   :1600.0   Max.   :5644.0   Max.   :1474.00   Max.   :2336.0  
##  NA's   :8                                                           
##   TotalBsmtSF       X1stFlrSF      X2ndFlrSF     LowQualFinSF    
##  Min.   :   0.0   Min.   : 334   Min.   :   0   Min.   :  0.000  
##  1st Qu.: 795.8   1st Qu.: 882   1st Qu.:   0   1st Qu.:  0.000  
##  Median : 991.5   Median :1087   Median :   0   Median :  0.000  
##  Mean   :1057.4   Mean   :1163   Mean   : 347   Mean   :  5.845  
##  3rd Qu.:1298.2   3rd Qu.:1391   3rd Qu.: 728   3rd Qu.:  0.000  
##  Max.   :6110.0   Max.   :4692   Max.   :2065   Max.   :572.000  
##                                                                  
##    GrLivArea     BsmtFullBath     BsmtHalfBath        FullBath    
##  Min.   : 334   Min.   :0.0000   Min.   :0.00000   Min.   :0.000  
##  1st Qu.:1130   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:1.000  
##  Median :1464   Median :0.0000   Median :0.00000   Median :2.000  
##  Mean   :1515   Mean   :0.4253   Mean   :0.05753   Mean   :1.565  
##  3rd Qu.:1777   3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:2.000  
##  Max.   :5642   Max.   :3.0000   Max.   :2.00000   Max.   :3.000  
##                                                                   
##     HalfBath       BedroomAbvGr    KitchenAbvGr    TotRmsAbvGrd   
##  Min.   :0.0000   Min.   :0.000   Min.   :0.000   Min.   : 2.000  
##  1st Qu.:0.0000   1st Qu.:2.000   1st Qu.:1.000   1st Qu.: 5.000  
##  Median :0.0000   Median :3.000   Median :1.000   Median : 6.000  
##  Mean   :0.3829   Mean   :2.866   Mean   :1.047   Mean   : 6.518  
##  3rd Qu.:1.0000   3rd Qu.:3.000   3rd Qu.:1.000   3rd Qu.: 7.000  
##  Max.   :2.0000   Max.   :8.000   Max.   :3.000   Max.   :14.000  
##                                                                   
##    Fireplaces     GarageYrBlt     GarageCars      GarageArea    
##  Min.   :0.000   Min.   :1900   Min.   :0.000   Min.   :   0.0  
##  1st Qu.:0.000   1st Qu.:1961   1st Qu.:1.000   1st Qu.: 334.5  
##  Median :1.000   Median :1980   Median :2.000   Median : 480.0  
##  Mean   :0.613   Mean   :1979   Mean   :1.767   Mean   : 473.0  
##  3rd Qu.:1.000   3rd Qu.:2002   3rd Qu.:2.000   3rd Qu.: 576.0  
##  Max.   :3.000   Max.   :2010   Max.   :4.000   Max.   :1418.0  
##                  NA's   :81                                     
##    WoodDeckSF      OpenPorchSF     EnclosedPorch      X3SsnPorch    
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.00   Min.   :  0.00  
##  1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.00  
##  Median :  0.00   Median : 25.00   Median :  0.00   Median :  0.00  
##  Mean   : 94.24   Mean   : 46.66   Mean   : 21.95   Mean   :  3.41  
##  3rd Qu.:168.00   3rd Qu.: 68.00   3rd Qu.:  0.00   3rd Qu.:  0.00  
##  Max.   :857.00   Max.   :547.00   Max.   :552.00   Max.   :508.00  
##                                                                     
##   ScreenPorch        PoolArea          MiscVal             MoSold      
##  Min.   :  0.00   Min.   :  0.000   Min.   :    0.00   Min.   : 1.000  
##  1st Qu.:  0.00   1st Qu.:  0.000   1st Qu.:    0.00   1st Qu.: 5.000  
##  Median :  0.00   Median :  0.000   Median :    0.00   Median : 6.000  
##  Mean   : 15.06   Mean   :  2.759   Mean   :   43.49   Mean   : 6.322  
##  3rd Qu.:  0.00   3rd Qu.:  0.000   3rd Qu.:    0.00   3rd Qu.: 8.000  
##  Max.   :480.00   Max.   :738.000   Max.   :15500.00   Max.   :12.000  
##                                                                        
##      YrSold       SalePrice     
##  Min.   :2006   Min.   : 34900  
##  1st Qu.:2007   1st Qu.:129975  
##  Median :2008   Median :163000  
##  Mean   :2008   Mean   :180921  
##  3rd Qu.:2009   3rd Qu.:214000  
##  Max.   :2010   Max.   :755000  
## 

The selected variables LotArea, GrLivArea, TotalBsmtSF, SalePricr were showing heavy skewness. In order to create comprehensive visualization, log transformed values have been used to plot.

train %>%
  dplyr::select(LotArea, GrLivArea, TotalBsmtSF, SalePrice) %>%
  mutate(LotArea = log(LotArea),
         GrLivArea = log(GrLivArea),
         TotalBsmtSF = log(TotalBsmtSF),
         SalePrice = log(SalePrice)) %>%
  gather('key', 'value') %>%
  ggplot(aes(x = value, y = ..density..)) +
  geom_histogram(
                 fill = 'red',
                 alpha = 0.5,
                 bins = 15) +
  geom_density() +
  facet_wrap(~key, scales ='free_x') +
  labs(x = 'log value',
       y = '',
       title = 'Histogram (log transformed values)')
## Warning: Removed 37 rows containing non-finite values (stat_bin).
## Warning: Removed 37 rows containing non-finite values (stat_density).

* Provide a scatterplot matrix for at least two of the independent variables and the dependent variable.

train %>%
  dplyr::select(LotArea, GrLivArea, TotalBsmtSF, SalePrice) %>%
    mutate(LotArea = log(LotArea),
         GrLivArea = log(GrLivArea),
         TotalBsmtSF = log(TotalBsmtSF),
         SalePrice = log(SalePrice)) %>%
  pairs(main = 'Scatterplot matrix LotArea, GrLivArea, TotalBsmtSF, SalePrice')

* Derive a correlation matrix for any THREE quantitative variables in the dataset.

train %>%
  dplyr::select(LotArea, GrLivArea, TotalBsmtSF, SalePrice) %>%
  cor() %>%
  corrplot(method = 'number')

Looking at the correlation plot, we can see that the GrLivArea is the variable with highest correlation with SalePrice.

* Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide a 80% confidence interval.

cor.test(train$GrLivArea, train$SalePrice, method = 'pearson', conf.level = 0.80)
## 
##  Pearson's product-moment correlation
## 
## data:  train$GrLivArea and train$SalePrice
## t = 38.348, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.6915087 0.7249450
## sample estimates:
##       cor 
## 0.7086245
cor.test(train$TotalBsmtSF, train$SalePrice, method = 'pearson', conf.level = 0.80)
## 
##  Pearson's product-moment correlation
## 
## data:  train$TotalBsmtSF and train$SalePrice
## t = 29.671, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.5922142 0.6340846
## sample estimates:
##       cor 
## 0.6135806
cor.test(train$LotArea, train$SalePrice, method = 'pearson', conf.level = 0.80)
## 
##  Pearson's product-moment correlation
## 
## data:  train$LotArea and train$SalePrice
## t = 10.445, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.2323391 0.2947946
## sample estimates:
##       cor 
## 0.2638434

The confidence interval for the correlation between GrLivArea and SalePrice is 0.6915 and 0.7249, with a strong positive correlation of 0.70.
The confidence interval for the correlation between TotalBsmtSF and Sale Price is 0.5922 and 0.6340, with a strong positive correlation of 0.6135.
The confidence interval for the correlation between LotArea and SalePrice is 0.2323 and 0.2947, with a positive correlation of 0.2638.

We see a very small p-value <2.2e-16 that appears to be statistically significant thus we reject the null hypothesis that the correlations between each pairwise set of variables is 0. We can state that there is strong evidence the selected variables have correlation between each pairwise.

Familywise error rate is the probability of a coming to at least one false conclusion in a series of hypothesis tests. in other words, it is the probability of making at least one Type I Error. A typical FWER approach used in the scientific literature is a Bonferroni correction. However, given the small number of tests with significantly low p-values, it is likely that this error has not occured.

Linear Algebra and Correlation

Invert your 3 x 3 correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.

* Create 3x3 correlation matrix with GrLivArea, TotalBsmtSf, SalePrice

# create 3*# correlation matrix with GrLivArea, TotalBsmtSf, SalePrice
mat <- train %>%
  dplyr::select(GrLivArea, TotalBsmtSF, SalePrice) %>%
  cor() %>%
  round(4)

mat
##             GrLivArea TotalBsmtSF SalePrice
## GrLivArea      1.0000      0.4549    0.7086
## TotalBsmtSF    0.4549      1.0000    0.6136
## SalePrice      0.7086      0.6136    1.0000

* Invert the matrix then Multiply the correlation matrix by the precision matrix,

prec.mat <- solve(mat) %>%
  round(4)

prec.mat
##             GrLivArea TotalBsmtSF SalePrice
## GrLivArea      2.0111     -0.0648   -1.3853
## TotalBsmtSF   -0.0648      1.6060   -0.9395
## SalePrice     -1.3853     -0.9395    2.5581

Since \(Precision = Correlation^{-1}\) thus \(Precision \times Correlation\) should be equal to I.

prec.mat %*% mat %>%
  round(4)
##             GrLivArea TotalBsmtSF SalePrice
## GrLivArea           1           0         0
## TotalBsmtSF         0           1         0
## SalePrice           0           0         1

* and then multiply the precision matrix by the correlation matrix.

mat %*% prec.mat %>%
  round(4)
##             GrLivArea TotalBsmtSF SalePrice
## GrLivArea           1           0         0
## TotalBsmtSF         0           1         0
## SalePrice           0           0         1

* Conduct LU decomposition

ALU <- function(A){
  # Factorize A, a square matrix, into Lower Triangular(L) and Upper Triangular(U) matrix
  
  # This function only works for square matrix, thus check whether the matrix is square first.
  if(nrow(A) != ncol(A)){
    stop("This is not a square matrix. Try again.")
  }
  
  # initialize variables
  counter <- as.integer(nrow(A))
  U <- A
  L <- diag(counter)
  # m for rows, n for columns
  
  # iterate through columns
  for (n in 1:(counter-1)){
    
    # iterate through the rows 
    m <- n+1
    
    for (m in (n+1):counter){
      x <- -U[m,n]/U[n,n] # U[m,n] <- U[m,n]+x*U[n,n]    
      U[m,n] <- 0
      
      # multiply the remiainings of current row by x
      n2 <- n+1
      
      for (n2 in (n+1):counter){
        U[m,n2] <- U[m,n2] + x*U[n,n2]
      }
      
    # assign the x to L
    L[m,n] = -x
    } 
  }
  line <- ("=========================================")
  print("A")
  print(A)
  print(line)
  print("U")
  print(U)
  print(line)
  print("L")
  print(L)
  print(line)
  print("A = LU")
  print(A == (L %*% U))
  print(L %*% U)
}

# or simply, lu.decomposition(mat)

ALU(mat)
## [1] "A"
##             GrLivArea TotalBsmtSF SalePrice
## GrLivArea      1.0000      0.4549    0.7086
## TotalBsmtSF    0.4549      1.0000    0.6136
## SalePrice      0.7086      0.6136    1.0000
## [1] "========================================="
## [1] "U"
##             GrLivArea TotalBsmtSF SalePrice
## GrLivArea           1    0.454900 0.7086000
## TotalBsmtSF         0    0.793066 0.2912579
## SalePrice           0    0.000000 0.3909200
## [1] "========================================="
## [1] "L"
##        [,1]      [,2] [,3]
## [1,] 1.0000 0.0000000    0
## [2,] 0.4549 1.0000000    0
## [3,] 0.7086 0.3672555    1
## [1] "========================================="
## [1] "A = LU"
##             GrLivArea TotalBsmtSF SalePrice
## GrLivArea        TRUE        TRUE      TRUE
## TotalBsmtSF      TRUE        TRUE      TRUE
## SalePrice        TRUE        TRUE      TRUE
##      GrLivArea TotalBsmtSF SalePrice
## [1,]    1.0000      0.4549    0.7086
## [2,]    0.4549      1.0000    0.6136
## [3,]    0.7086      0.6136    1.0000

Calculus-Based Probability & Statistics

Many times, it makes sense to fit a closed form distribution to data.

Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary.

Then load the MASS package and run fitdistr to fit an exponential probability density function.
(See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ).

Find the optimal value of \(\lambda\) for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, \(\lambda\) )).

Plot a histogram and compare it with a histogram of your original variable.

Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF).

Also generate a 95% confidence interval from the empirical data, assuming normality.

Finally, provide the empirical 5th percentile and 95th percentile of the data.

Discuss.

I selected GrLivArea as it appears to be skewed to the right.

First, Check the summary of the GrLivArea and plot to see the distribution.

summary(train$GrLivArea)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     334    1130    1464    1515    1777    5642
par(mfrow=c(1,2))
hist(train$GrLivArea, main="Histogram of GrLivArea")
boxplot(train$GrLivArea, main="Boxplot of GrLivArea")

* Find the optimal value of \(\lambda\) for this distribution, and then take 1000 samples from this exponential distribution using this value.

Calculate the optimal lambda for right-skewed GrLivArea using MASS library.

lambda <- fitdistr(train$GrLivArea, densfun='exponential')
lambda$estimate
##        rate 
## 0.000659864

Transpose the rate into 1000 selected variables as an exponential distribution.

set.seed(23)
pdf.dist <- rexp(1000,lambda$estimate)

Plot the results of the exponential distribution.

hist(pdf.dist, freq = FALSE, breaks = 100, 
     main ="Fitted Exponential PDF with GrLivArea",
     xlab = "PDF GrLivArea",
     xlim = c(1, quantile(pdf.dist, 0.99)))
curve(dexp(x, rate = lambda$estimate), col = "red", add = TRUE)

* Plot a histogram and compare it with a histogram of your original variable.

Original - GrLivArea

set.seed(23)
samp.GrLivArea <- sample(train$GrLivArea, 1000, replace=TRUE, prob=NULL)
exp.train <- data_frame(Expo=rexp(1000, lambda$estimate)) %>%
  mutate(GrLivArea = samp.GrLivArea)

plotdist(exp.train$GrLivArea, histo = TRUE, demp = TRUE)

Sample - Fitted Exponential PDF with GrLivArea

plotdist(exp.train$Expo, histo = TRUE, demp = TRUE)

hist(train$GrLivArea, freq = FALSE, breaks = 100, 
     main ="Comparison with original GrLivArea",
     xlab="Original GrLivArea",
     xlim = c(1, quantile(train$GrLivArea, 0.99)))
curve(dexp(x, rate = lambda$estimate), col = "red", add = TRUE)

* Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF).

pdf.5th <- qexp(0.05, rate = lambda$estimate, lower.tail = TRUE, log.p = FALSE)
pdf.5th <- round(pdf.5th, 4)
pdf.95th <- qexp(0.95, rate = lambda$estimate, lower.tail = TRUE, log.p = FALSE)
pdf.95th <- round(pdf.95th, 4)

The 5th percentile is 77.7331 and 95th percentile is 4539.9235

* Also generate a 95% confidence interval from the empirical data, assuming normality.

CI(train$GrLivArea, 0.95)
##    upper     mean    lower 
## 1542.440 1515.464 1488.487

* Finally, provide the empirical 5th percentile and 95th percentile of the data.

quantile(train$GrLivArea, c(.05, .95))
##     5%    95% 
##  848.0 2466.1

We are 95% confident that the mean of GrLivArea is between 1488.487 and 1542.440. The exponential distribution is not a good fit as we can see the center of the exp distribution is shifted left as compared to the empirical data.

Modeling

Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.

Goal

  • Predict the sales price for each house

Evaluation on kaggle

  • Submissions are evaluated on Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price. (Taking logs means that errors in predicting expensive houses and cheap houses will affect the result equally.)

The Workflow

  • check train, test data and merge them
  • separate the categorical/numerical variables as independent data frame
  • check NA
  • impute missing values
  • assemble the cat/num data and split into train,test set
  • Investigate the target, SalePrice and logtransform
  • Create first multiple regression model
  • Variable elimination according to the model result
  • apply stepAIC algorithm
  • prediction
  • kaggle submission
#str(train)
#str(test)

#combine data to clean
test$SalePrice <- 0
train$label <- 0
test$label <- 1
df.merge <- rbind(train,test)
# Divide the data categorical/nemerical
df.cat <- df.merge[, sapply(df.merge, is.character)]
df.num <- df.merge[, sapply(df.merge, is.numeric)]

# Check NA
library(Amelia)

# visualization
missmap(df.merge, 
        main = "Misisng Map", 
        col =c("yellow","black"),
        legend = FALSE)

NA.check <- function(data){
    index <- sapply(data, function(x) sum(is.na(x)))
    new.data <- data.frame(index = names(data),
                           na.values = index)
    new.data$perc <- round(new.data$na.values/nrow(data),4)
    arrange(new.data[new.data$na.values > 0,], desc(na.values))
}

# table
na.cat <- NA.check(df.cat)
na.num <- NA.check(df.num)
kable(na.cat)
index na.values perc
PoolQC 2909 0.9966
MiscFeature 2814 0.9640
Alley 2721 0.9322
Fence 2348 0.8044
FireplaceQu 1420 0.4865
GarageFinish 159 0.0545
GarageQual 159 0.0545
GarageCond 159 0.0545
GarageType 157 0.0538
BsmtCond 82 0.0281
BsmtExposure 82 0.0281
BsmtQual 81 0.0277
BsmtFinType2 80 0.0274
BsmtFinType1 79 0.0271
MasVnrType 24 0.0082
MSZoning 4 0.0014
Utilities 2 0.0007
Functional 2 0.0007
Exterior1st 1 0.0003
Exterior2nd 1 0.0003
Electrical 1 0.0003
KitchenQual 1 0.0003
SaleType 1 0.0003
kable(na.num)
index na.values perc
LotFrontage 486 0.1665
GarageYrBlt 159 0.0545
MasVnrArea 23 0.0079
BsmtFullBath 2 0.0007
BsmtHalfBath 2 0.0007
BsmtFinSF1 1 0.0003
BsmtFinSF2 1 0.0003
BsmtUnfSF 1 0.0003
TotalBsmtSF 1 0.0003
GarageCars 1 0.0003
GarageArea 1 0.0003

For categorical NA values, we will simply replace them with None and for numerical NA values, we will impute with median value.

# Given the NA table above, we arbitrarily keep variables that contains around 95% of original values for categorical variables.
df.cat <- subset(df.cat, 
                 select = -c(PoolQC, MiscFeature, Alley, Fence, FireplaceQu))

# We convert the categorical value as factor then as integer. Replace NAs with 0
replace.cat <- function(df) {
    df[sapply(df, is.character)] <- lapply(df[sapply(df, is.character)], as.factor)
    df[sapply(df, is.factor)] <- lapply(df[sapply(df, is.factor)], as.integer)
    df[is.na(df)] <- 0
    df
}

df.cat <- replace.cat(df.cat)
NA.check(df.cat)
## [1] index     na.values perc     
## <0 rows> (or 0-length row.names)
# For numerical missing values, we impute with mean
summary(df.num)
##        Id           MSSubClass      LotFrontage        LotArea      
##  Min.   :   1.0   Min.   : 20.00   Min.   : 21.00   Min.   :  1300  
##  1st Qu.: 730.5   1st Qu.: 20.00   1st Qu.: 59.00   1st Qu.:  7478  
##  Median :1460.0   Median : 50.00   Median : 68.00   Median :  9453  
##  Mean   :1460.0   Mean   : 57.14   Mean   : 69.31   Mean   : 10168  
##  3rd Qu.:2189.5   3rd Qu.: 70.00   3rd Qu.: 80.00   3rd Qu.: 11570  
##  Max.   :2919.0   Max.   :190.00   Max.   :313.00   Max.   :215245  
##                                    NA's   :486                      
##   OverallQual      OverallCond      YearBuilt     YearRemodAdd 
##  Min.   : 1.000   Min.   :1.000   Min.   :1872   Min.   :1950  
##  1st Qu.: 5.000   1st Qu.:5.000   1st Qu.:1954   1st Qu.:1965  
##  Median : 6.000   Median :5.000   Median :1973   Median :1993  
##  Mean   : 6.089   Mean   :5.565   Mean   :1971   Mean   :1984  
##  3rd Qu.: 7.000   3rd Qu.:6.000   3rd Qu.:2001   3rd Qu.:2004  
##  Max.   :10.000   Max.   :9.000   Max.   :2010   Max.   :2010  
##                                                                
##    MasVnrArea       BsmtFinSF1       BsmtFinSF2        BsmtUnfSF     
##  Min.   :   0.0   Min.   :   0.0   Min.   :   0.00   Min.   :   0.0  
##  1st Qu.:   0.0   1st Qu.:   0.0   1st Qu.:   0.00   1st Qu.: 220.0  
##  Median :   0.0   Median : 368.5   Median :   0.00   Median : 467.0  
##  Mean   : 102.2   Mean   : 441.4   Mean   :  49.58   Mean   : 560.8  
##  3rd Qu.: 164.0   3rd Qu.: 733.0   3rd Qu.:   0.00   3rd Qu.: 805.5  
##  Max.   :1600.0   Max.   :5644.0   Max.   :1526.00   Max.   :2336.0  
##  NA's   :23       NA's   :1        NA's   :1         NA's   :1       
##   TotalBsmtSF       X1stFlrSF      X2ndFlrSF       LowQualFinSF     
##  Min.   :   0.0   Min.   : 334   Min.   :   0.0   Min.   :   0.000  
##  1st Qu.: 793.0   1st Qu.: 876   1st Qu.:   0.0   1st Qu.:   0.000  
##  Median : 989.5   Median :1082   Median :   0.0   Median :   0.000  
##  Mean   :1051.8   Mean   :1160   Mean   : 336.5   Mean   :   4.694  
##  3rd Qu.:1302.0   3rd Qu.:1388   3rd Qu.: 704.0   3rd Qu.:   0.000  
##  Max.   :6110.0   Max.   :5095   Max.   :2065.0   Max.   :1064.000  
##  NA's   :1                                                          
##    GrLivArea     BsmtFullBath     BsmtHalfBath        FullBath    
##  Min.   : 334   Min.   :0.0000   Min.   :0.00000   Min.   :0.000  
##  1st Qu.:1126   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:1.000  
##  Median :1444   Median :0.0000   Median :0.00000   Median :2.000  
##  Mean   :1501   Mean   :0.4299   Mean   :0.06136   Mean   :1.568  
##  3rd Qu.:1744   3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:2.000  
##  Max.   :5642   Max.   :3.0000   Max.   :2.00000   Max.   :4.000  
##                 NA's   :2        NA's   :2                        
##     HalfBath       BedroomAbvGr   KitchenAbvGr    TotRmsAbvGrd   
##  Min.   :0.0000   Min.   :0.00   Min.   :0.000   Min.   : 2.000  
##  1st Qu.:0.0000   1st Qu.:2.00   1st Qu.:1.000   1st Qu.: 5.000  
##  Median :0.0000   Median :3.00   Median :1.000   Median : 6.000  
##  Mean   :0.3803   Mean   :2.86   Mean   :1.045   Mean   : 6.452  
##  3rd Qu.:1.0000   3rd Qu.:3.00   3rd Qu.:1.000   3rd Qu.: 7.000  
##  Max.   :2.0000   Max.   :8.00   Max.   :3.000   Max.   :15.000  
##                                                                  
##    Fireplaces      GarageYrBlt     GarageCars      GarageArea    
##  Min.   :0.0000   Min.   :1895   Min.   :0.000   Min.   :   0.0  
##  1st Qu.:0.0000   1st Qu.:1960   1st Qu.:1.000   1st Qu.: 320.0  
##  Median :1.0000   Median :1979   Median :2.000   Median : 480.0  
##  Mean   :0.5971   Mean   :1978   Mean   :1.767   Mean   : 472.9  
##  3rd Qu.:1.0000   3rd Qu.:2002   3rd Qu.:2.000   3rd Qu.: 576.0  
##  Max.   :4.0000   Max.   :2207   Max.   :5.000   Max.   :1488.0  
##                   NA's   :159    NA's   :1       NA's   :1       
##    WoodDeckSF       OpenPorchSF     EnclosedPorch      X3SsnPorch     
##  Min.   :   0.00   Min.   :  0.00   Min.   :   0.0   Min.   :  0.000  
##  1st Qu.:   0.00   1st Qu.:  0.00   1st Qu.:   0.0   1st Qu.:  0.000  
##  Median :   0.00   Median : 26.00   Median :   0.0   Median :  0.000  
##  Mean   :  93.71   Mean   : 47.49   Mean   :  23.1   Mean   :  2.602  
##  3rd Qu.: 168.00   3rd Qu.: 70.00   3rd Qu.:   0.0   3rd Qu.:  0.000  
##  Max.   :1424.00   Max.   :742.00   Max.   :1012.0   Max.   :508.000  
##                                                                       
##   ScreenPorch        PoolArea          MiscVal             MoSold      
##  Min.   :  0.00   Min.   :  0.000   Min.   :    0.00   Min.   : 1.000  
##  1st Qu.:  0.00   1st Qu.:  0.000   1st Qu.:    0.00   1st Qu.: 4.000  
##  Median :  0.00   Median :  0.000   Median :    0.00   Median : 6.000  
##  Mean   : 16.06   Mean   :  2.252   Mean   :   50.83   Mean   : 6.213  
##  3rd Qu.:  0.00   3rd Qu.:  0.000   3rd Qu.:    0.00   3rd Qu.: 8.000  
##  Max.   :576.00   Max.   :800.000   Max.   :17000.00   Max.   :12.000  
##                                                                        
##      YrSold       SalePrice          label       
##  Min.   :2006   Min.   :     0   Min.   :0.0000  
##  1st Qu.:2007   1st Qu.:     0   1st Qu.:0.0000  
##  Median :2008   Median : 34900   Median :0.0000  
##  Mean   :2008   Mean   : 90492   Mean   :0.4998  
##  3rd Qu.:2009   3rd Qu.:163000   3rd Qu.:1.0000  
##  Max.   :2010   Max.   :755000   Max.   :1.0000  
## 
df.num <- df.num %>% 
  mutate_all(~ifelse(is.na(.x), 
                     mean(.x, na.rm = TRUE), 
                     .x))

summary(df.num)
##        Id           MSSubClass      LotFrontage        LotArea      
##  Min.   :   1.0   Min.   : 20.00   Min.   : 21.00   Min.   :  1300  
##  1st Qu.: 730.5   1st Qu.: 20.00   1st Qu.: 60.00   1st Qu.:  7478  
##  Median :1460.0   Median : 50.00   Median : 69.31   Median :  9453  
##  Mean   :1460.0   Mean   : 57.14   Mean   : 69.31   Mean   : 10168  
##  3rd Qu.:2189.5   3rd Qu.: 70.00   3rd Qu.: 78.00   3rd Qu.: 11570  
##  Max.   :2919.0   Max.   :190.00   Max.   :313.00   Max.   :215245  
##   OverallQual      OverallCond      YearBuilt     YearRemodAdd 
##  Min.   : 1.000   Min.   :1.000   Min.   :1872   Min.   :1950  
##  1st Qu.: 5.000   1st Qu.:5.000   1st Qu.:1954   1st Qu.:1965  
##  Median : 6.000   Median :5.000   Median :1973   Median :1993  
##  Mean   : 6.089   Mean   :5.565   Mean   :1971   Mean   :1984  
##  3rd Qu.: 7.000   3rd Qu.:6.000   3rd Qu.:2001   3rd Qu.:2004  
##  Max.   :10.000   Max.   :9.000   Max.   :2010   Max.   :2010  
##    MasVnrArea       BsmtFinSF1       BsmtFinSF2        BsmtUnfSF     
##  Min.   :   0.0   Min.   :   0.0   Min.   :   0.00   Min.   :   0.0  
##  1st Qu.:   0.0   1st Qu.:   0.0   1st Qu.:   0.00   1st Qu.: 220.0  
##  Median :   0.0   Median : 369.0   Median :   0.00   Median : 467.0  
##  Mean   : 102.2   Mean   : 441.4   Mean   :  49.58   Mean   : 560.8  
##  3rd Qu.: 163.5   3rd Qu.: 733.0   3rd Qu.:   0.00   3rd Qu.: 805.0  
##  Max.   :1600.0   Max.   :5644.0   Max.   :1526.00   Max.   :2336.0  
##   TotalBsmtSF     X1stFlrSF      X2ndFlrSF       LowQualFinSF     
##  Min.   :   0   Min.   : 334   Min.   :   0.0   Min.   :   0.000  
##  1st Qu.: 793   1st Qu.: 876   1st Qu.:   0.0   1st Qu.:   0.000  
##  Median : 990   Median :1082   Median :   0.0   Median :   0.000  
##  Mean   :1052   Mean   :1160   Mean   : 336.5   Mean   :   4.694  
##  3rd Qu.:1302   3rd Qu.:1388   3rd Qu.: 704.0   3rd Qu.:   0.000  
##  Max.   :6110   Max.   :5095   Max.   :2065.0   Max.   :1064.000  
##    GrLivArea     BsmtFullBath     BsmtHalfBath        FullBath    
##  Min.   : 334   Min.   :0.0000   Min.   :0.00000   Min.   :0.000  
##  1st Qu.:1126   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:1.000  
##  Median :1444   Median :0.0000   Median :0.00000   Median :2.000  
##  Mean   :1501   Mean   :0.4299   Mean   :0.06136   Mean   :1.568  
##  3rd Qu.:1744   3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:2.000  
##  Max.   :5642   Max.   :3.0000   Max.   :2.00000   Max.   :4.000  
##     HalfBath       BedroomAbvGr   KitchenAbvGr    TotRmsAbvGrd   
##  Min.   :0.0000   Min.   :0.00   Min.   :0.000   Min.   : 2.000  
##  1st Qu.:0.0000   1st Qu.:2.00   1st Qu.:1.000   1st Qu.: 5.000  
##  Median :0.0000   Median :3.00   Median :1.000   Median : 6.000  
##  Mean   :0.3803   Mean   :2.86   Mean   :1.045   Mean   : 6.452  
##  3rd Qu.:1.0000   3rd Qu.:3.00   3rd Qu.:1.000   3rd Qu.: 7.000  
##  Max.   :2.0000   Max.   :8.00   Max.   :3.000   Max.   :15.000  
##    Fireplaces      GarageYrBlt     GarageCars      GarageArea    
##  Min.   :0.0000   Min.   :1895   Min.   :0.000   Min.   :   0.0  
##  1st Qu.:0.0000   1st Qu.:1962   1st Qu.:1.000   1st Qu.: 320.0  
##  Median :1.0000   Median :1978   Median :2.000   Median : 480.0  
##  Mean   :0.5971   Mean   :1978   Mean   :1.767   Mean   : 472.9  
##  3rd Qu.:1.0000   3rd Qu.:2001   3rd Qu.:2.000   3rd Qu.: 576.0  
##  Max.   :4.0000   Max.   :2207   Max.   :5.000   Max.   :1488.0  
##    WoodDeckSF       OpenPorchSF     EnclosedPorch      X3SsnPorch     
##  Min.   :   0.00   Min.   :  0.00   Min.   :   0.0   Min.   :  0.000  
##  1st Qu.:   0.00   1st Qu.:  0.00   1st Qu.:   0.0   1st Qu.:  0.000  
##  Median :   0.00   Median : 26.00   Median :   0.0   Median :  0.000  
##  Mean   :  93.71   Mean   : 47.49   Mean   :  23.1   Mean   :  2.602  
##  3rd Qu.: 168.00   3rd Qu.: 70.00   3rd Qu.:   0.0   3rd Qu.:  0.000  
##  Max.   :1424.00   Max.   :742.00   Max.   :1012.0   Max.   :508.000  
##   ScreenPorch        PoolArea          MiscVal             MoSold      
##  Min.   :  0.00   Min.   :  0.000   Min.   :    0.00   Min.   : 1.000  
##  1st Qu.:  0.00   1st Qu.:  0.000   1st Qu.:    0.00   1st Qu.: 4.000  
##  Median :  0.00   Median :  0.000   Median :    0.00   Median : 6.000  
##  Mean   : 16.06   Mean   :  2.252   Mean   :   50.83   Mean   : 6.213  
##  3rd Qu.:  0.00   3rd Qu.:  0.000   3rd Qu.:    0.00   3rd Qu.: 8.000  
##  Max.   :576.00   Max.   :800.000   Max.   :17000.00   Max.   :12.000  
##      YrSold       SalePrice          label       
##  Min.   :2006   Min.   :     0   Min.   :0.0000  
##  1st Qu.:2007   1st Qu.:     0   1st Qu.:0.0000  
##  Median :2008   Median : 34900   Median :0.0000  
##  Mean   :2008   Mean   : 90492   Mean   :0.4998  
##  3rd Qu.:2009   3rd Qu.:163000   3rd Qu.:1.0000  
##  Max.   :2010   Max.   :755000   Max.   :1.0000
NA.check(df.num)
## [1] index     na.values perc     
## <0 rows> (or 0-length row.names)
# Aseemble data
full.updated <- cbind(df.num, df.cat)

# Split data
train <- full.updated %>%
  filter(label == 0) %>%
  dplyr::select(-label)

test <- full.updated %>%
  filter(label == 1) %>%
  dplyr::select(-label, -SalePrice)

Correlation

# Visualization
train %>%
  dplyr::select(-Id) %>%
  cor() %>%
  ggcorr()

# Table
train %>%
  dplyr::select(-Id) %>%
  cor() %>%
  as.data.frame() %>%
  mutate(cor.names = colnames(.)) %>%
  dplyr::select(cor.names, SalePrice) %>%
  arrange(desc(SalePrice)) %>%
  kable(.)
cor.names SalePrice
SalePrice 1.0000000
OverallQual 0.7909816
GrLivArea 0.7086245
GarageCars 0.6404092
GarageArea 0.6234314
TotalBsmtSF 0.6135806
X1stFlrSF 0.6058522
FullBath 0.5606638
TotRmsAbvGrd 0.5337232
YearBuilt 0.5228973
YearRemodAdd 0.5071010
MasVnrArea 0.4752097
GarageYrBlt 0.4710619
Fireplaces 0.4669288
BsmtFinSF1 0.3864198
Foundation 0.3824790
LotFrontage 0.3348202
WoodDeckSF 0.3244134
X2ndFlrSF 0.3193338
OpenPorchSF 0.3158562
HalfBath 0.2841077
GarageCond 0.2757815
LotArea 0.2638434
GarageQual 0.2613470
CentralAir 0.2513282
Electrical 0.2339194
PavedDrive 0.2313570
BsmtFullBath 0.2271222
RoofStyle 0.2224053
BsmtUnfSF 0.2144791
SaleCondition 0.2130920
HouseStyle 0.1801626
Neighborhood 0.1709413
BedroomAbvGr 0.1682132
BsmtCond 0.1473674
RoofMatl 0.1323831
BsmtFinType2 0.1308142
ExterCond 0.1173027
Functional 0.1153279
ScreenPorch 0.1114466
Exterior2nd 0.1037655
Exterior1st 0.1035510
PoolArea 0.0924035
Condition1 0.0911549
LandSlope 0.0511522
MoSold 0.0464322
X3SsnPorch 0.0445837
Street 0.0410355
LandContour 0.0154532
Condition2 0.0075127
MasVnrType -0.0004878
BsmtFinSF2 -0.0113781
BsmtFinType1 -0.0132329
Utilities -0.0143143
BsmtHalfBath -0.0168442
MiscVal -0.0211896
LowQualFinSF -0.0256061
YrSold -0.0289226
SaleType -0.0503695
LotConfig -0.0673960
OverallCond -0.0778559
MSSubClass -0.0842841
BldgType -0.0855906
Heating -0.0988121
EnclosedPorch -0.1285780
KitchenAbvGr -0.1359074
MSZoning -0.1668722
BsmtExposure -0.1930786
GarageType -0.2238185
LotShape -0.2555799
GarageFinish -0.2924833
HeatingQC -0.4001775
BsmtQual -0.4388810
KitchenQual -0.5891888
ExterQual -0.6368837
ggplot(train, aes(x = SalePrice, y = ..density..)) + 
  geom_histogram(fill = 'red',
                 alpha = 0.5,
                 bins = 15) +
  geom_density() +
  labs(title = 'Sales price histogram')

SalePrice details:

describe(train$SalePrice)
##    vars    n     mean      sd median  trimmed     mad   min    max  range
## X1    1 1460 180921.2 79442.5 163000 170783.3 56338.8 34900 755000 720100
##    skew kurtosis      se
## X1 1.88      6.5 2079.11
round(sd(train$SalePrice)) ** 2 #variance
## [1] 6311190249

Prior to building the model, Look at Sale Price. We can see below that the sales price is skewed to the right. Given all the values are positive, has low mean and has large variance, this data may be a good fit to log transform.

fit <- fitdistr(train$SalePrice, densfun="log-normal") 
fit
##      meanlog         sdlog    
##   12.024050901    0.399315046 
##  ( 0.010450552) ( 0.007389656)
SalePrice.log <- log(train$SalePrice)

sample.log <- sample(SalePrice.log, size = 1000, replace = TRUE, prob = NULL)
sample.origin <- sample(train$SalePrice, size = 1000, replace = TRUE, prob = NULL)

par(mfrow=c(1,2))
hist(sample.origin, col ="red")
hist(sample.log, col = "blue")

The log transformation does normalize the sales price. With the log transformation, the SalePrice data appears to be more normalized. Let’s set this new column SalePrice_log into the dataset.

# cbind the Salesprice.log
d.train <- train %>%
  cbind(., SalePrice.log) %>%
  dplyr::select(-SalePrice)

# multiple regression
reg.lm <- lm(SalePrice.log ~ . , data = d.train)
summary(reg.lm)
## 
## Call:
## lm(formula = SalePrice.log ~ ., data = d.train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.72503 -0.06237  0.00320  0.06696  0.54303 
## 
## Coefficients: (2 not defined because of singularities)
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    2.081e+01  5.754e+00   3.616 0.000310 ***
## Id            -8.974e-06  8.839e-06  -1.015 0.310149    
## MSSubClass    -2.239e-04  1.974e-04  -1.134 0.256984    
## LotFrontage   -4.996e-04  2.181e-04  -2.291 0.022126 *  
## LotArea        1.557e-06  4.624e-07   3.368 0.000779 ***
## OverallQual    7.023e-02  5.159e-03  13.614  < 2e-16 ***
## OverallCond    3.968e-02  4.549e-03   8.722  < 2e-16 ***
## YearBuilt      1.825e-03  3.427e-04   5.325 1.18e-07 ***
## YearRemodAdd   7.902e-04  2.954e-04   2.675 0.007570 ** 
## MasVnrArea     1.379e-05  2.632e-05   0.524 0.600378    
## BsmtFinSF1     3.777e-05  2.240e-05   1.686 0.092042 .  
## BsmtFinSF2     1.262e-04  3.437e-05   3.672 0.000250 ***
## BsmtUnfSF      2.870e-05  2.193e-05   1.309 0.190718    
## TotalBsmtSF           NA         NA      NA       NA    
## X1stFlrSF      2.092e-04  2.708e-05   7.725 2.13e-14 ***
## X2ndFlrSF      1.654e-04  2.094e-05   7.896 5.80e-15 ***
## LowQualFinSF   1.680e-04  8.160e-05   2.059 0.039660 *  
## GrLivArea             NA         NA      NA       NA    
## BsmtFullBath   5.729e-02  1.063e-02   5.391 8.21e-08 ***
## BsmtHalfBath   2.178e-02  1.671e-02   1.303 0.192741    
## FullBath       3.897e-02  1.169e-02   3.333 0.000883 ***
## HalfBath       1.794e-02  1.103e-02   1.626 0.104103    
## BedroomAbvGr   7.074e-03  7.244e-03   0.977 0.328919    
## KitchenAbvGr  -3.421e-02  2.187e-02  -1.564 0.118072    
## TotRmsAbvGrd   1.364e-02  5.091e-03   2.679 0.007465 ** 
## Fireplaces     3.457e-02  7.323e-03   4.720 2.59e-06 ***
## GarageYrBlt   -7.378e-04  3.051e-04  -2.418 0.015733 *  
## GarageCars     6.276e-02  1.211e-02   5.181 2.53e-07 ***
## GarageArea     4.230e-05  4.224e-05   1.001 0.316860    
## WoodDeckSF     1.077e-04  3.262e-05   3.303 0.000982 ***
## OpenPorchSF   -3.070e-05  6.215e-05  -0.494 0.621466    
## EnclosedPorch  1.518e-04  6.793e-05   2.235 0.025572 *  
## X3SsnPorch     1.782e-04  1.264e-04   1.410 0.158910    
## ScreenPorch    3.228e-04  6.989e-05   4.619 4.21e-06 ***
## PoolArea      -2.753e-04  9.629e-05  -2.859 0.004316 ** 
## MiscVal       -9.565e-07  7.555e-06  -0.127 0.899280    
## MoSold         9.798e-05  1.387e-03   0.071 0.943682    
## YrSold        -7.107e-03  2.842e-03  -2.500 0.012525 *  
## MSZoning      -1.774e-02  6.584e-03  -2.695 0.007125 ** 
## Street         1.909e-01  6.088e-02   3.135 0.001752 ** 
## LotShape      -6.274e-03  2.874e-03  -2.183 0.029209 *  
## LandContour    1.062e-02  5.844e-03   1.818 0.069328 .  
## Utilities     -1.750e-01  1.447e-01  -1.209 0.226909    
## LotConfig     -1.887e-03  2.382e-03  -0.792 0.428423    
## LandSlope      3.529e-02  1.665e-02   2.119 0.034227 *  
## Neighborhood   8.286e-04  6.842e-04   1.211 0.226096    
## Condition1     1.939e-03  4.417e-03   0.439 0.660793    
## Condition2    -4.587e-02  1.459e-02  -3.144 0.001704 ** 
## BldgType      -1.178e-02  6.499e-03  -1.813 0.070001 .  
## HouseStyle    -4.533e-03  2.847e-03  -1.592 0.111560    
## RoofStyle      4.944e-03  4.890e-03   1.011 0.312102    
## RoofMatl       9.324e-03  6.542e-03   1.425 0.154286    
## Exterior1st   -3.620e-03  2.270e-03  -1.595 0.110929    
## Exterior2nd    3.370e-03  2.053e-03   1.642 0.100831    
## MasVnrType     4.663e-03  6.414e-03   0.727 0.467331    
## ExterQual     -9.145e-03  8.571e-03  -1.067 0.286203    
## ExterCond      1.072e-02  5.450e-03   1.968 0.049302 *  
## Foundation     1.357e-02  7.326e-03   1.853 0.064124 .  
## BsmtQual      -1.325e-02  5.900e-03  -2.245 0.024920 *  
## BsmtCond       1.250e-02  5.702e-03   2.192 0.028534 *  
## BsmtExposure  -7.357e-03  3.812e-03  -1.930 0.053836 .  
## BsmtFinType1  -6.083e-03  2.707e-03  -2.247 0.024806 *  
## BsmtFinType2   1.603e-02  4.881e-03   3.285 0.001045 ** 
## Heating       -3.864e-03  1.399e-02  -0.276 0.782411    
## HeatingQC     -7.933e-03  2.657e-03  -2.986 0.002878 ** 
## CentralAir     7.360e-02  1.953e-02   3.769 0.000171 ***
## Electrical    -1.220e-03  3.957e-03  -0.308 0.757965    
## KitchenQual   -2.337e-02  6.256e-03  -3.735 0.000195 ***
## Functional     1.744e-02  4.107e-03   4.246 2.32e-05 ***
## GarageType    -4.013e-03  2.716e-03  -1.477 0.139780    
## GarageFinish  -9.937e-03  6.209e-03  -1.600 0.109740    
## GarageQual    -5.157e-04  7.270e-03  -0.071 0.943460    
## GarageCond     8.531e-03  7.633e-03   1.118 0.263915    
## PavedDrive     2.362e-02  8.998e-03   2.625 0.008772 ** 
## SaleType      -1.493e-03  2.491e-03  -0.599 0.548950    
## SaleCondition  2.298e-02  3.594e-03   6.393 2.21e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1381 on 1386 degrees of freedom
## Multiple R-squared:  0.8865, Adjusted R-squared:  0.8805 
## F-statistic: 148.3 on 73 and 1386 DF,  p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(reg.lm)
## Warning: not plotting observations with leverage one:
##   945

## Warning: not plotting observations with leverage one:
##   945

The regression model produces an adjusted R-squred value and F-statistic that correstponds with a significant p-value. There seems to be some variables that do not seem to be significant.

Feature selection

new.lm <- lm(SalePrice.log~LotFrontage + LotArea + OverallQual + OverallCond + YearBuilt + YearRemodAdd +
               BsmtFinSF2 + X1stFlrSF + X2ndFlrSF + LowQualFinSF + BsmtFullBath + FullBath + TotRmsAbvGrd + 
               Fireplaces + GarageYrBlt + GarageCars + WoodDeckSF + EnclosedPorch + ScreenPorch + PoolArea + 
               YrSold + MSZoning + Street + LotShape + LandSlope + Condition2 + ExterCond + BsmtQual + BsmtCond + 
               BsmtFinType1 + BsmtFinType2 + HeatingQC + CentralAir + KitchenQual + Functional + PavedDrive + SaleCondition ,
   train)

summary(new.lm)
## 
## Call:
## lm(formula = SalePrice.log ~ LotFrontage + LotArea + OverallQual + 
##     OverallCond + YearBuilt + YearRemodAdd + BsmtFinSF2 + X1stFlrSF + 
##     X2ndFlrSF + LowQualFinSF + BsmtFullBath + FullBath + TotRmsAbvGrd + 
##     Fireplaces + GarageYrBlt + GarageCars + WoodDeckSF + EnclosedPorch + 
##     ScreenPorch + PoolArea + YrSold + MSZoning + Street + LotShape + 
##     LandSlope + Condition2 + ExterCond + BsmtQual + BsmtCond + 
##     BsmtFinType1 + BsmtFinType2 + HeatingQC + CentralAir + KitchenQual + 
##     Functional + PavedDrive + SaleCondition, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.82405 -0.06765  0.00430  0.07434  0.61319 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1.926e+01  5.673e+00   3.395 0.000705 ***
## LotFrontage    8.707e-05  1.972e-04   0.442 0.658843    
## LotArea        1.997e-06  4.579e-07   4.361 1.39e-05 ***
## OverallQual    7.455e-02  4.902e-03  15.207  < 2e-16 ***
## OverallCond    4.068e-02  4.501e-03   9.038  < 2e-16 ***
## YearBuilt      2.242e-03  2.945e-04   7.616 4.76e-14 ***
## YearRemodAdd   7.179e-04  2.879e-04   2.493 0.012773 *  
## BsmtFinSF2     9.700e-05  3.076e-05   3.154 0.001644 ** 
## X1stFlrSF      2.604e-04  1.919e-05  13.566  < 2e-16 ***
## X2ndFlrSF      1.712e-04  1.666e-05  10.278  < 2e-16 ***
## LowQualFinSF   1.656e-04  8.075e-05   2.051 0.040436 *  
## BsmtFullBath   5.444e-02  8.579e-03   6.346 2.97e-10 ***
## FullBath       2.215e-02  1.037e-02   2.136 0.032838 *  
## TotRmsAbvGrd   1.429e-02  4.250e-03   3.362 0.000794 ***
## Fireplaces     3.498e-02  7.131e-03   4.905 1.04e-06 ***
## GarageYrBlt   -6.078e-04  2.773e-04  -2.192 0.028537 *  
## GarageCars     7.127e-02  6.942e-03  10.266  < 2e-16 ***
## WoodDeckSF     1.377e-04  3.234e-05   4.259 2.19e-05 ***
## EnclosedPorch  1.754e-04  6.796e-05   2.581 0.009964 ** 
## ScreenPorch    3.663e-04  6.950e-05   5.271 1.57e-07 ***
## PoolArea      -3.358e-04  9.603e-05  -3.497 0.000485 ***
## YrSold        -6.930e-03  2.814e-03  -2.463 0.013893 *  
## MSZoning      -2.304e-02  6.313e-03  -3.650 0.000272 ***
## Street         2.028e-01  6.018e-02   3.370 0.000773 ***
## LotShape      -8.841e-03  2.783e-03  -3.177 0.001521 ** 
## LandSlope      2.655e-02  1.541e-02   1.724 0.085014 .  
## Condition2    -4.741e-02  1.440e-02  -3.293 0.001015 ** 
## ExterCond      8.432e-03  5.426e-03   1.554 0.120377    
## BsmtQual      -1.351e-02  5.509e-03  -2.453 0.014283 *  
## BsmtCond       1.297e-02  5.580e-03   2.325 0.020209 *  
## BsmtFinType1  -7.593e-03  2.413e-03  -3.146 0.001687 ** 
## BsmtFinType2   1.592e-02  4.607e-03   3.455 0.000566 ***
## HeatingQC     -9.981e-03  2.572e-03  -3.880 0.000109 ***
## CentralAir     8.385e-02  1.807e-02   4.641 3.78e-06 ***
## KitchenQual   -2.762e-02  5.833e-03  -4.736 2.40e-06 ***
## Functional     1.902e-02  4.046e-03   4.702 2.83e-06 ***
## PavedDrive     2.484e-02  8.839e-03   2.810 0.005020 ** 
## SaleCondition  2.438e-02  3.518e-03   6.928 6.44e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1404 on 1422 degrees of freedom
## Multiple R-squared:  0.8795, Adjusted R-squared:  0.8764 
## F-statistic: 280.6 on 37 and 1422 DF,  p-value: < 2.2e-16

The elimination decreased the Adjusted R-squared value.

We will utilize stepAIC() algorithm.

step.lm <- stepAIC(reg.lm, trace=FALSE)
summary(step.lm)
## 
## Call:
## lm(formula = SalePrice.log ~ LotFrontage + LotArea + OverallQual + 
##     OverallCond + YearBuilt + YearRemodAdd + BsmtFinSF1 + BsmtFinSF2 + 
##     BsmtUnfSF + X1stFlrSF + X2ndFlrSF + LowQualFinSF + BsmtFullBath + 
##     FullBath + HalfBath + KitchenAbvGr + TotRmsAbvGrd + Fireplaces + 
##     GarageYrBlt + GarageCars + WoodDeckSF + EnclosedPorch + X3SsnPorch + 
##     ScreenPorch + PoolArea + YrSold + MSZoning + Street + LotShape + 
##     LandContour + LandSlope + Condition2 + BldgType + HouseStyle + 
##     RoofMatl + Exterior1st + Exterior2nd + ExterCond + Foundation + 
##     BsmtQual + BsmtCond + BsmtExposure + BsmtFinType1 + BsmtFinType2 + 
##     HeatingQC + CentralAir + KitchenQual + Functional + GarageType + 
##     GarageFinish + GarageCond + PavedDrive + SaleCondition, data = d.train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.75367 -0.06306  0.00364  0.06753  0.55406 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1.966e+01  5.611e+00   3.505 0.000472 ***
## LotFrontage   -4.345e-04  2.133e-04  -2.037 0.041800 *  
## LotArea        1.543e-06  4.559e-07   3.385 0.000730 ***
## OverallQual    7.187e-02  4.973e-03  14.452  < 2e-16 ***
## OverallCond    4.014e-02  4.485e-03   8.949  < 2e-16 ***
## YearBuilt      1.903e-03  3.309e-04   5.749 1.10e-08 ***
## YearRemodAdd   7.789e-04  2.879e-04   2.705 0.006904 ** 
## BsmtFinSF1     5.076e-05  2.131e-05   2.382 0.017369 *  
## BsmtFinSF2     1.359e-04  3.356e-05   4.050 5.40e-05 ***
## BsmtUnfSF      3.705e-05  2.110e-05   1.756 0.079373 .  
## X1stFlrSF      2.136e-04  2.607e-05   8.196 5.52e-16 ***
## X2ndFlrSF      1.648e-04  1.973e-05   8.357  < 2e-16 ***
## LowQualFinSF   1.723e-04  8.046e-05   2.141 0.032453 *  
## BsmtFullBath   5.151e-02  9.993e-03   5.154 2.91e-07 ***
## FullBath       3.539e-02  1.131e-02   3.130 0.001782 ** 
## HalfBath       1.615e-02  1.085e-02   1.489 0.136823    
## KitchenAbvGr  -3.630e-02  2.093e-02  -1.734 0.083126 .  
## TotRmsAbvGrd   1.587e-02  4.501e-03   3.526 0.000436 ***
## Fireplaces     3.255e-02  7.193e-03   4.525 6.55e-06 ***
## GarageYrBlt   -6.903e-04  2.889e-04  -2.389 0.017021 *  
## GarageCars     7.302e-02  8.331e-03   8.765  < 2e-16 ***
## WoodDeckSF     1.129e-04  3.222e-05   3.505 0.000471 ***
## EnclosedPorch  1.662e-04  6.705e-05   2.480 0.013272 *  
## X3SsnPorch     1.951e-04  1.251e-04   1.559 0.119141    
## ScreenPorch    3.214e-04  6.892e-05   4.663 3.42e-06 ***
## PoolArea      -2.901e-04  9.513e-05  -3.049 0.002337 ** 
## YrSold        -6.725e-03  2.777e-03  -2.422 0.015569 *  
## MSZoning      -2.104e-02  6.276e-03  -3.352 0.000824 ***
## Street         1.741e-01  5.958e-02   2.922 0.003529 ** 
## LotShape      -6.832e-03  2.772e-03  -2.464 0.013839 *  
## LandContour    1.110e-02  5.775e-03   1.921 0.054877 .  
## LandSlope      3.371e-02  1.644e-02   2.051 0.040465 *  
## Condition2    -4.441e-02  1.421e-02  -3.126 0.001811 ** 
## BldgType      -1.853e-02  3.865e-03  -4.795 1.80e-06 ***
## HouseStyle    -5.611e-03  2.521e-03  -2.225 0.026230 *  
## RoofMatl       9.334e-03  6.462e-03   1.445 0.148799    
## Exterior1st   -3.381e-03  2.231e-03  -1.515 0.129935    
## Exterior2nd    2.990e-03  2.011e-03   1.487 0.137317    
## ExterCond      9.998e-03  5.342e-03   1.872 0.061452 .  
## Foundation     1.238e-02  7.211e-03   1.716 0.086345 .  
## BsmtQual      -1.534e-02  5.653e-03  -2.713 0.006745 ** 
## BsmtCond       1.149e-02  5.599e-03   2.052 0.040349 *  
## BsmtExposure  -5.888e-03  3.703e-03  -1.590 0.112022    
## BsmtFinType1  -6.356e-03  2.677e-03  -2.374 0.017718 *  
## BsmtFinType2   1.542e-02  4.806e-03   3.208 0.001366 ** 
## HeatingQC     -7.478e-03  2.611e-03  -2.864 0.004241 ** 
## CentralAir     7.587e-02  1.807e-02   4.198 2.87e-05 ***
## KitchenQual   -2.544e-02  5.803e-03  -4.384 1.25e-05 ***
## Functional     1.713e-02  4.051e-03   4.229 2.50e-05 ***
## GarageType    -3.795e-03  2.672e-03  -1.420 0.155697    
## GarageFinish  -9.027e-03  6.081e-03  -1.484 0.137929    
## GarageCond     7.841e-03  4.861e-03   1.613 0.106984    
## PavedDrive     2.405e-02  8.906e-03   2.701 0.007005 ** 
## SaleCondition  2.309e-02  3.475e-03   6.645 4.31e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1377 on 1406 degrees of freedom
## Multiple R-squared:  0.8854, Adjusted R-squared:  0.8811 
## F-statistic:   205 on 53 and 1406 DF,  p-value: < 2.2e-16
dim(reg.lm$model)
## [1] 1460   76
dim(step.lm$model)
## [1] 1460   54

The stepAIC() algorithm eliminated 22 variables from the original model and increased the Adjusted R-squared value to .8811.

Let’s create the new model.

#final model
final.lm <- lm(SalePrice.log ~., data = step.lm$model)
summary(final.lm)
## 
## Call:
## lm(formula = SalePrice.log ~ ., data = step.lm$model)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.75367 -0.06306  0.00364  0.06753  0.55406 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1.966e+01  5.611e+00   3.505 0.000472 ***
## LotFrontage   -4.345e-04  2.133e-04  -2.037 0.041800 *  
## LotArea        1.543e-06  4.559e-07   3.385 0.000730 ***
## OverallQual    7.187e-02  4.973e-03  14.452  < 2e-16 ***
## OverallCond    4.014e-02  4.485e-03   8.949  < 2e-16 ***
## YearBuilt      1.903e-03  3.309e-04   5.749 1.10e-08 ***
## YearRemodAdd   7.789e-04  2.879e-04   2.705 0.006904 ** 
## BsmtFinSF1     5.076e-05  2.131e-05   2.382 0.017369 *  
## BsmtFinSF2     1.359e-04  3.356e-05   4.050 5.40e-05 ***
## BsmtUnfSF      3.705e-05  2.110e-05   1.756 0.079373 .  
## X1stFlrSF      2.136e-04  2.607e-05   8.196 5.52e-16 ***
## X2ndFlrSF      1.648e-04  1.973e-05   8.357  < 2e-16 ***
## LowQualFinSF   1.723e-04  8.046e-05   2.141 0.032453 *  
## BsmtFullBath   5.151e-02  9.993e-03   5.154 2.91e-07 ***
## FullBath       3.539e-02  1.131e-02   3.130 0.001782 ** 
## HalfBath       1.615e-02  1.085e-02   1.489 0.136823    
## KitchenAbvGr  -3.630e-02  2.093e-02  -1.734 0.083126 .  
## TotRmsAbvGrd   1.587e-02  4.501e-03   3.526 0.000436 ***
## Fireplaces     3.255e-02  7.193e-03   4.525 6.55e-06 ***
## GarageYrBlt   -6.903e-04  2.889e-04  -2.389 0.017021 *  
## GarageCars     7.302e-02  8.331e-03   8.765  < 2e-16 ***
## WoodDeckSF     1.129e-04  3.222e-05   3.505 0.000471 ***
## EnclosedPorch  1.662e-04  6.705e-05   2.480 0.013272 *  
## X3SsnPorch     1.951e-04  1.251e-04   1.559 0.119141    
## ScreenPorch    3.214e-04  6.892e-05   4.663 3.42e-06 ***
## PoolArea      -2.901e-04  9.513e-05  -3.049 0.002337 ** 
## YrSold        -6.725e-03  2.777e-03  -2.422 0.015569 *  
## MSZoning      -2.104e-02  6.276e-03  -3.352 0.000824 ***
## Street         1.741e-01  5.958e-02   2.922 0.003529 ** 
## LotShape      -6.832e-03  2.772e-03  -2.464 0.013839 *  
## LandContour    1.110e-02  5.775e-03   1.921 0.054877 .  
## LandSlope      3.371e-02  1.644e-02   2.051 0.040465 *  
## Condition2    -4.441e-02  1.421e-02  -3.126 0.001811 ** 
## BldgType      -1.853e-02  3.865e-03  -4.795 1.80e-06 ***
## HouseStyle    -5.611e-03  2.521e-03  -2.225 0.026230 *  
## RoofMatl       9.334e-03  6.462e-03   1.445 0.148799    
## Exterior1st   -3.381e-03  2.231e-03  -1.515 0.129935    
## Exterior2nd    2.990e-03  2.011e-03   1.487 0.137317    
## ExterCond      9.998e-03  5.342e-03   1.872 0.061452 .  
## Foundation     1.238e-02  7.211e-03   1.716 0.086345 .  
## BsmtQual      -1.534e-02  5.653e-03  -2.713 0.006745 ** 
## BsmtCond       1.149e-02  5.599e-03   2.052 0.040349 *  
## BsmtExposure  -5.888e-03  3.703e-03  -1.590 0.112022    
## BsmtFinType1  -6.356e-03  2.677e-03  -2.374 0.017718 *  
## BsmtFinType2   1.542e-02  4.806e-03   3.208 0.001366 ** 
## HeatingQC     -7.478e-03  2.611e-03  -2.864 0.004241 ** 
## CentralAir     7.587e-02  1.807e-02   4.198 2.87e-05 ***
## KitchenQual   -2.544e-02  5.803e-03  -4.384 1.25e-05 ***
## Functional     1.713e-02  4.051e-03   4.229 2.50e-05 ***
## GarageType    -3.795e-03  2.672e-03  -1.420 0.155697    
## GarageFinish  -9.027e-03  6.081e-03  -1.484 0.137929    
## GarageCond     7.841e-03  4.861e-03   1.613 0.106984    
## PavedDrive     2.405e-02  8.906e-03   2.701 0.007005 ** 
## SaleCondition  2.309e-02  3.475e-03   6.645 4.31e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1377 on 1406 degrees of freedom
## Multiple R-squared:  0.8854, Adjusted R-squared:  0.8811 
## F-statistic:   205 on 53 and 1406 DF,  p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(final.lm)

The Residuals vs Fitted appears to be fairly random. The Q-Q plot appears to fit the Q-Q line with exceptions towards the end of the plot.

Prediction

# Building the prediction
pred <- predict(final.lm, newdata = test)
pred.exp <- sapply(pred, exp)

Submission

#Id-SalePrice
Id <- test$Id
SalePrice <- pred.exp
submission <- data.frame(Id, SalePrice)
head(submission)
##     Id SalePrice
## 1 1461  115639.7
## 2 1462  152002.9
## 3 1463  169448.4
## 4 1464  198115.9
## 5 1465  185758.2
## 6 1466  174048.4
write.csv(submission, file = "submission.csv", row.names=FALSE)

Kaggle report

First attempt with reg.lm Second attempt with step.lm