House Prices: Advanced Regression Techniques Competition

Introduction

The goal is to use Kaggle’s competition to apply advanced regression techniques to a data set of house prices.

Pick one of the quantitative independent variables from the training data set (train.csv) , and define that variable as X. Make sure this variable is skewed to the right!

Independent variable - GrLivArea: Above grade (ground) living area square feet

Since there is a limit to how small a house is, but not a limit to how large a house is, the data for above ground square feet is skewed to the right.

Pick the dependent variable and define it as Y.

The dependent variable is the sale price.

Probability

The small letter “x” is estimated as the 1st quartile of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable.
(a) P(X>x | Y>y)

x <- quantile(X)[2]
y <- quantile(Y)[2]
probxgiveny <- filter(house_data, SalePrice > y) %>%
  count(GrLivArea>x)
probxgiveny <- probxgiveny$n[2]/sum(probxgiveny$n)
probxgiveny
## [1] 0.8712329

Of the houses whose sales price are in the top 75%, the probability that the above ground square footage is in the top 75% of house square footage is 0.871.

  1. P(X>x, Y>y) Joint Distribution
jointdist <- filter(house_data, SalePrice > y, GrLivArea>x)
jointdist <- nrow(jointdist)/nrow(house_data)
jointdist
## [1] 0.6534247

The probability that the the square footage is in the top 75% and the sales price is in the top 75% is .653.

  1. P(Xy)
probxgiveny <- filter(house_data, SalePrice > y) %>%
  count(GrLivArea<x)
probxlowgiveny <- probxgiveny$n[2]/sum(probxgiveny$n)
probxlowgiveny
## [1] 0.1287671

Of the houses whose sales price is in the top 75%, the probabilty that the square footage is in the lowest 25% is 0.129.  This makes sense because the probability that the square footage is in the lowest 25% plus the probability that the square footage is in the top 75% given that the sales price is in the top 75% should add to 1, and it does.
The probability that X > x is 75%. However the probability that square footage is in the top 75% given that house sales price is in the top 75% is 87%. This demonstrates that sales price is dependent on square footage.

Yless1stQuartile Ygreater1stQuartile Total
X<=1st quartile 224 141 365
X>1st quartile 141 954 1095
Total 365 1095 1460

Of the houses in the bottom 25% in price, about 61% are in the bottom quarter of square footage. Since the percentage of houses in the bottom quarter is not equal to 25%, it can be concluded that house price and square footage are dependent.

Let A be the new variable counting those observations above the 1st quartile for X, and let B be the new variable counting those observations above the 1st quartile for Y. Does P(AB)=P(A)P(B)? Check mathematically, and then evaluate by running a Chi Square test for association.

A <- h
B <- f
PAB <- d/tot
PAB
## [1] 0.6534247
PAPB <- (A/tot)*(B/tot)
PAPB
## [1] 0.5625
chisq.test(df)
## 
##  Pearson's Chi-squared test
## 
## data:  df
## X-squared = 343.33, df = 4, p-value < 2.2e-16

P(AB) = 0.6534
P(A)P(B) = .5625
P(AB)\(\neq\)P(A)P(B)
Since the p-value for the Chi square test is less than 0.05, we do reject the null hypothesis that house price is independent of square footage.

Descriptive and Inferential Statistics.

Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot of X and Y.

## X 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1460        0      861        1     1515    563.1      848      912 
##      .25      .50      .75      .90      .95 
##     1130     1464     1777     2158     2466 
## 
## lowest :  334  438  480  520  605, highest: 3627 4316 4476 4676 5642
## Y 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1460        0      663        1   180921    81086    88000   106475 
##      .25      .50      .75      .90      .95 
##   129975   163000   214000   278000   326100 
## 
## lowest :  34900  35311  37900  39300  40000, highest: 582933 611657 625000 745000 755000

The mean square footage is slightly greater than the median. This indicates that it is slightly skewed to the right.
The mean house price is greater than the median house price, which indicates that house price too is skewed to the right.

Derive a correlation matrix for any THREE quantitative variables in the dataset.

I chose to create a correlation matrix for Year Built, Lot Area and Garage Area.

mtrx <- matrix(c(house_data$YearBuilt,house_data$LotArea,house_data$GarageArea),ncol=3)
colnames(mtrx) <- c("Year Built","Lot Area", "Garage Area")
cor <- rcorr(mtrx)
cor
##             Year Built Lot Area Garage Area
## Year Built        1.00     0.01        0.48
## Lot Area          0.01     1.00        0.18
## Garage Area       0.48     0.18        1.00
## 
## n= 1460 
## 
## 
## P
##             Year Built Lot Area Garage Area
## Year Built             0.587    0.000      
## Lot Area    0.587               0.000      
## Garage Area 0.000      0.000

The correlation between lot area and year built is 0.01. This means that lot area is not correlated to the year a house was built. The p-value is 0.587. We fail to reject the null hypothesis that lot area is not correlated to the year the house was built.
The correlation between garage area and the year built is 0.48. This means that there is a positive correlation between the year the house was built and the area of the garage. (The later the house is built, the larger the garage area.) The p value to 3 decimal places is zero. We therefore reject the null hypothesis that garage area is not correlated to the year a house was built.
The correlation between lot area and garage area is 0.18. There is a positive correlation between garage area and lot area. However since the correlation coefficient is fairly low, the data has some variability. The p value to 3 decimal places is zero. We therefore reject the null hypothesis that garage area is not correlated to the lot area.
I would be concerned about familywise, or type 1 error, only for the relationships between garage area and year built, and lot area and year built. For those two relationships, we rejected the null hypothesis. Type 1 error occurs when you reject the null hypothesis, but it is in fact true. However since the p-values are zero to three decimal places, I am not particularly concerned about this.

Linear Algebra and Correlation.

Invert your 3 x 3 correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.)

##              Year Built    Lot Area Garage Area
## Year Built   1.30681632  0.09749499  -0.6434930
## Lot Area     0.09749499  1.04091358  -0.2344793
## Garage Area -0.64349303 -0.23447928   1.3505042

Multiply the correlation matrix by the precision matrix.

##               Year Built     Lot Area Garage Area
## Year Built  1.000000e+00 1.387779e-17           0
## Lot Area    0.000000e+00 1.000000e+00           0
## Garage Area 1.110223e-16 5.551115e-17           1

This gives the identity matrix.

Then multiply the precision matrix by the correlation matrix.

##               Year Built Lot Area  Garage Area
## Year Built  1.000000e+00        0 1.110223e-16
## Lot Area    1.387779e-17        1 5.551115e-17
## Garage Area 0.000000e+00        0 1.000000e+00

This gives the identity matrix.

Conduct LU decomposition on the matrix.

factorize <- function(A){
  U <- A
  L <- matrix(c(1,1,1,0,1,1,0,0,1),3,3)

  for (colnum in 1:2){
    for (rownum in (colnum+1):3){
      comparison_position <- U[colnum,colnum]
      value <- U[rownum,colnum]
      factor <- -1*value/comparison_position  
      U[rownum,] <- U[rownum,] + factor*U[colnum,]
      L[rownum,colnum] <- -1*factor
      }
    }

c(U,L)
}


UL <- matrix(factorize(cor$r),3,6)

U <- UL[,1:3]
L <- UL[,4:6]

The lower triangular matrix:

##            [,1]      [,2] [,3]
## [1,] 1.00000000 0.0000000    0
## [2,] 0.01422765 1.0000000    0
## [3,] 0.47895382 0.1736235    1

The upper triangular matrix:

##      [,1]       [,2]      [,3]
## [1,]    1 0.01422765 0.4789538
## [2,]    0 0.99979757 0.1735884
## [3,]    0 0.00000000 0.7404642

The product of L and U equals the correlation matrix.

##             Year Built Lot Area Garage Area
## Year Built        TRUE     TRUE        TRUE
## Lot Area          TRUE     TRUE        TRUE
## Garage Area       TRUE     TRUE        TRUE

Calculus-Based Probability & Statistics.

Many times, it makes sense to fit a closed form distribution to data. For the first variable that you selected which is skewed to the right, shift it so that the minimum value is above zero as necessary.
The minimum value is already above zero because it is a measure of above ground square footage.

Then load the MASS package and run fitdistr to fit an exponential probability density function.

##        rate    
##   6.598640e-04 
##  (1.726943e-05)

Find the optimal value of \(\lambda\) for this distribution.

\(\lambda\) = 0.000660

Take 1000 samples from this exponential distribution using this value.

Plot a histogram and compare it with a histogram of your original variable.

Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF).

##       5% 
## 79.68066
##      95% 
## 4728.365

The above ground square foot of the 5th percentile from the exponential function is 101.6 sqft.
The above ground square foot of the 95th percentile from the exponential function is 4996.1 sqft.
Also generate a 95% confidence interval from the empirical data, assuming normality.

## 
##  One Sample t-test
## 
## data:  X
## t = 110.2, df = 1459, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  1488.487 1542.440
## sample estimates:
## mean of x 
##  1515.464
## [1] 1488.509
## [1] 1542.419

We can be 95% confident that the value of house square footage is between 1488.5 sqft and 1542.4 sqft.

Finally, provide the empirical 5th percentile and 95th percentile of the data.

##  5% 
## 848
##    95% 
## 2466.1

The 5th percentile of square footage from the data is 848 sqft.
The 95th percentile of square footage from the data is 2466.1 sqft.

Based on the comparison between the exponential function and the data, I can conclude that the values for house square footage do not approximate an exponential function. The exponential’s values for the 5th percentile is too low and the value of the 95% percentile is too high. I expect that this is because, while above ground sqaure footage is skewed right, there is a sizeable gap between zero and where the smallest houses’ areas. Most houses fall in a middle range and the exponential function predicts the 95th percentile to be too high. The 95% confidence interval for the normal distribution was a much better approxiation for the data.

Modeling

Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.

I shuffled the data set and created a training set and testing set

Backward Elimination - Linear Regression Model

## 
## Call:
## lm(formula = SalePrice ~ MSSubClass + LotArea + OverallQual + 
##     OverallCond + YearBuilt + MasVnrArea + X1stFlrSF + X2ndFlrSF + 
##     BsmtFullBath + BedroomAbvGr + GarageCars + WoodDeckSF + ScreenPorch + 
##     PoolArea, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -346787  -18564   -2213   14969  263760 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -7.719e+05  1.127e+05  -6.847 1.44e-11 ***
## MSSubClass   -1.714e+02  3.167e+01  -5.410 8.18e-08 ***
## LotArea       4.175e-01  1.123e-01   3.718 0.000214 ***
## OverallQual   2.142e+04  1.393e+03  15.370  < 2e-16 ***
## OverallCond   5.225e+03  1.190e+03   4.389 1.28e-05 ***
## YearBuilt     3.553e+02  5.768e+01   6.160 1.12e-09 ***
## MasVnrArea    2.861e+01  7.860e+00   3.640 0.000289 ***
## X1stFlrSF     5.030e+01  4.781e+00  10.522  < 2e-16 ***
## X2ndFlrSF     4.421e+01  4.349e+00  10.165  < 2e-16 ***
## BsmtFullBath  1.479e+04  2.513e+03   5.886 5.67e-09 ***
## BedroomAbvGr -4.547e+03  1.880e+03  -2.419 0.015771 *  
## GarageCars    1.295e+04  2.165e+03   5.981 3.26e-09 ***
## WoodDeckSF    3.245e+01  1.055e+01   3.075 0.002169 ** 
## ScreenPorch   7.432e+01  1.977e+01   3.760 0.000182 ***
## PoolArea     -8.536e+01  2.554e+01  -3.342 0.000866 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 35620 on 856 degrees of freedom
##   (5 observations deleted due to missingness)
## Multiple R-squared:  0.7713, Adjusted R-squared:  0.7675 
## F-statistic: 206.2 on 14 and 856 DF,  p-value: < 2.2e-16

The residuals show a trend upward at the beginning and the end. I will therefore try a different model in which I take the log of Sales Price and try to build a linear model that way.

Backward Elimination - Linear Regression of log(Sale Price) as the independent variable

## 
## Call:
## lm(formula = log(SalePrice) ~ MSSubClass + LotArea + OverallQual + 
##     OverallCond + YearBuilt + X1stFlrSF + GrLivArea + BsmtFullBath + 
##     Fireplaces + GarageCars + ScreenPorch, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.79881 -0.08093  0.00330  0.09401  0.51482 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   3.822e+00  4.944e-01   7.731 2.96e-14 ***
## MSSubClass   -6.857e-04  1.386e-04  -4.947 9.07e-07 ***
## LotArea       1.617e-06  5.046e-07   3.205   0.0014 ** 
## OverallQual   9.459e-02  6.161e-03  15.352  < 2e-16 ***
## OverallCond   5.105e-02  5.244e-03   9.735  < 2e-16 ***
## YearBuilt     3.449e-03  2.539e-04  13.586  < 2e-16 ***
## X1stFlrSF     1.827e-05  1.977e-05   0.924   0.3558    
## GrLivArea     2.135e-04  1.553e-05  13.744  < 2e-16 ***
## BsmtFullBath  7.418e-02  1.107e-02   6.701 3.73e-11 ***
## Fireplaces    5.508e-02  1.025e-02   5.372 1.00e-07 ***
## GarageCars    8.346e-02  9.585e-03   8.707  < 2e-16 ***
## ScreenPorch   3.535e-04  8.897e-05   3.974 7.67e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1589 on 864 degrees of freedom
## Multiple R-squared:  0.8279, Adjusted R-squared:  0.8257 
## F-statistic: 377.8 on 11 and 864 DF,  p-value: < 2.2e-16

Residual Analysis

plot(fitted(price_lm),resid(price_lm))

qqnorm(resid(price_lm))
qqline(resid(price_lm))

The residuals do not show a pattern.

Prediction

predictprice <- predict(price_lm, newdata=test, type="response")
error <- (exp(predictprice)-test$SalePrice)/test$SalePrice
error <- mean(error)
error
## [1] 0.009332353

This gives the average percent error between the predicted sales price and the actual sales price.

Apply Model to Test Data from Kaggle

kaggle_test <- read.csv('C:/Users/Swigo/Desktop/Sarah/DATA 605/kaggle_house_test_data.csv')
predict_kaggle_price <- predict(price_lm, newdata=kaggle_test, type="response")
predict_kaggle_price <- exp(predict_kaggle_price)
head(predict_kaggle_price)
##        1        2        3        4        5        6 
## 120991.8 140073.3 163719.7 187850.2 188111.4 176848.6
predict_kaggle_price[is.na(predict_kaggle_price)] <- mean(train$SalePrice)
submission <- data.frame(list("Id"=kaggle_test$Id, "SalePrice"=predict_kaggle_price), stringsAsFactors = FALSE)
head(submission)
##     Id SalePrice
## 1 1461  120991.8
## 2 1462  140073.3
## 3 1463  163719.7
## 4 1464  187850.2
## 5 1465  188111.4
## 6 1466  176848.6
write.csv(submission, file="house_submission.csv", row.names=FALSE, col.names=TRUE,sep='\t')
## Warning in write.csv(submission, file = "house_submission.csv", row.names =
## FALSE, : attempt to set 'col.names' ignored
## Warning in write.csv(submission, file = "house_submission.csv", row.names =
## FALSE, : attempt to set 'sep' ignored

username Sarah Wigodsky Kaggle Score 0.14173