House Prices: Advanced Regression Techniques Competition
Introduction
The goal is to use Kaggle’s competition to apply advanced regression techniques to a data set of house prices.
Pick one of the quantitative independent variables from the training data set (train.csv) , and define that variable as X. Make sure this variable is skewed to the right!
Independent variable - GrLivArea: Above grade (ground) living area square feet
Since there is a limit to how small a house is, but not a limit to how large a house is, the data for above ground square feet is skewed to the right.
Pick the dependent variable and define it as Y.
The dependent variable is the sale price.
Probability
The small letter “x” is estimated as the 1st quartile of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable.
(a) P(X>x | Y>y)
x <- quantile(X)[2]
y <- quantile(Y)[2]
probxgiveny <- filter(house_data, SalePrice > y) %>%
count(GrLivArea>x)
probxgiveny <- probxgiveny$n[2]/sum(probxgiveny$n)
probxgiveny
## [1] 0.8712329
Of the houses whose sales price are in the top 75%, the probability that the above ground square footage is in the top 75% of house square footage is 0.871.
- P(X>x, Y>y) Joint Distribution
jointdist <- filter(house_data, SalePrice > y, GrLivArea>x)
jointdist <- nrow(jointdist)/nrow(house_data)
jointdist
## [1] 0.6534247
The probability that the the square footage is in the top 75% and the sales price is in the top 75% is .653.
- P(X
y)
probxgiveny <- filter(house_data, SalePrice > y) %>%
count(GrLivArea<x)
probxlowgiveny <- probxgiveny$n[2]/sum(probxgiveny$n)
probxlowgiveny
## [1] 0.1287671
Of the houses whose sales price is in the top 75%, the probabilty that the square footage is in the lowest 25% is 0.129. Â This makes sense because the probability that the square footage is in the lowest 25% plus the probability that the square footage is in the top 75% given that the sales price is in the top 75% should add to 1, and it does.
The probability that X > x is 75%. However the probability that square footage is in the top 75% given that house sales price is in the top 75% is 87%. This demonstrates that sales price is dependent on square footage.
Yless1stQuartile | Ygreater1stQuartile | Total | |
---|---|---|---|
X<=1st quartile | 224 | 141 | 365 |
X>1st quartile | 141 | 954 | 1095 |
Total | 365 | 1095 | 1460 |
Of the houses in the bottom 25% in price, about 61% are in the bottom quarter of square footage. Since the percentage of houses in the bottom quarter is not equal to 25%, it can be concluded that house price and square footage are dependent.
Let A be the new variable counting those observations above the 1st quartile for X, and let B be the new variable counting those observations above the 1st quartile for Y. Does P(AB)=P(A)P(B)? Check mathematically, and then evaluate by running a Chi Square test for association.
A <- h
B <- f
PAB <- d/tot
PAB
## [1] 0.6534247
PAPB <- (A/tot)*(B/tot)
PAPB
## [1] 0.5625
chisq.test(df)
##
## Pearson's Chi-squared test
##
## data: df
## X-squared = 343.33, df = 4, p-value < 2.2e-16
P(AB) = 0.6534
P(A)P(B) = .5625
P(AB)\(\neq\)P(A)P(B)
Since the p-value for the Chi square test is less than 0.05, we do reject the null hypothesis that house price is independent of square footage.
Descriptive and Inferential Statistics.
Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot of X and Y.
## X
## n missing distinct Info Mean Gmd .05 .10
## 1460 0 861 1 1515 563.1 848 912
## .25 .50 .75 .90 .95
## 1130 1464 1777 2158 2466
##
## lowest : 334 438 480 520 605, highest: 3627 4316 4476 4676 5642
## Y
## n missing distinct Info Mean Gmd .05 .10
## 1460 0 663 1 180921 81086 88000 106475
## .25 .50 .75 .90 .95
## 129975 163000 214000 278000 326100
##
## lowest : 34900 35311 37900 39300 40000, highest: 582933 611657 625000 745000 755000
The mean square footage is slightly greater than the median. This indicates that it is slightly skewed to the right.
The mean house price is greater than the median house price, which indicates that house price too is skewed to the right.
Derive a correlation matrix for any THREE quantitative variables in the dataset.
I chose to create a correlation matrix for Year Built, Lot Area and Garage Area.
mtrx <- matrix(c(house_data$YearBuilt,house_data$LotArea,house_data$GarageArea),ncol=3)
colnames(mtrx) <- c("Year Built","Lot Area", "Garage Area")
cor <- rcorr(mtrx)
cor
## Year Built Lot Area Garage Area
## Year Built 1.00 0.01 0.48
## Lot Area 0.01 1.00 0.18
## Garage Area 0.48 0.18 1.00
##
## n= 1460
##
##
## P
## Year Built Lot Area Garage Area
## Year Built 0.587 0.000
## Lot Area 0.587 0.000
## Garage Area 0.000 0.000
The correlation between lot area and year built is 0.01. This means that lot area is not correlated to the year a house was built. The p-value is 0.587. We fail to reject the null hypothesis that lot area is not correlated to the year the house was built.
The correlation between garage area and the year built is 0.48. This means that there is a positive correlation between the year the house was built and the area of the garage. (The later the house is built, the larger the garage area.) The p value to 3 decimal places is zero. We therefore reject the null hypothesis that garage area is not correlated to the year a house was built.
The correlation between lot area and garage area is 0.18. There is a positive correlation between garage area and lot area. However since the correlation coefficient is fairly low, the data has some variability. The p value to 3 decimal places is zero. We therefore reject the null hypothesis that garage area is not correlated to the lot area.
I would be concerned about familywise, or type 1 error, only for the relationships between garage area and year built, and lot area and year built. For those two relationships, we rejected the null hypothesis. Type 1 error occurs when you reject the null hypothesis, but it is in fact true. However since the p-values are zero to three decimal places, I am not particularly concerned about this.
Linear Algebra and Correlation.
Invert your 3 x 3 correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.)
## Year Built Lot Area Garage Area
## Year Built 1.30681632 0.09749499 -0.6434930
## Lot Area 0.09749499 1.04091358 -0.2344793
## Garage Area -0.64349303 -0.23447928 1.3505042
Multiply the correlation matrix by the precision matrix.
## Year Built Lot Area Garage Area
## Year Built 1.000000e+00 1.387779e-17 0
## Lot Area 0.000000e+00 1.000000e+00 0
## Garage Area 1.110223e-16 5.551115e-17 1
This gives the identity matrix.
Then multiply the precision matrix by the correlation matrix.
## Year Built Lot Area Garage Area
## Year Built 1.000000e+00 0 1.110223e-16
## Lot Area 1.387779e-17 1 5.551115e-17
## Garage Area 0.000000e+00 0 1.000000e+00
This gives the identity matrix.
Conduct LU decomposition on the matrix.
factorize <- function(A){
U <- A
L <- matrix(c(1,1,1,0,1,1,0,0,1),3,3)
for (colnum in 1:2){
for (rownum in (colnum+1):3){
comparison_position <- U[colnum,colnum]
value <- U[rownum,colnum]
factor <- -1*value/comparison_position
U[rownum,] <- U[rownum,] + factor*U[colnum,]
L[rownum,colnum] <- -1*factor
}
}
c(U,L)
}
UL <- matrix(factorize(cor$r),3,6)
U <- UL[,1:3]
L <- UL[,4:6]
The lower triangular matrix:
## [,1] [,2] [,3]
## [1,] 1.00000000 0.0000000 0
## [2,] 0.01422765 1.0000000 0
## [3,] 0.47895382 0.1736235 1
The upper triangular matrix:
## [,1] [,2] [,3]
## [1,] 1 0.01422765 0.4789538
## [2,] 0 0.99979757 0.1735884
## [3,] 0 0.00000000 0.7404642
The product of L and U equals the correlation matrix.
## Year Built Lot Area Garage Area
## Year Built TRUE TRUE TRUE
## Lot Area TRUE TRUE TRUE
## Garage Area TRUE TRUE TRUE
Calculus-Based Probability & Statistics.
Many times, it makes sense to fit a closed form distribution to data. For the first variable that you selected which is skewed to the right, shift it so that the minimum value is above zero as necessary.
The minimum value is already above zero because it is a measure of above ground square footage.
Then load the MASS package and run fitdistr to fit an exponential probability density function.
## rate
## 6.598640e-04
## (1.726943e-05)
Find the optimal value of \(\lambda\) for this distribution.
\(\lambda\) = 0.000660
Take 1000 samples from this exponential distribution using this value.
Plot a histogram and compare it with a histogram of your original variable.
Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF).
## 5%
## 79.68066
## 95%
## 4728.365
The above ground square foot of the 5th percentile from the exponential function is 101.6 sqft.
The above ground square foot of the 95th percentile from the exponential function is 4996.1 sqft.
Also generate a 95% confidence interval from the empirical data, assuming normality.
##
## One Sample t-test
##
## data: X
## t = 110.2, df = 1459, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 1488.487 1542.440
## sample estimates:
## mean of x
## 1515.464
## [1] 1488.509
## [1] 1542.419
We can be 95% confident that the value of house square footage is between 1488.5 sqft and 1542.4 sqft.
Finally, provide the empirical 5th percentile and 95th percentile of the data.
## 5%
## 848
## 95%
## 2466.1
The 5th percentile of square footage from the data is 848 sqft.
The 95th percentile of square footage from the data is 2466.1 sqft.
Based on the comparison between the exponential function and the data, I can conclude that the values for house square footage do not approximate an exponential function. The exponential’s values for the 5th percentile is too low and the value of the 95% percentile is too high. I expect that this is because, while above ground sqaure footage is skewed right, there is a sizeable gap between zero and where the smallest houses’ areas. Most houses fall in a middle range and the exponential function predicts the 95th percentile to be too high. The 95% confidence interval for the normal distribution was a much better approxiation for the data.
Modeling
Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.
I shuffled the data set and created a training set and testing set
Backward Elimination - Linear Regression Model
##
## Call:
## lm(formula = SalePrice ~ MSSubClass + LotArea + OverallQual +
## OverallCond + YearBuilt + MasVnrArea + X1stFlrSF + X2ndFlrSF +
## BsmtFullBath + BedroomAbvGr + GarageCars + WoodDeckSF + ScreenPorch +
## PoolArea, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -346787 -18564 -2213 14969 263760
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.719e+05 1.127e+05 -6.847 1.44e-11 ***
## MSSubClass -1.714e+02 3.167e+01 -5.410 8.18e-08 ***
## LotArea 4.175e-01 1.123e-01 3.718 0.000214 ***
## OverallQual 2.142e+04 1.393e+03 15.370 < 2e-16 ***
## OverallCond 5.225e+03 1.190e+03 4.389 1.28e-05 ***
## YearBuilt 3.553e+02 5.768e+01 6.160 1.12e-09 ***
## MasVnrArea 2.861e+01 7.860e+00 3.640 0.000289 ***
## X1stFlrSF 5.030e+01 4.781e+00 10.522 < 2e-16 ***
## X2ndFlrSF 4.421e+01 4.349e+00 10.165 < 2e-16 ***
## BsmtFullBath 1.479e+04 2.513e+03 5.886 5.67e-09 ***
## BedroomAbvGr -4.547e+03 1.880e+03 -2.419 0.015771 *
## GarageCars 1.295e+04 2.165e+03 5.981 3.26e-09 ***
## WoodDeckSF 3.245e+01 1.055e+01 3.075 0.002169 **
## ScreenPorch 7.432e+01 1.977e+01 3.760 0.000182 ***
## PoolArea -8.536e+01 2.554e+01 -3.342 0.000866 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 35620 on 856 degrees of freedom
## (5 observations deleted due to missingness)
## Multiple R-squared: 0.7713, Adjusted R-squared: 0.7675
## F-statistic: 206.2 on 14 and 856 DF, p-value: < 2.2e-16
The residuals show a trend upward at the beginning and the end. I will therefore try a different model in which I take the log of Sales Price and try to build a linear model that way.
Backward Elimination - Linear Regression of log(Sale Price) as the independent variable
##
## Call:
## lm(formula = log(SalePrice) ~ MSSubClass + LotArea + OverallQual +
## OverallCond + YearBuilt + X1stFlrSF + GrLivArea + BsmtFullBath +
## Fireplaces + GarageCars + ScreenPorch, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.79881 -0.08093 0.00330 0.09401 0.51482
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.822e+00 4.944e-01 7.731 2.96e-14 ***
## MSSubClass -6.857e-04 1.386e-04 -4.947 9.07e-07 ***
## LotArea 1.617e-06 5.046e-07 3.205 0.0014 **
## OverallQual 9.459e-02 6.161e-03 15.352 < 2e-16 ***
## OverallCond 5.105e-02 5.244e-03 9.735 < 2e-16 ***
## YearBuilt 3.449e-03 2.539e-04 13.586 < 2e-16 ***
## X1stFlrSF 1.827e-05 1.977e-05 0.924 0.3558
## GrLivArea 2.135e-04 1.553e-05 13.744 < 2e-16 ***
## BsmtFullBath 7.418e-02 1.107e-02 6.701 3.73e-11 ***
## Fireplaces 5.508e-02 1.025e-02 5.372 1.00e-07 ***
## GarageCars 8.346e-02 9.585e-03 8.707 < 2e-16 ***
## ScreenPorch 3.535e-04 8.897e-05 3.974 7.67e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1589 on 864 degrees of freedom
## Multiple R-squared: 0.8279, Adjusted R-squared: 0.8257
## F-statistic: 377.8 on 11 and 864 DF, p-value: < 2.2e-16
Residual Analysis
plot(fitted(price_lm),resid(price_lm))
qqnorm(resid(price_lm))
qqline(resid(price_lm))
The residuals do not show a pattern.
Prediction
predictprice <- predict(price_lm, newdata=test, type="response")
error <- (exp(predictprice)-test$SalePrice)/test$SalePrice
error <- mean(error)
error
## [1] 0.009332353
This gives the average percent error between the predicted sales price and the actual sales price.
Apply Model to Test Data from Kaggle
kaggle_test <- read.csv('C:/Users/Swigo/Desktop/Sarah/DATA 605/kaggle_house_test_data.csv')
predict_kaggle_price <- predict(price_lm, newdata=kaggle_test, type="response")
predict_kaggle_price <- exp(predict_kaggle_price)
head(predict_kaggle_price)
## 1 2 3 4 5 6
## 120991.8 140073.3 163719.7 187850.2 188111.4 176848.6
predict_kaggle_price[is.na(predict_kaggle_price)] <- mean(train$SalePrice)
submission <- data.frame(list("Id"=kaggle_test$Id, "SalePrice"=predict_kaggle_price), stringsAsFactors = FALSE)
head(submission)
## Id SalePrice
## 1 1461 120991.8
## 2 1462 140073.3
## 3 1463 163719.7
## 4 1464 187850.2
## 5 1465 188111.4
## 6 1466 176848.6
write.csv(submission, file="house_submission.csv", row.names=FALSE, col.names=TRUE,sep='\t')
## Warning in write.csv(submission, file = "house_submission.csv", row.names =
## FALSE, : attempt to set 'col.names' ignored
## Warning in write.csv(submission, file = "house_submission.csv", row.names =
## FALSE, : attempt to set 'sep' ignored
username Sarah Wigodsky Kaggle Score 0.14173