Your final is due by the end of day on 5/20/2018 You should post your solutions to your GitHub account or RPubs. You are also expected to make a short presentation via YouTube and post that recording to the board. This project will show off your ability to understand the elements of the class.
You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition. https://www.kaggle.com/c/house-prices-advanced-regression-techniques . I want you to do the following.
setwd('/Users/gaboston/Documents/bethany/ds605_final')
train <- read.csv('train.csv')
train_names <-colnames(train)
numeric_train <- train%>%
select_if(is.numeric)
(describe(numeric_train))
## numeric_train
##
## 38 Variables 1460 Observations
## ---------------------------------------------------------------------------
## Id
## n missing distinct Info Mean Gmd .05 .10
## 1460 0 1460 1 730.5 487 73.95 146.90
## .25 .50 .75 .90 .95
## 365.75 730.50 1095.25 1314.10 1387.05
##
## lowest : 1 2 3 4 5, highest: 1456 1457 1458 1459 1460
## ---------------------------------------------------------------------------
## MSSubClass
## n missing distinct Info Mean Gmd .05 .10
## 1460 0 15 0.94 56.9 43.19 20 20
## .25 .50 .75 .90 .95
## 20 50 70 120 160
##
## Value 20 30 40 45 50 60 70 75 80 85
## Frequency 536 69 4 12 144 299 60 16 58 20
## Proportion 0.367 0.047 0.003 0.008 0.099 0.205 0.041 0.011 0.040 0.014
##
## Value 90 120 160 180 190
## Frequency 52 87 63 10 30
## Proportion 0.036 0.060 0.043 0.007 0.021
## ---------------------------------------------------------------------------
## LotFrontage
## n missing distinct Info Mean Gmd .05 .10
## 1201 259 110 0.998 70.05 24.61 34 44
## .25 .50 .75 .90 .95
## 59 69 80 96 107
##
## lowest : 21 24 30 32 33, highest: 160 168 174 182 313
## ---------------------------------------------------------------------------
## LotArea
## n missing distinct Info Mean Gmd .05 .10
## 1460 0 1073 1 10517 5718 3312 5000
## .25 .50 .75 .90 .95
## 7554 9478 11602 14382 17401
##
## lowest : 1300 1477 1491 1526 1533, highest: 70761 115149 159000 164660 215245
## ---------------------------------------------------------------------------
## OverallQual
## n missing distinct Info Mean Gmd .05 .10
## 1460 0 10 0.951 6.099 1.522 4 5
## .25 .50 .75 .90 .95
## 5 6 7 8 8
##
## Value 1 2 3 4 5 6 7 8 9 10
## Frequency 2 3 20 116 397 374 319 168 43 18
## Proportion 0.001 0.002 0.014 0.079 0.272 0.256 0.218 0.115 0.029 0.012
## ---------------------------------------------------------------------------
## OverallCond
## n missing distinct Info Mean Gmd
## 1460 0 9 0.814 5.575 1.111
##
## Value 1 2 3 4 5 6 7 8 9
## Frequency 1 5 25 57 821 252 205 72 22
## Proportion 0.001 0.003 0.017 0.039 0.562 0.173 0.140 0.049 0.015
## ---------------------------------------------------------------------------
## YearBuilt
## n missing distinct Info Mean Gmd .05 .10
## 1460 0 112 1 1971 33.88 1916 1925
## .25 .50 .75 .90 .95
## 1954 1973 2000 2006 2007
##
## lowest : 1872 1875 1880 1882 1885, highest: 2006 2007 2008 2009 2010
## ---------------------------------------------------------------------------
## YearRemodAdd
## n missing distinct Info Mean Gmd .05 .10
## 1460 0 61 0.997 1985 23.05 1950 1950
## .25 .50 .75 .90 .95
## 1967 1994 2004 2006 2007
##
## lowest : 1950 1951 1952 1953 1954, highest: 2006 2007 2008 2009 2010
## ---------------------------------------------------------------------------
## MasVnrArea
## n missing distinct Info Mean Gmd .05 .10
## 1452 8 327 0.791 103.7 156.9 0 0
## .25 .50 .75 .90 .95
## 0 0 166 335 456
##
## lowest : 0 1 11 14 16, highest: 1115 1129 1170 1378 1600
## ---------------------------------------------------------------------------
## BsmtFinSF1
## n missing distinct Info Mean Gmd .05 .10
## 1460 0 637 0.967 443.6 484.5 0.0 0.0
## .25 .50 .75 .90 .95
## 0.0 383.5 712.2 1065.5 1274.0
##
## lowest : 0 2 16 20 24, highest: 1904 2096 2188 2260 5644
## ---------------------------------------------------------------------------
## BsmtFinSF2
## n missing distinct Info Mean Gmd .05 .10
## 1460 0 144 0.305 46.55 86.58 0.0 0.0
## .25 .50 .75 .90 .95
## 0.0 0.0 0.0 117.2 396.2
##
## lowest : 0 28 32 35 40, highest: 1080 1085 1120 1127 1474
## ---------------------------------------------------------------------------
## BsmtUnfSF
## n missing distinct Info Mean Gmd .05 .10
## 1460 0 780 0.999 567.2 486.6 0.0 74.9
## .25 .50 .75 .90 .95
## 223.0 477.5 808.0 1232.0 1468.0
##
## lowest : 0 14 15 23 26, highest: 2042 2046 2121 2153 2336
## ---------------------------------------------------------------------------
## TotalBsmtSF
## n missing distinct Info Mean Gmd .05 .10
## 1460 0 721 1 1057 459.5 519.3 636.9
## .25 .50 .75 .90 .95
## 795.8 991.5 1298.2 1602.2 1753.0
##
## lowest : 0 105 190 264 270, highest: 3094 3138 3200 3206 6110
## ---------------------------------------------------------------------------
## X1stFlrSF
## n missing distinct Info Mean Gmd .05 .10
## 1460 0 753 1 1163 416.4 673.0 756.9
## .25 .50 .75 .90 .95
## 882.0 1087.0 1391.2 1680.0 1831.2
##
## lowest : 334 372 438 480 483, highest: 2633 2898 3138 3228 4692
## ---------------------------------------------------------------------------
## X2ndFlrSF
## n missing distinct Info Mean Gmd .05 .10
## 1460 0 417 0.817 347 450.2 0.0 0.0
## .25 .50 .75 .90 .95
## 0.0 0.0 728.0 954.2 1141.0
##
## lowest : 0 110 167 192 208, highest: 1611 1796 1818 1872 2065
## ---------------------------------------------------------------------------
## LowQualFinSF
## n missing distinct Info Mean Gmd .05 .10
## 1460 0 24 0.052 5.845 11.55 0 0
## .25 .50 .75 .90 .95
## 0 0 0 0 0
##
## lowest : 0 53 80 120 144, highest: 513 514 515 528 572
## ---------------------------------------------------------------------------
## GrLivArea
## n missing distinct Info Mean Gmd .05 .10
## 1460 0 861 1 1515 563.1 848 912
## .25 .50 .75 .90 .95
## 1130 1464 1777 2158 2466
##
## lowest : 334 438 480 520 605, highest: 3627 4316 4476 4676 5642
## ---------------------------------------------------------------------------
## BsmtFullBath
## n missing distinct Info Mean Gmd
## 1460 0 4 0.733 0.4253 0.5085
##
## Value 0 1 2 3
## Frequency 856 588 15 1
## Proportion 0.586 0.403 0.010 0.001
## ---------------------------------------------------------------------------
## BsmtHalfBath
## n missing distinct Info Mean Gmd
## 1460 0 3 0.159 0.05753 0.1088
##
## Value 0 1 2
## Frequency 1378 80 2
## Proportion 0.944 0.055 0.001
## ---------------------------------------------------------------------------
## FullBath
## n missing distinct Info Mean Gmd
## 1460 0 4 0.766 1.565 0.5521
##
## Value 0 1 2 3
## Frequency 9 650 768 33
## Proportion 0.006 0.445 0.526 0.023
## ---------------------------------------------------------------------------
## HalfBath
## n missing distinct Info Mean Gmd
## 1460 0 3 0.706 0.3829 0.4852
##
## Value 0 1 2
## Frequency 913 535 12
## Proportion 0.625 0.366 0.008
## ---------------------------------------------------------------------------
## BedroomAbvGr
## n missing distinct Info Mean Gmd
## 1460 0 8 0.815 2.866 0.818
##
## Value 0 1 2 3 4 5 6 8
## Frequency 6 50 358 804 213 21 7 1
## Proportion 0.004 0.034 0.245 0.551 0.146 0.014 0.005 0.001
## ---------------------------------------------------------------------------
## KitchenAbvGr
## n missing distinct Info Mean Gmd
## 1460 0 4 0.133 1.047 0.09174
##
## Value 0 1 2 3
## Frequency 1 1392 65 2
## Proportion 0.001 0.953 0.045 0.001
## ---------------------------------------------------------------------------
## TotRmsAbvGrd
## n missing distinct Info Mean Gmd .05 .10
## 1460 0 12 0.958 6.518 1.762 4 5
## .25 .50 .75 .90 .95
## 5 6 7 9 10
##
## Value 2 3 4 5 6 7 8 9 10 11
## Frequency 1 17 97 275 402 329 187 75 47 18
## Proportion 0.001 0.012 0.066 0.188 0.275 0.225 0.128 0.051 0.032 0.012
##
## Value 12 14
## Frequency 11 1
## Proportion 0.008 0.001
## ---------------------------------------------------------------------------
## Fireplaces
## n missing distinct Info Mean Gmd
## 1460 0 4 0.806 0.613 0.6566
##
## Value 0 1 2 3
## Frequency 690 650 115 5
## Proportion 0.473 0.445 0.079 0.003
## ---------------------------------------------------------------------------
## GarageYrBlt
## n missing distinct Info Mean Gmd .05 .10
## 1379 81 97 1 1979 27.63 1930 1945
## .25 .50 .75 .90 .95
## 1961 1980 2002 2006 2007
##
## lowest : 1900 1906 1908 1910 1914, highest: 2006 2007 2008 2009 2010
## ---------------------------------------------------------------------------
## GarageCars
## n missing distinct Info Mean Gmd
## 1460 0 5 0.802 1.767 0.7609
##
## Value 0 1 2 3 4
## Frequency 81 369 824 181 5
## Proportion 0.055 0.253 0.564 0.124 0.003
## ---------------------------------------------------------------------------
## GarageArea
## n missing distinct Info Mean Gmd .05 .10
## 1460 0 441 1 473 234.9 0.0 240.0
## .25 .50 .75 .90 .95
## 334.5 480.0 576.0 757.1 850.1
##
## lowest : 0 160 164 180 186, highest: 1220 1248 1356 1390 1418
## ---------------------------------------------------------------------------
## WoodDeckSF
## n missing distinct Info Mean Gmd .05 .10
## 1460 0 274 0.858 94.24 125 0 0
## .25 .50 .75 .90 .95
## 0 0 168 262 335
##
## lowest : 0 12 24 26 28, highest: 668 670 728 736 857
## ---------------------------------------------------------------------------
## OpenPorchSF
## n missing distinct Info Mean Gmd .05 .10
## 1460 0 202 0.909 46.66 62.43 0 0
## .25 .50 .75 .90 .95
## 0 25 68 130 175
##
## lowest : 0 4 8 10 11, highest: 406 418 502 523 547
## ---------------------------------------------------------------------------
## EnclosedPorch
## n missing distinct Info Mean Gmd .05 .10
## 1460 0 120 0.369 21.95 39.39 0.0 0.0
## .25 .50 .75 .90 .95
## 0.0 0.0 0.0 112.0 180.1
##
## lowest : 0 19 20 24 30, highest: 301 318 330 386 552
## ---------------------------------------------------------------------------
## X3SsnPorch
## n missing distinct Info Mean Gmd .05 .10
## 1460 0 20 0.049 3.41 6.739 0 0
## .25 .50 .75 .90 .95
## 0 0 0 0 0
##
## Value 0 23 96 130 140 144 153 162 168 180
## Frequency 1436 1 1 1 1 2 1 1 3 2
## Proportion 0.984 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.002 0.001
##
## Value 182 196 216 238 245 290 304 320 407 508
## Frequency 1 1 2 1 1 1 1 1 1 1
## Proportion 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001
## ---------------------------------------------------------------------------
## ScreenPorch
## n missing distinct Info Mean Gmd .05 .10
## 1460 0 76 0.22 15.06 28.27 0 0
## .25 .50 .75 .90 .95
## 0 0 0 0 160
##
## lowest : 0 40 53 60 63, highest: 385 396 410 440 480
## ---------------------------------------------------------------------------
## PoolArea
## n missing distinct Info Mean Gmd
## 1460 0 8 0.014 2.759 5.497
##
## Value 0 480 512 519 555 576 648 738
## Frequency 1453 1 1 1 1 1 1 1
## Proportion 0.995 0.001 0.001 0.001 0.001 0.001 0.001 0.001
## ---------------------------------------------------------------------------
## MiscVal
## n missing distinct Info Mean Gmd .05 .10
## 1460 0 21 0.103 43.49 85.67 0 0
## .25 .50 .75 .90 .95
## 0 0 0 0 0
##
## Value 0 50 350 400 450 500 550 600 700 800
## Frequency 1408 1 1 11 4 10 1 5 5 1
## Proportion 0.964 0.001 0.001 0.008 0.003 0.007 0.001 0.003 0.003 0.001
##
## Value 1150 1200 1300 1400 2000 2500 3500 8300 15500
## Frequency 1 2 1 1 4 1 1 1 1
## Proportion 0.001 0.001 0.001 0.001 0.003 0.001 0.001 0.001 0.001
## ---------------------------------------------------------------------------
## MoSold
## n missing distinct Info Mean Gmd .05 .10
## 1460 0 12 0.985 6.322 3.041 2 3
## .25 .50 .75 .90 .95
## 5 6 8 10 11
##
## Value 1 2 3 4 5 6 7 8 9 10
## Frequency 58 52 106 141 204 253 234 122 63 89
## Proportion 0.040 0.036 0.073 0.097 0.140 0.173 0.160 0.084 0.043 0.061
##
## Value 11 12
## Frequency 79 59
## Proportion 0.054 0.040
## ---------------------------------------------------------------------------
## YrSold
## n missing distinct Info Mean Gmd
## 1460 0 5 0.955 2008 1.498
##
## Value 2006 2007 2008 2009 2010
## Frequency 314 329 304 338 175
## Proportion 0.215 0.225 0.208 0.232 0.120
## ---------------------------------------------------------------------------
## SalePrice
## n missing distinct Info Mean Gmd .05 .10
## 1460 0 663 1 180921 81086 88000 106475
## .25 .50 .75 .90 .95
## 129975 163000 214000 278000 326100
##
## lowest : 34900 35311 37900 39300 40000, highest: 582933 611657 625000 745000 755000
## ---------------------------------------------------------------------------
# The Following is The Basic Statistics of the Numeric Variables
### Outcome Variable
Y <- as.numeric(numeric_train[,'SalePrice'])
# The following are Statistical Summaries for the Outcome Variable, Sales Price
(summary(Y))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 34900 129975 163000 180921 214000 755000
Using the positive values greater than 2 in skew for the above descriptive statistics, I chose LotArea
as my X for this exploration.I did so because having a large lot might be seen as valuable, and also many homes do NOT have one, so there might be some connection to the outcome variable SalesPrice
and it was distrubuted as requested.
X = as.numeric(numeric_train[,'LotArea'])
hist(X, breaks=30, main = "Histogram of Lot Area", xlab = 'Lot Area in Square Feet', ylab = 'Frequency', col= 'magenta')
Calculate as a minimum the below probabilities a through c.
Assume the small letter “x” is estimated as the 1st quartile of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable.
Interpret the meaning of all probabilities.
\(P(X>x | Y>y)\)
\(P(X>x,Y>y)\)
\(P(X<x|Y>y)\)
In addition, make a table of counts as shown below.
x<-quantile(X)[2]
y<-quantile(Y)[2]
x_2<-quantile(Y)[3]
y_2<-quantile(Y)[3]
x_3<-quantile(Y)[4]
y_3<-quantile(Y)[4]
#create a whole frame of XY for simplicity sake and calculate denominators for the
# whole set denXY (all observations) and the just Y-values greater than the 25th percentile, denY.
XY <-cbind(X,Y)
denXY <- nrow(XY) #All of XY
denY <-nrow(subset(XY, Y>y)) # all of XY where Y>y
Because we are only looking at X values above the 25th percentile if they already are identified as being above the 25th of Y, we take the numerator subsets of XY above the 25th percentile, count those values and divide byt the number of Y values greater than the 25th percentile.
a <- nrow(subset(XY, (X>x & Y>y)))/denY
In thins case the numerator is the same as above, but because we are not predicating the probability on knowing that the Y value was greater than the 25th while selecting the X value, the denominator is all observations in the set.
b <- nrow(subset(XY, (X>x & Y>y)))/denXY
c <- nrow(subset(XY, (X<x & Y>y)))/denY
X_LE_3_Y_EL_2 <- nrow(subset(XY, (X<=x_3 & Y <= y_2)))
X_LE_3_Y_GT_2 <- nrow(subset(XY, (X<=x_3 & Y > y_2)))
X_LE_3_TOT <- X_LE_3_Y_EL_2 + X_LE_3_Y_GT_2
Lots_Less_Than<- data.frame(Price_At_Bellow_163k=X_LE_3_Y_EL_2, Price_Greater_163k = X_LE_3_Y_GT_2, Total = X_LE_3_TOT)
XY_GT_3_Y_EL_2 <- nrow(subset(XY, (X>x_3 & Y <= y_2)))
XY_GT_3_Y_GT_2 <- nrow(subset(XY, (X>x_3 & Y > y_2)))
X_GT_3_TOT <- XY_GT_3_Y_EL_2 + XY_GT_3_Y_GT_2
Lots_Greater_Than<- data.frame(Price_At_Bellow_163k=XY_GT_3_Y_EL_2, Price_Greater_163k = XY_GT_3_Y_GT_2, Total = X_GT_3_TOT)
Totals <- data.frame(Price_At_Bellow_163k=(X_LE_3_Y_EL_2 + XY_GT_3_Y_EL_2), Price_Greater_163k = (X_LE_3_Y_GT_2 + XY_GT_3_Y_GT_2), Total = (X_LE_3_TOT + X_GT_3_TOT))
table_probs <- data.frame(rbind(Lots_Less_Than, Lots_Greater_Than, Totals ))
row.names(table_probs)<- c("Lots_Less_Than_7553", "Lots_Greater_Than_7553", "Totals" )
r DT::datatable(table_probs)
(where \(P(A | B)\) is problem a. from above)
p_ab <-nrow(subset(XY, (X>x & Y>y)))/denY
p_b <-nrow(subset(XY, Y>y))/nrow(XY)
pAB <- p_ab * p_b
p_a <-nrow(subset(XY, X>x))/nrow(XY)
P_A_B <- p_a*p_b
answer =pAB==P_A_B
FALSE
A<- subset(X, X>x)
B<-subset(Y, Y>y)
tab_ab <- table(A, B)
ch_sq <-chisq.test(tab_ab)
Statistic: 428917.745478
p-value: 0.705902
In this case where the null hypothesis is that they are independent and given a p-value of .705 we fail to reject and assume the division does in fact support independence in these two variables
(describe(numeric_train))
## numeric_train
##
## 38 Variables 1460 Observations
## ---------------------------------------------------------------------------
## Id
## n missing distinct Info Mean Gmd .05 .10
## 1460 0 1460 1 730.5 487 73.95 146.90
## .25 .50 .75 .90 .95
## 365.75 730.50 1095.25 1314.10 1387.05
##
## lowest : 1 2 3 4 5, highest: 1456 1457 1458 1459 1460
## ---------------------------------------------------------------------------
## MSSubClass
## n missing distinct Info Mean Gmd .05 .10
## 1460 0 15 0.94 56.9 43.19 20 20
## .25 .50 .75 .90 .95
## 20 50 70 120 160
##
## Value 20 30 40 45 50 60 70 75 80 85
## Frequency 536 69 4 12 144 299 60 16 58 20
## Proportion 0.367 0.047 0.003 0.008 0.099 0.205 0.041 0.011 0.040 0.014
##
## Value 90 120 160 180 190
## Frequency 52 87 63 10 30
## Proportion 0.036 0.060 0.043 0.007 0.021
## ---------------------------------------------------------------------------
## LotFrontage
## n missing distinct Info Mean Gmd .05 .10
## 1201 259 110 0.998 70.05 24.61 34 44
## .25 .50 .75 .90 .95
## 59 69 80 96 107
##
## lowest : 21 24 30 32 33, highest: 160 168 174 182 313
## ---------------------------------------------------------------------------
## LotArea
## n missing distinct Info Mean Gmd .05 .10
## 1460 0 1073 1 10517 5718 3312 5000
## .25 .50 .75 .90 .95
## 7554 9478 11602 14382 17401
##
## lowest : 1300 1477 1491 1526 1533, highest: 70761 115149 159000 164660 215245
## ---------------------------------------------------------------------------
## OverallQual
## n missing distinct Info Mean Gmd .05 .10
## 1460 0 10 0.951 6.099 1.522 4 5
## .25 .50 .75 .90 .95
## 5 6 7 8 8
##
## Value 1 2 3 4 5 6 7 8 9 10
## Frequency 2 3 20 116 397 374 319 168 43 18
## Proportion 0.001 0.002 0.014 0.079 0.272 0.256 0.218 0.115 0.029 0.012
## ---------------------------------------------------------------------------
## OverallCond
## n missing distinct Info Mean Gmd
## 1460 0 9 0.814 5.575 1.111
##
## Value 1 2 3 4 5 6 7 8 9
## Frequency 1 5 25 57 821 252 205 72 22
## Proportion 0.001 0.003 0.017 0.039 0.562 0.173 0.140 0.049 0.015
## ---------------------------------------------------------------------------
## YearBuilt
## n missing distinct Info Mean Gmd .05 .10
## 1460 0 112 1 1971 33.88 1916 1925
## .25 .50 .75 .90 .95
## 1954 1973 2000 2006 2007
##
## lowest : 1872 1875 1880 1882 1885, highest: 2006 2007 2008 2009 2010
## ---------------------------------------------------------------------------
## YearRemodAdd
## n missing distinct Info Mean Gmd .05 .10
## 1460 0 61 0.997 1985 23.05 1950 1950
## .25 .50 .75 .90 .95
## 1967 1994 2004 2006 2007
##
## lowest : 1950 1951 1952 1953 1954, highest: 2006 2007 2008 2009 2010
## ---------------------------------------------------------------------------
## MasVnrArea
## n missing distinct Info Mean Gmd .05 .10
## 1452 8 327 0.791 103.7 156.9 0 0
## .25 .50 .75 .90 .95
## 0 0 166 335 456
##
## lowest : 0 1 11 14 16, highest: 1115 1129 1170 1378 1600
## ---------------------------------------------------------------------------
## BsmtFinSF1
## n missing distinct Info Mean Gmd .05 .10
## 1460 0 637 0.967 443.6 484.5 0.0 0.0
## .25 .50 .75 .90 .95
## 0.0 383.5 712.2 1065.5 1274.0
##
## lowest : 0 2 16 20 24, highest: 1904 2096 2188 2260 5644
## ---------------------------------------------------------------------------
## BsmtFinSF2
## n missing distinct Info Mean Gmd .05 .10
## 1460 0 144 0.305 46.55 86.58 0.0 0.0
## .25 .50 .75 .90 .95
## 0.0 0.0 0.0 117.2 396.2
##
## lowest : 0 28 32 35 40, highest: 1080 1085 1120 1127 1474
## ---------------------------------------------------------------------------
## BsmtUnfSF
## n missing distinct Info Mean Gmd .05 .10
## 1460 0 780 0.999 567.2 486.6 0.0 74.9
## .25 .50 .75 .90 .95
## 223.0 477.5 808.0 1232.0 1468.0
##
## lowest : 0 14 15 23 26, highest: 2042 2046 2121 2153 2336
## ---------------------------------------------------------------------------
## TotalBsmtSF
## n missing distinct Info Mean Gmd .05 .10
## 1460 0 721 1 1057 459.5 519.3 636.9
## .25 .50 .75 .90 .95
## 795.8 991.5 1298.2 1602.2 1753.0
##
## lowest : 0 105 190 264 270, highest: 3094 3138 3200 3206 6110
## ---------------------------------------------------------------------------
## X1stFlrSF
## n missing distinct Info Mean Gmd .05 .10
## 1460 0 753 1 1163 416.4 673.0 756.9
## .25 .50 .75 .90 .95
## 882.0 1087.0 1391.2 1680.0 1831.2
##
## lowest : 334 372 438 480 483, highest: 2633 2898 3138 3228 4692
## ---------------------------------------------------------------------------
## X2ndFlrSF
## n missing distinct Info Mean Gmd .05 .10
## 1460 0 417 0.817 347 450.2 0.0 0.0
## .25 .50 .75 .90 .95
## 0.0 0.0 728.0 954.2 1141.0
##
## lowest : 0 110 167 192 208, highest: 1611 1796 1818 1872 2065
## ---------------------------------------------------------------------------
## LowQualFinSF
## n missing distinct Info Mean Gmd .05 .10
## 1460 0 24 0.052 5.845 11.55 0 0
## .25 .50 .75 .90 .95
## 0 0 0 0 0
##
## lowest : 0 53 80 120 144, highest: 513 514 515 528 572
## ---------------------------------------------------------------------------
## GrLivArea
## n missing distinct Info Mean Gmd .05 .10
## 1460 0 861 1 1515 563.1 848 912
## .25 .50 .75 .90 .95
## 1130 1464 1777 2158 2466
##
## lowest : 334 438 480 520 605, highest: 3627 4316 4476 4676 5642
## ---------------------------------------------------------------------------
## BsmtFullBath
## n missing distinct Info Mean Gmd
## 1460 0 4 0.733 0.4253 0.5085
##
## Value 0 1 2 3
## Frequency 856 588 15 1
## Proportion 0.586 0.403 0.010 0.001
## ---------------------------------------------------------------------------
## BsmtHalfBath
## n missing distinct Info Mean Gmd
## 1460 0 3 0.159 0.05753 0.1088
##
## Value 0 1 2
## Frequency 1378 80 2
## Proportion 0.944 0.055 0.001
## ---------------------------------------------------------------------------
## FullBath
## n missing distinct Info Mean Gmd
## 1460 0 4 0.766 1.565 0.5521
##
## Value 0 1 2 3
## Frequency 9 650 768 33
## Proportion 0.006 0.445 0.526 0.023
## ---------------------------------------------------------------------------
## HalfBath
## n missing distinct Info Mean Gmd
## 1460 0 3 0.706 0.3829 0.4852
##
## Value 0 1 2
## Frequency 913 535 12
## Proportion 0.625 0.366 0.008
## ---------------------------------------------------------------------------
## BedroomAbvGr
## n missing distinct Info Mean Gmd
## 1460 0 8 0.815 2.866 0.818
##
## Value 0 1 2 3 4 5 6 8
## Frequency 6 50 358 804 213 21 7 1
## Proportion 0.004 0.034 0.245 0.551 0.146 0.014 0.005 0.001
## ---------------------------------------------------------------------------
## KitchenAbvGr
## n missing distinct Info Mean Gmd
## 1460 0 4 0.133 1.047 0.09174
##
## Value 0 1 2 3
## Frequency 1 1392 65 2
## Proportion 0.001 0.953 0.045 0.001
## ---------------------------------------------------------------------------
## TotRmsAbvGrd
## n missing distinct Info Mean Gmd .05 .10
## 1460 0 12 0.958 6.518 1.762 4 5
## .25 .50 .75 .90 .95
## 5 6 7 9 10
##
## Value 2 3 4 5 6 7 8 9 10 11
## Frequency 1 17 97 275 402 329 187 75 47 18
## Proportion 0.001 0.012 0.066 0.188 0.275 0.225 0.128 0.051 0.032 0.012
##
## Value 12 14
## Frequency 11 1
## Proportion 0.008 0.001
## ---------------------------------------------------------------------------
## Fireplaces
## n missing distinct Info Mean Gmd
## 1460 0 4 0.806 0.613 0.6566
##
## Value 0 1 2 3
## Frequency 690 650 115 5
## Proportion 0.473 0.445 0.079 0.003
## ---------------------------------------------------------------------------
## GarageYrBlt
## n missing distinct Info Mean Gmd .05 .10
## 1379 81 97 1 1979 27.63 1930 1945
## .25 .50 .75 .90 .95
## 1961 1980 2002 2006 2007
##
## lowest : 1900 1906 1908 1910 1914, highest: 2006 2007 2008 2009 2010
## ---------------------------------------------------------------------------
## GarageCars
## n missing distinct Info Mean Gmd
## 1460 0 5 0.802 1.767 0.7609
##
## Value 0 1 2 3 4
## Frequency 81 369 824 181 5
## Proportion 0.055 0.253 0.564 0.124 0.003
## ---------------------------------------------------------------------------
## GarageArea
## n missing distinct Info Mean Gmd .05 .10
## 1460 0 441 1 473 234.9 0.0 240.0
## .25 .50 .75 .90 .95
## 334.5 480.0 576.0 757.1 850.1
##
## lowest : 0 160 164 180 186, highest: 1220 1248 1356 1390 1418
## ---------------------------------------------------------------------------
## WoodDeckSF
## n missing distinct Info Mean Gmd .05 .10
## 1460 0 274 0.858 94.24 125 0 0
## .25 .50 .75 .90 .95
## 0 0 168 262 335
##
## lowest : 0 12 24 26 28, highest: 668 670 728 736 857
## ---------------------------------------------------------------------------
## OpenPorchSF
## n missing distinct Info Mean Gmd .05 .10
## 1460 0 202 0.909 46.66 62.43 0 0
## .25 .50 .75 .90 .95
## 0 25 68 130 175
##
## lowest : 0 4 8 10 11, highest: 406 418 502 523 547
## ---------------------------------------------------------------------------
## EnclosedPorch
## n missing distinct Info Mean Gmd .05 .10
## 1460 0 120 0.369 21.95 39.39 0.0 0.0
## .25 .50 .75 .90 .95
## 0.0 0.0 0.0 112.0 180.1
##
## lowest : 0 19 20 24 30, highest: 301 318 330 386 552
## ---------------------------------------------------------------------------
## X3SsnPorch
## n missing distinct Info Mean Gmd .05 .10
## 1460 0 20 0.049 3.41 6.739 0 0
## .25 .50 .75 .90 .95
## 0 0 0 0 0
##
## Value 0 23 96 130 140 144 153 162 168 180
## Frequency 1436 1 1 1 1 2 1 1 3 2
## Proportion 0.984 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.002 0.001
##
## Value 182 196 216 238 245 290 304 320 407 508
## Frequency 1 1 2 1 1 1 1 1 1 1
## Proportion 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001
## ---------------------------------------------------------------------------
## ScreenPorch
## n missing distinct Info Mean Gmd .05 .10
## 1460 0 76 0.22 15.06 28.27 0 0
## .25 .50 .75 .90 .95
## 0 0 0 0 160
##
## lowest : 0 40 53 60 63, highest: 385 396 410 440 480
## ---------------------------------------------------------------------------
## PoolArea
## n missing distinct Info Mean Gmd
## 1460 0 8 0.014 2.759 5.497
##
## Value 0 480 512 519 555 576 648 738
## Frequency 1453 1 1 1 1 1 1 1
## Proportion 0.995 0.001 0.001 0.001 0.001 0.001 0.001 0.001
## ---------------------------------------------------------------------------
## MiscVal
## n missing distinct Info Mean Gmd .05 .10
## 1460 0 21 0.103 43.49 85.67 0 0
## .25 .50 .75 .90 .95
## 0 0 0 0 0
##
## Value 0 50 350 400 450 500 550 600 700 800
## Frequency 1408 1 1 11 4 10 1 5 5 1
## Proportion 0.964 0.001 0.001 0.008 0.003 0.007 0.001 0.003 0.003 0.001
##
## Value 1150 1200 1300 1400 2000 2500 3500 8300 15500
## Frequency 1 2 1 1 4 1 1 1 1
## Proportion 0.001 0.001 0.001 0.001 0.003 0.001 0.001 0.001 0.001
## ---------------------------------------------------------------------------
## MoSold
## n missing distinct Info Mean Gmd .05 .10
## 1460 0 12 0.985 6.322 3.041 2 3
## .25 .50 .75 .90 .95
## 5 6 8 10 11
##
## Value 1 2 3 4 5 6 7 8 9 10
## Frequency 58 52 106 141 204 253 234 122 63 89
## Proportion 0.040 0.036 0.073 0.097 0.140 0.173 0.160 0.084 0.043 0.061
##
## Value 11 12
## Frequency 79 59
## Proportion 0.054 0.040
## ---------------------------------------------------------------------------
## YrSold
## n missing distinct Info Mean Gmd
## 1460 0 5 0.955 2008 1.498
##
## Value 2006 2007 2008 2009 2010
## Frequency 314 329 304 338 175
## Proportion 0.215 0.225 0.208 0.232 0.120
## ---------------------------------------------------------------------------
## SalePrice
## n missing distinct Info Mean Gmd .05 .10
## 1460 0 663 1 180921 81086 88000 106475
## .25 .50 .75 .90 .95
## 129975 163000 214000 278000 326100
##
## lowest : 34900 35311 37900 39300 40000, highest: 582933 611657 625000 745000 755000
## ---------------------------------------------------------------------------
corrplot(cor(as.matrix(numeric_train),use = "complete.obs"), type="lower", insig = "n", tl.srt = 45,number.cex = .30, order = "hclust",tl.col = 'black',tl.cex=.75)
hist(numeric_train$GrLivArea, main="Gross Living Area",xlab ="Square Feet of Finished Living Area", col='navy') #
hist(numeric_train$TotRmsAbvGrd, main="Total Number of Above groun Rooms",xlab ='Rooms Above Ground', col='navy')
hist(numeric_train$YearRemodAdd, main="Years Since Remodel or Addition", xlab ='Years', col='navy') #
hist(numeric_train$FullBath, main="Number of Full Baths", xlab = 'Full Baths', col='navy') #
hist(numeric_train$GarageCars, main ="Numbers of Cars Bays in Garage", xlab ="Car Bays", col='navy')
hist(numeric_train$GarageArea, main = "Square Footage of Garage", xlab = 'Square Feet', col ='navy')
hist(numeric_train$TotalBsmtSF, main = 'Square Footage of Basement', xlab='Square Feet', col = 'navy')
hist(numeric_train$X1stFlrSF, main='First Floor Square Footage', xlab ='Square Feet', col ='navy')
data <- na.omit(train[,c('LotArea','SalePrice')])
plot(x =data$LotArea, y= data$SalePrice, main = "Sale Price Vs. Square Feet of Lot", xlab = 'Square Feet' , ylab =' Sale Price', col = 'maroon')
I chose three variables which had a relatively high correlation with the Sales Price on the correlation triangle above, specifically to have some interaction to evaluate in the following tests.
three_variables<- data.frame( cbind(train$GrLivArea, train$GarageArea, train$TotalBsmtSF))
colnames(three_variables)<-c('GrLivArea', 'GarageArea', 'TotalBsmtSF')
three_cor <- cor(three_variables, method = 'pearson', use = 'complete.obs')
GrLivArea | GarageArea | TotalBsmtSF | |
---|---|---|---|
GrLivArea | 1.0000000 | 0.4689975 | 0.4548682 |
GarageArea | 0.4689975 | 1.0000000 | 0.4866655 |
TotalBsmtSF | 0.4548682 | 0.4866655 | 1.0000000 |
GrLivArea_GarageArea <-cor.test(train$GrLivArea, train$GarageArea,
alternative = "two.sided",
method = "pearson",
exact = NULL, conf.level = 0.92)
GrLivArea_TotalBsmtSF <-cor.test(train$GrLivArea, train$TotalBsmtSF,
alternative = "two.sided",
method = "pearson",
exact = NULL, conf.level = 0.92)
GarageArea_TotalBsmtSF <-cor.test(train$GarageArea, train$TotalBsmtSF,
alternative = "two.sided",
method = "pearson",
exact = NULL, conf.level = 0.92)
Residential Area & Garage Area:
Estimate: 0.4689975
p-value: 0
Interval: 0.4324608 0.5039965
Garage Area & Total Basement Area:
Estimate: 0.4866655
p-value: 0
Interval: 0.4508901 0.5208797
Residential Area & Total Basement Area:
Estimate: 0.4548682
p-value: 0
Interval: 0.4177447 0.4904754
All three of the correlation coefficients (estimates) suggest a moderate positive association, which are statistically supported by extremely small p-values, such that they likely change together to some extent or are tied to another variable with which they trend together.
Familywise error is a risk incurred when you chain together estimates and it appears that our results are significant to such small extent that this would not be an issue, However, to be sure we can simply re-run the analysis with the alpha value (1-confidence ratio) weighted by the number of tests we are doing, in this case, three. \(\alpha = .05\)
n=3
\(\alpha_n = \frac{\alpha}{3} = 1.66\)
\(1-\alpha_n= .9833\)
GrLivArea_GarageArea2 <-cor.test(train$GrLivArea, train$GarageArea,
alternative = "two.sided",
method = "pearson",
exact = NULL, conf.level = 0.9833)
GrLivArea_TotalBsmtSF2 <-cor.test(train$GrLivArea, train$TotalBsmtSF,
alternative = "two.sided",
method = "pearson",
exact = NULL, conf.level = 0.9833)
GarageArea_TotalBsmtSF2 <-cor.test(train$GarageArea, train$TotalBsmtSF,
alternative = "two.sided",
method = "pearson",
exact = NULL, conf.level = 0.9833)
Residential Area & Garage Area:
Estimate: 0.4689975
p-value: 0
Interval: 0.4186762 0.5164475
Garage Area & Total Basement Area:
Estimate: 0.4866655
p-value: 0
Interval: 0.4373772 0.5330386
Residential Area & Total Basement Area:
Estimate: 0.4548682
p-value: 0
Interval: 0.4037514 0.5031538
From the above calculations, you can see that despite the increase confidence \(\aplha\) values, we still have many xeros between our p-value and the risk of a type one error (despite having slight larger intervals now). It is likely a very safe bet that you would note experience extreme results do to Familywise Error.
##Invert Three-Way Correlation
require(Matrix)
three_precise <- solve(three_cor)
print(three_precise)
## GrLivArea GarageArea TotalBsmtSF
## GrLivArea 1.4030275 -0.4552539 -0.4166363
## GarageArea -0.4552539 1.4580675 -0.5025106
## TotalBsmtSF -0.4166363 -0.5025106 1.4340691
cor_prec <- three_cor%*%three_precise
print(cor_prec)
## GrLivArea GarageArea
## GrLivArea 1.00000000000000000000000 -0.00000000000000002775558
## GarageArea 0.00000000000000002775558 0.99999999999999977795540
## TotalBsmtSF 0.00000000000000005551115 -0.00000000000000011102230
## TotalBsmtSF
## GrLivArea 0.0000000000000000000000
## GarageArea -0.0000000000000001110223
## TotalBsmtSF 0.9999999999999997779554
prec_cor<- three_precise%*%three_cor
print(prec_cor)
## GrLivArea GarageArea
## GrLivArea 1.00000000000000000000000 0.00000000000000002775558
## GarageArea -0.00000000000000002775558 0.99999999999999977795540
## TotalBsmtSF 0.00000000000000000000000 -0.00000000000000011102230
## TotalBsmtSF
## GrLivArea 0.00000000000000005551115
## GarageArea -0.00000000000000011102230
## TotalBsmtSF 0.99999999999999977795540
From these values we can see that both of these equate to inversions of each other where non-diagonal values are very small (approaching zero) such that rounding would bring these matrices to 0, providing us with two instances of the same matrix, the identity matrix for this correlation matrix.
lu_decomp <-lu(three_cor)
expand(lu_decomp)
## $L
## 3 x 3 Matrix of class "dtrMatrix" (unitriangular)
## [,1] [,2] [,3]
## [1,] 1.0000000 . .
## [2,] 0.4689975 1.0000000 .
## [3,] 0.4548682 0.3504089 1.0000000
##
## $U
## 3 x 3 Matrix of class "dtrMatrix"
## [,1] [,2] [,3]
## [1,] 1.0000000 0.4689975 0.4548682
## [2,] . 0.7800414 0.2733334
## [3,] . . 0.6973165
##
## $P
## 3 x 3 sparse Matrix of class "pMatrix"
##
## [1,] | . .
## [2,] . | .
## [3,] . . |
Many times, it makes sense to fit a closed form distribution to data. For the first variable that you selected which is skewed to the right, shift it so that the minimum value is above zero as necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. See: MASS.
(summary(X))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1300 7554 9478 10517 11602 215245
The minimumn is greater than zero, so we can proceed with the rest of this process as-is.
x_dist<-fitdistr(X, densfun = 'exponential')
lambda <- x_dist$estimate
re_distibution <- rexp(1000, lambda)
(summary(re_distibution))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.24 3344.42 7692.38 10571.80 14703.51 70187.17
par(mfrow=c(1,2))
hist(numeric_train$LotArea, main = "Ames Iowa Lot Areas ", xlab='Square Feet', col ='navy')
hist(re_distibution, main = "Simulated Lot Areas", xlab="Simulated Values Square Feet", col = "darkorange")
Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF).
p_05_exp <- round(log(1-0.05)/-lambda,2)
p_95_exp <- round(log(1-0.95)/-lambda,2)
Exponential Distribution:
5th Percentile: 539.44
95th Percentile: 31505.6
empirical_interval <- CI(train$LotArea, ci=0.95)
Empirical Confidence Interval (on Lot Size) : (11029 , 10516)
p_05_X <-quantile(train$LotArea, 0.05)
p_95_X <- quantile(train$LotArea, 0.95)
Distribution of Lot Size:
5th Percentile: 3311.7
95th Percentile: 17401.15
dtable<- data.frame(cbind( c('Simulated_Data' , 'Lot_Area'), c(p_05_exp, p_05_X), c(p_95_exp,p_95_X)))
colnames(dtable) <- c('Data Set', 'Fifth Percentile', 'Ninty-Fifty Percentile')
Data Set | Fifth Percentile | Ninty-Fifty Percentile | |
---|---|---|---|
rate | Simulated_Data | 539.44 | 31505.6 |
5% | Lot_Area | 3311.7 | 17401.15 |
The variable Lot Size
is definitely skewed, not bounded closely by zero and trends to the smaller side with a sharp decrease in frequency over 10,000 square feet. Because it is in many cases limited by the space between streets (with homes back to back in many cases) or rows of streets with alleys in between, the relatively compact range of sizes makes perfect sense for homes within the city proper where things are divied up in blocks.
It also makes sense that a developer would create as many lots as possible from a contiguous parcel such that lots tend to converge on a size somewhere at or near the city minimum. Because of this, despite the appearance of an exponential distribution, there definitely is not one, nor is the distribution normal. In fact with the extremes removed it almost approaches uniform distribution or a few comapct uniform distributions.
This is why the exponential distribution sampled using the highly smoothed lambda from our Lot Size
variable produced a distribution with more extreme percentiles than the wild data shows.
These values are an artifact of the smoothing provided by heavy sampling from the entire exponenital range as opposed to sampling from the actual values.
The smoothing effect of the lambda value also contributes to the unrealist 5the percentile value of 540-square feet, a size much smaller than any home could inhabit or the city would allow.
Although the distribution is heavily skewed right, using the Exponential distribution is clearly not the right method for simulating this distribution.
A better choice might be using the absolute values of a samples selected using weighted probabilities from the normal distribution centered around zero which is then shifted by the minimum of the empirical distribution after the absolute values are taken. This would stack them up in the lower range and taper off to the right more appropriately with some compensation for the plateaus in the weights.
This data set was acquired from the City of Ames Iowa with the express purpose of become a go-to open source regression modeling tool by Professor Dean DeCock of Truman State University. Having read his publication, ‘Ames, Iowa: Alternative to the Boston Housing Data as an End of Semester Regression Project’ it became clear to me that this was never meant to become a festival of Ridge-Lasso-ElasticNet-and-Crossvalidation,that Kaggle has reduced it to, but was instead intended to be a data set rich with useful predictive indicators suggestive of sale prices, groomed for easy access.
With that in mind I am starting as suggested by the author: -remvoing homes with footage gerater than 4000 square feet - targeting the predictions to the squared sale price - using neighborhood, Gross Living Area + finished basement as the core variables _ scaling data -using a few discrete ordinal -including a categorical or two
All the Ames Variables:
types <- sapply(train, class)
data_view<- data.frame(types)
types | |
---|---|
Id | integer |
MSSubClass | integer |
MSZoning | factor |
LotFrontage | integer |
LotArea | integer |
Street | factor |
Alley | factor |
LotShape | factor |
LandContour | factor |
Utilities | factor |
LotConfig | factor |
LandSlope | factor |
Neighborhood | factor |
Condition1 | factor |
Condition2 | factor |
BldgType | factor |
HouseStyle | factor |
OverallQual | integer |
OverallCond | integer |
YearBuilt | integer |
YearRemodAdd | integer |
RoofStyle | factor |
RoofMatl | factor |
Exterior1st | factor |
Exterior2nd | factor |
MasVnrType | factor |
MasVnrArea | integer |
ExterQual | factor |
ExterCond | factor |
Foundation | factor |
BsmtQual | factor |
BsmtCond | factor |
BsmtExposure | factor |
BsmtFinType1 | factor |
BsmtFinSF1 | integer |
BsmtFinType2 | factor |
BsmtFinSF2 | integer |
BsmtUnfSF | integer |
TotalBsmtSF | integer |
Heating | factor |
HeatingQC | factor |
CentralAir | factor |
Electrical | factor |
X1stFlrSF | integer |
X2ndFlrSF | integer |
LowQualFinSF | integer |
GrLivArea | integer |
BsmtFullBath | integer |
BsmtHalfBath | integer |
FullBath | integer |
HalfBath | integer |
BedroomAbvGr | integer |
KitchenAbvGr | integer |
KitchenQual | factor |
TotRmsAbvGrd | integer |
Functional | factor |
Fireplaces | integer |
FireplaceQu | factor |
GarageType | factor |
GarageYrBlt | integer |
GarageFinish | factor |
GarageCars | integer |
GarageArea | integer |
GarageQual | factor |
GarageCond | factor |
PavedDrive | factor |
WoodDeckSF | integer |
OpenPorchSF | integer |
EnclosedPorch | integer |
X3SsnPorch | integer |
ScreenPorch | integer |
PoolArea | integer |
PoolQC | factor |
Fence | factor |
MiscFeature | factor |
MiscVal | integer |
MoSold | integer |
YrSold | integer |
SaleType | factor |
SaleCondition | factor |
SalePrice | integer |
And trying to keep the number of variables to 10 or fewer, as complexity is th enemy of comprehension and consistency, Picking from the above, I used the Correlation Matrix below to choose variables that had a high correlation to SalePrice
.
LotArea
Neighborhood
YearBuilt
YearRemodAdd
TotalBsmtSF
GarageCars
FullBath
TotRmsAbvGrd
GrLivArea
corrplot(cor(as.matrix(numeric_train),use = "complete.obs"), type="lower", insig = "n", tl.srt = 45,number.cex = .30, order = "hclust",tl.col = 'black',tl.cex=.75)
#sub_train <- train[ ,c('LotArea','GrLivArea','Neighborhood', 'YearBuilt', 'YearRemodAdd', 'TotalBsmtSF', 'GarageCars', 'FullBath', 'TotRmsAbvGrd', 'YrSold','GarageArea', 'SalePrice')]
sapply(train, function(x) sum(is.na(x)))
## Id MSSubClass MSZoning LotFrontage LotArea
## 0 0 0 259 0
## Street Alley LotShape LandContour Utilities
## 0 1369 0 0 0
## LotConfig LandSlope Neighborhood Condition1 Condition2
## 0 0 0 0 0
## BldgType HouseStyle OverallQual OverallCond YearBuilt
## 0 0 0 0 0
## YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd
## 0 0 0 0 0
## MasVnrType MasVnrArea ExterQual ExterCond Foundation
## 8 8 0 0 0
## BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1
## 37 37 38 37 0
## BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating
## 38 0 0 0 0
## HeatingQC CentralAir Electrical X1stFlrSF X2ndFlrSF
## 0 0 1 0 0
## LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath
## 0 0 0 0 0
## HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd
## 0 0 0 0 0
## Functional Fireplaces FireplaceQu GarageType GarageYrBlt
## 0 0 690 81 81
## GarageFinish GarageCars GarageArea GarageQual GarageCond
## 81 0 0 81 81
## PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch
## 0 0 0 0 0
## ScreenPorch PoolArea PoolQC Fence MiscFeature
## 0 0 1453 1179 1406
## MiscVal MoSold YrSold SaleType SaleCondition
## 0 0 0 0 0
## SalePrice
## 0
x | |
---|---|
Id | integer |
MSSubClass | integer |
MSZoning | factor |
LotFrontage | integer |
LotArea | integer |
Street | factor |
Alley | factor |
LotShape | factor |
LandContour | factor |
Utilities | factor |
LotConfig | factor |
LandSlope | factor |
Neighborhood | factor |
Condition1 | factor |
Condition2 | factor |
BldgType | factor |
HouseStyle | factor |
OverallQual | integer |
OverallCond | integer |
YearBuilt | integer |
YearRemodAdd | integer |
RoofStyle | factor |
RoofMatl | factor |
Exterior1st | factor |
Exterior2nd | factor |
MasVnrType | factor |
MasVnrArea | integer |
ExterQual | factor |
ExterCond | factor |
Foundation | factor |
BsmtQual | factor |
BsmtCond | factor |
BsmtExposure | factor |
BsmtFinType1 | factor |
BsmtFinSF1 | integer |
BsmtFinType2 | factor |
BsmtFinSF2 | integer |
BsmtUnfSF | integer |
TotalBsmtSF | integer |
Heating | factor |
HeatingQC | factor |
CentralAir | factor |
Electrical | factor |
X1stFlrSF | integer |
X2ndFlrSF | integer |
LowQualFinSF | integer |
GrLivArea | integer |
BsmtFullBath | integer |
BsmtHalfBath | integer |
FullBath | integer |
HalfBath | integer |
BedroomAbvGr | integer |
KitchenAbvGr | integer |
KitchenQual | factor |
TotRmsAbvGrd | integer |
Functional | factor |
Fireplaces | integer |
FireplaceQu | factor |
GarageType | factor |
GarageYrBlt | integer |
GarageFinish | factor |
GarageCars | integer |
GarageArea | integer |
GarageQual | factor |
GarageCond | factor |
PavedDrive | factor |
WoodDeckSF | integer |
OpenPorchSF | integer |
EnclosedPorch | integer |
X3SsnPorch | integer |
ScreenPorch | integer |
PoolArea | integer |
PoolQC | factor |
Fence | factor |
MiscFeature | factor |
MiscVal | integer |
MoSold | integer |
YrSold | integer |
SaleType | factor |
SaleCondition | factor |
SalePrice | integer |
With no NA
values in our chosen fields I want convert neighborhood to factor
** To Numeric**
Before turning the integers of measurements into continuous numerics, or factors I decided to combine TWO variables into what might be more meaningful given this data was collected over four years. Instead of YearRemod, which would have less value as you move away so that 2005 is less impressive in 2010 than it was in 2005 I create YearsSinceRemod to use the difference between the sale year and the remodel, which should tighten up the value attribution a little bit.
train[,c('LotArea','TotalBsmtSF','GrLivArea', 'GarageArea')]<-lapply(train[,c('LotArea','TotalBsmtSF','GrLivArea', 'GarageArea')], as.numeric)
train['YearsSinceRemod'] <- train['YrSold'] - train['YearRemodAdd']
train<- train[which(train['GrLivArea']<=4000),]
temp<-data.frame(sapply(train, class))
sapply.train..class. | |
---|---|
Id | integer |
MSSubClass | integer |
MSZoning | factor |
LotFrontage | integer |
LotArea | numeric |
Street | factor |
Alley | factor |
LotShape | factor |
LandContour | factor |
Utilities | factor |
LotConfig | factor |
LandSlope | factor |
Neighborhood | factor |
Condition1 | factor |
Condition2 | factor |
BldgType | factor |
HouseStyle | factor |
OverallQual | integer |
OverallCond | integer |
YearBuilt | integer |
YearRemodAdd | integer |
RoofStyle | factor |
RoofMatl | factor |
Exterior1st | factor |
Exterior2nd | factor |
MasVnrType | factor |
MasVnrArea | integer |
ExterQual | factor |
ExterCond | factor |
Foundation | factor |
BsmtQual | factor |
BsmtCond | factor |
BsmtExposure | factor |
BsmtFinType1 | factor |
BsmtFinSF1 | integer |
BsmtFinType2 | factor |
BsmtFinSF2 | integer |
BsmtUnfSF | integer |
TotalBsmtSF | numeric |
Heating | factor |
HeatingQC | factor |
CentralAir | factor |
Electrical | factor |
X1stFlrSF | integer |
X2ndFlrSF | integer |
LowQualFinSF | integer |
GrLivArea | numeric |
BsmtFullBath | integer |
BsmtHalfBath | integer |
FullBath | integer |
HalfBath | integer |
BedroomAbvGr | integer |
KitchenAbvGr | integer |
KitchenQual | factor |
TotRmsAbvGrd | integer |
Functional | factor |
Fireplaces | integer |
FireplaceQu | factor |
GarageType | factor |
GarageYrBlt | integer |
GarageFinish | factor |
GarageCars | integer |
GarageArea | numeric |
GarageQual | factor |
GarageCond | factor |
PavedDrive | factor |
WoodDeckSF | integer |
OpenPorchSF | integer |
EnclosedPorch | integer |
X3SsnPorch | integer |
ScreenPorch | integer |
PoolArea | integer |
PoolQC | factor |
Fence | factor |
MiscFeature | factor |
MiscVal | integer |
MoSold | integer |
YrSold | integer |
SaleType | factor |
SaleCondition | factor |
SalePrice | integer |
YearsSinceRemod | integer |
Since the goal is to make useful predictions, that I can also explain easily. I decided to go with simple. Raise sale price to the second power, add interaction to the GrLivArea
to TotalBsmtSF
variables, and square the LotArea
to help pick up value on the uber-large lots which might be more valuable.
I will reserve some of the training data to validate on prior to transforming the test data and submitting it.
set.seed(1)
sample <- sample.split(train, SplitRatio = .75)
train_2 = subset(train, sample=TRUE)
validation = subset(train, sample=FALSE)
fit <- lm(SalePrice ~ GrLivArea * TotalBsmtSF+ LotArea + GarageArea + Neighborhood + YearBuilt + YearsSinceRemod + KitchenQual +WoodDeckSF + GarageCars + ExterQual + Fireplaces + BsmtQual + MasVnrArea ,data= train_2)
summary(fit)
##
## Call:
## lm(formula = SalePrice ~ GrLivArea * TotalBsmtSF + LotArea +
## GarageArea + Neighborhood + YearBuilt + YearsSinceRemod +
## KitchenQual + WoodDeckSF + GarageCars + ExterQual + Fireplaces +
## BsmtQual + MasVnrArea, data = train_2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -236625 -14310 710 13892 282211
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) -377992.372327 142282.686136 -2.657
## GrLivArea 71.878798 3.201212 22.454
## TotalBsmtSF 57.628181 4.712041 12.230
## LotArea 0.609556 0.096510 6.316
## GarageArea 16.418702 9.073949 1.809
## NeighborhoodBlueste -1109.512065 23497.567838 -0.047
## NeighborhoodBrDale -958.312115 11735.038930 -0.082
## NeighborhoodBrkSide 17583.950101 9913.298649 1.774
## NeighborhoodClearCr 11277.380097 10320.706445 1.093
## NeighborhoodCollgCr 16935.426841 8176.931521 2.071
## NeighborhoodCrawfor 37786.341599 9672.330203 3.907
## NeighborhoodEdwards -1171.472465 9085.007958 -0.129
## NeighborhoodGilbert 10866.643262 8596.286851 1.264
## NeighborhoodIDOTRR 3009.245320 10585.359257 0.284
## NeighborhoodMeadowV -6652.978812 11336.259114 -0.587
## NeighborhoodMitchel 578.385179 9288.420842 0.062
## NeighborhoodNAmes 6741.502866 8694.219399 0.775
## NeighborhoodNoRidge 75695.510049 9462.053591 8.000
## NeighborhoodNPkVill 6919.906488 13175.760595 0.525
## NeighborhoodNridgHt 32654.885648 8825.311821 3.700
## NeighborhoodNWAmes 5523.006891 8909.184759 0.620
## NeighborhoodOldTown -305.338087 9650.436864 -0.032
## NeighborhoodSawyer 5620.883903 9131.484560 0.616
## NeighborhoodSawyerW 14027.805029 8865.825699 1.582
## NeighborhoodSomerst 24053.968481 8515.308006 2.825
## NeighborhoodStoneBr 63115.035219 9960.659785 6.336
## NeighborhoodSWISU -384.255014 11129.971822 -0.035
## NeighborhoodTimber 14956.938404 9395.312184 1.592
## NeighborhoodVeenker 38961.777696 12198.305181 3.194
## YearBuilt 246.820249 70.697294 3.491
## YearsSinceRemod -300.245411 60.659574 -4.950
## KitchenQualFa -42573.026487 7531.793003 -5.652
## KitchenQualGd -34238.042327 4309.920063 -7.944
## KitchenQualTA -40596.605374 4843.115388 -8.382
## WoodDeckSF 23.534965 7.151939 3.291
## GarageCars 4735.900002 2711.904799 1.746
## ExterQualFa -46709.359725 11584.465977 -4.032
## ExterQualGd -32065.276924 5946.729001 -5.392
## ExterQualTA -35524.482013 6503.823563 -5.462
## Fireplaces 8907.003576 1590.723691 5.599
## BsmtQualFa -38510.801345 7542.480578 -5.106
## BsmtQualGd -37107.519633 4071.244774 -9.115
## BsmtQualTA -36703.494372 4929.067806 -7.446
## MasVnrArea 12.533570 5.794510 2.163
## GrLivArea:TotalBsmtSF -0.019667 0.001656 -11.874
## Pr(>|t|)
## (Intercept) 0.007984 **
## GrLivArea < 0.0000000000000002 ***
## TotalBsmtSF < 0.0000000000000002 ***
## LotArea 0.00000000036194846 ***
## GarageArea 0.070603 .
## NeighborhoodBlueste 0.962346
## NeighborhoodBrDale 0.934927
## NeighborhoodBrkSide 0.076323 .
## NeighborhoodClearCr 0.274720
## NeighborhoodCollgCr 0.038534 *
## NeighborhoodCrawfor 0.00009815094442356 ***
## NeighborhoodEdwards 0.897420
## NeighborhoodGilbert 0.206406
## NeighborhoodIDOTRR 0.776236
## NeighborhoodMeadowV 0.557384
## NeighborhoodMitchel 0.950357
## NeighborhoodNAmes 0.438237
## NeighborhoodNoRidge 0.00000000000000263 ***
## NeighborhoodNPkVill 0.599529
## NeighborhoodNridgHt 0.000224 ***
## NeighborhoodNWAmes 0.535412
## NeighborhoodOldTown 0.974764
## NeighborhoodSawyer 0.538294
## NeighborhoodSawyerW 0.113827
## NeighborhoodSomerst 0.004800 **
## NeighborhoodStoneBr 0.00000000031831850 ***
## NeighborhoodSWISU 0.972464
## NeighborhoodTimber 0.111625
## NeighborhoodVeenker 0.001435 **
## YearBuilt 0.000496 ***
## YearsSinceRemod 0.00000083559708956 ***
## KitchenQualFa 0.00000001922066606 ***
## KitchenQualGd 0.00000000000000405 ***
## KitchenQualTA < 0.0000000000000002 ***
## WoodDeckSF 0.001025 **
## GarageCars 0.080977 .
## ExterQualFa 0.00005833417842443 ***
## ExterQualGd 0.00000008193012827 ***
## ExterQualTA 0.00000005581621460 ***
## Fireplaces 0.00000002596352784 ***
## BsmtQualFa 0.00000037568023232 ***
## BsmtQualGd < 0.0000000000000002 ***
## BsmtQualTA 0.00000000000016902 ***
## MasVnrArea 0.030713 *
## GrLivArea:TotalBsmtSF < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 31050 on 1370 degrees of freedom
## (45 observations deleted due to missingness)
## Multiple R-squared: 0.8512, Adjusted R-squared: 0.8464
## F-statistic: 178.2 on 44 and 1370 DF, p-value: < 0.00000000000000022
We startedwith just a few variables (from GrLivArea
to YearsSinceRemod
) the performance was too low down around 62% according to the \(R^2\) SO I added variables and subtracted them until all the p-values had significane (preferabbly extreme significance with ***) and in the case of Nighborhood
some neighborhoods this was significant and dropping it cost about 3% overall so I decided to keep it in in favor of adding more quantitative variables. This model is a bit larger than I had hoped initially and is not against the squared home price as suggested by the author. however it gives a solid performance. So lets look at some plots and see if it is in fact doing as well as I hoped it would.
plot(fitted(fit), resid(fit))
qqnorm(resid(fit))
qqline(resid(fit))
From the charts above there is clearly a lack of adherence in the upper end of home prices the residuals and fitted values are less consistent at the high end as well, tu the seem to be solid in the most common prices which leads me to believe there might be two mechanisms at work here and settle in with this particular model, as adding new variables is not improving the base performance enough to justify the added complexity, which is already pretty high.
preds<- predict(fit, validation)
difference <- preds - validation$SalePrice
plot(difference)
From this plot you can see a little bit of wildness but the core of the predictions are in a tight spot. At this poitn I think it is time to run the test set, submit to Kaggle and hope for the best!
test <- read.csv('test.csv')
test[,c('LotArea','TotalBsmtSF','GrLivArea', 'GarageArea')]<-lapply(test[,c('LotArea','TotalBsmtSF','GrLivArea', 'GarageArea')], as.numeric)
test['YearsSinceRemod'] <- test['YrSold'] - test['YearRemodAdd']
preds_2<- predict(fit, test)
kaggle<- data.frame(test$Id, preds_2)
colnames(kaggle)<- c('Id', 'SalePrice')
write.csv(kaggle, 'submission.csv', row.names = FALSE)
Kaggle Results: Number 519 with a score of .20458, I have NO idea what that means…I am guessing that means I still have work to do!
Thanks for everything, Doc!
Bethany