I chose \(X_1\) as my independent variable and \(Y_1\) as my dependent variable Reading the dataset into R:
FinalData<-read.csv("605 Final Data Set P1.csv")
FinalData
## Y1 X1
## 1 20.3 9.3
## 2 19.1 4.1
## 3 19.3 22.4
## 4 20.9 9.1
## 5 22.0 15.8
## 6 23.5 7.1
## 7 13.8 15.9
## 8 18.8 6.9
## 9 20.9 16.0
## 10 18.6 6.7
## 11 22.3 8.2
## 12 17.6 16.0
## 13 20.8 6.4
## 14 28.7 11.8
## 15 15.2 3.5
## 16 20.9 21.7
## 17 18.4 12.2
## 18 10.3 9.3
## 19 26.3 8.0
## 20 28.1 6.2
summary(FinalData)
## Y1 X1
## Min. :10.30 Min. : 3.50
## 1st Qu.:18.55 1st Qu.: 6.85
## Median :20.55 Median : 9.20
## Mean :20.29 Mean :10.83
## 3rd Qu.:22.07 3rd Qu.:15.82
## Max. :28.70 Max. :22.40
We have \(x=15.82\) and \(y=22.07\).
nrow(FinalData)
## [1] 20
a<-subset(FinalData,FinalData$X1>15.82)
nrow(a)
## [1] 5
b<-subset(FinalData,FinalData$Y1>18.55)
nrow(b)
## [1] 15
c<-subset(FinalData,FinalData$X1<15.82)
nrow(c)
## [1] 15
d<-subset(FinalData,FinalData$Y1<18.55)
Given this, \(\text{P}(Y>y) = \frac{15}{20} =\frac{3}{4}\). Also, when evaluated independently, \(\text{P}(X>x) =\frac{5}{20}=\frac{1}{4}\), both of which are to be expected.
a.) \(\text{P}(X>x|Y>y)\).
subset(b,b$X1>15.82)
## Y1 X1
## 3 19.3 22.4
## 9 20.9 16.0
## 16 20.9 21.7
So, \(\text{P}(X>x|Y>y)=\frac{3}{15}=\frac{1}{5}\)
b.) \(\text{P}(X>x,Y>y)\). This one is easiest done using the information above. There are 3 cases that fit this criteria, so the odds are \(\frac{3}{20}\)
c.) \(\text{P}(X<x|Y>y)=\frac{12}{15}=\frac{4}{5}\)
nrow(subset(b,b$X1<15.82))
## [1] 12
subset(c,c$Y1<=18.55)
## Y1 X1
## 15 15.2 3.5
## 17 18.4 12.2
## 18 10.3 9.3
subset(c,c$Y1>18.55)
## Y1 X1
## 1 20.3 9.3
## 2 19.1 4.1
## 4 20.9 9.1
## 5 22.0 15.8
## 6 23.5 7.1
## 8 18.8 6.9
## 10 18.6 6.7
## 11 22.3 8.2
## 13 20.8 6.4
## 14 28.7 11.8
## 19 26.3 8.0
## 20 28.1 6.2
subset(a,a$Y1<=18.55)
## Y1 X1
## 7 13.8 15.9
## 12 17.6 16.0
subset(a,a$Y1>18.55)
## Y1 X1
## 3 19.3 22.4
## 9 20.9 16.0
## 16 20.9 21.7
| \(x/y\) | \(\leq 3^{\text{rd}}\text{quartile}\) | \(> 3^{\text{rd}}\text{quartile}\) | Total |
|---|---|---|---|
| \(\leq 1^\text{st} \text{quartile}\) | 3 | 2 | 5 |
| \(> 1^\text{st} \text{quartile}\) | 12 | 3 | 15 |
| Total | 15 | 5 |
Does splitting the data this way make them independent?
A<-subset(FinalData,FinalData$X1>6.85)
B<-subset(FinalData,FinalData$Y1>18.55)
nrow(A)
## [1] 15
nrow(B)
## [1] 15
AB<-subset(A,A$Y1>18.55)
nrow(AB)
## [1] 11
\(\text{P}(A)=\frac{15}{20}=\frac{3}{4}\) \(\text{P}(B)=\frac{3}{4}\) \(\text{P}(A)\text{P}(B)=\frac{9}{16}\) \(\text{P}(AB)=\frac{11}{20}\). \(\text{P}(A)\text{P}(B)\neq\text{P}(AB)\)
They are not independent when split this way.
ABTest<-data.frame(c(11,4),c(4,5))
chisq.test(ABTest)
## Warning in chisq.test(ABTest): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: ABTest
## X-squared = 0.96, df = 1, p-value = 0.3272
The p-value is .32, therefore we fail to reject the null hypothesis that the two are not independent.
Part 2
train<-read.csv("train.csv")
summary(train)
## Id MSSubClass MSZoning LotFrontage
## Min. : 1.0 Min. : 20.0 C (all): 10 Min. : 21.00
## 1st Qu.: 365.8 1st Qu.: 20.0 FV : 65 1st Qu.: 59.00
## Median : 730.5 Median : 50.0 RH : 16 Median : 69.00
## Mean : 730.5 Mean : 56.9 RL :1151 Mean : 70.05
## 3rd Qu.:1095.2 3rd Qu.: 70.0 RM : 218 3rd Qu.: 80.00
## Max. :1460.0 Max. :190.0 Max. :313.00
## NA's :259
## LotArea Street Alley LotShape LandContour
## Min. : 1300 Grvl: 6 Grvl: 50 IR1:484 Bnk: 63
## 1st Qu.: 7554 Pave:1454 Pave: 41 IR2: 41 HLS: 50
## Median : 9478 NA's:1369 IR3: 10 Low: 36
## Mean : 10517 Reg:925 Lvl:1311
## 3rd Qu.: 11602
## Max. :215245
##
## Utilities LotConfig LandSlope Neighborhood Condition1
## AllPub:1459 Corner : 263 Gtl:1382 NAmes :225 Norm :1260
## NoSeWa: 1 CulDSac: 94 Mod: 65 CollgCr:150 Feedr : 81
## FR2 : 47 Sev: 13 OldTown:113 Artery : 48
## FR3 : 4 Edwards:100 RRAn : 26
## Inside :1052 Somerst: 86 PosN : 19
## Gilbert: 79 RRAe : 11
## (Other):707 (Other): 15
## Condition2 BldgType HouseStyle OverallQual
## Norm :1445 1Fam :1220 1Story :726 Min. : 1.000
## Feedr : 6 2fmCon: 31 2Story :445 1st Qu.: 5.000
## Artery : 2 Duplex: 52 1.5Fin :154 Median : 6.000
## PosN : 2 Twnhs : 43 SLvl : 65 Mean : 6.099
## RRNn : 2 TwnhsE: 114 SFoyer : 37 3rd Qu.: 7.000
## PosA : 1 1.5Unf : 14 Max. :10.000
## (Other): 2 (Other): 19
## OverallCond YearBuilt YearRemodAdd RoofStyle
## Min. :1.000 Min. :1872 Min. :1950 Flat : 13
## 1st Qu.:5.000 1st Qu.:1954 1st Qu.:1967 Gable :1141
## Median :5.000 Median :1973 Median :1994 Gambrel: 11
## Mean :5.575 Mean :1971 Mean :1985 Hip : 286
## 3rd Qu.:6.000 3rd Qu.:2000 3rd Qu.:2004 Mansard: 7
## Max. :9.000 Max. :2010 Max. :2010 Shed : 2
##
## RoofMatl Exterior1st Exterior2nd MasVnrType MasVnrArea
## CompShg:1434 VinylSd:515 VinylSd:504 BrkCmn : 15 Min. : 0.0
## Tar&Grv: 11 HdBoard:222 MetalSd:214 BrkFace:445 1st Qu.: 0.0
## WdShngl: 6 MetalSd:220 HdBoard:207 None :864 Median : 0.0
## WdShake: 5 Wd Sdng:206 Wd Sdng:197 Stone :128 Mean : 103.7
## ClyTile: 1 Plywood:108 Plywood:142 NA's : 8 3rd Qu.: 166.0
## Membran: 1 CemntBd: 61 CmentBd: 60 Max. :1600.0
## (Other): 2 (Other):128 (Other):136 NA's :8
## ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure
## Ex: 52 Ex: 3 BrkTil:146 Ex :121 Fa : 45 Av :221
## Fa: 14 Fa: 28 CBlock:634 Fa : 35 Gd : 65 Gd :134
## Gd:488 Gd: 146 PConc :647 Gd :618 Po : 2 Mn :114
## TA:906 Po: 1 Slab : 24 TA :649 TA :1311 No :953
## TA:1282 Stone : 6 NA's: 37 NA's: 37 NA's: 38
## Wood : 3
##
## BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2
## ALQ :220 Min. : 0.0 ALQ : 19 Min. : 0.00
## BLQ :148 1st Qu.: 0.0 BLQ : 33 1st Qu.: 0.00
## GLQ :418 Median : 383.5 GLQ : 14 Median : 0.00
## LwQ : 74 Mean : 443.6 LwQ : 46 Mean : 46.55
## Rec :133 3rd Qu.: 712.2 Rec : 54 3rd Qu.: 0.00
## Unf :430 Max. :5644.0 Unf :1256 Max. :1474.00
## NA's: 37 NA's: 38
## BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir
## Min. : 0.0 Min. : 0.0 Floor: 1 Ex:741 N: 95
## 1st Qu.: 223.0 1st Qu.: 795.8 GasA :1428 Fa: 49 Y:1365
## Median : 477.5 Median : 991.5 GasW : 18 Gd:241
## Mean : 567.2 Mean :1057.4 Grav : 7 Po: 1
## 3rd Qu.: 808.0 3rd Qu.:1298.2 OthW : 2 TA:428
## Max. :2336.0 Max. :6110.0 Wall : 4
##
## Electrical X1stFlrSF X2ndFlrSF LowQualFinSF
## FuseA: 94 Min. : 334 Min. : 0 Min. : 0.000
## FuseF: 27 1st Qu.: 882 1st Qu.: 0 1st Qu.: 0.000
## FuseP: 3 Median :1087 Median : 0 Median : 0.000
## Mix : 1 Mean :1163 Mean : 347 Mean : 5.845
## SBrkr:1334 3rd Qu.:1391 3rd Qu.: 728 3rd Qu.: 0.000
## NA's : 1 Max. :4692 Max. :2065 Max. :572.000
##
## GrLivArea BsmtFullBath BsmtHalfBath FullBath
## Min. : 334 Min. :0.0000 Min. :0.00000 Min. :0.000
## 1st Qu.:1130 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:1.000
## Median :1464 Median :0.0000 Median :0.00000 Median :2.000
## Mean :1515 Mean :0.4253 Mean :0.05753 Mean :1.565
## 3rd Qu.:1777 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:2.000
## Max. :5642 Max. :3.0000 Max. :2.00000 Max. :3.000
##
## HalfBath BedroomAbvGr KitchenAbvGr KitchenQual
## Min. :0.0000 Min. :0.000 Min. :0.000 Ex:100
## 1st Qu.:0.0000 1st Qu.:2.000 1st Qu.:1.000 Fa: 39
## Median :0.0000 Median :3.000 Median :1.000 Gd:586
## Mean :0.3829 Mean :2.866 Mean :1.047 TA:735
## 3rd Qu.:1.0000 3rd Qu.:3.000 3rd Qu.:1.000
## Max. :2.0000 Max. :8.000 Max. :3.000
##
## TotRmsAbvGrd Functional Fireplaces FireplaceQu GarageType
## Min. : 2.000 Maj1: 14 Min. :0.000 Ex : 24 2Types : 6
## 1st Qu.: 5.000 Maj2: 5 1st Qu.:0.000 Fa : 33 Attchd :870
## Median : 6.000 Min1: 31 Median :1.000 Gd :380 Basment: 19
## Mean : 6.518 Min2: 34 Mean :0.613 Po : 20 BuiltIn: 88
## 3rd Qu.: 7.000 Mod : 15 3rd Qu.:1.000 TA :313 CarPort: 9
## Max. :14.000 Sev : 1 Max. :3.000 NA's:690 Detchd :387
## Typ :1360 NA's : 81
## GarageYrBlt GarageFinish GarageCars GarageArea GarageQual
## Min. :1900 Fin :352 Min. :0.000 Min. : 0.0 Ex : 3
## 1st Qu.:1961 RFn :422 1st Qu.:1.000 1st Qu.: 334.5 Fa : 48
## Median :1980 Unf :605 Median :2.000 Median : 480.0 Gd : 14
## Mean :1979 NA's: 81 Mean :1.767 Mean : 473.0 Po : 3
## 3rd Qu.:2002 3rd Qu.:2.000 3rd Qu.: 576.0 TA :1311
## Max. :2010 Max. :4.000 Max. :1418.0 NA's: 81
## NA's :81
## GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch
## Ex : 2 N: 90 Min. : 0.00 Min. : 0.00 Min. : 0.00
## Fa : 35 P: 30 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00
## Gd : 9 Y:1340 Median : 0.00 Median : 25.00 Median : 0.00
## Po : 7 Mean : 94.24 Mean : 46.66 Mean : 21.95
## TA :1326 3rd Qu.:168.00 3rd Qu.: 68.00 3rd Qu.: 0.00
## NA's: 81 Max. :857.00 Max. :547.00 Max. :552.00
##
## X3SsnPorch ScreenPorch PoolArea PoolQC
## Min. : 0.00 Min. : 0.00 Min. : 0.000 Ex : 2
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.000 Fa : 2
## Median : 0.00 Median : 0.00 Median : 0.000 Gd : 3
## Mean : 3.41 Mean : 15.06 Mean : 2.759 NA's:1453
## 3rd Qu.: 0.00 3rd Qu.: 0.00 3rd Qu.: 0.000
## Max. :508.00 Max. :480.00 Max. :738.000
##
## Fence MiscFeature MiscVal MoSold
## GdPrv: 59 Gar2: 2 Min. : 0.00 Min. : 1.000
## GdWo : 54 Othr: 2 1st Qu.: 0.00 1st Qu.: 5.000
## MnPrv: 157 Shed: 49 Median : 0.00 Median : 6.000
## MnWw : 11 TenC: 1 Mean : 43.49 Mean : 6.322
## NA's :1179 NA's:1406 3rd Qu.: 0.00 3rd Qu.: 8.000
## Max. :15500.00 Max. :12.000
##
## YrSold SaleType SaleCondition SalePrice
## Min. :2006 WD :1267 Abnorml: 101 Min. : 34900
## 1st Qu.:2007 New : 122 AdjLand: 4 1st Qu.:129975
## Median :2008 COD : 43 Alloca : 12 Median :163000
## Mean :2008 ConLD : 9 Family : 20 Mean :180921
## 3rd Qu.:2009 ConLI : 5 Normal :1198 3rd Qu.:214000
## Max. :2010 ConLw : 5 Partial: 125 Max. :755000
## (Other): 9
I chose to create histograms for four of the variables. The spikes in Garage Area are interesting, while the data is a bit noisy when it is plotted against number of cars, I suspect that these spikes are the standard one, two, and three car garage sizes. Above Ground Living Area (GrLivArea) is the closest to a skewed bell curve. The others either have spikes (Garage Area), have a huge tail at 0, (YearRemodAdd). I have included the Garage area to number of cars in the garage because the second is a dependent variable on the first (you have to have a minimum square footage per car). This is not the dependent variable for the entire data set, but it is interesting that there is more than one in it.
#removing a few of the outliers that are either huge or godawful expensive. When I clicked on the link on Kaggle, I found the suggestion to use 4000 ft^2 as the limiting factor in the training data, especially since it only removes 4 outliers
train2<-subset(train,train$GrLivArea<4000)
train3<-train2[,c(21,47,63,81)]
summary(train3) #train3 has the year remodeled, which has a nice, but not great, correlation with SalePrice. I intend to try it out in my model
## YearRemodAdd GrLivArea GarageArea SalePrice
## Min. :1950 Min. : 334 Min. : 0.0 Min. : 34900
## 1st Qu.:1967 1st Qu.:1128 1st Qu.: 329.5 1st Qu.:129900
## Median :1994 Median :1458 Median : 478.5 Median :163000
## Mean :1985 Mean :1507 Mean : 471.6 Mean :180151
## 3rd Qu.:2004 3rd Qu.:1775 3rd Qu.: 576.0 3rd Qu.:214000
## Max. :2010 Max. :3627 Max. :1390.0 Max. :625000
train4<-train3[,c(2:4)]
ggplot(data=train3,aes(x=YearRemodAdd))+geom_histogram(binwidth = 1)
ggplot(data=train3,aes(x=GrLivArea))+geom_histogram(bins = 100)
ggplot(data=train3,aes(x=GarageArea))+geom_histogram(bins = 100)
ggplot(data=train3,aes(x=SalePrice))+geom_histogram(bins = 100)
plot(train2$GarageArea,train2$GarageCars)
SaleCor<-cor(train4)
SaleCor
## GrLivArea GarageArea SalePrice
## GrLivArea 1.0000000 0.4545117 0.7205163
## GarageArea 0.4545117 1.0000000 0.6369636
## SalePrice 0.7205163 0.6369636 1.0000000
plot(train4$GrLivArea,train4$SalePrice)
plot(train4$GarageArea,train4$SalePrice)
Checking pairwise correlations:
cor.test(train4$GrLivArea,train4$SalePrice,conf.level = .8)
##
## Pearson's product-moment correlation
##
## data: train4$GrLivArea and train4$SalePrice
## t = 39.62, df = 1454, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.7039548 0.7362947
## sample estimates:
## cor
## 0.7205163
cor.test(train4$GarageArea,train4$SalePrice,conf.level = .8)
##
## Pearson's product-moment correlation
##
## data: train4$GarageArea and train4$SalePrice
## t = 31.507, df = 1454, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.6165544 0.6565173
## sample estimates:
## cor
## 0.6369636
cor.test(train4$GarageArea,train4$GrLivArea,conf.level = .8)
##
## Pearson's product-moment correlation
##
## data: train4$GarageArea and train4$GrLivArea
## t = 19.457, df = 1454, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.4274330 0.4807755
## sample estimates:
## cor
## 0.4545117
There correlation between any two of these variables is non-zero. The p-values are low enough that I would not worry about a familywise error. Even using Bonferroni or similar correction, the p-values are small enough that they would not still be less than \(1\times 10^{-14}\), which is good enough for just about anything.
The correlations indicate that the given variables are strongly related to each other, especially given the absence of any familiywise error risk.
SalePresc<-solve(SaleCor)
SalePresc
## GrLivArea GarageArea SalePrice
## GrLivArea 2.07976643 0.01550685 -1.508383
## GarageArea 0.01550685 1.68283152 -1.083075
## SalePrice -1.50838292 -1.08307535 2.776694
SalePresc%*%SaleCor
## GrLivArea GarageArea SalePrice
## GrLivArea 1 1.110223e-16 0
## GarageArea 0 1.000000e+00 0
## SalePrice 0 0.000000e+00 1
SaleCor%*%SalePresc
## GrLivArea GarageArea SalePrice
## GrLivArea 1.000000e+00 0 0
## GarageArea 1.110223e-16 1 0
## SalePrice 0.000000e+00 0 1
We have the inverse matrix, which multiplies out to the identity matrix, as is expected. To do LU-Decomposition, I use the function I wrote for HW2:
LUme <- function(A){ #A is the original matrix #apparently decompose is already kicking around, thus the function name.
size = nrow(A)
lower<-diag(size)
usethis <- size-1
##this should be doable via lapply, but given that what needs to be done to the lower matrix is so difference than the upper
##I ended up just doing two nested loops so that this can handle arbitrary sized non-singular matrices
for (i in 1:usethis){
for (j in i:size){
if(j<size){
container<-rUeL(A,lower,i,j+1)
A<-container[[1]]
lower<-container[[2]]
}
}
}
answerlist<-list(lower,A,lower%*%A)
names(answerlist)<-c("Lower Matrix","Upper Matrix","Multiplied")
return(answerlist)
}
rUeL <- function(m1,m2,r1,r2){ #m1 is the upper matrix, m2 is the lower, r1 is the row above r2
coef<-(m1[r2,r1]/m1[r1,r1]) #coefficient for subtraction of r1 and r2 to zero out the appropriate element of r2
m1[r2,]<-m1[r2,]-(m1[r1,]*m1[r2,r1]/m1[r1,r1])
m2[r2,r1]<-coef
return(list(m1,m2))
}
Using said function on both the correlation matrix and the precision matrix, we get:
LUme(SaleCor)
## $`Lower Matrix`
## [,1] [,2] [,3]
## [1,] 1.0000000 0.0000000 0
## [2,] 0.4545117 1.0000000 0
## [3,] 0.7205163 0.3900593 1
##
## $`Upper Matrix`
## GrLivArea GarageArea SalePrice
## GrLivArea 1 0.4545117 0.7205163
## GarageArea 0 0.7934191 0.3094805
## SalePrice 0 0.0000000 0.3601405
##
## $Multiplied
## GrLivArea GarageArea SalePrice
## [1,] 1.0000000 0.4545117 0.7205163
## [2,] 0.4545117 1.0000000 0.6369636
## [3,] 0.7205163 0.6369636 1.0000000
LUme(SalePresc)
## $`Lower Matrix`
## [,1] [,2] [,3]
## [1,] 1.000000000 0.0000000 0
## [2,] 0.007456056 1.0000000 0
## [3,] -0.725265537 -0.6369636 1
##
## $`Upper Matrix`
## GrLivArea GarageArea SalePrice
## GrLivArea 2.079766e+00 0.01550685 -1.508383
## GarageArea 1.734723e-18 1.68271590 -1.071829
## SalePrice 1.104956e-18 0.00000000 1.000000
##
## $Multiplied
## GrLivArea GarageArea SalePrice
## [1,] 2.07976643 0.01550685 -1.508383
## [2,] 0.01550685 1.68283152 -1.083075
## [3,] -1.50838292 -1.08307535 2.776694
I did this using BsmUnfSF, both with and without the 0s. While this does not cause any issues for fitdist, with the same value received, it is neater.
hist(train2$BsmtUnfSF)
#this looks great, but when it shrink the bin size, we get this
hist(train2$BsmtUnfSF,breaks=100,freq = FALSE)
Unfinished<-fitdistr(train2$BsmtUnfSF,"exponential")
Unfinished
## rate
## 1.763698e-03
## (4.622146e-05)
curve(dexp(x,1.763698e-03),add=TRUE)
set.seed(2001)
fitsamp<-rexp(1000,1.763698e-03)
hist(fitsamp,breaks=100,freq=FALSE)
trainBase<-subset(train2,train2$BsmtUnfSF>0)
hist(trainBase$BsmtUnfSF,breaks=100,freq = FALSE)
AllUnfinished<-fitdistr(train2$BsmtUnfSF,"exponential")
AllUnfinished
## rate
## 1.763698e-03
## (4.622146e-05)
curve(dexp(x,1.763698e-03),add=TRUE)
The cumulative distribution functio for an exponential distribution is \(F(x,\lambda)=1-e^{\lambda x}\). We want \(F=.05\) and \(F=.95\) For the first value:
\[ 1-e^{-\lambda x}=.05\\ -e^{-\lambda x}=-.95\\ x=-\frac{\ln(.95)}{\lambda} x=-\frac{-0.05129329}{.001765255}\approx29.06 \]
For the second value:
\[ 1-e^{-\lambda x}=.95\\ -e^{-\lambda x}=-.05\\ x=-\frac{\ln(.05)}{\lambda} x=-\frac{-2.995732}{.001765255}\approx1697.05 \]
Finding the 95% confidence interval, using the std.error function from plotrix:
qnorm(.95)
## [1] 1.644854
std.error(trainBase$BsmtUnfSF)
## [1] 11.6604
mean(trainBase$BsmtUnfSF)
## [1] 616.994
So the 95% confidence interval is \(617\pm(1.645)(11.66)=617\pm 19.18\)
quantile(trainBase$BsmtUnfSF,c(.05,.95))
## 5% 95%
## 100.00 1489.75
quantile(fitsamp,c(.05,.95))
## 5% 95%
## 28.42716 1649.27524
mean(fitsamp)
## [1] 557.9193
These are all remarkably close. The histograms combined with the quantile data are very similar. The mean of the sample is not within the confidence interval, however, given that the confidence interval was created assuming normality, while the sample data was not, this does not detract from the model.
trainumbs<-unlist(lapply(train,is.numeric))
trainnumonly<-train2[,trainumbs]
cor(trainnumonly)
## Id MSSubClass LotFrontage LotArea
## Id 1.0000000000 0.011076005 NA -0.038040539
## MSSubClass 0.0110760048 1.000000000 NA -0.142191843
## LotFrontage NA NA 1 NA
## LotArea -0.0380405391 -0.142191843 NA 1.000000000
## OverallQual -0.0323233071 0.032415616 NA 0.088718768
## OverallCond 0.0133374790 -0.059276572 NA -0.002832285
## YearBuilt -0.0140335892 0.027689352 NA 0.006590226
## YearRemodAdd -0.0230755086 0.040458748 NA 0.006930318
## MasVnrArea NA NA NA NA
## BsmtFinSF1 -0.0178206510 -0.075268440 NA 0.173426158
## BsmtFinSF2 -0.0056094902 -0.065598386 NA 0.114691227
## BsmtUnfSF -0.0070000331 -0.140890171 NA -0.003774031
## TotalBsmtSF -0.0283118194 -0.255441005 NA 0.221939938
## X1stFlrSF 0.0016663035 -0.265000693 NA 0.267643644
## X2ndFlrSF 0.0025784921 0.311293638 NA 0.037276582
## LowQualFinSF -0.0441275123 0.046499262 NA 0.005675275
## GrLivArea -0.0008462866 0.077955528 NA 0.231886955
## BsmtFullBath -0.0010194236 0.003281653 NA 0.147594611
## BsmtHalfBath -0.0197198713 -0.002508698 NA 0.047390546
## FullBath 0.0040051155 0.132131037 NA 0.117335855
## HalfBath 0.0052481188 0.177476000 NA 0.005980504
## BedroomAbvGr 0.0367743224 -0.023626587 NA 0.118959513
## KitchenAbvGr 0.0032223876 0.281783056 NA -0.016565309
## TotRmsAbvGrd 0.0238544181 0.040246635 NA 0.173629285
## Fireplaces -0.0246730773 -0.046376588 NA 0.259700916
## GarageYrBlt NA NA NA NA
## GarageCars 0.0157828536 -0.040490374 NA 0.150977421
## GarageArea 0.0132664527 -0.100144776 NA 0.162182789
## WoodDeckSF -0.0306440156 -0.012852609 NA 0.167040055
## OpenPorchSF -0.0024713384 -0.006686882 NA 0.061679261
## EnclosedPorch 0.0033478948 -0.011966400 NA -0.016108446
## X3SsnPorch -0.0465399309 -0.043802236 NA 0.021505371
## ScreenPorch 0.0016739186 -0.025978506 NA 0.045620158
## PoolArea 0.0408707136 0.007956894 NA 0.033875227
## MiscVal -0.0061383622 -0.007665753 NA 0.039192359
## MoSold 0.0232451338 -0.013512341 NA 0.007188288
## YrSold 0.0007934422 -0.021329726 NA -0.013014088
## SalePrice -0.0274548863 -0.088160149 NA 0.269866484
## OverallQual OverallCond YearBuilt YearRemodAdd
## Id -0.03232331 0.013337479 -0.014033589 -0.023075509
## MSSubClass 0.03241562 -0.059276572 0.027689352 0.040458748
## LotFrontage NA NA NA NA
## LotArea 0.08871877 -0.002832285 0.006590226 0.006930318
## OverallQual 1.00000000 -0.090691730 0.571711832 0.550970612
## OverallCond -0.09069173 1.000000000 -0.375691114 0.074702591
## YearBuilt 0.57171183 -0.375691114 1.000000000 0.591906136
## YearRemodAdd 0.55097061 0.074702591 0.591906136 1.000000000
## MasVnrArea NA NA NA NA
## BsmtFinSF1 0.21307936 -0.042542236 0.248272491 0.121689609
## BsmtFinSF2 -0.05752025 0.040014833 -0.048393371 -0.067187713
## BsmtUnfSF 0.31016404 -0.137266510 0.148810145 0.180972421
## TotalBsmtSF 0.53266599 -0.176000436 0.399866607 0.294866090
## X1stFlrSF 0.46204182 -0.145612855 0.279928714 0.238304489
## X2ndFlrSF 0.27974502 0.031296654 0.002953154 0.136103360
## LowQualFinSF -0.02982579 0.025406473 -0.183719954 -0.062215160
## GrLivArea 0.58351920 -0.078567355 0.192644951 0.289264278
## BsmtFullBath 0.10409198 -0.053106837 0.185009254 0.116765047
## BsmtHalfBath -0.04717213 0.117206818 -0.039945271 -0.013296723
## FullBath 0.54379093 -0.194167121 0.466710000 0.438211507
## HalfBath 0.26743138 -0.059926614 0.240143681 0.181135981
## BedroomAbvGr 0.09684815 0.013248892 -0.072623278 -0.041918503
## KitchenAbvGr -0.18428060 -0.087204458 -0.174481002 -0.149287577
## TotRmsAbvGrd 0.41583390 -0.055766348 0.089206884 0.187520258
## Fireplaces 0.38742490 -0.022277117 0.143162318 0.108731621
## GarageYrBlt NA NA NA NA
## GarageCars 0.59873889 -0.185493758 0.536748656 0.419572822
## GarageArea 0.55490469 -0.150679146 0.477311363 0.369589852
## WoodDeckSF 0.23281894 -0.003063120 0.222690462 0.204019691
## OpenPorchSF 0.29780274 -0.029648925 0.183905184 0.222648772
## EnclosedPorch -0.11240732 0.070102698 -0.386903576 -0.193348040
## X3SsnPorch 0.03162059 0.025418921 0.031717375 0.045595823
## ScreenPorch 0.06773223 0.054616656 -0.049702593 -0.038176464
## PoolArea 0.01812125 0.008078797 -0.014372825 -0.009490188
## MiscVal -0.03106845 0.068729421 -0.034192666 -0.010099838
## MoSold 0.07641430 -0.003135262 0.013880837 0.022628793
## YrSold -0.02432058 0.043754812 -0.012593267 0.036597493
## SalePrice 0.80085836 -0.080201802 0.535279432 0.521427960
## MasVnrArea BsmtFinSF1 BsmtFinSF2 BsmtUnfSF
## Id NA -0.017820651 -0.005609490 -0.007000033
## MSSubClass NA -0.075268440 -0.065598386 -0.140890171
## LotFrontage NA NA NA NA
## LotArea NA 0.173426158 0.114691227 -0.003774031
## OverallQual NA 0.213079362 -0.057520249 0.310164044
## OverallCond NA -0.042542236 0.040014833 -0.137266510
## YearBuilt NA 0.248272491 -0.048393371 0.148810145
## YearRemodAdd NA 0.121689609 -0.067187713 0.180972421
## MasVnrArea 1 NA NA NA
## BsmtFinSF1 NA 1.000000000 -0.048738073 -0.526140244
## BsmtFinSF2 NA -0.048738073 1.000000000 -0.209285937
## BsmtUnfSF NA -0.526140244 -0.209285937 1.000000000
## TotalBsmtSF NA 0.460323665 0.116477633 0.441625132
## X1stFlrSF NA 0.386453075 0.106133963 0.331573791
## X2ndFlrSF NA -0.183357567 -0.098241420 0.002749242
## LowQualFinSF NA -0.066610894 0.014713622 0.028252980
## GrLivArea NA 0.121479030 -0.004994960 0.251631936
## BsmtFullBath NA 0.661932650 0.160254329 -0.424026185
## BsmtHalfBath NA 0.068868916 0.071985752 -0.099007488
## FullBath NA 0.037158794 -0.075290707 0.289399490
## HalfBath NA -0.014507635 -0.031242668 -0.041925429
## BedroomAbvGr NA -0.121893063 -0.015134114 0.166583946
## KitchenAbvGr NA -0.082722357 -0.040926117 0.030226318
## TotRmsAbvGrd NA 0.001876651 -0.033490456 0.251935602
## Fireplaces NA 0.236218676 0.049027376 0.051796777
## GarageYrBlt NA NA NA NA
## GarageCars NA 0.224043217 -0.037330543 0.213772291
## GarageArea NA 0.268650796 -0.016484860 0.184562127
## WoodDeckSF NA 0.201462192 0.069027662 -0.006875562
## OpenPorchSF NA 0.071850658 0.005083074 0.129147583
## EnclosedPorch NA -0.103053229 0.036268978 -0.002336340
## X3SsnPorch NA 0.029879169 -0.030089659 0.020843281
## ScreenPorch NA 0.070025842 0.088676018 -0.012435350
## PoolArea NA 0.016379573 0.053177619 -0.031243573
## MiscVal NA 0.005148597 0.004870854 -0.023802148
## MoSold NA -0.001773449 -0.015725934 0.035455949
## YrSold NA 0.018506484 0.031383706 -0.040834117
## SalePrice NA 0.395923108 -0.008899911 0.220677828
## TotalBsmtSF X1stFlrSF X2ndFlrSF LowQualFinSF
## Id -0.028311819 0.001666304 0.002578492 -4.412751e-02
## MSSubClass -0.255441005 -0.265000693 0.311293638 4.649926e-02
## LotFrontage NA NA NA NA
## LotArea 0.221939938 0.267643644 0.037276582 5.675275e-03
## OverallQual 0.532665986 0.462041822 0.279745021 -2.982579e-02
## OverallCond -0.176000436 -0.145612855 0.031296654 2.540647e-02
## YearBuilt 0.399866607 0.279928714 0.002953154 -1.837200e-01
## YearRemodAdd 0.294866090 0.238304489 0.136103360 -6.221516e-02
## MasVnrArea NA NA NA NA
## BsmtFinSF1 0.460323665 0.386453075 -0.183357567 -6.661089e-02
## BsmtFinSF2 0.116477633 0.106133963 -0.098241420 1.471362e-02
## BsmtUnfSF 0.441625132 0.331573791 0.002749242 2.825298e-02
## TotalBsmtSF 1.000000000 0.800758989 -0.226960337 -3.345752e-02
## X1stFlrSF 0.800758989 1.000000000 -0.252296704 -1.312801e-02
## X2ndFlrSF -0.226960337 -0.252296704 1.000000000 6.514187e-02
## LowQualFinSF -0.033457516 -0.013128013 0.065141871 1.000000e+00
## GrLivArea 0.394829176 0.522920244 0.687429564 1.448249e-01
## BsmtFullBath 0.298870883 0.232826186 -0.178520522 -4.697760e-02
## BsmtHalfBath -0.006119831 -0.004382854 -0.032587094 -5.605597e-03
## FullBath 0.319777839 0.374630683 0.410642097 1.239123e-06
## HalfBath -0.072369928 -0.144372621 0.609022074 -2.673046e-02
## BedroomAbvGr 0.045549161 0.125474298 0.502450076 1.060079e-01
## KitchenAbvGr -0.069964342 0.074554578 0.061777147 7.452500e-03
## TotRmsAbvGrd 0.259133114 0.390639219 0.610793572 1.333444e-01
## Fireplaces 0.321377773 0.396829341 0.182722299 -2.072834e-02
## GarageYrBlt NA NA NA NA
## GarageCars 0.448605965 0.445861364 0.174846717 -9.431482e-02
## GarageArea 0.472002530 0.474245802 0.125023360 -6.747432e-02
## WoodDeckSF 0.229984154 0.230864886 0.083669986 -2.511374e-02
## OpenPorchSF 0.215558908 0.179049159 0.198406685 1.933851e-02
## EnclosedPorch -0.095871638 -0.063073606 0.065690250 6.097457e-02
## X3SsnPorch 0.041761766 0.060552703 -0.023739859 -4.334208e-03
## ScreenPorch 0.094511019 0.097092939 0.043307602 2.671336e-02
## PoolArea 0.004418110 0.032087836 0.038375042 7.307037e-02
## MiscVal -0.018253492 -0.020800889 0.017111398 -3.821953e-03
## MoSold 0.030026019 0.045080784 0.039163439 -2.244100e-02
## YrSold -0.012192121 -0.010013637 -0.024874105 -2.907371e-02
## SalePrice 0.646584498 0.625234719 0.297301302 -2.535064e-02
## GrLivArea BsmtFullBath BsmtHalfBath FullBath
## Id -0.0008462866 -0.001019424 -0.019719871 4.005115e-03
## MSSubClass 0.0779555283 0.003281653 -0.002508698 1.321310e-01
## LotFrontage NA NA NA NA
## LotArea 0.2318869552 0.147594611 0.047390546 1.173359e-01
## OverallQual 0.5835191995 0.104091983 -0.047172132 5.437909e-01
## OverallCond -0.0785673546 -0.053106837 0.117206818 -1.941671e-01
## YearBuilt 0.1926449508 0.185009254 -0.039945271 4.667100e-01
## YearRemodAdd 0.2892642784 0.116765047 -0.013296723 4.382115e-01
## MasVnrArea NA NA NA NA
## BsmtFinSF1 0.1214790300 0.661932650 0.068868916 3.715879e-02
## BsmtFinSF2 -0.0049949602 0.160254329 0.071985752 -7.529071e-02
## BsmtUnfSF 0.2516319360 -0.424026185 -0.099007488 2.893995e-01
## TotalBsmtSF 0.3948291761 0.298870883 -0.006119831 3.197778e-01
## X1stFlrSF 0.5229202439 0.232826186 -0.004382854 3.746307e-01
## X2ndFlrSF 0.6874295638 -0.178520522 -0.032587094 4.106421e-01
## LowQualFinSF 0.1448248926 -0.046977597 -0.005605597 1.239123e-06
## GrLivArea 1.0000000000 0.013406111 -0.032112178 6.351612e-01
## BsmtFullBath 0.0134061114 1.000000000 -0.146201453 -6.945738e-02
## BsmtHalfBath -0.0321121779 -0.146201453 1.000000000 -6.137911e-02
## FullBath 0.6351612085 -0.069457376 -0.061379110 1.000000e+00
## HalfBath 0.4190516237 -0.034860782 -0.015172957 1.303351e-01
## BedroomAbvGr 0.5400833656 -0.152267699 0.043330861 3.609899e-01
## KitchenAbvGr 0.1098094664 -0.041035509 -0.037682320 1.353522e-01
## TotRmsAbvGrd 0.8339786219 -0.063714744 -0.028715371 5.496252e-01
## Fireplaces 0.4516621451 0.130932663 0.024536785 2.364775e-01
## GarageYrBlt NA NA NA NA
## GarageCars 0.4740577872 0.130567676 -0.024971992 4.653247e-01
## GarageArea 0.4545116864 0.170653430 -0.028212797 4.007804e-01
## WoodDeckSF 0.2418269598 0.174636052 0.034626278 1.821310e-01
## OpenPorchSF 0.3073253482 0.056250603 -0.024384947 2.529112e-01
## EnclosedPorch 0.0161478136 -0.049033639 -0.007802929 -1.138121e-01
## X3SsnPorch 0.0239668378 0.000249040 0.035564678 3.630391e-02
## ScreenPorch 0.1124084791 0.024074572 0.032901535 -6.556984e-03
## PoolArea 0.0643456922 0.037039385 0.027886165 2.150946e-02
## MiscVal -0.0009740931 -0.022877398 -0.007211400 -1.387200e-02
## MoSold 0.0653284782 -0.023770135 0.038478449 5.819696e-02
## YrSold -0.0318983131 0.067665051 -0.045302641 -1.657378e-02
## SalePrice 0.7205163007 0.235696782 -0.036792474 5.590482e-01
## HalfBath BedroomAbvGr KitchenAbvGr TotRmsAbvGrd
## Id 0.005248119 0.036774322 0.003222388 0.023854418
## MSSubClass 0.177476000 -0.023626587 0.281783056 0.040246635
## LotFrontage NA NA NA NA
## LotArea 0.005980504 0.118959513 -0.016565309 0.173629285
## OverallQual 0.267431383 0.096848146 -0.184280603 0.415833902
## OverallCond -0.059926614 0.013248892 -0.087204458 -0.055766348
## YearBuilt 0.240143681 -0.072623278 -0.174481002 0.089206884
## YearRemodAdd 0.181135981 -0.041918503 -0.149287577 0.187520258
## MasVnrArea NA NA NA NA
## BsmtFinSF1 -0.014507635 -0.121893063 -0.082722357 0.001876651
## BsmtFinSF2 -0.031242668 -0.015134114 -0.040926117 -0.033490456
## BsmtUnfSF -0.041925429 0.166583946 0.030226318 0.251935602
## TotalBsmtSF -0.072369928 0.045549161 -0.069964342 0.259133114
## X1stFlrSF -0.144372621 0.125474298 0.074554578 0.390639219
## X2ndFlrSF 0.609022074 0.502450076 0.061777147 0.610793572
## LowQualFinSF -0.026730455 0.106007883 0.007452500 0.133344353
## GrLivArea 0.419051624 0.540083366 0.109809466 0.833978622
## BsmtFullBath -0.034860782 -0.152267699 -0.041035509 -0.063714744
## BsmtHalfBath -0.015172957 0.043330861 -0.037682320 -0.028715371
## FullBath 0.130335116 0.360989894 0.135352207 0.549625234
## HalfBath 1.000000000 0.224798930 -0.067693846 0.338617884
## BedroomAbvGr 0.224798930 1.000000000 0.199328385 0.679346237
## KitchenAbvGr -0.067693846 0.199328385 1.000000000 0.260103377
## TotRmsAbvGrd 0.338617884 0.679346237 0.260103377 1.000000000
## Fireplaces 0.198393911 0.103951004 -0.123688494 0.315643170
## GarageYrBlt NA NA NA NA
## GarageCars 0.215800298 0.083083778 -0.050014806 0.358068680
## GarageArea 0.157317275 0.062108286 -0.063668606 0.325466799
## WoodDeckSF 0.104537626 0.044038842 -0.089670237 0.159720485
## OpenPorchSF 0.194921319 0.093803316 -0.069738368 0.219969071
## EnclosedPorch -0.094316841 0.042401896 0.037112509 0.006789744
## X3SsnPorch -0.004589731 -0.024262571 -0.024669916 -0.005908304
## ScreenPorch 0.073496772 0.044941354 -0.051778708 0.061924394
## PoolArea 0.001009749 0.064117862 -0.012306310 0.041587857
## MiscVal 0.001589181 0.007964924 0.062294269 0.025639926
## MoSold -0.007126783 0.048477204 0.026340058 0.041965802
## YrSold -0.008853373 -0.034848689 0.031454021 -0.032189520
## SalePrice 0.282924892 0.160541722 -0.138848617 0.537461767
## Fireplaces GarageYrBlt GarageCars GarageArea
## Id -0.024673077 NA 0.015782854 0.01326645
## MSSubClass -0.046376588 NA -0.040490374 -0.10014478
## LotFrontage NA NA NA NA
## LotArea 0.259700916 NA 0.150977421 0.16218279
## OverallQual 0.387424903 NA 0.598738891 0.55490469
## OverallCond -0.022277117 NA -0.185493758 -0.15067915
## YearBuilt 0.143162318 NA 0.536748656 0.47731136
## YearRemodAdd 0.108731621 NA 0.419572822 0.36958985
## MasVnrArea NA NA NA NA
## BsmtFinSF1 0.236218676 NA 0.224043217 0.26865080
## BsmtFinSF2 0.049027376 NA -0.037330543 -0.01648486
## BsmtUnfSF 0.051796777 NA 0.213772291 0.18456213
## TotalBsmtSF 0.321377773 NA 0.448605965 0.47200253
## X1stFlrSF 0.396829341 NA 0.445861364 0.47424580
## X2ndFlrSF 0.182722299 NA 0.174846717 0.12502336
## LowQualFinSF -0.020728345 NA -0.094314819 -0.06747432
## GrLivArea 0.451662145 NA 0.474057787 0.45451169
## BsmtFullBath 0.130932663 NA 0.130567676 0.17065343
## BsmtHalfBath 0.024536785 NA -0.024971992 -0.02821280
## FullBath 0.236477476 NA 0.465324740 0.40078044
## HalfBath 0.198393911 NA 0.215800298 0.15731727
## BedroomAbvGr 0.103951004 NA 0.083083778 0.06210829
## KitchenAbvGr -0.123688494 NA -0.050014806 -0.06366861
## TotRmsAbvGrd 0.315643170 NA 0.358068680 0.32546680
## Fireplaces 1.000000000 NA 0.297666003 0.25685254
## GarageYrBlt NA 1 NA NA
## GarageCars 0.297666003 NA 1.000000000 0.88688169
## GarageArea 0.256852545 NA 0.886881692 1.00000000
## WoodDeckSF 0.194972138 NA 0.223009904 0.21996742
## OpenPorchSF 0.160646855 NA 0.209762044 0.22808912
## EnclosedPorch -0.022885417 NA -0.150590002 -0.12061485
## X3SsnPorch 0.012042209 NA 0.036289595 0.03621290
## ScreenPorch 0.187656148 NA 0.051622319 0.05373159
## PoolArea 0.051221254 NA 0.003359958 0.01163730
## MiscVal 0.001942724 NA -0.042885515 -0.02708835
## MoSold 0.053946716 NA 0.041607745 0.03460232
## YrSold -0.022566883 NA -0.037179042 -0.02587008
## SalePrice 0.466765283 NA 0.649256334 0.63696359
## WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch
## Id -0.030644016 -0.002471338 0.003347895 -0.0465399309
## MSSubClass -0.012852609 -0.006686882 -0.011966400 -0.0438022359
## LotFrontage NA NA NA NA
## LotArea 0.167040055 0.061679261 -0.016108446 0.0215053707
## OverallQual 0.232818939 0.297802744 -0.112407321 0.0316205860
## OverallCond -0.003063120 -0.029648925 0.070102698 0.0254189208
## YearBuilt 0.222690462 0.183905184 -0.386903576 0.0317173746
## YearRemodAdd 0.204019691 0.222648772 -0.193348040 0.0455958229
## MasVnrArea NA NA NA NA
## BsmtFinSF1 0.201462192 0.071850658 -0.103053229 0.0298791692
## BsmtFinSF2 0.069027662 0.005083074 0.036268978 -0.0300896587
## BsmtUnfSF -0.006875562 0.129147583 -0.002336340 0.0208432809
## TotalBsmtSF 0.229984154 0.215558908 -0.095871638 0.0417617657
## X1stFlrSF 0.230864886 0.179049159 -0.063073606 0.0605527028
## X2ndFlrSF 0.083669986 0.198406685 0.065690250 -0.0237398590
## LowQualFinSF -0.025113736 0.019338513 0.060974566 -0.0043342079
## GrLivArea 0.241826960 0.307325348 0.016147814 0.0239668378
## BsmtFullBath 0.174636052 0.056250603 -0.049033639 0.0002490400
## BsmtHalfBath 0.034626278 -0.024384947 -0.007802929 0.0355646780
## FullBath 0.182131041 0.252911183 -0.113812140 0.0363039057
## HalfBath 0.104537626 0.194921319 -0.094316841 -0.0045897314
## BedroomAbvGr 0.044038842 0.093803316 0.042401896 -0.0242625708
## KitchenAbvGr -0.089670237 -0.069738368 0.037112509 -0.0246699160
## TotRmsAbvGrd 0.159720485 0.219969071 0.006789744 -0.0059083044
## Fireplaces 0.194972138 0.160646855 -0.022885417 0.0120422088
## GarageYrBlt NA NA NA NA
## GarageCars 0.223009904 0.209762044 -0.150590002 0.0362895946
## GarageArea 0.219967415 0.228089122 -0.120614848 0.0362129011
## WoodDeckSF 1.000000000 0.053498193 -0.125150835 -0.0324722989
## OpenPorchSF 0.053498193 1.000000000 -0.092093705 -0.0051484572
## EnclosedPorch -0.125150835 -0.092093705 1.000000000 -0.0374274621
## X3SsnPorch -0.032472299 -0.005148457 -0.037427462 1.0000000000
## ScreenPorch -0.073489496 0.077261261 -0.083154070 -0.0315259573
## PoolArea 0.068308638 0.030359975 0.068796316 -0.0067704980
## MiscVal -0.009287455 -0.018276553 0.018277473 0.0003259486
## MoSold 0.024595095 0.072515014 -0.029564738 0.0293859621
## YrSold 0.023859822 -0.056326427 -0.010342520 0.0185163895
## SalePrice 0.322537864 0.330360776 -0.129773817 0.0474141193
## ScreenPorch PoolArea MiscVal MoSold
## Id 0.001673919 0.040870714 -0.0061383622 0.023245134
## MSSubClass -0.025978506 0.007956894 -0.0076657529 -0.013512341
## LotFrontage NA NA NA NA
## LotArea 0.045620158 0.033875227 0.0391923589 0.007188288
## OverallQual 0.067732233 0.018121246 -0.0310684533 0.076414295
## OverallCond 0.054616656 0.008078797 0.0687294213 -0.003135262
## YearBuilt -0.049702593 -0.014372825 -0.0341926665 0.013880837
## YearRemodAdd -0.038176464 -0.009490188 -0.0100998380 0.022628793
## MasVnrArea NA NA NA NA
## BsmtFinSF1 0.070025842 0.016379573 0.0051485974 -0.001773449
## BsmtFinSF2 0.088676018 0.053177619 0.0048708535 -0.015725934
## BsmtUnfSF -0.012435350 -0.031243573 -0.0238021485 0.035455949
## TotalBsmtSF 0.094511019 0.004418110 -0.0182534921 0.030026019
## X1stFlrSF 0.097092939 0.032087836 -0.0208008890 0.045080784
## X2ndFlrSF 0.043307602 0.038375042 0.0171113981 0.039163439
## LowQualFinSF 0.026713364 0.073070372 -0.0038219534 -0.022440998
## GrLivArea 0.112408479 0.064345692 -0.0009740931 0.065328478
## BsmtFullBath 0.024074572 0.037039385 -0.0228773978 -0.023770135
## BsmtHalfBath 0.032901535 0.027886165 -0.0072114001 0.038478449
## FullBath -0.006556984 0.021509463 -0.0138719953 0.058196963
## HalfBath 0.073496772 0.001009749 0.0015891807 -0.007126783
## BedroomAbvGr 0.044941354 0.064117862 0.0079649243 0.048477204
## KitchenAbvGr -0.051778708 -0.012306310 0.0622942688 0.026340058
## TotRmsAbvGrd 0.061924394 0.041587857 0.0256399258 0.041965802
## Fireplaces 0.187656148 0.051221254 0.0019427237 0.053946716
## GarageYrBlt NA NA NA NA
## GarageCars 0.051622319 0.003359958 -0.0428855151 0.041607745
## GarageArea 0.053731588 0.011637301 -0.0270883487 0.034602322
## WoodDeckSF -0.073489496 0.068308638 -0.0092874553 0.024595095
## OpenPorchSF 0.077261261 0.030359975 -0.0182765535 0.072515014
## EnclosedPorch -0.083154070 0.068796316 0.0182774735 -0.029564738
## X3SsnPorch -0.031525957 -0.006770498 0.0003259486 0.029385962
## ScreenPorch 1.000000000 0.063724323 0.0318842020 0.022863399
## PoolArea 0.063724323 1.000000000 0.0354804284 -0.022901046
## MiscVal 0.031884202 0.035480428 1.0000000000 -0.006656820
## MoSold 0.022863399 -0.022901046 -0.0066568201 1.000000000
## YrSold 0.010382918 -0.062640205 0.0048055564 -0.146229332
## SalePrice 0.118324340 0.032819025 -0.0210967705 0.056796504
## YrSold SalePrice
## Id 0.0007934422 -0.027454886
## MSSubClass -0.0213297258 -0.088160149
## LotFrontage NA NA
## LotArea -0.0130140877 0.269866484
## OverallQual -0.0243205816 0.800858356
## OverallCond 0.0437548122 -0.080201802
## YearBuilt -0.0125932671 0.535279432
## YearRemodAdd 0.0365974929 0.521427960
## MasVnrArea NA NA
## BsmtFinSF1 0.0185064844 0.395923108
## BsmtFinSF2 0.0313837059 -0.008899911
## BsmtUnfSF -0.0408341174 0.220677828
## TotalBsmtSF -0.0121921213 0.646584498
## X1stFlrSF -0.0100136366 0.625234719
## X2ndFlrSF -0.0248741055 0.297301302
## LowQualFinSF -0.0290737148 -0.025350636
## GrLivArea -0.0318983131 0.720516301
## BsmtFullBath 0.0676650512 0.235696782
## BsmtHalfBath -0.0453026411 -0.036792474
## FullBath -0.0165737812 0.559048238
## HalfBath -0.0088533728 0.282924892
## BedroomAbvGr -0.0348486895 0.160541722
## KitchenAbvGr 0.0314540208 -0.138848617
## TotRmsAbvGrd -0.0321895196 0.537461767
## Fireplaces -0.0225668829 0.466765283
## GarageYrBlt NA NA
## GarageCars -0.0371790418 0.649256334
## GarageArea -0.0258700820 0.636963593
## WoodDeckSF 0.0238598225 0.322537864
## OpenPorchSF -0.0563264274 0.330360776
## EnclosedPorch -0.0103425199 -0.129773817
## X3SsnPorch 0.0185163895 0.047414119
## ScreenPorch 0.0103829179 0.118324340
## PoolArea -0.0626402045 0.032819025
## MiscVal 0.0048055564 -0.021096770
## MoSold -0.1462293319 0.056796504
## YrSold 1.0000000000 -0.023693833
## SalePrice -0.0236938330 1.000000000
The highest correlation scores are with OverallQual, YearBuilt, YearRemodAdd, TotalBsmtSF, X1stFlrSF, GrLivArea, FullBath, TotRmsAbvGrd, GarageCars, andd GarageArea.
trainhigh<-trainnumonly[,c(5,7,8,13,14,17,20,24,27,28,38)]
cor(trainhigh)
## OverallQual YearBuilt YearRemodAdd TotalBsmtSF X1stFlrSF
## OverallQual 1.0000000 0.57171183 0.5509706 0.5326660 0.4620418
## YearBuilt 0.5717118 1.00000000 0.5919061 0.3998666 0.2799287
## YearRemodAdd 0.5509706 0.59190614 1.0000000 0.2948661 0.2383045
## TotalBsmtSF 0.5326660 0.39986661 0.2948661 1.0000000 0.8007590
## X1stFlrSF 0.4620418 0.27992871 0.2383045 0.8007590 1.0000000
## GrLivArea 0.5835192 0.19264495 0.2892643 0.3948292 0.5229202
## FullBath 0.5437909 0.46671000 0.4382115 0.3197778 0.3746307
## TotRmsAbvGrd 0.4158339 0.08920688 0.1875203 0.2591331 0.3906392
## GarageCars 0.5987389 0.53674866 0.4195728 0.4486060 0.4458614
## GarageArea 0.5549047 0.47731136 0.3695899 0.4720025 0.4742458
## SalePrice 0.8008584 0.53527943 0.5214280 0.6465845 0.6252347
## GrLivArea FullBath TotRmsAbvGrd GarageCars GarageArea
## OverallQual 0.5835192 0.5437909 0.41583390 0.5987389 0.5549047
## YearBuilt 0.1926450 0.4667100 0.08920688 0.5367487 0.4773114
## YearRemodAdd 0.2892643 0.4382115 0.18752026 0.4195728 0.3695899
## TotalBsmtSF 0.3948292 0.3197778 0.25913311 0.4486060 0.4720025
## X1stFlrSF 0.5229202 0.3746307 0.39063922 0.4458614 0.4742458
## GrLivArea 1.0000000 0.6351612 0.83397862 0.4740578 0.4545117
## FullBath 0.6351612 1.0000000 0.54962523 0.4653247 0.4007804
## TotRmsAbvGrd 0.8339786 0.5496252 1.00000000 0.3580687 0.3254668
## GarageCars 0.4740578 0.4653247 0.35806868 1.0000000 0.8868817
## GarageArea 0.4545117 0.4007804 0.32546680 0.8868817 1.0000000
## SalePrice 0.7205163 0.5590482 0.53746177 0.6492563 0.6369636
## SalePrice
## OverallQual 0.8008584
## YearBuilt 0.5352794
## YearRemodAdd 0.5214280
## TotalBsmtSF 0.6465845
## X1stFlrSF 0.6252347
## GrLivArea 0.7205163
## FullBath 0.5590482
## TotRmsAbvGrd 0.5374618
## GarageCars 0.6492563
## GarageArea 0.6369636
## SalePrice 1.0000000
Closely related: GarageCars and Garage Area. TotRmsAvbGrd and GrLiveArea. X1stFlrSF and TotalBsmtSF.
Now removing some outliers I missed earlier.
plot(trainhigh)
plot(trainhigh$SalePrice,trainhigh$X1stFlrSF)
trainhigh<-subset(trainhigh,trainhigh$X1stFlrSF<2500)
plot(trainhigh$SalePrice,trainhigh$X1stFlrSF)
plot(trainhigh)
plot(trainhigh$SalePrice,trainhigh$TotalBsmtSF)
trainhigh<-subset(trainhigh,trainhigh$TotalBsmtSF<2500)
plot(trainhigh$SalePrice,trainhigh$TotalBsmtSF)
cor(trainhigh)
## OverallQual YearBuilt YearRemodAdd TotalBsmtSF X1stFlrSF
## OverallQual 1.0000000 0.57082722 0.5516664 0.5260381 0.4541321
## YearBuilt 0.5708272 1.00000000 0.5910612 0.4034064 0.2819163
## YearRemodAdd 0.5516664 0.59106119 1.0000000 0.2979495 0.2438063
## TotalBsmtSF 0.5260381 0.40340637 0.2979495 1.0000000 0.7960045
## X1stFlrSF 0.4541321 0.28191632 0.2438063 0.7960045 1.0000000
## GrLivArea 0.5794769 0.19034522 0.2912482 0.3782374 0.5050480
## FullBath 0.5436030 0.46769096 0.4401974 0.3097467 0.3645304
## TotRmsAbvGrd 0.4100516 0.08613077 0.1873652 0.2434663 0.3764859
## GarageCars 0.5964638 0.53527734 0.4181702 0.4491701 0.4508248
## GarageArea 0.5527161 0.47578264 0.3684276 0.4750226 0.4805533
## SalePrice 0.7984023 0.53661391 0.5248672 0.6355049 0.6162173
## GrLivArea FullBath TotRmsAbvGrd GarageCars GarageArea
## OverallQual 0.5794769 0.5436030 0.41005155 0.5964638 0.5527161
## YearBuilt 0.1903452 0.4676910 0.08613077 0.5352773 0.4757826
## YearRemodAdd 0.2912482 0.4401974 0.18736518 0.4181702 0.3684276
## TotalBsmtSF 0.3782374 0.3097467 0.24346627 0.4491701 0.4750226
## X1stFlrSF 0.5050480 0.3645304 0.37648592 0.4508248 0.4805533
## GrLivArea 1.0000000 0.6318396 0.83227643 0.4742825 0.4544235
## FullBath 0.6318396 1.0000000 0.54589573 0.4651898 0.4003586
## TotRmsAbvGrd 0.8322764 0.5458957 1.00000000 0.3560526 0.3234478
## GarageCars 0.4742825 0.4651898 0.35605261 1.0000000 0.8865733
## GarageArea 0.4544235 0.4003586 0.32344775 0.8865733 1.0000000
## SalePrice 0.7181285 0.5593858 0.53244421 0.6506223 0.6397290
## SalePrice
## OverallQual 0.7984023
## YearBuilt 0.5366139
## YearRemodAdd 0.5248672
## TotalBsmtSF 0.6355049
## X1stFlrSF 0.6162173
## GrLivArea 0.7181285
## FullBath 0.5593858
## TotRmsAbvGrd 0.5324442
## GarageCars 0.6506223
## GarageArea 0.6397290
## SalePrice 1.0000000
The coefficients have dropped here. I’m going to run the linear models with and without them to see what I get.
summary(lm(SalePrice ~ OverallQual +YearBuilt + TotalBsmtSF +GrLivArea + GarageCars, data=train2))
##
## Call:
## lm(formula = SalePrice ~ OverallQual + YearBuilt + TotalBsmtSF +
## GrLivArea + GarageCars, data = train2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -141345 -19786 -1804 16447 249419
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.147e+05 7.391e+04 -9.669 < 2e-16 ***
## OverallQual 1.850e+04 1.014e+03 18.257 < 2e-16 ***
## YearBuilt 3.214e+02 3.883e+01 8.278 2.80e-16 ***
## TotalBsmtSF 4.243e+01 2.582e+00 16.432 < 2e-16 ***
## GrLivArea 5.575e+01 2.322e+00 24.015 < 2e-16 ***
## GarageCars 1.130e+04 1.604e+03 7.045 2.86e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 33360 on 1450 degrees of freedom
## Multiple R-squared: 0.8114, Adjusted R-squared: 0.8108
## F-statistic: 1248 on 5 and 1450 DF, p-value: < 2.2e-16
summary(lm(SalePrice ~ OverallQual +YearBuilt + TotalBsmtSF +GrLivArea + GarageCars, data=trainhigh))
##
## Call:
## lm(formula = SalePrice ~ OverallQual + YearBuilt + TotalBsmtSF +
## GrLivArea + GarageCars, data = trainhigh)
##
## Residuals:
## Min 1Q Median 3Q Max
## -140803 -19499 -1648 16122 250173
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.193e+05 7.330e+04 -9.813 < 2e-16 ***
## OverallQual 1.823e+04 1.007e+03 18.095 < 2e-16 ***
## YearBuilt 3.245e+02 3.852e+01 8.425 < 2e-16 ***
## TotalBsmtSF 4.211e+01 2.653e+00 15.868 < 2e-16 ***
## GrLivArea 5.616e+01 2.319e+00 24.211 < 2e-16 ***
## GarageCars 1.131e+04 1.595e+03 7.091 2.08e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 33060 on 1443 degrees of freedom
## Multiple R-squared: 0.8087, Adjusted R-squared: 0.808
## F-statistic: 1220 on 5 and 1443 DF, p-value: < 2.2e-16
summary(lm(SalePrice ~ OverallQual +YearBuilt + TotalBsmtSF +GrLivArea + GarageCars + FullBath, data=train2))
##
## Call:
## lm(formula = SalePrice ~ OverallQual + YearBuilt + TotalBsmtSF +
## GrLivArea + GarageCars + FullBath, data = train2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -140731 -19122 -1714 16763 246694
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.346e+05 7.865e+04 -10.611 < 2e-16 ***
## OverallQual 1.867e+04 1.008e+03 18.515 < 2e-16 ***
## YearBuilt 3.853e+02 4.139e+01 9.309 < 2e-16 ***
## TotalBsmtSF 4.145e+01 2.577e+00 16.085 < 2e-16 ***
## GrLivArea 6.191e+01 2.721e+00 22.752 < 2e-16 ***
## GarageCars 1.141e+04 1.595e+03 7.155 1.32e-12 ***
## FullBath -9.902e+03 2.317e+03 -4.274 2.05e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 33160 on 1449 degrees of freedom
## Multiple R-squared: 0.8138, Adjusted R-squared: 0.813
## F-statistic: 1055 on 6 and 1449 DF, p-value: < 2.2e-16
summary(lm(SalePrice ~ OverallQual +YearBuilt + TotalBsmtSF +GrLivArea + GarageCars + FullBath, data=trainhigh))
##
## Call:
## lm(formula = SalePrice ~ OverallQual + YearBuilt + TotalBsmtSF +
## GrLivArea + GarageCars + FullBath, data = trainhigh)
##
## Residuals:
## Min 1Q Median 3Q Max
## -140200 -18958 -1685 16689 247634
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.332e+05 7.807e+04 -10.673 < 2e-16 ***
## OverallQual 1.841e+04 1.003e+03 18.355 < 2e-16 ***
## YearBuilt 3.852e+02 4.109e+01 9.374 < 2e-16 ***
## TotalBsmtSF 4.110e+01 2.651e+00 15.506 < 2e-16 ***
## GrLivArea 6.199e+01 2.714e+00 22.844 < 2e-16 ***
## GarageCars 1.142e+04 1.586e+03 7.197 9.89e-13 ***
## FullBath -9.412e+03 2.306e+03 -4.081 4.74e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 32880 on 1442 degrees of freedom
## Multiple R-squared: 0.8109, Adjusted R-squared: 0.8101
## F-statistic: 1030 on 6 and 1442 DF, p-value: < 2.2e-16
ThisModel<-lm(SalePrice ~ OverallQual +YearBuilt + TotalBsmtSF +GrLivArea + GarageCars, data=train2)
ThisModelPlus<-lm(SalePrice ~ OverallQual +YearBuilt + TotalBsmtSF +GrLivArea + GarageCars + FullBath, data=train2)
Even though FullBath does not have that high a correlation with any of the other variables I am using, it does not add anything to the model. I am discarding it. Also, the drop in correlation coefficients has not made any appreciable difference in the adjusted R\(^2\) values or the p-values.
Trying the model out on the data it was based on is not terrible useful, but I want to see how well it does when we throw the outliers back in.
practicepred<-predict(ThisModel,train)
summary(practicepred)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -51999 127449 177226 181194 223406 712271
summary(train$SalePrice)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 34900 129975 163000 180921 214000 755000
Now without those 5 outliers:
practicepred<-predict(ThisModel,train2)
summary(practicepred)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -51999 127281 176962 180151 222941 429684
summary(train2$SalePrice)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 34900 129900 163000 180151 214000 625000
These are both lousy. Adding in some quality measures:
Some quick aggregation on what I think are likely hits:
Q1<-count(train2,"SaleCondition")
A1<-aggregate(train2$SalePrice,by=list(SaleCondition=train2$SaleCondition),FUN=sum)
cbind(Q1$SaleCondition,A1/Q1)
## Warning in Ops.factor(left, right): '/' not meaningful for factors
## Q1$SaleCondition SaleCondition x
## 1 Abnorml NA 140541.9
## 2 AdjLand NA 104125.0
## 3 Alloca NA 167377.4
## 4 Family NA 149600.0
## 5 Normal NA 174717.8
## 6 Partial NA 273916.4
Q1
## SaleCondition freq
## 1 Abnorml 100
## 2 AdjLand 4
## 3 Alloca 12
## 4 Family 20
## 5 Normal 1197
## 6 Partial 123
#might work
Q2<-count(train2,"Exterior1st")
A2<-aggregate(train2$SalePrice,by=list(Exterior1st=train2$Exterior1st),FUN=sum)
cbind(Q2$Exterior1st,A2/Q2)
## Warning in Ops.factor(left, right): '/' not meaningful for factors
## Q2$Exterior1st Exterior1st x
## 1 AsbShng NA 107385.6
## 2 AsphShn NA 100000.0
## 3 BrkComm NA 71000.0
## 4 BrkFace NA 194573.0
## 5 CBlock NA 105000.0
## 6 CemntBd NA 232473.0
## 7 HdBoard NA 160399.1
## 8 ImStucc NA 262000.0
## 9 MetalSd NA 149422.2
## 10 Plywood NA 175942.4
## 11 Stone NA 258500.0
## 12 Stucco NA 163114.6
## 13 VinylSd NA 213732.9
## 14 Wd Sdng NA 146938.4
## 15 WdShing NA 150655.1
Q2 #not worthwhile
## Exterior1st freq
## 1 AsbShng 20
## 2 AsphShn 1
## 3 BrkComm 2
## 4 BrkFace 50
## 5 CBlock 1
## 6 CemntBd 60
## 7 HdBoard 221
## 8 ImStucc 1
## 9 MetalSd 220
## 10 Plywood 108
## 11 Stone 2
## 12 Stucco 24
## 13 VinylSd 515
## 14 Wd Sdng 205
## 15 WdShing 26
Q3<-count(train2,"BsmtCond")
Q3<-Q3[-5,]
A3<-aggregate(train2$SalePrice,by=list(Exterior1st=train2$BsmtCond),FUN=sum)
Q3
## BsmtCond freq
## 1 Fa 45
## 2 Gd 65
## 3 Po 2
## 4 TA 1307
cbind(Q3$BsmtCond,A3/Q3)
## Warning in Ops.factor(left, right): '/' not meaningful for factors
## Q3$BsmtCond Exterior1st x
## 1 Fa NA 121809.5
## 2 Gd NA 213599.9
## 3 Po NA 64000.0
## 4 TA NA 182783.2
Q4<-count(train2,"ExterQual")
A4<-aggregate(train2$SalePrice,by=list(Exterior1st=train2$ExterQual),FUN=sum)
cbind(Q4$ExterQual,A4/Q4)
## Warning in Ops.factor(left, right): '/' not meaningful for factors
## Q4$ExterQual Exterior1st x
## 1 Ex NA 367408.57
## 2 Fa NA 87985.21
## 3 Gd NA 230579.37
## 4 TA NA 144341.31
Ran out of time, so I didn’t get to show as much as I wanted. Sorry.
QualModel<-lm(SalePrice ~ OverallQual +YearBuilt + TotalBsmtSF +GrLivArea + GarageCars + ExterQual , data=train2)
QualModel2<-lm(SalePrice ~ OverallQual +YearBuilt + TotalBsmtSF +GrLivArea + GarageCars +SaleCondition, data=train2)
QualModel3<-lm(SalePrice ~ OverallQual +YearBuilt + TotalBsmtSF +GrLivArea + GarageCars +SaleCondition+ExterQual, data=train2)
summary(QualModel)
##
## Call:
## lm(formula = SalePrice ~ OverallQual + YearBuilt + TotalBsmtSF +
## GrLivArea + GarageCars + ExterQual, data = train2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -131680 -17368 -1381 15259 214170
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.732e+05 7.602e+04 -7.540 8.26e-14 ***
## OverallQual 1.408e+04 1.052e+03 13.379 < 2e-16 ***
## YearBuilt 3.028e+02 3.895e+01 7.776 1.42e-14 ***
## TotalBsmtSF 3.752e+01 2.446e+00 15.342 < 2e-16 ***
## GrLivArea 5.582e+01 2.185e+00 25.543 < 2e-16 ***
## GarageCars 1.068e+04 1.506e+03 7.091 2.07e-12 ***
## ExterQualFa -7.862e+04 1.054e+04 -7.461 1.48e-13 ***
## ExterQualGd -6.785e+04 4.953e+03 -13.699 < 2e-16 ***
## ExterQualTA -7.753e+04 5.564e+03 -13.935 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 31280 on 1447 degrees of freedom
## Multiple R-squared: 0.8346, Adjusted R-squared: 0.8337
## F-statistic: 912.8 on 8 and 1447 DF, p-value: < 2.2e-16
summary(QualModel2)
##
## Call:
## lm(formula = SalePrice ~ OverallQual + YearBuilt + TotalBsmtSF +
## GrLivArea + GarageCars + SaleCondition, data = train2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -161262 -19084 -1069 16430 230614
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.080e+05 7.282e+04 -8.349 < 2e-16 ***
## OverallQual 1.744e+04 9.928e+02 17.566 < 2e-16 ***
## YearBuilt 2.635e+02 3.820e+01 6.897 7.89e-12 ***
## TotalBsmtSF 4.185e+01 2.521e+00 16.604 < 2e-16 ***
## GrLivArea 5.652e+01 2.260e+00 25.005 < 2e-16 ***
## GarageCars 1.036e+04 1.570e+03 6.598 5.84e-11 ***
## SaleConditionAdjLand 2.227e+04 1.659e+04 1.343 0.180
## SaleConditionAlloca 1.069e+04 9.963e+03 1.073 0.284
## SaleConditionFamily -9.177e+03 7.939e+03 -1.156 0.248
## SaleConditionNormal 1.426e+04 3.395e+03 4.199 2.84e-05 ***
## SaleConditionPartial 4.051e+04 4.645e+03 8.721 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 32370 on 1445 degrees of freedom
## Multiple R-squared: 0.8231, Adjusted R-squared: 0.8218
## F-statistic: 672.1 on 10 and 1445 DF, p-value: < 2.2e-16
summary(QualModel3)
##
## Call:
## lm(formula = SalePrice ~ OverallQual + YearBuilt + TotalBsmtSF +
## GrLivArea + GarageCars + SaleCondition + ExterQual, data = train2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -145885 -17639 -929 15289 216453
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.242e+05 7.516e+04 -6.974 4.67e-12 ***
## OverallQual 1.381e+04 1.036e+03 13.337 < 2e-16 ***
## YearBuilt 2.681e+02 3.860e+01 6.946 5.67e-12 ***
## TotalBsmtSF 3.778e+01 2.412e+00 15.662 < 2e-16 ***
## GrLivArea 5.643e+01 2.153e+00 26.210 < 2e-16 ***
## GarageCars 1.005e+04 1.491e+03 6.740 2.29e-11 ***
## SaleConditionAdjLand 1.992e+04 1.574e+04 1.266 0.206
## SaleConditionAlloca 1.014e+04 9.470e+03 1.071 0.284
## SaleConditionFamily -7.207e+03 7.537e+03 -0.956 0.339
## SaleConditionNormal 1.383e+04 3.221e+03 4.295 1.86e-05 ***
## SaleConditionPartial 3.133e+04 4.467e+03 7.013 3.56e-12 ***
## ExterQualFa -7.140e+04 1.042e+04 -6.851 1.09e-11 ***
## ExterQualGd -6.187e+04 4.960e+03 -12.475 < 2e-16 ***
## ExterQualTA -6.974e+04 5.603e+03 -12.447 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 30710 on 1442 degrees of freedom
## Multiple R-squared: 0.8411, Adjusted R-squared: 0.8397
## F-statistic: 587.3 on 13 and 1442 DF, p-value: < 2.2e-16
summary(predict(QualModel,train2))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -29768 127349 173034 180151 221257 457921
summary(predict(QualModel2,train2))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -44660 126949 175649 180151 222670 423185
summary(predict(QualModel3,train2))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -27336 127553 173714 180151 220662 449958
summary(train2$SalePrice)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 34900 129900 163000 180151 214000 625000
Still lousy, but hopefully better.
Now loading the Kaggle test data:
Ktest<-read.csv("test.csv",stringsAsFactors = FALSE)
Ktest[is.na(Ktest)]<-0
And getting the predictions.
Kpred<-predict(QualModel,Ktest)
Kpred2<-predict(QualModel2,Ktest)
Kpred3<-predict(QualModel3,Ktest)
Ksub<-cbind(Ktest$Id,Kpred)
Ksub2<-cbind(Ktest$Id,Kpred)
Ksub3<-cbind(Ktest$Id,Kpred)
colnames(Ksub)<-c("ID","SalePrice")
colnames(Ksub2)<-c("ID","SalePrice")
colnames(Ksub3)<-c("ID","SalePrice")
write.csv(Ksub,"SubmitThisQ.csv", row.names = FALSE)
write.csv(Ksub2,"SubmitThisQ2.csv", row.names = FALSE)
write.csv(Ksub3,"SubmitThisQ3.csv", row.names = FALSE)