1.

I chose \(X_1\) as my independent variable and \(Y_1\) as my dependent variable Reading the dataset into R:

FinalData<-read.csv("605 Final Data Set P1.csv")
FinalData
##      Y1   X1
## 1  20.3  9.3
## 2  19.1  4.1
## 3  19.3 22.4
## 4  20.9  9.1
## 5  22.0 15.8
## 6  23.5  7.1
## 7  13.8 15.9
## 8  18.8  6.9
## 9  20.9 16.0
## 10 18.6  6.7
## 11 22.3  8.2
## 12 17.6 16.0
## 13 20.8  6.4
## 14 28.7 11.8
## 15 15.2  3.5
## 16 20.9 21.7
## 17 18.4 12.2
## 18 10.3  9.3
## 19 26.3  8.0
## 20 28.1  6.2
summary(FinalData)
##        Y1              X1       
##  Min.   :10.30   Min.   : 3.50  
##  1st Qu.:18.55   1st Qu.: 6.85  
##  Median :20.55   Median : 9.20  
##  Mean   :20.29   Mean   :10.83  
##  3rd Qu.:22.07   3rd Qu.:15.82  
##  Max.   :28.70   Max.   :22.40

We have \(x=15.82\) and \(y=22.07\).

nrow(FinalData)
## [1] 20
a<-subset(FinalData,FinalData$X1>15.82)
nrow(a)
## [1] 5
b<-subset(FinalData,FinalData$Y1>18.55)
nrow(b)
## [1] 15
c<-subset(FinalData,FinalData$X1<15.82)
nrow(c)
## [1] 15
d<-subset(FinalData,FinalData$Y1<18.55)

Given this, \(\text{P}(Y>y) = \frac{15}{20} =\frac{3}{4}\). Also, when evaluated independently, \(\text{P}(X>x) =\frac{5}{20}=\frac{1}{4}\), both of which are to be expected.

a.) \(\text{P}(X>x|Y>y)\).

subset(b,b$X1>15.82)
##      Y1   X1
## 3  19.3 22.4
## 9  20.9 16.0
## 16 20.9 21.7

So, \(\text{P}(X>x|Y>y)=\frac{3}{15}=\frac{1}{5}\)

b.) \(\text{P}(X>x,Y>y)\). This one is easiest done using the information above. There are 3 cases that fit this criteria, so the odds are \(\frac{3}{20}\)

c.) \(\text{P}(X<x|Y>y)=\frac{12}{15}=\frac{4}{5}\)

nrow(subset(b,b$X1<15.82))
## [1] 12
subset(c,c$Y1<=18.55)
##      Y1   X1
## 15 15.2  3.5
## 17 18.4 12.2
## 18 10.3  9.3
subset(c,c$Y1>18.55)
##      Y1   X1
## 1  20.3  9.3
## 2  19.1  4.1
## 4  20.9  9.1
## 5  22.0 15.8
## 6  23.5  7.1
## 8  18.8  6.9
## 10 18.6  6.7
## 11 22.3  8.2
## 13 20.8  6.4
## 14 28.7 11.8
## 19 26.3  8.0
## 20 28.1  6.2
subset(a,a$Y1<=18.55)
##      Y1   X1
## 7  13.8 15.9
## 12 17.6 16.0
subset(a,a$Y1>18.55)
##      Y1   X1
## 3  19.3 22.4
## 9  20.9 16.0
## 16 20.9 21.7
\(x/y\) \(\leq 3^{\text{rd}}\text{quartile}\) \(> 3^{\text{rd}}\text{quartile}\) Total
\(\leq 1^\text{st} \text{quartile}\) 3 2 5
\(> 1^\text{st} \text{quartile}\) 12 3 15
Total 15 5

Does splitting the data this way make them independent?

A<-subset(FinalData,FinalData$X1>6.85)
B<-subset(FinalData,FinalData$Y1>18.55)
nrow(A)
## [1] 15
nrow(B)
## [1] 15
AB<-subset(A,A$Y1>18.55)
nrow(AB)
## [1] 11

\(\text{P}(A)=\frac{15}{20}=\frac{3}{4}\) \(\text{P}(B)=\frac{3}{4}\) \(\text{P}(A)\text{P}(B)=\frac{9}{16}\) \(\text{P}(AB)=\frac{11}{20}\). \(\text{P}(A)\text{P}(B)\neq\text{P}(AB)\)

They are not independent when split this way.

ABTest<-data.frame(c(11,4),c(4,5))
chisq.test(ABTest)
## Warning in chisq.test(ABTest): Chi-squared approximation may be incorrect
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  ABTest
## X-squared = 0.96, df = 1, p-value = 0.3272

The p-value is .32, therefore we fail to reject the null hypothesis that the two are not independent.

Part 2

train<-read.csv("train.csv")
summary(train)
##        Id           MSSubClass       MSZoning     LotFrontage    
##  Min.   :   1.0   Min.   : 20.0   C (all):  10   Min.   : 21.00  
##  1st Qu.: 365.8   1st Qu.: 20.0   FV     :  65   1st Qu.: 59.00  
##  Median : 730.5   Median : 50.0   RH     :  16   Median : 69.00  
##  Mean   : 730.5   Mean   : 56.9   RL     :1151   Mean   : 70.05  
##  3rd Qu.:1095.2   3rd Qu.: 70.0   RM     : 218   3rd Qu.: 80.00  
##  Max.   :1460.0   Max.   :190.0                  Max.   :313.00  
##                                                  NA's   :259     
##     LotArea        Street      Alley      LotShape  LandContour
##  Min.   :  1300   Grvl:   6   Grvl:  50   IR1:484   Bnk:  63   
##  1st Qu.:  7554   Pave:1454   Pave:  41   IR2: 41   HLS:  50   
##  Median :  9478               NA's:1369   IR3: 10   Low:  36   
##  Mean   : 10517                           Reg:925   Lvl:1311   
##  3rd Qu.: 11602                                                
##  Max.   :215245                                                
##                                                                
##   Utilities      LotConfig    LandSlope   Neighborhood   Condition1  
##  AllPub:1459   Corner : 263   Gtl:1382   NAmes  :225   Norm   :1260  
##  NoSeWa:   1   CulDSac:  94   Mod:  65   CollgCr:150   Feedr  :  81  
##                FR2    :  47   Sev:  13   OldTown:113   Artery :  48  
##                FR3    :   4              Edwards:100   RRAn   :  26  
##                Inside :1052              Somerst: 86   PosN   :  19  
##                                          Gilbert: 79   RRAe   :  11  
##                                          (Other):707   (Other):  15  
##    Condition2     BldgType      HouseStyle   OverallQual    
##  Norm   :1445   1Fam  :1220   1Story :726   Min.   : 1.000  
##  Feedr  :   6   2fmCon:  31   2Story :445   1st Qu.: 5.000  
##  Artery :   2   Duplex:  52   1.5Fin :154   Median : 6.000  
##  PosN   :   2   Twnhs :  43   SLvl   : 65   Mean   : 6.099  
##  RRNn   :   2   TwnhsE: 114   SFoyer : 37   3rd Qu.: 7.000  
##  PosA   :   1                 1.5Unf : 14   Max.   :10.000  
##  (Other):   2                 (Other): 19                   
##   OverallCond      YearBuilt     YearRemodAdd    RoofStyle   
##  Min.   :1.000   Min.   :1872   Min.   :1950   Flat   :  13  
##  1st Qu.:5.000   1st Qu.:1954   1st Qu.:1967   Gable  :1141  
##  Median :5.000   Median :1973   Median :1994   Gambrel:  11  
##  Mean   :5.575   Mean   :1971   Mean   :1985   Hip    : 286  
##  3rd Qu.:6.000   3rd Qu.:2000   3rd Qu.:2004   Mansard:   7  
##  Max.   :9.000   Max.   :2010   Max.   :2010   Shed   :   2  
##                                                              
##     RoofMatl     Exterior1st   Exterior2nd    MasVnrType    MasVnrArea    
##  CompShg:1434   VinylSd:515   VinylSd:504   BrkCmn : 15   Min.   :   0.0  
##  Tar&Grv:  11   HdBoard:222   MetalSd:214   BrkFace:445   1st Qu.:   0.0  
##  WdShngl:   6   MetalSd:220   HdBoard:207   None   :864   Median :   0.0  
##  WdShake:   5   Wd Sdng:206   Wd Sdng:197   Stone  :128   Mean   : 103.7  
##  ClyTile:   1   Plywood:108   Plywood:142   NA's   :  8   3rd Qu.: 166.0  
##  Membran:   1   CemntBd: 61   CmentBd: 60                 Max.   :1600.0  
##  (Other):   2   (Other):128   (Other):136                 NA's   :8       
##  ExterQual ExterCond  Foundation  BsmtQual   BsmtCond    BsmtExposure
##  Ex: 52    Ex:   3   BrkTil:146   Ex  :121   Fa  :  45   Av  :221    
##  Fa: 14    Fa:  28   CBlock:634   Fa  : 35   Gd  :  65   Gd  :134    
##  Gd:488    Gd: 146   PConc :647   Gd  :618   Po  :   2   Mn  :114    
##  TA:906    Po:   1   Slab  : 24   TA  :649   TA  :1311   No  :953    
##            TA:1282   Stone :  6   NA's: 37   NA's:  37   NA's: 38    
##                      Wood  :  3                                      
##                                                                      
##  BsmtFinType1   BsmtFinSF1     BsmtFinType2   BsmtFinSF2     
##  ALQ :220     Min.   :   0.0   ALQ :  19    Min.   :   0.00  
##  BLQ :148     1st Qu.:   0.0   BLQ :  33    1st Qu.:   0.00  
##  GLQ :418     Median : 383.5   GLQ :  14    Median :   0.00  
##  LwQ : 74     Mean   : 443.6   LwQ :  46    Mean   :  46.55  
##  Rec :133     3rd Qu.: 712.2   Rec :  54    3rd Qu.:   0.00  
##  Unf :430     Max.   :5644.0   Unf :1256    Max.   :1474.00  
##  NA's: 37                      NA's:  38                     
##    BsmtUnfSF       TotalBsmtSF      Heating     HeatingQC CentralAir
##  Min.   :   0.0   Min.   :   0.0   Floor:   1   Ex:741    N:  95    
##  1st Qu.: 223.0   1st Qu.: 795.8   GasA :1428   Fa: 49    Y:1365    
##  Median : 477.5   Median : 991.5   GasW :  18   Gd:241              
##  Mean   : 567.2   Mean   :1057.4   Grav :   7   Po:  1              
##  3rd Qu.: 808.0   3rd Qu.:1298.2   OthW :   2   TA:428              
##  Max.   :2336.0   Max.   :6110.0   Wall :   4                       
##                                                                     
##  Electrical     X1stFlrSF      X2ndFlrSF     LowQualFinSF    
##  FuseA:  94   Min.   : 334   Min.   :   0   Min.   :  0.000  
##  FuseF:  27   1st Qu.: 882   1st Qu.:   0   1st Qu.:  0.000  
##  FuseP:   3   Median :1087   Median :   0   Median :  0.000  
##  Mix  :   1   Mean   :1163   Mean   : 347   Mean   :  5.845  
##  SBrkr:1334   3rd Qu.:1391   3rd Qu.: 728   3rd Qu.:  0.000  
##  NA's :   1   Max.   :4692   Max.   :2065   Max.   :572.000  
##                                                              
##    GrLivArea     BsmtFullBath     BsmtHalfBath        FullBath    
##  Min.   : 334   Min.   :0.0000   Min.   :0.00000   Min.   :0.000  
##  1st Qu.:1130   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:1.000  
##  Median :1464   Median :0.0000   Median :0.00000   Median :2.000  
##  Mean   :1515   Mean   :0.4253   Mean   :0.05753   Mean   :1.565  
##  3rd Qu.:1777   3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:2.000  
##  Max.   :5642   Max.   :3.0000   Max.   :2.00000   Max.   :3.000  
##                                                                   
##     HalfBath       BedroomAbvGr    KitchenAbvGr   KitchenQual
##  Min.   :0.0000   Min.   :0.000   Min.   :0.000   Ex:100     
##  1st Qu.:0.0000   1st Qu.:2.000   1st Qu.:1.000   Fa: 39     
##  Median :0.0000   Median :3.000   Median :1.000   Gd:586     
##  Mean   :0.3829   Mean   :2.866   Mean   :1.047   TA:735     
##  3rd Qu.:1.0000   3rd Qu.:3.000   3rd Qu.:1.000              
##  Max.   :2.0000   Max.   :8.000   Max.   :3.000              
##                                                              
##   TotRmsAbvGrd    Functional    Fireplaces    FireplaceQu   GarageType 
##  Min.   : 2.000   Maj1:  14   Min.   :0.000   Ex  : 24    2Types :  6  
##  1st Qu.: 5.000   Maj2:   5   1st Qu.:0.000   Fa  : 33    Attchd :870  
##  Median : 6.000   Min1:  31   Median :1.000   Gd  :380    Basment: 19  
##  Mean   : 6.518   Min2:  34   Mean   :0.613   Po  : 20    BuiltIn: 88  
##  3rd Qu.: 7.000   Mod :  15   3rd Qu.:1.000   TA  :313    CarPort:  9  
##  Max.   :14.000   Sev :   1   Max.   :3.000   NA's:690    Detchd :387  
##                   Typ :1360                               NA's   : 81  
##   GarageYrBlt   GarageFinish   GarageCars      GarageArea     GarageQual 
##  Min.   :1900   Fin :352     Min.   :0.000   Min.   :   0.0   Ex  :   3  
##  1st Qu.:1961   RFn :422     1st Qu.:1.000   1st Qu.: 334.5   Fa  :  48  
##  Median :1980   Unf :605     Median :2.000   Median : 480.0   Gd  :  14  
##  Mean   :1979   NA's: 81     Mean   :1.767   Mean   : 473.0   Po  :   3  
##  3rd Qu.:2002                3rd Qu.:2.000   3rd Qu.: 576.0   TA  :1311  
##  Max.   :2010                Max.   :4.000   Max.   :1418.0   NA's:  81  
##  NA's   :81                                                              
##  GarageCond  PavedDrive   WoodDeckSF      OpenPorchSF     EnclosedPorch   
##  Ex  :   2   N:  90     Min.   :  0.00   Min.   :  0.00   Min.   :  0.00  
##  Fa  :  35   P:  30     1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.00  
##  Gd  :   9   Y:1340     Median :  0.00   Median : 25.00   Median :  0.00  
##  Po  :   7              Mean   : 94.24   Mean   : 46.66   Mean   : 21.95  
##  TA  :1326              3rd Qu.:168.00   3rd Qu.: 68.00   3rd Qu.:  0.00  
##  NA's:  81              Max.   :857.00   Max.   :547.00   Max.   :552.00  
##                                                                           
##    X3SsnPorch      ScreenPorch        PoolArea        PoolQC    
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.000   Ex  :   2  
##  1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.000   Fa  :   2  
##  Median :  0.00   Median :  0.00   Median :  0.000   Gd  :   3  
##  Mean   :  3.41   Mean   : 15.06   Mean   :  2.759   NA's:1453  
##  3rd Qu.:  0.00   3rd Qu.:  0.00   3rd Qu.:  0.000              
##  Max.   :508.00   Max.   :480.00   Max.   :738.000              
##                                                                 
##    Fence      MiscFeature    MiscVal             MoSold      
##  GdPrv:  59   Gar2:   2   Min.   :    0.00   Min.   : 1.000  
##  GdWo :  54   Othr:   2   1st Qu.:    0.00   1st Qu.: 5.000  
##  MnPrv: 157   Shed:  49   Median :    0.00   Median : 6.000  
##  MnWw :  11   TenC:   1   Mean   :   43.49   Mean   : 6.322  
##  NA's :1179   NA's:1406   3rd Qu.:    0.00   3rd Qu.: 8.000  
##                           Max.   :15500.00   Max.   :12.000  
##                                                              
##      YrSold        SaleType    SaleCondition    SalePrice     
##  Min.   :2006   WD     :1267   Abnorml: 101   Min.   : 34900  
##  1st Qu.:2007   New    : 122   AdjLand:   4   1st Qu.:129975  
##  Median :2008   COD    :  43   Alloca :  12   Median :163000  
##  Mean   :2008   ConLD  :   9   Family :  20   Mean   :180921  
##  3rd Qu.:2009   ConLI  :   5   Normal :1198   3rd Qu.:214000  
##  Max.   :2010   ConLw  :   5   Partial: 125   Max.   :755000  
##                 (Other):   9

I chose to create histograms for four of the variables. The spikes in Garage Area are interesting, while the data is a bit noisy when it is plotted against number of cars, I suspect that these spikes are the standard one, two, and three car garage sizes. Above Ground Living Area (GrLivArea) is the closest to a skewed bell curve. The others either have spikes (Garage Area), have a huge tail at 0, (YearRemodAdd). I have included the Garage area to number of cars in the garage because the second is a dependent variable on the first (you have to have a minimum square footage per car). This is not the dependent variable for the entire data set, but it is interesting that there is more than one in it.

#removing a few of the outliers that are either huge or godawful expensive.  When I clicked on the link on Kaggle, I found the suggestion to use 4000 ft^2 as the limiting factor in the training data, especially since it only removes 4 outliers
train2<-subset(train,train$GrLivArea<4000)
train3<-train2[,c(21,47,63,81)]
summary(train3) #train3 has the year remodeled, which has a nice, but not great, correlation with SalePrice.  I intend to try it out in my model
##   YearRemodAdd    GrLivArea      GarageArea       SalePrice     
##  Min.   :1950   Min.   : 334   Min.   :   0.0   Min.   : 34900  
##  1st Qu.:1967   1st Qu.:1128   1st Qu.: 329.5   1st Qu.:129900  
##  Median :1994   Median :1458   Median : 478.5   Median :163000  
##  Mean   :1985   Mean   :1507   Mean   : 471.6   Mean   :180151  
##  3rd Qu.:2004   3rd Qu.:1775   3rd Qu.: 576.0   3rd Qu.:214000  
##  Max.   :2010   Max.   :3627   Max.   :1390.0   Max.   :625000
train4<-train3[,c(2:4)]
ggplot(data=train3,aes(x=YearRemodAdd))+geom_histogram(binwidth = 1)

ggplot(data=train3,aes(x=GrLivArea))+geom_histogram(bins = 100)

ggplot(data=train3,aes(x=GarageArea))+geom_histogram(bins = 100)

ggplot(data=train3,aes(x=SalePrice))+geom_histogram(bins = 100)

plot(train2$GarageArea,train2$GarageCars)

SaleCor<-cor(train4)
SaleCor
##            GrLivArea GarageArea SalePrice
## GrLivArea  1.0000000  0.4545117 0.7205163
## GarageArea 0.4545117  1.0000000 0.6369636
## SalePrice  0.7205163  0.6369636 1.0000000
plot(train4$GrLivArea,train4$SalePrice)

plot(train4$GarageArea,train4$SalePrice)

Checking pairwise correlations:

cor.test(train4$GrLivArea,train4$SalePrice,conf.level = .8)
## 
##  Pearson's product-moment correlation
## 
## data:  train4$GrLivArea and train4$SalePrice
## t = 39.62, df = 1454, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.7039548 0.7362947
## sample estimates:
##       cor 
## 0.7205163
cor.test(train4$GarageArea,train4$SalePrice,conf.level = .8)
## 
##  Pearson's product-moment correlation
## 
## data:  train4$GarageArea and train4$SalePrice
## t = 31.507, df = 1454, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.6165544 0.6565173
## sample estimates:
##       cor 
## 0.6369636
cor.test(train4$GarageArea,train4$GrLivArea,conf.level = .8)
## 
##  Pearson's product-moment correlation
## 
## data:  train4$GarageArea and train4$GrLivArea
## t = 19.457, df = 1454, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.4274330 0.4807755
## sample estimates:
##       cor 
## 0.4545117

There correlation between any two of these variables is non-zero. The p-values are low enough that I would not worry about a familywise error. Even using Bonferroni or similar correction, the p-values are small enough that they would not still be less than \(1\times 10^{-14}\), which is good enough for just about anything.

The correlations indicate that the given variables are strongly related to each other, especially given the absence of any familiywise error risk.

Linear Algebra

SalePresc<-solve(SaleCor)
SalePresc
##              GrLivArea  GarageArea SalePrice
## GrLivArea   2.07976643  0.01550685 -1.508383
## GarageArea  0.01550685  1.68283152 -1.083075
## SalePrice  -1.50838292 -1.08307535  2.776694
SalePresc%*%SaleCor
##            GrLivArea   GarageArea SalePrice
## GrLivArea          1 1.110223e-16         0
## GarageArea         0 1.000000e+00         0
## SalePrice          0 0.000000e+00         1
SaleCor%*%SalePresc
##               GrLivArea GarageArea SalePrice
## GrLivArea  1.000000e+00          0         0
## GarageArea 1.110223e-16          1         0
## SalePrice  0.000000e+00          0         1

We have the inverse matrix, which multiplies out to the identity matrix, as is expected. To do LU-Decomposition, I use the function I wrote for HW2:

LUme <- function(A){ #A is the original matrix #apparently decompose is already kicking around, thus the function name.
  size = nrow(A)
  lower<-diag(size)
  usethis <- size-1
  ##this should be doable via lapply, but given that what needs to be done to the lower matrix is so difference than the upper
  ##I ended up just doing two nested loops so that this can handle arbitrary sized non-singular matrices
  for (i in 1:usethis){
    for (j in i:size){
      if(j<size){
        container<-rUeL(A,lower,i,j+1)
        A<-container[[1]]
        lower<-container[[2]]
      }
    }
  }
  answerlist<-list(lower,A,lower%*%A)
  names(answerlist)<-c("Lower Matrix","Upper Matrix","Multiplied")
  return(answerlist)
  }

rUeL <- function(m1,m2,r1,r2){ #m1 is the upper matrix, m2 is the lower, r1 is the row above r2
  coef<-(m1[r2,r1]/m1[r1,r1]) #coefficient for subtraction of r1 and r2 to zero out the appropriate element of r2
  m1[r2,]<-m1[r2,]-(m1[r1,]*m1[r2,r1]/m1[r1,r1])
  m2[r2,r1]<-coef
  return(list(m1,m2))
}

Using said function on both the correlation matrix and the precision matrix, we get:

LUme(SaleCor)
## $`Lower Matrix`
##           [,1]      [,2] [,3]
## [1,] 1.0000000 0.0000000    0
## [2,] 0.4545117 1.0000000    0
## [3,] 0.7205163 0.3900593    1
## 
## $`Upper Matrix`
##            GrLivArea GarageArea SalePrice
## GrLivArea          1  0.4545117 0.7205163
## GarageArea         0  0.7934191 0.3094805
## SalePrice          0  0.0000000 0.3601405
## 
## $Multiplied
##      GrLivArea GarageArea SalePrice
## [1,] 1.0000000  0.4545117 0.7205163
## [2,] 0.4545117  1.0000000 0.6369636
## [3,] 0.7205163  0.6369636 1.0000000
LUme(SalePresc)
## $`Lower Matrix`
##              [,1]       [,2] [,3]
## [1,]  1.000000000  0.0000000    0
## [2,]  0.007456056  1.0000000    0
## [3,] -0.725265537 -0.6369636    1
## 
## $`Upper Matrix`
##               GrLivArea GarageArea SalePrice
## GrLivArea  2.079766e+00 0.01550685 -1.508383
## GarageArea 1.734723e-18 1.68271590 -1.071829
## SalePrice  1.104956e-18 0.00000000  1.000000
## 
## $Multiplied
##        GrLivArea  GarageArea SalePrice
## [1,]  2.07976643  0.01550685 -1.508383
## [2,]  0.01550685  1.68283152 -1.083075
## [3,] -1.50838292 -1.08307535  2.776694

Calculus based Statistics

I did this using BsmUnfSF, both with and without the 0s. While this does not cause any issues for fitdist, with the same value received, it is neater.

hist(train2$BsmtUnfSF)

#this looks great, but when it shrink the bin size, we get this
hist(train2$BsmtUnfSF,breaks=100,freq = FALSE)
Unfinished<-fitdistr(train2$BsmtUnfSF,"exponential")
Unfinished
##        rate    
##   1.763698e-03 
##  (4.622146e-05)
curve(dexp(x,1.763698e-03),add=TRUE)

set.seed(2001)
fitsamp<-rexp(1000,1.763698e-03)
hist(fitsamp,breaks=100,freq=FALSE)

trainBase<-subset(train2,train2$BsmtUnfSF>0)
hist(trainBase$BsmtUnfSF,breaks=100,freq = FALSE)
AllUnfinished<-fitdistr(train2$BsmtUnfSF,"exponential")
AllUnfinished
##        rate    
##   1.763698e-03 
##  (4.622146e-05)
curve(dexp(x,1.763698e-03),add=TRUE)

The cumulative distribution functio for an exponential distribution is \(F(x,\lambda)=1-e^{\lambda x}\). We want \(F=.05\) and \(F=.95\) For the first value:

\[ 1-e^{-\lambda x}=.05\\ -e^{-\lambda x}=-.95\\ x=-\frac{\ln(.95)}{\lambda} x=-\frac{-0.05129329}{.001765255}\approx29.06 \]

For the second value:

\[ 1-e^{-\lambda x}=.95\\ -e^{-\lambda x}=-.05\\ x=-\frac{\ln(.05)}{\lambda} x=-\frac{-2.995732}{.001765255}\approx1697.05 \]

Finding the 95% confidence interval, using the std.error function from plotrix:

qnorm(.95)
## [1] 1.644854
std.error(trainBase$BsmtUnfSF)
## [1] 11.6604
mean(trainBase$BsmtUnfSF)
## [1] 616.994

So the 95% confidence interval is \(617\pm(1.645)(11.66)=617\pm 19.18\)

quantile(trainBase$BsmtUnfSF,c(.05,.95))
##      5%     95% 
##  100.00 1489.75
quantile(fitsamp,c(.05,.95))
##         5%        95% 
##   28.42716 1649.27524
mean(fitsamp)
## [1] 557.9193

These are all remarkably close. The histograms combined with the quantile data are very similar. The mean of the sample is not within the confidence interval, however, given that the confidence interval was created assuming normality, while the sample data was not, this does not detract from the model.

trainumbs<-unlist(lapply(train,is.numeric))
trainnumonly<-train2[,trainumbs]
cor(trainnumonly)
##                          Id   MSSubClass LotFrontage      LotArea
## Id             1.0000000000  0.011076005          NA -0.038040539
## MSSubClass     0.0110760048  1.000000000          NA -0.142191843
## LotFrontage              NA           NA           1           NA
## LotArea       -0.0380405391 -0.142191843          NA  1.000000000
## OverallQual   -0.0323233071  0.032415616          NA  0.088718768
## OverallCond    0.0133374790 -0.059276572          NA -0.002832285
## YearBuilt     -0.0140335892  0.027689352          NA  0.006590226
## YearRemodAdd  -0.0230755086  0.040458748          NA  0.006930318
## MasVnrArea               NA           NA          NA           NA
## BsmtFinSF1    -0.0178206510 -0.075268440          NA  0.173426158
## BsmtFinSF2    -0.0056094902 -0.065598386          NA  0.114691227
## BsmtUnfSF     -0.0070000331 -0.140890171          NA -0.003774031
## TotalBsmtSF   -0.0283118194 -0.255441005          NA  0.221939938
## X1stFlrSF      0.0016663035 -0.265000693          NA  0.267643644
## X2ndFlrSF      0.0025784921  0.311293638          NA  0.037276582
## LowQualFinSF  -0.0441275123  0.046499262          NA  0.005675275
## GrLivArea     -0.0008462866  0.077955528          NA  0.231886955
## BsmtFullBath  -0.0010194236  0.003281653          NA  0.147594611
## BsmtHalfBath  -0.0197198713 -0.002508698          NA  0.047390546
## FullBath       0.0040051155  0.132131037          NA  0.117335855
## HalfBath       0.0052481188  0.177476000          NA  0.005980504
## BedroomAbvGr   0.0367743224 -0.023626587          NA  0.118959513
## KitchenAbvGr   0.0032223876  0.281783056          NA -0.016565309
## TotRmsAbvGrd   0.0238544181  0.040246635          NA  0.173629285
## Fireplaces    -0.0246730773 -0.046376588          NA  0.259700916
## GarageYrBlt              NA           NA          NA           NA
## GarageCars     0.0157828536 -0.040490374          NA  0.150977421
## GarageArea     0.0132664527 -0.100144776          NA  0.162182789
## WoodDeckSF    -0.0306440156 -0.012852609          NA  0.167040055
## OpenPorchSF   -0.0024713384 -0.006686882          NA  0.061679261
## EnclosedPorch  0.0033478948 -0.011966400          NA -0.016108446
## X3SsnPorch    -0.0465399309 -0.043802236          NA  0.021505371
## ScreenPorch    0.0016739186 -0.025978506          NA  0.045620158
## PoolArea       0.0408707136  0.007956894          NA  0.033875227
## MiscVal       -0.0061383622 -0.007665753          NA  0.039192359
## MoSold         0.0232451338 -0.013512341          NA  0.007188288
## YrSold         0.0007934422 -0.021329726          NA -0.013014088
## SalePrice     -0.0274548863 -0.088160149          NA  0.269866484
##               OverallQual  OverallCond    YearBuilt YearRemodAdd
## Id            -0.03232331  0.013337479 -0.014033589 -0.023075509
## MSSubClass     0.03241562 -0.059276572  0.027689352  0.040458748
## LotFrontage            NA           NA           NA           NA
## LotArea        0.08871877 -0.002832285  0.006590226  0.006930318
## OverallQual    1.00000000 -0.090691730  0.571711832  0.550970612
## OverallCond   -0.09069173  1.000000000 -0.375691114  0.074702591
## YearBuilt      0.57171183 -0.375691114  1.000000000  0.591906136
## YearRemodAdd   0.55097061  0.074702591  0.591906136  1.000000000
## MasVnrArea             NA           NA           NA           NA
## BsmtFinSF1     0.21307936 -0.042542236  0.248272491  0.121689609
## BsmtFinSF2    -0.05752025  0.040014833 -0.048393371 -0.067187713
## BsmtUnfSF      0.31016404 -0.137266510  0.148810145  0.180972421
## TotalBsmtSF    0.53266599 -0.176000436  0.399866607  0.294866090
## X1stFlrSF      0.46204182 -0.145612855  0.279928714  0.238304489
## X2ndFlrSF      0.27974502  0.031296654  0.002953154  0.136103360
## LowQualFinSF  -0.02982579  0.025406473 -0.183719954 -0.062215160
## GrLivArea      0.58351920 -0.078567355  0.192644951  0.289264278
## BsmtFullBath   0.10409198 -0.053106837  0.185009254  0.116765047
## BsmtHalfBath  -0.04717213  0.117206818 -0.039945271 -0.013296723
## FullBath       0.54379093 -0.194167121  0.466710000  0.438211507
## HalfBath       0.26743138 -0.059926614  0.240143681  0.181135981
## BedroomAbvGr   0.09684815  0.013248892 -0.072623278 -0.041918503
## KitchenAbvGr  -0.18428060 -0.087204458 -0.174481002 -0.149287577
## TotRmsAbvGrd   0.41583390 -0.055766348  0.089206884  0.187520258
## Fireplaces     0.38742490 -0.022277117  0.143162318  0.108731621
## GarageYrBlt            NA           NA           NA           NA
## GarageCars     0.59873889 -0.185493758  0.536748656  0.419572822
## GarageArea     0.55490469 -0.150679146  0.477311363  0.369589852
## WoodDeckSF     0.23281894 -0.003063120  0.222690462  0.204019691
## OpenPorchSF    0.29780274 -0.029648925  0.183905184  0.222648772
## EnclosedPorch -0.11240732  0.070102698 -0.386903576 -0.193348040
## X3SsnPorch     0.03162059  0.025418921  0.031717375  0.045595823
## ScreenPorch    0.06773223  0.054616656 -0.049702593 -0.038176464
## PoolArea       0.01812125  0.008078797 -0.014372825 -0.009490188
## MiscVal       -0.03106845  0.068729421 -0.034192666 -0.010099838
## MoSold         0.07641430 -0.003135262  0.013880837  0.022628793
## YrSold        -0.02432058  0.043754812 -0.012593267  0.036597493
## SalePrice      0.80085836 -0.080201802  0.535279432  0.521427960
##               MasVnrArea   BsmtFinSF1   BsmtFinSF2    BsmtUnfSF
## Id                    NA -0.017820651 -0.005609490 -0.007000033
## MSSubClass            NA -0.075268440 -0.065598386 -0.140890171
## LotFrontage           NA           NA           NA           NA
## LotArea               NA  0.173426158  0.114691227 -0.003774031
## OverallQual           NA  0.213079362 -0.057520249  0.310164044
## OverallCond           NA -0.042542236  0.040014833 -0.137266510
## YearBuilt             NA  0.248272491 -0.048393371  0.148810145
## YearRemodAdd          NA  0.121689609 -0.067187713  0.180972421
## MasVnrArea             1           NA           NA           NA
## BsmtFinSF1            NA  1.000000000 -0.048738073 -0.526140244
## BsmtFinSF2            NA -0.048738073  1.000000000 -0.209285937
## BsmtUnfSF             NA -0.526140244 -0.209285937  1.000000000
## TotalBsmtSF           NA  0.460323665  0.116477633  0.441625132
## X1stFlrSF             NA  0.386453075  0.106133963  0.331573791
## X2ndFlrSF             NA -0.183357567 -0.098241420  0.002749242
## LowQualFinSF          NA -0.066610894  0.014713622  0.028252980
## GrLivArea             NA  0.121479030 -0.004994960  0.251631936
## BsmtFullBath          NA  0.661932650  0.160254329 -0.424026185
## BsmtHalfBath          NA  0.068868916  0.071985752 -0.099007488
## FullBath              NA  0.037158794 -0.075290707  0.289399490
## HalfBath              NA -0.014507635 -0.031242668 -0.041925429
## BedroomAbvGr          NA -0.121893063 -0.015134114  0.166583946
## KitchenAbvGr          NA -0.082722357 -0.040926117  0.030226318
## TotRmsAbvGrd          NA  0.001876651 -0.033490456  0.251935602
## Fireplaces            NA  0.236218676  0.049027376  0.051796777
## GarageYrBlt           NA           NA           NA           NA
## GarageCars            NA  0.224043217 -0.037330543  0.213772291
## GarageArea            NA  0.268650796 -0.016484860  0.184562127
## WoodDeckSF            NA  0.201462192  0.069027662 -0.006875562
## OpenPorchSF           NA  0.071850658  0.005083074  0.129147583
## EnclosedPorch         NA -0.103053229  0.036268978 -0.002336340
## X3SsnPorch            NA  0.029879169 -0.030089659  0.020843281
## ScreenPorch           NA  0.070025842  0.088676018 -0.012435350
## PoolArea              NA  0.016379573  0.053177619 -0.031243573
## MiscVal               NA  0.005148597  0.004870854 -0.023802148
## MoSold                NA -0.001773449 -0.015725934  0.035455949
## YrSold                NA  0.018506484  0.031383706 -0.040834117
## SalePrice             NA  0.395923108 -0.008899911  0.220677828
##                TotalBsmtSF    X1stFlrSF    X2ndFlrSF  LowQualFinSF
## Id            -0.028311819  0.001666304  0.002578492 -4.412751e-02
## MSSubClass    -0.255441005 -0.265000693  0.311293638  4.649926e-02
## LotFrontage             NA           NA           NA            NA
## LotArea        0.221939938  0.267643644  0.037276582  5.675275e-03
## OverallQual    0.532665986  0.462041822  0.279745021 -2.982579e-02
## OverallCond   -0.176000436 -0.145612855  0.031296654  2.540647e-02
## YearBuilt      0.399866607  0.279928714  0.002953154 -1.837200e-01
## YearRemodAdd   0.294866090  0.238304489  0.136103360 -6.221516e-02
## MasVnrArea              NA           NA           NA            NA
## BsmtFinSF1     0.460323665  0.386453075 -0.183357567 -6.661089e-02
## BsmtFinSF2     0.116477633  0.106133963 -0.098241420  1.471362e-02
## BsmtUnfSF      0.441625132  0.331573791  0.002749242  2.825298e-02
## TotalBsmtSF    1.000000000  0.800758989 -0.226960337 -3.345752e-02
## X1stFlrSF      0.800758989  1.000000000 -0.252296704 -1.312801e-02
## X2ndFlrSF     -0.226960337 -0.252296704  1.000000000  6.514187e-02
## LowQualFinSF  -0.033457516 -0.013128013  0.065141871  1.000000e+00
## GrLivArea      0.394829176  0.522920244  0.687429564  1.448249e-01
## BsmtFullBath   0.298870883  0.232826186 -0.178520522 -4.697760e-02
## BsmtHalfBath  -0.006119831 -0.004382854 -0.032587094 -5.605597e-03
## FullBath       0.319777839  0.374630683  0.410642097  1.239123e-06
## HalfBath      -0.072369928 -0.144372621  0.609022074 -2.673046e-02
## BedroomAbvGr   0.045549161  0.125474298  0.502450076  1.060079e-01
## KitchenAbvGr  -0.069964342  0.074554578  0.061777147  7.452500e-03
## TotRmsAbvGrd   0.259133114  0.390639219  0.610793572  1.333444e-01
## Fireplaces     0.321377773  0.396829341  0.182722299 -2.072834e-02
## GarageYrBlt             NA           NA           NA            NA
## GarageCars     0.448605965  0.445861364  0.174846717 -9.431482e-02
## GarageArea     0.472002530  0.474245802  0.125023360 -6.747432e-02
## WoodDeckSF     0.229984154  0.230864886  0.083669986 -2.511374e-02
## OpenPorchSF    0.215558908  0.179049159  0.198406685  1.933851e-02
## EnclosedPorch -0.095871638 -0.063073606  0.065690250  6.097457e-02
## X3SsnPorch     0.041761766  0.060552703 -0.023739859 -4.334208e-03
## ScreenPorch    0.094511019  0.097092939  0.043307602  2.671336e-02
## PoolArea       0.004418110  0.032087836  0.038375042  7.307037e-02
## MiscVal       -0.018253492 -0.020800889  0.017111398 -3.821953e-03
## MoSold         0.030026019  0.045080784  0.039163439 -2.244100e-02
## YrSold        -0.012192121 -0.010013637 -0.024874105 -2.907371e-02
## SalePrice      0.646584498  0.625234719  0.297301302 -2.535064e-02
##                   GrLivArea BsmtFullBath BsmtHalfBath      FullBath
## Id            -0.0008462866 -0.001019424 -0.019719871  4.005115e-03
## MSSubClass     0.0779555283  0.003281653 -0.002508698  1.321310e-01
## LotFrontage              NA           NA           NA            NA
## LotArea        0.2318869552  0.147594611  0.047390546  1.173359e-01
## OverallQual    0.5835191995  0.104091983 -0.047172132  5.437909e-01
## OverallCond   -0.0785673546 -0.053106837  0.117206818 -1.941671e-01
## YearBuilt      0.1926449508  0.185009254 -0.039945271  4.667100e-01
## YearRemodAdd   0.2892642784  0.116765047 -0.013296723  4.382115e-01
## MasVnrArea               NA           NA           NA            NA
## BsmtFinSF1     0.1214790300  0.661932650  0.068868916  3.715879e-02
## BsmtFinSF2    -0.0049949602  0.160254329  0.071985752 -7.529071e-02
## BsmtUnfSF      0.2516319360 -0.424026185 -0.099007488  2.893995e-01
## TotalBsmtSF    0.3948291761  0.298870883 -0.006119831  3.197778e-01
## X1stFlrSF      0.5229202439  0.232826186 -0.004382854  3.746307e-01
## X2ndFlrSF      0.6874295638 -0.178520522 -0.032587094  4.106421e-01
## LowQualFinSF   0.1448248926 -0.046977597 -0.005605597  1.239123e-06
## GrLivArea      1.0000000000  0.013406111 -0.032112178  6.351612e-01
## BsmtFullBath   0.0134061114  1.000000000 -0.146201453 -6.945738e-02
## BsmtHalfBath  -0.0321121779 -0.146201453  1.000000000 -6.137911e-02
## FullBath       0.6351612085 -0.069457376 -0.061379110  1.000000e+00
## HalfBath       0.4190516237 -0.034860782 -0.015172957  1.303351e-01
## BedroomAbvGr   0.5400833656 -0.152267699  0.043330861  3.609899e-01
## KitchenAbvGr   0.1098094664 -0.041035509 -0.037682320  1.353522e-01
## TotRmsAbvGrd   0.8339786219 -0.063714744 -0.028715371  5.496252e-01
## Fireplaces     0.4516621451  0.130932663  0.024536785  2.364775e-01
## GarageYrBlt              NA           NA           NA            NA
## GarageCars     0.4740577872  0.130567676 -0.024971992  4.653247e-01
## GarageArea     0.4545116864  0.170653430 -0.028212797  4.007804e-01
## WoodDeckSF     0.2418269598  0.174636052  0.034626278  1.821310e-01
## OpenPorchSF    0.3073253482  0.056250603 -0.024384947  2.529112e-01
## EnclosedPorch  0.0161478136 -0.049033639 -0.007802929 -1.138121e-01
## X3SsnPorch     0.0239668378  0.000249040  0.035564678  3.630391e-02
## ScreenPorch    0.1124084791  0.024074572  0.032901535 -6.556984e-03
## PoolArea       0.0643456922  0.037039385  0.027886165  2.150946e-02
## MiscVal       -0.0009740931 -0.022877398 -0.007211400 -1.387200e-02
## MoSold         0.0653284782 -0.023770135  0.038478449  5.819696e-02
## YrSold        -0.0318983131  0.067665051 -0.045302641 -1.657378e-02
## SalePrice      0.7205163007  0.235696782 -0.036792474  5.590482e-01
##                   HalfBath BedroomAbvGr KitchenAbvGr TotRmsAbvGrd
## Id             0.005248119  0.036774322  0.003222388  0.023854418
## MSSubClass     0.177476000 -0.023626587  0.281783056  0.040246635
## LotFrontage             NA           NA           NA           NA
## LotArea        0.005980504  0.118959513 -0.016565309  0.173629285
## OverallQual    0.267431383  0.096848146 -0.184280603  0.415833902
## OverallCond   -0.059926614  0.013248892 -0.087204458 -0.055766348
## YearBuilt      0.240143681 -0.072623278 -0.174481002  0.089206884
## YearRemodAdd   0.181135981 -0.041918503 -0.149287577  0.187520258
## MasVnrArea              NA           NA           NA           NA
## BsmtFinSF1    -0.014507635 -0.121893063 -0.082722357  0.001876651
## BsmtFinSF2    -0.031242668 -0.015134114 -0.040926117 -0.033490456
## BsmtUnfSF     -0.041925429  0.166583946  0.030226318  0.251935602
## TotalBsmtSF   -0.072369928  0.045549161 -0.069964342  0.259133114
## X1stFlrSF     -0.144372621  0.125474298  0.074554578  0.390639219
## X2ndFlrSF      0.609022074  0.502450076  0.061777147  0.610793572
## LowQualFinSF  -0.026730455  0.106007883  0.007452500  0.133344353
## GrLivArea      0.419051624  0.540083366  0.109809466  0.833978622
## BsmtFullBath  -0.034860782 -0.152267699 -0.041035509 -0.063714744
## BsmtHalfBath  -0.015172957  0.043330861 -0.037682320 -0.028715371
## FullBath       0.130335116  0.360989894  0.135352207  0.549625234
## HalfBath       1.000000000  0.224798930 -0.067693846  0.338617884
## BedroomAbvGr   0.224798930  1.000000000  0.199328385  0.679346237
## KitchenAbvGr  -0.067693846  0.199328385  1.000000000  0.260103377
## TotRmsAbvGrd   0.338617884  0.679346237  0.260103377  1.000000000
## Fireplaces     0.198393911  0.103951004 -0.123688494  0.315643170
## GarageYrBlt             NA           NA           NA           NA
## GarageCars     0.215800298  0.083083778 -0.050014806  0.358068680
## GarageArea     0.157317275  0.062108286 -0.063668606  0.325466799
## WoodDeckSF     0.104537626  0.044038842 -0.089670237  0.159720485
## OpenPorchSF    0.194921319  0.093803316 -0.069738368  0.219969071
## EnclosedPorch -0.094316841  0.042401896  0.037112509  0.006789744
## X3SsnPorch    -0.004589731 -0.024262571 -0.024669916 -0.005908304
## ScreenPorch    0.073496772  0.044941354 -0.051778708  0.061924394
## PoolArea       0.001009749  0.064117862 -0.012306310  0.041587857
## MiscVal        0.001589181  0.007964924  0.062294269  0.025639926
## MoSold        -0.007126783  0.048477204  0.026340058  0.041965802
## YrSold        -0.008853373 -0.034848689  0.031454021 -0.032189520
## SalePrice      0.282924892  0.160541722 -0.138848617  0.537461767
##                 Fireplaces GarageYrBlt   GarageCars  GarageArea
## Id            -0.024673077          NA  0.015782854  0.01326645
## MSSubClass    -0.046376588          NA -0.040490374 -0.10014478
## LotFrontage             NA          NA           NA          NA
## LotArea        0.259700916          NA  0.150977421  0.16218279
## OverallQual    0.387424903          NA  0.598738891  0.55490469
## OverallCond   -0.022277117          NA -0.185493758 -0.15067915
## YearBuilt      0.143162318          NA  0.536748656  0.47731136
## YearRemodAdd   0.108731621          NA  0.419572822  0.36958985
## MasVnrArea              NA          NA           NA          NA
## BsmtFinSF1     0.236218676          NA  0.224043217  0.26865080
## BsmtFinSF2     0.049027376          NA -0.037330543 -0.01648486
## BsmtUnfSF      0.051796777          NA  0.213772291  0.18456213
## TotalBsmtSF    0.321377773          NA  0.448605965  0.47200253
## X1stFlrSF      0.396829341          NA  0.445861364  0.47424580
## X2ndFlrSF      0.182722299          NA  0.174846717  0.12502336
## LowQualFinSF  -0.020728345          NA -0.094314819 -0.06747432
## GrLivArea      0.451662145          NA  0.474057787  0.45451169
## BsmtFullBath   0.130932663          NA  0.130567676  0.17065343
## BsmtHalfBath   0.024536785          NA -0.024971992 -0.02821280
## FullBath       0.236477476          NA  0.465324740  0.40078044
## HalfBath       0.198393911          NA  0.215800298  0.15731727
## BedroomAbvGr   0.103951004          NA  0.083083778  0.06210829
## KitchenAbvGr  -0.123688494          NA -0.050014806 -0.06366861
## TotRmsAbvGrd   0.315643170          NA  0.358068680  0.32546680
## Fireplaces     1.000000000          NA  0.297666003  0.25685254
## GarageYrBlt             NA           1           NA          NA
## GarageCars     0.297666003          NA  1.000000000  0.88688169
## GarageArea     0.256852545          NA  0.886881692  1.00000000
## WoodDeckSF     0.194972138          NA  0.223009904  0.21996742
## OpenPorchSF    0.160646855          NA  0.209762044  0.22808912
## EnclosedPorch -0.022885417          NA -0.150590002 -0.12061485
## X3SsnPorch     0.012042209          NA  0.036289595  0.03621290
## ScreenPorch    0.187656148          NA  0.051622319  0.05373159
## PoolArea       0.051221254          NA  0.003359958  0.01163730
## MiscVal        0.001942724          NA -0.042885515 -0.02708835
## MoSold         0.053946716          NA  0.041607745  0.03460232
## YrSold        -0.022566883          NA -0.037179042 -0.02587008
## SalePrice      0.466765283          NA  0.649256334  0.63696359
##                 WoodDeckSF  OpenPorchSF EnclosedPorch    X3SsnPorch
## Id            -0.030644016 -0.002471338   0.003347895 -0.0465399309
## MSSubClass    -0.012852609 -0.006686882  -0.011966400 -0.0438022359
## LotFrontage             NA           NA            NA            NA
## LotArea        0.167040055  0.061679261  -0.016108446  0.0215053707
## OverallQual    0.232818939  0.297802744  -0.112407321  0.0316205860
## OverallCond   -0.003063120 -0.029648925   0.070102698  0.0254189208
## YearBuilt      0.222690462  0.183905184  -0.386903576  0.0317173746
## YearRemodAdd   0.204019691  0.222648772  -0.193348040  0.0455958229
## MasVnrArea              NA           NA            NA            NA
## BsmtFinSF1     0.201462192  0.071850658  -0.103053229  0.0298791692
## BsmtFinSF2     0.069027662  0.005083074   0.036268978 -0.0300896587
## BsmtUnfSF     -0.006875562  0.129147583  -0.002336340  0.0208432809
## TotalBsmtSF    0.229984154  0.215558908  -0.095871638  0.0417617657
## X1stFlrSF      0.230864886  0.179049159  -0.063073606  0.0605527028
## X2ndFlrSF      0.083669986  0.198406685   0.065690250 -0.0237398590
## LowQualFinSF  -0.025113736  0.019338513   0.060974566 -0.0043342079
## GrLivArea      0.241826960  0.307325348   0.016147814  0.0239668378
## BsmtFullBath   0.174636052  0.056250603  -0.049033639  0.0002490400
## BsmtHalfBath   0.034626278 -0.024384947  -0.007802929  0.0355646780
## FullBath       0.182131041  0.252911183  -0.113812140  0.0363039057
## HalfBath       0.104537626  0.194921319  -0.094316841 -0.0045897314
## BedroomAbvGr   0.044038842  0.093803316   0.042401896 -0.0242625708
## KitchenAbvGr  -0.089670237 -0.069738368   0.037112509 -0.0246699160
## TotRmsAbvGrd   0.159720485  0.219969071   0.006789744 -0.0059083044
## Fireplaces     0.194972138  0.160646855  -0.022885417  0.0120422088
## GarageYrBlt             NA           NA            NA            NA
## GarageCars     0.223009904  0.209762044  -0.150590002  0.0362895946
## GarageArea     0.219967415  0.228089122  -0.120614848  0.0362129011
## WoodDeckSF     1.000000000  0.053498193  -0.125150835 -0.0324722989
## OpenPorchSF    0.053498193  1.000000000  -0.092093705 -0.0051484572
## EnclosedPorch -0.125150835 -0.092093705   1.000000000 -0.0374274621
## X3SsnPorch    -0.032472299 -0.005148457  -0.037427462  1.0000000000
## ScreenPorch   -0.073489496  0.077261261  -0.083154070 -0.0315259573
## PoolArea       0.068308638  0.030359975   0.068796316 -0.0067704980
## MiscVal       -0.009287455 -0.018276553   0.018277473  0.0003259486
## MoSold         0.024595095  0.072515014  -0.029564738  0.0293859621
## YrSold         0.023859822 -0.056326427  -0.010342520  0.0185163895
## SalePrice      0.322537864  0.330360776  -0.129773817  0.0474141193
##                ScreenPorch     PoolArea       MiscVal       MoSold
## Id             0.001673919  0.040870714 -0.0061383622  0.023245134
## MSSubClass    -0.025978506  0.007956894 -0.0076657529 -0.013512341
## LotFrontage             NA           NA            NA           NA
## LotArea        0.045620158  0.033875227  0.0391923589  0.007188288
## OverallQual    0.067732233  0.018121246 -0.0310684533  0.076414295
## OverallCond    0.054616656  0.008078797  0.0687294213 -0.003135262
## YearBuilt     -0.049702593 -0.014372825 -0.0341926665  0.013880837
## YearRemodAdd  -0.038176464 -0.009490188 -0.0100998380  0.022628793
## MasVnrArea              NA           NA            NA           NA
## BsmtFinSF1     0.070025842  0.016379573  0.0051485974 -0.001773449
## BsmtFinSF2     0.088676018  0.053177619  0.0048708535 -0.015725934
## BsmtUnfSF     -0.012435350 -0.031243573 -0.0238021485  0.035455949
## TotalBsmtSF    0.094511019  0.004418110 -0.0182534921  0.030026019
## X1stFlrSF      0.097092939  0.032087836 -0.0208008890  0.045080784
## X2ndFlrSF      0.043307602  0.038375042  0.0171113981  0.039163439
## LowQualFinSF   0.026713364  0.073070372 -0.0038219534 -0.022440998
## GrLivArea      0.112408479  0.064345692 -0.0009740931  0.065328478
## BsmtFullBath   0.024074572  0.037039385 -0.0228773978 -0.023770135
## BsmtHalfBath   0.032901535  0.027886165 -0.0072114001  0.038478449
## FullBath      -0.006556984  0.021509463 -0.0138719953  0.058196963
## HalfBath       0.073496772  0.001009749  0.0015891807 -0.007126783
## BedroomAbvGr   0.044941354  0.064117862  0.0079649243  0.048477204
## KitchenAbvGr  -0.051778708 -0.012306310  0.0622942688  0.026340058
## TotRmsAbvGrd   0.061924394  0.041587857  0.0256399258  0.041965802
## Fireplaces     0.187656148  0.051221254  0.0019427237  0.053946716
## GarageYrBlt             NA           NA            NA           NA
## GarageCars     0.051622319  0.003359958 -0.0428855151  0.041607745
## GarageArea     0.053731588  0.011637301 -0.0270883487  0.034602322
## WoodDeckSF    -0.073489496  0.068308638 -0.0092874553  0.024595095
## OpenPorchSF    0.077261261  0.030359975 -0.0182765535  0.072515014
## EnclosedPorch -0.083154070  0.068796316  0.0182774735 -0.029564738
## X3SsnPorch    -0.031525957 -0.006770498  0.0003259486  0.029385962
## ScreenPorch    1.000000000  0.063724323  0.0318842020  0.022863399
## PoolArea       0.063724323  1.000000000  0.0354804284 -0.022901046
## MiscVal        0.031884202  0.035480428  1.0000000000 -0.006656820
## MoSold         0.022863399 -0.022901046 -0.0066568201  1.000000000
## YrSold         0.010382918 -0.062640205  0.0048055564 -0.146229332
## SalePrice      0.118324340  0.032819025 -0.0210967705  0.056796504
##                      YrSold    SalePrice
## Id             0.0007934422 -0.027454886
## MSSubClass    -0.0213297258 -0.088160149
## LotFrontage              NA           NA
## LotArea       -0.0130140877  0.269866484
## OverallQual   -0.0243205816  0.800858356
## OverallCond    0.0437548122 -0.080201802
## YearBuilt     -0.0125932671  0.535279432
## YearRemodAdd   0.0365974929  0.521427960
## MasVnrArea               NA           NA
## BsmtFinSF1     0.0185064844  0.395923108
## BsmtFinSF2     0.0313837059 -0.008899911
## BsmtUnfSF     -0.0408341174  0.220677828
## TotalBsmtSF   -0.0121921213  0.646584498
## X1stFlrSF     -0.0100136366  0.625234719
## X2ndFlrSF     -0.0248741055  0.297301302
## LowQualFinSF  -0.0290737148 -0.025350636
## GrLivArea     -0.0318983131  0.720516301
## BsmtFullBath   0.0676650512  0.235696782
## BsmtHalfBath  -0.0453026411 -0.036792474
## FullBath      -0.0165737812  0.559048238
## HalfBath      -0.0088533728  0.282924892
## BedroomAbvGr  -0.0348486895  0.160541722
## KitchenAbvGr   0.0314540208 -0.138848617
## TotRmsAbvGrd  -0.0321895196  0.537461767
## Fireplaces    -0.0225668829  0.466765283
## GarageYrBlt              NA           NA
## GarageCars    -0.0371790418  0.649256334
## GarageArea    -0.0258700820  0.636963593
## WoodDeckSF     0.0238598225  0.322537864
## OpenPorchSF   -0.0563264274  0.330360776
## EnclosedPorch -0.0103425199 -0.129773817
## X3SsnPorch     0.0185163895  0.047414119
## ScreenPorch    0.0103829179  0.118324340
## PoolArea      -0.0626402045  0.032819025
## MiscVal        0.0048055564 -0.021096770
## MoSold        -0.1462293319  0.056796504
## YrSold         1.0000000000 -0.023693833
## SalePrice     -0.0236938330  1.000000000

The highest correlation scores are with OverallQual, YearBuilt, YearRemodAdd, TotalBsmtSF, X1stFlrSF, GrLivArea, FullBath, TotRmsAbvGrd, GarageCars, andd GarageArea.

trainhigh<-trainnumonly[,c(5,7,8,13,14,17,20,24,27,28,38)]
cor(trainhigh)
##              OverallQual  YearBuilt YearRemodAdd TotalBsmtSF X1stFlrSF
## OverallQual    1.0000000 0.57171183    0.5509706   0.5326660 0.4620418
## YearBuilt      0.5717118 1.00000000    0.5919061   0.3998666 0.2799287
## YearRemodAdd   0.5509706 0.59190614    1.0000000   0.2948661 0.2383045
## TotalBsmtSF    0.5326660 0.39986661    0.2948661   1.0000000 0.8007590
## X1stFlrSF      0.4620418 0.27992871    0.2383045   0.8007590 1.0000000
## GrLivArea      0.5835192 0.19264495    0.2892643   0.3948292 0.5229202
## FullBath       0.5437909 0.46671000    0.4382115   0.3197778 0.3746307
## TotRmsAbvGrd   0.4158339 0.08920688    0.1875203   0.2591331 0.3906392
## GarageCars     0.5987389 0.53674866    0.4195728   0.4486060 0.4458614
## GarageArea     0.5549047 0.47731136    0.3695899   0.4720025 0.4742458
## SalePrice      0.8008584 0.53527943    0.5214280   0.6465845 0.6252347
##              GrLivArea  FullBath TotRmsAbvGrd GarageCars GarageArea
## OverallQual  0.5835192 0.5437909   0.41583390  0.5987389  0.5549047
## YearBuilt    0.1926450 0.4667100   0.08920688  0.5367487  0.4773114
## YearRemodAdd 0.2892643 0.4382115   0.18752026  0.4195728  0.3695899
## TotalBsmtSF  0.3948292 0.3197778   0.25913311  0.4486060  0.4720025
## X1stFlrSF    0.5229202 0.3746307   0.39063922  0.4458614  0.4742458
## GrLivArea    1.0000000 0.6351612   0.83397862  0.4740578  0.4545117
## FullBath     0.6351612 1.0000000   0.54962523  0.4653247  0.4007804
## TotRmsAbvGrd 0.8339786 0.5496252   1.00000000  0.3580687  0.3254668
## GarageCars   0.4740578 0.4653247   0.35806868  1.0000000  0.8868817
## GarageArea   0.4545117 0.4007804   0.32546680  0.8868817  1.0000000
## SalePrice    0.7205163 0.5590482   0.53746177  0.6492563  0.6369636
##              SalePrice
## OverallQual  0.8008584
## YearBuilt    0.5352794
## YearRemodAdd 0.5214280
## TotalBsmtSF  0.6465845
## X1stFlrSF    0.6252347
## GrLivArea    0.7205163
## FullBath     0.5590482
## TotRmsAbvGrd 0.5374618
## GarageCars   0.6492563
## GarageArea   0.6369636
## SalePrice    1.0000000

Closely related: GarageCars and Garage Area. TotRmsAvbGrd and GrLiveArea. X1stFlrSF and TotalBsmtSF.

Now removing some outliers I missed earlier.

plot(trainhigh)

plot(trainhigh$SalePrice,trainhigh$X1stFlrSF)

trainhigh<-subset(trainhigh,trainhigh$X1stFlrSF<2500)
plot(trainhigh$SalePrice,trainhigh$X1stFlrSF)

plot(trainhigh)

plot(trainhigh$SalePrice,trainhigh$TotalBsmtSF)

trainhigh<-subset(trainhigh,trainhigh$TotalBsmtSF<2500)
plot(trainhigh$SalePrice,trainhigh$TotalBsmtSF)

cor(trainhigh)
##              OverallQual  YearBuilt YearRemodAdd TotalBsmtSF X1stFlrSF
## OverallQual    1.0000000 0.57082722    0.5516664   0.5260381 0.4541321
## YearBuilt      0.5708272 1.00000000    0.5910612   0.4034064 0.2819163
## YearRemodAdd   0.5516664 0.59106119    1.0000000   0.2979495 0.2438063
## TotalBsmtSF    0.5260381 0.40340637    0.2979495   1.0000000 0.7960045
## X1stFlrSF      0.4541321 0.28191632    0.2438063   0.7960045 1.0000000
## GrLivArea      0.5794769 0.19034522    0.2912482   0.3782374 0.5050480
## FullBath       0.5436030 0.46769096    0.4401974   0.3097467 0.3645304
## TotRmsAbvGrd   0.4100516 0.08613077    0.1873652   0.2434663 0.3764859
## GarageCars     0.5964638 0.53527734    0.4181702   0.4491701 0.4508248
## GarageArea     0.5527161 0.47578264    0.3684276   0.4750226 0.4805533
## SalePrice      0.7984023 0.53661391    0.5248672   0.6355049 0.6162173
##              GrLivArea  FullBath TotRmsAbvGrd GarageCars GarageArea
## OverallQual  0.5794769 0.5436030   0.41005155  0.5964638  0.5527161
## YearBuilt    0.1903452 0.4676910   0.08613077  0.5352773  0.4757826
## YearRemodAdd 0.2912482 0.4401974   0.18736518  0.4181702  0.3684276
## TotalBsmtSF  0.3782374 0.3097467   0.24346627  0.4491701  0.4750226
## X1stFlrSF    0.5050480 0.3645304   0.37648592  0.4508248  0.4805533
## GrLivArea    1.0000000 0.6318396   0.83227643  0.4742825  0.4544235
## FullBath     0.6318396 1.0000000   0.54589573  0.4651898  0.4003586
## TotRmsAbvGrd 0.8322764 0.5458957   1.00000000  0.3560526  0.3234478
## GarageCars   0.4742825 0.4651898   0.35605261  1.0000000  0.8865733
## GarageArea   0.4544235 0.4003586   0.32344775  0.8865733  1.0000000
## SalePrice    0.7181285 0.5593858   0.53244421  0.6506223  0.6397290
##              SalePrice
## OverallQual  0.7984023
## YearBuilt    0.5366139
## YearRemodAdd 0.5248672
## TotalBsmtSF  0.6355049
## X1stFlrSF    0.6162173
## GrLivArea    0.7181285
## FullBath     0.5593858
## TotRmsAbvGrd 0.5324442
## GarageCars   0.6506223
## GarageArea   0.6397290
## SalePrice    1.0000000

The coefficients have dropped here. I’m going to run the linear models with and without them to see what I get.

summary(lm(SalePrice ~ OverallQual +YearBuilt + TotalBsmtSF +GrLivArea  + GarageCars, data=train2))
## 
## Call:
## lm(formula = SalePrice ~ OverallQual + YearBuilt + TotalBsmtSF + 
##     GrLivArea + GarageCars, data = train2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -141345  -19786   -1804   16447  249419 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -7.147e+05  7.391e+04  -9.669  < 2e-16 ***
## OverallQual  1.850e+04  1.014e+03  18.257  < 2e-16 ***
## YearBuilt    3.214e+02  3.883e+01   8.278 2.80e-16 ***
## TotalBsmtSF  4.243e+01  2.582e+00  16.432  < 2e-16 ***
## GrLivArea    5.575e+01  2.322e+00  24.015  < 2e-16 ***
## GarageCars   1.130e+04  1.604e+03   7.045 2.86e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 33360 on 1450 degrees of freedom
## Multiple R-squared:  0.8114, Adjusted R-squared:  0.8108 
## F-statistic:  1248 on 5 and 1450 DF,  p-value: < 2.2e-16
summary(lm(SalePrice ~ OverallQual +YearBuilt + TotalBsmtSF +GrLivArea  + GarageCars, data=trainhigh))
## 
## Call:
## lm(formula = SalePrice ~ OverallQual + YearBuilt + TotalBsmtSF + 
##     GrLivArea + GarageCars, data = trainhigh)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -140803  -19499   -1648   16122  250173 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -7.193e+05  7.330e+04  -9.813  < 2e-16 ***
## OverallQual  1.823e+04  1.007e+03  18.095  < 2e-16 ***
## YearBuilt    3.245e+02  3.852e+01   8.425  < 2e-16 ***
## TotalBsmtSF  4.211e+01  2.653e+00  15.868  < 2e-16 ***
## GrLivArea    5.616e+01  2.319e+00  24.211  < 2e-16 ***
## GarageCars   1.131e+04  1.595e+03   7.091 2.08e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 33060 on 1443 degrees of freedom
## Multiple R-squared:  0.8087, Adjusted R-squared:  0.808 
## F-statistic:  1220 on 5 and 1443 DF,  p-value: < 2.2e-16
summary(lm(SalePrice ~ OverallQual +YearBuilt + TotalBsmtSF +GrLivArea  + GarageCars + FullBath, data=train2))
## 
## Call:
## lm(formula = SalePrice ~ OverallQual + YearBuilt + TotalBsmtSF + 
##     GrLivArea + GarageCars + FullBath, data = train2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -140731  -19122   -1714   16763  246694 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -8.346e+05  7.865e+04 -10.611  < 2e-16 ***
## OverallQual  1.867e+04  1.008e+03  18.515  < 2e-16 ***
## YearBuilt    3.853e+02  4.139e+01   9.309  < 2e-16 ***
## TotalBsmtSF  4.145e+01  2.577e+00  16.085  < 2e-16 ***
## GrLivArea    6.191e+01  2.721e+00  22.752  < 2e-16 ***
## GarageCars   1.141e+04  1.595e+03   7.155 1.32e-12 ***
## FullBath    -9.902e+03  2.317e+03  -4.274 2.05e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 33160 on 1449 degrees of freedom
## Multiple R-squared:  0.8138, Adjusted R-squared:  0.813 
## F-statistic:  1055 on 6 and 1449 DF,  p-value: < 2.2e-16
summary(lm(SalePrice ~ OverallQual +YearBuilt + TotalBsmtSF +GrLivArea  + GarageCars + FullBath, data=trainhigh))
## 
## Call:
## lm(formula = SalePrice ~ OverallQual + YearBuilt + TotalBsmtSF + 
##     GrLivArea + GarageCars + FullBath, data = trainhigh)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -140200  -18958   -1685   16689  247634 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -8.332e+05  7.807e+04 -10.673  < 2e-16 ***
## OverallQual  1.841e+04  1.003e+03  18.355  < 2e-16 ***
## YearBuilt    3.852e+02  4.109e+01   9.374  < 2e-16 ***
## TotalBsmtSF  4.110e+01  2.651e+00  15.506  < 2e-16 ***
## GrLivArea    6.199e+01  2.714e+00  22.844  < 2e-16 ***
## GarageCars   1.142e+04  1.586e+03   7.197 9.89e-13 ***
## FullBath    -9.412e+03  2.306e+03  -4.081 4.74e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 32880 on 1442 degrees of freedom
## Multiple R-squared:  0.8109, Adjusted R-squared:  0.8101 
## F-statistic:  1030 on 6 and 1442 DF,  p-value: < 2.2e-16
ThisModel<-lm(SalePrice ~ OverallQual +YearBuilt + TotalBsmtSF +GrLivArea  + GarageCars, data=train2)
ThisModelPlus<-lm(SalePrice ~ OverallQual +YearBuilt + TotalBsmtSF +GrLivArea  + GarageCars + FullBath, data=train2)

Even though FullBath does not have that high a correlation with any of the other variables I am using, it does not add anything to the model. I am discarding it. Also, the drop in correlation coefficients has not made any appreciable difference in the adjusted R\(^2\) values or the p-values.

Trying the model out on the data it was based on is not terrible useful, but I want to see how well it does when we throw the outliers back in.

practicepred<-predict(ThisModel,train)
summary(practicepred)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -51999  127449  177226  181194  223406  712271
summary(train$SalePrice)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   34900  129975  163000  180921  214000  755000

Now without those 5 outliers:

practicepred<-predict(ThisModel,train2)
summary(practicepred)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -51999  127281  176962  180151  222941  429684
summary(train2$SalePrice)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   34900  129900  163000  180151  214000  625000

These are both lousy. Adding in some quality measures:

Some quick aggregation on what I think are likely hits:

Q1<-count(train2,"SaleCondition")
A1<-aggregate(train2$SalePrice,by=list(SaleCondition=train2$SaleCondition),FUN=sum)
cbind(Q1$SaleCondition,A1/Q1)
## Warning in Ops.factor(left, right): '/' not meaningful for factors
##   Q1$SaleCondition SaleCondition        x
## 1          Abnorml            NA 140541.9
## 2          AdjLand            NA 104125.0
## 3           Alloca            NA 167377.4
## 4           Family            NA 149600.0
## 5           Normal            NA 174717.8
## 6          Partial            NA 273916.4
Q1
##   SaleCondition freq
## 1       Abnorml  100
## 2       AdjLand    4
## 3        Alloca   12
## 4        Family   20
## 5        Normal 1197
## 6       Partial  123
#might work
Q2<-count(train2,"Exterior1st")
A2<-aggregate(train2$SalePrice,by=list(Exterior1st=train2$Exterior1st),FUN=sum)
cbind(Q2$Exterior1st,A2/Q2)
## Warning in Ops.factor(left, right): '/' not meaningful for factors
##    Q2$Exterior1st Exterior1st        x
## 1         AsbShng          NA 107385.6
## 2         AsphShn          NA 100000.0
## 3         BrkComm          NA  71000.0
## 4         BrkFace          NA 194573.0
## 5          CBlock          NA 105000.0
## 6         CemntBd          NA 232473.0
## 7         HdBoard          NA 160399.1
## 8         ImStucc          NA 262000.0
## 9         MetalSd          NA 149422.2
## 10        Plywood          NA 175942.4
## 11          Stone          NA 258500.0
## 12         Stucco          NA 163114.6
## 13        VinylSd          NA 213732.9
## 14        Wd Sdng          NA 146938.4
## 15        WdShing          NA 150655.1
Q2 #not worthwhile
##    Exterior1st freq
## 1      AsbShng   20
## 2      AsphShn    1
## 3      BrkComm    2
## 4      BrkFace   50
## 5       CBlock    1
## 6      CemntBd   60
## 7      HdBoard  221
## 8      ImStucc    1
## 9      MetalSd  220
## 10     Plywood  108
## 11       Stone    2
## 12      Stucco   24
## 13     VinylSd  515
## 14     Wd Sdng  205
## 15     WdShing   26
Q3<-count(train2,"BsmtCond")
Q3<-Q3[-5,]
A3<-aggregate(train2$SalePrice,by=list(Exterior1st=train2$BsmtCond),FUN=sum)
Q3
##   BsmtCond freq
## 1       Fa   45
## 2       Gd   65
## 3       Po    2
## 4       TA 1307
cbind(Q3$BsmtCond,A3/Q3)
## Warning in Ops.factor(left, right): '/' not meaningful for factors
##   Q3$BsmtCond Exterior1st        x
## 1          Fa          NA 121809.5
## 2          Gd          NA 213599.9
## 3          Po          NA  64000.0
## 4          TA          NA 182783.2
Q4<-count(train2,"ExterQual")
A4<-aggregate(train2$SalePrice,by=list(Exterior1st=train2$ExterQual),FUN=sum)
cbind(Q4$ExterQual,A4/Q4)
## Warning in Ops.factor(left, right): '/' not meaningful for factors
##   Q4$ExterQual Exterior1st         x
## 1           Ex          NA 367408.57
## 2           Fa          NA  87985.21
## 3           Gd          NA 230579.37
## 4           TA          NA 144341.31

Ran out of time, so I didn’t get to show as much as I wanted. Sorry.

QualModel<-lm(SalePrice ~ OverallQual +YearBuilt + TotalBsmtSF +GrLivArea  + GarageCars + ExterQual , data=train2)
QualModel2<-lm(SalePrice ~ OverallQual +YearBuilt + TotalBsmtSF +GrLivArea  + GarageCars +SaleCondition, data=train2)
QualModel3<-lm(SalePrice ~ OverallQual +YearBuilt + TotalBsmtSF +GrLivArea  + GarageCars +SaleCondition+ExterQual, data=train2)
summary(QualModel)
## 
## Call:
## lm(formula = SalePrice ~ OverallQual + YearBuilt + TotalBsmtSF + 
##     GrLivArea + GarageCars + ExterQual, data = train2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -131680  -17368   -1381   15259  214170 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5.732e+05  7.602e+04  -7.540 8.26e-14 ***
## OverallQual  1.408e+04  1.052e+03  13.379  < 2e-16 ***
## YearBuilt    3.028e+02  3.895e+01   7.776 1.42e-14 ***
## TotalBsmtSF  3.752e+01  2.446e+00  15.342  < 2e-16 ***
## GrLivArea    5.582e+01  2.185e+00  25.543  < 2e-16 ***
## GarageCars   1.068e+04  1.506e+03   7.091 2.07e-12 ***
## ExterQualFa -7.862e+04  1.054e+04  -7.461 1.48e-13 ***
## ExterQualGd -6.785e+04  4.953e+03 -13.699  < 2e-16 ***
## ExterQualTA -7.753e+04  5.564e+03 -13.935  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 31280 on 1447 degrees of freedom
## Multiple R-squared:  0.8346, Adjusted R-squared:  0.8337 
## F-statistic: 912.8 on 8 and 1447 DF,  p-value: < 2.2e-16
summary(QualModel2)
## 
## Call:
## lm(formula = SalePrice ~ OverallQual + YearBuilt + TotalBsmtSF + 
##     GrLivArea + GarageCars + SaleCondition, data = train2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -161262  -19084   -1069   16430  230614 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -6.080e+05  7.282e+04  -8.349  < 2e-16 ***
## OverallQual           1.744e+04  9.928e+02  17.566  < 2e-16 ***
## YearBuilt             2.635e+02  3.820e+01   6.897 7.89e-12 ***
## TotalBsmtSF           4.185e+01  2.521e+00  16.604  < 2e-16 ***
## GrLivArea             5.652e+01  2.260e+00  25.005  < 2e-16 ***
## GarageCars            1.036e+04  1.570e+03   6.598 5.84e-11 ***
## SaleConditionAdjLand  2.227e+04  1.659e+04   1.343    0.180    
## SaleConditionAlloca   1.069e+04  9.963e+03   1.073    0.284    
## SaleConditionFamily  -9.177e+03  7.939e+03  -1.156    0.248    
## SaleConditionNormal   1.426e+04  3.395e+03   4.199 2.84e-05 ***
## SaleConditionPartial  4.051e+04  4.645e+03   8.721  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 32370 on 1445 degrees of freedom
## Multiple R-squared:  0.8231, Adjusted R-squared:  0.8218 
## F-statistic: 672.1 on 10 and 1445 DF,  p-value: < 2.2e-16
summary(QualModel3)
## 
## Call:
## lm(formula = SalePrice ~ OverallQual + YearBuilt + TotalBsmtSF + 
##     GrLivArea + GarageCars + SaleCondition + ExterQual, data = train2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -145885  -17639    -929   15289  216453 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -5.242e+05  7.516e+04  -6.974 4.67e-12 ***
## OverallQual           1.381e+04  1.036e+03  13.337  < 2e-16 ***
## YearBuilt             2.681e+02  3.860e+01   6.946 5.67e-12 ***
## TotalBsmtSF           3.778e+01  2.412e+00  15.662  < 2e-16 ***
## GrLivArea             5.643e+01  2.153e+00  26.210  < 2e-16 ***
## GarageCars            1.005e+04  1.491e+03   6.740 2.29e-11 ***
## SaleConditionAdjLand  1.992e+04  1.574e+04   1.266    0.206    
## SaleConditionAlloca   1.014e+04  9.470e+03   1.071    0.284    
## SaleConditionFamily  -7.207e+03  7.537e+03  -0.956    0.339    
## SaleConditionNormal   1.383e+04  3.221e+03   4.295 1.86e-05 ***
## SaleConditionPartial  3.133e+04  4.467e+03   7.013 3.56e-12 ***
## ExterQualFa          -7.140e+04  1.042e+04  -6.851 1.09e-11 ***
## ExterQualGd          -6.187e+04  4.960e+03 -12.475  < 2e-16 ***
## ExterQualTA          -6.974e+04  5.603e+03 -12.447  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 30710 on 1442 degrees of freedom
## Multiple R-squared:  0.8411, Adjusted R-squared:  0.8397 
## F-statistic: 587.3 on 13 and 1442 DF,  p-value: < 2.2e-16
summary(predict(QualModel,train2))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -29768  127349  173034  180151  221257  457921
summary(predict(QualModel2,train2))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -44660  126949  175649  180151  222670  423185
summary(predict(QualModel3,train2))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -27336  127553  173714  180151  220662  449958
summary(train2$SalePrice)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   34900  129900  163000  180151  214000  625000

Still lousy, but hopefully better.

Now loading the Kaggle test data:

Ktest<-read.csv("test.csv",stringsAsFactors = FALSE)
Ktest[is.na(Ktest)]<-0

And getting the predictions.

Kpred<-predict(QualModel,Ktest)
Kpred2<-predict(QualModel2,Ktest)
Kpred3<-predict(QualModel3,Ktest)
Ksub<-cbind(Ktest$Id,Kpred)
Ksub2<-cbind(Ktest$Id,Kpred)
Ksub3<-cbind(Ktest$Id,Kpred)
colnames(Ksub)<-c("ID","SalePrice")
colnames(Ksub2)<-c("ID","SalePrice")
colnames(Ksub3)<-c("ID","SalePrice")
write.csv(Ksub,"SubmitThisQ.csv", row.names = FALSE)
write.csv(Ksub2,"SubmitThisQ2.csv", row.names = FALSE)
write.csv(Ksub3,"SubmitThisQ3.csv", row.names = FALSE)