Computational Mathematics

Your final is due by the end of the last week of class. You should post your solutions to your GitHub account or RPubs. You are also expected to make a short presentation via YouTube and post that recording to the board. This project will show off your ability to understand the elements of the class.

Load Libraries

library(dplyr)
library(kableExtra)
library(corrplot)
library(MASS)
library(ggplot2)

Problem 1

Using R, generate a random variable X that has 10,000 random uniform numbers from 1 to N, where N can be any number of your choosing greater than or equal to 6. Then generate a random variable Y that has 10,000 random normal numbers with a mean of \(\mu = \sigma=(N+1)/2\).

Generate random variable X:

N <- round(runif(1, 10, 100))
n <- 10000
X <- runif(n,min=0,max=N)
hist(X)

Generate random variable Y:

N <- round(runif(1, 10, 100))
n <- 10000
Y <- rnorm(n,(N+1)/2,(N+1)/2)
hist(Y)

Probability

Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities.

5 points a. P(X>x | X>y) b. P(X>x, Y>y) c. P(Xy)

x <- median(X)
y <- quantile(Y, 0.25)
  1. \(P(X>x|X>y)= \frac{P(X>x \&\ X>y)}{P(X>y)}\)
    Probability that X is greater than its median given that X is greater than the first quartile of Y

    p <- (length(X[X>x & X>y])/length(X)) /(length(X[X>y])/length(X))
    round(p,2)
    ## [1] 0.61
  2. \(P(X>x, Y>y) = P(X>x)*P(Y>y)\)
    Probability that X is grater than its median and Y is greater than the first quartile of Y

    p <- (length(X[X>x]) / length(X)) * (length(Y[Y>y]) / length(Y))
    round(p,2)
    ## [1] 0.38
  3. \(P(X<x | X>y) = \frac{P(X<x \&\ X>y)}{P(X>y)}\)
    Probability that X is less than its median given that X is greater than the first quartile of Y

    p = (length(X[X<x & X>y])/length(X)) / (length(X[X>y])/length(X))
    round(p,2)
    ## [1] 0.39

5 points. Investigate whether P(X>x and Y>y)=P(X>x)P(Y>y) by building a table and evaluating the marginal and joint probabilities.

Xgx_Ygy <- length(X[X>x & Y>y])
Xgx_Yly <- length(X[X>x & Y<y])

Xlx_Ygy <- length(X[X<x & Y>y])
Xlx_Yly <- length(X[X<x & Y<y])

matrix <- matrix(c(Xgx_Ygy, Xgx_Yly, Xlx_Ygy, Xlx_Yly), nrow = 2, ncol = 2)

colnames(matrix) <- c("X>x","X<x")
rownames(matrix) <- c("Y<y","Y>y")

table <- as.data.frame(matrix)

kable(table) %>%
  kable_styling("striped", full_width = FALSE,bootstrap_options = "bordered")
X>x X<x
Y<y 3776 3724
Y>y 1224 1276

Evaluate P(X>x and Y>y) using table:

table[1,1]/n
## [1] 0.3776

Evaluate P(X>x)P(Y>y) using table:

((table[1,1]/n) + (table[2,1]/n)) * ((table[1,1]/n) + (table[1,2]/n))
## [1] 0.375

Both sides of the equation are very close to being equal so we will conclude that P(X>x and Y>y)=P(X>x)P(Y>y).

5 points. Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?

    Null Hypothesis: X>x and Y>y are independent events  
    Alternate Hypothesis: X>x and Y>y are dependent events
fisher.test(matrix)
## 
##  Fisher's Exact Test for Count Data
## 
## data:  matrix
## p-value = 0.2389
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.964496 1.158476
## sample estimates:
## odds ratio 
##   1.057068
chisq.test(matrix)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  matrix
## X-squared = 1.3872, df = 1, p-value = 0.2389

The p-value is about the same for both tests and is greater than 0.05. Thus, we fail to reject the null hypothesis that P(X>x) and P(Y>y) are independent. Fisher’s Exact Test is used to test the association between two categorical variables when cell sizes are small (less than 5). The Chi Square Test is used when cell sizes are large. The Fisher’s Exact Test is most appropriate because it is typically used only for 2×2 contingency table and is more accurate.

Problem 2

You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition. https://www.kaggle.com/c/house-prices-advanced-regression-techniques . I want you to do the following.

Descriptive and Inferential Statistics.

Provide univariate descriptive statistics and appropriate plots for the training data set.

train <- read.csv("train.csv", header = TRUE)

head(train)
##   Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape
## 1  1         60       RL          65    8450   Pave  <NA>      Reg
## 2  2         20       RL          80    9600   Pave  <NA>      Reg
## 3  3         60       RL          68   11250   Pave  <NA>      IR1
## 4  4         70       RL          60    9550   Pave  <NA>      IR1
## 5  5         60       RL          84   14260   Pave  <NA>      IR1
## 6  6         50       RL          85   14115   Pave  <NA>      IR1
##   LandContour Utilities LotConfig LandSlope Neighborhood Condition1
## 1         Lvl    AllPub    Inside       Gtl      CollgCr       Norm
## 2         Lvl    AllPub       FR2       Gtl      Veenker      Feedr
## 3         Lvl    AllPub    Inside       Gtl      CollgCr       Norm
## 4         Lvl    AllPub    Corner       Gtl      Crawfor       Norm
## 5         Lvl    AllPub       FR2       Gtl      NoRidge       Norm
## 6         Lvl    AllPub    Inside       Gtl      Mitchel       Norm
##   Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt
## 1       Norm     1Fam     2Story           7           5      2003
## 2       Norm     1Fam     1Story           6           8      1976
## 3       Norm     1Fam     2Story           7           5      2001
## 4       Norm     1Fam     2Story           7           5      1915
## 5       Norm     1Fam     2Story           8           5      2000
## 6       Norm     1Fam     1.5Fin           5           5      1993
##   YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType
## 1         2003     Gable  CompShg     VinylSd     VinylSd    BrkFace
## 2         1976     Gable  CompShg     MetalSd     MetalSd       None
## 3         2002     Gable  CompShg     VinylSd     VinylSd    BrkFace
## 4         1970     Gable  CompShg     Wd Sdng     Wd Shng       None
## 5         2000     Gable  CompShg     VinylSd     VinylSd    BrkFace
## 6         1995     Gable  CompShg     VinylSd     VinylSd       None
##   MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure
## 1        196        Gd        TA      PConc       Gd       TA           No
## 2          0        TA        TA     CBlock       Gd       TA           Gd
## 3        162        Gd        TA      PConc       Gd       TA           Mn
## 4          0        TA        TA     BrkTil       TA       Gd           No
## 5        350        Gd        TA      PConc       Gd       TA           Av
## 6          0        TA        TA       Wood       Gd       TA           No
##   BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF
## 1          GLQ        706          Unf          0       150         856
## 2          ALQ        978          Unf          0       284        1262
## 3          GLQ        486          Unf          0       434         920
## 4          ALQ        216          Unf          0       540         756
## 5          GLQ        655          Unf          0       490        1145
## 6          GLQ        732          Unf          0        64         796
##   Heating HeatingQC CentralAir Electrical X1stFlrSF X2ndFlrSF LowQualFinSF
## 1    GasA        Ex          Y      SBrkr       856       854            0
## 2    GasA        Ex          Y      SBrkr      1262         0            0
## 3    GasA        Ex          Y      SBrkr       920       866            0
## 4    GasA        Gd          Y      SBrkr       961       756            0
## 5    GasA        Ex          Y      SBrkr      1145      1053            0
## 6    GasA        Ex          Y      SBrkr       796       566            0
##   GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr
## 1      1710            1            0        2        1            3
## 2      1262            0            1        2        0            3
## 3      1786            1            0        2        1            3
## 4      1717            1            0        1        0            3
## 5      2198            1            0        2        1            4
## 6      1362            1            0        1        1            1
##   KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces FireplaceQu
## 1            1          Gd            8        Typ          0        <NA>
## 2            1          TA            6        Typ          1          TA
## 3            1          Gd            6        Typ          1          TA
## 4            1          Gd            7        Typ          1          Gd
## 5            1          Gd            9        Typ          1          TA
## 6            1          TA            5        Typ          0        <NA>
##   GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual
## 1     Attchd        2003          RFn          2        548         TA
## 2     Attchd        1976          RFn          2        460         TA
## 3     Attchd        2001          RFn          2        608         TA
## 4     Detchd        1998          Unf          3        642         TA
## 5     Attchd        2000          RFn          3        836         TA
## 6     Attchd        1993          Unf          2        480         TA
##   GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch
## 1         TA          Y          0          61             0          0
## 2         TA          Y        298           0             0          0
## 3         TA          Y          0          42             0          0
## 4         TA          Y          0          35           272          0
## 5         TA          Y        192          84             0          0
## 6         TA          Y         40          30             0        320
##   ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold
## 1           0        0   <NA>  <NA>        <NA>       0      2   2008
## 2           0        0   <NA>  <NA>        <NA>       0      5   2007
## 3           0        0   <NA>  <NA>        <NA>       0      9   2008
## 4           0        0   <NA>  <NA>        <NA>       0      2   2006
## 5           0        0   <NA>  <NA>        <NA>       0     12   2008
## 6           0        0   <NA> MnPrv        Shed     700     10   2009
##   SaleType SaleCondition SalePrice
## 1       WD        Normal    208500
## 2       WD        Normal    181500
## 3       WD        Normal    223500
## 4       WD       Abnorml    140000
## 5       WD        Normal    250000
## 6       WD        Normal    143000
summary(train)
##        Id           MSSubClass       MSZoning     LotFrontage    
##  Min.   :   1.0   Min.   : 20.0   C (all):  10   Min.   : 21.00  
##  1st Qu.: 365.8   1st Qu.: 20.0   FV     :  65   1st Qu.: 59.00  
##  Median : 730.5   Median : 50.0   RH     :  16   Median : 69.00  
##  Mean   : 730.5   Mean   : 56.9   RL     :1151   Mean   : 70.05  
##  3rd Qu.:1095.2   3rd Qu.: 70.0   RM     : 218   3rd Qu.: 80.00  
##  Max.   :1460.0   Max.   :190.0                  Max.   :313.00  
##                                                  NA's   :259     
##     LotArea        Street      Alley      LotShape  LandContour
##  Min.   :  1300   Grvl:   6   Grvl:  50   IR1:484   Bnk:  63   
##  1st Qu.:  7554   Pave:1454   Pave:  41   IR2: 41   HLS:  50   
##  Median :  9478               NA's:1369   IR3: 10   Low:  36   
##  Mean   : 10517                           Reg:925   Lvl:1311   
##  3rd Qu.: 11602                                                
##  Max.   :215245                                                
##                                                                
##   Utilities      LotConfig    LandSlope   Neighborhood   Condition1  
##  AllPub:1459   Corner : 263   Gtl:1382   NAmes  :225   Norm   :1260  
##  NoSeWa:   1   CulDSac:  94   Mod:  65   CollgCr:150   Feedr  :  81  
##                FR2    :  47   Sev:  13   OldTown:113   Artery :  48  
##                FR3    :   4              Edwards:100   RRAn   :  26  
##                Inside :1052              Somerst: 86   PosN   :  19  
##                                          Gilbert: 79   RRAe   :  11  
##                                          (Other):707   (Other):  15  
##    Condition2     BldgType      HouseStyle   OverallQual    
##  Norm   :1445   1Fam  :1220   1Story :726   Min.   : 1.000  
##  Feedr  :   6   2fmCon:  31   2Story :445   1st Qu.: 5.000  
##  Artery :   2   Duplex:  52   1.5Fin :154   Median : 6.000  
##  PosN   :   2   Twnhs :  43   SLvl   : 65   Mean   : 6.099  
##  RRNn   :   2   TwnhsE: 114   SFoyer : 37   3rd Qu.: 7.000  
##  PosA   :   1                 1.5Unf : 14   Max.   :10.000  
##  (Other):   2                 (Other): 19                   
##   OverallCond      YearBuilt     YearRemodAdd    RoofStyle   
##  Min.   :1.000   Min.   :1872   Min.   :1950   Flat   :  13  
##  1st Qu.:5.000   1st Qu.:1954   1st Qu.:1967   Gable  :1141  
##  Median :5.000   Median :1973   Median :1994   Gambrel:  11  
##  Mean   :5.575   Mean   :1971   Mean   :1985   Hip    : 286  
##  3rd Qu.:6.000   3rd Qu.:2000   3rd Qu.:2004   Mansard:   7  
##  Max.   :9.000   Max.   :2010   Max.   :2010   Shed   :   2  
##                                                              
##     RoofMatl     Exterior1st   Exterior2nd    MasVnrType    MasVnrArea    
##  CompShg:1434   VinylSd:515   VinylSd:504   BrkCmn : 15   Min.   :   0.0  
##  Tar&Grv:  11   HdBoard:222   MetalSd:214   BrkFace:445   1st Qu.:   0.0  
##  WdShngl:   6   MetalSd:220   HdBoard:207   None   :864   Median :   0.0  
##  WdShake:   5   Wd Sdng:206   Wd Sdng:197   Stone  :128   Mean   : 103.7  
##  ClyTile:   1   Plywood:108   Plywood:142   NA's   :  8   3rd Qu.: 166.0  
##  Membran:   1   CemntBd: 61   CmentBd: 60                 Max.   :1600.0  
##  (Other):   2   (Other):128   (Other):136                 NA's   :8       
##  ExterQual ExterCond  Foundation  BsmtQual   BsmtCond    BsmtExposure
##  Ex: 52    Ex:   3   BrkTil:146   Ex  :121   Fa  :  45   Av  :221    
##  Fa: 14    Fa:  28   CBlock:634   Fa  : 35   Gd  :  65   Gd  :134    
##  Gd:488    Gd: 146   PConc :647   Gd  :618   Po  :   2   Mn  :114    
##  TA:906    Po:   1   Slab  : 24   TA  :649   TA  :1311   No  :953    
##            TA:1282   Stone :  6   NA's: 37   NA's:  37   NA's: 38    
##                      Wood  :  3                                      
##                                                                      
##  BsmtFinType1   BsmtFinSF1     BsmtFinType2   BsmtFinSF2     
##  ALQ :220     Min.   :   0.0   ALQ :  19    Min.   :   0.00  
##  BLQ :148     1st Qu.:   0.0   BLQ :  33    1st Qu.:   0.00  
##  GLQ :418     Median : 383.5   GLQ :  14    Median :   0.00  
##  LwQ : 74     Mean   : 443.6   LwQ :  46    Mean   :  46.55  
##  Rec :133     3rd Qu.: 712.2   Rec :  54    3rd Qu.:   0.00  
##  Unf :430     Max.   :5644.0   Unf :1256    Max.   :1474.00  
##  NA's: 37                      NA's:  38                     
##    BsmtUnfSF       TotalBsmtSF      Heating     HeatingQC CentralAir
##  Min.   :   0.0   Min.   :   0.0   Floor:   1   Ex:741    N:  95    
##  1st Qu.: 223.0   1st Qu.: 795.8   GasA :1428   Fa: 49    Y:1365    
##  Median : 477.5   Median : 991.5   GasW :  18   Gd:241              
##  Mean   : 567.2   Mean   :1057.4   Grav :   7   Po:  1              
##  3rd Qu.: 808.0   3rd Qu.:1298.2   OthW :   2   TA:428              
##  Max.   :2336.0   Max.   :6110.0   Wall :   4                       
##                                                                     
##  Electrical     X1stFlrSF      X2ndFlrSF     LowQualFinSF    
##  FuseA:  94   Min.   : 334   Min.   :   0   Min.   :  0.000  
##  FuseF:  27   1st Qu.: 882   1st Qu.:   0   1st Qu.:  0.000  
##  FuseP:   3   Median :1087   Median :   0   Median :  0.000  
##  Mix  :   1   Mean   :1163   Mean   : 347   Mean   :  5.845  
##  SBrkr:1334   3rd Qu.:1391   3rd Qu.: 728   3rd Qu.:  0.000  
##  NA's :   1   Max.   :4692   Max.   :2065   Max.   :572.000  
##                                                              
##    GrLivArea     BsmtFullBath     BsmtHalfBath        FullBath    
##  Min.   : 334   Min.   :0.0000   Min.   :0.00000   Min.   :0.000  
##  1st Qu.:1130   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:1.000  
##  Median :1464   Median :0.0000   Median :0.00000   Median :2.000  
##  Mean   :1515   Mean   :0.4253   Mean   :0.05753   Mean   :1.565  
##  3rd Qu.:1777   3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:2.000  
##  Max.   :5642   Max.   :3.0000   Max.   :2.00000   Max.   :3.000  
##                                                                   
##     HalfBath       BedroomAbvGr    KitchenAbvGr   KitchenQual
##  Min.   :0.0000   Min.   :0.000   Min.   :0.000   Ex:100     
##  1st Qu.:0.0000   1st Qu.:2.000   1st Qu.:1.000   Fa: 39     
##  Median :0.0000   Median :3.000   Median :1.000   Gd:586     
##  Mean   :0.3829   Mean   :2.866   Mean   :1.047   TA:735     
##  3rd Qu.:1.0000   3rd Qu.:3.000   3rd Qu.:1.000              
##  Max.   :2.0000   Max.   :8.000   Max.   :3.000              
##                                                              
##   TotRmsAbvGrd    Functional    Fireplaces    FireplaceQu   GarageType 
##  Min.   : 2.000   Maj1:  14   Min.   :0.000   Ex  : 24    2Types :  6  
##  1st Qu.: 5.000   Maj2:   5   1st Qu.:0.000   Fa  : 33    Attchd :870  
##  Median : 6.000   Min1:  31   Median :1.000   Gd  :380    Basment: 19  
##  Mean   : 6.518   Min2:  34   Mean   :0.613   Po  : 20    BuiltIn: 88  
##  3rd Qu.: 7.000   Mod :  15   3rd Qu.:1.000   TA  :313    CarPort:  9  
##  Max.   :14.000   Sev :   1   Max.   :3.000   NA's:690    Detchd :387  
##                   Typ :1360                               NA's   : 81  
##   GarageYrBlt   GarageFinish   GarageCars      GarageArea     GarageQual 
##  Min.   :1900   Fin :352     Min.   :0.000   Min.   :   0.0   Ex  :   3  
##  1st Qu.:1961   RFn :422     1st Qu.:1.000   1st Qu.: 334.5   Fa  :  48  
##  Median :1980   Unf :605     Median :2.000   Median : 480.0   Gd  :  14  
##  Mean   :1979   NA's: 81     Mean   :1.767   Mean   : 473.0   Po  :   3  
##  3rd Qu.:2002                3rd Qu.:2.000   3rd Qu.: 576.0   TA  :1311  
##  Max.   :2010                Max.   :4.000   Max.   :1418.0   NA's:  81  
##  NA's   :81                                                              
##  GarageCond  PavedDrive   WoodDeckSF      OpenPorchSF     EnclosedPorch   
##  Ex  :   2   N:  90     Min.   :  0.00   Min.   :  0.00   Min.   :  0.00  
##  Fa  :  35   P:  30     1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.00  
##  Gd  :   9   Y:1340     Median :  0.00   Median : 25.00   Median :  0.00  
##  Po  :   7              Mean   : 94.24   Mean   : 46.66   Mean   : 21.95  
##  TA  :1326              3rd Qu.:168.00   3rd Qu.: 68.00   3rd Qu.:  0.00  
##  NA's:  81              Max.   :857.00   Max.   :547.00   Max.   :552.00  
##                                                                           
##    X3SsnPorch      ScreenPorch        PoolArea        PoolQC    
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.000   Ex  :   2  
##  1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.000   Fa  :   2  
##  Median :  0.00   Median :  0.00   Median :  0.000   Gd  :   3  
##  Mean   :  3.41   Mean   : 15.06   Mean   :  2.759   NA's:1453  
##  3rd Qu.:  0.00   3rd Qu.:  0.00   3rd Qu.:  0.000              
##  Max.   :508.00   Max.   :480.00   Max.   :738.000              
##                                                                 
##    Fence      MiscFeature    MiscVal             MoSold      
##  GdPrv:  59   Gar2:   2   Min.   :    0.00   Min.   : 1.000  
##  GdWo :  54   Othr:   2   1st Qu.:    0.00   1st Qu.: 5.000  
##  MnPrv: 157   Shed:  49   Median :    0.00   Median : 6.000  
##  MnWw :  11   TenC:   1   Mean   :   43.49   Mean   : 6.322  
##  NA's :1179   NA's:1406   3rd Qu.:    0.00   3rd Qu.: 8.000  
##                           Max.   :15500.00   Max.   :12.000  
##                                                              
##      YrSold        SaleType    SaleCondition    SalePrice     
##  Min.   :2006   WD     :1267   Abnorml: 101   Min.   : 34900  
##  1st Qu.:2007   New    : 122   AdjLand:   4   1st Qu.:129975  
##  Median :2008   COD    :  43   Alloca :  12   Median :163000  
##  Mean   :2008   ConLD  :   9   Family :  20   Mean   :180921  
##  3rd Qu.:2009   ConLI  :   5   Normal :1198   3rd Qu.:214000  
##  Max.   :2010   ConLw  :   5   Partial: 125   Max.   :755000  
##                 (Other):   9
#Sample box plots
ggplot(train, aes(x=YearBuilt, y=SalePrice, fill=YearBuilt, group=YearBuilt)) + geom_boxplot()

train$OverallQual_factor <- as.factor(as.character(train$OverallQual))
ggplot(train, aes(x=OverallQual, y=SalePrice, fill=OverallQual_factor)) + geom_boxplot()

Provide a scatterplot matrix for at least two of the independent variables and the dependent variable.

ggplot(train, aes(x = YearBuilt, y = SalePrice)) +
    geom_point() +
    labs(
        title = "Year Built vs Sale Price"
         )

ggplot(train, aes(x = TotalBsmtSF, y = SalePrice)) +
    geom_point() +
    labs(
        title = "Total Bsmt SF vs Sale Price"
         )

Derive a correlation matrix for any three quantitative variables in the dataset.

var <- dplyr::select(train, YearBuilt, TotalBsmtSF, OverallCond, SalePrice)

corr <- cor(var, method = "pearson", use = "complete.obs")
corr
##              YearBuilt TotalBsmtSF OverallCond   SalePrice
## YearBuilt    1.0000000   0.3914520 -0.37598320  0.52289733
## TotalBsmtSF  0.3914520   1.0000000 -0.17109751  0.61358055
## OverallCond -0.3759832  -0.1710975  1.00000000 -0.07785589
## SalePrice    0.5228973   0.6135806 -0.07785589  1.00000000
corrplot(corr,method ="color")

Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval.

cor.test(var$SalePrice,var$YearBuilt, conf.level = 0.8)
## 
##  Pearson's product-moment correlation
## 
## data:  var$SalePrice and var$YearBuilt
## t = 23.424, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.4980766 0.5468619
## sample estimates:
##       cor 
## 0.5228973
cor.test(var$SalePrice,var$TotalBsmtSF, conf.level = 0.8)
## 
##  Pearson's product-moment correlation
## 
## data:  var$SalePrice and var$TotalBsmtSF
## t = 29.671, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.5922142 0.6340846
## sample estimates:
##       cor 
## 0.6135806
cor.test(var$SalePrice,var$OverallCond, conf.level = 0.8)
## 
##  Pearson's product-moment correlation
## 
## data:  var$SalePrice and var$OverallCond
## t = -2.9819, df = 1458, p-value = 0.002912
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  -0.1111272 -0.0444103
## sample estimates:
##         cor 
## -0.07785589

Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?

The p-value in all tests is less than 0.05 so we reject the null hypothesis in favor of the alternative and conclude that true correlation is not equal to 0.

Familywise error is the probability of making at least one Type I error – a false positive when performing multiple hypotheses tests. I would not be worried about familywise error in this case because the p-value of each the correlations above are small.

Linear Algebra and Correlation

Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.)

precisionMatrix<-solve(corr)
round(precisionMatrix,2)
##             YearBuilt TotalBsmtSF OverallCond SalePrice
## YearBuilt        1.63       -0.08        0.54     -0.76
## TotalBsmtSF     -0.08        1.65        0.18     -0.96
## OverallCond      0.54        0.18        1.21     -0.30
## SalePrice       -0.76       -0.96       -0.30      1.96

Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.

round(corr %*% precisionMatrix, 2)
##             YearBuilt TotalBsmtSF OverallCond SalePrice
## YearBuilt           1           0           0         0
## TotalBsmtSF         0           1           0         0
## OverallCond         0           0           1         0
## SalePrice           0           0           0         1
round(precisionMatrix %*% corr, 2)
##             YearBuilt TotalBsmtSF OverallCond SalePrice
## YearBuilt           1           0           0         0
## TotalBsmtSF         0           1           0         0
## OverallCond         0           0           1         0
## SalePrice           0           0           0         1

Calculus-Based Probability & Statistics

Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary.

Right skewed data has a mean greater than the median

train.num <- dplyr::select_if(train, is.numeric)

summary(train.num)
##        Id           MSSubClass     LotFrontage        LotArea      
##  Min.   :   1.0   Min.   : 20.0   Min.   : 21.00   Min.   :  1300  
##  1st Qu.: 365.8   1st Qu.: 20.0   1st Qu.: 59.00   1st Qu.:  7554  
##  Median : 730.5   Median : 50.0   Median : 69.00   Median :  9478  
##  Mean   : 730.5   Mean   : 56.9   Mean   : 70.05   Mean   : 10517  
##  3rd Qu.:1095.2   3rd Qu.: 70.0   3rd Qu.: 80.00   3rd Qu.: 11602  
##  Max.   :1460.0   Max.   :190.0   Max.   :313.00   Max.   :215245  
##                                   NA's   :259                      
##   OverallQual      OverallCond      YearBuilt     YearRemodAdd 
##  Min.   : 1.000   Min.   :1.000   Min.   :1872   Min.   :1950  
##  1st Qu.: 5.000   1st Qu.:5.000   1st Qu.:1954   1st Qu.:1967  
##  Median : 6.000   Median :5.000   Median :1973   Median :1994  
##  Mean   : 6.099   Mean   :5.575   Mean   :1971   Mean   :1985  
##  3rd Qu.: 7.000   3rd Qu.:6.000   3rd Qu.:2000   3rd Qu.:2004  
##  Max.   :10.000   Max.   :9.000   Max.   :2010   Max.   :2010  
##                                                                
##    MasVnrArea       BsmtFinSF1       BsmtFinSF2        BsmtUnfSF     
##  Min.   :   0.0   Min.   :   0.0   Min.   :   0.00   Min.   :   0.0  
##  1st Qu.:   0.0   1st Qu.:   0.0   1st Qu.:   0.00   1st Qu.: 223.0  
##  Median :   0.0   Median : 383.5   Median :   0.00   Median : 477.5  
##  Mean   : 103.7   Mean   : 443.6   Mean   :  46.55   Mean   : 567.2  
##  3rd Qu.: 166.0   3rd Qu.: 712.2   3rd Qu.:   0.00   3rd Qu.: 808.0  
##  Max.   :1600.0   Max.   :5644.0   Max.   :1474.00   Max.   :2336.0  
##  NA's   :8                                                           
##   TotalBsmtSF       X1stFlrSF      X2ndFlrSF     LowQualFinSF    
##  Min.   :   0.0   Min.   : 334   Min.   :   0   Min.   :  0.000  
##  1st Qu.: 795.8   1st Qu.: 882   1st Qu.:   0   1st Qu.:  0.000  
##  Median : 991.5   Median :1087   Median :   0   Median :  0.000  
##  Mean   :1057.4   Mean   :1163   Mean   : 347   Mean   :  5.845  
##  3rd Qu.:1298.2   3rd Qu.:1391   3rd Qu.: 728   3rd Qu.:  0.000  
##  Max.   :6110.0   Max.   :4692   Max.   :2065   Max.   :572.000  
##                                                                  
##    GrLivArea     BsmtFullBath     BsmtHalfBath        FullBath    
##  Min.   : 334   Min.   :0.0000   Min.   :0.00000   Min.   :0.000  
##  1st Qu.:1130   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:1.000  
##  Median :1464   Median :0.0000   Median :0.00000   Median :2.000  
##  Mean   :1515   Mean   :0.4253   Mean   :0.05753   Mean   :1.565  
##  3rd Qu.:1777   3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:2.000  
##  Max.   :5642   Max.   :3.0000   Max.   :2.00000   Max.   :3.000  
##                                                                   
##     HalfBath       BedroomAbvGr    KitchenAbvGr    TotRmsAbvGrd   
##  Min.   :0.0000   Min.   :0.000   Min.   :0.000   Min.   : 2.000  
##  1st Qu.:0.0000   1st Qu.:2.000   1st Qu.:1.000   1st Qu.: 5.000  
##  Median :0.0000   Median :3.000   Median :1.000   Median : 6.000  
##  Mean   :0.3829   Mean   :2.866   Mean   :1.047   Mean   : 6.518  
##  3rd Qu.:1.0000   3rd Qu.:3.000   3rd Qu.:1.000   3rd Qu.: 7.000  
##  Max.   :2.0000   Max.   :8.000   Max.   :3.000   Max.   :14.000  
##                                                                   
##    Fireplaces     GarageYrBlt     GarageCars      GarageArea    
##  Min.   :0.000   Min.   :1900   Min.   :0.000   Min.   :   0.0  
##  1st Qu.:0.000   1st Qu.:1961   1st Qu.:1.000   1st Qu.: 334.5  
##  Median :1.000   Median :1980   Median :2.000   Median : 480.0  
##  Mean   :0.613   Mean   :1979   Mean   :1.767   Mean   : 473.0  
##  3rd Qu.:1.000   3rd Qu.:2002   3rd Qu.:2.000   3rd Qu.: 576.0  
##  Max.   :3.000   Max.   :2010   Max.   :4.000   Max.   :1418.0  
##                  NA's   :81                                     
##    WoodDeckSF      OpenPorchSF     EnclosedPorch      X3SsnPorch    
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.00   Min.   :  0.00  
##  1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.00  
##  Median :  0.00   Median : 25.00   Median :  0.00   Median :  0.00  
##  Mean   : 94.24   Mean   : 46.66   Mean   : 21.95   Mean   :  3.41  
##  3rd Qu.:168.00   3rd Qu.: 68.00   3rd Qu.:  0.00   3rd Qu.:  0.00  
##  Max.   :857.00   Max.   :547.00   Max.   :552.00   Max.   :508.00  
##                                                                     
##   ScreenPorch        PoolArea          MiscVal             MoSold      
##  Min.   :  0.00   Min.   :  0.000   Min.   :    0.00   Min.   : 1.000  
##  1st Qu.:  0.00   1st Qu.:  0.000   1st Qu.:    0.00   1st Qu.: 5.000  
##  Median :  0.00   Median :  0.000   Median :    0.00   Median : 6.000  
##  Mean   : 15.06   Mean   :  2.759   Mean   :   43.49   Mean   : 6.322  
##  3rd Qu.:  0.00   3rd Qu.:  0.000   3rd Qu.:    0.00   3rd Qu.: 8.000  
##  Max.   :480.00   Max.   :738.000   Max.   :15500.00   Max.   :12.000  
##                                                                        
##      YrSold       SalePrice     
##  Min.   :2006   Min.   : 34900  
##  1st Qu.:2007   1st Qu.:129975  
##  Median :2008   Median :163000  
##  Mean   :2008   Mean   :180921  
##  3rd Qu.:2009   3rd Qu.:214000  
##  Max.   :2010   Max.   :755000  
## 
hist(train.num$LotArea, col='blue', breaks=20)

Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ).

fit <- fitdistr(train.num$LotArea, densfun = "exponential")

Find the optimal value of \(\lambda\) for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, \(\lambda\))).

lamda <- fit$estimate
lamda
##        rate 
## 9.50857e-05
sample <- rexp(1000, lamda)
summary(sample)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     7.04  3143.47  7098.76 10424.38 14398.37 64966.61

Plot a histogram and compare it with a histogram of your original variable.

hist(sample, col = 'blue', breaks = 20)

The histogram of the new variable looks more like an exponential distribution.

Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF).

qexp(c(0.05,0.95), lamda)
## [1]   539.4428 31505.6013

Also generate a 95% confidence interval from the empirical data, assuming normality.

qnorm(c(0.025, 0.975), mean = mean(train.num$LotArea), sd = sd(train.num$LotArea))
## [1] -9046.092 30079.748

Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.

quantile(train.num$LotArea, c(0.05, 0.95))
##       5%      95% 
##  3311.70 17401.15

Modeling

Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.

c <- cor(train.num, method = "pearson", use = "complete.obs")

lm <- lm(SalePrice ~  OverallQual+ YearBuilt+YearRemodAdd+MasVnrArea+X1stFlrSF+TotalBsmtSF+ GrLivArea+ FullBath+TotRmsAbvGrd+GarageCars+GarageArea+ BsmtFinSF1+ LotArea+Fireplaces+ BedroomAbvGr,  data = train.num)

summary(lm)
## 
## Call:
## lm(formula = SalePrice ~ OverallQual + YearBuilt + YearRemodAdd + 
##     MasVnrArea + X1stFlrSF + TotalBsmtSF + GrLivArea + FullBath + 
##     TotRmsAbvGrd + GarageCars + GarageArea + BsmtFinSF1 + LotArea + 
##     Fireplaces + BedroomAbvGr, data = train.num)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -531792  -16783   -1389   14782  286886 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.130e+06  1.250e+05  -9.040  < 2e-16 ***
## OverallQual   1.841e+04  1.180e+03  15.605  < 2e-16 ***
## YearBuilt     2.015e+02  4.894e+01   4.116 4.07e-05 ***
## YearRemodAdd  3.407e+02  6.180e+01   5.513 4.17e-08 ***
## MasVnrArea    2.956e+01  6.079e+00   4.863 1.28e-06 ***
## X1stFlrSF     5.553e+00  4.794e+00   1.158 0.246885    
## TotalBsmtSF   1.165e+01  4.250e+00   2.740 0.006218 ** 
## GrLivArea     4.006e+01  4.225e+00   9.484  < 2e-16 ***
## FullBath     -6.172e+02  2.616e+03  -0.236 0.813537    
## TotRmsAbvGrd  4.036e+03  1.219e+03   3.312 0.000948 ***
## GarageCars    9.847e+03  2.921e+03   3.372 0.000767 ***
## GarageArea    7.820e+00  9.929e+00   0.788 0.431071    
## BsmtFinSF1    1.625e+01  2.575e+00   6.312 3.67e-10 ***
## LotArea       5.423e-01  1.027e-01   5.281 1.48e-07 ***
## Fireplaces    6.478e+03  1.784e+03   3.632 0.000292 ***
## BedroomAbvGr -7.114e+03  1.710e+03  -4.162 3.35e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 36020 on 1436 degrees of freedom
##   (8 observations deleted due to missingness)
## Multiple R-squared:  0.7958, Adjusted R-squared:  0.7936 
## F-statistic:   373 on 15 and 1436 DF,  p-value: < 2.2e-16
lm1 <- lm(SalePrice ~  OverallQual+ YearBuilt+YearRemodAdd+MasVnrArea+TotalBsmtSF+ GrLivArea+TotRmsAbvGrd+GarageCars+ BsmtFinSF1+ LotArea+Fireplaces+ BedroomAbvGr,  data = train.num)

summary(lm1)
## 
## Call:
## lm(formula = SalePrice ~ OverallQual + YearBuilt + YearRemodAdd + 
##     MasVnrArea + TotalBsmtSF + GrLivArea + TotRmsAbvGrd + GarageCars + 
##     BsmtFinSF1 + LotArea + Fireplaces + BedroomAbvGr, data = train.num)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -530612  -17143   -1214   14877  285552 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.109e+06  1.173e+05  -9.452  < 2e-16 ***
## OverallQual   1.821e+04  1.167e+03  15.601  < 2e-16 ***
## YearBuilt     1.944e+02  4.633e+01   4.196 2.88e-05 ***
## YearRemodAdd  3.376e+02  6.142e+01   5.497 4.56e-08 ***
## MasVnrArea    2.977e+01  6.066e+00   4.907 1.03e-06 ***
## TotalBsmtSF   1.543e+01  3.029e+00   5.095 3.96e-07 ***
## GrLivArea     4.104e+01  3.957e+00  10.371  < 2e-16 ***
## TotRmsAbvGrd  4.093e+03  1.213e+03   3.375 0.000759 ***
## GarageCars    1.187e+04  1.734e+03   6.844 1.14e-11 ***
## BsmtFinSF1    1.662e+01  2.538e+00   6.549 8.06e-11 ***
## LotArea       5.512e-01  1.024e-01   5.380 8.67e-08 ***
## Fireplaces    6.581e+03  1.755e+03   3.750 0.000184 ***
## BedroomAbvGr -7.451e+03  1.686e+03  -4.418 1.07e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 36010 on 1439 degrees of freedom
##   (8 observations deleted due to missingness)
## Multiple R-squared:  0.7955, Adjusted R-squared:  0.7938 
## F-statistic: 466.4 on 12 and 1439 DF,  p-value: < 2.2e-16

Let’s check if data transformations can improve model

hist(train.num$YearBuilt, breaks=30) #left skewed

hist(train.num$MasVnrArea, breaks=30) #right skewed

hist(train.num$X1stFlrSF, breaks=30) #right skewed

hist(train.num$GrLivArea, breaks=30) #right skewed

yrblt2 <- (train.num$YearBuilt)^2
logMasVnrArea <- log(train.num$MasVnrArea) #
log1stFlr <- log(train.num$X1stFlrSF)
logGrLivArea <- log(train.num$GrLivArea)

lm2 <- lm(train.num$SalePrice ~  train.num$OverallQual+ yrblt2+train.num$YearRemodAdd+train.num$MasVnrArea+log1stFlr+train.num$TotalBsmtSF+ logGrLivArea+ train.num$FullBath+train.num$GarageCars+train.num$BsmtFinSF1+ train.num$LotArea+train.num$Fireplaces+ train.num$BedroomAbvGr)

summary(lm2)
## 
## Call:
## lm(formula = train.num$SalePrice ~ train.num$OverallQual + yrblt2 + 
##     train.num$YearRemodAdd + train.num$MasVnrArea + log1stFlr + 
##     train.num$TotalBsmtSF + logGrLivArea + train.num$FullBath + 
##     train.num$GarageCars + train.num$BsmtFinSF1 + train.num$LotArea + 
##     train.num$Fireplaces + train.num$BedroomAbvGr)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -434893  -18372   -2398   15465  335138 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            -1.362e+06  1.252e+05 -10.876  < 2e-16 ***
## train.num$OverallQual   1.954e+04  1.225e+03  15.956  < 2e-16 ***
## yrblt2                  2.855e-02  1.261e-02   2.264 0.023701 *  
## train.num$YearRemodAdd  3.649e+02  6.352e+01   5.744 1.13e-08 ***
## train.num$MasVnrArea    3.792e+01  6.192e+00   6.124 1.18e-09 ***
## log1stFlr               1.550e+04  5.408e+03   2.866 0.004220 ** 
## train.num$TotalBsmtSF   9.617e+00  4.091e+00   2.351 0.018875 *  
## logGrLivArea            5.943e+04  5.876e+03  10.114  < 2e-16 ***
## train.num$FullBath      2.632e+03  2.674e+03   0.984 0.325241    
## train.num$GarageCars    1.161e+04  1.806e+03   6.428 1.75e-10 ***
## train.num$BsmtFinSF1    1.858e+01  2.612e+00   7.114 1.78e-12 ***
## train.num$LotArea       5.800e-01  1.052e-01   5.513 4.17e-08 ***
## train.num$Fireplaces    6.457e+03  1.837e+03   3.514 0.000455 ***
## train.num$BedroomAbvGr -3.159e+03  1.582e+03  -1.997 0.045989 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 37010 on 1438 degrees of freedom
##   (8 observations deleted due to missingness)
## Multiple R-squared:  0.784,  Adjusted R-squared:  0.7821 
## F-statistic: 401.6 on 13 and 1438 DF,  p-value: < 2.2e-16
lm3 <- lm(train.num$SalePrice ~  train.num$OverallQual+ yrblt2+train.num$YearRemodAdd+train.num$MasVnrArea+log1stFlr+train.num$TotalBsmtSF+ logGrLivArea+train.num$GarageCars+train.num$BsmtFinSF1+ train.num$LotArea+train.num$Fireplace)

summary(lm3)
## 
## Call:
## lm(formula = train.num$SalePrice ~ train.num$OverallQual + yrblt2 + 
##     train.num$YearRemodAdd + train.num$MasVnrArea + log1stFlr + 
##     train.num$TotalBsmtSF + logGrLivArea + train.num$GarageCars + 
##     train.num$BsmtFinSF1 + train.num$LotArea + train.num$Fireplace)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -433535  -18574   -2246   15207  336037 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            -1.401e+06  1.183e+05 -11.845  < 2e-16 ***
## train.num$OverallQual   1.998e+04  1.208e+03  16.538  < 2e-16 ***
## yrblt2                  3.150e-02  1.203e-02   2.618 0.008928 ** 
## train.num$YearRemodAdd  3.868e+02  6.269e+01   6.169 8.89e-10 ***
## train.num$MasVnrArea    3.777e+01  6.196e+00   6.096 1.40e-09 ***
## log1stFlr               1.623e+04  5.402e+03   3.004 0.002707 ** 
## train.num$TotalBsmtSF   9.428e+00  4.093e+00   2.304 0.021386 *  
## logGrLivArea            5.539e+04  4.375e+03  12.659  < 2e-16 ***
## train.num$GarageCars    1.193e+04  1.802e+03   6.620 5.04e-11 ***
## train.num$BsmtFinSF1    1.889e+01  2.577e+00   7.333 3.74e-13 ***
## train.num$LotArea       5.762e-01  1.052e-01   5.475 5.15e-08 ***
## train.num$Fireplace     6.788e+03  1.817e+03   3.736 0.000194 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 37040 on 1440 degrees of freedom
##   (8 observations deleted due to missingness)
## Multiple R-squared:  0.7833, Adjusted R-squared:  0.7817 
## F-statistic: 473.3 on 11 and 1440 DF,  p-value: < 2.2e-16

lm1 as the best model because it had the greatest \(R^2\) and fewer predictor variables.

Now let’s evaluate the quality of the model.

plot(lm1$fitted.values, lm1$residuals, xlab='Fitted Values', ylab='Residuals')
abline(0,0)

hist(lm1$residuals)

qqnorm(lm1$residuals)
qqline(lm1$residuals)

The residuals appear to be normally distributed around the 0 except for some deviation in the upper tail.

To improve this model I would evaluate and remove outlier data points if appropriate by looking at Cook’s distance.

Predict

test <- read.csv("test.csv", header = TRUE)

head(test)
##     Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape
## 1 1461         20       RH          80   11622   Pave  <NA>      Reg
## 2 1462         20       RL          81   14267   Pave  <NA>      IR1
## 3 1463         60       RL          74   13830   Pave  <NA>      IR1
## 4 1464         60       RL          78    9978   Pave  <NA>      IR1
## 5 1465        120       RL          43    5005   Pave  <NA>      IR1
## 6 1466         60       RL          75   10000   Pave  <NA>      IR1
##   LandContour Utilities LotConfig LandSlope Neighborhood Condition1
## 1         Lvl    AllPub    Inside       Gtl        NAmes      Feedr
## 2         Lvl    AllPub    Corner       Gtl        NAmes       Norm
## 3         Lvl    AllPub    Inside       Gtl      Gilbert       Norm
## 4         Lvl    AllPub    Inside       Gtl      Gilbert       Norm
## 5         HLS    AllPub    Inside       Gtl      StoneBr       Norm
## 6         Lvl    AllPub    Corner       Gtl      Gilbert       Norm
##   Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt
## 1       Norm     1Fam     1Story           5           6      1961
## 2       Norm     1Fam     1Story           6           6      1958
## 3       Norm     1Fam     2Story           5           5      1997
## 4       Norm     1Fam     2Story           6           6      1998
## 5       Norm   TwnhsE     1Story           8           5      1992
## 6       Norm     1Fam     2Story           6           5      1993
##   YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType
## 1         1961     Gable  CompShg     VinylSd     VinylSd       None
## 2         1958       Hip  CompShg     Wd Sdng     Wd Sdng    BrkFace
## 3         1998     Gable  CompShg     VinylSd     VinylSd       None
## 4         1998     Gable  CompShg     VinylSd     VinylSd    BrkFace
## 5         1992     Gable  CompShg     HdBoard     HdBoard       None
## 6         1994     Gable  CompShg     HdBoard     HdBoard       None
##   MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure
## 1          0        TA        TA     CBlock       TA       TA           No
## 2        108        TA        TA     CBlock       TA       TA           No
## 3          0        TA        TA      PConc       Gd       TA           No
## 4         20        TA        TA      PConc       TA       TA           No
## 5          0        Gd        TA      PConc       Gd       TA           No
## 6          0        TA        TA      PConc       Gd       TA           No
##   BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF
## 1          Rec        468          LwQ        144       270         882
## 2          ALQ        923          Unf          0       406        1329
## 3          GLQ        791          Unf          0       137         928
## 4          GLQ        602          Unf          0       324         926
## 5          ALQ        263          Unf          0      1017        1280
## 6          Unf          0          Unf          0       763         763
##   Heating HeatingQC CentralAir Electrical X1stFlrSF X2ndFlrSF LowQualFinSF
## 1    GasA        TA          Y      SBrkr       896         0            0
## 2    GasA        TA          Y      SBrkr      1329         0            0
## 3    GasA        Gd          Y      SBrkr       928       701            0
## 4    GasA        Ex          Y      SBrkr       926       678            0
## 5    GasA        Ex          Y      SBrkr      1280         0            0
## 6    GasA        Gd          Y      SBrkr       763       892            0
##   GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr
## 1       896            0            0        1        0            2
## 2      1329            0            0        1        1            3
## 3      1629            0            0        2        1            3
## 4      1604            0            0        2        1            3
## 5      1280            0            0        2        0            2
## 6      1655            0            0        2        1            3
##   KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces FireplaceQu
## 1            1          TA            5        Typ          0        <NA>
## 2            1          Gd            6        Typ          0        <NA>
## 3            1          TA            6        Typ          1          TA
## 4            1          Gd            7        Typ          1          Gd
## 5            1          Gd            5        Typ          0        <NA>
## 6            1          TA            7        Typ          1          TA
##   GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual
## 1     Attchd        1961          Unf          1        730         TA
## 2     Attchd        1958          Unf          1        312         TA
## 3     Attchd        1997          Fin          2        482         TA
## 4     Attchd        1998          Fin          2        470         TA
## 5     Attchd        1992          RFn          2        506         TA
## 6     Attchd        1993          Fin          2        440         TA
##   GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch
## 1         TA          Y        140           0             0          0
## 2         TA          Y        393          36             0          0
## 3         TA          Y        212          34             0          0
## 4         TA          Y        360          36             0          0
## 5         TA          Y          0          82             0          0
## 6         TA          Y        157          84             0          0
##   ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold
## 1         120        0   <NA> MnPrv        <NA>       0      6   2010
## 2           0        0   <NA>  <NA>        Gar2   12500      6   2010
## 3           0        0   <NA> MnPrv        <NA>       0      3   2010
## 4           0        0   <NA>  <NA>        <NA>       0      6   2010
## 5         144        0   <NA>  <NA>        <NA>       0      1   2010
## 6           0        0   <NA>  <NA>        <NA>       0      4   2010
##   SaleType SaleCondition
## 1       WD        Normal
## 2       WD        Normal
## 3       WD        Normal
## 4       WD        Normal
## 5       WD        Normal
## 6       WD        Normal
#Clean data (replace NA with 0):
test$LotFrontage <- test$LotFrontage[is.na(test$LotFrontage)] <- 0
test$Alley <- test$Alley[is.na(test$Alley)] <- 0
test$FireplaceQu <- test$FireplaceQu[is.na(test$FireplaceQu)] <- 0
test$PoolQC <- test$PoolQC[is.na(test$PoolQC)] <- 0
test$Fence <- test$Fence[is.na(test$Fence)] <- 0
test$MiscFeature <- test$MiscFeature[is.na(test$MiscFeature)] <- 0

head(test)
##     Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape
## 1 1461         20       RH           0   11622   Pave     0      Reg
## 2 1462         20       RL           0   14267   Pave     0      IR1
## 3 1463         60       RL           0   13830   Pave     0      IR1
## 4 1464         60       RL           0    9978   Pave     0      IR1
## 5 1465        120       RL           0    5005   Pave     0      IR1
## 6 1466         60       RL           0   10000   Pave     0      IR1
##   LandContour Utilities LotConfig LandSlope Neighborhood Condition1
## 1         Lvl    AllPub    Inside       Gtl        NAmes      Feedr
## 2         Lvl    AllPub    Corner       Gtl        NAmes       Norm
## 3         Lvl    AllPub    Inside       Gtl      Gilbert       Norm
## 4         Lvl    AllPub    Inside       Gtl      Gilbert       Norm
## 5         HLS    AllPub    Inside       Gtl      StoneBr       Norm
## 6         Lvl    AllPub    Corner       Gtl      Gilbert       Norm
##   Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt
## 1       Norm     1Fam     1Story           5           6      1961
## 2       Norm     1Fam     1Story           6           6      1958
## 3       Norm     1Fam     2Story           5           5      1997
## 4       Norm     1Fam     2Story           6           6      1998
## 5       Norm   TwnhsE     1Story           8           5      1992
## 6       Norm     1Fam     2Story           6           5      1993
##   YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType
## 1         1961     Gable  CompShg     VinylSd     VinylSd       None
## 2         1958       Hip  CompShg     Wd Sdng     Wd Sdng    BrkFace
## 3         1998     Gable  CompShg     VinylSd     VinylSd       None
## 4         1998     Gable  CompShg     VinylSd     VinylSd    BrkFace
## 5         1992     Gable  CompShg     HdBoard     HdBoard       None
## 6         1994     Gable  CompShg     HdBoard     HdBoard       None
##   MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure
## 1          0        TA        TA     CBlock       TA       TA           No
## 2        108        TA        TA     CBlock       TA       TA           No
## 3          0        TA        TA      PConc       Gd       TA           No
## 4         20        TA        TA      PConc       TA       TA           No
## 5          0        Gd        TA      PConc       Gd       TA           No
## 6          0        TA        TA      PConc       Gd       TA           No
##   BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF
## 1          Rec        468          LwQ        144       270         882
## 2          ALQ        923          Unf          0       406        1329
## 3          GLQ        791          Unf          0       137         928
## 4          GLQ        602          Unf          0       324         926
## 5          ALQ        263          Unf          0      1017        1280
## 6          Unf          0          Unf          0       763         763
##   Heating HeatingQC CentralAir Electrical X1stFlrSF X2ndFlrSF LowQualFinSF
## 1    GasA        TA          Y      SBrkr       896         0            0
## 2    GasA        TA          Y      SBrkr      1329         0            0
## 3    GasA        Gd          Y      SBrkr       928       701            0
## 4    GasA        Ex          Y      SBrkr       926       678            0
## 5    GasA        Ex          Y      SBrkr      1280         0            0
## 6    GasA        Gd          Y      SBrkr       763       892            0
##   GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr
## 1       896            0            0        1        0            2
## 2      1329            0            0        1        1            3
## 3      1629            0            0        2        1            3
## 4      1604            0            0        2        1            3
## 5      1280            0            0        2        0            2
## 6      1655            0            0        2        1            3
##   KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces FireplaceQu
## 1            1          TA            5        Typ          0           0
## 2            1          Gd            6        Typ          0           0
## 3            1          TA            6        Typ          1           0
## 4            1          Gd            7        Typ          1           0
## 5            1          Gd            5        Typ          0           0
## 6            1          TA            7        Typ          1           0
##   GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual
## 1     Attchd        1961          Unf          1        730         TA
## 2     Attchd        1958          Unf          1        312         TA
## 3     Attchd        1997          Fin          2        482         TA
## 4     Attchd        1998          Fin          2        470         TA
## 5     Attchd        1992          RFn          2        506         TA
## 6     Attchd        1993          Fin          2        440         TA
##   GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch
## 1         TA          Y        140           0             0          0
## 2         TA          Y        393          36             0          0
## 3         TA          Y        212          34             0          0
## 4         TA          Y        360          36             0          0
## 5         TA          Y          0          82             0          0
## 6         TA          Y        157          84             0          0
##   ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold
## 1         120        0      0     0           0       0      6   2010
## 2           0        0      0     0           0   12500      6   2010
## 3           0        0      0     0           0       0      3   2010
## 4           0        0      0     0           0       0      6   2010
## 5         144        0      0     0           0       0      1   2010
## 6           0        0      0     0           0       0      4   2010
##   SaleType SaleCondition
## 1       WD        Normal
## 2       WD        Normal
## 3       WD        Normal
## 4       WD        Normal
## 5       WD        Normal
## 6       WD        Normal
results <- predict(lm1, test)
resultsDf <- data.frame(cbind(test$Id, results))
colnames(resultsDf) = c('Id', 'SalePrice')

head(resultsDf, 10)
##      Id SalePrice
## 1  1461  107770.3
## 2  1462  157928.8
## 3  1463  179728.0
## 4  1464  196500.4
## 5  1465  205609.9
## 6  1466  183168.0
## 7  1467  178157.0
## 8  1468  177213.7
## 9  1469  204793.9
## 10 1470  105355.7
#write.csv(resultsDf, file="kaggle_submission.csv", row.names=FALSE, na="0")

Kaggle username: everska
Score: 1.13434

Youtube presentation:
https://youtu.be/3G8eTTgMvJY