Kaggle Competition: House Prices: Advanced Regression Techniques

username: Emahayz

Score: 0.6294

0.2 Problem 1

Using R, generate a random variable X that has 10,000 random uniform numbers from 1 to N, where N can be any number of your choosing greater than or equal to 6. Then generate a random variable Y that has 10,000 random normal numbers with a mean of \(\mu = \sigma = (N+1)/2\).

R offers us a variety of solutions for random number generation; one simple solution is to use the runif function, which generates a stated number of values between two end points (but not the end points themselves!) The function uses the continuous uniform distribution, meaning that every value between the two end points has an equal probability of being sampled.

Obtain Summaries for X and Y

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.001   2.254   3.487   3.505   4.760   5.999
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -9.703   1.153   3.453   3.517   5.899  19.148

0.2.1 Probability

Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities.

a. P(X>x | X>y)

This is similar to a conditional probability A occurs given B.

## [1] 0.5153577

Therefore, the P(X>x | X>y) = 0.5153

b. P(X>x, Y>y)

This is similar to finding the probability of A is greater than probability of B

## [1] 0.3778

Therefore, the P(X>x, Y>y) = 0.3778

c. P(X<x | X>y)

This is also similar to a conditional probability A occurs given B.

## [1] 0.4846423

Therefore, the P(X<x | X>y) = 0.4846

5 points. Investigate whether P(X>x and Y>y) = P(X>x)P(Y>y) by building a table and evaluating the marginal and joint probabilities.

##          P(X>x) P(X<x) Marginal
## P(Y>y)    0.375  0.375     0.75
## P(Y<y)    0.125  0.125     0.25
## Marginal  0.500  0.500     1.00

P(X>x) = 0.50 and the P(Y>y) = 0.75. Hence, P(X>x and Y>y) = 0.375 from Table is the same as P(X>x)*P(Y>y) = 0.375.

Therefore, P(X>x and Y>y) = P(X>x)P(Y>y)

5 points. Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?

## Warning in fisher.test(indp): 'x' has been rounded to integer: Mean
## relative difference: 0.75
## 
##  Fisher's Exact Test for Count Data
## 
## data:  indp
## p-value = 1
## alternative hypothesis: two.sided
## Warning in chisq.test(indp): Chi-squared approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  indp
## X-squared = 0, df = 2, p-value = 1

Both Test shows that the variables are independent as the p-value = 1.

0.3 Problem 2

You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition. https://www.kaggle.com/c/house-prices-advanced-regression-techniques . I want you to do the following.

0.3.1 About the Data

The Ames Housing dataset was compiled by Dean De Cock for use in data science education. It’s an incredible alternative for data scientists looking for a modernized and expanded version of the often cited Boston Housing dataset.

0.4 Load the training and test data sets

I will use the training data to train the model, use the test data to evaluate and modify test data to predict the model.

0.5 Data Preparation

In this section, I will prepare the dataset for multiple regression modeling. I will consider the assumptions of linear regression based on ordinary least squares algorithms – particularly regarding linearity, normality, homoscedasticity, and measurement level.

0.5.1 Missing Values

Looking for missing values

##            Id    MSSubClass      MSZoning   LotFrontage       LotArea 
##             0             0             0           259             0 
##        Street         Alley      LotShape   LandContour     Utilities 
##             0          1369             0             0             0 
##     LotConfig     LandSlope  Neighborhood    Condition1    Condition2 
##             0             0             0             0             0 
##      BldgType    HouseStyle   OverallQual   OverallCond     YearBuilt 
##             0             0             0             0             0 
##  YearRemodAdd     RoofStyle      RoofMatl   Exterior1st   Exterior2nd 
##             0             0             0             0             0 
##    MasVnrType    MasVnrArea     ExterQual     ExterCond    Foundation 
##             8             8             0             0             0 
##      BsmtQual      BsmtCond  BsmtExposure  BsmtFinType1    BsmtFinSF1 
##            37            37            38            37             0 
##  BsmtFinType2    BsmtFinSF2     BsmtUnfSF   TotalBsmtSF       Heating 
##            38             0             0             0             0 
##     HeatingQC    CentralAir    Electrical     X1stFlrSF     X2ndFlrSF 
##             0             0             1             0             0 
##  LowQualFinSF     GrLivArea  BsmtFullBath  BsmtHalfBath      FullBath 
##             0             0             0             0             0 
##      HalfBath  BedroomAbvGr  KitchenAbvGr   KitchenQual  TotRmsAbvGrd 
##             0             0             0             0             0 
##    Functional    Fireplaces   FireplaceQu    GarageType   GarageYrBlt 
##             0             0           690            81            81 
##  GarageFinish    GarageCars    GarageArea    GarageQual    GarageCond 
##            81             0             0            81            81 
##    PavedDrive    WoodDeckSF   OpenPorchSF EnclosedPorch    X3SsnPorch 
##             0             0             0             0             0 
##   ScreenPorch      PoolArea        PoolQC         Fence   MiscFeature 
##             0             0          1453          1179          1406 
##       MiscVal        MoSold        YrSold      SaleType SaleCondition 
##             0             0             0             0             0 
##     SalePrice 
##             0
##            Id    MSSubClass      MSZoning   LotFrontage       LotArea 
##             0             0             4           227             0 
##        Street         Alley      LotShape   LandContour     Utilities 
##             0          1352             0             0             2 
##     LotConfig     LandSlope  Neighborhood    Condition1    Condition2 
##             0             0             0             0             0 
##      BldgType    HouseStyle   OverallQual   OverallCond     YearBuilt 
##             0             0             0             0             0 
##  YearRemodAdd     RoofStyle      RoofMatl   Exterior1st   Exterior2nd 
##             0             0             0             1             1 
##    MasVnrType    MasVnrArea     ExterQual     ExterCond    Foundation 
##            16            15             0             0             0 
##      BsmtQual      BsmtCond  BsmtExposure  BsmtFinType1    BsmtFinSF1 
##            44            45            44            42             1 
##  BsmtFinType2    BsmtFinSF2     BsmtUnfSF   TotalBsmtSF       Heating 
##            42             1             1             1             0 
##     HeatingQC    CentralAir    Electrical     X1stFlrSF     X2ndFlrSF 
##             0             0             0             0             0 
##  LowQualFinSF     GrLivArea  BsmtFullBath  BsmtHalfBath      FullBath 
##             0             0             2             2             0 
##      HalfBath  BedroomAbvGr  KitchenAbvGr   KitchenQual  TotRmsAbvGrd 
##             0             0             0             1             0 
##    Functional    Fireplaces   FireplaceQu    GarageType   GarageYrBlt 
##             2             0           730            76            78 
##  GarageFinish    GarageCars    GarageArea    GarageQual    GarageCond 
##            78             1             1            78            78 
##    PavedDrive    WoodDeckSF   OpenPorchSF EnclosedPorch    X3SsnPorch 
##             0             0             0             0             0 
##   ScreenPorch      PoolArea        PoolQC         Fence   MiscFeature 
##             0             0          1456          1169          1408 
##       MiscVal        MoSold        YrSold      SaleType SaleCondition 
##             0             0             0             1             0

The data set shows several missing values.

5 points. Descriptive and Inferential Statistics

Provide univariate descriptive statistics and appropriate plots for the training data set.

##     Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape
## 1 1461         20       RH          80   11622   Pave  <NA>      Reg
## 2 1462         20       RL          81   14267   Pave  <NA>      IR1
## 3 1463         60       RL          74   13830   Pave  <NA>      IR1
## 4 1464         60       RL          78    9978   Pave  <NA>      IR1
## 5 1465        120       RL          43    5005   Pave  <NA>      IR1
## 6 1466         60       RL          75   10000   Pave  <NA>      IR1
##   LandContour Utilities LotConfig LandSlope Neighborhood Condition1
## 1         Lvl    AllPub    Inside       Gtl        NAmes      Feedr
## 2         Lvl    AllPub    Corner       Gtl        NAmes       Norm
## 3         Lvl    AllPub    Inside       Gtl      Gilbert       Norm
## 4         Lvl    AllPub    Inside       Gtl      Gilbert       Norm
## 5         HLS    AllPub    Inside       Gtl      StoneBr       Norm
## 6         Lvl    AllPub    Corner       Gtl      Gilbert       Norm
##   Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt
## 1       Norm     1Fam     1Story           5           6      1961
## 2       Norm     1Fam     1Story           6           6      1958
## 3       Norm     1Fam     2Story           5           5      1997
## 4       Norm     1Fam     2Story           6           6      1998
## 5       Norm   TwnhsE     1Story           8           5      1992
## 6       Norm     1Fam     2Story           6           5      1993
##   YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType
## 1         1961     Gable  CompShg     VinylSd     VinylSd       None
## 2         1958       Hip  CompShg     Wd Sdng     Wd Sdng    BrkFace
## 3         1998     Gable  CompShg     VinylSd     VinylSd       None
## 4         1998     Gable  CompShg     VinylSd     VinylSd    BrkFace
## 5         1992     Gable  CompShg     HdBoard     HdBoard       None
## 6         1994     Gable  CompShg     HdBoard     HdBoard       None
##   MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure
## 1          0        TA        TA     CBlock       TA       TA           No
## 2        108        TA        TA     CBlock       TA       TA           No
## 3          0        TA        TA      PConc       Gd       TA           No
## 4         20        TA        TA      PConc       TA       TA           No
## 5          0        Gd        TA      PConc       Gd       TA           No
## 6          0        TA        TA      PConc       Gd       TA           No
##   BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF
## 1          Rec        468          LwQ        144       270         882
## 2          ALQ        923          Unf          0       406        1329
## 3          GLQ        791          Unf          0       137         928
## 4          GLQ        602          Unf          0       324         926
## 5          ALQ        263          Unf          0      1017        1280
## 6          Unf          0          Unf          0       763         763
##   Heating HeatingQC CentralAir Electrical X1stFlrSF X2ndFlrSF LowQualFinSF
## 1    GasA        TA          Y      SBrkr       896         0            0
## 2    GasA        TA          Y      SBrkr      1329         0            0
## 3    GasA        Gd          Y      SBrkr       928       701            0
## 4    GasA        Ex          Y      SBrkr       926       678            0
## 5    GasA        Ex          Y      SBrkr      1280         0            0
## 6    GasA        Gd          Y      SBrkr       763       892            0
##   GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr
## 1       896            0            0        1        0            2
## 2      1329            0            0        1        1            3
## 3      1629            0            0        2        1            3
## 4      1604            0            0        2        1            3
## 5      1280            0            0        2        0            2
## 6      1655            0            0        2        1            3
##   KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces FireplaceQu
## 1            1          TA            5        Typ          0        <NA>
## 2            1          Gd            6        Typ          0        <NA>
## 3            1          TA            6        Typ          1          TA
## 4            1          Gd            7        Typ          1          Gd
## 5            1          Gd            5        Typ          0        <NA>
## 6            1          TA            7        Typ          1          TA
##   GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual
## 1     Attchd        1961          Unf          1        730         TA
## 2     Attchd        1958          Unf          1        312         TA
## 3     Attchd        1997          Fin          2        482         TA
## 4     Attchd        1998          Fin          2        470         TA
## 5     Attchd        1992          RFn          2        506         TA
## 6     Attchd        1993          Fin          2        440         TA
##   GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch
## 1         TA          Y        140           0             0          0
## 2         TA          Y        393          36             0          0
## 3         TA          Y        212          34             0          0
## 4         TA          Y        360          36             0          0
## 5         TA          Y          0          82             0          0
## 6         TA          Y        157          84             0          0
##   ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold
## 1         120        0   <NA> MnPrv        <NA>       0      6   2010
## 2           0        0   <NA>  <NA>        Gar2   12500      6   2010
## 3           0        0   <NA> MnPrv        <NA>       0      3   2010
## 4           0        0   <NA>  <NA>        <NA>       0      6   2010
## 5         144        0   <NA>  <NA>        <NA>       0      1   2010
## 6           0        0   <NA>  <NA>        <NA>       0      4   2010
##   SaleType SaleCondition
## 1       WD        Normal
## 2       WD        Normal
## 3       WD        Normal
## 4       WD        Normal
## 5       WD        Normal
## 6       WD        Normal
## 'data.frame':    1460 obs. of  81 variables:
##  $ Id           : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ MSSubClass   : int  60 20 60 70 60 50 20 60 50 190 ...
##  $ MSZoning     : Factor w/ 5 levels "C (all)","FV",..: 4 4 4 4 4 4 4 4 5 4 ...
##  $ LotFrontage  : num  65 80 68 60 84 ...
##  $ LotArea      : int  8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
##  $ Street       : Factor w/ 2 levels "Grvl","Pave": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Alley        : Factor w/ 2 levels "Grvl","Pave": NA NA NA NA NA NA NA NA NA NA ...
##  $ LotShape     : Factor w/ 4 levels "IR1","IR2","IR3",..: 4 4 1 1 1 1 4 1 4 4 ...
##  $ LandContour  : Factor w/ 4 levels "Bnk","HLS","Low",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ Utilities    : Factor w/ 2 levels "AllPub","NoSeWa": 1 1 1 1 1 1 1 1 1 1 ...
##  $ LotConfig    : Factor w/ 5 levels "Corner","CulDSac",..: 5 3 5 1 3 5 5 1 5 1 ...
##  $ LandSlope    : Factor w/ 3 levels "Gtl","Mod","Sev": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Neighborhood : Factor w/ 25 levels "Blmngtn","Blueste",..: 6 25 6 7 14 12 21 17 18 4 ...
##  $ Condition1   : Factor w/ 9 levels "Artery","Feedr",..: 3 2 3 3 3 3 3 5 1 1 ...
##  $ Condition2   : Factor w/ 8 levels "Artery","Feedr",..: 3 3 3 3 3 3 3 3 3 1 ...
##  $ BldgType     : Factor w/ 5 levels "1Fam","2fmCon",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ HouseStyle   : Factor w/ 8 levels "1.5Fin","1.5Unf",..: 6 3 6 6 6 1 3 6 1 2 ...
##  $ OverallQual  : int  7 6 7 7 8 5 8 7 7 5 ...
##  $ OverallCond  : int  5 8 5 5 5 5 5 6 5 6 ...
##  $ YearBuilt    : int  2003 1976 2001 1915 2000 1993 2004 1973 1931 1939 ...
##  $ YearRemodAdd : int  2003 1976 2002 1970 2000 1995 2005 1973 1950 1950 ...
##  $ RoofStyle    : Factor w/ 6 levels "Flat","Gable",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ RoofMatl     : Factor w/ 8 levels "ClyTile","CompShg",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Exterior1st  : Factor w/ 15 levels "AsbShng","AsphShn",..: 13 9 13 14 13 13 13 7 4 9 ...
##  $ Exterior2nd  : Factor w/ 16 levels "AsbShng","AsphShn",..: 14 9 14 16 14 14 14 7 16 9 ...
##  $ MasVnrType   : Factor w/ 4 levels "BrkCmn","BrkFace",..: 2 3 2 3 2 3 4 4 3 3 ...
##  $ MasVnrArea   : num  196 0 162 0 350 0 186 240 0 0 ...
##  $ ExterQual    : Factor w/ 4 levels "Ex","Fa","Gd",..: 3 4 3 4 3 4 3 4 4 4 ...
##  $ ExterCond    : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 5 5 ...
##  $ Foundation   : Factor w/ 6 levels "BrkTil","CBlock",..: 3 2 3 1 3 6 3 2 1 1 ...
##  $ BsmtQual     : Factor w/ 4 levels "Ex","Fa","Gd",..: 3 3 3 4 3 3 1 3 4 4 ...
##  $ BsmtCond     : Factor w/ 4 levels "Fa","Gd","Po",..: 4 4 4 2 4 4 4 4 4 4 ...
##  $ BsmtExposure : Factor w/ 4 levels "Av","Gd","Mn",..: 4 2 3 4 1 4 1 3 4 4 ...
##  $ BsmtFinType1 : Factor w/ 6 levels "ALQ","BLQ","GLQ",..: 3 1 3 1 3 3 3 1 6 3 ...
##  $ BsmtFinSF1   : int  706 978 486 216 655 732 1369 859 0 851 ...
##  $ BsmtFinType2 : Factor w/ 6 levels "ALQ","BLQ","GLQ",..: 6 6 6 6 6 6 6 2 6 6 ...
##  $ BsmtFinSF2   : int  0 0 0 0 0 0 0 32 0 0 ...
##  $ BsmtUnfSF    : int  150 284 434 540 490 64 317 216 952 140 ...
##  $ TotalBsmtSF  : int  856 1262 920 756 1145 796 1686 1107 952 991 ...
##  $ Heating      : Factor w/ 6 levels "Floor","GasA",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ HeatingQC    : Factor w/ 5 levels "Ex","Fa","Gd",..: 1 1 1 3 1 1 1 1 3 1 ...
##  $ CentralAir   : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Electrical   : Factor w/ 5 levels "FuseA","FuseF",..: 5 5 5 5 5 5 5 5 2 5 ...
##  $ X1stFlrSF    : int  856 1262 920 961 1145 796 1694 1107 1022 1077 ...
##  $ X2ndFlrSF    : int  854 0 866 756 1053 566 0 983 752 0 ...
##  $ LowQualFinSF : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ GrLivArea    : int  1710 1262 1786 1717 2198 1362 1694 2090 1774 1077 ...
##  $ BsmtFullBath : int  1 0 1 1 1 1 1 1 0 1 ...
##  $ BsmtHalfBath : int  0 1 0 0 0 0 0 0 0 0 ...
##  $ FullBath     : int  2 2 2 1 2 1 2 2 2 1 ...
##  $ HalfBath     : int  1 0 1 0 1 1 0 1 0 0 ...
##  $ BedroomAbvGr : int  3 3 3 3 4 1 3 3 2 2 ...
##  $ KitchenAbvGr : int  1 1 1 1 1 1 1 1 2 2 ...
##  $ KitchenQual  : Factor w/ 4 levels "Ex","Fa","Gd",..: 3 4 3 3 3 4 3 4 4 4 ...
##  $ TotRmsAbvGrd : int  8 6 6 7 9 5 7 7 8 5 ...
##  $ Functional   : Factor w/ 7 levels "Maj1","Maj2",..: 7 7 7 7 7 7 7 7 3 7 ...
##  $ Fireplaces   : int  0 1 1 1 1 0 1 2 2 2 ...
##  $ FireplaceQu  : Factor w/ 5 levels "Ex","Fa","Gd",..: NA 5 5 3 5 NA 3 5 5 5 ...
##  $ GarageType   : Factor w/ 6 levels "2Types","Attchd",..: 2 2 2 6 2 2 2 2 6 2 ...
##  $ GarageYrBlt  : num  2003 1976 2001 1998 2000 ...
##  $ GarageFinish : Factor w/ 3 levels "Fin","RFn","Unf": 2 2 2 3 2 3 2 2 3 2 ...
##  $ GarageCars   : int  2 2 2 3 3 2 2 2 2 1 ...
##  $ GarageArea   : int  548 460 608 642 836 480 636 484 468 205 ...
##  $ GarageQual   : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 2 3 ...
##  $ GarageCond   : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 5 5 ...
##  $ PavedDrive   : Factor w/ 3 levels "N","P","Y": 3 3 3 3 3 3 3 3 3 3 ...
##  $ WoodDeckSF   : int  0 298 0 0 192 40 255 235 90 0 ...
##  $ OpenPorchSF  : int  61 0 42 35 84 30 57 204 0 4 ...
##  $ EnclosedPorch: int  0 0 0 272 0 0 0 228 205 0 ...
##  $ X3SsnPorch   : int  0 0 0 0 0 320 0 0 0 0 ...
##  $ ScreenPorch  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PoolArea     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PoolQC       : Factor w/ 3 levels "Ex","Fa","Gd": NA NA NA NA NA NA NA NA NA NA ...
##  $ Fence        : Factor w/ 4 levels "GdPrv","GdWo",..: NA NA NA NA NA 3 NA NA NA NA ...
##  $ MiscFeature  : Factor w/ 4 levels "Gar2","Othr",..: NA NA NA NA NA 3 NA 3 NA NA ...
##  $ MiscVal      : int  0 0 0 0 0 700 0 350 0 0 ...
##  $ MoSold       : int  2 5 9 2 12 10 8 11 4 1 ...
##  $ YrSold       : int  2008 2007 2008 2006 2008 2009 2007 2009 2008 2008 ...
##  $ SaleType     : Factor w/ 9 levels "COD","Con","ConLD",..: 9 9 9 9 9 9 9 9 9 9 ...
##  $ SaleCondition: Factor w/ 6 levels "Abnorml","AdjLand",..: 5 5 5 1 5 5 5 5 1 5 ...
##  $ SalePrice    : int  208500 181500 223500 140000 250000 143000 307000 200000 129900 118000 ...
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   34900  129975  163000  180921  214000  755000
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   795.8   991.5  1057.4  1298.2  6110.0
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1300    7554    9478   10517   11602  215245
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1872    1954    1973    1971    2000    2010
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   334.5   480.0   473.0   576.0  1418.0

Provide a scatterplot matrix for at least two of the independent variables and the dependent variable.

Derive a correlation matrix for any three quantitative variables in the dataset.

##                        ames_train.TotalBsmtSF ames_train.GarageArea
## ames_train.TotalBsmtSF                   1.00                  0.49
## ames_train.GarageArea                    0.49                  1.00
## ames_train.OverallQual                   0.54                  0.56
##                        ames_train.OverallQual
## ames_train.TotalBsmtSF                   0.54
## ames_train.GarageArea                    0.56
## ames_train.OverallQual                   1.00

Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis.

Would you be worried about familywise error? Why or why not?

## 
##  Pearson's product-moment correlation
## 
## data:  ames_train$SalePrice and ames_train$TotalBsmtSF
## t = 29.671, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.5922142 0.6340846
## sample estimates:
##       cor 
## 0.6135806
## 
##  Pearson's product-moment correlation
## 
## data:  ames_train$SalePrice and ames_train$GarageArea
## t = 30.446, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.6024756 0.6435283
## sample estimates:
##       cor 
## 0.6234314
## 
##  Pearson's product-moment correlation
## 
## data:  ames_train$SalePrice and ames_train$OverallQual
## t = 49.364, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.7780752 0.8032204
## sample estimates:
##       cor 
## 0.7909816
## 
##  Pearson's product-moment correlation
## 
## data:  ames_train$SalePrice and ames_train$LotArea
## t = 10.445, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.2323391 0.2947946
## sample estimates:
##       cor 
## 0.2638434

5 points. Linear Algebra and Correlation

Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.

Precision Matrix

##                        ames_train.TotalBsmtSF ames_train.GarageArea
## ames_train.TotalBsmtSF                   1.52                 -0.42
## ames_train.GarageArea                   -0.42                  1.57
## ames_train.OverallQual                  -0.59                 -0.65
##                        ames_train.OverallQual
## ames_train.TotalBsmtSF                  -0.59
## ames_train.GarageArea                   -0.65
## ames_train.OverallQual                   1.68

Multiply correlation matrix by the precision matrix

##                        ames_train.TotalBsmtSF ames_train.GarageArea
## ames_train.TotalBsmtSF                   1.00                     0
## ames_train.GarageArea                   -0.01                     1
## ames_train.OverallQual                   0.00                     0
##                        ames_train.OverallQual
## ames_train.TotalBsmtSF                      0
## ames_train.GarageArea                       0
## ames_train.OverallQual                      1

Multiply precision matrix by the correlation matrix

##                        ames_train.TotalBsmtSF ames_train.GarageArea
## ames_train.TotalBsmtSF                      1                 -0.01
## ames_train.GarageArea                       0                  1.00
## ames_train.OverallQual                      0                  0.00
##                        ames_train.OverallQual
## ames_train.TotalBsmtSF                      0
## ames_train.GarageArea                       0
## ames_train.OverallQual                      1

LU decomposition

## $L
##      [,1]      [,2] [,3]
## [1,] 1.00 0.0000000    0
## [2,] 0.49 1.0000000    0
## [3,] 0.54 0.3887354    1
## 
## $U
##      [,1]   [,2]      [,3]
## [1,]    1 0.4900 0.5400000
## [2,]    0 0.7599 0.2954000
## [3,]    0 0.0000 0.5935676
##                        ames_train.TotalBsmtSF ames_train.GarageArea
## ames_train.TotalBsmtSF                   TRUE                  TRUE
## ames_train.GarageArea                    TRUE                  TRUE
## ames_train.OverallQual                   TRUE                  TRUE
##                        ames_train.OverallQual
## ames_train.TotalBsmtSF                   TRUE
## ames_train.GarageArea                    TRUE
## ames_train.OverallQual                   TRUE

The LU decomposition is expected to yield the correlation matrix, this is TRUE as as shown above.

5 points. Calculus-Based Probability & Statistics

Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary.

Total Basement SQF The minimum value for this variable is “0”, I will shift this values by adding 5% to the values.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   795.8   991.5  1057.4  1298.2  6110.0
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.05  776.05  980.05 1038.61 1248.05 6110.05

Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ).

##        rate    
##   9.628262e-04 
##  (2.806465e-05)

Find the optimal value of \(\lambda\) for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, \(\lambda\))).

##         rate 
## 0.0009628262
##    [1] 1164.030014  473.083531 1734.518460 1885.914296  176.808906
##    [6] 1375.058750 2274.311250  479.805898  207.997341  219.074635
##   [11]  138.490080 1103.712174  806.322618  992.543874  363.963895
##   [16]   99.165308 3249.643186  709.027571 1585.689398 2437.461409
##   [21] 2201.470050 1317.416387  471.905873 2188.139443  838.108450
##   [26] 2994.399718 1409.434020  263.749754 1552.992388 1261.283324
##   [31]  180.625732 1218.331546  555.465123   61.972702  496.523336
##   [36]   36.462263   12.131241 2098.520436  253.739401  766.981615
##   [41] 1256.535027  537.481836  301.280870 3151.928358 3193.363037
##   [46]  310.186137  486.256323  943.333397   65.807723 4485.353495
##   [51]  437.927497 1247.149850  648.695238  338.166563 2239.021661
##   [56]  101.038380  169.212871  844.923887  854.264831  280.214024
##   [61] 3631.991670  960.768031  139.509752  788.272210  507.455789
##   [66]  711.617264 1972.228959 3043.367833 1364.217375  231.376652
##   [71]  909.608430 2145.931484  694.524938  922.895508  849.032208
##   [76]  216.340771  285.264169  454.267218  316.514297 1862.911525
##   [81] 1447.199131  904.133802  419.336042 5134.595134  774.574764
##   [86]  356.698451  230.788271 1519.608649  428.005893  811.813490
##   [91] 4630.977858  687.099207  139.245800  406.595563 1257.979505
##   [96] 2676.175953  600.522066  218.150702  830.937972   27.331716
##  [101] 1300.703023  379.417909 2378.334608  743.147101   62.460259
##  [106]  224.930109  258.324597  148.720263  157.725925 3289.133541
##  [111]  485.285823  572.913143  616.194026 1940.232998  223.979597
##  [116] 1103.210551 2011.680811 2965.452698  498.126358  464.990019
##  [121] 4118.183069 2076.736916 1101.263225  132.874511  604.125952
##  [126]   57.947859 1871.829253   13.919136  355.051328 1810.945468
##  [131] 2403.466997 4317.514571  521.671251 1058.418644   83.854896
##  [136]  497.689766 1434.457806 1003.013433  564.563282  222.831149
##  [141]  471.492924  957.159627 2677.877383 1155.017789  620.052270
##  [146]  985.897260 1079.604570 1084.967278  196.418979  175.309635
##  [151]  162.041090  108.170617 1862.360367   36.434622 1943.762373
##  [156] 1157.180838  150.237337  722.298166  185.467673 2586.717259
##  [161]  113.128584 1911.796207   75.423340  134.363758  472.959478
##  [166] 3453.792629  539.443262  576.987961 1922.624106  817.487803
##  [171] 1277.282419  471.176535  595.990974 1006.375572 4029.487265
##  [176] 3057.632232  246.948595  367.784224  492.290360  233.829736
##  [181]  715.534752 1976.595344  324.222208 2650.419688 3147.253246
##  [186]  101.405281  574.651195  868.099130  324.630175  126.184590
##  [191] 1454.953134 2126.391171   93.476747  910.801307   88.191517
##  [196] 1555.960917  468.682362 1786.819881 1194.771905  185.229699
##  [201] 1048.850542  803.886709  894.023209  914.109750   11.441039
##  [206]  199.473531  614.334919 2685.136445  143.710164  178.186960
##  [211]  201.337848 1920.157375  283.205049 1041.469659  280.049805
##  [216]  372.868128 2514.939232 2244.731606  918.483438  378.443667
##  [221] 1312.926393   90.588159  254.134377  319.092214  582.686488
##  [226]  555.768556    6.236745  321.075661  763.348074  338.930264
##  [231]   13.002116  489.013583  693.826186  290.126869 1481.298462
##  [236]  175.975297  998.454270 1179.376379  599.077101  459.518684
##  [241]   77.272435 1096.588228  533.095136  748.225508 1717.562066
##  [246]  502.349170  755.359552  269.085535  182.569486 1071.238298
##  [251] 2961.273645  490.688459 1214.815799 1026.693920  495.807860
##  [256] 1017.259403  413.418578 2227.308915  362.720006 2626.812410
##  [261] 1603.471128 4190.912531 1516.161549 2232.484651 1204.238699
##  [266]   35.115204 2380.050502 1515.207969 1130.597908  151.322248
##  [271]  540.581909 2762.038057  832.458857  252.866543  382.514708
##  [276]  442.737044  716.605363 2125.177182 3150.763227  349.288264
##  [281] 1073.039004  847.592758 2992.877006 1047.533282 1020.332892
##  [286]  683.168107 2583.897223  764.162567 3215.710998 3256.315233
##  [291] 2659.876097  340.473557 1491.691543  661.830171  869.946449
##  [296] 1739.112090 1348.586235  389.583728   99.809596 1259.043871
##  [301]  872.175039 2596.664736 3620.923770  162.711594 1076.942031
##  [306]  994.154547  805.960112  502.572467  521.820087  498.579078
##  [311]  868.153636  851.677064  771.268273 1439.637614 4115.676344
##  [316] 1012.768600  644.653544  182.694140  864.752616  502.795621
##  [321]  650.474109   34.374116  294.859975 1447.648040  967.802994
##  [326]  485.468476 2108.750308 1542.738843  102.190934 1019.982306
##  [331]  363.538797  879.123033  413.806492 2946.254365  194.884591
##  [336]  613.758054   16.987603  359.694291 1142.699497    5.929488
##  [341] 2530.204222  414.769610  594.612597 1804.141702  669.354986
##  [346] 1813.475913  571.176149  108.178164  502.127551  779.867585
##  [351] 1735.256342  193.184183 3208.155289 1543.908399  579.711254
##  [356] 1254.249050  411.545657  603.920136 2570.296337 1314.079356
##  [361]  246.623966 1895.712667 1143.634244 1872.325418  112.077332
##  [366] 1297.443438  848.948756  948.842429 4407.158130  308.018420
##  [371] 1059.450174 1449.317573 3419.150562 3651.714337 1379.294185
##  [376] 3210.692594  954.125566 1205.185529  536.517602  476.619712
##  [381]  442.425729  641.885206 1495.871779  385.853861 1908.547486
##  [386]  152.869685  729.960606 1800.586856  613.117126 1052.911977
##  [391]  243.885509 1641.074227 1187.867486  601.199982 1727.523909
##  [396]  691.772684 1213.010450 1113.136270  128.573733  606.692996
##  [401]  641.055320  386.188571 1315.826222  415.888859 2745.735712
##  [406] 1419.936523  836.155902 1865.118957 2774.359111  101.634993
##  [411]   15.146464  671.279227  570.836822   64.277455 1876.972713
##  [416]  424.179488  634.556926 1022.765794  130.743309  166.987397
##  [421] 2235.272689  394.255248 2830.749925    7.829786 3466.600167
##  [426]  965.617117  769.071198 4960.775103 1223.440157 1177.322399
##  [431]  518.588968 1795.496769  139.399962 1849.289512 1641.877393
##  [436]  987.952879 5252.497737 2680.546674 3076.213289  646.257744
##  [441]  452.076399  716.562180  269.163140  516.615830  590.325441
##  [446]  468.029498    2.709556 1245.346891 2769.334784 2809.228412
##  [451] 1204.275718  533.714456  805.526935  105.965643  397.037804
##  [456]  439.625633  747.702914 1221.322378  343.043446  209.863500
##  [461]  322.858774 1844.937289 1925.033336  255.593331 3739.133781
##  [466] 1496.786492  155.425263  459.412196  204.567422  690.332708
##  [471]  227.539530  549.371260  905.373827  279.259145  166.673487
##  [476]  469.938284 1250.101344 2487.964852   81.128605 2549.999661
##  [481] 1794.818733  367.557195  658.213094  529.898855 3568.797778
##  [486]  881.096235  417.725696   22.017242  275.186483   53.726003
##  [491]  147.776363  361.980240   52.369731  674.831088  364.196355
##  [496] 2604.991628  509.112451  234.666229 1874.738369  188.459040
##  [501]  617.257092  798.140473  552.021632  188.553533  490.497543
##  [506]  534.407515 1007.532079  391.784344  135.364841 1520.791840
##  [511]  745.444489 3528.006231  614.300479  302.776140   74.215362
##  [516]  211.875656  549.206193 1766.659686   41.093638  537.091367
##  [521]  903.501727 1829.732020  257.794354 1381.272345  293.232476
##  [526] 1175.012253 2568.939034  711.309221 1184.552075    6.028655
##  [531]   97.755660 1331.315997  308.688213  131.824678  146.365230
##  [536]  781.410902  455.508114  379.826824  134.618136  695.152755
##  [541] 1213.276931  532.410430  133.440591 2001.983321  320.404219
##  [546] 1170.250051 2114.070397  684.930973 1124.618914   14.136345
##  [551]  438.703737  124.806837   83.428616    4.996434  866.291304
##  [556]  852.583846   21.091665  324.651706  515.202769 1400.718513
##  [561]   75.582941  474.625933  658.144517 2893.491263 1472.177751
##  [566]  192.182436 2138.151513 1433.221373 1000.057274 1019.809187
##  [571]  508.329448  495.890608  348.470704 1623.210097 2242.110563
##  [576]  520.933937  322.983721 1460.348413  637.030423 1307.875150
##  [581] 1955.974481  443.830517 1093.425626 2761.306971 1679.944527
##  [586]  773.322347 2333.112850 1773.178226   71.365088  901.405176
##  [591]  669.761958 3311.668159  423.970767  126.043473 1204.298516
##  [596] 3758.975342 1004.143271 6329.395185  530.595827  488.144963
##  [601] 2624.274874 1767.452632  660.883535  933.553919  443.954031
##  [606]  825.408897  291.575153  769.211852  579.813244   54.685450
##  [611]  668.480658  400.075017  904.270746 2449.460180  305.695163
##  [616] 1666.805777  301.186655  508.994847  243.238856 1010.774793
##  [621] 2614.069609 1189.945310  183.245895  407.654469  798.554091
##  [626]  428.677157  332.598491  241.625069  110.785092  413.476079
##  [631] 1743.257446  611.276321  759.149751   77.914622  274.617138
##  [636]  730.512527  878.776275  287.576882 4616.798276  363.623083
##  [641] 1501.415535  790.470771 1021.304295 1077.533873  789.143486
##  [646]  503.109558   97.085078   61.554852   69.226595  299.264283
##  [651] 1058.193515 4140.787893 1746.133816 2094.500530  995.209842
##  [656]  582.159667   46.517959    3.062376  494.028720  169.027709
##  [661] 1231.541288  122.155521 3078.218801  148.671122  548.276686
##  [666] 1024.716314  466.432748  803.908634 1284.697710 1510.614348
##  [671]  267.395396  588.584514  141.409321   59.949127  300.459879
##  [676]  129.248310  503.609348  750.504940 2934.327405  945.260709
##  [681]  677.895728 1241.819802   62.569126  418.136920  227.511104
##  [686] 1402.898390  490.632295 1979.523537  490.255911 2326.926769
##  [691] 2931.626405   94.074324  299.778191  840.983081 2895.780080
##  [696]  564.493851  933.601004  717.386430 1303.387271  538.258131
##  [701]  643.308548 1581.823946  427.106128 1172.108242 1298.264288
##  [706]  874.492184  448.781427  524.260449  871.703002 1303.975164
##  [711] 1276.613373 1135.871720  375.877846  184.480558  314.512660
##  [716]  685.330748  384.828349  624.468053  802.944603 1219.584149
##  [721]   89.532431  424.069284 1447.806231  512.587531   33.887431
##  [726] 1877.028057  257.975766 1229.575746  265.617213 2395.657700
##  [731] 1077.080024   12.199835  316.828845  349.113570  896.602031
##  [736] 1346.737723 1357.178606  940.667984 4460.991207   51.492421
##  [741] 1231.302512  748.679345  131.634575 1750.192657  804.597404
##  [746]  563.586466 1428.696593 1570.122562 1146.335772  844.540812
##  [751]  216.656077 2118.486746  792.479468  432.951722   66.728168
##  [756] 1759.389321 2430.506266  441.811062  289.869254 1808.086974
##  [761] 1629.932306 2153.266469  612.353967   89.681275 1140.243092
##  [766]  485.450221  242.312961 3961.129957  640.256936 1172.601415
##  [771]  613.195182  873.487945  610.459662  657.266295 1942.151159
##  [776]  577.239406 1959.773887  673.333865  187.870959  455.283718
##  [781] 1510.953594  514.084782 1449.128793  289.532718   18.954552
##  [786] 1043.291894  137.213476  267.481480 1078.199251  584.075609
##  [791]  199.433034  713.474388  273.081952  880.309158  152.248331
##  [796] 4793.373067  662.414213 2727.459362  790.611687  481.148146
##  [801]  349.356319  460.626638  814.789354   28.036447 1895.149387
##  [806] 1061.664929  156.678800  331.054175  866.835766   59.945978
##  [811] 2426.824349  289.982203 3034.092615   71.842285  633.732436
##  [816] 1026.455443 1212.459334  218.424117  427.708476 1282.553031
##  [821]  111.663898 1188.969092 1367.325439  889.683458 1812.080871
##  [826]  312.538848 1570.618464 1232.442461  495.047009  973.806097
##  [831] 2845.799282  157.385979 1087.006554  240.904320  556.045572
##  [836]  207.279993  458.459990 1484.103444 2211.373307 1411.397293
##  [841]  392.859748  596.200354  231.888649  274.883743  731.252073
##  [846]  463.966609 2732.645306  277.323967   99.057260  268.815436
##  [851] 1424.147859 1909.222002 1001.454115  912.421325  650.195414
##  [856] 1268.173041  345.098678  322.167581  254.013815  991.168024
##  [861]  270.448161  573.602752 2199.359445  280.478811  966.706459
##  [866]   25.590866 2120.358276   26.596344  667.027259 1005.681518
##  [871]  481.281703  563.233363 1604.710494  918.574493  671.615231
##  [876]  916.342299  411.540404  794.917576  973.419330  193.096411
##  [881]  328.024576   56.184145  583.530971  604.358729   92.518111
##  [886]  585.110440 1088.821961  837.076543  728.567421  296.302010
##  [891]  791.009248 1909.895937  313.182673 1236.238426  285.762102
##  [896]   35.910066  163.623480 1997.648711 2624.436186 4205.653771
##  [901]  647.293877 3935.100697 1775.153017  409.380429 2329.177544
##  [906]  239.379441  158.012775 2044.806972 1019.789001  204.226792
##  [911] 1368.183551 1921.213196  706.169253  947.007529  224.895800
##  [916] 1888.086757 1383.502758  909.449046  237.509935  253.734312
##  [921] 1082.594779  238.526501  399.586274  536.102202 1132.248701
##  [926]  406.574551  285.811300 1026.955612  150.607524  198.809950
##  [931]  277.184799 1289.324321  386.360445  458.823898  403.853667
##  [936]  945.436488 1439.138886  351.837311 1733.442901  695.886444
##  [941]  221.127297  110.113167 1674.940643 3182.401448   86.290249
##  [946]  254.031445 1455.589412 2205.059491  474.416758   39.430823
##  [951]  226.983798 1310.964746   95.555971  371.952517  382.423173
##  [956] 2004.724871 3932.676813  233.901112 1069.302632  143.817907
##  [961]   61.810901 1967.666767 1887.551232  599.866539 2414.830088
##  [966]    9.940301  400.920815  853.069228  190.924911  845.887877
##  [971]   35.519694  162.896755  877.728322 1168.463498  323.549886
##  [976] 1788.579785  360.189372 1164.298247  260.123163 1109.061155
##  [981] 3110.758118   38.132411 2549.114873 2373.207455 2397.144076
##  [986] 2100.687756  665.622277  916.416888 1972.688506  398.936647
##  [991]  336.151156 1522.735783 1592.556400 1095.115284 1730.120959
##  [996] 1737.159803  210.488797  188.552135  882.524874  616.501849

Plot a histogram and compare it with a histogram of your original variable.
Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF).

Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.

## [1] 53.27368
## [1] 3111.395

Confidence Interval

##    upper     mean    lower 
## 1079.951 1057.429 1034.908

Empirical Percentile

##    5% 
## 519.3
##  95% 
## 1753

The 5th and 95th percentile of the original data are 519.3 and 1753, respectively. The lower and upper bounds of the 95% confidence interval for the TotalBsmtSF is (1034.91, 1079.95).

0.5.2 Modeling Method - Multilinear Regression

10 points. Modeling. Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.

Multilinear Regression Model - Stepwise All variables

##   ID MSSubClass GrLivArea BldgType GarageArea LotArea TotalBsmtSF
## 1  1         60      1710     1Fam        548    8450         856
## 2  2         20      1262     1Fam        460    9600        1262
## 3  3         60      1786     1Fam        608   11250         920
## 4  4         70      1717     1Fam        642    9550         756
## 5  5         60      2198     1Fam        836   14260        1145
## 6  6         50      1362     1Fam        480   14115         796
##   BsmtFinSF1 Age OverallQuality OverallCondition SalePrice
## 1        706   5              7                5    208500
## 2        978  31              6                8    181500
## 3        486   7              7                5    223500
## 4        216  91              7                5    140000
## 5        655   8              8                5    250000
## 6        732  16              5                5    143000
## 
## Call:
## lm(formula = SalePrice ~ ., data = ames, na.action = na.exclude)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -541628  -16397   -3502   13125  275070 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -7.013e+04  7.905e+03  -8.871  < 2e-16 ***
## ID               -1.813e+00  2.259e+00  -0.802   0.4224    
## MSSubClass       -2.259e+02  5.717e+01  -3.951 8.15e-05 ***
## GrLivArea         5.620e+01  2.862e+00  19.639  < 2e-16 ***
## BldgType2fmCon    2.182e+04  1.033e+04   2.112   0.0349 *  
## BldgTypeDuplex   -8.960e+03  5.938e+03  -1.509   0.1315    
## BldgTypeTwnhs    -1.414e+03  8.675e+03  -0.163   0.8705    
## BldgTypeTwnhsE    1.020e+04  6.722e+03   1.518   0.1292    
## GarageArea        3.528e+01  5.926e+00   5.953 3.30e-09 ***
## LotArea           4.644e-01  1.035e-01   4.486 7.83e-06 ***
## TotalBsmtSF       8.185e+00  3.564e+00   2.297   0.0218 *  
## BsmtFinSF1        1.841e+01  2.499e+00   7.367 2.92e-13 ***
## Age              -4.740e+02  4.645e+01 -10.203  < 2e-16 ***
## OverallQuality    2.101e+04  1.164e+03  18.055  < 2e-16 ***
## OverallCondition  5.372e+03  9.582e+02   5.606 2.47e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 36250 on 1445 degrees of freedom
## Multiple R-squared:  0.7938, Adjusted R-squared:  0.7918 
## F-statistic: 397.3 on 14 and 1445 DF,  p-value: < 2.2e-16

Metrics 1

##          R2     RMSE      MAE
## 1 0.7938015 36061.76 21645.03

Multilinear Regression Model - Eight Significant Variables

## 
## Call:
## lm(formula = SalePrice ~ MSSubClass + GrLivArea + GarageArea + 
##     LotArea + BsmtFinSF1 + Age + OverallQuality + OverallCondition, 
##     data = ames, na.action = na.exclude)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -526259  -17263   -3378   14118  278763 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -7.251e+04  7.467e+03  -9.711  < 2e-16 ***
## MSSubClass       -1.826e+02  2.333e+01  -7.830 9.35e-15 ***
## GrLivArea         5.499e+01  2.499e+00  22.008  < 2e-16 ***
## GarageArea        3.751e+01  5.919e+00   6.338 3.10e-10 ***
## LotArea           5.238e-01  1.025e-01   5.109 3.66e-07 ***
## BsmtFinSF1        2.193e+01  2.257e+00   9.717  < 2e-16 ***
## Age              -4.589e+02  4.536e+01 -10.118  < 2e-16 ***
## OverallQuality    2.245e+04  1.095e+03  20.502  < 2e-16 ***
## OverallCondition  4.906e+03  9.497e+02   5.166 2.73e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 36490 on 1451 degrees of freedom
## Multiple R-squared:  0.7902, Adjusted R-squared:  0.789 
## F-statistic:   683 on 8 and 1451 DF,  p-value: < 2.2e-16

Metrics 2

##          R2     RMSE      MAE
## 1 0.7901543 36379.28 22198.86

Multilinear Regression Model - Six Positively Significant Variables

## 
## Call:
## lm(formula = SalePrice ~ GrLivArea + GarageArea + LotArea + BsmtFinSF1 + 
##     OverallQuality + OverallCondition, data = ames, na.action = na.exclude)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -534778  -18236    -564   14485  284868 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -1.110e+05  7.142e+03 -15.544  < 2e-16 ***
## GrLivArea         4.529e+01  2.491e+00  18.182  < 2e-16 ***
## GarageArea        5.846e+01  5.965e+00   9.801  < 2e-16 ***
## LotArea           5.906e-01  1.066e-01   5.541 3.57e-08 ***
## BsmtFinSF1        2.551e+01  2.351e+00  10.850  < 2e-16 ***
## OverallQuality    2.780e+04  9.920e+02  28.023  < 2e-16 ***
## OverallCondition  1.537e+03  9.137e+02   1.683   0.0927 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 38370 on 1453 degrees of freedom
## Multiple R-squared:  0.7676, Adjusted R-squared:  0.7667 
## F-statistic: 800.1 on 6 and 1453 DF,  p-value: < 2.2e-16

Metrics 3

##          R2     RMSE     MAE
## 1 0.7676475 38280.52 23848.4

0.6 Model Performance

0.6.1 Evaluating the Selected Model

Comparing the OLS Model Fit

##   ModelName Model_RSquared Model_RMSE Model_FStatistic
## 1    Model1            79%   36061.76            397.3
## 2    Model2           79%    36379.28              683
## 3    Model3            77%   38280.52            800.1

R-squared measures the strength of the relationship between your model and the dependent variable. The F-test of overall significance is the hypothesis test for this relationship. If the overall F-test is significant, we can conclude that R-squared does not equal zero, and the correlation between the model and dependent variable is statistically significant.

The F-statistic for the three models are significant. As noted by Hormoz (2015), the higher the F value, the better is the model. Although Model3 has a better F-Statistic, the \(R^2\) is lower than Model1 and Model2.

Similarly, the \(R^2\) for Model1 and Model2 are the same, Model2 has a better F-Statistic than Model1. The \(R^2\) of the Model shows that 79% of the variation in the data is explained by this Model.

Therefore, I will select this Model with eight variables to predict Housing sale price for this task.

0.6.2 Visualize Model Prediction

The prediction plot doesn’t look bad. This is not the greatest predictions I will expect from this Model considering the variable significance.

0.8 Conclusion

The data was not a good fit for a Multi linear Regression Model. The \(R^2\) for these Model could not exceed 79% for the three models I built. Other options could be the introduction of penaly such as in Ridge regression which is capable of penalizing the model to improve performance.

I selected the second model for my prediction of housing sale price considering eight independent variables; tw of the eight variables negatively contributed to the housing price prediction, sacrificing these two variables did not significantly improve the model by \(R^2\) value as seen in Model3.

Using the test data provided, the Model was able to predict new sale prices considering the significant variables identified above. The result of the new sale prices or Model prediction is saved as results in a csv file and can be viewed directly.

Reference

Sohrabi, Hormoz. (2015). Re: Can I use the F-test value to determine the best model in regression?. Retrieved from: https://www.researchgate.net/post/Can_I_use_the_F-test_value_to_determine_the_best_model_in_regression/54b417aad4c11849278b4578/citation/download

