DATA 605 Final

Your final is due by the end of day on 12/16/2018. You should post your solutions to your GitHub account or RPubs. You are also expected to make a short presentation via YouTube and post that recording to the board. This project will show off your ability to understand the elements of the class.

Problem 1 Pick one of the quantitative independent variables (Xi) from the data set below, and define that variable as X. Also, pick one of the dependent variables (Yi) below, and define that as Y.

Lets pick y1 and x4.

p1data = read.csv("https://raw.githubusercontent.com/vindication09/DATA-605/master/data605_data.csv", header = TRUE)

p1<-subset(p1data, select=c(Y1, X4))

summary(p1)

##        Y1              X4        
##  Min.   :10.30   Min.   :-1.000  
##  1st Qu.:18.55   1st Qu.: 4.350  
##  Median :20.55   Median : 8.500  
##  Mean   :20.29   Mean   : 8.595  
##  3rd Qu.:22.07   3rd Qu.:12.525  
##  Max.   :28.70   Max.   :19.900

Probability. Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the 3d quartile of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities.

Before we compute any probailities, we need to gather the values that fall within the desired quartile. We will be required to fill in the entries for problem 3

Y3q_greater = subset(p1, Y1 > 18.55)
Y3q_lesser = subset(p1, Y1 <= 18.55)
X4q_greater_given_yq3_greater = nrow(subset(Y3q_greater, X4 > 12.525))
X4q_lesser_given_Y3q_greater = nrow(subset(Y3q_greater, X4 <= 12.525))
X4q_greater_given_Y3q_lesser = nrow(subset(Y3q_lesser, X4 > 12.525))
X4q_lesser_given_Y3q_lesser = nrow(subset(Y3q_lesser, X4 <= 12.525))

a.P(X>x | Y>y)

nrow(subset(Y3q_greater, X4 > 12.525)) / nrow(Y3q_greater)

## [1] 0.2666667

P(X>x, Y>y)

nrow(subset(Y3q_greater, X4 > 12.525)) / nrow(p1)

## [1] 0.2

P(Xy)

nrow(subset(Y3q_greater, X4 < 12.525)) / nrow(Y3q_greater)

## [1] 0.7333333

In addition, make a table of counts as shown below.

Lets take the counts we computed and put them all into a data frame

p2 = data.frame(
X4q_lesser_given_Y3q_lesser,
X4q_greater_given_Y3q_lesser,
X4q_greater_given_Y3q_lesser + X4q_greater_given_Y3q_lesser,
X4q_lesser_given_Y3q_greater,
X4q_greater_given_yq3_greater,
X4q_greater_given_yq3_greater + X4q_lesser_given_Y3q_greater,
X4q_lesser_given_Y3q_lesser + X4q_lesser_given_Y3q_greater,
X4q_greater_given_Y3q_lesser + X4q_greater_given_yq3_greater,
nrow(p1))
p2 = matrix(p2, nrow = 3, ncol = 3)
p2 = data.frame(p2)
rownames(p2) <- c('<=1st q for x', '>1st q for x', 'Total')
colnames(p2) <- c('<=3rd q for y', '>3rd q for y', 'Total')
print(p2)

##               <=3rd q for y >3rd q for y Total
## <=1st q for x             4           11    15
## >1st q for x              1            4     5
## Total                     2           15    20

5 points. Does splitting the training data in this fashion make them independent? Let A be the new variable counting those observations above the 1st quartile for X, and let B be the new variable counting those observations above the 1st quartile for Y. Does P(AB)=P(A)P(B)? Check mathematically, and then evaluate by running a Chi Square test for association.

x4q_greater = nrow(subset(p1, X4 > 4.350))

Y1q_greater = nrow(subset(p1, Y1 > 18.55))

X4q_greater_Y1q_lesser = nrow(subset(p1, X4 > 4.350 & Y1 > 18.55))

prob_AintersectB = X4q_greater_Y1q_lesser/nrow(p1) 

probA =  x4q_greater / nrow(p1)

ProbB = Y1q_greater / nrow(p1)

prob_AB = probA * ProbB


print(paste0(prob_AintersectB, " Not equal to ", prob_AB, " Hence not independent"))

## [1] "0.55 Not equal to 0.5625 Hence not independent"

Run a chi-squared test

X4q_greater = subset(p1, X4 > 4.350)

Y1q_greater = subset(p1, Y1 > 18.55)

chisq.test(X4q_greater$X4, Y1q_greater$Y1)

## Warning in chisq.test(X4q_greater$X4, Y1q_greater$Y1): Chi-squared
## approximation may be incorrect

## 
##  Pearson's Chi-squared test
## 
## data:  X4q_greater$X4 and Y1q_greater$Y1
## X-squared = 155, df = 144, p-value = 0.251

You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition. https://www.kaggle.com/c/house-prices-advanced-regression-techniques . I want you to do the following.

Read in the data

train = read.csv("https://raw.githubusercontent.com/vindication09/DATA-605/master/train.csv", header = TRUE)

test= read.csv("https://raw.githubusercontent.com/vindication09/DATA-605/master/test.csv", header = TRUE)

head(train, 10)

##    Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape
## 1   1         60       RL          65    8450   Pave  <NA>      Reg
## 2   2         20       RL          80    9600   Pave  <NA>      Reg
## 3   3         60       RL          68   11250   Pave  <NA>      IR1
## 4   4         70       RL          60    9550   Pave  <NA>      IR1
## 5   5         60       RL          84   14260   Pave  <NA>      IR1
## 6   6         50       RL          85   14115   Pave  <NA>      IR1
## 7   7         20       RL          75   10084   Pave  <NA>      Reg
## 8   8         60       RL          NA   10382   Pave  <NA>      IR1
## 9   9         50       RM          51    6120   Pave  <NA>      Reg
## 10 10        190       RL          50    7420   Pave  <NA>      Reg
##    LandContour Utilities LotConfig LandSlope Neighborhood Condition1
## 1          Lvl    AllPub    Inside       Gtl      CollgCr       Norm
## 2          Lvl    AllPub       FR2       Gtl      Veenker      Feedr
## 3          Lvl    AllPub    Inside       Gtl      CollgCr       Norm
## 4          Lvl    AllPub    Corner       Gtl      Crawfor       Norm
## 5          Lvl    AllPub       FR2       Gtl      NoRidge       Norm
## 6          Lvl    AllPub    Inside       Gtl      Mitchel       Norm
## 7          Lvl    AllPub    Inside       Gtl      Somerst       Norm
## 8          Lvl    AllPub    Corner       Gtl       NWAmes       PosN
## 9          Lvl    AllPub    Inside       Gtl      OldTown     Artery
## 10         Lvl    AllPub    Corner       Gtl      BrkSide     Artery
##    Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt
## 1        Norm     1Fam     2Story           7           5      2003
## 2        Norm     1Fam     1Story           6           8      1976
## 3        Norm     1Fam     2Story           7           5      2001
## 4        Norm     1Fam     2Story           7           5      1915
## 5        Norm     1Fam     2Story           8           5      2000
## 6        Norm     1Fam     1.5Fin           5           5      1993
## 7        Norm     1Fam     1Story           8           5      2004
## 8        Norm     1Fam     2Story           7           6      1973
## 9        Norm     1Fam     1.5Fin           7           5      1931
## 10     Artery   2fmCon     1.5Unf           5           6      1939
##    YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType
## 1          2003     Gable  CompShg     VinylSd     VinylSd    BrkFace
## 2          1976     Gable  CompShg     MetalSd     MetalSd       None
## 3          2002     Gable  CompShg     VinylSd     VinylSd    BrkFace
## 4          1970     Gable  CompShg     Wd Sdng     Wd Shng       None
## 5          2000     Gable  CompShg     VinylSd     VinylSd    BrkFace
## 6          1995     Gable  CompShg     VinylSd     VinylSd       None
## 7          2005     Gable  CompShg     VinylSd     VinylSd      Stone
## 8          1973     Gable  CompShg     HdBoard     HdBoard      Stone
## 9          1950     Gable  CompShg     BrkFace     Wd Shng       None
## 10         1950     Gable  CompShg     MetalSd     MetalSd       None
##    MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond
## 1         196        Gd        TA      PConc       Gd       TA
## 2           0        TA        TA     CBlock       Gd       TA
## 3         162        Gd        TA      PConc       Gd       TA
## 4           0        TA        TA     BrkTil       TA       Gd
## 5         350        Gd        TA      PConc       Gd       TA
## 6           0        TA        TA       Wood       Gd       TA
## 7         186        Gd        TA      PConc       Ex       TA
## 8         240        TA        TA     CBlock       Gd       TA
## 9           0        TA        TA     BrkTil       TA       TA
## 10          0        TA        TA     BrkTil       TA       TA
##    BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF
## 1            No          GLQ        706          Unf          0       150
## 2            Gd          ALQ        978          Unf          0       284
## 3            Mn          GLQ        486          Unf          0       434
## 4            No          ALQ        216          Unf          0       540
## 5            Av          GLQ        655          Unf          0       490
## 6            No          GLQ        732          Unf          0        64
## 7            Av          GLQ       1369          Unf          0       317
## 8            Mn          ALQ        859          BLQ         32       216
## 9            No          Unf          0          Unf          0       952
## 10           No          GLQ        851          Unf          0       140
##    TotalBsmtSF Heating HeatingQC CentralAir Electrical X1stFlrSF X2ndFlrSF
## 1          856    GasA        Ex          Y      SBrkr       856       854
## 2         1262    GasA        Ex          Y      SBrkr      1262         0
## 3          920    GasA        Ex          Y      SBrkr       920       866
## 4          756    GasA        Gd          Y      SBrkr       961       756
## 5         1145    GasA        Ex          Y      SBrkr      1145      1053
## 6          796    GasA        Ex          Y      SBrkr       796       566
## 7         1686    GasA        Ex          Y      SBrkr      1694         0
## 8         1107    GasA        Ex          Y      SBrkr      1107       983
## 9          952    GasA        Gd          Y      FuseF      1022       752
## 10         991    GasA        Ex          Y      SBrkr      1077         0
##    LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath
## 1             0      1710            1            0        2        1
## 2             0      1262            0            1        2        0
## 3             0      1786            1            0        2        1
## 4             0      1717            1            0        1        0
## 5             0      2198            1            0        2        1
## 6             0      1362            1            0        1        1
## 7             0      1694            1            0        2        0
## 8             0      2090            1            0        2        1
## 9             0      1774            0            0        2        0
## 10            0      1077            1            0        1        0
##    BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional
## 1             3            1          Gd            8        Typ
## 2             3            1          TA            6        Typ
## 3             3            1          Gd            6        Typ
## 4             3            1          Gd            7        Typ
## 5             4            1          Gd            9        Typ
## 6             1            1          TA            5        Typ
## 7             3            1          Gd            7        Typ
## 8             3            1          TA            7        Typ
## 9             2            2          TA            8       Min1
## 10            2            2          TA            5        Typ
##    Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars
## 1           0        <NA>     Attchd        2003          RFn          2
## 2           1          TA     Attchd        1976          RFn          2
## 3           1          TA     Attchd        2001          RFn          2
## 4           1          Gd     Detchd        1998          Unf          3
## 5           1          TA     Attchd        2000          RFn          3
## 6           0        <NA>     Attchd        1993          Unf          2
## 7           1          Gd     Attchd        2004          RFn          2
## 8           2          TA     Attchd        1973          RFn          2
## 9           2          TA     Detchd        1931          Unf          2
## 10          2          TA     Attchd        1939          RFn          1
##    GarageArea GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF
## 1         548         TA         TA          Y          0          61
## 2         460         TA         TA          Y        298           0
## 3         608         TA         TA          Y          0          42
## 4         642         TA         TA          Y          0          35
## 5         836         TA         TA          Y        192          84
## 6         480         TA         TA          Y         40          30
## 7         636         TA         TA          Y        255          57
## 8         484         TA         TA          Y        235         204
## 9         468         Fa         TA          Y         90           0
## 10        205         Gd         TA          Y          0           4
##    EnclosedPorch X3SsnPorch ScreenPorch PoolArea PoolQC Fence MiscFeature
## 1              0          0           0        0   <NA>  <NA>        <NA>
## 2              0          0           0        0   <NA>  <NA>        <NA>
## 3              0          0           0        0   <NA>  <NA>        <NA>
## 4            272          0           0        0   <NA>  <NA>        <NA>
## 5              0          0           0        0   <NA>  <NA>        <NA>
## 6              0        320           0        0   <NA> MnPrv        Shed
## 7              0          0           0        0   <NA>  <NA>        <NA>
## 8            228          0           0        0   <NA>  <NA>        Shed
## 9            205          0           0        0   <NA>  <NA>        <NA>
## 10             0          0           0        0   <NA>  <NA>        <NA>
##    MiscVal MoSold YrSold SaleType SaleCondition SalePrice
## 1        0      2   2008       WD        Normal    208500
## 2        0      5   2007       WD        Normal    181500
## 3        0      9   2008       WD        Normal    223500
## 4        0      2   2006       WD       Abnorml    140000
## 5        0     12   2008       WD        Normal    250000
## 6      700     10   2009       WD        Normal    143000
## 7        0      8   2007       WD        Normal    307000
## 8      350     11   2009       WD        Normal    200000
## 9        0      4   2008       WD       Abnorml    129900
## 10       0      1   2008       WD        Normal    118000

Basic EDA View Data types for each of the variables

str(train)

## 'data.frame':    1460 obs. of  81 variables:
##  $ Id           : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ MSSubClass   : int  60 20 60 70 60 50 20 60 50 190 ...
##  $ MSZoning     : Factor w/ 5 levels "C (all)","FV",..: 4 4 4 4 4 4 4 4 5 4 ...
##  $ LotFrontage  : int  65 80 68 60 84 85 75 NA 51 50 ...
##  $ LotArea      : int  8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
##  $ Street       : Factor w/ 2 levels "Grvl","Pave": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Alley        : Factor w/ 2 levels "Grvl","Pave": NA NA NA NA NA NA NA NA NA NA ...
##  $ LotShape     : Factor w/ 4 levels "IR1","IR2","IR3",..: 4 4 1 1 1 1 4 1 4 4 ...
##  $ LandContour  : Factor w/ 4 levels "Bnk","HLS","Low",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ Utilities    : Factor w/ 2 levels "AllPub","NoSeWa": 1 1 1 1 1 1 1 1 1 1 ...
##  $ LotConfig    : Factor w/ 5 levels "Corner","CulDSac",..: 5 3 5 1 3 5 5 1 5 1 ...
##  $ LandSlope    : Factor w/ 3 levels "Gtl","Mod","Sev": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Neighborhood : Factor w/ 25 levels "Blmngtn","Blueste",..: 6 25 6 7 14 12 21 17 18 4 ...
##  $ Condition1   : Factor w/ 9 levels "Artery","Feedr",..: 3 2 3 3 3 3 3 5 1 1 ...
##  $ Condition2   : Factor w/ 8 levels "Artery","Feedr",..: 3 3 3 3 3 3 3 3 3 1 ...
##  $ BldgType     : Factor w/ 5 levels "1Fam","2fmCon",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ HouseStyle   : Factor w/ 8 levels "1.5Fin","1.5Unf",..: 6 3 6 6 6 1 3 6 1 2 ...
##  $ OverallQual  : int  7 6 7 7 8 5 8 7 7 5 ...
##  $ OverallCond  : int  5 8 5 5 5 5 5 6 5 6 ...
##  $ YearBuilt    : int  2003 1976 2001 1915 2000 1993 2004 1973 1931 1939 ...
##  $ YearRemodAdd : int  2003 1976 2002 1970 2000 1995 2005 1973 1950 1950 ...
##  $ RoofStyle    : Factor w/ 6 levels "Flat","Gable",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ RoofMatl     : Factor w/ 8 levels "ClyTile","CompShg",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Exterior1st  : Factor w/ 15 levels "AsbShng","AsphShn",..: 13 9 13 14 13 13 13 7 4 9 ...
##  $ Exterior2nd  : Factor w/ 16 levels "AsbShng","AsphShn",..: 14 9 14 16 14 14 14 7 16 9 ...
##  $ MasVnrType   : Factor w/ 4 levels "BrkCmn","BrkFace",..: 2 3 2 3 2 3 4 4 3 3 ...
##  $ MasVnrArea   : int  196 0 162 0 350 0 186 240 0 0 ...
##  $ ExterQual    : Factor w/ 4 levels "Ex","Fa","Gd",..: 3 4 3 4 3 4 3 4 4 4 ...
##  $ ExterCond    : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 5 5 ...
##  $ Foundation   : Factor w/ 6 levels "BrkTil","CBlock",..: 3 2 3 1 3 6 3 2 1 1 ...
##  $ BsmtQual     : Factor w/ 4 levels "Ex","Fa","Gd",..: 3 3 3 4 3 3 1 3 4 4 ...
##  $ BsmtCond     : Factor w/ 4 levels "Fa","Gd","Po",..: 4 4 4 2 4 4 4 4 4 4 ...
##  $ BsmtExposure : Factor w/ 4 levels "Av","Gd","Mn",..: 4 2 3 4 1 4 1 3 4 4 ...
##  $ BsmtFinType1 : Factor w/ 6 levels "ALQ","BLQ","GLQ",..: 3 1 3 1 3 3 3 1 6 3 ...
##  $ BsmtFinSF1   : int  706 978 486 216 655 732 1369 859 0 851 ...
##  $ BsmtFinType2 : Factor w/ 6 levels "ALQ","BLQ","GLQ",..: 6 6 6 6 6 6 6 2 6 6 ...
##  $ BsmtFinSF2   : int  0 0 0 0 0 0 0 32 0 0 ...
##  $ BsmtUnfSF    : int  150 284 434 540 490 64 317 216 952 140 ...
##  $ TotalBsmtSF  : int  856 1262 920 756 1145 796 1686 1107 952 991 ...
##  $ Heating      : Factor w/ 6 levels "Floor","GasA",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ HeatingQC    : Factor w/ 5 levels "Ex","Fa","Gd",..: 1 1 1 3 1 1 1 1 3 1 ...
##  $ CentralAir   : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Electrical   : Factor w/ 5 levels "FuseA","FuseF",..: 5 5 5 5 5 5 5 5 2 5 ...
##  $ X1stFlrSF    : int  856 1262 920 961 1145 796 1694 1107 1022 1077 ...
##  $ X2ndFlrSF    : int  854 0 866 756 1053 566 0 983 752 0 ...
##  $ LowQualFinSF : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ GrLivArea    : int  1710 1262 1786 1717 2198 1362 1694 2090 1774 1077 ...
##  $ BsmtFullBath : int  1 0 1 1 1 1 1 1 0 1 ...
##  $ BsmtHalfBath : int  0 1 0 0 0 0 0 0 0 0 ...
##  $ FullBath     : int  2 2 2 1 2 1 2 2 2 1 ...
##  $ HalfBath     : int  1 0 1 0 1 1 0 1 0 0 ...
##  $ BedroomAbvGr : int  3 3 3 3 4 1 3 3 2 2 ...
##  $ KitchenAbvGr : int  1 1 1 1 1 1 1 1 2 2 ...
##  $ KitchenQual  : Factor w/ 4 levels "Ex","Fa","Gd",..: 3 4 3 3 3 4 3 4 4 4 ...
##  $ TotRmsAbvGrd : int  8 6 6 7 9 5 7 7 8 5 ...
##  $ Functional   : Factor w/ 7 levels "Maj1","Maj2",..: 7 7 7 7 7 7 7 7 3 7 ...
##  $ Fireplaces   : int  0 1 1 1 1 0 1 2 2 2 ...
##  $ FireplaceQu  : Factor w/ 5 levels "Ex","Fa","Gd",..: NA 5 5 3 5 NA 3 5 5 5 ...
##  $ GarageType   : Factor w/ 6 levels "2Types","Attchd",..: 2 2 2 6 2 2 2 2 6 2 ...
##  $ GarageYrBlt  : int  2003 1976 2001 1998 2000 1993 2004 1973 1931 1939 ...
##  $ GarageFinish : Factor w/ 3 levels "Fin","RFn","Unf": 2 2 2 3 2 3 2 2 3 2 ...
##  $ GarageCars   : int  2 2 2 3 3 2 2 2 2 1 ...
##  $ GarageArea   : int  548 460 608 642 836 480 636 484 468 205 ...
##  $ GarageQual   : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 2 3 ...
##  $ GarageCond   : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 5 5 ...
##  $ PavedDrive   : Factor w/ 3 levels "N","P","Y": 3 3 3 3 3 3 3 3 3 3 ...
##  $ WoodDeckSF   : int  0 298 0 0 192 40 255 235 90 0 ...
##  $ OpenPorchSF  : int  61 0 42 35 84 30 57 204 0 4 ...
##  $ EnclosedPorch: int  0 0 0 272 0 0 0 228 205 0 ...
##  $ X3SsnPorch   : int  0 0 0 0 0 320 0 0 0 0 ...
##  $ ScreenPorch  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PoolArea     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PoolQC       : Factor w/ 3 levels "Ex","Fa","Gd": NA NA NA NA NA NA NA NA NA NA ...
##  $ Fence        : Factor w/ 4 levels "GdPrv","GdWo",..: NA NA NA NA NA 3 NA NA NA NA ...
##  $ MiscFeature  : Factor w/ 4 levels "Gar2","Othr",..: NA NA NA NA NA 3 NA 3 NA NA ...
##  $ MiscVal      : int  0 0 0 0 0 700 0 350 0 0 ...
##  $ MoSold       : int  2 5 9 2 12 10 8 11 4 1 ...
##  $ YrSold       : int  2008 2007 2008 2006 2008 2009 2007 2009 2008 2008 ...
##  $ SaleType     : Factor w/ 9 levels "COD","Con","ConLD",..: 9 9 9 9 9 9 9 9 9 9 ...
##  $ SaleCondition: Factor w/ 6 levels "Abnorml","AdjLand",..: 5 5 5 1 5 5 5 5 1 5 ...
##  $ SalePrice    : int  208500 181500 223500 140000 250000 143000 307000 200000 129900 118000 ...

It seems we have many variables in this dataset which have multiple levels. I am almost certain that we do not need all or even most of these variables. We can systematically determine what variables to keep or not.

Check missing data

colSums(sapply(train, is.na))

##            Id    MSSubClass      MSZoning   LotFrontage       LotArea 
##             0             0             0           259             0 
##        Street         Alley      LotShape   LandContour     Utilities 
##             0          1369             0             0             0 
##     LotConfig     LandSlope  Neighborhood    Condition1    Condition2 
##             0             0             0             0             0 
##      BldgType    HouseStyle   OverallQual   OverallCond     YearBuilt 
##             0             0             0             0             0 
##  YearRemodAdd     RoofStyle      RoofMatl   Exterior1st   Exterior2nd 
##             0             0             0             0             0 
##    MasVnrType    MasVnrArea     ExterQual     ExterCond    Foundation 
##             8             8             0             0             0 
##      BsmtQual      BsmtCond  BsmtExposure  BsmtFinType1    BsmtFinSF1 
##            37            37            38            37             0 
##  BsmtFinType2    BsmtFinSF2     BsmtUnfSF   TotalBsmtSF       Heating 
##            38             0             0             0             0 
##     HeatingQC    CentralAir    Electrical     X1stFlrSF     X2ndFlrSF 
##             0             0             1             0             0 
##  LowQualFinSF     GrLivArea  BsmtFullBath  BsmtHalfBath      FullBath 
##             0             0             0             0             0 
##      HalfBath  BedroomAbvGr  KitchenAbvGr   KitchenQual  TotRmsAbvGrd 
##             0             0             0             0             0 
##    Functional    Fireplaces   FireplaceQu    GarageType   GarageYrBlt 
##             0             0           690            81            81 
##  GarageFinish    GarageCars    GarageArea    GarageQual    GarageCond 
##            81             0             0            81            81 
##    PavedDrive    WoodDeckSF   OpenPorchSF EnclosedPorch    X3SsnPorch 
##             0             0             0             0             0 
##   ScreenPorch      PoolArea        PoolQC         Fence   MiscFeature 
##             0             0          1453          1179          1406 
##       MiscVal        MoSold        YrSold      SaleType SaleCondition 
##             0             0             0             0             0 
##     SalePrice 
##             0

Right off the bat, MiscFeature, PoolQC, and Alley can be eliminated. There are only 1460 rows, hence these variables are missing more than 50 percent of their total data.

5 points. Descriptive and Inferential Statistics. Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any THREE quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide a 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?

Descriptives

nums <- unlist(lapply(train, is.numeric))  
trainb<-train[ , nums]
trainc<-subset(trainb, select=-c(Id))
summary(trainc)

##    MSSubClass     LotFrontage        LotArea        OverallQual    
##  Min.   : 20.0   Min.   : 21.00   Min.   :  1300   Min.   : 1.000  
##  1st Qu.: 20.0   1st Qu.: 59.00   1st Qu.:  7554   1st Qu.: 5.000  
##  Median : 50.0   Median : 69.00   Median :  9478   Median : 6.000  
##  Mean   : 56.9   Mean   : 70.05   Mean   : 10517   Mean   : 6.099  
##  3rd Qu.: 70.0   3rd Qu.: 80.00   3rd Qu.: 11602   3rd Qu.: 7.000  
##  Max.   :190.0   Max.   :313.00   Max.   :215245   Max.   :10.000  
##                  NA's   :259                                       
##   OverallCond      YearBuilt     YearRemodAdd    MasVnrArea    
##  Min.   :1.000   Min.   :1872   Min.   :1950   Min.   :   0.0  
##  1st Qu.:5.000   1st Qu.:1954   1st Qu.:1967   1st Qu.:   0.0  
##  Median :5.000   Median :1973   Median :1994   Median :   0.0  
##  Mean   :5.575   Mean   :1971   Mean   :1985   Mean   : 103.7  
##  3rd Qu.:6.000   3rd Qu.:2000   3rd Qu.:2004   3rd Qu.: 166.0  
##  Max.   :9.000   Max.   :2010   Max.   :2010   Max.   :1600.0  
##                                                NA's   :8       
##    BsmtFinSF1       BsmtFinSF2        BsmtUnfSF       TotalBsmtSF    
##  Min.   :   0.0   Min.   :   0.00   Min.   :   0.0   Min.   :   0.0  
##  1st Qu.:   0.0   1st Qu.:   0.00   1st Qu.: 223.0   1st Qu.: 795.8  
##  Median : 383.5   Median :   0.00   Median : 477.5   Median : 991.5  
##  Mean   : 443.6   Mean   :  46.55   Mean   : 567.2   Mean   :1057.4  
##  3rd Qu.: 712.2   3rd Qu.:   0.00   3rd Qu.: 808.0   3rd Qu.:1298.2  
##  Max.   :5644.0   Max.   :1474.00   Max.   :2336.0   Max.   :6110.0  
##                                                                      
##    X1stFlrSF      X2ndFlrSF     LowQualFinSF       GrLivArea   
##  Min.   : 334   Min.   :   0   Min.   :  0.000   Min.   : 334  
##  1st Qu.: 882   1st Qu.:   0   1st Qu.:  0.000   1st Qu.:1130  
##  Median :1087   Median :   0   Median :  0.000   Median :1464  
##  Mean   :1163   Mean   : 347   Mean   :  5.845   Mean   :1515  
##  3rd Qu.:1391   3rd Qu.: 728   3rd Qu.:  0.000   3rd Qu.:1777  
##  Max.   :4692   Max.   :2065   Max.   :572.000   Max.   :5642  
##                                                                
##   BsmtFullBath     BsmtHalfBath        FullBath        HalfBath     
##  Min.   :0.0000   Min.   :0.00000   Min.   :0.000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:1.000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.00000   Median :2.000   Median :0.0000  
##  Mean   :0.4253   Mean   :0.05753   Mean   :1.565   Mean   :0.3829  
##  3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:2.000   3rd Qu.:1.0000  
##  Max.   :3.0000   Max.   :2.00000   Max.   :3.000   Max.   :2.0000  
##                                                                     
##   BedroomAbvGr    KitchenAbvGr    TotRmsAbvGrd      Fireplaces   
##  Min.   :0.000   Min.   :0.000   Min.   : 2.000   Min.   :0.000  
##  1st Qu.:2.000   1st Qu.:1.000   1st Qu.: 5.000   1st Qu.:0.000  
##  Median :3.000   Median :1.000   Median : 6.000   Median :1.000  
##  Mean   :2.866   Mean   :1.047   Mean   : 6.518   Mean   :0.613  
##  3rd Qu.:3.000   3rd Qu.:1.000   3rd Qu.: 7.000   3rd Qu.:1.000  
##  Max.   :8.000   Max.   :3.000   Max.   :14.000   Max.   :3.000  
##                                                                  
##   GarageYrBlt     GarageCars      GarageArea       WoodDeckSF    
##  Min.   :1900   Min.   :0.000   Min.   :   0.0   Min.   :  0.00  
##  1st Qu.:1961   1st Qu.:1.000   1st Qu.: 334.5   1st Qu.:  0.00  
##  Median :1980   Median :2.000   Median : 480.0   Median :  0.00  
##  Mean   :1979   Mean   :1.767   Mean   : 473.0   Mean   : 94.24  
##  3rd Qu.:2002   3rd Qu.:2.000   3rd Qu.: 576.0   3rd Qu.:168.00  
##  Max.   :2010   Max.   :4.000   Max.   :1418.0   Max.   :857.00  
##  NA's   :81                                                      
##   OpenPorchSF     EnclosedPorch      X3SsnPorch      ScreenPorch    
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.00   Min.   :  0.00  
##  1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.00  
##  Median : 25.00   Median :  0.00   Median :  0.00   Median :  0.00  
##  Mean   : 46.66   Mean   : 21.95   Mean   :  3.41   Mean   : 15.06  
##  3rd Qu.: 68.00   3rd Qu.:  0.00   3rd Qu.:  0.00   3rd Qu.:  0.00  
##  Max.   :547.00   Max.   :552.00   Max.   :508.00   Max.   :480.00  
##                                                                     
##     PoolArea          MiscVal             MoSold           YrSold    
##  Min.   :  0.000   Min.   :    0.00   Min.   : 1.000   Min.   :2006  
##  1st Qu.:  0.000   1st Qu.:    0.00   1st Qu.: 5.000   1st Qu.:2007  
##  Median :  0.000   Median :    0.00   Median : 6.000   Median :2008  
##  Mean   :  2.759   Mean   :   43.49   Mean   : 6.322   Mean   :2008  
##  3rd Qu.:  0.000   3rd Qu.:    0.00   3rd Qu.: 8.000   3rd Qu.:2009  
##  Max.   :738.000   Max.   :15500.00   Max.   :12.000   Max.   :2010  
##                                                                      
##    SalePrice     
##  Min.   : 34900  
##  1st Qu.:129975  
##  Median :163000  
##  Mean   :180921  
##  3rd Qu.:214000  
##  Max.   :755000  
##

Data documentation: http://jse.amstat.org/v19n3/decock.pdf

Plots

library(DataExplorer)

## Warning: package 'DataExplorer' was built under R version 3.4.4

plot_missing(trainc)[1]

##          feature
## 1     MSSubClass
## 2    LotFrontage
## 3        LotArea
## 4    OverallQual
## 5    OverallCond
## 6      YearBuilt
## 7   YearRemodAdd
## 8     MasVnrArea
## 9     BsmtFinSF1
## 10    BsmtFinSF2
## 11     BsmtUnfSF
## 12   TotalBsmtSF
## 13     X1stFlrSF
## 14     X2ndFlrSF
## 15  LowQualFinSF
## 16     GrLivArea
## 17  BsmtFullBath
## 18  BsmtHalfBath
## 19      FullBath
## 20      HalfBath
## 21  BedroomAbvGr
## 22  KitchenAbvGr
## 23  TotRmsAbvGrd
## 24    Fireplaces
## 25   GarageYrBlt
## 26    GarageCars
## 27    GarageArea
## 28    WoodDeckSF
## 29   OpenPorchSF
## 30 EnclosedPorch
## 31    X3SsnPorch
## 32   ScreenPorch
## 33      PoolArea
## 34       MiscVal
## 35        MoSold
## 36        YrSold
## 37     SalePrice

#plot_histogram(wine_training2);plot_density(wine_training2)

Out of the selected numerical variables, they all hav less than 20% missing data. I would try to impute the missing values with the mean or mode, but it might be an effort outside the scope of DATA 605 but rather for DATA 621.

plot_histogram(trainc)

Full correlation matrix with all numerical variables

correlation_matrix <- round(cor(trainc),2)

# Get lower triangle of the correlation matrix
  get_lower_tri<-function(correlation_matrix){
    correlation_matrix[upper.tri(correlation_matrix)] <- NA
    return(correlation_matrix)
  }
  # Get upper triangle of the correlation matrix
  get_upper_tri <- function(correlation_matrix){
    correlation_matrix[lower.tri(correlation_matrix)]<- NA
    return(correlation_matrix)
  }
  
  upper_tri <- get_upper_tri(correlation_matrix)



library(reshape2)

# Melt the correlation matrix
melted_correlation_matrix <- melt(upper_tri, na.rm = TRUE)

# Heatmap
library(ggplot2)

ggheatmap <- ggplot(data = melted_correlation_matrix, aes(Var2, Var1, fill = value))+
 geom_tile(color = "white")+
 scale_fill_gradient2(low = "blue", high = "red", mid = "white", 
   midpoint = 0, limit = c(-1,1), space = "Lab", 
   name="Pearson\nCorrelation") +
  theme_minimal()+ 
 theme(axis.text.x = element_text(angle = 45, vjust = 1, 
    size = 15, hjust = 1))+
 coord_fixed()


#add nice labels 
ggheatmap + 
geom_text(aes(Var2, Var1, label = value), color = "black", size = 3) +
theme(
  axis.title.x = element_blank(),
  axis.title.y = element_blank(),
  axis.text.x=element_text(size=rel(0.8), angle=90),
  axis.text.y=element_text(size=rel(0.8)),
  panel.grid.major = element_blank(),
  panel.border = element_blank(),
  panel.background = element_blank(),
  axis.ticks = element_blank(),
  legend.justification = c(1, 0),
  legend.position = c(0.6, 0.7),
  legend.direction = "horizontal")+
  guides(fill = guide_colorbar(barwicrash_training2h = 7, barheight = 2,
                title.position = "top", title.hjust = 0.5))

The numbers are hard to see but we can use the heat map to figure the correlation levels for each variable. Lets reduce this to a handful of variables a little later on.

Lets provide a scatterplot matrix for two of the predictor variables along with the response. Lets also overlay a correlation analysis and see the relationship between said variable and the response variable.

library(ggpubr)

## Loading required package: magrittr

ggscatter(trainc, x = "GrLivArea", y = "SalePrice", 
          add = "reg.line", conf.int = TRUE, 
          cor.coef = TRUE, cor.method = "pearson",
          xlab = "Above Grade Living Area", ylab = "Sales Price");

ggscatter(trainc, x = "LotArea", y = "SalePrice", 
          add = "reg.line", conf.int = TRUE, 
          cor.coef = TRUE, cor.method = "pearson",
          xlab = "Lot Area", ylab = "Sales Price")

Derive a correlation matrix for any THREE variables Lets pick the two variables from the scatter plot and the response variable

corr_data<-subset(trainc,select=c("X1stFlrSF","LotArea", "SalePrice"))


correlation_matrix <- round(cor(corr_data),2)

# Get lower triangle of the correlation matrix
  get_lower_tri<-function(correlation_matrix){
    correlation_matrix[upper.tri(correlation_matrix)] <- NA
    return(correlation_matrix)
  }
  # Get upper triangle of the correlation matrix
  get_upper_tri <- function(correlation_matrix){
    correlation_matrix[lower.tri(correlation_matrix)]<- NA
    return(correlation_matrix)
  }
  
  upper_tri <- get_upper_tri(correlation_matrix)



library(reshape2)

# Melt the correlation matrix
melted_correlation_matrix <- melt(upper_tri, na.rm = TRUE)

# Heatmap
library(ggplot2)

ggheatmap <- ggplot(data = melted_correlation_matrix, aes(Var2, Var1, fill = value))+
 geom_tile(color = "white")+
 scale_fill_gradient2(low = "blue", high = "red", mid = "white", 
   midpoint = 0, limit = c(-1,1), space = "Lab", 
   name="Pearson\nCorrelation") +
  theme_minimal()+ 
 theme(axis.text.x = element_text(angle = 45, vjust = 1, 
    size = 15, hjust = 1))+
 coord_fixed()


#add nice labels 
ggheatmap + 
geom_text(aes(Var2, Var1, label = value), color = "black", size = 3) +
theme(
  axis.title.x = element_blank(),
  axis.title.y = element_blank(),
  axis.text.x=element_text(size=rel(0.8), angle=90),
  axis.text.y=element_text(size=rel(0.8)),
  panel.grid.major = element_blank(),
  panel.border = element_blank(),
  panel.background = element_blank(),
  axis.ticks = element_blank(),
  legend.justification = c(1, 0),
  legend.position = c(0.6, 0.7),
  legend.direction = "horizontal")+
  guides(fill = guide_colorbar(barwicrash_training2h = 7, barheight = 1,
                title.position = "top", title.hjust = 0.5))

Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide a 80% confidence interval.

cor.test(corr_data$X1stFlrSF, corr_data$SalePrice, method = c("pearson", "kendall", "spearman"), conf.level = 0.8)

## 
##  Pearson's product-moment correlation
## 
## data:  corr_data$X1stFlrSF and corr_data$SalePrice
## t = 29.078, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.5841687 0.6266715
## sample estimates:
##       cor 
## 0.6058522

cor.test(corr_data$LotArea, corr_data$SalePrice, method = c("pearson", "kendall", "spearman"), conf.level = 0.8)

## 
##  Pearson's product-moment correlation
## 
## data:  corr_data$LotArea and corr_data$SalePrice
## t = 10.445, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.2323391 0.2947946
## sample estimates:
##       cor 
## 0.2638434

cor.test(corr_data$X1stFlrSF, corr_data$LotArea, method = c("pearson", "kendall", "spearman"), conf.level = 0.8)

## 
##  Pearson's product-moment correlation
## 
## data:  corr_data$X1stFlrSF and corr_data$LotArea
## t = 11.985, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.2686127 0.3297222
## sample estimates:
##       cor 
## 0.2994746

In all three instances, we have generated an 80 percent confidence interval. We should also note the small p value. Hence for the three iterations of testing, we can reject the the null hypothesis and conclude that the true correlation is not 0 for the selected variables.

Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?

What is a family wise error anyways? It is a measurment of error when it comes o performing several iterations of estimates. This might cause results to be interpreted as being more independent then they really are. Our three tests of correlation had low p values, hence we can use that to derive the familywise error rate.

n=3

alpha=(0.5)/n

print(paste0("Familywise error rate is ", 1-alpha))

## [1] "Familywise error rate is 0.833333333333333"

5 points. Linear Algebra and Correlation. Invert your 3 x 3 correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.

Correlation matrix

print(correlation_matrix)

##           X1stFlrSF LotArea SalePrice
## X1stFlrSF      1.00    0.30      0.61
## LotArea        0.30    1.00      0.26
## SalePrice      0.61    0.26      1.00

Invert correlation matrix

require(Matrix)

## Loading required package: Matrix

my_mat <- solve(correlation_matrix)
print(my_mat)

##            X1stFlrSF    LotArea  SalePrice
## X1stFlrSF  1.6489230 -0.2500619 -0.9408269
## LotArea   -0.2500619  1.1104234 -0.1361723
## SalePrice -0.9408269 -0.1361723  1.6093092

Multiply the correlation matrix by the precision matrix

p_mat <- correlation_matrix%*%my_mat
print(p_mat)

##               X1stFlrSF      LotArea     SalePrice
## X1stFlrSF  1.000000e+00 1.387779e-17  0.000000e+00
## LotArea   -2.775558e-17 1.000000e+00 -5.551115e-17
## SalePrice -1.110223e-16 0.000000e+00  1.000000e+00

multiply the precision matrix by the correlation matrix.

x_mat <- p_mat%*%correlation_matrix
print("We have derived our original correlation matrix")

## [1] "We have derived our original correlation matrix"

print( x_mat)

##           X1stFlrSF LotArea SalePrice
## X1stFlrSF      1.00    0.30      0.61
## LotArea        0.30    1.00      0.26
## SalePrice      0.61    0.26      1.00

Conduct LU decomposition on the matrix.

lu_mat<-lu(correlation_matrix)
lu_mat2<-expand(lu_mat)
print(lu_mat2$L %*% lu_mat2$U)

## 3 x 3 Matrix of class "dgeMatrix"
##      [,1] [,2] [,3]
## [1,] 1.00 0.30 0.61
## [2,] 0.30 1.00 0.26
## [3,] 0.61 0.26 1.00

5 points. Calculus-Based Probability & Statistics. Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of λ for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, λ)). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.

Gr Living Area is a variable with a right skew

plot_histogram(trainc$GrLivArea);

summary(trainc$GrLivArea)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     334    1130    1464    1515    1777    5642

Gr Living Area does not have a minimum of zero, therefore we do not need to shift the variable.

Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of λ for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, λ)). Plot a histogram and compare it with a histogram of your original variable.

library(MASS)

dist<-fitdistr(trainc$GrLivArea, densfun = 'exponential')
lambda <- dist$estimate
exp_distibution <- rexp(1000, lambda)

summary(exp_distibution);

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     3.69   423.94  1050.27  1530.28  2152.03 11314.29

plot_histogram(trainc$GrLivArea);

hist(exp_distibution, main = "Simulated Grade Living Area", xlab="", col = "blue")

Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF).

quantile(exp_distibution, c(.05, .95))

##         5%        95% 
##   82.85466 4385.03358

Also generate a 95% confidence interval from the empirical data, assuming normality.

library(Rmisc)

## Loading required package: lattice

## Loading required package: plyr

CI(trainc$GrLivArea, ci=0.95)

##    upper     mean    lower 
## 1542.440 1515.464 1488.487

Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.

quantile(trainc$GrLivArea, c(.05, .95))

##     5%    95% 
##  848.0 2466.1

10 points. Modeling. Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.

When it comes to modeling, there are numerous things that can be done with some more involved than others. I could do PCA decomposition to reduce the dimension of the data. I could also create multiple dummy variables with a degrees of freedom trade off. For the sake of this course, I will keep the model simple and limit them to variables that had decent correlation with Sales Price.

Lets see how close to the normal distribution our response variable is

ggqqplot(trainc$SalePrice);

x <- trainc$SalePrice 
h<-hist(x, breaks=10, col="red", xlab="Sales Price", 
    main="Histogram with Normal Curve") 
xfit<-seq(min(x),max(x),length=40) 
yfit<-dnorm(xfit,mean=mean(x),sd=sd(x)) 
yfit <- yfit*diff(h$mids[1:2])*length(x) 
lines(xfit, yfit, col="blue", lwd=2)

It seems that no major transformation needs to be done on the response variable, however we will confirm with diagnostics.

We examined a heat map based off the correlation matrix. We can use that to our advantage. We can actually systematically go through a process that can identify what predictors have significant correlations with the response variables.

cor(trainc[-37], trainc$SalePrice)

##                      [,1]
## MSSubClass    -0.08428414
## LotFrontage            NA
## LotArea        0.26384335
## OverallQual    0.79098160
## OverallCond   -0.07785589
## YearBuilt      0.52289733
## YearRemodAdd   0.50710097
## MasVnrArea             NA
## BsmtFinSF1     0.38641981
## BsmtFinSF2    -0.01137812
## BsmtUnfSF      0.21447911
## TotalBsmtSF    0.61358055
## X1stFlrSF      0.60585218
## X2ndFlrSF      0.31933380
## LowQualFinSF  -0.02560613
## GrLivArea      0.70862448
## BsmtFullBath   0.22712223
## BsmtHalfBath  -0.01684415
## FullBath       0.56066376
## HalfBath       0.28410768
## BedroomAbvGr   0.16821315
## KitchenAbvGr  -0.13590737
## TotRmsAbvGrd   0.53372316
## Fireplaces     0.46692884
## GarageYrBlt            NA
## GarageCars     0.64040920
## GarageArea     0.62343144
## WoodDeckSF     0.32441344
## OpenPorchSF    0.31585623
## EnclosedPorch -0.12857796
## X3SsnPorch     0.04458367
## ScreenPorch    0.11144657
## PoolArea       0.09240355
## MiscVal       -0.02118958
## MoSold         0.04643225
## YrSold        -0.02892259

We will take variables with strong positive correlations greater than .5.

mod <- lm(SalePrice~GarageArea+GarageCars+TotRmsAbvGrd+FullBath+GrLivArea+X1stFlrSF+TotalBsmtSF+YearRemodAdd+YearBuilt+OverallQual, data=trainc)

summary(mod)

## 
## Call:
## lm(formula = SalePrice ~ GarageArea + GarageCars + TotRmsAbvGrd + 
##     FullBath + GrLivArea + X1stFlrSF + TotalBsmtSF + YearRemodAdd + 
##     YearBuilt + OverallQual, data = trainc)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -489958  -19316   -1948   16020  290558 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.186e+06  1.291e+05  -9.187  < 2e-16 ***
## GarageArea    1.495e+01  1.031e+01   1.450 0.147384    
## GarageCars    1.042e+04  3.044e+03   3.422 0.000639 ***
## TotRmsAbvGrd  3.310e+01  1.119e+03   0.030 0.976404    
## FullBath     -6.791e+03  2.682e+03  -2.532 0.011457 *  
## GrLivArea     5.130e+01  4.233e+00  12.119  < 2e-16 ***
## X1stFlrSF     1.417e+01  4.930e+00   2.875 0.004097 ** 
## TotalBsmtSF   1.986e+01  4.295e+00   4.625 4.09e-06 ***
## YearRemodAdd  2.965e+02  6.363e+01   4.659 3.47e-06 ***
## YearBuilt     2.682e+02  5.035e+01   5.328 1.15e-07 ***
## OverallQual   1.960e+04  1.190e+03  16.472  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 37920 on 1449 degrees of freedom
## Multiple R-squared:  0.7737, Adjusted R-squared:  0.7721 
## F-statistic: 495.4 on 10 and 1449 DF,  p-value: < 2.2e-16

Garage area and Total Rooms Above grade do not appear to be significant. We have an adjusted R squared value of .77 meaning, 77% of the variability in the data is accounted for.

Remove non significant predictors and re-fit model

mod <- lm(SalePrice~GarageCars+FullBath+GrLivArea+X1stFlrSF+TotalBsmtSF+YearRemodAdd+YearBuilt+OverallQual, data=trainc)

summary(mod)

## 
## Call:
## lm(formula = SalePrice ~ GarageCars + FullBath + GrLivArea + 
##     X1stFlrSF + TotalBsmtSF + YearRemodAdd + YearBuilt + OverallQual, 
##     data = trainc)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -482525  -19191   -1801   16208  289639 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.188e+06  1.284e+05  -9.255  < 2e-16 ***
## GarageCars    1.395e+04  1.817e+03   7.680 2.92e-14 ***
## FullBath     -7.184e+03  2.644e+03  -2.717  0.00666 ** 
## GrLivArea     5.177e+01  3.097e+00  16.714  < 2e-16 ***
## X1stFlrSF     1.465e+01  4.919e+00   2.979  0.00294 ** 
## TotalBsmtSF   2.039e+01  4.269e+00   4.775 1.98e-06 ***
## YearRemodAdd  2.957e+02  6.362e+01   4.649 3.64e-06 ***
## YearBuilt     2.699e+02  5.017e+01   5.380 8.69e-08 ***
## OverallQual   1.959e+04  1.188e+03  16.486  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 37920 on 1451 degrees of freedom
## Multiple R-squared:  0.7734, Adjusted R-squared:  0.7721 
## F-statistic: 618.9 on 8 and 1451 DF,  p-value: < 2.2e-16

We still retain almost identical adjusted r square with fewer predictors.

Diagnostics

hist(mod$residuals);

qqnorm(mod$residuals);
qqline(mod$residuals)

The residuals seem to follow a close to normal distribution. We need to check constant variance.

library(olsrr)

## 
## Attaching package: 'olsrr'

## The following object is masked from 'package:MASS':
## 
##     cement

## The following object is masked from 'package:datasets':
## 
##     rivers

ols_test_breusch_pagan(mod)

## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##                 Data                  
##  -------------------------------------
##  Response : SalePrice 
##  Variables: fitted values of SalePrice 
## 
##         Test Summary         
##  ----------------------------
##  DF            =    1 
##  Chi2          =    2921.5080 
##  Prob > Chi2   =    0.0000

A small p value in the Breusch Pagan Test for Heteroskedasticity indicates strong evidence against the null value. We can say that constant variance is not met. Let us check visually.

plot(fitted(mod), residuals(mod), xlab="fitted", ylab="residuals")
abline(h=0);

plot(fitted(mod), sqrt(abs(residuals(mod))), xlab="fitted", ylab=expression(sqrt(hat(epsilon))))

If we look at the residuals, we can see a parabolic shape indicating that some transform needs to be done on the response variable. If we look at the square root of the residuals, the parabolic pattern becomes much more prominant.

We can apply the Box-Cox transform. This worlflow is highlighted in detail in Julian Farayws linear model in r book.

library(MASS)

boxcox(mod, plotit=T, lambda=seq(0, 0.2, by=0.01))

According to the transform, the max log-likelihood happens around -2700. We can estimate a parameter lambda by using the center line bounded by the interval roughly (0.07, 0.17). It looks like our power transform is going to be 0.13

traind<-trainc

mod2 <- lm(SalePrice^(0.13)~GarageCars+FullBath+GrLivArea+X1stFlrSF+TotalBsmtSF+YearRemodAdd+YearBuilt+OverallQual, data=trainc)

summary(mod2)

## 
## Call:
## lm(formula = SalePrice^(0.13) ~ GarageCars + FullBath + GrLivArea + 
##     X1stFlrSF + TotalBsmtSF + YearRemodAdd + YearBuilt + OverallQual, 
##     data = trainc)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.38745 -0.04679  0.00302  0.05812  0.34836 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -8.760e-01  3.544e-01  -2.471   0.0136 *  
## GarageCars    5.209e-02  5.016e-03  10.383  < 2e-16 ***
## FullBath     -1.291e-02  7.301e-03  -1.769   0.0771 .  
## GrLivArea     1.500e-04  8.552e-06  17.542  < 2e-16 ***
## X1stFlrSF     3.827e-05  1.358e-05   2.818   0.0049 ** 
## TotalBsmtSF   5.718e-05  1.179e-05   4.850 1.36e-06 ***
## YearRemodAdd  1.331e-03  1.757e-04   7.579 6.19e-14 ***
## YearBuilt     1.139e-03  1.385e-04   8.219 4.50e-16 ***
## OverallQual   5.982e-02  3.281e-03  18.234  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1047 on 1451 degrees of freedom
## Multiple R-squared:  0.8246, Adjusted R-squared:  0.8236 
## F-statistic: 852.4 on 8 and 1451 DF,  p-value: < 2.2e-16

Residual standard error has decreased significantly while out adjusted r square has increased from .77 to .82.

hist(mod2$residuals);

qqnorm(mod2$residuals);
qqline(mod2$residuals)

There is a slight skew introduced into the residuals but it does not appear to be much. Lets visually check constant variance.

plot(fitted(mod2), resid(mod2), col = "dodgerblue",
     pch = 20, cex = 1.5, xlab = "Fitted", ylab = "Residuals")
abline(h = 0, lty = 2, col = "darkorange", lwd = 2)

Lets examine outliers on a top level

library(car)

## Warning: package 'car' was built under R version 3.4.4

## Loading required package: carData

## Warning: package 'carData' was built under R version 3.4.4

outlierTest(mod2)

##        rstudent unadjusted p-value Bonferonni p
## 1299 -15.538912         1.7562e-50   2.5640e-47
## 524   -9.634839         2.4441e-21   3.5683e-18
## 633   -4.748411         2.2531e-06   3.2895e-03
## 31    -4.617930         4.2190e-06   6.1597e-03
## 1325  -4.171794         3.2015e-05   4.6742e-02

We have identified several outliers however the low p values indicates that they do not seem to be significant.

Can we reduce our predictors even more by using variance inflation numbers?

vif(mod2)

##   GarageCars     FullBath    GrLivArea    X1stFlrSF  TotalBsmtSF 
##     1.869688     2.152343     2.687048     3.668313     3.558638 
## YearRemodAdd    YearBuilt  OverallQual 
##     1.750023     2.329177     2.738752

VIF numbers indicate that we do not need to remove additional predictors since there is no VIF number that is unsually large.

Before we conclude modeling, lets examine all possible permutations of predictors and see their performance based on KPI such as adjusted r square and mallows CP.

k<-ols_step_all_possible(mod2)
plot(k)

From a top level, using all 8 predictors yields the better adjusted r square and reduced the AIC.

Feature selection at a more detailed level

h<-ols_step_best_subset(mod2)
h

##                                        Best Subsets Regression                                       
## -----------------------------------------------------------------------------------------------------
## Model Index    Predictors
## -----------------------------------------------------------------------------------------------------
##      1         OverallQual                                                                            
##      2         GrLivArea OverallQual                                                                  
##      3         GrLivArea YearBuilt OverallQual                                                        
##      4         GarageCars GrLivArea YearBuilt OverallQual                                             
##      5         GarageCars GrLivArea TotalBsmtSF YearBuilt OverallQual                                 
##      6         GarageCars GrLivArea TotalBsmtSF YearRemodAdd YearBuilt OverallQual                    
##      7         GarageCars GrLivArea X1stFlrSF TotalBsmtSF YearRemodAdd YearBuilt OverallQual          
##      8         GarageCars FullBath GrLivArea X1stFlrSF TotalBsmtSF YearRemodAdd YearBuilt OverallQual 
## -----------------------------------------------------------------------------------------------------
## 
##                                                        Subsets Regression Summary                                                       
## ----------------------------------------------------------------------------------------------------------------------------------------
##                        Adj.        Pred                                                                                                  
## Model    R-Square    R-Square    R-Square      C(p)          AIC           SBIC          SBC         MSEP      FPE       HSP       APC  
## ----------------------------------------------------------------------------------------------------------------------------------------
##   1        0.6714      0.6712      0.6704    1261.4740    -1532.4749    -5678.0546    -1516.6163    0.0205    0.0205    0.0000    0.3295 
##   2        0.7464      0.7461      0.7435     643.2617    -1908.7174    -6054.0257    -1887.5727    0.0158    0.0158    0.0000    0.2546 
##   3        0.7871      0.7867      0.7839     308.7680    -2162.0458    -6306.7712    -2135.6149    0.0133    0.0133    0.0000    0.2141 
##   4        0.8037      0.8032      0.8003     173.3529    -2278.6801    -6423.0072    -2246.9630    0.0123    0.0123    0.0000    0.1976 
##   5        0.8166      0.8160      0.8073      68.6192    -2375.9748    -6519.7261    -2338.9715    0.0115    0.0115    0.0000    0.1849 
##   6        0.8233      0.8225      0.8137      15.6517    -2427.8729    -6571.1893    -2385.5834    0.0111    0.0111    0.0000    0.1784 
##   7        0.8242      0.8233      0.8144      10.1290    -2433.4066    -6576.6425    -2385.8308    0.0110    0.0110    0.0000    0.1778 
##   8        0.8246      0.8236      0.8134       9.0000    -2434.5516    -6577.7405    -2381.6897    0.0110    0.0110    0.0000    0.1776 
## ----------------------------------------------------------------------------------------------------------------------------------------
## AIC: Akaike Information Criteria 
##  SBIC: Sawa's Bayesian Information Criteria 
##  SBC: Schwarz Bayesian Criteria 
##  MSEP: Estimated error of prediction, assuming multivariate normality 
##  FPE: Final Prediction Error 
##  HSP: Hocking's Sp 
##  APC: Amemiya Prediction Criteria

It seems that model 4-8 are pretty close in adjusted r square but if any more predictors get removed, there is a sharp drop off.

plot(h)

It is easy to keep looking for methods to optimize the model. I would even go as far as saying that a GLM should be considered here but that is outside the scope of the class.

Lets apply to our test data and make some predictions

test_results <- predict(mod2, test)

prediction <- data.frame(Id = test[,"Id"],  SalePrice = test_results)

prediction[prediction<0] <- 0

prediction <- replace(prediction,is.na(prediction),0)

prediction$SalePrice <- prediction$SalePrice^(1/.13)

head(test_results)

##        1        2        3        4        5        6 
## 4.525119 4.684621 4.768366 4.825383 4.915384 4.806456

write.csv(prediction, "salepricepredictions.csv")

Kaggle results Number 4558, posted under Vinicio Haro: 1 submission scored at 0.47026

DATA 605 Final

Vinicio Haro

December 10, 2018