Load required R Packages

## 
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
## Warning: package 'corrplot' was built under R version 3.4.2

Load train.csv for training data set. Make sure the data is loaded correct.

##   Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape
## 1  1         60       RL          65    8450   Pave  <NA>      Reg
## 2  2         20       RL          80    9600   Pave  <NA>      Reg
## 3  3         60       RL          68   11250   Pave  <NA>      IR1
## 4  4         70       RL          60    9550   Pave  <NA>      IR1
## 5  5         60       RL          84   14260   Pave  <NA>      IR1
## 6  6         50       RL          85   14115   Pave  <NA>      IR1
##   LandContour Utilities LotConfig LandSlope Neighborhood Condition1
## 1         Lvl    AllPub    Inside       Gtl      CollgCr       Norm
## 2         Lvl    AllPub       FR2       Gtl      Veenker      Feedr
## 3         Lvl    AllPub    Inside       Gtl      CollgCr       Norm
## 4         Lvl    AllPub    Corner       Gtl      Crawfor       Norm
## 5         Lvl    AllPub       FR2       Gtl      NoRidge       Norm
## 6         Lvl    AllPub    Inside       Gtl      Mitchel       Norm
##   Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt
## 1       Norm     1Fam     2Story           7           5      2003
## 2       Norm     1Fam     1Story           6           8      1976
## 3       Norm     1Fam     2Story           7           5      2001
## 4       Norm     1Fam     2Story           7           5      1915
## 5       Norm     1Fam     2Story           8           5      2000
## 6       Norm     1Fam     1.5Fin           5           5      1993
##   YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType
## 1         2003     Gable  CompShg     VinylSd     VinylSd    BrkFace
## 2         1976     Gable  CompShg     MetalSd     MetalSd       None
## 3         2002     Gable  CompShg     VinylSd     VinylSd    BrkFace
## 4         1970     Gable  CompShg     Wd Sdng     Wd Shng       None
## 5         2000     Gable  CompShg     VinylSd     VinylSd    BrkFace
## 6         1995     Gable  CompShg     VinylSd     VinylSd       None
##   MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure
## 1        196        Gd        TA      PConc       Gd       TA           No
## 2          0        TA        TA     CBlock       Gd       TA           Gd
## 3        162        Gd        TA      PConc       Gd       TA           Mn
## 4          0        TA        TA     BrkTil       TA       Gd           No
## 5        350        Gd        TA      PConc       Gd       TA           Av
## 6          0        TA        TA       Wood       Gd       TA           No
##   BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF
## 1          GLQ        706          Unf          0       150         856
## 2          ALQ        978          Unf          0       284        1262
## 3          GLQ        486          Unf          0       434         920
## 4          ALQ        216          Unf          0       540         756
## 5          GLQ        655          Unf          0       490        1145
## 6          GLQ        732          Unf          0        64         796
##   Heating HeatingQC CentralAir Electrical X1stFlrSF X2ndFlrSF LowQualFinSF
## 1    GasA        Ex          Y      SBrkr       856       854            0
## 2    GasA        Ex          Y      SBrkr      1262         0            0
## 3    GasA        Ex          Y      SBrkr       920       866            0
## 4    GasA        Gd          Y      SBrkr       961       756            0
## 5    GasA        Ex          Y      SBrkr      1145      1053            0
## 6    GasA        Ex          Y      SBrkr       796       566            0
##   GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr
## 1      1710            1            0        2        1            3
## 2      1262            0            1        2        0            3
## 3      1786            1            0        2        1            3
## 4      1717            1            0        1        0            3
## 5      2198            1            0        2        1            4
## 6      1362            1            0        1        1            1
##   KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces FireplaceQu
## 1            1          Gd            8        Typ          0        <NA>
## 2            1          TA            6        Typ          1          TA
## 3            1          Gd            6        Typ          1          TA
## 4            1          Gd            7        Typ          1          Gd
## 5            1          Gd            9        Typ          1          TA
## 6            1          TA            5        Typ          0        <NA>
##   GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual
## 1     Attchd        2003          RFn          2        548         TA
## 2     Attchd        1976          RFn          2        460         TA
## 3     Attchd        2001          RFn          2        608         TA
## 4     Detchd        1998          Unf          3        642         TA
## 5     Attchd        2000          RFn          3        836         TA
## 6     Attchd        1993          Unf          2        480         TA
##   GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch
## 1         TA          Y          0          61             0          0
## 2         TA          Y        298           0             0          0
## 3         TA          Y          0          42             0          0
## 4         TA          Y          0          35           272          0
## 5         TA          Y        192          84             0          0
## 6         TA          Y         40          30             0        320
##   ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold
## 1           0        0   <NA>  <NA>        <NA>       0      2   2008
## 2           0        0   <NA>  <NA>        <NA>       0      5   2007
## 3           0        0   <NA>  <NA>        <NA>       0      9   2008
## 4           0        0   <NA>  <NA>        <NA>       0      2   2006
## 5           0        0   <NA>  <NA>        <NA>       0     12   2008
## 6           0        0   <NA> MnPrv        Shed     700     10   2009
##   SaleType SaleCondition SalePrice
## 1       WD        Normal    208500
## 2       WD        Normal    181500
## 3       WD        Normal    223500
## 4       WD       Abnorml    140000
## 5       WD        Normal    250000
## 6       WD        Normal    143000

Set Up:

Pick one of the quantitative independent variables from the training data set (train.csv), and define that variable as X. Pick the dependent variable and define it as Y.

For this project, I have choosen train$LotArea as my quantitative independent variable.

Checking the skewness of train$GrLivArea to make sure that it is to the right.

## [1] 1.365156

The positive return implies that the variable is skewed to the right.

My dependant variable for this project is train$SalePrice.

Summary of train$GrLivAera (X)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     334    1130    1464    1515    1777    5642

Summary for train$SalePrice (Y)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   34900  129975  163000  180921  214000  755000

Probability

Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the 1st quartile of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities. In addition, make a table of counts as shown below.

Define \(x\) and \(y\) First quartile of Above Ground Living Space \(x = 1129.5\)

##    25% 
## 1129.5

First quartile of Sale Price \(y = 129975\)

##    25% 
## 129975
#P(X > x)
X_g_x <- subset(train,GrLivArea > x)
#P(X <= x)
X_le_x <- subset(train,GrLivArea <= x)
#P(Y > y)
Y_g_y<- subset(train, SalePrice > y)
#P(Y <= y)
Y_le_y<- subset(train, SalePrice <= y)

Calculating the joint/conditional probability assuming that the events are dependent.

  1. \(P(X > x | Y > y)\)

This is probability that the Above Ground Living Space (X) is greater than the 1st quartile of Above Ground Living Space given that the Sale Price (Y) is greater than the 1st quartile of Sale Price.

\(P(X > x | Y > y)\) is the probability of \(X > x\) given \(Y > y\). This is the area represented by the intersection of both events, divided by the total area of the given event.

Calculate the number of rows where \(Y > y\)

## [1] 1095

Calculate the total number of rows for train

## [1] 1460

Probability of \(P(X > x | Y > y)\)

## [1] 0.8712329

The probability that the 75% of the living space is larger than 25% of the living space, given that 75% of the sale price of the home is 87.12%

  1. \(P(X > x, Y > y)\)

The probability that the Above Ground Living Space (X) is greater than the 1st quartile for Above Ground Living Space and the Sale Price (Y) is greater than the 1st quartile for Sale Price.

\(P(X>x, Y>y)\) is the probability of \(X> x\) and \(Y > y\), divided by the total number of rows

## [1] 0.6534247

The probability is about 65.37%

  1. \(P(X < x | Y > y)\)

The probability that the Above Ground Living Space (X) is less than the 1st quartile for Above Ground Living Space given that the Sale Price (Y) is greater than the 1st quartile for Sale Price.

\(P(X < x | Y > y)\) is the probability of \(X < x\) given \(Y > y\) dividded by the number of rows.

## [1] 0.1287671

The probability is 12.87 percent.

Table of Counts

a <- sum(train$GrLivArea <= x & train$SalePrice <= y)
b<- sum(train$GrLivArea <= x & train$SalePrice > y)
c <- sum(train$GrLivArea > x & train$SalePrice <= y)
d<- sum(train$GrLivArea > x & train$SalePrice > y)
##                <=1st quartile >1st quartile Total
## <=1st quartile            224           141   365
## >1st quartile             141           954  1095
## Total                     365          1095  1460

Does splitting the training data in this fashion make them independent? Let A be the new variable counting those observations above the 1st quartile for X, and let B be the new variable counting those observations above the 1st quartile for Y. Does P(AB)=P(A)P(B)? Check mathematically, and then evaluate by running a Chi Square test for association.

\(P(A) = 1095/1460 = .75\) \(P(B) = 1095/1460 = .75\)

P(A)P(B) = 0.75 * 0.75 = 0.5625

P(AB) = 0.75 * 0.75 = 0.5625

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  matrix(c(a, b, c, d), ncol = 2)
## X-squared = 340.75, df = 1, p-value < 2.2e-16

The p-value < 2.2e-16 is less than the .05 significance level, we reject the null hypothesis that GrLivArea variable is independent of the SalePrice.

Descriptive and Inferential Statistics

Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot of X and Y. Derive a correlation matrix for any THREE quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide a 92% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?

Scatterplot

ggplot(train, aes(x=GrLivArea, y=SalePrice)) + 
  geom_point()+
  geom_smooth(method=lm)

Histograms

Q-Q Plots Above Ground Living Space

Sale Price Box Plot

Above Ground Living Space Sale Price

boxplot(train$SalePrice)

Density Plot

ggplot(data = train, aes(x = GrLivArea) )+ 
  geom_density(alpha = .2, fill = "gold1")+
  ggtitle("Above Ground Living Space in SQFT")

ggplot(data = train, aes(x = SalePrice) )+ 
  geom_density(alpha = .2, fill = "gold1")+
  ggtitle("Above Ground Living Space in SQFT")

##               vars    n      mean       sd   median   trimmed      mad
## MSSubClass       1 1460     56.90    42.30     50.0     49.15    44.48
## LotFrontage      2 1201     70.05    24.28     69.0     68.94    16.31
## LotArea          3 1460  10516.83  9981.26   9478.5   9563.28  2962.23
## OverallQual      4 1460      6.10     1.38      6.0      6.08     1.48
## OverallCond      5 1460      5.58     1.11      5.0      5.48     0.00
## YearBuilt        6 1460   1971.27    30.20   1973.0   1974.13    37.06
## YearRemodAdd     7 1460   1984.87    20.65   1994.0   1986.37    19.27
## MasVnrArea       8 1452    103.69   181.07      0.0     63.15     0.00
## BsmtFinSF1       9 1460    443.64   456.10    383.5    386.08   568.58
## BsmtFinSF2      10 1460     46.55   161.32      0.0      1.38     0.00
## BsmtUnfSF       11 1460    567.24   441.87    477.5    519.29   426.99
## TotalBsmtSF     12 1460   1057.43   438.71    991.5   1036.70   347.67
## X1stFlrSF       13 1460   1162.63   386.59   1087.0   1129.99   347.67
## X2ndFlrSF       14 1460    346.99   436.53      0.0    285.36     0.00
## LowQualFinSF    15 1460      5.84    48.62      0.0      0.00     0.00
## GrLivArea       16 1460   1515.46   525.48   1464.0   1467.67   483.33
## BsmtFullBath    17 1460      0.43     0.52      0.0      0.39     0.00
## BsmtHalfBath    18 1460      0.06     0.24      0.0      0.00     0.00
## FullBath        19 1460      1.57     0.55      2.0      1.56     0.00
## HalfBath        20 1460      0.38     0.50      0.0      0.34     0.00
## BedroomAbvGr    21 1460      2.87     0.82      3.0      2.85     0.00
## KitchenAbvGr    22 1460      1.05     0.22      1.0      1.00     0.00
## TotRmsAbvGrd    23 1460      6.52     1.63      6.0      6.41     1.48
## Fireplaces      24 1460      0.61     0.64      1.0      0.53     1.48
## GarageYrBlt     25 1379   1978.51    24.69   1980.0   1981.07    31.13
## GarageCars      26 1460      1.77     0.75      2.0      1.77     0.00
## GarageArea      27 1460    472.98   213.80    480.0    469.81   177.91
## WoodDeckSF      28 1460     94.24   125.34      0.0     71.76     0.00
## OpenPorchSF     29 1460     46.66    66.26     25.0     33.23    37.06
## EnclosedPorch   30 1460     21.95    61.12      0.0      3.87     0.00
## X3SsnPorch      31 1460      3.41    29.32      0.0      0.00     0.00
## ScreenPorch     32 1460     15.06    55.76      0.0      0.00     0.00
## PoolArea        33 1460      2.76    40.18      0.0      0.00     0.00
## MiscVal         34 1460     43.49   496.12      0.0      0.00     0.00
## MoSold          35 1460      6.32     2.70      6.0      6.25     2.97
## YrSold          36 1460   2007.82     1.33   2008.0   2007.77     1.48
## SalePrice       37 1460 180921.20 79442.50 163000.0 170783.29 56338.80
##                 min    max  range  skew kurtosis      se
## MSSubClass       20    190    170  1.40     1.56    1.11
## LotFrontage      21    313    292  2.16    17.34    0.70
## LotArea        1300 215245 213945 12.18   202.26  261.22
## OverallQual       1     10      9  0.22     0.09    0.04
## OverallCond       1      9      8  0.69     1.09    0.03
## YearBuilt      1872   2010    138 -0.61    -0.45    0.79
## YearRemodAdd   1950   2010     60 -0.50    -1.27    0.54
## MasVnrArea        0   1600   1600  2.66    10.03    4.75
## BsmtFinSF1        0   5644   5644  1.68    11.06   11.94
## BsmtFinSF2        0   1474   1474  4.25    20.01    4.22
## BsmtUnfSF         0   2336   2336  0.92     0.46   11.56
## TotalBsmtSF       0   6110   6110  1.52    13.18   11.48
## X1stFlrSF       334   4692   4358  1.37     5.71   10.12
## X2ndFlrSF         0   2065   2065  0.81    -0.56   11.42
## LowQualFinSF      0    572    572  8.99    82.83    1.27
## GrLivArea       334   5642   5308  1.36     4.86   13.75
## BsmtFullBath      0      3      3  0.59    -0.84    0.01
## BsmtHalfBath      0      2      2  4.09    16.31    0.01
## FullBath          0      3      3  0.04    -0.86    0.01
## HalfBath          0      2      2  0.67    -1.08    0.01
## BedroomAbvGr      0      8      8  0.21     2.21    0.02
## KitchenAbvGr      0      3      3  4.48    21.42    0.01
## TotRmsAbvGrd      2     14     12  0.67     0.87    0.04
## Fireplaces        0      3      3  0.65    -0.22    0.02
## GarageYrBlt    1900   2010    110 -0.65    -0.42    0.66
## GarageCars        0      4      4 -0.34     0.21    0.02
## GarageArea        0   1418   1418  0.18     0.90    5.60
## WoodDeckSF        0    857    857  1.54     2.97    3.28
## OpenPorchSF       0    547    547  2.36     8.44    1.73
## EnclosedPorch     0    552    552  3.08    10.37    1.60
## X3SsnPorch        0    508    508 10.28   123.06    0.77
## ScreenPorch       0    480    480  4.11    18.34    1.46
## PoolArea          0    738    738 14.80   222.19    1.05
## MiscVal           0  15500  15500 24.43   697.64   12.98
## MoSold            1     12     11  0.21    -0.41    0.07
## YrSold         2006   2010      4  0.10    -1.19    0.03
## SalePrice     34900 755000 720100  1.88     6.50 2079.11

There are 37 numerical values in the train data set.

Histographs of the 36 numerical values

##     MSZoning     Street      Alley      LotShape  LandContour
##  C (all):  10   Grvl:   6   Grvl:  50   IR1:484   Bnk:  63   
##  FV     :  65   Pave:1454   Pave:  41   IR2: 41   HLS:  50   
##  RH     :  16               NA's:1369   IR3: 10   Low:  36   
##  RL     :1151                           Reg:925   Lvl:1311   
##  RM     : 218                                                
##                                                              
##                                                              
##   Utilities      LotConfig    LandSlope   Neighborhood   Condition1  
##  AllPub:1459   Corner : 263   Gtl:1382   NAmes  :225   Norm   :1260  
##  NoSeWa:   1   CulDSac:  94   Mod:  65   CollgCr:150   Feedr  :  81  
##                FR2    :  47   Sev:  13   OldTown:113   Artery :  48  
##                FR3    :   4              Edwards:100   RRAn   :  26  
##                Inside :1052              Somerst: 86   PosN   :  19  
##                                          Gilbert: 79   RRAe   :  11  
##                                          (Other):707   (Other):  15  
##    Condition2     BldgType      HouseStyle    RoofStyle       RoofMatl   
##  Norm   :1445   1Fam  :1220   1Story :726   Flat   :  13   CompShg:1434  
##  Feedr  :   6   2fmCon:  31   2Story :445   Gable  :1141   Tar&Grv:  11  
##  Artery :   2   Duplex:  52   1.5Fin :154   Gambrel:  11   WdShngl:   6  
##  PosN   :   2   Twnhs :  43   SLvl   : 65   Hip    : 286   WdShake:   5  
##  RRNn   :   2   TwnhsE: 114   SFoyer : 37   Mansard:   7   ClyTile:   1  
##  PosA   :   1                 1.5Unf : 14   Shed   :   2   Membran:   1  
##  (Other):   2                 (Other): 19                  (Other):   2  
##   Exterior1st   Exterior2nd    MasVnrType  ExterQual ExterCond
##  VinylSd:515   VinylSd:504   BrkCmn : 15   Ex: 52    Ex:   3  
##  HdBoard:222   MetalSd:214   BrkFace:445   Fa: 14    Fa:  28  
##  MetalSd:220   HdBoard:207   None   :864   Gd:488    Gd: 146  
##  Wd Sdng:206   Wd Sdng:197   Stone  :128   TA:906    Po:   1  
##  Plywood:108   Plywood:142   NA's   :  8             TA:1282  
##  CemntBd: 61   CmentBd: 60                                    
##  (Other):128   (Other):136                                    
##   Foundation  BsmtQual   BsmtCond    BsmtExposure BsmtFinType1
##  BrkTil:146   Ex  :121   Fa  :  45   Av  :221     ALQ :220    
##  CBlock:634   Fa  : 35   Gd  :  65   Gd  :134     BLQ :148    
##  PConc :647   Gd  :618   Po  :   2   Mn  :114     GLQ :418    
##  Slab  : 24   TA  :649   TA  :1311   No  :953     LwQ : 74    
##  Stone :  6   NA's: 37   NA's:  37   NA's: 38     Rec :133    
##  Wood  :  3                                       Unf :430    
##                                                   NA's: 37    
##  BsmtFinType2  Heating     HeatingQC CentralAir Electrical   KitchenQual
##  ALQ :  19    Floor:   1   Ex:741    N:  95     FuseA:  94   Ex:100     
##  BLQ :  33    GasA :1428   Fa: 49    Y:1365     FuseF:  27   Fa: 39     
##  GLQ :  14    GasW :  18   Gd:241               FuseP:   3   Gd:586     
##  LwQ :  46    Grav :   7   Po:  1               Mix  :   1   TA:735     
##  Rec :  54    OthW :   2   TA:428               SBrkr:1334              
##  Unf :1256    Wall :   4                        NA's :   1              
##  NA's:  38                                                              
##  Functional  FireplaceQu   GarageType  GarageFinish GarageQual 
##  Maj1:  14   Ex  : 24    2Types :  6   Fin :352     Ex  :   3  
##  Maj2:   5   Fa  : 33    Attchd :870   RFn :422     Fa  :  48  
##  Min1:  31   Gd  :380    Basment: 19   Unf :605     Gd  :  14  
##  Min2:  34   Po  : 20    BuiltIn: 88   NA's: 81     Po  :   3  
##  Mod :  15   TA  :313    CarPort:  9                TA  :1311  
##  Sev :   1   NA's:690    Detchd :387                NA's:  81  
##  Typ :1360               NA's   : 81                           
##  GarageCond  PavedDrive  PoolQC       Fence      MiscFeature
##  Ex  :   2   N:  90     Ex  :   2   GdPrv:  59   Gar2:   2  
##  Fa  :  35   P:  30     Fa  :   2   GdWo :  54   Othr:   2  
##  Gd  :   9   Y:1340     Gd  :   3   MnPrv: 157   Shed:  49  
##  Po  :   7              NA's:1453   MnWw :  11   TenC:   1  
##  TA  :1326                          NA's :1179   NA's:1406  
##  NA's:  81                                                  
##                                                             
##     SaleType    SaleCondition 
##  WD     :1267   Abnorml: 101  
##  New    : 122   AdjLand:   4  
##  COD    :  43   Alloca :  12  
##  ConLD  :   9   Family :  20  
##  ConLI  :   5   Normal :1198  
##  ConLw  :   5   Partial: 125  
##  (Other):   9

barchart of the 43 Categorical Data correlation matrix for any THREE quantitative variables

##               LotArea OverallQual GrLivArea SalePrice
## LotArea     1.0000000   0.1058057 0.2631162 0.2638434
## OverallQual 0.1058057   1.0000000 0.5930074 0.7909816
## GrLivArea   0.2631162   0.5930074 1.0000000 0.7086245
## SalePrice   0.2638434   0.7909816 0.7086245 1.0000000

These results show a very low but possible positive correlation between the OverallQual and LotArea. A low but possible positive correlation between LotArea and GrLivArea and a somewhat strong correlation between GrLivArea and OverallQual.

The there is a low correlation between SalePrice and LotArea, but a strong correlation between SalePrice and GrLivArea and OverallQual.

corrplot(correlations, method="square")

## 
##  Welch Two Sample t-test
## 
## data:  correlation$LotArea and correlation$SalePrice
## t = -81.321, df = 1505.1, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 92 percent confidence interval:
##  -174075.3 -166733.4
## sample estimates:
## mean of x mean of y 
##  10516.83 180921.20

In the house training dataset, the mean total Above Ground Living Area is 10516.83 and the mean sale price of a house is 180921.196. The 92% confidence interval of the difference in mean sale price is between 166733.40 and 174075.30.

We see a very small p-value (< 0.5) which leads us to reject the null hypothesis. There is strong evidence of a mean price increase between above ground Lot Area and sales price, which is indicative of a relationship between these two variables.

## 
##  Welch Two Sample t-test
## 
## data:  correlation$GrLivArea and correlation$SalePrice
## t = -86.288, df = 1459.1, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 92 percent confidence interval:
##  -183048.2 -175763.3
## sample estimates:
##  mean of x  mean of y 
##   1515.464 180921.196

In the house training dataset, the mean total Above Ground Living Area is 1515.464 and the mean sale price of a house is 180921.196. The 92% confidence interval of the difference in mean sale price is between 175763.30 and 183048.20.

We see a very small p-value (< 0.5) which leads us to reject the null hypothesis. There is strong evidence of a mean price increase between above ground living area and sales price, which is indicative of a relationship between these two variables.

## 
##  Welch Two Sample t-test
## 
## data:  correlation$OverallQual and correlation$SalePrice
## t = -87.016, df = 1459, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 92 percent confidence interval:
##  -184557.5 -177272.7
## sample estimates:
##    mean of x    mean of y 
## 6.099315e+00 1.809212e+05

In the house training dataset, the mean total Overall Quality is 6.09 and the mean sale price of a house is 180921.196. The 92% confidence interval of the difference in mean sale price is between 177272.70 and 184557.50.

We see a very small p-value (< 0.5) which leads us to reject the null hypothesis. There is strong evidence of a mean price increase between overall quality and sales price, which is indicative of a relationship between these two variables.

Linear Algebra and Correlation

Invert your 3 x 3 correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.

##               LotArea OverallQual GrLivArea
## LotArea     1.0000000   0.1058057 0.2631162
## OverallQual 0.1058057   1.0000000 0.5930074
## GrLivArea   0.2631162   0.5930074 1.0000000
##                LotArea OverallQual  GrLivArea
## LotArea      1.0788892   0.0835766 -0.3334347
## OverallQual  0.0835766   1.5488697 -0.9404816
## GrLivArea   -0.3334347  -0.9404816  1.6454446
##                   LotArea  OverallQual    GrLivArea
## LotArea      1.000000e+00 0.000000e+00 0.000000e+00
## OverallQual -2.775558e-17 1.000000e+00 1.110223e-16
## GrLivArea    5.551115e-17 1.110223e-16 1.000000e+00
precision %*% cor2
##                   LotArea  OverallQual    GrLivArea
## LotArea      1.000000e+00 2.775558e-17 1.110223e-16
## OverallQual  0.000000e+00 1.000000e+00 1.110223e-16
## GrLivArea   -5.551115e-17 1.110223e-16 1.000000e+00

Conduct LU decomposition on the matrix

Calculus-Based Probability & Statistics

Many times, it makes sense to fit a closed form distribution to data. For the first variable that you selected which is skewed to the right, shift it so that the minimum value is above zero as necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function.Find the optimal value of λ for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, λ)). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.

## [1] 334
##        rate 
## 0.000659864

5th and 95th percentiles using the cumulative distribution function (CDF)

##         5%        95% 
##   80.02991 4694.18765

95% confidence interval from the empirical data, assuming normality

##    upper     mean    lower 
## 1542.440 1515.464 1488.487

95% confidence, the mean of GrLivArea is between 1488.487 and 1542.440.

From this CI, we can see that the empirical data is a better fit for this case.

Modeling

Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.

I’m going to build a multiple regression model.

For this multiple linear regression example, I’ll use more than one predictor. The response variable will continue to be SalePrice but now I will include Above Ground Living Space, LotArea, OverallQual and TotRmsAbvGrd as the list of predictor variables.

##     LotArea        TotalBsmtSF       GrLivArea      GarageArea    
##  Min.   :  1300   Min.   :   0.0   Min.   : 334   Min.   :   0.0  
##  1st Qu.:  7554   1st Qu.: 795.8   1st Qu.:1130   1st Qu.: 334.5  
##  Median :  9478   Median : 991.5   Median :1464   Median : 480.0  
##  Mean   : 10517   Mean   :1057.4   Mean   :1515   Mean   : 473.0  
##  3rd Qu.: 11602   3rd Qu.:1298.2   3rd Qu.:1777   3rd Qu.: 576.0  
##  Max.   :215245   Max.   :6110.0   Max.   :5642   Max.   :1418.0  
##    SalePrice     
##  Min.   : 34900  
##  1st Qu.:129975  
##  Median :163000  
##  Mean   :180921  
##  3rd Qu.:214000  
##  Max.   :755000

The new dataset contains the five variables to be used in the model. The matrix plot above allows us to vizualise the relationship among all variables in one single image. For example, we can see how Total Basement SQFT and Above Ground Living Space are related (see third column, second row graph).

I’ll start by fitting a linear regression on this dataset and see how well it models the observed data. I’ll add all other predictors and give each of them a separate slope coefficient.

For our multiple linear regression example, we want to solve the following equation:

SalePrice=B0+B1∗LotArea+B2∗TotalBsmtSF+B3∗GrLivArea+B4*GarageArea

The model will estimate the value of the intercept (B0) and each predictor’s slope (B1) for LotArea, (B2) for TotalBsmtSF, (B3) for GrLivArea, and (B4) for GarageArea. The intercept is the average expected Sale Price value for the average value across all predictors. We want the model to fit a line across the observed relationship in a way that the line created is as close as possible to all data points.

## 
## Call:
## lm(formula = SalePrice ~ ., data = lmdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -654432  -19542      30   19320  277149 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.430e+04  4.045e+03  -6.006 2.39e-09 ***
## LotArea      2.042e-01  1.273e-01   1.604    0.109    
## TotalBsmtSF  4.834e+01  3.340e+00  14.474  < 2e-16 ***
## GrLivArea    6.806e+01  2.760e+00  24.656  < 2e-16 ***
## GarageArea   1.032e+02  6.831e+00  15.108  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 46200 on 1455 degrees of freedom
## Multiple R-squared:  0.6627, Adjusted R-squared:  0.6618 
## F-statistic: 714.8 on 4 and 1455 DF,  p-value: < 2.2e-16

SalePrice <- -2.430e+04 + 2.042e-01 * LotArea + 4.834e+01 * TotalBsmtSF + 6.806e+01 * GrLivArea + 1.032e+02 * GarageArea

For any given level of the variables, we see an improvement in SalePrice.

In this model, we have a R-squared number of 0.6618

corr2 = cor(lmdata)
corrplot(corr2, method = "number")

Notice the correlation between LotArea and GarageArea is very low at 0.18. This reveals Garage Area is not aligned to Lot Area. So in essence, Lot Area’s high p-value indicates that the other variables are related to Sale Price, but there is no evidence that Lot Area is associated with Sale Price, at least not when these other predictors are also considered in the model.

The F-Statistic value from our model is 714.8 on 4 and 1455 degrees of freedom. So assuming that the number of data points is appropriate and given that the p-values returned are low, we have some evidence that at least one of the predictors is associated with SalePrice.

plot(model1, pch=16, which=1)

Given that we have indications that at least one of the predictors is associated with SalePrice, and based on the fact that LotArea here has a high p-value, we can consider removing LotArea from the model and see how the model fit changes (we are not going to run a variable selection procedure such as forward, backward or mixed selection in this example):

## 
## Call:
## lm(formula = SalePrice ~ TotalBsmtSF + GrLivArea + GarageArea, 
##     data = lmdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -650577  -19502    -128   19408  276301 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -24105.471   4045.540  -5.959 3.19e-09 ***
## TotalBsmtSF     49.146      3.303  14.878  < 2e-16 ***
## GrLivArea       68.751      2.728  25.203  < 2e-16 ***
## GarageArea     103.321      6.835  15.117  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 46220 on 1456 degrees of freedom
## Multiple R-squared:  0.6621, Adjusted R-squared:  0.6614 
## F-statistic: 951.1 on 3 and 1456 DF,  p-value: < 2.2e-16

The model excluding LotArea has not improved our F-Statistic. This is possibly due to the presence of outlier points in the data.

plot(model2, pch=16, which=1)

Note how the residuals plot of this last model shows some important points still lying far away from the middle area of the graph.

Let’s apply a logarithmic transformation with the log function on the SalePrice variable (the log function here transforms using the natural log. If base 10 is desired log10 is the function to be used). I’ll apply this transformations directly into the model function and see what happens with both the model fit and the model accuracy.

## 
## Call:
## lm(formula = log(SalePrice) ~ TotalBsmtSF + GrLivArea + GarageArea, 
##     data = lmdata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -3.15585 -0.09757  0.03202  0.13550  0.74485 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.099e+01  2.007e-02  547.64   <2e-16 ***
## TotalBsmtSF 2.356e-04  1.639e-05   14.37   <2e-16 ***
## GrLivArea   3.284e-04  1.353e-05   24.27   <2e-16 ***
## GarageArea  6.022e-04  3.391e-05   17.76   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2293 on 1456 degrees of freedom
## Multiple R-squared:  0.671,  Adjusted R-squared:  0.6703 
## F-statistic:   990 on 3 and 1456 DF,  p-value: < 2.2e-16

SalePrice <- 1.099e+01 + 2.356e-04 * TotalBsmtSF + 3.284e-04 * GrLivArea + 6.022e-04 * GarageArea

plot(model3, pch=16, which=1)

A high F value means that our data does not well support the null hypothesis. Or in other words, the alternative hypothesis is compatible with observed data.

load test data and try out equation from model 1 on test data.

##   Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape
## 1  1         60       RL          65    8450   Pave  <NA>      Reg
## 2  2         20       RL          80    9600   Pave  <NA>      Reg
## 3  3         60       RL          68   11250   Pave  <NA>      IR1
## 4  4         70       RL          60    9550   Pave  <NA>      IR1
## 5  5         60       RL          84   14260   Pave  <NA>      IR1
## 6  6         50       RL          85   14115   Pave  <NA>      IR1
##   LandContour Utilities LotConfig LandSlope Neighborhood Condition1
## 1         Lvl    AllPub    Inside       Gtl      CollgCr       Norm
## 2         Lvl    AllPub       FR2       Gtl      Veenker      Feedr
## 3         Lvl    AllPub    Inside       Gtl      CollgCr       Norm
## 4         Lvl    AllPub    Corner       Gtl      Crawfor       Norm
## 5         Lvl    AllPub       FR2       Gtl      NoRidge       Norm
## 6         Lvl    AllPub    Inside       Gtl      Mitchel       Norm
##   Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt
## 1       Norm     1Fam     2Story           7           5      2003
## 2       Norm     1Fam     1Story           6           8      1976
## 3       Norm     1Fam     2Story           7           5      2001
## 4       Norm     1Fam     2Story           7           5      1915
## 5       Norm     1Fam     2Story           8           5      2000
## 6       Norm     1Fam     1.5Fin           5           5      1993
##   YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType
## 1         2003     Gable  CompShg     VinylSd     VinylSd    BrkFace
## 2         1976     Gable  CompShg     MetalSd     MetalSd       None
## 3         2002     Gable  CompShg     VinylSd     VinylSd    BrkFace
## 4         1970     Gable  CompShg     Wd Sdng     Wd Shng       None
## 5         2000     Gable  CompShg     VinylSd     VinylSd    BrkFace
## 6         1995     Gable  CompShg     VinylSd     VinylSd       None
##   MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure
## 1        196        Gd        TA      PConc       Gd       TA           No
## 2          0        TA        TA     CBlock       Gd       TA           Gd
## 3        162        Gd        TA      PConc       Gd       TA           Mn
## 4          0        TA        TA     BrkTil       TA       Gd           No
## 5        350        Gd        TA      PConc       Gd       TA           Av
## 6          0        TA        TA       Wood       Gd       TA           No
##   BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF
## 1          GLQ        706          Unf          0       150         856
## 2          ALQ        978          Unf          0       284        1262
## 3          GLQ        486          Unf          0       434         920
## 4          ALQ        216          Unf          0       540         756
## 5          GLQ        655          Unf          0       490        1145
## 6          GLQ        732          Unf          0        64         796
##   Heating HeatingQC CentralAir Electrical X1stFlrSF X2ndFlrSF LowQualFinSF
## 1    GasA        Ex          Y      SBrkr       856       854            0
## 2    GasA        Ex          Y      SBrkr      1262         0            0
## 3    GasA        Ex          Y      SBrkr       920       866            0
## 4    GasA        Gd          Y      SBrkr       961       756            0
## 5    GasA        Ex          Y      SBrkr      1145      1053            0
## 6    GasA        Ex          Y      SBrkr       796       566            0
##   GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr
## 1      1710            1            0        2        1            3
## 2      1262            0            1        2        0            3
## 3      1786            1            0        2        1            3
## 4      1717            1            0        1        0            3
## 5      2198            1            0        2        1            4
## 6      1362            1            0        1        1            1
##   KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces FireplaceQu
## 1            1          Gd            8        Typ          0        <NA>
## 2            1          TA            6        Typ          1          TA
## 3            1          Gd            6        Typ          1          TA
## 4            1          Gd            7        Typ          1          Gd
## 5            1          Gd            9        Typ          1          TA
## 6            1          TA            5        Typ          0        <NA>
##   GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual
## 1     Attchd        2003          RFn          2        548         TA
## 2     Attchd        1976          RFn          2        460         TA
## 3     Attchd        2001          RFn          2        608         TA
## 4     Detchd        1998          Unf          3        642         TA
## 5     Attchd        2000          RFn          3        836         TA
## 6     Attchd        1993          Unf          2        480         TA
##   GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch
## 1         TA          Y          0          61             0          0
## 2         TA          Y        298           0             0          0
## 3         TA          Y          0          42             0          0
## 4         TA          Y          0          35           272          0
## 5         TA          Y        192          84             0          0
## 6         TA          Y         40          30             0        320
##   ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold
## 1           0        0   <NA>  <NA>        <NA>       0      2   2008
## 2           0        0   <NA>  <NA>        <NA>       0      5   2007
## 3           0        0   <NA>  <NA>        <NA>       0      9   2008
## 4           0        0   <NA>  <NA>        <NA>       0      2   2006
## 5           0        0   <NA>  <NA>        <NA>       0     12   2008
## 6           0        0   <NA> MnPrv        Shed     700     10   2009
##   SaleType SaleCondition SalePrice
## 1       WD        Normal    208500
## 2       WD        Normal    181500
## 3       WD        Normal    223500
## 4       WD       Abnorml    140000
## 5       WD        Normal    250000
## 6       WD        Normal    143000
## logical(0)
## [1] NA
## logical(0)
## [1] NA

Score from Kaggle: 9.45827 Screen Name: nataliemollaghan