library(ggplot2)
library(corrplot)
## corrplot 0.84 loaded
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(matrixcalc)
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
library(tidyverse)
## -- Attaching packages ------------------------ tidyverse 1.3.0 --
## v tibble  3.0.1     v purrr   0.3.4
## v tidyr   1.0.3     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.0
## -- Conflicts --------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## x MASS::select()  masks dplyr::select()

Computational Mathematics

Your final is due by the end of the last week of class. You should post your solutions to your GitHub account or RPubs. You are also expected to make a short presentation via YouTube and post that recording to the board. This project will show off your ability to understand the elements of the class.

Problem 1

Using R, generate a random variable X that has 10,000 random uniform numbers from 1 to N, where N can be any number of your choosing greater than or equal to 6. Then generate a random variable Y that has 10,000 random normal numbers with a mean of μ = σ = (N+1)/2.

Probability.

Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities.

  1. P(X>x | X>y) b. P(X>x, Y>y) c. P(X<x | X>y)
## [1] 40
##      25% 
## 13.05264

a.P(X>x | X>y)

## [1] 0.59

The probability of X is greater than the median.

  1. P(X>x, Y>y)
## [1] 0.37

The probability of X is greater than all X and Y is greater than all y.

  1. P(X<x | X>y)
## [1] 0.39

The probability of X is greater than the median.

Investigate whether P(X>x and Y>y)=P(X>x)P(Y>y) by building a table and evaluating the marginal and joint probabilities.

##             X>x       X<x     Total
## Y<y   0.1293653 0.1207612 0.2501265
## Y>y   0.3751392 0.3747343 0.7498735
## Total 0.5045045 0.4954955 1.0000000

The Total row and column consists of the marginal probability distributions.

The joint probability distribution is the values in the table cells of sum(X>x & Y<y),sum(X>x & Y>y), sum(X<x & Y<y),sum(X<x & Y>y).

Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?

Fisher’s Exact test is a non-parametric alternative to the Chi-Square test and is used when we have cell sizes less than 5. With the Chi-Square test we can use for greater amount of cell sizes, there is the most appropiate in this scenario.

## 
##  Fisher's Exact Test for Count Data with simulated p-value (based on
##  2000 replicates)
## 
## data:  matrix_df
## p-value = 0.6937
## alternative hypothesis: two.sided

Problem 2

You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition. https://www.kaggle.com/c/house-prices-advanced-regression-techniques . I want you to do the following.

Descriptive and Inferential Statistics.

Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any three quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?

##   Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour
## 1  1         60       RL          65    8450   Pave  <NA>      Reg         Lvl
## 2  2         20       RL          80    9600   Pave  <NA>      Reg         Lvl
## 3  3         60       RL          68   11250   Pave  <NA>      IR1         Lvl
## 4  4         70       RL          60    9550   Pave  <NA>      IR1         Lvl
## 5  5         60       RL          84   14260   Pave  <NA>      IR1         Lvl
## 6  6         50       RL          85   14115   Pave  <NA>      IR1         Lvl
##   Utilities LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType
## 1    AllPub    Inside       Gtl      CollgCr       Norm       Norm     1Fam
## 2    AllPub       FR2       Gtl      Veenker      Feedr       Norm     1Fam
## 3    AllPub    Inside       Gtl      CollgCr       Norm       Norm     1Fam
## 4    AllPub    Corner       Gtl      Crawfor       Norm       Norm     1Fam
## 5    AllPub       FR2       Gtl      NoRidge       Norm       Norm     1Fam
## 6    AllPub    Inside       Gtl      Mitchel       Norm       Norm     1Fam
##   HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl
## 1     2Story           7           5      2003         2003     Gable  CompShg
## 2     1Story           6           8      1976         1976     Gable  CompShg
## 3     2Story           7           5      2001         2002     Gable  CompShg
## 4     2Story           7           5      1915         1970     Gable  CompShg
## 5     2Story           8           5      2000         2000     Gable  CompShg
## 6     1.5Fin           5           5      1993         1995     Gable  CompShg
##   Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation
## 1     VinylSd     VinylSd    BrkFace        196        Gd        TA      PConc
## 2     MetalSd     MetalSd       None          0        TA        TA     CBlock
## 3     VinylSd     VinylSd    BrkFace        162        Gd        TA      PConc
## 4     Wd Sdng     Wd Shng       None          0        TA        TA     BrkTil
## 5     VinylSd     VinylSd    BrkFace        350        Gd        TA      PConc
## 6     VinylSd     VinylSd       None          0        TA        TA       Wood
##   BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2
## 1       Gd       TA           No          GLQ        706          Unf
## 2       Gd       TA           Gd          ALQ        978          Unf
## 3       Gd       TA           Mn          GLQ        486          Unf
## 4       TA       Gd           No          ALQ        216          Unf
## 5       Gd       TA           Av          GLQ        655          Unf
## 6       Gd       TA           No          GLQ        732          Unf
##   BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical
## 1          0       150         856    GasA        Ex          Y      SBrkr
## 2          0       284        1262    GasA        Ex          Y      SBrkr
## 3          0       434         920    GasA        Ex          Y      SBrkr
## 4          0       540         756    GasA        Gd          Y      SBrkr
## 5          0       490        1145    GasA        Ex          Y      SBrkr
## 6          0        64         796    GasA        Ex          Y      SBrkr
##   X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath
## 1       856       854            0      1710            1            0        2
## 2      1262         0            0      1262            0            1        2
## 3       920       866            0      1786            1            0        2
## 4       961       756            0      1717            1            0        1
## 5      1145      1053            0      2198            1            0        2
## 6       796       566            0      1362            1            0        1
##   HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional
## 1        1            3            1          Gd            8        Typ
## 2        0            3            1          TA            6        Typ
## 3        1            3            1          Gd            6        Typ
## 4        0            3            1          Gd            7        Typ
## 5        1            4            1          Gd            9        Typ
## 6        1            1            1          TA            5        Typ
##   Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars
## 1          0        <NA>     Attchd        2003          RFn          2
## 2          1          TA     Attchd        1976          RFn          2
## 3          1          TA     Attchd        2001          RFn          2
## 4          1          Gd     Detchd        1998          Unf          3
## 5          1          TA     Attchd        2000          RFn          3
## 6          0        <NA>     Attchd        1993          Unf          2
##   GarageArea GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF
## 1        548         TA         TA          Y          0          61
## 2        460         TA         TA          Y        298           0
## 3        608         TA         TA          Y          0          42
## 4        642         TA         TA          Y          0          35
## 5        836         TA         TA          Y        192          84
## 6        480         TA         TA          Y         40          30
##   EnclosedPorch X3SsnPorch ScreenPorch PoolArea PoolQC Fence MiscFeature
## 1             0          0           0        0   <NA>  <NA>        <NA>
## 2             0          0           0        0   <NA>  <NA>        <NA>
## 3             0          0           0        0   <NA>  <NA>        <NA>
## 4           272          0           0        0   <NA>  <NA>        <NA>
## 5             0          0           0        0   <NA>  <NA>        <NA>
## 6             0        320           0        0   <NA> MnPrv        Shed
##   MiscVal MoSold YrSold SaleType SaleCondition SalePrice
## 1       0      2   2008       WD        Normal    208500
## 2       0      5   2007       WD        Normal    181500
## 3       0      9   2008       WD        Normal    223500
## 4       0      2   2006       WD       Abnorml    140000
## 5       0     12   2008       WD        Normal    250000
## 6     700     10   2009       WD        Normal    143000
##     Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape
## 1 1461         20       RH          80   11622   Pave  <NA>      Reg
## 2 1462         20       RL          81   14267   Pave  <NA>      IR1
## 3 1463         60       RL          74   13830   Pave  <NA>      IR1
## 4 1464         60       RL          78    9978   Pave  <NA>      IR1
## 5 1465        120       RL          43    5005   Pave  <NA>      IR1
## 6 1466         60       RL          75   10000   Pave  <NA>      IR1
##   LandContour Utilities LotConfig LandSlope Neighborhood Condition1 Condition2
## 1         Lvl    AllPub    Inside       Gtl        NAmes      Feedr       Norm
## 2         Lvl    AllPub    Corner       Gtl        NAmes       Norm       Norm
## 3         Lvl    AllPub    Inside       Gtl      Gilbert       Norm       Norm
## 4         Lvl    AllPub    Inside       Gtl      Gilbert       Norm       Norm
## 5         HLS    AllPub    Inside       Gtl      StoneBr       Norm       Norm
## 6         Lvl    AllPub    Corner       Gtl      Gilbert       Norm       Norm
##   BldgType HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle
## 1     1Fam     1Story           5           6      1961         1961     Gable
## 2     1Fam     1Story           6           6      1958         1958       Hip
## 3     1Fam     2Story           5           5      1997         1998     Gable
## 4     1Fam     2Story           6           6      1998         1998     Gable
## 5   TwnhsE     1Story           8           5      1992         1992     Gable
## 6     1Fam     2Story           6           5      1993         1994     Gable
##   RoofMatl Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond
## 1  CompShg     VinylSd     VinylSd       None          0        TA        TA
## 2  CompShg     Wd Sdng     Wd Sdng    BrkFace        108        TA        TA
## 3  CompShg     VinylSd     VinylSd       None          0        TA        TA
## 4  CompShg     VinylSd     VinylSd    BrkFace         20        TA        TA
## 5  CompShg     HdBoard     HdBoard       None          0        Gd        TA
## 6  CompShg     HdBoard     HdBoard       None          0        TA        TA
##   Foundation BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1
## 1     CBlock       TA       TA           No          Rec        468
## 2     CBlock       TA       TA           No          ALQ        923
## 3      PConc       Gd       TA           No          GLQ        791
## 4      PConc       TA       TA           No          GLQ        602
## 5      PConc       Gd       TA           No          ALQ        263
## 6      PConc       Gd       TA           No          Unf          0
##   BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir
## 1          LwQ        144       270         882    GasA        TA          Y
## 2          Unf          0       406        1329    GasA        TA          Y
## 3          Unf          0       137         928    GasA        Gd          Y
## 4          Unf          0       324         926    GasA        Ex          Y
## 5          Unf          0      1017        1280    GasA        Ex          Y
## 6          Unf          0       763         763    GasA        Gd          Y
##   Electrical X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath
## 1      SBrkr       896         0            0       896            0
## 2      SBrkr      1329         0            0      1329            0
## 3      SBrkr       928       701            0      1629            0
## 4      SBrkr       926       678            0      1604            0
## 5      SBrkr      1280         0            0      1280            0
## 6      SBrkr       763       892            0      1655            0
##   BsmtHalfBath FullBath HalfBath BedroomAbvGr KitchenAbvGr KitchenQual
## 1            0        1        0            2            1          TA
## 2            0        1        1            3            1          Gd
## 3            0        2        1            3            1          TA
## 4            0        2        1            3            1          Gd
## 5            0        2        0            2            1          Gd
## 6            0        2        1            3            1          TA
##   TotRmsAbvGrd Functional Fireplaces FireplaceQu GarageType GarageYrBlt
## 1            5        Typ          0        <NA>     Attchd        1961
## 2            6        Typ          0        <NA>     Attchd        1958
## 3            6        Typ          1          TA     Attchd        1997
## 4            7        Typ          1          Gd     Attchd        1998
## 5            5        Typ          0        <NA>     Attchd        1992
## 6            7        Typ          1          TA     Attchd        1993
##   GarageFinish GarageCars GarageArea GarageQual GarageCond PavedDrive
## 1          Unf          1        730         TA         TA          Y
## 2          Unf          1        312         TA         TA          Y
## 3          Fin          2        482         TA         TA          Y
## 4          Fin          2        470         TA         TA          Y
## 5          RFn          2        506         TA         TA          Y
## 6          Fin          2        440         TA         TA          Y
##   WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch ScreenPorch PoolArea PoolQC
## 1        140           0             0          0         120        0   <NA>
## 2        393          36             0          0           0        0   <NA>
## 3        212          34             0          0           0        0   <NA>
## 4        360          36             0          0           0        0   <NA>
## 5          0          82             0          0         144        0   <NA>
## 6        157          84             0          0           0        0   <NA>
##   Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition
## 1 MnPrv        <NA>       0      6   2010       WD        Normal
## 2  <NA>        Gar2   12500      6   2010       WD        Normal
## 3 MnPrv        <NA>       0      3   2010       WD        Normal
## 4  <NA>        <NA>       0      6   2010       WD        Normal
## 5  <NA>        <NA>       0      1   2010       WD        Normal
## 6  <NA>        <NA>       0      4   2010       WD        Normal
## [1] 180921.2
## [1] 163000
## [1] 79442.5

## 
## Call:
## lm(formula = training_dataset$SalePrice ~ training_dataset$BldgType)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -150864  -50764  -14701   31861  569236 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                       185764       2238  83.009  < 2e-16 ***
## training_dataset$BldgType2fmCon   -57332      14216  -4.033 5.80e-05 ***
## training_dataset$BldgTypeDuplex   -52223      11068  -4.718 2.61e-06 ***
## training_dataset$BldgTypeTwnhs    -49852      12128  -4.110 4.17e-05 ***
## training_dataset$BldgTypeTwnhsE    -3804       7655  -0.497    0.619    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 78170 on 1455 degrees of freedom
## Multiple R-squared:  0.03453,    Adjusted R-squared:  0.03188 
## F-statistic: 13.01 on 4 and 1455 DF,  p-value: 2.057e-10

##             SalePrice GrLivArea LotArea YearBuilt GarageArea FullBath
## SalePrice      1.0000    0.7086  0.2638    0.5229     0.6234   0.5607
## GrLivArea      0.7086    1.0000  0.2631    0.1990     0.4690   0.6300
## LotArea        0.2638    0.2631  1.0000    0.0142     0.1804   0.1260
## YearBuilt      0.5229    0.1990  0.0142    1.0000     0.4790   0.4683
## GarageArea     0.6234    0.4690  0.1804    0.4790     1.0000   0.4057
## FullBath       0.5607    0.6300  0.1260    0.4683     0.4057   1.0000
## OverallQual    0.7910    0.5930  0.1058    0.5723     0.5620   0.5506
##             OverallQual
## SalePrice        0.7910
## GrLivArea        0.5930
## LotArea          0.1058
## YearBuilt        0.5723
## GarageArea       0.5620
## FullBath         0.5506
## OverallQual      1.0000

## 
##  Pearson's product-moment correlation
## 
## data:  correlation_data$SalePrice and correlation_data$GrLivArea
## t = 38.348, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.6915087 0.7249450
## sample estimates:
##       cor 
## 0.7086245

We are 80% confident that the correlation bettwe these two variables is between 0.6915087 and 0.7249450.

## 
##  Pearson's product-moment correlation
## 
## data:  correlation_data$SalePrice and correlation_data$YearBuilt
## t = 23.424, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.4980766 0.5468619
## sample estimates:
##       cor 
## 0.5228973

We are 80% confident that the correlation bettwe these two variables is between 0.6915087 and 0.7249450.

## 
##  Pearson's product-moment correlation
## 
## data:  correlation_data$SalePrice and correlation_data$OverallQual
## t = 49.364, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.7780752 0.8032204
## sample estimates:
##       cor 
## 0.7909816

We are 80% confident that the correlation bettwe these two variables is between 0.7780752 and 0.8032204.

With a low p-value, we would not be worried about familywise error.

Linear Algebra and Correlation.

Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.

##             SalePrice GrLivArea LotArea YearBuilt GarageArea FullBath
## SalePrice      4.2107   -1.5406 -0.4289   -0.6971    -0.5858   0.1904
## GrLivArea     -1.5406    3.0440 -0.1666    1.1284    -0.2560  -1.2455
## LotArea       -0.4289   -0.1666  1.1362    0.1012    -0.0829   0.0284
## YearBuilt     -0.6971    1.1284  0.1012    2.0790    -0.4289  -0.7739
## GarageArea    -0.5858   -0.2560 -0.0829   -0.4289     1.7684   0.0748
## FullBath       0.1904   -1.2455  0.0284   -0.7739     0.0748   2.1003
## OverallQual   -1.7484   -0.3850  0.2909   -0.6512    -0.1656  -0.1706
##             OverallQual
## SalePrice       -1.7484
## GrLivArea       -0.3850
## LotArea          0.2909
## YearBuilt       -0.6512
## GarageArea      -0.1656
## FullBath        -0.1706
## OverallQual      3.1402
##             SalePrice GrLivArea LotArea YearBuilt GarageArea FullBath
## SalePrice           1         0       0         0          0        0
## GrLivArea           0         1       0         0          0        0
## LotArea             0         0       1         0          0        0
## YearBuilt           0         0       0         1          0        0
## GarageArea          0         0       0         0          1        0
## FullBath            0         0       0         0          0        1
## OverallQual         0         0       0         0          0        0
##             OverallQual
## SalePrice             0
## GrLivArea             0
## LotArea               0
## YearBuilt             0
## GarageArea            0
## FullBath              0
## OverallQual           1
##             SalePrice GrLivArea LotArea YearBuilt GarageArea FullBath
## SalePrice           1         0       0         0          0        0
## GrLivArea           0         1       0         0          0        0
## LotArea             0         0       1         0          0        0
## YearBuilt           0         0       0         1          0        0
## GarageArea          0         0       0         0          1        0
## FullBath            0         0       0         0          0        1
## OverallQual         0         0       0         0          0        0
##             OverallQual
## SalePrice             0
## GrLivArea             0
## LotArea               0
## YearBuilt             0
## GarageArea            0
## FullBath              0
## OverallQual           1
##        [,1]        [,2]        [,3]      [,4]        [,5]       [,6] [,7]
## [1,] 1.0000  0.00000000  0.00000000 0.0000000  0.00000000 0.00000000    0
## [2,] 0.7086  1.00000000  0.00000000 0.0000000  0.00000000 0.00000000    0
## [3,] 0.2638  0.15298947  1.00000000 0.0000000  0.00000000 0.00000000    0
## [4,] 0.5229 -0.34451044 -0.10612087 1.0000000  0.00000000 0.00000000    0
## [5,] 0.6234  0.05474899  0.01281817 0.2490577  1.00000000 0.00000000    0
## [6,] 0.5607  0.46735189 -0.06259710 0.3791760 -0.03146122 1.00000000    0
## [7,] 0.7910  0.06527076 -0.11737343 0.2411038  0.05102839 0.05433795    1
##      [,1]     [,2]          [,3]       [,4]       [,5]        [,6]        [,7]
## [1,]    1 0.708600  2.638000e-01  0.5229000 0.62340000  0.56070000  0.79100000
## [2,]    0 0.497886  7.617132e-02 -0.1715269 0.02725876  0.23268798  0.03249740
## [3,]    0 0.000000  9.187562e-01 -0.0974992 0.01177678 -0.05751147 -0.10783756
## [4,]    0 0.000000  0.000000e+00  0.6571361 0.16366483  0.24917024  0.15843798
## [5,]    0 0.000000  0.000000e+00  0.0000000 0.56896710 -0.01790040  0.02903347
## [6,]    0 0.000000 -6.938894e-18  0.0000000 0.00000000  0.47822574  0.02598581
## [7,]    0 0.000000  3.770453e-19  0.0000000 0.00000000  0.00000000  0.31844707
##        [,1]   [,2]   [,3]   [,4]   [,5]   [,6]   [,7]
## [1,] 1.0000 0.7086 0.2638 0.5229 0.6234 0.5607 0.7910
## [2,] 0.7086 1.0000 0.2631 0.1990 0.4690 0.6300 0.5930
## [3,] 0.2638 0.2631 1.0000 0.0142 0.1804 0.1260 0.1058
## [4,] 0.5229 0.1990 0.0142 1.0000 0.4790 0.4683 0.5723
## [5,] 0.6234 0.4690 0.1804 0.4790 1.0000 0.4057 0.5620
## [6,] 0.5607 0.6300 0.1260 0.4683 0.4057 1.0000 0.5506
## [7,] 0.7910 0.5930 0.1058 0.5723 0.5620 0.5506 1.0000
##             SalePrice GrLivArea LotArea YearBuilt GarageArea FullBath
## SalePrice      1.0000    0.7086  0.2638    0.5229     0.6234   0.5607
## GrLivArea      0.7086    1.0000  0.2631    0.1990     0.4690   0.6300
## LotArea        0.2638    0.2631  1.0000    0.0142     0.1804   0.1260
## YearBuilt      0.5229    0.1990  0.0142    1.0000     0.4790   0.4683
## GarageArea     0.6234    0.4690  0.1804    0.4790     1.0000   0.4057
## FullBath       0.5607    0.6300  0.1260    0.4683     0.4057   1.0000
## OverallQual    0.7910    0.5930  0.1058    0.5723     0.5620   0.5506
##             OverallQual
## SalePrice        0.7910
## GrLivArea        0.5930
## LotArea          0.1058
## YearBuilt        0.5723
## GarageArea       0.5620
## FullBath         0.5506
## OverallQual      1.0000
##             SalePrice GrLivArea LotArea YearBuilt GarageArea FullBath
## SalePrice        TRUE      TRUE    TRUE      TRUE       TRUE     TRUE
## GrLivArea        TRUE      TRUE    TRUE      TRUE       TRUE     TRUE
## LotArea          TRUE      TRUE    TRUE      TRUE       TRUE     TRUE
## YearBuilt        TRUE      TRUE    TRUE      TRUE       TRUE     TRUE
## GarageArea       TRUE      TRUE    TRUE      TRUE       TRUE     TRUE
## FullBath         TRUE      TRUE    TRUE      TRUE       TRUE     TRUE
## OverallQual      TRUE      TRUE    TRUE      TRUE       TRUE     TRUE
##             OverallQual
## SalePrice          TRUE
## GrLivArea          TRUE
## LotArea            TRUE
## YearBuilt          TRUE
## GarageArea         TRUE
## FullBath           TRUE
## OverallQual        TRUE

Calculus-Based Probability & Statistics.

Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of  for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, )). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.

## $breaks
##  [1]    0  500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 6000
## 
## $counts
##  [1]   3 228 554 461 144  52  12   2   2   1   0   1
## 
## $density
##  [1] 4.109589e-06 3.123288e-04 7.589041e-04 6.315068e-04 1.972603e-04
##  [6] 7.123288e-05 1.643836e-05 2.739726e-06 2.739726e-06 1.369863e-06
## [11] 0.000000e+00 1.369863e-06
## 
## $mids
##  [1]  250  750 1250 1750 2250 2750 3250 3750 4250 4750 5250 5750
## 
## $xname
## [1] "training_dataset$GrLivArea"
## 
## $equidist
## [1] TRUE
## 
## attr(,"class")
## [1] "histogram"

##         5%        95% 
##   72.18889 4289.87669
##     5%    95% 
##  848.0 2466.1

Modeling.

Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score

We will be analyzing a dataset to see what factors play a part in the predicting the costs of real estate. The factors are the transaction date, house age, distance to the MRT(metro), number of convenience stores, house price of unit area.

##   Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour
## 1  1         60       RL          65    8450   Pave  <NA>      Reg         Lvl
## 2  2         20       RL          80    9600   Pave  <NA>      Reg         Lvl
## 3  3         60       RL          68   11250   Pave  <NA>      IR1         Lvl
## 4  4         70       RL          60    9550   Pave  <NA>      IR1         Lvl
## 5  5         60       RL          84   14260   Pave  <NA>      IR1         Lvl
## 6  6         50       RL          85   14115   Pave  <NA>      IR1         Lvl
##   Utilities LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType
## 1    AllPub    Inside       Gtl      CollgCr       Norm       Norm     1Fam
## 2    AllPub       FR2       Gtl      Veenker      Feedr       Norm     1Fam
## 3    AllPub    Inside       Gtl      CollgCr       Norm       Norm     1Fam
## 4    AllPub    Corner       Gtl      Crawfor       Norm       Norm     1Fam
## 5    AllPub       FR2       Gtl      NoRidge       Norm       Norm     1Fam
## 6    AllPub    Inside       Gtl      Mitchel       Norm       Norm     1Fam
##   HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl
## 1     2Story           7           5      2003         2003     Gable  CompShg
## 2     1Story           6           8      1976         1976     Gable  CompShg
## 3     2Story           7           5      2001         2002     Gable  CompShg
## 4     2Story           7           5      1915         1970     Gable  CompShg
## 5     2Story           8           5      2000         2000     Gable  CompShg
## 6     1.5Fin           5           5      1993         1995     Gable  CompShg
##   Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation
## 1     VinylSd     VinylSd    BrkFace        196        Gd        TA      PConc
## 2     MetalSd     MetalSd       None          0        TA        TA     CBlock
## 3     VinylSd     VinylSd    BrkFace        162        Gd        TA      PConc
## 4     Wd Sdng     Wd Shng       None          0        TA        TA     BrkTil
## 5     VinylSd     VinylSd    BrkFace        350        Gd        TA      PConc
## 6     VinylSd     VinylSd       None          0        TA        TA       Wood
##   BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2
## 1       Gd       TA           No          GLQ        706          Unf
## 2       Gd       TA           Gd          ALQ        978          Unf
## 3       Gd       TA           Mn          GLQ        486          Unf
## 4       TA       Gd           No          ALQ        216          Unf
## 5       Gd       TA           Av          GLQ        655          Unf
## 6       Gd       TA           No          GLQ        732          Unf
##   BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical
## 1          0       150         856    GasA        Ex          Y      SBrkr
## 2          0       284        1262    GasA        Ex          Y      SBrkr
## 3          0       434         920    GasA        Ex          Y      SBrkr
## 4          0       540         756    GasA        Gd          Y      SBrkr
## 5          0       490        1145    GasA        Ex          Y      SBrkr
## 6          0        64         796    GasA        Ex          Y      SBrkr
##   X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath
## 1       856       854            0      1710            1            0        2
## 2      1262         0            0      1262            0            1        2
## 3       920       866            0      1786            1            0        2
## 4       961       756            0      1717            1            0        1
## 5      1145      1053            0      2198            1            0        2
## 6       796       566            0      1362            1            0        1
##   HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional
## 1        1            3            1          Gd            8        Typ
## 2        0            3            1          TA            6        Typ
## 3        1            3            1          Gd            6        Typ
## 4        0            3            1          Gd            7        Typ
## 5        1            4            1          Gd            9        Typ
## 6        1            1            1          TA            5        Typ
##   Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars
## 1          0        <NA>     Attchd        2003          RFn          2
## 2          1          TA     Attchd        1976          RFn          2
## 3          1          TA     Attchd        2001          RFn          2
## 4          1          Gd     Detchd        1998          Unf          3
## 5          1          TA     Attchd        2000          RFn          3
## 6          0        <NA>     Attchd        1993          Unf          2
##   GarageArea GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF
## 1        548         TA         TA          Y          0          61
## 2        460         TA         TA          Y        298           0
## 3        608         TA         TA          Y          0          42
## 4        642         TA         TA          Y          0          35
## 5        836         TA         TA          Y        192          84
## 6        480         TA         TA          Y         40          30
##   EnclosedPorch X3SsnPorch ScreenPorch PoolArea PoolQC Fence MiscFeature
## 1             0          0           0        0   <NA>  <NA>        <NA>
## 2             0          0           0        0   <NA>  <NA>        <NA>
## 3             0          0           0        0   <NA>  <NA>        <NA>
## 4           272          0           0        0   <NA>  <NA>        <NA>
## 5             0          0           0        0   <NA>  <NA>        <NA>
## 6             0        320           0        0   <NA> MnPrv        Shed
##   MiscVal MoSold YrSold SaleType SaleCondition SalePrice
## 1       0      2   2008       WD        Normal    208500
## 2       0      5   2007       WD        Normal    181500
## 3       0      9   2008       WD        Normal    223500
## 4       0      2   2006       WD       Abnorml    140000
## 5       0     12   2008       WD        Normal    250000
## 6     700     10   2009       WD        Normal    143000
## 
## Call:
## lm(formula = SalePrice ~ GrLivArea + LotArea + OverallQual + 
##     Fireplaces + YearBuilt + BedroomAbvGr + Neighborhood_Average, 
##     data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -363910  -18585   -1056   16636  274616 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -5.132e+05  8.212e+04  -6.249 5.42e-10 ***
## GrLivArea             5.697e+01  3.006e+00  18.949  < 2e-16 ***
## LotArea               6.547e-01  1.010e-01   6.485 1.22e-10 ***
## OverallQual           1.755e+04  1.144e+03  15.342  < 2e-16 ***
## Fireplaces            7.106e+03  1.731e+03   4.105 4.28e-05 ***
## YearBuilt             2.248e+02  4.323e+01   5.200 2.28e-07 ***
## BedroomAbvGr         -7.434e+03  1.453e+03  -5.116 3.53e-07 ***
## Neighborhood_Average  3.739e-01  2.483e-02  15.060  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 36220 on 1452 degrees of freedom
## Multiple R-squared:  0.7931, Adjusted R-squared:  0.7921 
## F-statistic: 795.3 on 7 and 1452 DF,  p-value: < 2.2e-16

At this point, all the p-values of all the p-values are less than 0.05

Data Analysis

Most of the points are didstributed uniformly around zero.

Most of the points follow the line, even though we have some outliers on the end points. We do see a slight right-skewed distribution.

Conclusion

THe p-values for each variable is less than 0.05 so we don’t have to do the backward elimination process.

The Adjusted R-squared is 0.0.8084 with a degrees of freedom of 1450.

As someone that bought a property, I can advocate that the year people buys, the distance from metros, the house age and the convenience stores nearby(accessiblity) affects the prices of real estate. Another factor that is not in the dataset that affects the sale prices is the Overall condition of the home.

Kaggle.com username: tony #2 score : 0.58291