Data 605 - Final Project

Hazal Gunduz

Problem 1.

Probability Density 1: X~Gamma. Using R, generate a random variable X that has 10,000 random Gamma pdf values. A Gamma pdf is completely describe by n (a size parameter) and lambda (λ, a shape parameter). Choose any n greater 3 and an expected value (λ) between 2 and 10 (you choose).

set.seed(0)

n <- 4
lambda <- 2
a <- 10000

X <- rgamma(a, n, lambda)
summary(X)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.07235 1.26099 1.83317 1.99745 2.56044 7.60100
hist(X, col = 'skyblue')

Probability Density 2: Y~Sum of Exponentials. Then generate 10,000 observations from the sum of n exponential pdfs with rate/shape parameter ( λ). The n and λ must be the same as in the previous case. (e.g., mysum=rexp(10000, λ) + rexp(10000, λ) + …)

set.seed(0)

Y <- rexp(a, lambda) + rexp(a, lambda) + rexp(a, lambda) + rexp(a, lambda)
hist(Y, col = 'orange')

Probability Density 3: Z~Exponential. Then generate 10,000 observations from a single exponential pdf with rate/shape parameter (λ).

set.seed(0)

Z <- rexp(a, lambda)
hist(Z, col = 'green')

1a. Calculate the empirical expected value (means) and variances of all three pdfs.

n <- 4
lambda <- 2

meanX <- 4 / 2
meanX
## [1] 2
varX <- 4 / 2^2
varX
## [1] 1
meanY <- 4 / 2
meanY
## [1] 2
varY <- 4 / 2^2
varY
## [1] 1
meanZ <- (1 / 2) / 2
meanZ
## [1] 0.25
varZ <- (1 / 2^2) / 2^2
varZ
## [1] 0.0625

1b. Using calculus, calculate the expected value and variance of the Gamma pdf (X). Using the moment generating function for exponentials, calculate the expected value of the single exponential (Z) and the sum of exponentials (Y)

λ = 2

\(λe^{-xλ}\) = exponentials distribution

M(t) = \(\int_{0}^{\infty} λe^{-xλ}. e^{tx} = \frac{-λ}{t - λ}\)

\(\frac{λ}{(t - λ)^2} = M'(t)\)

\(\frac{1}{λ} = M'(0)\)

Expected Value : \(\frac{1}{λ} = \frac{1}{2}\)

$M’’(0) = = $

\(\frac{2λ}{(- λ)^3}\)

\(\frac{2}{(λ)^2}\)

Variance : \(\frac{2}{(λ)^2} = \frac{2}{4} = \frac{1}{2}\)

1c-e.Probability. For pdf Z (the exponential), calculate empirically probabilities a through c. Then evaluate through calculus whether the memoryless property holds.

a.P(Z>λ | Z>λ/2) b.P(Z>2λ | Z>λ) c. P(Z>3λ | Z>λ)

\(P(A|B) = \frac{P(A∪B)}{P(B)}\)

a <- mean(Z > lambda & Z > lambda / 2) / (mean(Z > lambda / 2))
a 
## [1] 0.1543478
b <- mean(Z > 2 * lambda & Z > lambda) / (mean(Z > lambda ))
b
## [1] 0.01408451
c <- mean(Z > 3 * lambda & Z > lambda) / (mean(Z > lambda ))
c
## [1] 0

Loosely investigate whether P(YZ) = P(Y) P(Z) by building a table with quartiles and evaluating the marginal and joint probabilities.

quantY <- quantile(Y, probs = c(.25,.5,.75,1))
quantY <- as.matrix(t(quantY))
colnames(quantY) = c("25% Y", "50% Y", "75% Y", "100% Y")

quantZ = quantile(Z, probs = c(.25,.5,.75,1))
quantZ = as.matrix(quantZ)
rownames(quantZ) = c("25% Z", "50% Z", "75% Z", "100% Z")

P_YZ <- quantZ %*% quantY

P_Sum <- sum(P_YZ)
prob_YZ <- matrix(nrow = 4, ncol = 4)
for (row in 1:4) {
    for (col in 1:4) {
        prob_YZ[row,col] = P_YZ[row,col]/P_Sum
    }
}
colnames(prob_YZ) = c("25% Y", "50% Y", "75% Y", "100% Y")
rownames(prob_YZ) = c("25% Z", "50% Z", "75% Z", "100% Z")
prob_YZ
##              25% Y       50% Y       75% Y     100% Y
## 25% Z  0.002357123 0.003480398 0.004869254 0.01399171
## 50% Z  0.005767742 0.008516330 0.011914779 0.03423690
## 75% Z  0.011407022 0.016842981 0.023564187 0.06771126
## 100% Z 0.075904041 0.112075726 0.156799645 0.45056090
RSum <- rowSums(prob_YZ)
CSum <- colSums(prob_YZ)

Result <- cbind(prob_YZ, RSum)
Result <- rbind(Result, CSum)
## Warning in rbind(Result, CSum): number of columns of result is not a multiple
## of vector length (arg 2)
Result[4,4] 
## [1] 0.4505609
Result
##              25% Y       50% Y       75% Y     100% Y       RSum
## 25% Z  0.002357123 0.003480398 0.004869254 0.01399171 0.02469849
## 50% Z  0.005767742 0.008516330 0.011914779 0.03423690 0.06043575
## 75% Z  0.011407022 0.016842981 0.023564187 0.06771126 0.11952545
## 100% Z 0.075904041 0.112075726 0.156799645 0.45056090 0.79534032
## CSum   0.095435928 0.140915434 0.197147864 0.56650077 0.09543593

Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?

FisherTest <- prob_YZ * 10000

fisher.test(FisherTest, simulate.p.value = T)
## Warning in fisher.test(FisherTest, simulate.p.value = T): 'x' has been rounded
## to integer: Mean relative difference: 0.000366621
## 
##  Fisher's Exact Test for Count Data with simulated p-value (based on
##  2000 replicates)
## 
## data:  FisherTest
## p-value = 1
## alternative hypothesis: two.sided
chisq.test(FisherTest)
## 
##  Pearson's Chi-squared test
## 
## data:  FisherTest
## X-squared = 4.0282e-29, df = 9, p-value = 1

The Fisher Exact Test and the Chi-Square Test give the same p-value of 1 and verify that the “Y” and “Z” samples are independent.

Problem 2.

You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition. https://www.kaggle.com/c/house-prices-advanced-regression-techniques

Descriptive and Inferential Statistics. Provide univariate descriptive statistics and appropriate plots for the training data set.Provide a scatterplot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any three quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?

train <- read.csv(file = "~/Downloads/train.csv", header=TRUE)
head(train, 10)
##    Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour
## 1   1         60       RL          65    8450   Pave  <NA>      Reg         Lvl
## 2   2         20       RL          80    9600   Pave  <NA>      Reg         Lvl
## 3   3         60       RL          68   11250   Pave  <NA>      IR1         Lvl
## 4   4         70       RL          60    9550   Pave  <NA>      IR1         Lvl
## 5   5         60       RL          84   14260   Pave  <NA>      IR1         Lvl
## 6   6         50       RL          85   14115   Pave  <NA>      IR1         Lvl
## 7   7         20       RL          75   10084   Pave  <NA>      Reg         Lvl
## 8   8         60       RL          NA   10382   Pave  <NA>      IR1         Lvl
## 9   9         50       RM          51    6120   Pave  <NA>      Reg         Lvl
## 10 10        190       RL          50    7420   Pave  <NA>      Reg         Lvl
##    Utilities LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType
## 1     AllPub    Inside       Gtl      CollgCr       Norm       Norm     1Fam
## 2     AllPub       FR2       Gtl      Veenker      Feedr       Norm     1Fam
## 3     AllPub    Inside       Gtl      CollgCr       Norm       Norm     1Fam
## 4     AllPub    Corner       Gtl      Crawfor       Norm       Norm     1Fam
## 5     AllPub       FR2       Gtl      NoRidge       Norm       Norm     1Fam
## 6     AllPub    Inside       Gtl      Mitchel       Norm       Norm     1Fam
## 7     AllPub    Inside       Gtl      Somerst       Norm       Norm     1Fam
## 8     AllPub    Corner       Gtl       NWAmes       PosN       Norm     1Fam
## 9     AllPub    Inside       Gtl      OldTown     Artery       Norm     1Fam
## 10    AllPub    Corner       Gtl      BrkSide     Artery     Artery   2fmCon
##    HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl
## 1      2Story           7           5      2003         2003     Gable  CompShg
## 2      1Story           6           8      1976         1976     Gable  CompShg
## 3      2Story           7           5      2001         2002     Gable  CompShg
## 4      2Story           7           5      1915         1970     Gable  CompShg
## 5      2Story           8           5      2000         2000     Gable  CompShg
## 6      1.5Fin           5           5      1993         1995     Gable  CompShg
## 7      1Story           8           5      2004         2005     Gable  CompShg
## 8      2Story           7           6      1973         1973     Gable  CompShg
## 9      1.5Fin           7           5      1931         1950     Gable  CompShg
## 10     1.5Unf           5           6      1939         1950     Gable  CompShg
##    Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation
## 1      VinylSd     VinylSd    BrkFace        196        Gd        TA      PConc
## 2      MetalSd     MetalSd       None          0        TA        TA     CBlock
## 3      VinylSd     VinylSd    BrkFace        162        Gd        TA      PConc
## 4      Wd Sdng     Wd Shng       None          0        TA        TA     BrkTil
## 5      VinylSd     VinylSd    BrkFace        350        Gd        TA      PConc
## 6      VinylSd     VinylSd       None          0        TA        TA       Wood
## 7      VinylSd     VinylSd      Stone        186        Gd        TA      PConc
## 8      HdBoard     HdBoard      Stone        240        TA        TA     CBlock
## 9      BrkFace     Wd Shng       None          0        TA        TA     BrkTil
## 10     MetalSd     MetalSd       None          0        TA        TA     BrkTil
##    BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2
## 1        Gd       TA           No          GLQ        706          Unf
## 2        Gd       TA           Gd          ALQ        978          Unf
## 3        Gd       TA           Mn          GLQ        486          Unf
## 4        TA       Gd           No          ALQ        216          Unf
## 5        Gd       TA           Av          GLQ        655          Unf
## 6        Gd       TA           No          GLQ        732          Unf
## 7        Ex       TA           Av          GLQ       1369          Unf
## 8        Gd       TA           Mn          ALQ        859          BLQ
## 9        TA       TA           No          Unf          0          Unf
## 10       TA       TA           No          GLQ        851          Unf
##    BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical
## 1           0       150         856    GasA        Ex          Y      SBrkr
## 2           0       284        1262    GasA        Ex          Y      SBrkr
## 3           0       434         920    GasA        Ex          Y      SBrkr
## 4           0       540         756    GasA        Gd          Y      SBrkr
## 5           0       490        1145    GasA        Ex          Y      SBrkr
## 6           0        64         796    GasA        Ex          Y      SBrkr
## 7           0       317        1686    GasA        Ex          Y      SBrkr
## 8          32       216        1107    GasA        Ex          Y      SBrkr
## 9           0       952         952    GasA        Gd          Y      FuseF
## 10          0       140         991    GasA        Ex          Y      SBrkr
##    X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath
## 1        856       854            0      1710            1            0
## 2       1262         0            0      1262            0            1
## 3        920       866            0      1786            1            0
## 4        961       756            0      1717            1            0
## 5       1145      1053            0      2198            1            0
## 6        796       566            0      1362            1            0
## 7       1694         0            0      1694            1            0
## 8       1107       983            0      2090            1            0
## 9       1022       752            0      1774            0            0
## 10      1077         0            0      1077            1            0
##    FullBath HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd
## 1         2        1            3            1          Gd            8
## 2         2        0            3            1          TA            6
## 3         2        1            3            1          Gd            6
## 4         1        0            3            1          Gd            7
## 5         2        1            4            1          Gd            9
## 6         1        1            1            1          TA            5
## 7         2        0            3            1          Gd            7
## 8         2        1            3            1          TA            7
## 9         2        0            2            2          TA            8
## 10        1        0            2            2          TA            5
##    Functional Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish
## 1         Typ          0        <NA>     Attchd        2003          RFn
## 2         Typ          1          TA     Attchd        1976          RFn
## 3         Typ          1          TA     Attchd        2001          RFn
## 4         Typ          1          Gd     Detchd        1998          Unf
## 5         Typ          1          TA     Attchd        2000          RFn
## 6         Typ          0        <NA>     Attchd        1993          Unf
## 7         Typ          1          Gd     Attchd        2004          RFn
## 8         Typ          2          TA     Attchd        1973          RFn
## 9        Min1          2          TA     Detchd        1931          Unf
## 10        Typ          2          TA     Attchd        1939          RFn
##    GarageCars GarageArea GarageQual GarageCond PavedDrive WoodDeckSF
## 1           2        548         TA         TA          Y          0
## 2           2        460         TA         TA          Y        298
## 3           2        608         TA         TA          Y          0
## 4           3        642         TA         TA          Y          0
## 5           3        836         TA         TA          Y        192
## 6           2        480         TA         TA          Y         40
## 7           2        636         TA         TA          Y        255
## 8           2        484         TA         TA          Y        235
## 9           2        468         Fa         TA          Y         90
## 10          1        205         Gd         TA          Y          0
##    OpenPorchSF EnclosedPorch X3SsnPorch ScreenPorch PoolArea PoolQC Fence
## 1           61             0          0           0        0   <NA>  <NA>
## 2            0             0          0           0        0   <NA>  <NA>
## 3           42             0          0           0        0   <NA>  <NA>
## 4           35           272          0           0        0   <NA>  <NA>
## 5           84             0          0           0        0   <NA>  <NA>
## 6           30             0        320           0        0   <NA> MnPrv
## 7           57             0          0           0        0   <NA>  <NA>
## 8          204           228          0           0        0   <NA>  <NA>
## 9            0           205          0           0        0   <NA>  <NA>
## 10           4             0          0           0        0   <NA>  <NA>
##    MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
## 1         <NA>       0      2   2008       WD        Normal    208500
## 2         <NA>       0      5   2007       WD        Normal    181500
## 3         <NA>       0      9   2008       WD        Normal    223500
## 4         <NA>       0      2   2006       WD       Abnorml    140000
## 5         <NA>       0     12   2008       WD        Normal    250000
## 6         Shed     700     10   2009       WD        Normal    143000
## 7         <NA>       0      8   2007       WD        Normal    307000
## 8         Shed     350     11   2009       WD        Normal    200000
## 9         <NA>       0      4   2008       WD       Abnorml    129900
## 10        <NA>       0      1   2008       WD        Normal    118000
test <- read.csv(file = "~/Downloads/test.csv")
head(test, 10)
##      Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape
## 1  1461         20       RH          80   11622   Pave  <NA>      Reg
## 2  1462         20       RL          81   14267   Pave  <NA>      IR1
## 3  1463         60       RL          74   13830   Pave  <NA>      IR1
## 4  1464         60       RL          78    9978   Pave  <NA>      IR1
## 5  1465        120       RL          43    5005   Pave  <NA>      IR1
## 6  1466         60       RL          75   10000   Pave  <NA>      IR1
## 7  1467         20       RL          NA    7980   Pave  <NA>      IR1
## 8  1468         60       RL          63    8402   Pave  <NA>      IR1
## 9  1469         20       RL          85   10176   Pave  <NA>      Reg
## 10 1470         20       RL          70    8400   Pave  <NA>      Reg
##    LandContour Utilities LotConfig LandSlope Neighborhood Condition1 Condition2
## 1          Lvl    AllPub    Inside       Gtl        NAmes      Feedr       Norm
## 2          Lvl    AllPub    Corner       Gtl        NAmes       Norm       Norm
## 3          Lvl    AllPub    Inside       Gtl      Gilbert       Norm       Norm
## 4          Lvl    AllPub    Inside       Gtl      Gilbert       Norm       Norm
## 5          HLS    AllPub    Inside       Gtl      StoneBr       Norm       Norm
## 6          Lvl    AllPub    Corner       Gtl      Gilbert       Norm       Norm
## 7          Lvl    AllPub    Inside       Gtl      Gilbert       Norm       Norm
## 8          Lvl    AllPub    Inside       Gtl      Gilbert       Norm       Norm
## 9          Lvl    AllPub    Inside       Gtl      Gilbert       Norm       Norm
## 10         Lvl    AllPub    Corner       Gtl        NAmes       Norm       Norm
##    BldgType HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle
## 1      1Fam     1Story           5           6      1961         1961     Gable
## 2      1Fam     1Story           6           6      1958         1958       Hip
## 3      1Fam     2Story           5           5      1997         1998     Gable
## 4      1Fam     2Story           6           6      1998         1998     Gable
## 5    TwnhsE     1Story           8           5      1992         1992     Gable
## 6      1Fam     2Story           6           5      1993         1994     Gable
## 7      1Fam     1Story           6           7      1992         2007     Gable
## 8      1Fam     2Story           6           5      1998         1998     Gable
## 9      1Fam     1Story           7           5      1990         1990     Gable
## 10     1Fam     1Story           4           5      1970         1970     Gable
##    RoofMatl Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond
## 1   CompShg     VinylSd     VinylSd       None          0        TA        TA
## 2   CompShg     Wd Sdng     Wd Sdng    BrkFace        108        TA        TA
## 3   CompShg     VinylSd     VinylSd       None          0        TA        TA
## 4   CompShg     VinylSd     VinylSd    BrkFace         20        TA        TA
## 5   CompShg     HdBoard     HdBoard       None          0        Gd        TA
## 6   CompShg     HdBoard     HdBoard       None          0        TA        TA
## 7   CompShg     HdBoard     HdBoard       None          0        TA        Gd
## 8   CompShg     VinylSd     VinylSd       None          0        TA        TA
## 9   CompShg     HdBoard     HdBoard       None          0        TA        TA
## 10  CompShg     Plywood     Plywood       None          0        TA        TA
##    Foundation BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1
## 1      CBlock       TA       TA           No          Rec        468
## 2      CBlock       TA       TA           No          ALQ        923
## 3       PConc       Gd       TA           No          GLQ        791
## 4       PConc       TA       TA           No          GLQ        602
## 5       PConc       Gd       TA           No          ALQ        263
## 6       PConc       Gd       TA           No          Unf          0
## 7       PConc       Gd       TA           No          ALQ        935
## 8       PConc       Gd       TA           No          Unf          0
## 9       PConc       Gd       TA           Gd          GLQ        637
## 10     CBlock       TA       TA           No          ALQ        804
##    BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir
## 1           LwQ        144       270         882    GasA        TA          Y
## 2           Unf          0       406        1329    GasA        TA          Y
## 3           Unf          0       137         928    GasA        Gd          Y
## 4           Unf          0       324         926    GasA        Ex          Y
## 5           Unf          0      1017        1280    GasA        Ex          Y
## 6           Unf          0       763         763    GasA        Gd          Y
## 7           Unf          0       233        1168    GasA        Ex          Y
## 8           Unf          0       789         789    GasA        Gd          Y
## 9           Unf          0       663        1300    GasA        Gd          Y
## 10          Rec         78         0         882    GasA        TA          Y
##    Electrical X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath
## 1       SBrkr       896         0            0       896            0
## 2       SBrkr      1329         0            0      1329            0
## 3       SBrkr       928       701            0      1629            0
## 4       SBrkr       926       678            0      1604            0
## 5       SBrkr      1280         0            0      1280            0
## 6       SBrkr       763       892            0      1655            0
## 7       SBrkr      1187         0            0      1187            1
## 8       SBrkr       789       676            0      1465            0
## 9       SBrkr      1341         0            0      1341            1
## 10      SBrkr       882         0            0       882            1
##    BsmtHalfBath FullBath HalfBath BedroomAbvGr KitchenAbvGr KitchenQual
## 1             0        1        0            2            1          TA
## 2             0        1        1            3            1          Gd
## 3             0        2        1            3            1          TA
## 4             0        2        1            3            1          Gd
## 5             0        2        0            2            1          Gd
## 6             0        2        1            3            1          TA
## 7             0        2        0            3            1          TA
## 8             0        2        1            3            1          TA
## 9             0        1        1            2            1          Gd
## 10            0        1        0            2            1          TA
##    TotRmsAbvGrd Functional Fireplaces FireplaceQu GarageType GarageYrBlt
## 1             5        Typ          0        <NA>     Attchd        1961
## 2             6        Typ          0        <NA>     Attchd        1958
## 3             6        Typ          1          TA     Attchd        1997
## 4             7        Typ          1          Gd     Attchd        1998
## 5             5        Typ          0        <NA>     Attchd        1992
## 6             7        Typ          1          TA     Attchd        1993
## 7             6        Typ          0        <NA>     Attchd        1992
## 8             7        Typ          1          Gd     Attchd        1998
## 9             5        Typ          1          Po     Attchd        1990
## 10            4        Typ          0        <NA>     Attchd        1970
##    GarageFinish GarageCars GarageArea GarageQual GarageCond PavedDrive
## 1           Unf          1        730         TA         TA          Y
## 2           Unf          1        312         TA         TA          Y
## 3           Fin          2        482         TA         TA          Y
## 4           Fin          2        470         TA         TA          Y
## 5           RFn          2        506         TA         TA          Y
## 6           Fin          2        440         TA         TA          Y
## 7           Fin          2        420         TA         TA          Y
## 8           Fin          2        393         TA         TA          Y
## 9           Unf          2        506         TA         TA          Y
## 10          Fin          2        525         TA         TA          Y
##    WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch ScreenPorch PoolArea PoolQC
## 1         140           0             0          0         120        0   <NA>
## 2         393          36             0          0           0        0   <NA>
## 3         212          34             0          0           0        0   <NA>
## 4         360          36             0          0           0        0   <NA>
## 5           0          82             0          0         144        0   <NA>
## 6         157          84             0          0           0        0   <NA>
## 7         483          21             0          0           0        0   <NA>
## 8           0          75             0          0           0        0   <NA>
## 9         192           0             0          0           0        0   <NA>
## 10        240           0             0          0           0        0   <NA>
##    Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition
## 1  MnPrv        <NA>       0      6   2010       WD        Normal
## 2   <NA>        Gar2   12500      6   2010       WD        Normal
## 3  MnPrv        <NA>       0      3   2010       WD        Normal
## 4   <NA>        <NA>       0      6   2010       WD        Normal
## 5   <NA>        <NA>       0      1   2010       WD        Normal
## 6   <NA>        <NA>       0      4   2010       WD        Normal
## 7  GdPrv        Shed     500      3   2010       WD        Normal
## 8   <NA>        <NA>       0      5   2010       WD        Normal
## 9   <NA>        <NA>       0      2   2010       WD        Normal
## 10 MnPrv        <NA>       0      4   2010       WD        Normal
summary(train)
##        Id           MSSubClass      MSZoning          LotFrontage    
##  Min.   :   1.0   Min.   : 20.0   Length:1460        Min.   : 21.00  
##  1st Qu.: 365.8   1st Qu.: 20.0   Class :character   1st Qu.: 59.00  
##  Median : 730.5   Median : 50.0   Mode  :character   Median : 69.00  
##  Mean   : 730.5   Mean   : 56.9                      Mean   : 70.05  
##  3rd Qu.:1095.2   3rd Qu.: 70.0                      3rd Qu.: 80.00  
##  Max.   :1460.0   Max.   :190.0                      Max.   :313.00  
##                                                      NA's   :259     
##     LotArea          Street             Alley             LotShape        
##  Min.   :  1300   Length:1460        Length:1460        Length:1460       
##  1st Qu.:  7554   Class :character   Class :character   Class :character  
##  Median :  9478   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 10517                                                           
##  3rd Qu.: 11602                                                           
##  Max.   :215245                                                           
##                                                                           
##  LandContour         Utilities          LotConfig          LandSlope        
##  Length:1460        Length:1460        Length:1460        Length:1460       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  Neighborhood        Condition1         Condition2          BldgType        
##  Length:1460        Length:1460        Length:1460        Length:1460       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##   HouseStyle         OverallQual      OverallCond      YearBuilt   
##  Length:1460        Min.   : 1.000   Min.   :1.000   Min.   :1872  
##  Class :character   1st Qu.: 5.000   1st Qu.:5.000   1st Qu.:1954  
##  Mode  :character   Median : 6.000   Median :5.000   Median :1973  
##                     Mean   : 6.099   Mean   :5.575   Mean   :1971  
##                     3rd Qu.: 7.000   3rd Qu.:6.000   3rd Qu.:2000  
##                     Max.   :10.000   Max.   :9.000   Max.   :2010  
##                                                                    
##   YearRemodAdd   RoofStyle           RoofMatl         Exterior1st       
##  Min.   :1950   Length:1460        Length:1460        Length:1460       
##  1st Qu.:1967   Class :character   Class :character   Class :character  
##  Median :1994   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :1985                                                           
##  3rd Qu.:2004                                                           
##  Max.   :2010                                                           
##                                                                         
##  Exterior2nd         MasVnrType          MasVnrArea      ExterQual        
##  Length:1460        Length:1460        Min.   :   0.0   Length:1460       
##  Class :character   Class :character   1st Qu.:   0.0   Class :character  
##  Mode  :character   Mode  :character   Median :   0.0   Mode  :character  
##                                        Mean   : 103.7                     
##                                        3rd Qu.: 166.0                     
##                                        Max.   :1600.0                     
##                                        NA's   :8                          
##   ExterCond          Foundation          BsmtQual           BsmtCond        
##  Length:1460        Length:1460        Length:1460        Length:1460       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  BsmtExposure       BsmtFinType1         BsmtFinSF1     BsmtFinType2      
##  Length:1460        Length:1460        Min.   :   0.0   Length:1460       
##  Class :character   Class :character   1st Qu.:   0.0   Class :character  
##  Mode  :character   Mode  :character   Median : 383.5   Mode  :character  
##                                        Mean   : 443.6                     
##                                        3rd Qu.: 712.2                     
##                                        Max.   :5644.0                     
##                                                                           
##    BsmtFinSF2        BsmtUnfSF       TotalBsmtSF       Heating         
##  Min.   :   0.00   Min.   :   0.0   Min.   :   0.0   Length:1460       
##  1st Qu.:   0.00   1st Qu.: 223.0   1st Qu.: 795.8   Class :character  
##  Median :   0.00   Median : 477.5   Median : 991.5   Mode  :character  
##  Mean   :  46.55   Mean   : 567.2   Mean   :1057.4                     
##  3rd Qu.:   0.00   3rd Qu.: 808.0   3rd Qu.:1298.2                     
##  Max.   :1474.00   Max.   :2336.0   Max.   :6110.0                     
##                                                                        
##   HeatingQC          CentralAir         Electrical          X1stFlrSF   
##  Length:1460        Length:1460        Length:1460        Min.   : 334  
##  Class :character   Class :character   Class :character   1st Qu.: 882  
##  Mode  :character   Mode  :character   Mode  :character   Median :1087  
##                                                           Mean   :1163  
##                                                           3rd Qu.:1391  
##                                                           Max.   :4692  
##                                                                         
##    X2ndFlrSF     LowQualFinSF       GrLivArea     BsmtFullBath   
##  Min.   :   0   Min.   :  0.000   Min.   : 334   Min.   :0.0000  
##  1st Qu.:   0   1st Qu.:  0.000   1st Qu.:1130   1st Qu.:0.0000  
##  Median :   0   Median :  0.000   Median :1464   Median :0.0000  
##  Mean   : 347   Mean   :  5.845   Mean   :1515   Mean   :0.4253  
##  3rd Qu.: 728   3rd Qu.:  0.000   3rd Qu.:1777   3rd Qu.:1.0000  
##  Max.   :2065   Max.   :572.000   Max.   :5642   Max.   :3.0000  
##                                                                  
##   BsmtHalfBath        FullBath        HalfBath       BedroomAbvGr  
##  Min.   :0.00000   Min.   :0.000   Min.   :0.0000   Min.   :0.000  
##  1st Qu.:0.00000   1st Qu.:1.000   1st Qu.:0.0000   1st Qu.:2.000  
##  Median :0.00000   Median :2.000   Median :0.0000   Median :3.000  
##  Mean   :0.05753   Mean   :1.565   Mean   :0.3829   Mean   :2.866  
##  3rd Qu.:0.00000   3rd Qu.:2.000   3rd Qu.:1.0000   3rd Qu.:3.000  
##  Max.   :2.00000   Max.   :3.000   Max.   :2.0000   Max.   :8.000  
##                                                                    
##   KitchenAbvGr   KitchenQual         TotRmsAbvGrd     Functional       
##  Min.   :0.000   Length:1460        Min.   : 2.000   Length:1460       
##  1st Qu.:1.000   Class :character   1st Qu.: 5.000   Class :character  
##  Median :1.000   Mode  :character   Median : 6.000   Mode  :character  
##  Mean   :1.047                      Mean   : 6.518                     
##  3rd Qu.:1.000                      3rd Qu.: 7.000                     
##  Max.   :3.000                      Max.   :14.000                     
##                                                                        
##    Fireplaces    FireplaceQu         GarageType         GarageYrBlt  
##  Min.   :0.000   Length:1460        Length:1460        Min.   :1900  
##  1st Qu.:0.000   Class :character   Class :character   1st Qu.:1961  
##  Median :1.000   Mode  :character   Mode  :character   Median :1980  
##  Mean   :0.613                                         Mean   :1979  
##  3rd Qu.:1.000                                         3rd Qu.:2002  
##  Max.   :3.000                                         Max.   :2010  
##                                                        NA's   :81    
##  GarageFinish         GarageCars      GarageArea      GarageQual       
##  Length:1460        Min.   :0.000   Min.   :   0.0   Length:1460       
##  Class :character   1st Qu.:1.000   1st Qu.: 334.5   Class :character  
##  Mode  :character   Median :2.000   Median : 480.0   Mode  :character  
##                     Mean   :1.767   Mean   : 473.0                     
##                     3rd Qu.:2.000   3rd Qu.: 576.0                     
##                     Max.   :4.000   Max.   :1418.0                     
##                                                                        
##   GarageCond         PavedDrive          WoodDeckSF      OpenPorchSF    
##  Length:1460        Length:1460        Min.   :  0.00   Min.   :  0.00  
##  Class :character   Class :character   1st Qu.:  0.00   1st Qu.:  0.00  
##  Mode  :character   Mode  :character   Median :  0.00   Median : 25.00  
##                                        Mean   : 94.24   Mean   : 46.66  
##                                        3rd Qu.:168.00   3rd Qu.: 68.00  
##                                        Max.   :857.00   Max.   :547.00  
##                                                                         
##  EnclosedPorch      X3SsnPorch      ScreenPorch        PoolArea      
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.00   Min.   :  0.000  
##  1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.000  
##  Median :  0.00   Median :  0.00   Median :  0.00   Median :  0.000  
##  Mean   : 21.95   Mean   :  3.41   Mean   : 15.06   Mean   :  2.759  
##  3rd Qu.:  0.00   3rd Qu.:  0.00   3rd Qu.:  0.00   3rd Qu.:  0.000  
##  Max.   :552.00   Max.   :508.00   Max.   :480.00   Max.   :738.000  
##                                                                      
##     PoolQC             Fence           MiscFeature           MiscVal        
##  Length:1460        Length:1460        Length:1460        Min.   :    0.00  
##  Class :character   Class :character   Class :character   1st Qu.:    0.00  
##  Mode  :character   Mode  :character   Mode  :character   Median :    0.00  
##                                                           Mean   :   43.49  
##                                                           3rd Qu.:    0.00  
##                                                           Max.   :15500.00  
##                                                                             
##      MoSold           YrSold       SaleType         SaleCondition     
##  Min.   : 1.000   Min.   :2006   Length:1460        Length:1460       
##  1st Qu.: 5.000   1st Qu.:2007   Class :character   Class :character  
##  Median : 6.000   Median :2008   Mode  :character   Mode  :character  
##  Mean   : 6.322   Mean   :2008                                        
##  3rd Qu.: 8.000   3rd Qu.:2009                                        
##  Max.   :12.000   Max.   :2010                                        
##                                                                       
##    SalePrice     
##  Min.   : 34900  
##  1st Qu.:129975  
##  Median :163000  
##  Mean   :180921  
##  3rd Qu.:214000  
##  Max.   :755000  
## 

Univariate descriptive statistics describe a single variable in a dataset. We will use summary() to get a variety for SalePrice variable.

summary(train$SalePrice)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   34900  129975  163000  180921  214000  755000
sd(train$SalePrice)
## [1] 79442.5
hist(train$SalePrice, col = 'orange')

dotchart(train$SalePrice, col = 'green')

We used a density plot to describe the SalePrice.

plot(density(train$SalePrice))

The plots of log “SalePrice” charts compared to the originals in the “Histogram and DotChart” section to see if “SalePrice” is currently distributed.

hist(log(train$SalePrice))

dotchart(log(train$SalePrice))

Linear Algebra and Correlation. Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.

Correlation

A correlation matrix is like a numerical version of our scatterplot and shows correlations between multiple variables. Here is the correlation matrix for all of them because it is instructive,we are looking at the two most and least related to SalePrice: GeneralQual and GeneralCond.

cor(train[c("SalePrice", "LotArea", "OverallQual", "OverallCond", "YearBuilt")])
##               SalePrice     LotArea OverallQual OverallCond   YearBuilt
## SalePrice    1.00000000  0.26384335  0.79098160 -0.07785589  0.52289733
## LotArea      0.26384335  1.00000000  0.10580574 -0.00563627  0.01422765
## OverallQual  0.79098160  0.10580574  1.00000000 -0.09193234  0.57232277
## OverallCond -0.07785589 -0.00563627 -0.09193234  1.00000000 -0.37598320
## YearBuilt    0.52289733  0.01422765  0.57232277 -0.37598320  1.00000000
CorrelationMatrix <- cor(train[c("SalePrice", "OverallQual", "OverallCond")])
CorrelationMatrix
##               SalePrice OverallQual OverallCond
## SalePrice    1.00000000  0.79098160 -0.07785589
## OverallQual  0.79098160  1.00000000 -0.09193234
## OverallCond -0.07785589 -0.09193234  1.00000000

We can see with 80% confidence, if our correlations are zero. If the p-value is less than 5% then we can reject the hypothesis that the tested correlation is zero. If the p-value is higher than 5%, we can’t reject our hypothesis that the tested correlation is zero.

cor.test(train$SalePrice, train$OverallQual, confidence = 0.80)
## 
##  Pearson's product-moment correlation
## 
## data:  train$SalePrice and train$OverallQual
## t = 49.364, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7709644 0.8094376
## sample estimates:
##       cor 
## 0.7909816
cor.test(train$SalePrice, train$OverallCond, confidence = 0.80)
## 
##  Pearson's product-moment correlation
## 
## data:  train$SalePrice and train$OverallCond
## t = -2.9819, df = 1458, p-value = 0.002912
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.12864437 -0.02666008
## sample estimates:
##         cor 
## -0.07785589

Precision Matrix

We multiply the correlation matrix by the precision matrix.

PrecisionMatrix <- solve(CorrelationMatrix)
PrecisionMatrix
##               SalePrice OverallQual OverallCond
## SalePrice    2.67150050 -2.11183483  0.01384614
## OverallQual -2.11183483  2.67793985  0.08177049
## OverallCond  0.01384614  0.08177049  1.00859536
CorrelationMatrix %*% PrecisionMatrix
##                 SalePrice  OverallQual OverallCond
## SalePrice    1.000000e+00 7.112366e-17           0
## OverallQual -1.994932e-17 1.000000e+00           0
## OverallCond -1.734723e-18 2.775558e-17           1

LU decomposition is when you transform an nxn matrix into two triangular matrices, a (L)low triangular matrix, and a (U)pper triangular matrix. The lower triangle matrix has all the zeros above the diagonal, and the upper triangle matrix has all the zeros below the diagonal line, such as when we multiply low by upper, we get the original nxn matrix.

library(pracma)
LU <- lu(PrecisionMatrix)
LU$L
##                SalePrice OverallQual OverallCond
## SalePrice    1.000000000  0.00000000           0
## OverallQual -0.790505124  1.00000000           0
## OverallCond  0.005182906  0.09193234           1
LU$U
##             SalePrice OverallQual OverallCond
## SalePrice    2.671501   -2.111835  0.01384614
## OverallQual  0.000000    1.008524  0.09271594
## OverallCond  0.000000    0.000000  1.00000000

Calculus-Based Probability & Statistics. Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of λ for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, λ)). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.

library(moments)

skewness(train$SalePrice)
## [1] 1.880941
skewness(train$OverallQual)
## [1] 0.216721
skewness(train$OverallCond)
## [1] 0.6923552
skewness(train$LotArea)
## [1] 12.19514
skewness(train$YearBuilt)
## [1] -0.6128307
range(train$SalePrice)
## [1]  34900 755000

The lowest Sale Price is $34,900.

library(MASS)

ExpProbability <- fitdistr(train$SalePrice, "exponential")
ExpProbability
##        rate    
##   5.527268e-06 
##  (1.446552e-07)
lambda <- ExpProbability$estimate
set.seed(1000)
GenerationData <- rexp(1000, rate = lambda)
lambda
##         rate 
## 5.527268e-06
hist(train$SalePrice, main = "Original Data", col = "lightgreen")

hist(GenerationData, main = "Generated Data", col = "lightblue")

5th and 95th Percentiles

qexp(0.05, rate = lambda)
## [1] 9280.044
qexp(0.95, rate = lambda)
## [1] 541991.5

95% Confidence Interval

qnorm(0.95, mean(train$SalePrice), sd(train$SalePrice))
## [1] 311592.5

5th & 95th Empirical

quantile(train$SalePrice, 0.05, names = FALSE)
## [1] 88000
quantile(train$SalePrice, 0.95, names=FALSE)
## [1] 326100

Modeling. Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score. Provide a screen snapshot of your score with your name identifiable.

We are building a model for estimating SalePrice based on the OverallCond, OverallQual, LotArea and YearBuilt of the houses. This model is a multiple regression model with SalePrice as the response variable and OverallCond, OverallQual, LotArea and YearBuilt as the predictor variables.

Model <- lm(SalePrice ~ OverallCond + OverallQual + LotArea + YearBuilt, data = train)
summary(Model)
## 
## Call:
## lm(formula = SalePrice ~ OverallCond + OverallQual + LotArea + 
##     YearBuilt, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -268634  -26234   -3667   20004  393023 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -7.984e+05  1.032e+05  -7.733 1.94e-14 ***
## OverallCond  2.736e+03  1.178e+03   2.323   0.0203 *  
## OverallQual  4.003e+04  1.079e+03  37.107  < 2e-16 ***
## LotArea      1.500e+00  1.210e-01  12.397  < 2e-16 ***
## YearBuilt    3.572e+02  5.279e+01   6.767 1.90e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 45770 on 1455 degrees of freedom
## Multiple R-squared:  0.6689, Adjusted R-squared:  0.668 
## F-statistic:   735 on 4 and 1455 DF,  p-value: < 2.2e-16
summary(Model)$coefficients
##                  Estimate   Std. Error   t value      Pr(>|t|)
## (Intercept) -7.984136e+05 1.032419e+05 -7.733428  1.942605e-14
## OverallCond  2.736366e+03 1.177978e+03  2.322934  2.031999e-02
## OverallQual  4.002845e+04 1.078727e+03 37.107124 1.170950e-212
## LotArea      1.499481e+00 1.209568e-01 12.396829  1.269000e-33
## YearBuilt    3.572131e+02 5.278759e+01  6.766990  1.900296e-11
ModelTransform = lm(SalePrice^(1/2) ~ OverallCond + OverallQual + LotArea + YearBuilt, data = train)
summary(ModelTransform)
## 
## Call:
## lm(formula = SalePrice^(1/2) ~ OverallCond + OverallQual + LotArea + 
##     YearBuilt, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -287.780  -27.326   -2.491   24.503  260.467 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -9.798e+02  1.025e+02  -9.559  < 2e-16 ***
## OverallCond  5.835e+00  1.170e+00   4.989 6.78e-07 ***
## OverallQual  4.294e+01  1.071e+00  40.093  < 2e-16 ***
## LotArea      1.627e-03  1.201e-04  13.546  < 2e-16 ***
## YearBuilt    5.503e-01  5.241e-02  10.501  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 45.44 on 1455 degrees of freedom
## Multiple R-squared:   0.72,  Adjusted R-squared:  0.7193 
## F-statistic: 935.5 on 4 and 1455 DF,  p-value: < 2.2e-16
summary(ModelTransform)$coefficients
##                  Estimate   Std. Error   t value      Pr(>|t|)
## (Intercept) -9.797832e+02 1.025006e+02 -9.558808  4.866473e-21
## OverallCond  5.835237e+00 1.169520e+00  4.989429  6.784580e-07
## OverallQual  4.293847e+01 1.070981e+00 40.092644 2.142094e-237
## LotArea      1.626667e-03 1.200883e-04 13.545592  1.861360e-39
## YearBuilt    5.503384e-01 5.240855e-02 10.500926  6.484907e-25