Data 605 - Final Project
Hazal Gunduz
Problem 1.
Probability Density 1: X~Gamma. Using R, generate a random variable X that has 10,000 random Gamma pdf values. A Gamma pdf is completely describe by n (a size parameter) and lambda (λ, a shape parameter). Choose any n greater 3 and an expected value (λ) between 2 and 10 (you choose).
set.seed(0)
n <- 4
lambda <- 2
a <- 10000
X <- rgamma(a, n, lambda)
summary(X)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.07235 1.26099 1.83317 1.99745 2.56044 7.60100
hist(X, col = 'skyblue')
Probability Density 2: Y~Sum of Exponentials. Then generate 10,000 observations from the sum of n exponential pdfs with rate/shape parameter ( λ). The n and λ must be the same as in the previous case. (e.g., mysum=rexp(10000, λ) + rexp(10000, λ) + …)
set.seed(0)
Y <- rexp(a, lambda) + rexp(a, lambda) + rexp(a, lambda) + rexp(a, lambda)
hist(Y, col = 'orange')
Probability Density 3: Z~Exponential. Then generate 10,000 observations from a single exponential pdf with rate/shape parameter (λ).
set.seed(0)
Z <- rexp(a, lambda)
hist(Z, col = 'green')
1a. Calculate the empirical expected value (means) and variances of all three pdfs.
n <- 4
lambda <- 2
meanX <- 4 / 2
meanX
## [1] 2
varX <- 4 / 2^2
varX
## [1] 1
meanY <- 4 / 2
meanY
## [1] 2
varY <- 4 / 2^2
varY
## [1] 1
meanZ <- (1 / 2) / 2
meanZ
## [1] 0.25
varZ <- (1 / 2^2) / 2^2
varZ
## [1] 0.0625
1b. Using calculus, calculate the expected value and variance of the Gamma pdf (X). Using the moment generating function for exponentials, calculate the expected value of the single exponential (Z) and the sum of exponentials (Y)
λ = 2
\(λe^{-xλ}\) = exponentials distribution
M(t) = \(\int_{0}^{\infty} λe^{-xλ}. e^{tx} = \frac{-λ}{t - λ}\)
\(\frac{λ}{(t - λ)^2} = M'(t)\)
\(\frac{1}{λ} = M'(0)\)
Expected Value : \(\frac{1}{λ} = \frac{1}{2}\)
$M’’(0) = = $
\(\frac{2λ}{(- λ)^3}\)
\(\frac{2}{(λ)^2}\)
Variance : \(\frac{2}{(λ)^2} = \frac{2}{4} = \frac{1}{2}\)
1c-e.Probability. For pdf Z (the exponential), calculate empirically probabilities a through c. Then evaluate through calculus whether the memoryless property holds.
a.P(Z>λ | Z>λ/2) b.P(Z>2λ | Z>λ) c. P(Z>3λ | Z>λ)
\(P(A|B) = \frac{P(A∪B)}{P(B)}\)
a <- mean(Z > lambda & Z > lambda / 2) / (mean(Z > lambda / 2))
a
## [1] 0.1543478
b <- mean(Z > 2 * lambda & Z > lambda) / (mean(Z > lambda ))
b
## [1] 0.01408451
c <- mean(Z > 3 * lambda & Z > lambda) / (mean(Z > lambda ))
c
## [1] 0
Loosely investigate whether P(YZ) = P(Y) P(Z) by building a table with quartiles and evaluating the marginal and joint probabilities.
quantY <- quantile(Y, probs = c(.25,.5,.75,1))
quantY <- as.matrix(t(quantY))
colnames(quantY) = c("25% Y", "50% Y", "75% Y", "100% Y")
quantZ = quantile(Z, probs = c(.25,.5,.75,1))
quantZ = as.matrix(quantZ)
rownames(quantZ) = c("25% Z", "50% Z", "75% Z", "100% Z")
P_YZ <- quantZ %*% quantY
P_Sum <- sum(P_YZ)
prob_YZ <- matrix(nrow = 4, ncol = 4)
for (row in 1:4) {
for (col in 1:4) {
prob_YZ[row,col] = P_YZ[row,col]/P_Sum
}
}
colnames(prob_YZ) = c("25% Y", "50% Y", "75% Y", "100% Y")
rownames(prob_YZ) = c("25% Z", "50% Z", "75% Z", "100% Z")
prob_YZ
## 25% Y 50% Y 75% Y 100% Y
## 25% Z 0.002357123 0.003480398 0.004869254 0.01399171
## 50% Z 0.005767742 0.008516330 0.011914779 0.03423690
## 75% Z 0.011407022 0.016842981 0.023564187 0.06771126
## 100% Z 0.075904041 0.112075726 0.156799645 0.45056090
RSum <- rowSums(prob_YZ)
CSum <- colSums(prob_YZ)
Result <- cbind(prob_YZ, RSum)
Result <- rbind(Result, CSum)
## Warning in rbind(Result, CSum): number of columns of result is not a multiple
## of vector length (arg 2)
Result[4,4]
## [1] 0.4505609
Result
## 25% Y 50% Y 75% Y 100% Y RSum
## 25% Z 0.002357123 0.003480398 0.004869254 0.01399171 0.02469849
## 50% Z 0.005767742 0.008516330 0.011914779 0.03423690 0.06043575
## 75% Z 0.011407022 0.016842981 0.023564187 0.06771126 0.11952545
## 100% Z 0.075904041 0.112075726 0.156799645 0.45056090 0.79534032
## CSum 0.095435928 0.140915434 0.197147864 0.56650077 0.09543593
Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?
FisherTest <- prob_YZ * 10000
fisher.test(FisherTest, simulate.p.value = T)
## Warning in fisher.test(FisherTest, simulate.p.value = T): 'x' has been rounded
## to integer: Mean relative difference: 0.000366621
##
## Fisher's Exact Test for Count Data with simulated p-value (based on
## 2000 replicates)
##
## data: FisherTest
## p-value = 1
## alternative hypothesis: two.sided
chisq.test(FisherTest)
##
## Pearson's Chi-squared test
##
## data: FisherTest
## X-squared = 4.0282e-29, df = 9, p-value = 1
The Fisher Exact Test and the Chi-Square Test give the same p-value of 1 and verify that the “Y” and “Z” samples are independent.
Problem 2.
You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition. https://www.kaggle.com/c/house-prices-advanced-regression-techniques
Descriptive and Inferential Statistics. Provide univariate descriptive statistics and appropriate plots for the training data set.Provide a scatterplot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any three quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?
train <- read.csv(file = "~/Downloads/train.csv", header=TRUE)
head(train, 10)
## Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour
## 1 1 60 RL 65 8450 Pave <NA> Reg Lvl
## 2 2 20 RL 80 9600 Pave <NA> Reg Lvl
## 3 3 60 RL 68 11250 Pave <NA> IR1 Lvl
## 4 4 70 RL 60 9550 Pave <NA> IR1 Lvl
## 5 5 60 RL 84 14260 Pave <NA> IR1 Lvl
## 6 6 50 RL 85 14115 Pave <NA> IR1 Lvl
## 7 7 20 RL 75 10084 Pave <NA> Reg Lvl
## 8 8 60 RL NA 10382 Pave <NA> IR1 Lvl
## 9 9 50 RM 51 6120 Pave <NA> Reg Lvl
## 10 10 190 RL 50 7420 Pave <NA> Reg Lvl
## Utilities LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType
## 1 AllPub Inside Gtl CollgCr Norm Norm 1Fam
## 2 AllPub FR2 Gtl Veenker Feedr Norm 1Fam
## 3 AllPub Inside Gtl CollgCr Norm Norm 1Fam
## 4 AllPub Corner Gtl Crawfor Norm Norm 1Fam
## 5 AllPub FR2 Gtl NoRidge Norm Norm 1Fam
## 6 AllPub Inside Gtl Mitchel Norm Norm 1Fam
## 7 AllPub Inside Gtl Somerst Norm Norm 1Fam
## 8 AllPub Corner Gtl NWAmes PosN Norm 1Fam
## 9 AllPub Inside Gtl OldTown Artery Norm 1Fam
## 10 AllPub Corner Gtl BrkSide Artery Artery 2fmCon
## HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl
## 1 2Story 7 5 2003 2003 Gable CompShg
## 2 1Story 6 8 1976 1976 Gable CompShg
## 3 2Story 7 5 2001 2002 Gable CompShg
## 4 2Story 7 5 1915 1970 Gable CompShg
## 5 2Story 8 5 2000 2000 Gable CompShg
## 6 1.5Fin 5 5 1993 1995 Gable CompShg
## 7 1Story 8 5 2004 2005 Gable CompShg
## 8 2Story 7 6 1973 1973 Gable CompShg
## 9 1.5Fin 7 5 1931 1950 Gable CompShg
## 10 1.5Unf 5 6 1939 1950 Gable CompShg
## Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation
## 1 VinylSd VinylSd BrkFace 196 Gd TA PConc
## 2 MetalSd MetalSd None 0 TA TA CBlock
## 3 VinylSd VinylSd BrkFace 162 Gd TA PConc
## 4 Wd Sdng Wd Shng None 0 TA TA BrkTil
## 5 VinylSd VinylSd BrkFace 350 Gd TA PConc
## 6 VinylSd VinylSd None 0 TA TA Wood
## 7 VinylSd VinylSd Stone 186 Gd TA PConc
## 8 HdBoard HdBoard Stone 240 TA TA CBlock
## 9 BrkFace Wd Shng None 0 TA TA BrkTil
## 10 MetalSd MetalSd None 0 TA TA BrkTil
## BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2
## 1 Gd TA No GLQ 706 Unf
## 2 Gd TA Gd ALQ 978 Unf
## 3 Gd TA Mn GLQ 486 Unf
## 4 TA Gd No ALQ 216 Unf
## 5 Gd TA Av GLQ 655 Unf
## 6 Gd TA No GLQ 732 Unf
## 7 Ex TA Av GLQ 1369 Unf
## 8 Gd TA Mn ALQ 859 BLQ
## 9 TA TA No Unf 0 Unf
## 10 TA TA No GLQ 851 Unf
## BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical
## 1 0 150 856 GasA Ex Y SBrkr
## 2 0 284 1262 GasA Ex Y SBrkr
## 3 0 434 920 GasA Ex Y SBrkr
## 4 0 540 756 GasA Gd Y SBrkr
## 5 0 490 1145 GasA Ex Y SBrkr
## 6 0 64 796 GasA Ex Y SBrkr
## 7 0 317 1686 GasA Ex Y SBrkr
## 8 32 216 1107 GasA Ex Y SBrkr
## 9 0 952 952 GasA Gd Y FuseF
## 10 0 140 991 GasA Ex Y SBrkr
## X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath
## 1 856 854 0 1710 1 0
## 2 1262 0 0 1262 0 1
## 3 920 866 0 1786 1 0
## 4 961 756 0 1717 1 0
## 5 1145 1053 0 2198 1 0
## 6 796 566 0 1362 1 0
## 7 1694 0 0 1694 1 0
## 8 1107 983 0 2090 1 0
## 9 1022 752 0 1774 0 0
## 10 1077 0 0 1077 1 0
## FullBath HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd
## 1 2 1 3 1 Gd 8
## 2 2 0 3 1 TA 6
## 3 2 1 3 1 Gd 6
## 4 1 0 3 1 Gd 7
## 5 2 1 4 1 Gd 9
## 6 1 1 1 1 TA 5
## 7 2 0 3 1 Gd 7
## 8 2 1 3 1 TA 7
## 9 2 0 2 2 TA 8
## 10 1 0 2 2 TA 5
## Functional Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish
## 1 Typ 0 <NA> Attchd 2003 RFn
## 2 Typ 1 TA Attchd 1976 RFn
## 3 Typ 1 TA Attchd 2001 RFn
## 4 Typ 1 Gd Detchd 1998 Unf
## 5 Typ 1 TA Attchd 2000 RFn
## 6 Typ 0 <NA> Attchd 1993 Unf
## 7 Typ 1 Gd Attchd 2004 RFn
## 8 Typ 2 TA Attchd 1973 RFn
## 9 Min1 2 TA Detchd 1931 Unf
## 10 Typ 2 TA Attchd 1939 RFn
## GarageCars GarageArea GarageQual GarageCond PavedDrive WoodDeckSF
## 1 2 548 TA TA Y 0
## 2 2 460 TA TA Y 298
## 3 2 608 TA TA Y 0
## 4 3 642 TA TA Y 0
## 5 3 836 TA TA Y 192
## 6 2 480 TA TA Y 40
## 7 2 636 TA TA Y 255
## 8 2 484 TA TA Y 235
## 9 2 468 Fa TA Y 90
## 10 1 205 Gd TA Y 0
## OpenPorchSF EnclosedPorch X3SsnPorch ScreenPorch PoolArea PoolQC Fence
## 1 61 0 0 0 0 <NA> <NA>
## 2 0 0 0 0 0 <NA> <NA>
## 3 42 0 0 0 0 <NA> <NA>
## 4 35 272 0 0 0 <NA> <NA>
## 5 84 0 0 0 0 <NA> <NA>
## 6 30 0 320 0 0 <NA> MnPrv
## 7 57 0 0 0 0 <NA> <NA>
## 8 204 228 0 0 0 <NA> <NA>
## 9 0 205 0 0 0 <NA> <NA>
## 10 4 0 0 0 0 <NA> <NA>
## MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
## 1 <NA> 0 2 2008 WD Normal 208500
## 2 <NA> 0 5 2007 WD Normal 181500
## 3 <NA> 0 9 2008 WD Normal 223500
## 4 <NA> 0 2 2006 WD Abnorml 140000
## 5 <NA> 0 12 2008 WD Normal 250000
## 6 Shed 700 10 2009 WD Normal 143000
## 7 <NA> 0 8 2007 WD Normal 307000
## 8 Shed 350 11 2009 WD Normal 200000
## 9 <NA> 0 4 2008 WD Abnorml 129900
## 10 <NA> 0 1 2008 WD Normal 118000
test <- read.csv(file = "~/Downloads/test.csv")
head(test, 10)
## Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape
## 1 1461 20 RH 80 11622 Pave <NA> Reg
## 2 1462 20 RL 81 14267 Pave <NA> IR1
## 3 1463 60 RL 74 13830 Pave <NA> IR1
## 4 1464 60 RL 78 9978 Pave <NA> IR1
## 5 1465 120 RL 43 5005 Pave <NA> IR1
## 6 1466 60 RL 75 10000 Pave <NA> IR1
## 7 1467 20 RL NA 7980 Pave <NA> IR1
## 8 1468 60 RL 63 8402 Pave <NA> IR1
## 9 1469 20 RL 85 10176 Pave <NA> Reg
## 10 1470 20 RL 70 8400 Pave <NA> Reg
## LandContour Utilities LotConfig LandSlope Neighborhood Condition1 Condition2
## 1 Lvl AllPub Inside Gtl NAmes Feedr Norm
## 2 Lvl AllPub Corner Gtl NAmes Norm Norm
## 3 Lvl AllPub Inside Gtl Gilbert Norm Norm
## 4 Lvl AllPub Inside Gtl Gilbert Norm Norm
## 5 HLS AllPub Inside Gtl StoneBr Norm Norm
## 6 Lvl AllPub Corner Gtl Gilbert Norm Norm
## 7 Lvl AllPub Inside Gtl Gilbert Norm Norm
## 8 Lvl AllPub Inside Gtl Gilbert Norm Norm
## 9 Lvl AllPub Inside Gtl Gilbert Norm Norm
## 10 Lvl AllPub Corner Gtl NAmes Norm Norm
## BldgType HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle
## 1 1Fam 1Story 5 6 1961 1961 Gable
## 2 1Fam 1Story 6 6 1958 1958 Hip
## 3 1Fam 2Story 5 5 1997 1998 Gable
## 4 1Fam 2Story 6 6 1998 1998 Gable
## 5 TwnhsE 1Story 8 5 1992 1992 Gable
## 6 1Fam 2Story 6 5 1993 1994 Gable
## 7 1Fam 1Story 6 7 1992 2007 Gable
## 8 1Fam 2Story 6 5 1998 1998 Gable
## 9 1Fam 1Story 7 5 1990 1990 Gable
## 10 1Fam 1Story 4 5 1970 1970 Gable
## RoofMatl Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond
## 1 CompShg VinylSd VinylSd None 0 TA TA
## 2 CompShg Wd Sdng Wd Sdng BrkFace 108 TA TA
## 3 CompShg VinylSd VinylSd None 0 TA TA
## 4 CompShg VinylSd VinylSd BrkFace 20 TA TA
## 5 CompShg HdBoard HdBoard None 0 Gd TA
## 6 CompShg HdBoard HdBoard None 0 TA TA
## 7 CompShg HdBoard HdBoard None 0 TA Gd
## 8 CompShg VinylSd VinylSd None 0 TA TA
## 9 CompShg HdBoard HdBoard None 0 TA TA
## 10 CompShg Plywood Plywood None 0 TA TA
## Foundation BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1
## 1 CBlock TA TA No Rec 468
## 2 CBlock TA TA No ALQ 923
## 3 PConc Gd TA No GLQ 791
## 4 PConc TA TA No GLQ 602
## 5 PConc Gd TA No ALQ 263
## 6 PConc Gd TA No Unf 0
## 7 PConc Gd TA No ALQ 935
## 8 PConc Gd TA No Unf 0
## 9 PConc Gd TA Gd GLQ 637
## 10 CBlock TA TA No ALQ 804
## BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir
## 1 LwQ 144 270 882 GasA TA Y
## 2 Unf 0 406 1329 GasA TA Y
## 3 Unf 0 137 928 GasA Gd Y
## 4 Unf 0 324 926 GasA Ex Y
## 5 Unf 0 1017 1280 GasA Ex Y
## 6 Unf 0 763 763 GasA Gd Y
## 7 Unf 0 233 1168 GasA Ex Y
## 8 Unf 0 789 789 GasA Gd Y
## 9 Unf 0 663 1300 GasA Gd Y
## 10 Rec 78 0 882 GasA TA Y
## Electrical X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath
## 1 SBrkr 896 0 0 896 0
## 2 SBrkr 1329 0 0 1329 0
## 3 SBrkr 928 701 0 1629 0
## 4 SBrkr 926 678 0 1604 0
## 5 SBrkr 1280 0 0 1280 0
## 6 SBrkr 763 892 0 1655 0
## 7 SBrkr 1187 0 0 1187 1
## 8 SBrkr 789 676 0 1465 0
## 9 SBrkr 1341 0 0 1341 1
## 10 SBrkr 882 0 0 882 1
## BsmtHalfBath FullBath HalfBath BedroomAbvGr KitchenAbvGr KitchenQual
## 1 0 1 0 2 1 TA
## 2 0 1 1 3 1 Gd
## 3 0 2 1 3 1 TA
## 4 0 2 1 3 1 Gd
## 5 0 2 0 2 1 Gd
## 6 0 2 1 3 1 TA
## 7 0 2 0 3 1 TA
## 8 0 2 1 3 1 TA
## 9 0 1 1 2 1 Gd
## 10 0 1 0 2 1 TA
## TotRmsAbvGrd Functional Fireplaces FireplaceQu GarageType GarageYrBlt
## 1 5 Typ 0 <NA> Attchd 1961
## 2 6 Typ 0 <NA> Attchd 1958
## 3 6 Typ 1 TA Attchd 1997
## 4 7 Typ 1 Gd Attchd 1998
## 5 5 Typ 0 <NA> Attchd 1992
## 6 7 Typ 1 TA Attchd 1993
## 7 6 Typ 0 <NA> Attchd 1992
## 8 7 Typ 1 Gd Attchd 1998
## 9 5 Typ 1 Po Attchd 1990
## 10 4 Typ 0 <NA> Attchd 1970
## GarageFinish GarageCars GarageArea GarageQual GarageCond PavedDrive
## 1 Unf 1 730 TA TA Y
## 2 Unf 1 312 TA TA Y
## 3 Fin 2 482 TA TA Y
## 4 Fin 2 470 TA TA Y
## 5 RFn 2 506 TA TA Y
## 6 Fin 2 440 TA TA Y
## 7 Fin 2 420 TA TA Y
## 8 Fin 2 393 TA TA Y
## 9 Unf 2 506 TA TA Y
## 10 Fin 2 525 TA TA Y
## WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch ScreenPorch PoolArea PoolQC
## 1 140 0 0 0 120 0 <NA>
## 2 393 36 0 0 0 0 <NA>
## 3 212 34 0 0 0 0 <NA>
## 4 360 36 0 0 0 0 <NA>
## 5 0 82 0 0 144 0 <NA>
## 6 157 84 0 0 0 0 <NA>
## 7 483 21 0 0 0 0 <NA>
## 8 0 75 0 0 0 0 <NA>
## 9 192 0 0 0 0 0 <NA>
## 10 240 0 0 0 0 0 <NA>
## Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition
## 1 MnPrv <NA> 0 6 2010 WD Normal
## 2 <NA> Gar2 12500 6 2010 WD Normal
## 3 MnPrv <NA> 0 3 2010 WD Normal
## 4 <NA> <NA> 0 6 2010 WD Normal
## 5 <NA> <NA> 0 1 2010 WD Normal
## 6 <NA> <NA> 0 4 2010 WD Normal
## 7 GdPrv Shed 500 3 2010 WD Normal
## 8 <NA> <NA> 0 5 2010 WD Normal
## 9 <NA> <NA> 0 2 2010 WD Normal
## 10 MnPrv <NA> 0 4 2010 WD Normal
summary(train)
## Id MSSubClass MSZoning LotFrontage
## Min. : 1.0 Min. : 20.0 Length:1460 Min. : 21.00
## 1st Qu.: 365.8 1st Qu.: 20.0 Class :character 1st Qu.: 59.00
## Median : 730.5 Median : 50.0 Mode :character Median : 69.00
## Mean : 730.5 Mean : 56.9 Mean : 70.05
## 3rd Qu.:1095.2 3rd Qu.: 70.0 3rd Qu.: 80.00
## Max. :1460.0 Max. :190.0 Max. :313.00
## NA's :259
## LotArea Street Alley LotShape
## Min. : 1300 Length:1460 Length:1460 Length:1460
## 1st Qu.: 7554 Class :character Class :character Class :character
## Median : 9478 Mode :character Mode :character Mode :character
## Mean : 10517
## 3rd Qu.: 11602
## Max. :215245
##
## LandContour Utilities LotConfig LandSlope
## Length:1460 Length:1460 Length:1460 Length:1460
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Neighborhood Condition1 Condition2 BldgType
## Length:1460 Length:1460 Length:1460 Length:1460
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## HouseStyle OverallQual OverallCond YearBuilt
## Length:1460 Min. : 1.000 Min. :1.000 Min. :1872
## Class :character 1st Qu.: 5.000 1st Qu.:5.000 1st Qu.:1954
## Mode :character Median : 6.000 Median :5.000 Median :1973
## Mean : 6.099 Mean :5.575 Mean :1971
## 3rd Qu.: 7.000 3rd Qu.:6.000 3rd Qu.:2000
## Max. :10.000 Max. :9.000 Max. :2010
##
## YearRemodAdd RoofStyle RoofMatl Exterior1st
## Min. :1950 Length:1460 Length:1460 Length:1460
## 1st Qu.:1967 Class :character Class :character Class :character
## Median :1994 Mode :character Mode :character Mode :character
## Mean :1985
## 3rd Qu.:2004
## Max. :2010
##
## Exterior2nd MasVnrType MasVnrArea ExterQual
## Length:1460 Length:1460 Min. : 0.0 Length:1460
## Class :character Class :character 1st Qu.: 0.0 Class :character
## Mode :character Mode :character Median : 0.0 Mode :character
## Mean : 103.7
## 3rd Qu.: 166.0
## Max. :1600.0
## NA's :8
## ExterCond Foundation BsmtQual BsmtCond
## Length:1460 Length:1460 Length:1460 Length:1460
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2
## Length:1460 Length:1460 Min. : 0.0 Length:1460
## Class :character Class :character 1st Qu.: 0.0 Class :character
## Mode :character Mode :character Median : 383.5 Mode :character
## Mean : 443.6
## 3rd Qu.: 712.2
## Max. :5644.0
##
## BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating
## Min. : 0.00 Min. : 0.0 Min. : 0.0 Length:1460
## 1st Qu.: 0.00 1st Qu.: 223.0 1st Qu.: 795.8 Class :character
## Median : 0.00 Median : 477.5 Median : 991.5 Mode :character
## Mean : 46.55 Mean : 567.2 Mean :1057.4
## 3rd Qu.: 0.00 3rd Qu.: 808.0 3rd Qu.:1298.2
## Max. :1474.00 Max. :2336.0 Max. :6110.0
##
## HeatingQC CentralAir Electrical X1stFlrSF
## Length:1460 Length:1460 Length:1460 Min. : 334
## Class :character Class :character Class :character 1st Qu.: 882
## Mode :character Mode :character Mode :character Median :1087
## Mean :1163
## 3rd Qu.:1391
## Max. :4692
##
## X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath
## Min. : 0 Min. : 0.000 Min. : 334 Min. :0.0000
## 1st Qu.: 0 1st Qu.: 0.000 1st Qu.:1130 1st Qu.:0.0000
## Median : 0 Median : 0.000 Median :1464 Median :0.0000
## Mean : 347 Mean : 5.845 Mean :1515 Mean :0.4253
## 3rd Qu.: 728 3rd Qu.: 0.000 3rd Qu.:1777 3rd Qu.:1.0000
## Max. :2065 Max. :572.000 Max. :5642 Max. :3.0000
##
## BsmtHalfBath FullBath HalfBath BedroomAbvGr
## Min. :0.00000 Min. :0.000 Min. :0.0000 Min. :0.000
## 1st Qu.:0.00000 1st Qu.:1.000 1st Qu.:0.0000 1st Qu.:2.000
## Median :0.00000 Median :2.000 Median :0.0000 Median :3.000
## Mean :0.05753 Mean :1.565 Mean :0.3829 Mean :2.866
## 3rd Qu.:0.00000 3rd Qu.:2.000 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :2.00000 Max. :3.000 Max. :2.0000 Max. :8.000
##
## KitchenAbvGr KitchenQual TotRmsAbvGrd Functional
## Min. :0.000 Length:1460 Min. : 2.000 Length:1460
## 1st Qu.:1.000 Class :character 1st Qu.: 5.000 Class :character
## Median :1.000 Mode :character Median : 6.000 Mode :character
## Mean :1.047 Mean : 6.518
## 3rd Qu.:1.000 3rd Qu.: 7.000
## Max. :3.000 Max. :14.000
##
## Fireplaces FireplaceQu GarageType GarageYrBlt
## Min. :0.000 Length:1460 Length:1460 Min. :1900
## 1st Qu.:0.000 Class :character Class :character 1st Qu.:1961
## Median :1.000 Mode :character Mode :character Median :1980
## Mean :0.613 Mean :1979
## 3rd Qu.:1.000 3rd Qu.:2002
## Max. :3.000 Max. :2010
## NA's :81
## GarageFinish GarageCars GarageArea GarageQual
## Length:1460 Min. :0.000 Min. : 0.0 Length:1460
## Class :character 1st Qu.:1.000 1st Qu.: 334.5 Class :character
## Mode :character Median :2.000 Median : 480.0 Mode :character
## Mean :1.767 Mean : 473.0
## 3rd Qu.:2.000 3rd Qu.: 576.0
## Max. :4.000 Max. :1418.0
##
## GarageCond PavedDrive WoodDeckSF OpenPorchSF
## Length:1460 Length:1460 Min. : 0.00 Min. : 0.00
## Class :character Class :character 1st Qu.: 0.00 1st Qu.: 0.00
## Mode :character Mode :character Median : 0.00 Median : 25.00
## Mean : 94.24 Mean : 46.66
## 3rd Qu.:168.00 3rd Qu.: 68.00
## Max. :857.00 Max. :547.00
##
## EnclosedPorch X3SsnPorch ScreenPorch PoolArea
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.000
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.000
## Median : 0.00 Median : 0.00 Median : 0.00 Median : 0.000
## Mean : 21.95 Mean : 3.41 Mean : 15.06 Mean : 2.759
## 3rd Qu.: 0.00 3rd Qu.: 0.00 3rd Qu.: 0.00 3rd Qu.: 0.000
## Max. :552.00 Max. :508.00 Max. :480.00 Max. :738.000
##
## PoolQC Fence MiscFeature MiscVal
## Length:1460 Length:1460 Length:1460 Min. : 0.00
## Class :character Class :character Class :character 1st Qu.: 0.00
## Mode :character Mode :character Mode :character Median : 0.00
## Mean : 43.49
## 3rd Qu.: 0.00
## Max. :15500.00
##
## MoSold YrSold SaleType SaleCondition
## Min. : 1.000 Min. :2006 Length:1460 Length:1460
## 1st Qu.: 5.000 1st Qu.:2007 Class :character Class :character
## Median : 6.000 Median :2008 Mode :character Mode :character
## Mean : 6.322 Mean :2008
## 3rd Qu.: 8.000 3rd Qu.:2009
## Max. :12.000 Max. :2010
##
## SalePrice
## Min. : 34900
## 1st Qu.:129975
## Median :163000
## Mean :180921
## 3rd Qu.:214000
## Max. :755000
##
Univariate descriptive statistics describe a single variable in a dataset. We will use summary() to get a variety for SalePrice variable.
summary(train$SalePrice)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 34900 129975 163000 180921 214000 755000
sd(train$SalePrice)
## [1] 79442.5
hist(train$SalePrice, col = 'orange')
dotchart(train$SalePrice, col = 'green')
We used a density plot to describe the SalePrice.
plot(density(train$SalePrice))
The plots of log “SalePrice” charts compared to the originals in the “Histogram and DotChart” section to see if “SalePrice” is currently distributed.
hist(log(train$SalePrice))
dotchart(log(train$SalePrice))
Linear Algebra and Correlation. Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.
Correlation
A correlation matrix is like a numerical version of our scatterplot and shows correlations between multiple variables. Here is the correlation matrix for all of them because it is instructive,we are looking at the two most and least related to SalePrice: GeneralQual and GeneralCond.
cor(train[c("SalePrice", "LotArea", "OverallQual", "OverallCond", "YearBuilt")])
## SalePrice LotArea OverallQual OverallCond YearBuilt
## SalePrice 1.00000000 0.26384335 0.79098160 -0.07785589 0.52289733
## LotArea 0.26384335 1.00000000 0.10580574 -0.00563627 0.01422765
## OverallQual 0.79098160 0.10580574 1.00000000 -0.09193234 0.57232277
## OverallCond -0.07785589 -0.00563627 -0.09193234 1.00000000 -0.37598320
## YearBuilt 0.52289733 0.01422765 0.57232277 -0.37598320 1.00000000
CorrelationMatrix <- cor(train[c("SalePrice", "OverallQual", "OverallCond")])
CorrelationMatrix
## SalePrice OverallQual OverallCond
## SalePrice 1.00000000 0.79098160 -0.07785589
## OverallQual 0.79098160 1.00000000 -0.09193234
## OverallCond -0.07785589 -0.09193234 1.00000000
We can see with 80% confidence, if our correlations are zero. If the p-value is less than 5% then we can reject the hypothesis that the tested correlation is zero. If the p-value is higher than 5%, we can’t reject our hypothesis that the tested correlation is zero.
cor.test(train$SalePrice, train$OverallQual, confidence = 0.80)
##
## Pearson's product-moment correlation
##
## data: train$SalePrice and train$OverallQual
## t = 49.364, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.7709644 0.8094376
## sample estimates:
## cor
## 0.7909816
cor.test(train$SalePrice, train$OverallCond, confidence = 0.80)
##
## Pearson's product-moment correlation
##
## data: train$SalePrice and train$OverallCond
## t = -2.9819, df = 1458, p-value = 0.002912
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.12864437 -0.02666008
## sample estimates:
## cor
## -0.07785589
Precision Matrix
We multiply the correlation matrix by the precision matrix.
PrecisionMatrix <- solve(CorrelationMatrix)
PrecisionMatrix
## SalePrice OverallQual OverallCond
## SalePrice 2.67150050 -2.11183483 0.01384614
## OverallQual -2.11183483 2.67793985 0.08177049
## OverallCond 0.01384614 0.08177049 1.00859536
CorrelationMatrix %*% PrecisionMatrix
## SalePrice OverallQual OverallCond
## SalePrice 1.000000e+00 7.112366e-17 0
## OverallQual -1.994932e-17 1.000000e+00 0
## OverallCond -1.734723e-18 2.775558e-17 1
LU decomposition is when you transform an nxn matrix into two triangular matrices, a (L)low triangular matrix, and a (U)pper triangular matrix. The lower triangle matrix has all the zeros above the diagonal, and the upper triangle matrix has all the zeros below the diagonal line, such as when we multiply low by upper, we get the original nxn matrix.
library(pracma)
LU <- lu(PrecisionMatrix)
LU$L
## SalePrice OverallQual OverallCond
## SalePrice 1.000000000 0.00000000 0
## OverallQual -0.790505124 1.00000000 0
## OverallCond 0.005182906 0.09193234 1
LU$U
## SalePrice OverallQual OverallCond
## SalePrice 2.671501 -2.111835 0.01384614
## OverallQual 0.000000 1.008524 0.09271594
## OverallCond 0.000000 0.000000 1.00000000
Calculus-Based Probability & Statistics. Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of λ for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, λ)). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.
library(moments)
skewness(train$SalePrice)
## [1] 1.880941
skewness(train$OverallQual)
## [1] 0.216721
skewness(train$OverallCond)
## [1] 0.6923552
skewness(train$LotArea)
## [1] 12.19514
skewness(train$YearBuilt)
## [1] -0.6128307
range(train$SalePrice)
## [1] 34900 755000
The lowest Sale Price is $34,900.
library(MASS)
ExpProbability <- fitdistr(train$SalePrice, "exponential")
ExpProbability
## rate
## 5.527268e-06
## (1.446552e-07)
lambda <- ExpProbability$estimate
set.seed(1000)
GenerationData <- rexp(1000, rate = lambda)
lambda
## rate
## 5.527268e-06
hist(train$SalePrice, main = "Original Data", col = "lightgreen")
hist(GenerationData, main = "Generated Data", col = "lightblue")
5th and 95th Percentiles
qexp(0.05, rate = lambda)
## [1] 9280.044
qexp(0.95, rate = lambda)
## [1] 541991.5
95% Confidence Interval
qnorm(0.95, mean(train$SalePrice), sd(train$SalePrice))
## [1] 311592.5
5th & 95th Empirical
quantile(train$SalePrice, 0.05, names = FALSE)
## [1] 88000
quantile(train$SalePrice, 0.95, names=FALSE)
## [1] 326100
Modeling. Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score. Provide a screen snapshot of your score with your name identifiable.
We are building a model for estimating SalePrice based on the OverallCond, OverallQual, LotArea and YearBuilt of the houses. This model is a multiple regression model with SalePrice as the response variable and OverallCond, OverallQual, LotArea and YearBuilt as the predictor variables.
Model <- lm(SalePrice ~ OverallCond + OverallQual + LotArea + YearBuilt, data = train)
summary(Model)
##
## Call:
## lm(formula = SalePrice ~ OverallCond + OverallQual + LotArea +
## YearBuilt, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -268634 -26234 -3667 20004 393023
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.984e+05 1.032e+05 -7.733 1.94e-14 ***
## OverallCond 2.736e+03 1.178e+03 2.323 0.0203 *
## OverallQual 4.003e+04 1.079e+03 37.107 < 2e-16 ***
## LotArea 1.500e+00 1.210e-01 12.397 < 2e-16 ***
## YearBuilt 3.572e+02 5.279e+01 6.767 1.90e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 45770 on 1455 degrees of freedom
## Multiple R-squared: 0.6689, Adjusted R-squared: 0.668
## F-statistic: 735 on 4 and 1455 DF, p-value: < 2.2e-16
summary(Model)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.984136e+05 1.032419e+05 -7.733428 1.942605e-14
## OverallCond 2.736366e+03 1.177978e+03 2.322934 2.031999e-02
## OverallQual 4.002845e+04 1.078727e+03 37.107124 1.170950e-212
## LotArea 1.499481e+00 1.209568e-01 12.396829 1.269000e-33
## YearBuilt 3.572131e+02 5.278759e+01 6.766990 1.900296e-11
ModelTransform = lm(SalePrice^(1/2) ~ OverallCond + OverallQual + LotArea + YearBuilt, data = train)
summary(ModelTransform)
##
## Call:
## lm(formula = SalePrice^(1/2) ~ OverallCond + OverallQual + LotArea +
## YearBuilt, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -287.780 -27.326 -2.491 24.503 260.467
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.798e+02 1.025e+02 -9.559 < 2e-16 ***
## OverallCond 5.835e+00 1.170e+00 4.989 6.78e-07 ***
## OverallQual 4.294e+01 1.071e+00 40.093 < 2e-16 ***
## LotArea 1.627e-03 1.201e-04 13.546 < 2e-16 ***
## YearBuilt 5.503e-01 5.241e-02 10.501 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 45.44 on 1455 degrees of freedom
## Multiple R-squared: 0.72, Adjusted R-squared: 0.7193
## F-statistic: 935.5 on 4 and 1455 DF, p-value: < 2.2e-16
summary(ModelTransform)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.797832e+02 1.025006e+02 -9.558808 4.866473e-21
## OverallCond 5.835237e+00 1.169520e+00 4.989429 6.784580e-07
## OverallQual 4.293847e+01 1.070981e+00 40.092644 2.142094e-237
## LotArea 1.626667e-03 1.200883e-04 13.545592 1.861360e-39
## YearBuilt 5.503384e-01 5.240855e-02 10.500926 6.484907e-25