Load required R Packages
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
## Warning: package 'corrplot' was built under R version 3.4.2
Load train.csv for training data set. Make sure the data is loaded correct.
## Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape
## 1 1 60 RL 65 8450 Pave <NA> Reg
## 2 2 20 RL 80 9600 Pave <NA> Reg
## 3 3 60 RL 68 11250 Pave <NA> IR1
## 4 4 70 RL 60 9550 Pave <NA> IR1
## 5 5 60 RL 84 14260 Pave <NA> IR1
## 6 6 50 RL 85 14115 Pave <NA> IR1
## LandContour Utilities LotConfig LandSlope Neighborhood Condition1
## 1 Lvl AllPub Inside Gtl CollgCr Norm
## 2 Lvl AllPub FR2 Gtl Veenker Feedr
## 3 Lvl AllPub Inside Gtl CollgCr Norm
## 4 Lvl AllPub Corner Gtl Crawfor Norm
## 5 Lvl AllPub FR2 Gtl NoRidge Norm
## 6 Lvl AllPub Inside Gtl Mitchel Norm
## Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt
## 1 Norm 1Fam 2Story 7 5 2003
## 2 Norm 1Fam 1Story 6 8 1976
## 3 Norm 1Fam 2Story 7 5 2001
## 4 Norm 1Fam 2Story 7 5 1915
## 5 Norm 1Fam 2Story 8 5 2000
## 6 Norm 1Fam 1.5Fin 5 5 1993
## YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType
## 1 2003 Gable CompShg VinylSd VinylSd BrkFace
## 2 1976 Gable CompShg MetalSd MetalSd None
## 3 2002 Gable CompShg VinylSd VinylSd BrkFace
## 4 1970 Gable CompShg Wd Sdng Wd Shng None
## 5 2000 Gable CompShg VinylSd VinylSd BrkFace
## 6 1995 Gable CompShg VinylSd VinylSd None
## MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure
## 1 196 Gd TA PConc Gd TA No
## 2 0 TA TA CBlock Gd TA Gd
## 3 162 Gd TA PConc Gd TA Mn
## 4 0 TA TA BrkTil TA Gd No
## 5 350 Gd TA PConc Gd TA Av
## 6 0 TA TA Wood Gd TA No
## BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF
## 1 GLQ 706 Unf 0 150 856
## 2 ALQ 978 Unf 0 284 1262
## 3 GLQ 486 Unf 0 434 920
## 4 ALQ 216 Unf 0 540 756
## 5 GLQ 655 Unf 0 490 1145
## 6 GLQ 732 Unf 0 64 796
## Heating HeatingQC CentralAir Electrical X1stFlrSF X2ndFlrSF LowQualFinSF
## 1 GasA Ex Y SBrkr 856 854 0
## 2 GasA Ex Y SBrkr 1262 0 0
## 3 GasA Ex Y SBrkr 920 866 0
## 4 GasA Gd Y SBrkr 961 756 0
## 5 GasA Ex Y SBrkr 1145 1053 0
## 6 GasA Ex Y SBrkr 796 566 0
## GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr
## 1 1710 1 0 2 1 3
## 2 1262 0 1 2 0 3
## 3 1786 1 0 2 1 3
## 4 1717 1 0 1 0 3
## 5 2198 1 0 2 1 4
## 6 1362 1 0 1 1 1
## KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces FireplaceQu
## 1 1 Gd 8 Typ 0 <NA>
## 2 1 TA 6 Typ 1 TA
## 3 1 Gd 6 Typ 1 TA
## 4 1 Gd 7 Typ 1 Gd
## 5 1 Gd 9 Typ 1 TA
## 6 1 TA 5 Typ 0 <NA>
## GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual
## 1 Attchd 2003 RFn 2 548 TA
## 2 Attchd 1976 RFn 2 460 TA
## 3 Attchd 2001 RFn 2 608 TA
## 4 Detchd 1998 Unf 3 642 TA
## 5 Attchd 2000 RFn 3 836 TA
## 6 Attchd 1993 Unf 2 480 TA
## GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch
## 1 TA Y 0 61 0 0
## 2 TA Y 298 0 0 0
## 3 TA Y 0 42 0 0
## 4 TA Y 0 35 272 0
## 5 TA Y 192 84 0 0
## 6 TA Y 40 30 0 320
## ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold
## 1 0 0 <NA> <NA> <NA> 0 2 2008
## 2 0 0 <NA> <NA> <NA> 0 5 2007
## 3 0 0 <NA> <NA> <NA> 0 9 2008
## 4 0 0 <NA> <NA> <NA> 0 2 2006
## 5 0 0 <NA> <NA> <NA> 0 12 2008
## 6 0 0 <NA> MnPrv Shed 700 10 2009
## SaleType SaleCondition SalePrice
## 1 WD Normal 208500
## 2 WD Normal 181500
## 3 WD Normal 223500
## 4 WD Abnorml 140000
## 5 WD Normal 250000
## 6 WD Normal 143000
Set Up:
Pick one of the quantitative independent variables from the training data set (train.csv), and define that variable as X. Pick the dependent variable and define it as Y.
For this project, I have choosen train$LotArea as my quantitative independent variable.
Checking the skewness of train$GrLivArea to make sure that it is to the right.
## [1] 1.365156
The positive return implies that the variable is skewed to the right.
My dependant variable for this project is train$SalePrice.
Summary of train$GrLivAera (X)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 334 1130 1464 1515 1777 5642
Summary for train$SalePrice (Y)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 34900 129975 163000 180921 214000 755000
Probability
Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the 1st quartile of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities. In addition, make a table of counts as shown below.
Define \(x\) and \(y\) First quartile of Above Ground Living Space \(x = 1129.5\)
## 25%
## 1129.5
First quartile of Sale Price \(y = 129975\)
## 25%
## 129975
#P(X > x)
X_g_x <- subset(train,GrLivArea > x)
#P(X <= x)
X_le_x <- subset(train,GrLivArea <= x)
#P(Y > y)
Y_g_y<- subset(train, SalePrice > y)
#P(Y <= y)
Y_le_y<- subset(train, SalePrice <= y)Calculating the joint/conditional probability assuming that the events are dependent.
- \(P(X > x | Y > y)\)
This is probability that the Above Ground Living Space (X) is greater than the 1st quartile of Above Ground Living Space given that the Sale Price (Y) is greater than the 1st quartile of Sale Price.
\(P(X > x | Y > y)\) is the probability of \(X > x\) given \(Y > y\). This is the area represented by the intersection of both events, divided by the total area of the given event.
Calculate the number of rows where \(Y > y\)
## [1] 1095
Calculate the total number of rows for train
## [1] 1460
Probability of \(P(X > x | Y > y)\)
## [1] 0.8712329
The probability that the 75% of the living space is larger than 25% of the living space, given that 75% of the sale price of the home is 87.12%
- \(P(X > x, Y > y)\)
The probability that the Above Ground Living Space (X) is greater than the 1st quartile for Above Ground Living Space and the Sale Price (Y) is greater than the 1st quartile for Sale Price.
\(P(X>x, Y>y)\) is the probability of \(X> x\) and \(Y > y\), divided by the total number of rows
## [1] 0.6534247
The probability is about 65.37%
- \(P(X < x | Y > y)\)
The probability that the Above Ground Living Space (X) is less than the 1st quartile for Above Ground Living Space given that the Sale Price (Y) is greater than the 1st quartile for Sale Price.
\(P(X < x | Y > y)\) is the probability of \(X < x\) given \(Y > y\) dividded by the number of rows.
## [1] 0.1287671
The probability is 12.87 percent.
Table of Counts
a <- sum(train$GrLivArea <= x & train$SalePrice <= y)
b<- sum(train$GrLivArea <= x & train$SalePrice > y)
c <- sum(train$GrLivArea > x & train$SalePrice <= y)
d<- sum(train$GrLivArea > x & train$SalePrice > y)## <=1st quartile >1st quartile Total
## <=1st quartile 224 141 365
## >1st quartile 141 954 1095
## Total 365 1095 1460
Does splitting the training data in this fashion make them independent? Let A be the new variable counting those observations above the 1st quartile for X, and let B be the new variable counting those observations above the 1st quartile for Y. Does P(AB)=P(A)P(B)? Check mathematically, and then evaluate by running a Chi Square test for association.
\(P(A) = 1095/1460 = .75\) \(P(B) = 1095/1460 = .75\)
P(A)P(B) = 0.75 * 0.75 = 0.5625
P(AB) = 0.75 * 0.75 = 0.5625
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: matrix(c(a, b, c, d), ncol = 2)
## X-squared = 340.75, df = 1, p-value < 2.2e-16
The p-value < 2.2e-16 is less than the .05 significance level, we reject the null hypothesis that GrLivArea variable is independent of the SalePrice.
Descriptive and Inferential Statistics
Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot of X and Y. Derive a correlation matrix for any THREE quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide a 92% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?
Scatterplot
ggplot(train, aes(x=GrLivArea, y=SalePrice)) +
geom_point()+
geom_smooth(method=lm) Histograms
Q-Q Plots Above Ground Living Space
Sale Price Box Plot
Above Ground Living Space Sale Price
boxplot(train$SalePrice) Density Plot
ggplot(data = train, aes(x = GrLivArea) )+
geom_density(alpha = .2, fill = "gold1")+
ggtitle("Above Ground Living Space in SQFT")ggplot(data = train, aes(x = SalePrice) )+
geom_density(alpha = .2, fill = "gold1")+
ggtitle("Above Ground Living Space in SQFT")## vars n mean sd median trimmed mad
## MSSubClass 1 1460 56.90 42.30 50.0 49.15 44.48
## LotFrontage 2 1201 70.05 24.28 69.0 68.94 16.31
## LotArea 3 1460 10516.83 9981.26 9478.5 9563.28 2962.23
## OverallQual 4 1460 6.10 1.38 6.0 6.08 1.48
## OverallCond 5 1460 5.58 1.11 5.0 5.48 0.00
## YearBuilt 6 1460 1971.27 30.20 1973.0 1974.13 37.06
## YearRemodAdd 7 1460 1984.87 20.65 1994.0 1986.37 19.27
## MasVnrArea 8 1452 103.69 181.07 0.0 63.15 0.00
## BsmtFinSF1 9 1460 443.64 456.10 383.5 386.08 568.58
## BsmtFinSF2 10 1460 46.55 161.32 0.0 1.38 0.00
## BsmtUnfSF 11 1460 567.24 441.87 477.5 519.29 426.99
## TotalBsmtSF 12 1460 1057.43 438.71 991.5 1036.70 347.67
## X1stFlrSF 13 1460 1162.63 386.59 1087.0 1129.99 347.67
## X2ndFlrSF 14 1460 346.99 436.53 0.0 285.36 0.00
## LowQualFinSF 15 1460 5.84 48.62 0.0 0.00 0.00
## GrLivArea 16 1460 1515.46 525.48 1464.0 1467.67 483.33
## BsmtFullBath 17 1460 0.43 0.52 0.0 0.39 0.00
## BsmtHalfBath 18 1460 0.06 0.24 0.0 0.00 0.00
## FullBath 19 1460 1.57 0.55 2.0 1.56 0.00
## HalfBath 20 1460 0.38 0.50 0.0 0.34 0.00
## BedroomAbvGr 21 1460 2.87 0.82 3.0 2.85 0.00
## KitchenAbvGr 22 1460 1.05 0.22 1.0 1.00 0.00
## TotRmsAbvGrd 23 1460 6.52 1.63 6.0 6.41 1.48
## Fireplaces 24 1460 0.61 0.64 1.0 0.53 1.48
## GarageYrBlt 25 1379 1978.51 24.69 1980.0 1981.07 31.13
## GarageCars 26 1460 1.77 0.75 2.0 1.77 0.00
## GarageArea 27 1460 472.98 213.80 480.0 469.81 177.91
## WoodDeckSF 28 1460 94.24 125.34 0.0 71.76 0.00
## OpenPorchSF 29 1460 46.66 66.26 25.0 33.23 37.06
## EnclosedPorch 30 1460 21.95 61.12 0.0 3.87 0.00
## X3SsnPorch 31 1460 3.41 29.32 0.0 0.00 0.00
## ScreenPorch 32 1460 15.06 55.76 0.0 0.00 0.00
## PoolArea 33 1460 2.76 40.18 0.0 0.00 0.00
## MiscVal 34 1460 43.49 496.12 0.0 0.00 0.00
## MoSold 35 1460 6.32 2.70 6.0 6.25 2.97
## YrSold 36 1460 2007.82 1.33 2008.0 2007.77 1.48
## SalePrice 37 1460 180921.20 79442.50 163000.0 170783.29 56338.80
## min max range skew kurtosis se
## MSSubClass 20 190 170 1.40 1.56 1.11
## LotFrontage 21 313 292 2.16 17.34 0.70
## LotArea 1300 215245 213945 12.18 202.26 261.22
## OverallQual 1 10 9 0.22 0.09 0.04
## OverallCond 1 9 8 0.69 1.09 0.03
## YearBuilt 1872 2010 138 -0.61 -0.45 0.79
## YearRemodAdd 1950 2010 60 -0.50 -1.27 0.54
## MasVnrArea 0 1600 1600 2.66 10.03 4.75
## BsmtFinSF1 0 5644 5644 1.68 11.06 11.94
## BsmtFinSF2 0 1474 1474 4.25 20.01 4.22
## BsmtUnfSF 0 2336 2336 0.92 0.46 11.56
## TotalBsmtSF 0 6110 6110 1.52 13.18 11.48
## X1stFlrSF 334 4692 4358 1.37 5.71 10.12
## X2ndFlrSF 0 2065 2065 0.81 -0.56 11.42
## LowQualFinSF 0 572 572 8.99 82.83 1.27
## GrLivArea 334 5642 5308 1.36 4.86 13.75
## BsmtFullBath 0 3 3 0.59 -0.84 0.01
## BsmtHalfBath 0 2 2 4.09 16.31 0.01
## FullBath 0 3 3 0.04 -0.86 0.01
## HalfBath 0 2 2 0.67 -1.08 0.01
## BedroomAbvGr 0 8 8 0.21 2.21 0.02
## KitchenAbvGr 0 3 3 4.48 21.42 0.01
## TotRmsAbvGrd 2 14 12 0.67 0.87 0.04
## Fireplaces 0 3 3 0.65 -0.22 0.02
## GarageYrBlt 1900 2010 110 -0.65 -0.42 0.66
## GarageCars 0 4 4 -0.34 0.21 0.02
## GarageArea 0 1418 1418 0.18 0.90 5.60
## WoodDeckSF 0 857 857 1.54 2.97 3.28
## OpenPorchSF 0 547 547 2.36 8.44 1.73
## EnclosedPorch 0 552 552 3.08 10.37 1.60
## X3SsnPorch 0 508 508 10.28 123.06 0.77
## ScreenPorch 0 480 480 4.11 18.34 1.46
## PoolArea 0 738 738 14.80 222.19 1.05
## MiscVal 0 15500 15500 24.43 697.64 12.98
## MoSold 1 12 11 0.21 -0.41 0.07
## YrSold 2006 2010 4 0.10 -1.19 0.03
## SalePrice 34900 755000 720100 1.88 6.50 2079.11
There are 37 numerical values in the train data set.
Histographs of the 36 numerical values
## MSZoning Street Alley LotShape LandContour
## C (all): 10 Grvl: 6 Grvl: 50 IR1:484 Bnk: 63
## FV : 65 Pave:1454 Pave: 41 IR2: 41 HLS: 50
## RH : 16 NA's:1369 IR3: 10 Low: 36
## RL :1151 Reg:925 Lvl:1311
## RM : 218
##
##
## Utilities LotConfig LandSlope Neighborhood Condition1
## AllPub:1459 Corner : 263 Gtl:1382 NAmes :225 Norm :1260
## NoSeWa: 1 CulDSac: 94 Mod: 65 CollgCr:150 Feedr : 81
## FR2 : 47 Sev: 13 OldTown:113 Artery : 48
## FR3 : 4 Edwards:100 RRAn : 26
## Inside :1052 Somerst: 86 PosN : 19
## Gilbert: 79 RRAe : 11
## (Other):707 (Other): 15
## Condition2 BldgType HouseStyle RoofStyle RoofMatl
## Norm :1445 1Fam :1220 1Story :726 Flat : 13 CompShg:1434
## Feedr : 6 2fmCon: 31 2Story :445 Gable :1141 Tar&Grv: 11
## Artery : 2 Duplex: 52 1.5Fin :154 Gambrel: 11 WdShngl: 6
## PosN : 2 Twnhs : 43 SLvl : 65 Hip : 286 WdShake: 5
## RRNn : 2 TwnhsE: 114 SFoyer : 37 Mansard: 7 ClyTile: 1
## PosA : 1 1.5Unf : 14 Shed : 2 Membran: 1
## (Other): 2 (Other): 19 (Other): 2
## Exterior1st Exterior2nd MasVnrType ExterQual ExterCond
## VinylSd:515 VinylSd:504 BrkCmn : 15 Ex: 52 Ex: 3
## HdBoard:222 MetalSd:214 BrkFace:445 Fa: 14 Fa: 28
## MetalSd:220 HdBoard:207 None :864 Gd:488 Gd: 146
## Wd Sdng:206 Wd Sdng:197 Stone :128 TA:906 Po: 1
## Plywood:108 Plywood:142 NA's : 8 TA:1282
## CemntBd: 61 CmentBd: 60
## (Other):128 (Other):136
## Foundation BsmtQual BsmtCond BsmtExposure BsmtFinType1
## BrkTil:146 Ex :121 Fa : 45 Av :221 ALQ :220
## CBlock:634 Fa : 35 Gd : 65 Gd :134 BLQ :148
## PConc :647 Gd :618 Po : 2 Mn :114 GLQ :418
## Slab : 24 TA :649 TA :1311 No :953 LwQ : 74
## Stone : 6 NA's: 37 NA's: 37 NA's: 38 Rec :133
## Wood : 3 Unf :430
## NA's: 37
## BsmtFinType2 Heating HeatingQC CentralAir Electrical KitchenQual
## ALQ : 19 Floor: 1 Ex:741 N: 95 FuseA: 94 Ex:100
## BLQ : 33 GasA :1428 Fa: 49 Y:1365 FuseF: 27 Fa: 39
## GLQ : 14 GasW : 18 Gd:241 FuseP: 3 Gd:586
## LwQ : 46 Grav : 7 Po: 1 Mix : 1 TA:735
## Rec : 54 OthW : 2 TA:428 SBrkr:1334
## Unf :1256 Wall : 4 NA's : 1
## NA's: 38
## Functional FireplaceQu GarageType GarageFinish GarageQual
## Maj1: 14 Ex : 24 2Types : 6 Fin :352 Ex : 3
## Maj2: 5 Fa : 33 Attchd :870 RFn :422 Fa : 48
## Min1: 31 Gd :380 Basment: 19 Unf :605 Gd : 14
## Min2: 34 Po : 20 BuiltIn: 88 NA's: 81 Po : 3
## Mod : 15 TA :313 CarPort: 9 TA :1311
## Sev : 1 NA's:690 Detchd :387 NA's: 81
## Typ :1360 NA's : 81
## GarageCond PavedDrive PoolQC Fence MiscFeature
## Ex : 2 N: 90 Ex : 2 GdPrv: 59 Gar2: 2
## Fa : 35 P: 30 Fa : 2 GdWo : 54 Othr: 2
## Gd : 9 Y:1340 Gd : 3 MnPrv: 157 Shed: 49
## Po : 7 NA's:1453 MnWw : 11 TenC: 1
## TA :1326 NA's :1179 NA's:1406
## NA's: 81
##
## SaleType SaleCondition
## WD :1267 Abnorml: 101
## New : 122 AdjLand: 4
## COD : 43 Alloca : 12
## ConLD : 9 Family : 20
## ConLI : 5 Normal :1198
## ConLw : 5 Partial: 125
## (Other): 9
barchart of the 43 Categorical Data correlation matrix for any THREE quantitative variables
## LotArea OverallQual GrLivArea SalePrice
## LotArea 1.0000000 0.1058057 0.2631162 0.2638434
## OverallQual 0.1058057 1.0000000 0.5930074 0.7909816
## GrLivArea 0.2631162 0.5930074 1.0000000 0.7086245
## SalePrice 0.2638434 0.7909816 0.7086245 1.0000000
These results show a very low but possible positive correlation between the OverallQual and LotArea. A low but possible positive correlation between LotArea and GrLivArea and a somewhat strong correlation between GrLivArea and OverallQual.
The there is a low correlation between SalePrice and LotArea, but a strong correlation between SalePrice and GrLivArea and OverallQual.
corrplot(correlations, method="square")##
## Welch Two Sample t-test
##
## data: correlation$LotArea and correlation$SalePrice
## t = -81.321, df = 1505.1, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 92 percent confidence interval:
## -174075.3 -166733.4
## sample estimates:
## mean of x mean of y
## 10516.83 180921.20
In the house training dataset, the mean total Above Ground Living Area is 10516.83 and the mean sale price of a house is 180921.196. The 92% confidence interval of the difference in mean sale price is between 166733.40 and 174075.30.
We see a very small p-value (< 0.5) which leads us to reject the null hypothesis. There is strong evidence of a mean price increase between above ground Lot Area and sales price, which is indicative of a relationship between these two variables.
##
## Welch Two Sample t-test
##
## data: correlation$GrLivArea and correlation$SalePrice
## t = -86.288, df = 1459.1, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 92 percent confidence interval:
## -183048.2 -175763.3
## sample estimates:
## mean of x mean of y
## 1515.464 180921.196
In the house training dataset, the mean total Above Ground Living Area is 1515.464 and the mean sale price of a house is 180921.196. The 92% confidence interval of the difference in mean sale price is between 175763.30 and 183048.20.
We see a very small p-value (< 0.5) which leads us to reject the null hypothesis. There is strong evidence of a mean price increase between above ground living area and sales price, which is indicative of a relationship between these two variables.
##
## Welch Two Sample t-test
##
## data: correlation$OverallQual and correlation$SalePrice
## t = -87.016, df = 1459, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 92 percent confidence interval:
## -184557.5 -177272.7
## sample estimates:
## mean of x mean of y
## 6.099315e+00 1.809212e+05
In the house training dataset, the mean total Overall Quality is 6.09 and the mean sale price of a house is 180921.196. The 92% confidence interval of the difference in mean sale price is between 177272.70 and 184557.50.
We see a very small p-value (< 0.5) which leads us to reject the null hypothesis. There is strong evidence of a mean price increase between overall quality and sales price, which is indicative of a relationship between these two variables.
Linear Algebra and Correlation
Invert your 3 x 3 correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.
## LotArea OverallQual GrLivArea
## LotArea 1.0000000 0.1058057 0.2631162
## OverallQual 0.1058057 1.0000000 0.5930074
## GrLivArea 0.2631162 0.5930074 1.0000000
## LotArea OverallQual GrLivArea
## LotArea 1.0788892 0.0835766 -0.3334347
## OverallQual 0.0835766 1.5488697 -0.9404816
## GrLivArea -0.3334347 -0.9404816 1.6454446
## LotArea OverallQual GrLivArea
## LotArea 1.000000e+00 0.000000e+00 0.000000e+00
## OverallQual -2.775558e-17 1.000000e+00 1.110223e-16
## GrLivArea 5.551115e-17 1.110223e-16 1.000000e+00
precision %*% cor2## LotArea OverallQual GrLivArea
## LotArea 1.000000e+00 2.775558e-17 1.110223e-16
## OverallQual 0.000000e+00 1.000000e+00 1.110223e-16
## GrLivArea -5.551115e-17 1.110223e-16 1.000000e+00
Conduct LU decomposition on the matrix
Calculus-Based Probability & Statistics
Many times, it makes sense to fit a closed form distribution to data. For the first variable that you selected which is skewed to the right, shift it so that the minimum value is above zero as necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function.Find the optimal value of λ for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, λ)). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.
## [1] 334
## rate
## 0.000659864
5th and 95th percentiles using the cumulative distribution function (CDF)
## 5% 95%
## 80.02991 4694.18765
95% confidence interval from the empirical data, assuming normality
## upper mean lower
## 1542.440 1515.464 1488.487
95% confidence, the mean of GrLivArea is between 1488.487 and 1542.440.
From this CI, we can see that the empirical data is a better fit for this case.
Modeling
Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.
I’m going to build a multiple regression model.
For this multiple linear regression example, I’ll use more than one predictor. The response variable will continue to be SalePrice but now I will include Above Ground Living Space, LotArea, OverallQual and TotRmsAbvGrd as the list of predictor variables.
## LotArea TotalBsmtSF GrLivArea GarageArea
## Min. : 1300 Min. : 0.0 Min. : 334 Min. : 0.0
## 1st Qu.: 7554 1st Qu.: 795.8 1st Qu.:1130 1st Qu.: 334.5
## Median : 9478 Median : 991.5 Median :1464 Median : 480.0
## Mean : 10517 Mean :1057.4 Mean :1515 Mean : 473.0
## 3rd Qu.: 11602 3rd Qu.:1298.2 3rd Qu.:1777 3rd Qu.: 576.0
## Max. :215245 Max. :6110.0 Max. :5642 Max. :1418.0
## SalePrice
## Min. : 34900
## 1st Qu.:129975
## Median :163000
## Mean :180921
## 3rd Qu.:214000
## Max. :755000
The new dataset contains the five variables to be used in the model. The matrix plot above allows us to vizualise the relationship among all variables in one single image. For example, we can see how
Total Basement SQFT and Above Ground Living Space are related (see third column, second row graph).
I’ll start by fitting a linear regression on this dataset and see how well it models the observed data. I’ll add all other predictors and give each of them a separate slope coefficient.
For our multiple linear regression example, we want to solve the following equation:
SalePrice=B0+B1∗LotArea+B2∗TotalBsmtSF+B3∗GrLivArea+B4*GarageArea
The model will estimate the value of the intercept (B0) and each predictor’s slope (B1) for LotArea, (B2) for TotalBsmtSF, (B3) for GrLivArea, and (B4) for GarageArea. The intercept is the average expected Sale Price value for the average value across all predictors. We want the model to fit a line across the observed relationship in a way that the line created is as close as possible to all data points.
##
## Call:
## lm(formula = SalePrice ~ ., data = lmdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -654432 -19542 30 19320 277149
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.430e+04 4.045e+03 -6.006 2.39e-09 ***
## LotArea 2.042e-01 1.273e-01 1.604 0.109
## TotalBsmtSF 4.834e+01 3.340e+00 14.474 < 2e-16 ***
## GrLivArea 6.806e+01 2.760e+00 24.656 < 2e-16 ***
## GarageArea 1.032e+02 6.831e+00 15.108 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 46200 on 1455 degrees of freedom
## Multiple R-squared: 0.6627, Adjusted R-squared: 0.6618
## F-statistic: 714.8 on 4 and 1455 DF, p-value: < 2.2e-16
SalePrice <- -2.430e+04 + 2.042e-01 * LotArea + 4.834e+01 * TotalBsmtSF + 6.806e+01 * GrLivArea + 1.032e+02 * GarageArea
For any given level of the variables, we see an improvement in SalePrice.
In this model, we have a R-squared number of 0.6618
corr2 = cor(lmdata)
corrplot(corr2, method = "number") Notice the correlation between
LotArea and GarageArea is very low at 0.18. This reveals Garage Area is not aligned to Lot Area. So in essence, Lot Area’s high p-value indicates that the other variables are related to Sale Price, but there is no evidence that Lot Area is associated with Sale Price, at least not when these other predictors are also considered in the model.
The F-Statistic value from our model is 714.8 on 4 and 1455 degrees of freedom. So assuming that the number of data points is appropriate and given that the p-values returned are low, we have some evidence that at least one of the predictors is associated with SalePrice.
plot(model1, pch=16, which=1)Given that we have indications that at least one of the predictors is associated with SalePrice, and based on the fact that LotArea here has a high p-value, we can consider removing LotArea from the model and see how the model fit changes (we are not going to run a variable selection procedure such as forward, backward or mixed selection in this example):
##
## Call:
## lm(formula = SalePrice ~ TotalBsmtSF + GrLivArea + GarageArea,
## data = lmdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -650577 -19502 -128 19408 276301
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -24105.471 4045.540 -5.959 3.19e-09 ***
## TotalBsmtSF 49.146 3.303 14.878 < 2e-16 ***
## GrLivArea 68.751 2.728 25.203 < 2e-16 ***
## GarageArea 103.321 6.835 15.117 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 46220 on 1456 degrees of freedom
## Multiple R-squared: 0.6621, Adjusted R-squared: 0.6614
## F-statistic: 951.1 on 3 and 1456 DF, p-value: < 2.2e-16
The model excluding LotArea has not improved our F-Statistic. This is possibly due to the presence of outlier points in the data.
plot(model2, pch=16, which=1)Note how the residuals plot of this last model shows some important points still lying far away from the middle area of the graph.
Let’s apply a logarithmic transformation with the log function on the SalePrice variable (the log function here transforms using the natural log. If base 10 is desired log10 is the function to be used). I’ll apply this transformations directly into the model function and see what happens with both the model fit and the model accuracy.
##
## Call:
## lm(formula = log(SalePrice) ~ TotalBsmtSF + GrLivArea + GarageArea,
## data = lmdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.15585 -0.09757 0.03202 0.13550 0.74485
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.099e+01 2.007e-02 547.64 <2e-16 ***
## TotalBsmtSF 2.356e-04 1.639e-05 14.37 <2e-16 ***
## GrLivArea 3.284e-04 1.353e-05 24.27 <2e-16 ***
## GarageArea 6.022e-04 3.391e-05 17.76 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2293 on 1456 degrees of freedom
## Multiple R-squared: 0.671, Adjusted R-squared: 0.6703
## F-statistic: 990 on 3 and 1456 DF, p-value: < 2.2e-16
SalePrice <- 1.099e+01 + 2.356e-04 * TotalBsmtSF + 3.284e-04 * GrLivArea + 6.022e-04 * GarageArea
plot(model3, pch=16, which=1) A high F value means that our data does not well support the null hypothesis. Or in other words, the alternative hypothesis is compatible with observed data.
load test data and try out equation from model 1 on test data.
## Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape
## 1 1 60 RL 65 8450 Pave <NA> Reg
## 2 2 20 RL 80 9600 Pave <NA> Reg
## 3 3 60 RL 68 11250 Pave <NA> IR1
## 4 4 70 RL 60 9550 Pave <NA> IR1
## 5 5 60 RL 84 14260 Pave <NA> IR1
## 6 6 50 RL 85 14115 Pave <NA> IR1
## LandContour Utilities LotConfig LandSlope Neighborhood Condition1
## 1 Lvl AllPub Inside Gtl CollgCr Norm
## 2 Lvl AllPub FR2 Gtl Veenker Feedr
## 3 Lvl AllPub Inside Gtl CollgCr Norm
## 4 Lvl AllPub Corner Gtl Crawfor Norm
## 5 Lvl AllPub FR2 Gtl NoRidge Norm
## 6 Lvl AllPub Inside Gtl Mitchel Norm
## Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt
## 1 Norm 1Fam 2Story 7 5 2003
## 2 Norm 1Fam 1Story 6 8 1976
## 3 Norm 1Fam 2Story 7 5 2001
## 4 Norm 1Fam 2Story 7 5 1915
## 5 Norm 1Fam 2Story 8 5 2000
## 6 Norm 1Fam 1.5Fin 5 5 1993
## YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType
## 1 2003 Gable CompShg VinylSd VinylSd BrkFace
## 2 1976 Gable CompShg MetalSd MetalSd None
## 3 2002 Gable CompShg VinylSd VinylSd BrkFace
## 4 1970 Gable CompShg Wd Sdng Wd Shng None
## 5 2000 Gable CompShg VinylSd VinylSd BrkFace
## 6 1995 Gable CompShg VinylSd VinylSd None
## MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure
## 1 196 Gd TA PConc Gd TA No
## 2 0 TA TA CBlock Gd TA Gd
## 3 162 Gd TA PConc Gd TA Mn
## 4 0 TA TA BrkTil TA Gd No
## 5 350 Gd TA PConc Gd TA Av
## 6 0 TA TA Wood Gd TA No
## BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF
## 1 GLQ 706 Unf 0 150 856
## 2 ALQ 978 Unf 0 284 1262
## 3 GLQ 486 Unf 0 434 920
## 4 ALQ 216 Unf 0 540 756
## 5 GLQ 655 Unf 0 490 1145
## 6 GLQ 732 Unf 0 64 796
## Heating HeatingQC CentralAir Electrical X1stFlrSF X2ndFlrSF LowQualFinSF
## 1 GasA Ex Y SBrkr 856 854 0
## 2 GasA Ex Y SBrkr 1262 0 0
## 3 GasA Ex Y SBrkr 920 866 0
## 4 GasA Gd Y SBrkr 961 756 0
## 5 GasA Ex Y SBrkr 1145 1053 0
## 6 GasA Ex Y SBrkr 796 566 0
## GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr
## 1 1710 1 0 2 1 3
## 2 1262 0 1 2 0 3
## 3 1786 1 0 2 1 3
## 4 1717 1 0 1 0 3
## 5 2198 1 0 2 1 4
## 6 1362 1 0 1 1 1
## KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces FireplaceQu
## 1 1 Gd 8 Typ 0 <NA>
## 2 1 TA 6 Typ 1 TA
## 3 1 Gd 6 Typ 1 TA
## 4 1 Gd 7 Typ 1 Gd
## 5 1 Gd 9 Typ 1 TA
## 6 1 TA 5 Typ 0 <NA>
## GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual
## 1 Attchd 2003 RFn 2 548 TA
## 2 Attchd 1976 RFn 2 460 TA
## 3 Attchd 2001 RFn 2 608 TA
## 4 Detchd 1998 Unf 3 642 TA
## 5 Attchd 2000 RFn 3 836 TA
## 6 Attchd 1993 Unf 2 480 TA
## GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch
## 1 TA Y 0 61 0 0
## 2 TA Y 298 0 0 0
## 3 TA Y 0 42 0 0
## 4 TA Y 0 35 272 0
## 5 TA Y 192 84 0 0
## 6 TA Y 40 30 0 320
## ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold
## 1 0 0 <NA> <NA> <NA> 0 2 2008
## 2 0 0 <NA> <NA> <NA> 0 5 2007
## 3 0 0 <NA> <NA> <NA> 0 9 2008
## 4 0 0 <NA> <NA> <NA> 0 2 2006
## 5 0 0 <NA> <NA> <NA> 0 12 2008
## 6 0 0 <NA> MnPrv Shed 700 10 2009
## SaleType SaleCondition SalePrice
## 1 WD Normal 208500
## 2 WD Normal 181500
## 3 WD Normal 223500
## 4 WD Abnorml 140000
## 5 WD Normal 250000
## 6 WD Normal 143000
## logical(0)
## [1] NA
## logical(0)
## [1] NA
Score from Kaggle: 9.45827 Screen Name: nataliemollaghan