Kaggle Competition: House Prices: Advanced Regression Techniques
username: Emahayz
Score: 0.6294
Using R, generate a random variable X that has 10,000 random uniform numbers from 1 to N, where N can be any number of your choosing greater than or equal to 6. Then generate a random variable Y that has 10,000 random normal numbers with a mean of \(\mu = \sigma = (N+1)/2\).
R offers us a variety of solutions for random number generation; one simple solution is to use the runif function, which generates a stated number of values between two end points (but not the end points themselves!) The function uses the continuous uniform distribution, meaning that every value between the two end points has an equal probability of being sampled.
set.seed(101)
N <- 6
M <- 10000
mu = sigma = (N + 1)/2
X <- runif(M, min = 1, max = N)
Y <- rnorm(M, mean = mu, sd = sigma)
numbers <- data.frame(X,Y)Obtain Summaries for X and Y
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.001 2.254 3.487 3.505 4.760 5.999
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -9.703 1.153 3.453 3.517 5.899 19.148
Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities.
a. P(X>x | X>y)
This is similar to a conditional probability A occurs given B.
A = numbers %>% filter(X > x, X > y) %>% nrow()/M
B = numbers %>% filter(X > y) %>% nrow()/M
cond = A/B
cond## [1] 0.5153577
Therefore, the P(X>x | X>y) = 0.5153
b. P(X>x, Y>y)
This is similar to finding the probability of A is greater than probability of B
## [1] 0.3778
Therefore, the P(X>x, Y>y) = 0.3778
c. P(X<x | X>y)
This is also similar to a conditional probability A occurs given B.
A1 = numbers %>% filter(X < x, X > y) %>% nrow()/M
B2 = numbers %>% filter(X > y) %>% nrow()/M
cond1 = A1/B2
cond1## [1] 0.4846423
Therefore, the P(X<x | X>y) = 0.4846
5 points. Investigate whether P(X>x and Y>y) = P(X>x)P(Y>y) by building a table and evaluating the marginal and joint probabilities.
A3 <- c(sum(Y>y)/M*sum(X>x)/M, sum(Y>y)/M*sum(X<=x)/M, sum(sum(Y>y)/M*sum(X>x)/M+sum(Y>y)/M*sum(X<=x)/M))
B3 <- c(sum(Y<y)/M*sum(X>x)/M, sum(Y<y)/M*sum(X<x)/M, sum(sum(Y<y)/M*sum(X>x)/M+sum(Y<y)/M*sum(X<x)/M))
C3 <-c(sum(Y>y)/M*sum(X>x)/M+sum(Y<y)/M*sum(X>x)/M, sum(Y>y)/M*sum(X<=x)/M+sum(Y<y)/M*sum(X<x)/M, sum(sum(Y>y)/M*sum(X>x)/M+sum(Y<y)/M*sum(X>x)/M, sum(Y>y)/M*sum(X<=x)/M+sum(Y<y)/M*sum(X<x)/M))
table <- rbind.data.frame(A3, B3, C3)
colnames(table) <- c("P(X>x)", "P(X<x)", "Marginal")
rownames(table) <- c("P(Y>y)", "P(Y<y)", "Marginal")
table## P(X>x) P(X<x) Marginal
## P(Y>y) 0.375 0.375 0.75
## P(Y<y) 0.125 0.125 0.25
## Marginal 0.500 0.500 1.00
P(X>x) = 0.50 and the P(Y>y) = 0.75. Hence, P(X>x and Y>y) = 0.375 from Table is the same as P(X>x)*P(Y>y) = 0.375.
Therefore, P(X>x and Y>y) = P(X>x)P(Y>y)
5 points. Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?
## Warning in fisher.test(indp): 'x' has been rounded to integer: Mean
## relative difference: 0.75
##
## Fisher's Exact Test for Count Data
##
## data: indp
## p-value = 1
## alternative hypothesis: two.sided
## Warning in chisq.test(indp): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: indp
## X-squared = 0, df = 2, p-value = 1
Both Test shows that the variables are independent as the p-value = 1.
You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition. https://www.kaggle.com/c/house-prices-advanced-regression-techniques . I want you to do the following.
The Ames Housing dataset was compiled by Dean De Cock for use in data science education. It’s an incredible alternative for data scientists looking for a modernized and expanded version of the often cited Boston Housing dataset.
I will use the training data to train the model, use the test data to evaluate and modify test data to predict the model.
In this section, I will prepare the dataset for multiple regression modeling. I will consider the assumptions of linear regression based on ordinary least squares algorithms – particularly regarding linearity, normality, homoscedasticity, and measurement level.
Looking for missing values
## Id MSSubClass MSZoning LotFrontage LotArea
## 0 0 0 259 0
## Street Alley LotShape LandContour Utilities
## 0 1369 0 0 0
## LotConfig LandSlope Neighborhood Condition1 Condition2
## 0 0 0 0 0
## BldgType HouseStyle OverallQual OverallCond YearBuilt
## 0 0 0 0 0
## YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd
## 0 0 0 0 0
## MasVnrType MasVnrArea ExterQual ExterCond Foundation
## 8 8 0 0 0
## BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1
## 37 37 38 37 0
## BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating
## 38 0 0 0 0
## HeatingQC CentralAir Electrical X1stFlrSF X2ndFlrSF
## 0 0 1 0 0
## LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath
## 0 0 0 0 0
## HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd
## 0 0 0 0 0
## Functional Fireplaces FireplaceQu GarageType GarageYrBlt
## 0 0 690 81 81
## GarageFinish GarageCars GarageArea GarageQual GarageCond
## 81 0 0 81 81
## PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch
## 0 0 0 0 0
## ScreenPorch PoolArea PoolQC Fence MiscFeature
## 0 0 1453 1179 1406
## MiscVal MoSold YrSold SaleType SaleCondition
## 0 0 0 0 0
## SalePrice
## 0
## Id MSSubClass MSZoning LotFrontage LotArea
## 0 0 4 227 0
## Street Alley LotShape LandContour Utilities
## 0 1352 0 0 2
## LotConfig LandSlope Neighborhood Condition1 Condition2
## 0 0 0 0 0
## BldgType HouseStyle OverallQual OverallCond YearBuilt
## 0 0 0 0 0
## YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd
## 0 0 0 1 1
## MasVnrType MasVnrArea ExterQual ExterCond Foundation
## 16 15 0 0 0
## BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1
## 44 45 44 42 1
## BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating
## 42 1 1 1 0
## HeatingQC CentralAir Electrical X1stFlrSF X2ndFlrSF
## 0 0 0 0 0
## LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath
## 0 0 2 2 0
## HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd
## 0 0 0 1 0
## Functional Fireplaces FireplaceQu GarageType GarageYrBlt
## 2 0 730 76 78
## GarageFinish GarageCars GarageArea GarageQual GarageCond
## 78 1 1 78 78
## PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch
## 0 0 0 0 0
## ScreenPorch PoolArea PoolQC Fence MiscFeature
## 0 0 1456 1169 1408
## MiscVal MoSold YrSold SaleType SaleCondition
## 0 0 0 1 0
The data set shows several missing values.
5 points. Descriptive and Inferential Statistics
Provide univariate descriptive statistics and appropriate plots for the training data set.
## Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape
## 1 1461 20 RH 80 11622 Pave <NA> Reg
## 2 1462 20 RL 81 14267 Pave <NA> IR1
## 3 1463 60 RL 74 13830 Pave <NA> IR1
## 4 1464 60 RL 78 9978 Pave <NA> IR1
## 5 1465 120 RL 43 5005 Pave <NA> IR1
## 6 1466 60 RL 75 10000 Pave <NA> IR1
## LandContour Utilities LotConfig LandSlope Neighborhood Condition1
## 1 Lvl AllPub Inside Gtl NAmes Feedr
## 2 Lvl AllPub Corner Gtl NAmes Norm
## 3 Lvl AllPub Inside Gtl Gilbert Norm
## 4 Lvl AllPub Inside Gtl Gilbert Norm
## 5 HLS AllPub Inside Gtl StoneBr Norm
## 6 Lvl AllPub Corner Gtl Gilbert Norm
## Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt
## 1 Norm 1Fam 1Story 5 6 1961
## 2 Norm 1Fam 1Story 6 6 1958
## 3 Norm 1Fam 2Story 5 5 1997
## 4 Norm 1Fam 2Story 6 6 1998
## 5 Norm TwnhsE 1Story 8 5 1992
## 6 Norm 1Fam 2Story 6 5 1993
## YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType
## 1 1961 Gable CompShg VinylSd VinylSd None
## 2 1958 Hip CompShg Wd Sdng Wd Sdng BrkFace
## 3 1998 Gable CompShg VinylSd VinylSd None
## 4 1998 Gable CompShg VinylSd VinylSd BrkFace
## 5 1992 Gable CompShg HdBoard HdBoard None
## 6 1994 Gable CompShg HdBoard HdBoard None
## MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure
## 1 0 TA TA CBlock TA TA No
## 2 108 TA TA CBlock TA TA No
## 3 0 TA TA PConc Gd TA No
## 4 20 TA TA PConc TA TA No
## 5 0 Gd TA PConc Gd TA No
## 6 0 TA TA PConc Gd TA No
## BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF
## 1 Rec 468 LwQ 144 270 882
## 2 ALQ 923 Unf 0 406 1329
## 3 GLQ 791 Unf 0 137 928
## 4 GLQ 602 Unf 0 324 926
## 5 ALQ 263 Unf 0 1017 1280
## 6 Unf 0 Unf 0 763 763
## Heating HeatingQC CentralAir Electrical X1stFlrSF X2ndFlrSF LowQualFinSF
## 1 GasA TA Y SBrkr 896 0 0
## 2 GasA TA Y SBrkr 1329 0 0
## 3 GasA Gd Y SBrkr 928 701 0
## 4 GasA Ex Y SBrkr 926 678 0
## 5 GasA Ex Y SBrkr 1280 0 0
## 6 GasA Gd Y SBrkr 763 892 0
## GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr
## 1 896 0 0 1 0 2
## 2 1329 0 0 1 1 3
## 3 1629 0 0 2 1 3
## 4 1604 0 0 2 1 3
## 5 1280 0 0 2 0 2
## 6 1655 0 0 2 1 3
## KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces FireplaceQu
## 1 1 TA 5 Typ 0 <NA>
## 2 1 Gd 6 Typ 0 <NA>
## 3 1 TA 6 Typ 1 TA
## 4 1 Gd 7 Typ 1 Gd
## 5 1 Gd 5 Typ 0 <NA>
## 6 1 TA 7 Typ 1 TA
## GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual
## 1 Attchd 1961 Unf 1 730 TA
## 2 Attchd 1958 Unf 1 312 TA
## 3 Attchd 1997 Fin 2 482 TA
## 4 Attchd 1998 Fin 2 470 TA
## 5 Attchd 1992 RFn 2 506 TA
## 6 Attchd 1993 Fin 2 440 TA
## GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch
## 1 TA Y 140 0 0 0
## 2 TA Y 393 36 0 0
## 3 TA Y 212 34 0 0
## 4 TA Y 360 36 0 0
## 5 TA Y 0 82 0 0
## 6 TA Y 157 84 0 0
## ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold
## 1 120 0 <NA> MnPrv <NA> 0 6 2010
## 2 0 0 <NA> <NA> Gar2 12500 6 2010
## 3 0 0 <NA> MnPrv <NA> 0 3 2010
## 4 0 0 <NA> <NA> <NA> 0 6 2010
## 5 144 0 <NA> <NA> <NA> 0 1 2010
## 6 0 0 <NA> <NA> <NA> 0 4 2010
## SaleType SaleCondition
## 1 WD Normal
## 2 WD Normal
## 3 WD Normal
## 4 WD Normal
## 5 WD Normal
## 6 WD Normal
## 'data.frame': 1460 obs. of 81 variables:
## $ Id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ MSSubClass : int 60 20 60 70 60 50 20 60 50 190 ...
## $ MSZoning : Factor w/ 5 levels "C (all)","FV",..: 4 4 4 4 4 4 4 4 5 4 ...
## $ LotFrontage : num 65 80 68 60 84 ...
## $ LotArea : int 8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
## $ Street : Factor w/ 2 levels "Grvl","Pave": 2 2 2 2 2 2 2 2 2 2 ...
## $ Alley : Factor w/ 2 levels "Grvl","Pave": NA NA NA NA NA NA NA NA NA NA ...
## $ LotShape : Factor w/ 4 levels "IR1","IR2","IR3",..: 4 4 1 1 1 1 4 1 4 4 ...
## $ LandContour : Factor w/ 4 levels "Bnk","HLS","Low",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ Utilities : Factor w/ 2 levels "AllPub","NoSeWa": 1 1 1 1 1 1 1 1 1 1 ...
## $ LotConfig : Factor w/ 5 levels "Corner","CulDSac",..: 5 3 5 1 3 5 5 1 5 1 ...
## $ LandSlope : Factor w/ 3 levels "Gtl","Mod","Sev": 1 1 1 1 1 1 1 1 1 1 ...
## $ Neighborhood : Factor w/ 25 levels "Blmngtn","Blueste",..: 6 25 6 7 14 12 21 17 18 4 ...
## $ Condition1 : Factor w/ 9 levels "Artery","Feedr",..: 3 2 3 3 3 3 3 5 1 1 ...
## $ Condition2 : Factor w/ 8 levels "Artery","Feedr",..: 3 3 3 3 3 3 3 3 3 1 ...
## $ BldgType : Factor w/ 5 levels "1Fam","2fmCon",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ HouseStyle : Factor w/ 8 levels "1.5Fin","1.5Unf",..: 6 3 6 6 6 1 3 6 1 2 ...
## $ OverallQual : int 7 6 7 7 8 5 8 7 7 5 ...
## $ OverallCond : int 5 8 5 5 5 5 5 6 5 6 ...
## $ YearBuilt : int 2003 1976 2001 1915 2000 1993 2004 1973 1931 1939 ...
## $ YearRemodAdd : int 2003 1976 2002 1970 2000 1995 2005 1973 1950 1950 ...
## $ RoofStyle : Factor w/ 6 levels "Flat","Gable",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ RoofMatl : Factor w/ 8 levels "ClyTile","CompShg",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ Exterior1st : Factor w/ 15 levels "AsbShng","AsphShn",..: 13 9 13 14 13 13 13 7 4 9 ...
## $ Exterior2nd : Factor w/ 16 levels "AsbShng","AsphShn",..: 14 9 14 16 14 14 14 7 16 9 ...
## $ MasVnrType : Factor w/ 4 levels "BrkCmn","BrkFace",..: 2 3 2 3 2 3 4 4 3 3 ...
## $ MasVnrArea : num 196 0 162 0 350 0 186 240 0 0 ...
## $ ExterQual : Factor w/ 4 levels "Ex","Fa","Gd",..: 3 4 3 4 3 4 3 4 4 4 ...
## $ ExterCond : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 5 5 ...
## $ Foundation : Factor w/ 6 levels "BrkTil","CBlock",..: 3 2 3 1 3 6 3 2 1 1 ...
## $ BsmtQual : Factor w/ 4 levels "Ex","Fa","Gd",..: 3 3 3 4 3 3 1 3 4 4 ...
## $ BsmtCond : Factor w/ 4 levels "Fa","Gd","Po",..: 4 4 4 2 4 4 4 4 4 4 ...
## $ BsmtExposure : Factor w/ 4 levels "Av","Gd","Mn",..: 4 2 3 4 1 4 1 3 4 4 ...
## $ BsmtFinType1 : Factor w/ 6 levels "ALQ","BLQ","GLQ",..: 3 1 3 1 3 3 3 1 6 3 ...
## $ BsmtFinSF1 : int 706 978 486 216 655 732 1369 859 0 851 ...
## $ BsmtFinType2 : Factor w/ 6 levels "ALQ","BLQ","GLQ",..: 6 6 6 6 6 6 6 2 6 6 ...
## $ BsmtFinSF2 : int 0 0 0 0 0 0 0 32 0 0 ...
## $ BsmtUnfSF : int 150 284 434 540 490 64 317 216 952 140 ...
## $ TotalBsmtSF : int 856 1262 920 756 1145 796 1686 1107 952 991 ...
## $ Heating : Factor w/ 6 levels "Floor","GasA",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ HeatingQC : Factor w/ 5 levels "Ex","Fa","Gd",..: 1 1 1 3 1 1 1 1 3 1 ...
## $ CentralAir : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 2 2 2 ...
## $ Electrical : Factor w/ 5 levels "FuseA","FuseF",..: 5 5 5 5 5 5 5 5 2 5 ...
## $ X1stFlrSF : int 856 1262 920 961 1145 796 1694 1107 1022 1077 ...
## $ X2ndFlrSF : int 854 0 866 756 1053 566 0 983 752 0 ...
## $ LowQualFinSF : int 0 0 0 0 0 0 0 0 0 0 ...
## $ GrLivArea : int 1710 1262 1786 1717 2198 1362 1694 2090 1774 1077 ...
## $ BsmtFullBath : int 1 0 1 1 1 1 1 1 0 1 ...
## $ BsmtHalfBath : int 0 1 0 0 0 0 0 0 0 0 ...
## $ FullBath : int 2 2 2 1 2 1 2 2 2 1 ...
## $ HalfBath : int 1 0 1 0 1 1 0 1 0 0 ...
## $ BedroomAbvGr : int 3 3 3 3 4 1 3 3 2 2 ...
## $ KitchenAbvGr : int 1 1 1 1 1 1 1 1 2 2 ...
## $ KitchenQual : Factor w/ 4 levels "Ex","Fa","Gd",..: 3 4 3 3 3 4 3 4 4 4 ...
## $ TotRmsAbvGrd : int 8 6 6 7 9 5 7 7 8 5 ...
## $ Functional : Factor w/ 7 levels "Maj1","Maj2",..: 7 7 7 7 7 7 7 7 3 7 ...
## $ Fireplaces : int 0 1 1 1 1 0 1 2 2 2 ...
## $ FireplaceQu : Factor w/ 5 levels "Ex","Fa","Gd",..: NA 5 5 3 5 NA 3 5 5 5 ...
## $ GarageType : Factor w/ 6 levels "2Types","Attchd",..: 2 2 2 6 2 2 2 2 6 2 ...
## $ GarageYrBlt : num 2003 1976 2001 1998 2000 ...
## $ GarageFinish : Factor w/ 3 levels "Fin","RFn","Unf": 2 2 2 3 2 3 2 2 3 2 ...
## $ GarageCars : int 2 2 2 3 3 2 2 2 2 1 ...
## $ GarageArea : int 548 460 608 642 836 480 636 484 468 205 ...
## $ GarageQual : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 2 3 ...
## $ GarageCond : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 5 5 ...
## $ PavedDrive : Factor w/ 3 levels "N","P","Y": 3 3 3 3 3 3 3 3 3 3 ...
## $ WoodDeckSF : int 0 298 0 0 192 40 255 235 90 0 ...
## $ OpenPorchSF : int 61 0 42 35 84 30 57 204 0 4 ...
## $ EnclosedPorch: int 0 0 0 272 0 0 0 228 205 0 ...
## $ X3SsnPorch : int 0 0 0 0 0 320 0 0 0 0 ...
## $ ScreenPorch : int 0 0 0 0 0 0 0 0 0 0 ...
## $ PoolArea : int 0 0 0 0 0 0 0 0 0 0 ...
## $ PoolQC : Factor w/ 3 levels "Ex","Fa","Gd": NA NA NA NA NA NA NA NA NA NA ...
## $ Fence : Factor w/ 4 levels "GdPrv","GdWo",..: NA NA NA NA NA 3 NA NA NA NA ...
## $ MiscFeature : Factor w/ 4 levels "Gar2","Othr",..: NA NA NA NA NA 3 NA 3 NA NA ...
## $ MiscVal : int 0 0 0 0 0 700 0 350 0 0 ...
## $ MoSold : int 2 5 9 2 12 10 8 11 4 1 ...
## $ YrSold : int 2008 2007 2008 2006 2008 2009 2007 2009 2008 2008 ...
## $ SaleType : Factor w/ 9 levels "COD","Con","ConLD",..: 9 9 9 9 9 9 9 9 9 9 ...
## $ SaleCondition: Factor w/ 6 levels "Abnorml","AdjLand",..: 5 5 5 1 5 5 5 5 1 5 ...
## $ SalePrice : int 208500 181500 223500 140000 250000 143000 307000 200000 129900 118000 ...
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 34900 129975 163000 180921 214000 755000
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 795.8 991.5 1057.4 1298.2 6110.0
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1300 7554 9478 10517 11602 215245
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1872 1954 1973 1971 2000 2010
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 334.5 480.0 473.0 576.0 1418.0
Provide a scatterplot matrix for at least two of the independent variables and the dependent variable.
#Scatterplot for salesprice,Total Basement SQF and GarageArea
plot(ames_train$SalePrice~ames_train$TotalBsmtSF, xlab = 'Total Basement SQF', ylab = 'Sale Price', main = 'Housing Sale Price and Year Built')Derive a correlation matrix for any three quantitative variables in the dataset.
# Correlation matrix for three quantitative variables
quant <- data.frame(ames_train$TotalBsmtSF, ames_train$GarageArea, ames_train$OverallQual)
corrvalues <- as.matrix(quant)
corrmatrix <- round(cor(corrvalues),2)
corrmatrix## ames_train.TotalBsmtSF ames_train.GarageArea
## ames_train.TotalBsmtSF 1.00 0.49
## ames_train.GarageArea 0.49 1.00
## ames_train.OverallQual 0.54 0.56
## ames_train.OverallQual
## ames_train.TotalBsmtSF 0.54
## ames_train.GarageArea 0.56
## ames_train.OverallQual 1.00
Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis.
Would you be worried about familywise error? Why or why not?
basementCorr <- cor.test(ames_train$SalePrice, ames_train$TotalBsmtSF, conf.level = 0.8)
basementCorr##
## Pearson's product-moment correlation
##
## data: ames_train$SalePrice and ames_train$TotalBsmtSF
## t = 29.671, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.5922142 0.6340846
## sample estimates:
## cor
## 0.6135806
##
## Pearson's product-moment correlation
##
## data: ames_train$SalePrice and ames_train$GarageArea
## t = 30.446, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.6024756 0.6435283
## sample estimates:
## cor
## 0.6234314
OverallQualCorr <- cor.test(ames_train$SalePrice, ames_train$OverallQual, conf.level = 0.8)
OverallQualCorr##
## Pearson's product-moment correlation
##
## data: ames_train$SalePrice and ames_train$OverallQual
## t = 49.364, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.7780752 0.8032204
## sample estimates:
## cor
## 0.7909816
##
## Pearson's product-moment correlation
##
## data: ames_train$SalePrice and ames_train$LotArea
## t = 10.445, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.2323391 0.2947946
## sample estimates:
## cor
## 0.2638434
5 points. Linear Algebra and Correlation
Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.
Precision Matrix
precisionmatrix <- round(solve(corrmatrix), 2) # inverting the correlation matrix for precision matrix
precisionmatrix## ames_train.TotalBsmtSF ames_train.GarageArea
## ames_train.TotalBsmtSF 1.52 -0.42
## ames_train.GarageArea -0.42 1.57
## ames_train.OverallQual -0.59 -0.65
## ames_train.OverallQual
## ames_train.TotalBsmtSF -0.59
## ames_train.GarageArea -0.65
## ames_train.OverallQual 1.68
Multiply correlation matrix by the precision matrix
## ames_train.TotalBsmtSF ames_train.GarageArea
## ames_train.TotalBsmtSF 1.00 0
## ames_train.GarageArea -0.01 1
## ames_train.OverallQual 0.00 0
## ames_train.OverallQual
## ames_train.TotalBsmtSF 0
## ames_train.GarageArea 0
## ames_train.OverallQual 1
Multiply precision matrix by the correlation matrix
## ames_train.TotalBsmtSF ames_train.GarageArea
## ames_train.TotalBsmtSF 1 -0.01
## ames_train.GarageArea 0 1.00
## ames_train.OverallQual 0 0.00
## ames_train.OverallQual
## ames_train.TotalBsmtSF 0
## ames_train.GarageArea 0
## ames_train.OverallQual 1
LU decomposition
## $L
## [,1] [,2] [,3]
## [1,] 1.00 0.0000000 0
## [2,] 0.49 1.0000000 0
## [3,] 0.54 0.3887354 1
##
## $U
## [,1] [,2] [,3]
## [1,] 1 0.4900 0.5400000
## [2,] 0 0.7599 0.2954000
## [3,] 0 0.0000 0.5935676
## ames_train.TotalBsmtSF ames_train.GarageArea
## ames_train.TotalBsmtSF TRUE TRUE
## ames_train.GarageArea TRUE TRUE
## ames_train.OverallQual TRUE TRUE
## ames_train.OverallQual
## ames_train.TotalBsmtSF TRUE
## ames_train.GarageArea TRUE
## ames_train.OverallQual TRUE
The LU decomposition is expected to yield the correlation matrix, this is TRUE as as shown above.
5 points. Calculus-Based Probability & Statistics
Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary.
Total Basement SQF The minimum value for this variable is “0”, I will shift this values by adding 5% to the values.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 795.8 991.5 1057.4 1298.2 6110.0
shiftTotalBsmtSF <- na.omit(ames_train$TotalBsmtSF[ames_train$TotalBsmtSF]) + 0.05
summary(shiftTotalBsmtSF)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.05 776.05 980.05 1038.61 1248.05 6110.05
Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ).
## rate
## 9.628262e-04
## (2.806465e-05)
Find the optimal value of \(\lambda\) for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, \(\lambda\))).
## rate
## 0.0009628262
## [1] 1164.030014 473.083531 1734.518460 1885.914296 176.808906
## [6] 1375.058750 2274.311250 479.805898 207.997341 219.074635
## [11] 138.490080 1103.712174 806.322618 992.543874 363.963895
## [16] 99.165308 3249.643186 709.027571 1585.689398 2437.461409
## [21] 2201.470050 1317.416387 471.905873 2188.139443 838.108450
## [26] 2994.399718 1409.434020 263.749754 1552.992388 1261.283324
## [31] 180.625732 1218.331546 555.465123 61.972702 496.523336
## [36] 36.462263 12.131241 2098.520436 253.739401 766.981615
## [41] 1256.535027 537.481836 301.280870 3151.928358 3193.363037
## [46] 310.186137 486.256323 943.333397 65.807723 4485.353495
## [51] 437.927497 1247.149850 648.695238 338.166563 2239.021661
## [56] 101.038380 169.212871 844.923887 854.264831 280.214024
## [61] 3631.991670 960.768031 139.509752 788.272210 507.455789
## [66] 711.617264 1972.228959 3043.367833 1364.217375 231.376652
## [71] 909.608430 2145.931484 694.524938 922.895508 849.032208
## [76] 216.340771 285.264169 454.267218 316.514297 1862.911525
## [81] 1447.199131 904.133802 419.336042 5134.595134 774.574764
## [86] 356.698451 230.788271 1519.608649 428.005893 811.813490
## [91] 4630.977858 687.099207 139.245800 406.595563 1257.979505
## [96] 2676.175953 600.522066 218.150702 830.937972 27.331716
## [101] 1300.703023 379.417909 2378.334608 743.147101 62.460259
## [106] 224.930109 258.324597 148.720263 157.725925 3289.133541
## [111] 485.285823 572.913143 616.194026 1940.232998 223.979597
## [116] 1103.210551 2011.680811 2965.452698 498.126358 464.990019
## [121] 4118.183069 2076.736916 1101.263225 132.874511 604.125952
## [126] 57.947859 1871.829253 13.919136 355.051328 1810.945468
## [131] 2403.466997 4317.514571 521.671251 1058.418644 83.854896
## [136] 497.689766 1434.457806 1003.013433 564.563282 222.831149
## [141] 471.492924 957.159627 2677.877383 1155.017789 620.052270
## [146] 985.897260 1079.604570 1084.967278 196.418979 175.309635
## [151] 162.041090 108.170617 1862.360367 36.434622 1943.762373
## [156] 1157.180838 150.237337 722.298166 185.467673 2586.717259
## [161] 113.128584 1911.796207 75.423340 134.363758 472.959478
## [166] 3453.792629 539.443262 576.987961 1922.624106 817.487803
## [171] 1277.282419 471.176535 595.990974 1006.375572 4029.487265
## [176] 3057.632232 246.948595 367.784224 492.290360 233.829736
## [181] 715.534752 1976.595344 324.222208 2650.419688 3147.253246
## [186] 101.405281 574.651195 868.099130 324.630175 126.184590
## [191] 1454.953134 2126.391171 93.476747 910.801307 88.191517
## [196] 1555.960917 468.682362 1786.819881 1194.771905 185.229699
## [201] 1048.850542 803.886709 894.023209 914.109750 11.441039
## [206] 199.473531 614.334919 2685.136445 143.710164 178.186960
## [211] 201.337848 1920.157375 283.205049 1041.469659 280.049805
## [216] 372.868128 2514.939232 2244.731606 918.483438 378.443667
## [221] 1312.926393 90.588159 254.134377 319.092214 582.686488
## [226] 555.768556 6.236745 321.075661 763.348074 338.930264
## [231] 13.002116 489.013583 693.826186 290.126869 1481.298462
## [236] 175.975297 998.454270 1179.376379 599.077101 459.518684
## [241] 77.272435 1096.588228 533.095136 748.225508 1717.562066
## [246] 502.349170 755.359552 269.085535 182.569486 1071.238298
## [251] 2961.273645 490.688459 1214.815799 1026.693920 495.807860
## [256] 1017.259403 413.418578 2227.308915 362.720006 2626.812410
## [261] 1603.471128 4190.912531 1516.161549 2232.484651 1204.238699
## [266] 35.115204 2380.050502 1515.207969 1130.597908 151.322248
## [271] 540.581909 2762.038057 832.458857 252.866543 382.514708
## [276] 442.737044 716.605363 2125.177182 3150.763227 349.288264
## [281] 1073.039004 847.592758 2992.877006 1047.533282 1020.332892
## [286] 683.168107 2583.897223 764.162567 3215.710998 3256.315233
## [291] 2659.876097 340.473557 1491.691543 661.830171 869.946449
## [296] 1739.112090 1348.586235 389.583728 99.809596 1259.043871
## [301] 872.175039 2596.664736 3620.923770 162.711594 1076.942031
## [306] 994.154547 805.960112 502.572467 521.820087 498.579078
## [311] 868.153636 851.677064 771.268273 1439.637614 4115.676344
## [316] 1012.768600 644.653544 182.694140 864.752616 502.795621
## [321] 650.474109 34.374116 294.859975 1447.648040 967.802994
## [326] 485.468476 2108.750308 1542.738843 102.190934 1019.982306
## [331] 363.538797 879.123033 413.806492 2946.254365 194.884591
## [336] 613.758054 16.987603 359.694291 1142.699497 5.929488
## [341] 2530.204222 414.769610 594.612597 1804.141702 669.354986
## [346] 1813.475913 571.176149 108.178164 502.127551 779.867585
## [351] 1735.256342 193.184183 3208.155289 1543.908399 579.711254
## [356] 1254.249050 411.545657 603.920136 2570.296337 1314.079356
## [361] 246.623966 1895.712667 1143.634244 1872.325418 112.077332
## [366] 1297.443438 848.948756 948.842429 4407.158130 308.018420
## [371] 1059.450174 1449.317573 3419.150562 3651.714337 1379.294185
## [376] 3210.692594 954.125566 1205.185529 536.517602 476.619712
## [381] 442.425729 641.885206 1495.871779 385.853861 1908.547486
## [386] 152.869685 729.960606 1800.586856 613.117126 1052.911977
## [391] 243.885509 1641.074227 1187.867486 601.199982 1727.523909
## [396] 691.772684 1213.010450 1113.136270 128.573733 606.692996
## [401] 641.055320 386.188571 1315.826222 415.888859 2745.735712
## [406] 1419.936523 836.155902 1865.118957 2774.359111 101.634993
## [411] 15.146464 671.279227 570.836822 64.277455 1876.972713
## [416] 424.179488 634.556926 1022.765794 130.743309 166.987397
## [421] 2235.272689 394.255248 2830.749925 7.829786 3466.600167
## [426] 965.617117 769.071198 4960.775103 1223.440157 1177.322399
## [431] 518.588968 1795.496769 139.399962 1849.289512 1641.877393
## [436] 987.952879 5252.497737 2680.546674 3076.213289 646.257744
## [441] 452.076399 716.562180 269.163140 516.615830 590.325441
## [446] 468.029498 2.709556 1245.346891 2769.334784 2809.228412
## [451] 1204.275718 533.714456 805.526935 105.965643 397.037804
## [456] 439.625633 747.702914 1221.322378 343.043446 209.863500
## [461] 322.858774 1844.937289 1925.033336 255.593331 3739.133781
## [466] 1496.786492 155.425263 459.412196 204.567422 690.332708
## [471] 227.539530 549.371260 905.373827 279.259145 166.673487
## [476] 469.938284 1250.101344 2487.964852 81.128605 2549.999661
## [481] 1794.818733 367.557195 658.213094 529.898855 3568.797778
## [486] 881.096235 417.725696 22.017242 275.186483 53.726003
## [491] 147.776363 361.980240 52.369731 674.831088 364.196355
## [496] 2604.991628 509.112451 234.666229 1874.738369 188.459040
## [501] 617.257092 798.140473 552.021632 188.553533 490.497543
## [506] 534.407515 1007.532079 391.784344 135.364841 1520.791840
## [511] 745.444489 3528.006231 614.300479 302.776140 74.215362
## [516] 211.875656 549.206193 1766.659686 41.093638 537.091367
## [521] 903.501727 1829.732020 257.794354 1381.272345 293.232476
## [526] 1175.012253 2568.939034 711.309221 1184.552075 6.028655
## [531] 97.755660 1331.315997 308.688213 131.824678 146.365230
## [536] 781.410902 455.508114 379.826824 134.618136 695.152755
## [541] 1213.276931 532.410430 133.440591 2001.983321 320.404219
## [546] 1170.250051 2114.070397 684.930973 1124.618914 14.136345
## [551] 438.703737 124.806837 83.428616 4.996434 866.291304
## [556] 852.583846 21.091665 324.651706 515.202769 1400.718513
## [561] 75.582941 474.625933 658.144517 2893.491263 1472.177751
## [566] 192.182436 2138.151513 1433.221373 1000.057274 1019.809187
## [571] 508.329448 495.890608 348.470704 1623.210097 2242.110563
## [576] 520.933937 322.983721 1460.348413 637.030423 1307.875150
## [581] 1955.974481 443.830517 1093.425626 2761.306971 1679.944527
## [586] 773.322347 2333.112850 1773.178226 71.365088 901.405176
## [591] 669.761958 3311.668159 423.970767 126.043473 1204.298516
## [596] 3758.975342 1004.143271 6329.395185 530.595827 488.144963
## [601] 2624.274874 1767.452632 660.883535 933.553919 443.954031
## [606] 825.408897 291.575153 769.211852 579.813244 54.685450
## [611] 668.480658 400.075017 904.270746 2449.460180 305.695163
## [616] 1666.805777 301.186655 508.994847 243.238856 1010.774793
## [621] 2614.069609 1189.945310 183.245895 407.654469 798.554091
## [626] 428.677157 332.598491 241.625069 110.785092 413.476079
## [631] 1743.257446 611.276321 759.149751 77.914622 274.617138
## [636] 730.512527 878.776275 287.576882 4616.798276 363.623083
## [641] 1501.415535 790.470771 1021.304295 1077.533873 789.143486
## [646] 503.109558 97.085078 61.554852 69.226595 299.264283
## [651] 1058.193515 4140.787893 1746.133816 2094.500530 995.209842
## [656] 582.159667 46.517959 3.062376 494.028720 169.027709
## [661] 1231.541288 122.155521 3078.218801 148.671122 548.276686
## [666] 1024.716314 466.432748 803.908634 1284.697710 1510.614348
## [671] 267.395396 588.584514 141.409321 59.949127 300.459879
## [676] 129.248310 503.609348 750.504940 2934.327405 945.260709
## [681] 677.895728 1241.819802 62.569126 418.136920 227.511104
## [686] 1402.898390 490.632295 1979.523537 490.255911 2326.926769
## [691] 2931.626405 94.074324 299.778191 840.983081 2895.780080
## [696] 564.493851 933.601004 717.386430 1303.387271 538.258131
## [701] 643.308548 1581.823946 427.106128 1172.108242 1298.264288
## [706] 874.492184 448.781427 524.260449 871.703002 1303.975164
## [711] 1276.613373 1135.871720 375.877846 184.480558 314.512660
## [716] 685.330748 384.828349 624.468053 802.944603 1219.584149
## [721] 89.532431 424.069284 1447.806231 512.587531 33.887431
## [726] 1877.028057 257.975766 1229.575746 265.617213 2395.657700
## [731] 1077.080024 12.199835 316.828845 349.113570 896.602031
## [736] 1346.737723 1357.178606 940.667984 4460.991207 51.492421
## [741] 1231.302512 748.679345 131.634575 1750.192657 804.597404
## [746] 563.586466 1428.696593 1570.122562 1146.335772 844.540812
## [751] 216.656077 2118.486746 792.479468 432.951722 66.728168
## [756] 1759.389321 2430.506266 441.811062 289.869254 1808.086974
## [761] 1629.932306 2153.266469 612.353967 89.681275 1140.243092
## [766] 485.450221 242.312961 3961.129957 640.256936 1172.601415
## [771] 613.195182 873.487945 610.459662 657.266295 1942.151159
## [776] 577.239406 1959.773887 673.333865 187.870959 455.283718
## [781] 1510.953594 514.084782 1449.128793 289.532718 18.954552
## [786] 1043.291894 137.213476 267.481480 1078.199251 584.075609
## [791] 199.433034 713.474388 273.081952 880.309158 152.248331
## [796] 4793.373067 662.414213 2727.459362 790.611687 481.148146
## [801] 349.356319 460.626638 814.789354 28.036447 1895.149387
## [806] 1061.664929 156.678800 331.054175 866.835766 59.945978
## [811] 2426.824349 289.982203 3034.092615 71.842285 633.732436
## [816] 1026.455443 1212.459334 218.424117 427.708476 1282.553031
## [821] 111.663898 1188.969092 1367.325439 889.683458 1812.080871
## [826] 312.538848 1570.618464 1232.442461 495.047009 973.806097
## [831] 2845.799282 157.385979 1087.006554 240.904320 556.045572
## [836] 207.279993 458.459990 1484.103444 2211.373307 1411.397293
## [841] 392.859748 596.200354 231.888649 274.883743 731.252073
## [846] 463.966609 2732.645306 277.323967 99.057260 268.815436
## [851] 1424.147859 1909.222002 1001.454115 912.421325 650.195414
## [856] 1268.173041 345.098678 322.167581 254.013815 991.168024
## [861] 270.448161 573.602752 2199.359445 280.478811 966.706459
## [866] 25.590866 2120.358276 26.596344 667.027259 1005.681518
## [871] 481.281703 563.233363 1604.710494 918.574493 671.615231
## [876] 916.342299 411.540404 794.917576 973.419330 193.096411
## [881] 328.024576 56.184145 583.530971 604.358729 92.518111
## [886] 585.110440 1088.821961 837.076543 728.567421 296.302010
## [891] 791.009248 1909.895937 313.182673 1236.238426 285.762102
## [896] 35.910066 163.623480 1997.648711 2624.436186 4205.653771
## [901] 647.293877 3935.100697 1775.153017 409.380429 2329.177544
## [906] 239.379441 158.012775 2044.806972 1019.789001 204.226792
## [911] 1368.183551 1921.213196 706.169253 947.007529 224.895800
## [916] 1888.086757 1383.502758 909.449046 237.509935 253.734312
## [921] 1082.594779 238.526501 399.586274 536.102202 1132.248701
## [926] 406.574551 285.811300 1026.955612 150.607524 198.809950
## [931] 277.184799 1289.324321 386.360445 458.823898 403.853667
## [936] 945.436488 1439.138886 351.837311 1733.442901 695.886444
## [941] 221.127297 110.113167 1674.940643 3182.401448 86.290249
## [946] 254.031445 1455.589412 2205.059491 474.416758 39.430823
## [951] 226.983798 1310.964746 95.555971 371.952517 382.423173
## [956] 2004.724871 3932.676813 233.901112 1069.302632 143.817907
## [961] 61.810901 1967.666767 1887.551232 599.866539 2414.830088
## [966] 9.940301 400.920815 853.069228 190.924911 845.887877
## [971] 35.519694 162.896755 877.728322 1168.463498 323.549886
## [976] 1788.579785 360.189372 1164.298247 260.123163 1109.061155
## [981] 3110.758118 38.132411 2549.114873 2373.207455 2397.144076
## [986] 2100.687756 665.622277 916.416888 1972.688506 398.936647
## [991] 336.151156 1522.735783 1592.556400 1095.115284 1730.120959
## [996] 1737.159803 210.488797 188.552135 882.524874 616.501849
Plot a histogram and compare it with a histogram of your original variable.
Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF).
par(mfrow = c(1, 2))
hist(ames_train$TotalBsmtSF, breaks = 50, main = "Original Variable - Total Basement Sqf", col = "pink")
hist(optimalval, breaks = 50, xlim = c(0, 6000), main = "Modified Variable - Total Basement Sqf",
col = "blue")Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.
## [1] 53.27368
## [1] 3111.395
Confidence Interval
## upper mean lower
## 1079.951 1057.429 1034.908
Empirical Percentile
## 5%
## 519.3
## 95%
## 1753
The 5th and 95th percentile of the original data are 519.3 and 1753, respectively. The lower and upper bounds of the 95% confidence interval for the TotalBsmtSF is (1034.91, 1079.95).
10 points. Modeling. Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.
Multilinear Regression Model - Stepwise All variables
ames <- data.frame(ID = (ames_train$Id), MSSubClass = (ames_train$MSSubClass), GrLivArea = (ames_train$GrLivArea), BldgType = (ames_train$BldgType), GarageArea = (ames_train$GarageArea), LotArea = (ames_train$LotArea), TotalBsmtSF = (ames_train$TotalBsmtSF), BsmtFinSF1 = (ames_train$BsmtFinSF1), Age = (ames_train$YrSold - ames_train$YearBuilt), OverallQuality = (ames_train$OverallQual), OverallCondition = (ames_train$OverallCond), SalePrice = (ames_train$SalePrice))
head(ames)## ID MSSubClass GrLivArea BldgType GarageArea LotArea TotalBsmtSF
## 1 1 60 1710 1Fam 548 8450 856
## 2 2 20 1262 1Fam 460 9600 1262
## 3 3 60 1786 1Fam 608 11250 920
## 4 4 70 1717 1Fam 642 9550 756
## 5 5 60 2198 1Fam 836 14260 1145
## 6 6 50 1362 1Fam 480 14115 796
## BsmtFinSF1 Age OverallQuality OverallCondition SalePrice
## 1 706 5 7 5 208500
## 2 978 31 6 8 181500
## 3 486 7 7 5 223500
## 4 216 91 7 5 140000
## 5 655 8 8 5 250000
## 6 732 16 5 5 143000
##
## Call:
## lm(formula = SalePrice ~ ., data = ames, na.action = na.exclude)
##
## Residuals:
## Min 1Q Median 3Q Max
## -541628 -16397 -3502 13125 275070
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.013e+04 7.905e+03 -8.871 < 2e-16 ***
## ID -1.813e+00 2.259e+00 -0.802 0.4224
## MSSubClass -2.259e+02 5.717e+01 -3.951 8.15e-05 ***
## GrLivArea 5.620e+01 2.862e+00 19.639 < 2e-16 ***
## BldgType2fmCon 2.182e+04 1.033e+04 2.112 0.0349 *
## BldgTypeDuplex -8.960e+03 5.938e+03 -1.509 0.1315
## BldgTypeTwnhs -1.414e+03 8.675e+03 -0.163 0.8705
## BldgTypeTwnhsE 1.020e+04 6.722e+03 1.518 0.1292
## GarageArea 3.528e+01 5.926e+00 5.953 3.30e-09 ***
## LotArea 4.644e-01 1.035e-01 4.486 7.83e-06 ***
## TotalBsmtSF 8.185e+00 3.564e+00 2.297 0.0218 *
## BsmtFinSF1 1.841e+01 2.499e+00 7.367 2.92e-13 ***
## Age -4.740e+02 4.645e+01 -10.203 < 2e-16 ***
## OverallQuality 2.101e+04 1.164e+03 18.055 < 2e-16 ***
## OverallCondition 5.372e+03 9.582e+02 5.606 2.47e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 36250 on 1445 degrees of freedom
## Multiple R-squared: 0.7938, Adjusted R-squared: 0.7918
## F-statistic: 397.3 on 14 and 1445 DF, p-value: < 2.2e-16
Metrics 1
Metrics1 <- data.frame(
R2 = rsquare(Model1, data = ames),
RMSE = rmse(Model1, data = ames),
MAE = mae(Model1, data = ames)
)
print(Metrics1)## R2 RMSE MAE
## 1 0.7938015 36061.76 21645.03
Multilinear Regression Model - Eight Significant Variables
Model2 <-lm(SalePrice ~ MSSubClass + GrLivArea + GarageArea + LotArea + BsmtFinSF1 + Age + OverallQuality + OverallCondition, data = ames,na.action = na.exclude)
summary(Model2)##
## Call:
## lm(formula = SalePrice ~ MSSubClass + GrLivArea + GarageArea +
## LotArea + BsmtFinSF1 + Age + OverallQuality + OverallCondition,
## data = ames, na.action = na.exclude)
##
## Residuals:
## Min 1Q Median 3Q Max
## -526259 -17263 -3378 14118 278763
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.251e+04 7.467e+03 -9.711 < 2e-16 ***
## MSSubClass -1.826e+02 2.333e+01 -7.830 9.35e-15 ***
## GrLivArea 5.499e+01 2.499e+00 22.008 < 2e-16 ***
## GarageArea 3.751e+01 5.919e+00 6.338 3.10e-10 ***
## LotArea 5.238e-01 1.025e-01 5.109 3.66e-07 ***
## BsmtFinSF1 2.193e+01 2.257e+00 9.717 < 2e-16 ***
## Age -4.589e+02 4.536e+01 -10.118 < 2e-16 ***
## OverallQuality 2.245e+04 1.095e+03 20.502 < 2e-16 ***
## OverallCondition 4.906e+03 9.497e+02 5.166 2.73e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 36490 on 1451 degrees of freedom
## Multiple R-squared: 0.7902, Adjusted R-squared: 0.789
## F-statistic: 683 on 8 and 1451 DF, p-value: < 2.2e-16
Metrics 2
Metrics2 <- data.frame(
R2 = rsquare(Model2, data = ames),
RMSE = rmse(Model2, data = ames),
MAE = mae(Model2, data = ames)
)
print(Metrics2)## R2 RMSE MAE
## 1 0.7901543 36379.28 22198.86
Multilinear Regression Model - Six Positively Significant Variables
Model3 <-lm(SalePrice ~ GrLivArea + GarageArea + LotArea + BsmtFinSF1 + OverallQuality + OverallCondition, data = ames, na.action = na.exclude)
summary(Model3)##
## Call:
## lm(formula = SalePrice ~ GrLivArea + GarageArea + LotArea + BsmtFinSF1 +
## OverallQuality + OverallCondition, data = ames, na.action = na.exclude)
##
## Residuals:
## Min 1Q Median 3Q Max
## -534778 -18236 -564 14485 284868
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.110e+05 7.142e+03 -15.544 < 2e-16 ***
## GrLivArea 4.529e+01 2.491e+00 18.182 < 2e-16 ***
## GarageArea 5.846e+01 5.965e+00 9.801 < 2e-16 ***
## LotArea 5.906e-01 1.066e-01 5.541 3.57e-08 ***
## BsmtFinSF1 2.551e+01 2.351e+00 10.850 < 2e-16 ***
## OverallQuality 2.780e+04 9.920e+02 28.023 < 2e-16 ***
## OverallCondition 1.537e+03 9.137e+02 1.683 0.0927 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 38370 on 1453 degrees of freedom
## Multiple R-squared: 0.7676, Adjusted R-squared: 0.7667
## F-statistic: 800.1 on 6 and 1453 DF, p-value: < 2.2e-16
Metrics 3
Metrics3 <- data.frame(
R2 = rsquare(Model3, data = ames),
RMSE = rmse(Model3, data = ames),
MAE = mae(Model3, data = ames)
)
print(Metrics3)## R2 RMSE MAE
## 1 0.7676475 38280.52 23848.4
Comparing the OLS Model Fit
ModelName <- c("Model1", "Model2","Model3")
Model_RSquared <- c("79%", "79% ", "77%")
Model_RMSE <- c("36061.76", "36379.28", "38280.52")
Model_FStatistic <- c("397.3", "683", "800.1")
Model_Performance <- data.frame(ModelName,Model_RSquared,Model_RMSE,Model_FStatistic)
Model_Performance## ModelName Model_RSquared Model_RMSE Model_FStatistic
## 1 Model1 79% 36061.76 397.3
## 2 Model2 79% 36379.28 683
## 3 Model3 77% 38280.52 800.1
R-squared measures the strength of the relationship between your model and the dependent variable. The F-test of overall significance is the hypothesis test for this relationship. If the overall F-test is significant, we can conclude that R-squared does not equal zero, and the correlation between the model and dependent variable is statistically significant.
The F-statistic for the three models are significant. As noted by Hormoz (2015), the higher the F value, the better is the model. Although Model3 has a better F-Statistic, the \(R^2\) is lower than Model1 and Model2.
Similarly, the \(R^2\) for Model1 and Model2 are the same, Model2 has a better F-Statistic than Model1. The \(R^2\) of the Model shows that 79% of the variation in the data is explained by this Model.
Therefore, I will select this Model with eight variables to predict Housing sale price for this task.
The prediction plot doesn’t look bad. This is not the greatest predictions I will expect from this Model considering the variable significance.
# Rename variables to match training data set and delete values of sale price
ames_test$Age <- ames_test$YrSold - ames_test$YearBuilt
ames_test$OverallQuality <- ames_test$OverallQual
ames_test$OverallCondition <- ames_test$OverallCond
ames_test <- ames_test %>% mutate(SalePrice = NA) #This is the values to be predicted.Making predictions using the evaluation Dataset
The data was not a good fit for a Multi linear Regression Model. The \(R^2\) for these Model could not exceed 79% for the three models I built. Other options could be the introduction of penaly such as in Ridge regression which is capable of penalizing the model to improve performance.
I selected the second model for my prediction of housing sale price considering eight independent variables; tw of the eight variables negatively contributed to the housing price prediction, sacrificing these two variables did not significantly improve the model by \(R^2\) value as seen in Model3.
Using the test data provided, the Model was able to predict new sale prices considering the significant variables identified above. The result of the new sale prices or Model prediction is saved as results in a csv file and can be viewed directly.
Reference
Sohrabi, Hormoz. (2015). Re: Can I use the F-test value to determine the best model in regression?. Retrieved from: https://www.researchgate.net/post/Can_I_use_the_F-test_value_to_determine_the_best_model_in_regression/54b417aad4c11849278b4578/citation/download