Data 605 Final Exam: Computational Mathematics
Please refer to the Final Exam Document.
1 Youtube Presentation
2 Problem 1
2.1 Part 1
Using R, generate a random variable X that has 10,000 random uniform numbers from 1 to N, where N can be any number of your choosing greater than or equal to 6. Then generate a random variable Y that has 10,000 random normal numbers with a mean of \(\mu = \sigma =(N+1)/2\).
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.002 3.186 5.489 5.476 7.718 9.999
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -14.677 1.720 5.419 5.450 9.122 26.899
2.2 Part 2
Probability: Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities.
5 points. a. P(X>x | X>y) b. P(X>x, Y>y) c. P(X<x | X>y)
5 points. Investigate whether P(X>x and Y>y)=P(X>x)P(Y>y) by building a table and evaluating the marginal and joint probabilities.
5 points. Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?
## [1] 5.489
## [1] 1.72
2.2.1 Part 2a
\[P(X>x|X>y) = \frac{P(X>x,X>y)}{P(X>y)}\]
## [1] 0.5453157
2.2.2 Part 2b P(X>x, Y>y)
We know that x is the median of X, so P(X>x) is about 0.5.
Also, we know that y is the 1st quantile of Y, so P(Y>y) is about 0.75.
Therefore, \(P(X>x, Y>y) = P(X>x)\cdot P(Y>y) = 0.5 \cdot 0.75 = 0.375\)
## [1] 0.375
2.2.3 Part 2c P(X<x | X>y)
\[P(X<x|X>y) = \frac{P(X<x,X>y)}{P(X>y)}\] It is obvious that Part C = 1 - Part A. We can also prove that using the calculation below.
## [1] 0.4546843
## [1] 1
2.2.4 Part 2d
Investigate whether P(X>x and Y>y)=P(X>x)P(Y>y) by building a table and evaluating the marginal and joint probabilities.
table <- data.frame(matrix(
c(sum(X>x & Y>y)/10000, sum(X<x & Y>y)/10000, sum(Y>y)/10000,
sum(X>x & Y<y)/10000, sum(X<x & Y<y)/10000, sum(Y<y)/10000,
sum(X>x)/10000, sum(X<x)/10000, 1.00),
ncol=3, byrow=TRUE))
colnames(table) <- c("X>x", "X<x", "Total")
rownames(table) <- c("Y>y", "Y<y", "Total")
table The marginal probability of P(X>x, Y>y) is 0.3756, which is similar to our answer in Part 2b 0.375.
As the difference is 0.0006, which is relatively small. I would conclude that the formula \(P(X>x\;and\;Y>y) = P(X>x)P(Y>y)\) holds.
2.2.5 Part 2e
Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?
Fisher’s Exact Test is a statistical test used to determine if there are nonrandom associations between two categorical variables. It works better when the sample is small, some cells have less than 5 counts in the contingency table.
Chi-Square Test is a statistical test commonly used for testing the relationship between categorical variables. It works better for large samples and requires each cell of the contingency matrix has at least 5 counts.
The null hypothesis for both is that no relationship exists on the categorical variables in the population, i.e. they are independent. If we reject the null hypothesis, it is likely that the categorical variables are dependent.
table2 <- data.frame(matrix(
c(sum(X>x & Y>y), sum(X<x & Y>y),
sum(X>x & Y<y), sum(X<x & Y<y)),
ncol=2, byrow=TRUE))
colnames(table2) <- c("X>x", "X<x")
rownames(table2) <- c("Y>y", "Y<y")
table2##
## Fisher's Exact Test for Count Data
##
## data: table2
## p-value = 0.7995
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.9242273 1.1100187
## sample estimates:
## odds ratio
## 1.012883
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: table2
## X-squared = 0.064533, df = 1, p-value = 0.7995
The p-value we got from Fisher’s Exact Test is 0.7995, which is greater than 0.05. We accept the null hypothesis that the variables are very likely independent of each other.
The p-value we got from Chi-Square Test is also 0.7995, which is greater than 0.05. We accept the null hypothesis that the variables are very likely independent of each other.
As we have a large sample size (N=10000) here with over 1k counts in each conditions in table2 and conventionally Chi-Sqaure test works better than Fisher’s Exact Test for large samples, it is more appropriate to use Chi-Sqaure Test here for this question.
3 Problem 2
You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition. I want you to do the following.
5 points. Descriptive and Inferential Statistics. Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any three quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?
5 points. Linear Algebra and Correlation. Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.
5 points. Calculus-Based Probability & Statistics. Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of \(\lambda\) for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, \(\lambda\))). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.
10 points. Modeling. Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.
3.1 Data Description
Please refer to the data description file.
3.2 Read Data
## Warning: package 'corrplot' was built under R version 3.6.3
train <- read.csv("https://raw.githubusercontent.com/shirley-wong/Data-605/master/FinalExam/Problem%202/train.csv")
test <- read.csv("https://raw.githubusercontent.com/shirley-wong/Data-605/master/FinalExam/Problem%202/test.csv")
dim(train)## [1] 1460 81
## Id MSSubClass MSZoning LotFrontage
## Min. : 1.0 Min. : 20.0 C (all): 10 Min. : 21.00
## 1st Qu.: 365.8 1st Qu.: 20.0 FV : 65 1st Qu.: 59.00
## Median : 730.5 Median : 50.0 RH : 16 Median : 69.00
## Mean : 730.5 Mean : 56.9 RL :1151 Mean : 70.05
## 3rd Qu.:1095.2 3rd Qu.: 70.0 RM : 218 3rd Qu.: 80.00
## Max. :1460.0 Max. :190.0 Max. :313.00
## NA's :259
## LotArea Street Alley LotShape LandContour
## Min. : 1300 Grvl: 6 Grvl: 50 IR1:484 Bnk: 63
## 1st Qu.: 7554 Pave:1454 Pave: 41 IR2: 41 HLS: 50
## Median : 9478 NA's:1369 IR3: 10 Low: 36
## Mean : 10517 Reg:925 Lvl:1311
## 3rd Qu.: 11602
## Max. :215245
##
## Utilities LotConfig LandSlope Neighborhood Condition1
## AllPub:1459 Corner : 263 Gtl:1382 NAmes :225 Norm :1260
## NoSeWa: 1 CulDSac: 94 Mod: 65 CollgCr:150 Feedr : 81
## FR2 : 47 Sev: 13 OldTown:113 Artery : 48
## FR3 : 4 Edwards:100 RRAn : 26
## Inside :1052 Somerst: 86 PosN : 19
## Gilbert: 79 RRAe : 11
## (Other):707 (Other): 15
## Condition2 BldgType HouseStyle OverallQual
## Norm :1445 1Fam :1220 1Story :726 Min. : 1.000
## Feedr : 6 2fmCon: 31 2Story :445 1st Qu.: 5.000
## Artery : 2 Duplex: 52 1.5Fin :154 Median : 6.000
## PosN : 2 Twnhs : 43 SLvl : 65 Mean : 6.099
## RRNn : 2 TwnhsE: 114 SFoyer : 37 3rd Qu.: 7.000
## PosA : 1 1.5Unf : 14 Max. :10.000
## (Other): 2 (Other): 19
## OverallCond YearBuilt YearRemodAdd RoofStyle
## Min. :1.000 Min. :1872 Min. :1950 Flat : 13
## 1st Qu.:5.000 1st Qu.:1954 1st Qu.:1967 Gable :1141
## Median :5.000 Median :1973 Median :1994 Gambrel: 11
## Mean :5.575 Mean :1971 Mean :1985 Hip : 286
## 3rd Qu.:6.000 3rd Qu.:2000 3rd Qu.:2004 Mansard: 7
## Max. :9.000 Max. :2010 Max. :2010 Shed : 2
##
## RoofMatl Exterior1st Exterior2nd MasVnrType MasVnrArea
## CompShg:1434 VinylSd:515 VinylSd:504 BrkCmn : 15 Min. : 0.0
## Tar&Grv: 11 HdBoard:222 MetalSd:214 BrkFace:445 1st Qu.: 0.0
## WdShngl: 6 MetalSd:220 HdBoard:207 None :864 Median : 0.0
## WdShake: 5 Wd Sdng:206 Wd Sdng:197 Stone :128 Mean : 103.7
## ClyTile: 1 Plywood:108 Plywood:142 NA's : 8 3rd Qu.: 166.0
## Membran: 1 CemntBd: 61 CmentBd: 60 Max. :1600.0
## (Other): 2 (Other):128 (Other):136 NA's :8
## ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure
## Ex: 52 Ex: 3 BrkTil:146 Ex :121 Fa : 45 Av :221
## Fa: 14 Fa: 28 CBlock:634 Fa : 35 Gd : 65 Gd :134
## Gd:488 Gd: 146 PConc :647 Gd :618 Po : 2 Mn :114
## TA:906 Po: 1 Slab : 24 TA :649 TA :1311 No :953
## TA:1282 Stone : 6 NA's: 37 NA's: 37 NA's: 38
## Wood : 3
##
## BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2
## ALQ :220 Min. : 0.0 ALQ : 19 Min. : 0.00
## BLQ :148 1st Qu.: 0.0 BLQ : 33 1st Qu.: 0.00
## GLQ :418 Median : 383.5 GLQ : 14 Median : 0.00
## LwQ : 74 Mean : 443.6 LwQ : 46 Mean : 46.55
## Rec :133 3rd Qu.: 712.2 Rec : 54 3rd Qu.: 0.00
## Unf :430 Max. :5644.0 Unf :1256 Max. :1474.00
## NA's: 37 NA's: 38
## BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir
## Min. : 0.0 Min. : 0.0 Floor: 1 Ex:741 N: 95
## 1st Qu.: 223.0 1st Qu.: 795.8 GasA :1428 Fa: 49 Y:1365
## Median : 477.5 Median : 991.5 GasW : 18 Gd:241
## Mean : 567.2 Mean :1057.4 Grav : 7 Po: 1
## 3rd Qu.: 808.0 3rd Qu.:1298.2 OthW : 2 TA:428
## Max. :2336.0 Max. :6110.0 Wall : 4
##
## Electrical X1stFlrSF X2ndFlrSF LowQualFinSF
## FuseA: 94 Min. : 334 Min. : 0 Min. : 0.000
## FuseF: 27 1st Qu.: 882 1st Qu.: 0 1st Qu.: 0.000
## FuseP: 3 Median :1087 Median : 0 Median : 0.000
## Mix : 1 Mean :1163 Mean : 347 Mean : 5.845
## SBrkr:1334 3rd Qu.:1391 3rd Qu.: 728 3rd Qu.: 0.000
## NA's : 1 Max. :4692 Max. :2065 Max. :572.000
##
## GrLivArea BsmtFullBath BsmtHalfBath FullBath
## Min. : 334 Min. :0.0000 Min. :0.00000 Min. :0.000
## 1st Qu.:1130 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:1.000
## Median :1464 Median :0.0000 Median :0.00000 Median :2.000
## Mean :1515 Mean :0.4253 Mean :0.05753 Mean :1.565
## 3rd Qu.:1777 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:2.000
## Max. :5642 Max. :3.0000 Max. :2.00000 Max. :3.000
##
## HalfBath BedroomAbvGr KitchenAbvGr KitchenQual
## Min. :0.0000 Min. :0.000 Min. :0.000 Ex:100
## 1st Qu.:0.0000 1st Qu.:2.000 1st Qu.:1.000 Fa: 39
## Median :0.0000 Median :3.000 Median :1.000 Gd:586
## Mean :0.3829 Mean :2.866 Mean :1.047 TA:735
## 3rd Qu.:1.0000 3rd Qu.:3.000 3rd Qu.:1.000
## Max. :2.0000 Max. :8.000 Max. :3.000
##
## TotRmsAbvGrd Functional Fireplaces FireplaceQu GarageType
## Min. : 2.000 Maj1: 14 Min. :0.000 Ex : 24 2Types : 6
## 1st Qu.: 5.000 Maj2: 5 1st Qu.:0.000 Fa : 33 Attchd :870
## Median : 6.000 Min1: 31 Median :1.000 Gd :380 Basment: 19
## Mean : 6.518 Min2: 34 Mean :0.613 Po : 20 BuiltIn: 88
## 3rd Qu.: 7.000 Mod : 15 3rd Qu.:1.000 TA :313 CarPort: 9
## Max. :14.000 Sev : 1 Max. :3.000 NA's:690 Detchd :387
## Typ :1360 NA's : 81
## GarageYrBlt GarageFinish GarageCars GarageArea GarageQual
## Min. :1900 Fin :352 Min. :0.000 Min. : 0.0 Ex : 3
## 1st Qu.:1961 RFn :422 1st Qu.:1.000 1st Qu.: 334.5 Fa : 48
## Median :1980 Unf :605 Median :2.000 Median : 480.0 Gd : 14
## Mean :1979 NA's: 81 Mean :1.767 Mean : 473.0 Po : 3
## 3rd Qu.:2002 3rd Qu.:2.000 3rd Qu.: 576.0 TA :1311
## Max. :2010 Max. :4.000 Max. :1418.0 NA's: 81
## NA's :81
## GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch
## Ex : 2 N: 90 Min. : 0.00 Min. : 0.00 Min. : 0.00
## Fa : 35 P: 30 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00
## Gd : 9 Y:1340 Median : 0.00 Median : 25.00 Median : 0.00
## Po : 7 Mean : 94.24 Mean : 46.66 Mean : 21.95
## TA :1326 3rd Qu.:168.00 3rd Qu.: 68.00 3rd Qu.: 0.00
## NA's: 81 Max. :857.00 Max. :547.00 Max. :552.00
##
## X3SsnPorch ScreenPorch PoolArea PoolQC
## Min. : 0.00 Min. : 0.00 Min. : 0.000 Ex : 2
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.000 Fa : 2
## Median : 0.00 Median : 0.00 Median : 0.000 Gd : 3
## Mean : 3.41 Mean : 15.06 Mean : 2.759 NA's:1453
## 3rd Qu.: 0.00 3rd Qu.: 0.00 3rd Qu.: 0.000
## Max. :508.00 Max. :480.00 Max. :738.000
##
## Fence MiscFeature MiscVal MoSold
## GdPrv: 59 Gar2: 2 Min. : 0.00 Min. : 1.000
## GdWo : 54 Othr: 2 1st Qu.: 0.00 1st Qu.: 5.000
## MnPrv: 157 Shed: 49 Median : 0.00 Median : 6.000
## MnWw : 11 TenC: 1 Mean : 43.49 Mean : 6.322
## NA's :1179 NA's:1406 3rd Qu.: 0.00 3rd Qu.: 8.000
## Max. :15500.00 Max. :12.000
##
## YrSold SaleType SaleCondition SalePrice
## Min. :2006 WD :1267 Abnorml: 101 Min. : 34900
## 1st Qu.:2007 New : 122 AdjLand: 4 1st Qu.:129975
## Median :2008 COD : 43 Alloca : 12 Median :163000
## Mean :2008 ConLD : 9 Family : 20 Mean :180921
## 3rd Qu.:2009 ConLI : 5 Normal :1198 3rd Qu.:214000
## Max. :2010 ConLw : 5 Partial: 125 Max. :755000
## (Other): 9
We have 81 variables and 1460 observations in the training set, where SalePrice is the response variable.
Check Data Type
By reading the data description, we know that the variable MSSubClass is a categorical variable identifies the type of dwelling involved in the sale. Thus, we need to change its data type for both training set and testing set.
3.3 Descriptive and Inferential Statistics
3.3.1 Plots
- summary of training set
## Id MSSubClass MSZoning LotFrontage
## Min. : 1.0 20 :536 C (all): 10 Min. : 21.00
## 1st Qu.: 365.8 60 :299 FV : 65 1st Qu.: 59.00
## Median : 730.5 50 :144 RH : 16 Median : 69.00
## Mean : 730.5 120 : 87 RL :1151 Mean : 70.05
## 3rd Qu.:1095.2 30 : 69 RM : 218 3rd Qu.: 80.00
## Max. :1460.0 160 : 63 Max. :313.00
## (Other):262 NA's :259
## LotArea Street Alley LotShape LandContour
## Min. : 1300 Grvl: 6 Grvl: 50 IR1:484 Bnk: 63
## 1st Qu.: 7554 Pave:1454 Pave: 41 IR2: 41 HLS: 50
## Median : 9478 NA's:1369 IR3: 10 Low: 36
## Mean : 10517 Reg:925 Lvl:1311
## 3rd Qu.: 11602
## Max. :215245
##
## Utilities LotConfig LandSlope Neighborhood Condition1
## AllPub:1459 Corner : 263 Gtl:1382 NAmes :225 Norm :1260
## NoSeWa: 1 CulDSac: 94 Mod: 65 CollgCr:150 Feedr : 81
## FR2 : 47 Sev: 13 OldTown:113 Artery : 48
## FR3 : 4 Edwards:100 RRAn : 26
## Inside :1052 Somerst: 86 PosN : 19
## Gilbert: 79 RRAe : 11
## (Other):707 (Other): 15
## Condition2 BldgType HouseStyle OverallQual
## Norm :1445 1Fam :1220 1Story :726 Min. : 1.000
## Feedr : 6 2fmCon: 31 2Story :445 1st Qu.: 5.000
## Artery : 2 Duplex: 52 1.5Fin :154 Median : 6.000
## PosN : 2 Twnhs : 43 SLvl : 65 Mean : 6.099
## RRNn : 2 TwnhsE: 114 SFoyer : 37 3rd Qu.: 7.000
## PosA : 1 1.5Unf : 14 Max. :10.000
## (Other): 2 (Other): 19
## OverallCond YearBuilt YearRemodAdd RoofStyle
## Min. :1.000 Min. :1872 Min. :1950 Flat : 13
## 1st Qu.:5.000 1st Qu.:1954 1st Qu.:1967 Gable :1141
## Median :5.000 Median :1973 Median :1994 Gambrel: 11
## Mean :5.575 Mean :1971 Mean :1985 Hip : 286
## 3rd Qu.:6.000 3rd Qu.:2000 3rd Qu.:2004 Mansard: 7
## Max. :9.000 Max. :2010 Max. :2010 Shed : 2
##
## RoofMatl Exterior1st Exterior2nd MasVnrType MasVnrArea
## CompShg:1434 VinylSd:515 VinylSd:504 BrkCmn : 15 Min. : 0.0
## Tar&Grv: 11 HdBoard:222 MetalSd:214 BrkFace:445 1st Qu.: 0.0
## WdShngl: 6 MetalSd:220 HdBoard:207 None :864 Median : 0.0
## WdShake: 5 Wd Sdng:206 Wd Sdng:197 Stone :128 Mean : 103.7
## ClyTile: 1 Plywood:108 Plywood:142 NA's : 8 3rd Qu.: 166.0
## Membran: 1 CemntBd: 61 CmentBd: 60 Max. :1600.0
## (Other): 2 (Other):128 (Other):136 NA's :8
## ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure
## Ex: 52 Ex: 3 BrkTil:146 Ex :121 Fa : 45 Av :221
## Fa: 14 Fa: 28 CBlock:634 Fa : 35 Gd : 65 Gd :134
## Gd:488 Gd: 146 PConc :647 Gd :618 Po : 2 Mn :114
## TA:906 Po: 1 Slab : 24 TA :649 TA :1311 No :953
## TA:1282 Stone : 6 NA's: 37 NA's: 37 NA's: 38
## Wood : 3
##
## BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2
## ALQ :220 Min. : 0.0 ALQ : 19 Min. : 0.00
## BLQ :148 1st Qu.: 0.0 BLQ : 33 1st Qu.: 0.00
## GLQ :418 Median : 383.5 GLQ : 14 Median : 0.00
## LwQ : 74 Mean : 443.6 LwQ : 46 Mean : 46.55
## Rec :133 3rd Qu.: 712.2 Rec : 54 3rd Qu.: 0.00
## Unf :430 Max. :5644.0 Unf :1256 Max. :1474.00
## NA's: 37 NA's: 38
## BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir
## Min. : 0.0 Min. : 0.0 Floor: 1 Ex:741 N: 95
## 1st Qu.: 223.0 1st Qu.: 795.8 GasA :1428 Fa: 49 Y:1365
## Median : 477.5 Median : 991.5 GasW : 18 Gd:241
## Mean : 567.2 Mean :1057.4 Grav : 7 Po: 1
## 3rd Qu.: 808.0 3rd Qu.:1298.2 OthW : 2 TA:428
## Max. :2336.0 Max. :6110.0 Wall : 4
##
## Electrical X1stFlrSF X2ndFlrSF LowQualFinSF
## FuseA: 94 Min. : 334 Min. : 0 Min. : 0.000
## FuseF: 27 1st Qu.: 882 1st Qu.: 0 1st Qu.: 0.000
## FuseP: 3 Median :1087 Median : 0 Median : 0.000
## Mix : 1 Mean :1163 Mean : 347 Mean : 5.845
## SBrkr:1334 3rd Qu.:1391 3rd Qu.: 728 3rd Qu.: 0.000
## NA's : 1 Max. :4692 Max. :2065 Max. :572.000
##
## GrLivArea BsmtFullBath BsmtHalfBath FullBath
## Min. : 334 Min. :0.0000 Min. :0.00000 Min. :0.000
## 1st Qu.:1130 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:1.000
## Median :1464 Median :0.0000 Median :0.00000 Median :2.000
## Mean :1515 Mean :0.4253 Mean :0.05753 Mean :1.565
## 3rd Qu.:1777 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:2.000
## Max. :5642 Max. :3.0000 Max. :2.00000 Max. :3.000
##
## HalfBath BedroomAbvGr KitchenAbvGr KitchenQual
## Min. :0.0000 Min. :0.000 Min. :0.000 Ex:100
## 1st Qu.:0.0000 1st Qu.:2.000 1st Qu.:1.000 Fa: 39
## Median :0.0000 Median :3.000 Median :1.000 Gd:586
## Mean :0.3829 Mean :2.866 Mean :1.047 TA:735
## 3rd Qu.:1.0000 3rd Qu.:3.000 3rd Qu.:1.000
## Max. :2.0000 Max. :8.000 Max. :3.000
##
## TotRmsAbvGrd Functional Fireplaces FireplaceQu GarageType
## Min. : 2.000 Maj1: 14 Min. :0.000 Ex : 24 2Types : 6
## 1st Qu.: 5.000 Maj2: 5 1st Qu.:0.000 Fa : 33 Attchd :870
## Median : 6.000 Min1: 31 Median :1.000 Gd :380 Basment: 19
## Mean : 6.518 Min2: 34 Mean :0.613 Po : 20 BuiltIn: 88
## 3rd Qu.: 7.000 Mod : 15 3rd Qu.:1.000 TA :313 CarPort: 9
## Max. :14.000 Sev : 1 Max. :3.000 NA's:690 Detchd :387
## Typ :1360 NA's : 81
## GarageYrBlt GarageFinish GarageCars GarageArea GarageQual
## Min. :1900 Fin :352 Min. :0.000 Min. : 0.0 Ex : 3
## 1st Qu.:1961 RFn :422 1st Qu.:1.000 1st Qu.: 334.5 Fa : 48
## Median :1980 Unf :605 Median :2.000 Median : 480.0 Gd : 14
## Mean :1979 NA's: 81 Mean :1.767 Mean : 473.0 Po : 3
## 3rd Qu.:2002 3rd Qu.:2.000 3rd Qu.: 576.0 TA :1311
## Max. :2010 Max. :4.000 Max. :1418.0 NA's: 81
## NA's :81
## GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch
## Ex : 2 N: 90 Min. : 0.00 Min. : 0.00 Min. : 0.00
## Fa : 35 P: 30 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00
## Gd : 9 Y:1340 Median : 0.00 Median : 25.00 Median : 0.00
## Po : 7 Mean : 94.24 Mean : 46.66 Mean : 21.95
## TA :1326 3rd Qu.:168.00 3rd Qu.: 68.00 3rd Qu.: 0.00
## NA's: 81 Max. :857.00 Max. :547.00 Max. :552.00
##
## X3SsnPorch ScreenPorch PoolArea PoolQC
## Min. : 0.00 Min. : 0.00 Min. : 0.000 Ex : 2
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.000 Fa : 2
## Median : 0.00 Median : 0.00 Median : 0.000 Gd : 3
## Mean : 3.41 Mean : 15.06 Mean : 2.759 NA's:1453
## 3rd Qu.: 0.00 3rd Qu.: 0.00 3rd Qu.: 0.000
## Max. :508.00 Max. :480.00 Max. :738.000
##
## Fence MiscFeature MiscVal MoSold
## GdPrv: 59 Gar2: 2 Min. : 0.00 Min. : 1.000
## GdWo : 54 Othr: 2 1st Qu.: 0.00 1st Qu.: 5.000
## MnPrv: 157 Shed: 49 Median : 0.00 Median : 6.000
## MnWw : 11 TenC: 1 Mean : 43.49 Mean : 6.322
## NA's :1179 NA's:1406 3rd Qu.: 0.00 3rd Qu.: 8.000
## Max. :15500.00 Max. :12.000
##
## YrSold SaleType SaleCondition SalePrice
## Min. :2006 WD :1267 Abnorml: 101 Min. : 34900
## 1st Qu.:2007 New : 122 AdjLand: 4 1st Qu.:129975
## Median :2008 COD : 43 Alloca : 12 Median :163000
## Mean :2008 ConLD : 9 Family : 20 Mean :180921
## 3rd Qu.:2009 ConLI : 5 Normal :1198 3rd Qu.:214000
## Max. :2010 ConLw : 5 Partial: 125 Max. :755000
## (Other): 9
SalePrice
It is our response variable. From the histogram, we can see that it is right skewed with most houses being sold below $200,000 and some between $200k to $400k.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 34900 129975 163000 180921 214000 755000
BldgType
Most of the building types are single-family detached with only little two-family convension, deplex, townhouse end unit and townhouse inside unit.
## 1Fam 2fmCon Duplex Twnhs TwnhsE
## 1220 31 52 43 114
HouseStyle
Most of the house style is one story and two story.
## 1.5Fin 1.5Unf 1Story 2.5Fin 2.5Unf 2Story SFoyer SLvl
## 154 14 726 8 11 445 37 65
OverallQual
The overall quality is between average and good.
## 1 2 3 4 5 6 7 8 9 10
## 2 3 20 116 397 374 319 168 43 18
OverallCond
The overall condition of the houses are average.
## 1 2 3 4 5 6 7 8 9
## 1 5 25 57 821 252 205 72 22
YearBuilt
Most of the houses are built after 1950s.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1872 1954 1973 1971 2000 2010
3.3.2 Scatterplot matrix and correlation matrix
From the plots below, we have the OverallQual being highly correlated to our dependent variable SalePrice with correlation coefficient 0.79. Our independent variable GrLivArea and FullBath are also correlated to our dependent variable SalePrice with correlation coefficients 0.71 and 0.56 respectively. These values make sense as the large the above ground living area and the more the full bath come with bigger house. And that the bigger the house, the higher the sale price.
#scatterplot matrix
pairs(train[,c("LotArea", "OverallQual", "OverallCond", "YearBuilt", "GrLivArea", "FullBath", "SalePrice")])#correlation matrix
pairs.panels(train[,c("LotArea", "OverallQual", "OverallCond", "YearBuilt", "GrLivArea", "FullBath", "SalePrice")])o_g <- cor(train$OverallQual, train$GrLivArea)
g_s <- cor(train$GrLivArea, train$SalePrice)
o_s <- cor(train$SalePrice, train$OverallQual)
#correlation matrix of three quantitative variables
cm <- data.frame(matrix(
c(1.0, o_g, o_s, o_g, 1.0, g_s, o_s, g_s, 1.0),
ncol=3, byrow=TRUE))
colnames(cm) <- c("OverallQual", "GrLivArea", "SalePrice")
rownames(cm) <- c("OverallQual", "GrLivArea", "SalePrice")
cm 3.3.3 Test the hypothesis
Limit to three quantitative variables. Test the hypothesis that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval.
#limit to three quantitative variables
pairs.panels(train[,c("OverallQual", "GrLivArea", "SalePrice")])SalePricevsGrLivArea
The p-value is nearly 0 and the 80% confidence interval does not include 0, so we reject the null hypothesis of true correlation equals zero.
##
## Pearson's product-moment correlation
##
## data: train$SalePrice and train$GrLivArea
## t = 38.348, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.6915087 0.7249450
## sample estimates:
## cor
## 0.7086245
SalePricevsOverallQual
The p-value is nearly 0 and the 80% confidence interval does not include 0, so we reject the null hypothesis of true correlation equals zero.
##
## Pearson's product-moment correlation
##
## data: train$SalePrice and train$OverallQual
## t = 49.364, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.7780752 0.8032204
## sample estimates:
## cor
## 0.7909816
GrLivAreavsOverallQual
The p-value is nearly 0 and the 80% confidence interval does not include 0, so we reject the null hypothesis of true correlation equals zero.
##
## Pearson's product-moment correlation
##
## data: train$GrLivArea and train$OverallQual
## t = 28.121, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.5708061 0.6143422
## sample estimates:
## cor
## 0.5930074
Although these variables are highly correlated to each other, this does not imply absolute causation between them, but it is commonly believed that larger house and better quality sells at high price. We can hope the independent variables can help us explain the response variable.
Familywise error rate (FWE or FWER) is the probability of a coming to at least one false conclusion in a series of hypothesis tests. It is the probability of making at least one Type I Error. It is also called alpha inflation or cumulative Type I error.
\(FWE \leq 1-(1-\alpha)^{c}\) where \(\alpha\) is the alpha level for an individual test (e.g. 0.5) and \(c\) is the number of comparisons/tests.
Thus, for this question, \(\alpha = 0.2\) for an 80% confidence interval and \(c=3\) for three variables used for the hypothesis test.
\[FWE \leq 1-(1-0.2)^{3} = 0.488\]
It means that we have about 48.8% chance of making at least one Type I Error across the three hypothesis tests.
I would not be worried as we can reduce the alpha, i.e. increase the confidence interval, for all three tests to reduce the familywise error rate.
3.4 Linear Algebra and Correlation
- Invert the correlation matrix from above. This is known as the precision matrix and contains variance inflation factors on the diagonal.
## OverallQual GrLivArea SalePrice
## OverallQual 2.6865350 -0.1753704 -2.000728
## GrLivArea -0.1753704 2.0200794 -1.292763
## SalePrice -2.0007280 -1.2927630 3.498623
- Multiply the correlation matrix by the precision matrix.
## OverallQual GrLivArea SalePrice
## OverallQual 1 0 0
## GrLivArea 0 1 0
## SalePrice 0 0 1
- Multiply the precision matrix by the correlation matrix.
## OverallQual GrLivArea SalePrice
## OverallQual 1 0 0
## GrLivArea 0 1 0
## SalePrice 0 0 1
- Comduct LU decomposition on the matrix
## $L
## [,1] [,2] [,3]
## [1,] 1.0000000 0.0000000 0
## [2,] 0.5930074 1.0000000 0
## [3,] 0.7909816 0.3695063 1
##
## $U
## [,1] [,2] [,3]
## [1,] 1 5.930074e-01 0.7909816
## [2,] 0 6.483422e-01 0.2395665
## [3,] 0 -2.775558e-17 0.2858268
check the result:
## OverallQual GrLivArea SalePrice
## OverallQual 0 0 0
## GrLivArea 0 0 0
## SalePrice 0 0 0
3.5 Calculus-Based Probability & Statistics
To find a variable that is skewed to the right from the training set, first choose one that does not have NA values and study their skewness.
## [1] 12.18262
## [1] 1.682041
## [1] 4.246521
## [1] 0.9183784
## [1] 1.521124
## [1] 1.373929
## [1] 0.81136
## [1] 0.1796113
## [1] 1.53821
## [1] 2.359486
## [1] 3.083526
## [1] 10.28318
## [1] 4.113747
## [1] 14.79792
## Id MSSubClass MSZoning LotFrontage LotArea
## 0 0 0 NA 0
## Street Alley LotShape LandContour Utilities
## 0 NA 0 0 0
## LotConfig LandSlope Neighborhood Condition1 Condition2
## 0 0 0 0 0
## BldgType HouseStyle OverallQual OverallCond YearBuilt
## 0 0 0 0 0
## YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd
## 0 0 0 0 0
## MasVnrType MasVnrArea ExterQual ExterCond Foundation
## NA NA 0 0 0
## BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1
## NA NA NA NA 467
## BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating
## NA 1293 118 37 0
## HeatingQC CentralAir Electrical X1stFlrSF X2ndFlrSF
## 0 0 NA 0 829
## LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath
## 1434 0 856 1378 9
## HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd
## 913 6 1 0 0
## Functional Fireplaces FireplaceQu GarageType GarageYrBlt
## 0 690 NA NA NA
## GarageFinish GarageCars GarageArea GarageQual GarageCond
## NA 81 81 NA NA
## PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch
## 0 761 656 1252 1436
## ScreenPorch PoolArea PoolQC Fence MiscFeature
## 1344 1453 NA NA NA
## MiscVal MoSold YrSold SaleType SaleCondition
## 1408 0 0 0 0
## SalePrice
## 0
By looking at the training set, PoolArea have high skewness but there are many 0s (1453). It happens to many other variables too.
Therefore, I will pick one with less 0s and reasonable skewness, the BsmtFinSF1. Adding 1 to the datapoints so that the minimum value is absolutely above zero.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 0.0 383.5 443.6 712.2 5644.0
library(MASS)
mydata <- train$BsmtFinSF1 + 1
#fit an exponential probability density function
bsmtfinsf1_exp <- fitdistr(mydata, "exponential" )
#find the optimal value of lambda
lambda_est <- bsmtfinsf1_exp$estimate
#take 1000 samples from this exp dist using the lambda value
ExponentialSamples <- rexp(1000, lambda_est)
#plot a histogram and compare it with the original one
par(mfrow=c(2,1))
hist(ExponentialSamples, breaks=50)
hist(mydata, breaks=50)Next, find the 5th and 95th percentiles using the cumulative distribution function.
## [1] 22.80704 1332.02158
The 5th percentile of the exponential distribution at the optimal lambda is 22.75574. The 95th percentile is 1329.02585.
Also, generate a 95% confidence interval from the empirical data, assuming normality.
## [1] -449.2961 1338.5756
Assuming the data is normally distributed (which is clearly not the case), the 95% confidence interval for the data is (-449.2961, 1338.5756). This result proves that the distribution is not normal as it does not make sense for us to have negative values from our data.
Finally, provide the empirical 5th percentile and 95th percentile of the data.
## 5% 95%
## 1 1275
The 5th and 95th quantiles are 1 and 1275 respectively. We have 1 as our 5th because we added 1 to the datapoints so that our minimum value is absolutely above zero. To show the true data, the 5th quantile should deduct by 1.
Discuss
As the variable BsmtFinSF1 has some 0s in the training set, these datapoints provides a large portion of the right-skewed nature. It may be because some houses do not have basements or livable basement to be counted into the dataset.
3.6 Modeling
Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.
train$LotFrontage[is.na(train$LotFrontage)] <- 0
train$MasVnrArea[is.na(train$MasVnrArea)] <- 0
train$GarageYrBlt[is.na(train$GarageYrBlt)] <- "0"
test$LotFrontage[is.na(test$LotFrontage)] <- 0
test$MasVnrArea[is.na(test$MasVnrArea)] <- 0
test$GarageYrBlt[is.na(test$GarageYrBlt)] <- "0"1st model:
I created the 1st model by removing all categorical variables.
# 1st model: dropping categorical variables
train1 <- train %>% dplyr::select(-Id, -MSSubClass, -MSZoning, -Street, -Alley, -LotShape, -LandContour, -Utilities, -LotConfig, -LandSlope, -Neighborhood, -Condition1, -Condition2, -BldgType, -HouseStyle, -RoofStyle, -RoofMatl, -Exterior1st, -Exterior2nd, -MasVnrType, -ExterQual, -ExterCond, -Foundation, -BsmtQual, -BsmtCond, -BsmtExposure, -BsmtFinType1, -BsmtFinType2, -Heating, -HeatingQC, -CentralAir, -Electrical, -KitchenQual, -Functional, -FireplaceQu, -GarageType, -GarageFinish, -GarageQual, -GarageCond, -PavedDrive, -PoolQC, -Fence, -MiscFeature, -SaleType, -SaleCondition, -GarageYrBlt)
test1 <- test %>% dplyr::select(-Id, -MSSubClass, -MSZoning, -Street, -Alley, -LotShape, -LandContour, -Utilities, -LotConfig, -LandSlope, -Neighborhood, -Condition1, -Condition2, -BldgType, -HouseStyle, -RoofStyle, -RoofMatl, -Exterior1st, -Exterior2nd, -MasVnrType, -ExterQual, -ExterCond, -Foundation, -BsmtQual, -BsmtCond, -BsmtExposure, -BsmtFinType1, -BsmtFinType2, -Heating, -HeatingQC, -CentralAir, -Electrical, -KitchenQual, -Functional, -FireplaceQu, -GarageType, -GarageFinish, -GarageQual, -GarageCond, -PavedDrive, -PoolQC, -Fence, -MiscFeature, -SaleType, -SaleCondition, -GarageYrBlt)
test1$BsmtFinSF1[is.na(test1$BsmtFinSF1)] <- 0
test1$BsmtFinSF2[is.na(test1$BsmtFinSF2)] <- 0
test1$BsmtUnfSF[is.na(test1$BsmtUnfSF)] <- 0
test1$TotalBsmtSF[is.na(test1$TotalBsmtSF)] <- 0
test1$BsmtFullBath[is.na(test$BsmtFullBath)] <- 0
test1$BsmtHalfBath[is.na(test1$BsmtHalfBath)] <- 0
test1$GarageCars[is.na(test1$GarageCars)] <- 0
test1$GarageArea[is.na(test1$GarageArea)] <- 0##
## Call:
## lm(formula = SalePrice ~ ., data = train1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -498568 -16541 -2102 13641 308685
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.617e+05 1.431e+06 0.113 0.910027
## LotFrontage 5.319e+01 2.858e+01 1.861 0.062979 .
## LotArea 4.523e-01 1.018e-01 4.444 9.53e-06 ***
## OverallQual 1.650e+04 1.199e+03 13.756 < 2e-16 ***
## OverallCond 4.792e+03 1.037e+03 4.621 4.16e-06 ***
## YearBuilt 3.150e+02 6.172e+01 5.104 3.78e-07 ***
## YearRemodAdd 1.616e+02 6.696e+01 2.414 0.015901 *
## MasVnrArea 2.928e+01 6.009e+00 4.873 1.22e-06 ***
## BsmtFinSF1 2.016e+01 4.731e+00 4.261 2.17e-05 ***
## BsmtFinSF2 9.052e+00 7.151e+00 1.266 0.205801
## BsmtUnfSF 1.089e+01 4.251e+00 2.562 0.010523 *
## TotalBsmtSF NA NA NA NA
## X1stFlrSF 4.920e+01 5.846e+00 8.417 < 2e-16 ***
## X2ndFlrSF 4.149e+01 4.919e+00 8.436 < 2e-16 ***
## LowQualFinSF 1.853e+01 1.991e+01 0.930 0.352294
## GrLivArea NA NA NA NA
## BsmtFullBath 7.924e+03 2.637e+03 3.005 0.002706 **
## BsmtHalfBath 4.056e+02 4.139e+03 0.098 0.921947
## FullBath 3.128e+03 2.851e+03 1.097 0.272708
## HalfBath -1.417e+03 2.701e+03 -0.525 0.599930
## BedroomAbvGr -9.047e+03 1.703e+03 -5.312 1.26e-07 ***
## KitchenAbvGr -2.435e+04 4.931e+03 -4.939 8.77e-07 ***
## TotRmsAbvGrd 5.838e+03 1.249e+03 4.674 3.23e-06 ***
## Fireplaces 3.665e+03 1.795e+03 2.042 0.041355 *
## GarageCars 1.045e+04 2.894e+03 3.613 0.000313 ***
## GarageArea 2.527e+00 9.829e+00 0.257 0.797112
## WoodDeckSF 2.618e+01 8.102e+00 3.232 0.001258 **
## OpenPorchSF 1.074e+00 1.537e+01 0.070 0.944302
## EnclosedPorch 1.334e+01 1.709e+01 0.780 0.435294
## X3SsnPorch 2.378e+01 3.179e+01 0.748 0.454577
## ScreenPorch 5.475e+01 1.742e+01 3.142 0.001711 **
## PoolArea -4.099e+01 2.401e+01 -1.707 0.088025 .
## MiscVal -1.873e-01 1.885e+00 -0.099 0.920862
## MoSold 2.402e+01 3.494e+02 0.069 0.945196
## YrSold -5.818e+02 7.113e+02 -0.818 0.413563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 35240 on 1427 degrees of freedom
## Multiple R-squared: 0.8075, Adjusted R-squared: 0.8032
## F-statistic: 187.1 on 32 and 1427 DF, p-value: < 2.2e-16
hist(resid(model1), breaks=35, prob=TRUE)
curve(dnorm(x, mean = mean(resid(model1)), sd = sd(resid(model1))), col="red", add=TRUE)The R-square of our model1 is 0.8075, and the adjusted R-squared is 0.8032.
The residuals are randomly dispersed around y=0 with some outliers.
The QQ plot also shows some outliers at both end of the graph.
The histogram is unimodal and fairly normal.
2nd model:
## Start: AIC=30604.98
## SalePrice ~ LotFrontage + LotArea + OverallQual + OverallCond +
## YearBuilt + YearRemodAdd + MasVnrArea + BsmtFinSF1 + BsmtFinSF2 +
## BsmtUnfSF + TotalBsmtSF + X1stFlrSF + X2ndFlrSF + LowQualFinSF +
## GrLivArea + BsmtFullBath + BsmtHalfBath + FullBath + HalfBath +
## BedroomAbvGr + KitchenAbvGr + TotRmsAbvGrd + Fireplaces +
## GarageCars + GarageArea + WoodDeckSF + OpenPorchSF + EnclosedPorch +
## X3SsnPorch + ScreenPorch + PoolArea + MiscVal + MoSold +
## YrSold
##
##
## Step: AIC=30604.98
## SalePrice ~ LotFrontage + LotArea + OverallQual + OverallCond +
## YearBuilt + YearRemodAdd + MasVnrArea + BsmtFinSF1 + BsmtFinSF2 +
## BsmtUnfSF + TotalBsmtSF + X1stFlrSF + X2ndFlrSF + LowQualFinSF +
## BsmtFullBath + BsmtHalfBath + FullBath + HalfBath + BedroomAbvGr +
## KitchenAbvGr + TotRmsAbvGrd + Fireplaces + GarageCars + GarageArea +
## WoodDeckSF + OpenPorchSF + EnclosedPorch + X3SsnPorch + ScreenPorch +
## PoolArea + MiscVal + MoSold + YrSold
##
##
## Step: AIC=30604.98
## SalePrice ~ LotFrontage + LotArea + OverallQual + OverallCond +
## YearBuilt + YearRemodAdd + MasVnrArea + BsmtFinSF1 + BsmtFinSF2 +
## BsmtUnfSF + X1stFlrSF + X2ndFlrSF + LowQualFinSF + BsmtFullBath +
## BsmtHalfBath + FullBath + HalfBath + BedroomAbvGr + KitchenAbvGr +
## TotRmsAbvGrd + Fireplaces + GarageCars + GarageArea + WoodDeckSF +
## OpenPorchSF + EnclosedPorch + X3SsnPorch + ScreenPorch +
## PoolArea + MiscVal + MoSold + YrSold
##
## Df Sum of Sq RSS AIC
## - MoSold 1 5.8708e+06 1.7723e+12 30603
## - OpenPorchSF 1 6.0643e+06 1.7723e+12 30603
## - BsmtHalfBath 1 1.1928e+07 1.7723e+12 30603
## - MiscVal 1 1.2263e+07 1.7723e+12 30603
## - GarageArea 1 8.2117e+07 1.7724e+12 30603
## - HalfBath 1 3.4182e+08 1.7726e+12 30603
## - X3SsnPorch 1 6.9491e+08 1.7730e+12 30604
## - EnclosedPorch 1 7.5636e+08 1.7731e+12 30604
## - YrSold 1 8.3079e+08 1.7731e+12 30604
## - LowQualFinSF 1 1.0752e+09 1.7734e+12 30604
## - FullBath 1 1.4953e+09 1.7738e+12 30604
## - BsmtFinSF2 1 1.9899e+09 1.7743e+12 30605
## <none> 1.7723e+12 30605
## - PoolArea 1 3.6193e+09 1.7759e+12 30606
## - LotFrontage 1 4.3004e+09 1.7766e+12 30607
## - Fireplaces 1 5.1777e+09 1.7775e+12 30607
## - YearRemodAdd 1 7.2378e+09 1.7795e+12 30609
## - BsmtUnfSF 1 8.1492e+09 1.7804e+12 30610
## - BsmtFullBath 1 1.1212e+10 1.7835e+12 30612
## - ScreenPorch 1 1.2263e+10 1.7846e+12 30613
## - WoodDeckSF 1 1.2972e+10 1.7853e+12 30614
## - GarageCars 1 1.6210e+10 1.7885e+12 30616
## - BsmtFinSF1 1 2.2551e+10 1.7948e+12 30621
## - LotArea 1 2.4522e+10 1.7968e+12 30623
## - OverallCond 1 2.6522e+10 1.7988e+12 30625
## - TotRmsAbvGrd 1 2.7132e+10 1.7994e+12 30625
## - MasVnrArea 1 2.9489e+10 1.8018e+12 30627
## - KitchenAbvGr 1 3.0299e+10 1.8026e+12 30628
## - YearBuilt 1 3.2352e+10 1.8047e+12 30629
## - BedroomAbvGr 1 3.5047e+10 1.8073e+12 30632
## - X1stFlrSF 1 8.7983e+10 1.8603e+12 30674
## - X2ndFlrSF 1 8.8390e+10 1.8607e+12 30674
## - OverallQual 1 2.3503e+11 2.0073e+12 30785
##
## Step: AIC=30602.98
## SalePrice ~ LotFrontage + LotArea + OverallQual + OverallCond +
## YearBuilt + YearRemodAdd + MasVnrArea + BsmtFinSF1 + BsmtFinSF2 +
## BsmtUnfSF + X1stFlrSF + X2ndFlrSF + LowQualFinSF + BsmtFullBath +
## BsmtHalfBath + FullBath + HalfBath + BedroomAbvGr + KitchenAbvGr +
## TotRmsAbvGrd + Fireplaces + GarageCars + GarageArea + WoodDeckSF +
## OpenPorchSF + EnclosedPorch + X3SsnPorch + ScreenPorch +
## PoolArea + MiscVal + YrSold
##
## Df Sum of Sq RSS AIC
## - OpenPorchSF 1 6.7525e+06 1.7723e+12 30601
## - MiscVal 1 1.2276e+07 1.7723e+12 30601
## - BsmtHalfBath 1 1.2428e+07 1.7723e+12 30601
## - GarageArea 1 8.1743e+07 1.7724e+12 30601
## - HalfBath 1 3.4577e+08 1.7727e+12 30601
## - X3SsnPorch 1 6.9988e+08 1.7730e+12 30602
## - EnclosedPorch 1 7.5278e+08 1.7731e+12 30602
## - YrSold 1 8.6929e+08 1.7732e+12 30602
## - LowQualFinSF 1 1.0716e+09 1.7734e+12 30602
## - FullBath 1 1.4946e+09 1.7738e+12 30602
## - BsmtFinSF2 1 1.9862e+09 1.7743e+12 30603
## <none> 1.7723e+12 30603
## - PoolArea 1 3.6432e+09 1.7759e+12 30604
## - LotFrontage 1 4.3061e+09 1.7766e+12 30605
## + MoSold 1 5.8708e+06 1.7723e+12 30605
## - Fireplaces 1 5.1911e+09 1.7775e+12 30605
## - YearRemodAdd 1 7.2388e+09 1.7795e+12 30607
## - BsmtUnfSF 1 8.1433e+09 1.7804e+12 30608
## - BsmtFullBath 1 1.1213e+10 1.7835e+12 30610
## - ScreenPorch 1 1.2278e+10 1.7846e+12 30611
## - WoodDeckSF 1 1.2996e+10 1.7853e+12 30612
## - GarageCars 1 1.6217e+10 1.7885e+12 30614
## - BsmtFinSF1 1 2.2550e+10 1.7949e+12 30619
## - LotArea 1 2.4517e+10 1.7968e+12 30621
## - OverallCond 1 2.6517e+10 1.7988e+12 30623
## - TotRmsAbvGrd 1 2.7154e+10 1.7995e+12 30623
## - MasVnrArea 1 2.9496e+10 1.8018e+12 30625
## - KitchenAbvGr 1 3.0327e+10 1.8026e+12 30626
## - YearBuilt 1 3.2351e+10 1.8047e+12 30627
## - BedroomAbvGr 1 3.5081e+10 1.8074e+12 30630
## - X1stFlrSF 1 8.8045e+10 1.8604e+12 30672
## - X2ndFlrSF 1 8.8457e+10 1.8608e+12 30672
## - OverallQual 1 2.3637e+11 2.0087e+12 30784
##
## Step: AIC=30600.99
## SalePrice ~ LotFrontage + LotArea + OverallQual + OverallCond +
## YearBuilt + YearRemodAdd + MasVnrArea + BsmtFinSF1 + BsmtFinSF2 +
## BsmtUnfSF + X1stFlrSF + X2ndFlrSF + LowQualFinSF + BsmtFullBath +
## BsmtHalfBath + FullBath + HalfBath + BedroomAbvGr + KitchenAbvGr +
## TotRmsAbvGrd + Fireplaces + GarageCars + GarageArea + WoodDeckSF +
## EnclosedPorch + X3SsnPorch + ScreenPorch + PoolArea + MiscVal +
## YrSold
##
## Df Sum of Sq RSS AIC
## - BsmtHalfBath 1 1.2451e+07 1.7723e+12 30599
## - MiscVal 1 1.2546e+07 1.7723e+12 30599
## - GarageArea 1 8.6407e+07 1.7724e+12 30599
## - HalfBath 1 3.4012e+08 1.7727e+12 30599
## - X3SsnPorch 1 6.9681e+08 1.7730e+12 30600
## - EnclosedPorch 1 7.4630e+08 1.7731e+12 30600
## - YrSold 1 8.8057e+08 1.7732e+12 30600
## - LowQualFinSF 1 1.0741e+09 1.7734e+12 30600
## - FullBath 1 1.5163e+09 1.7738e+12 30600
## - BsmtFinSF2 1 2.0023e+09 1.7743e+12 30601
## <none> 1.7723e+12 30601
## - PoolArea 1 3.6411e+09 1.7760e+12 30602
## - LotFrontage 1 4.3016e+09 1.7766e+12 30603
## + OpenPorchSF 1 6.7525e+06 1.7723e+12 30603
## + MoSold 1 6.5590e+06 1.7723e+12 30603
## - Fireplaces 1 5.1894e+09 1.7775e+12 30603
## - YearRemodAdd 1 7.2834e+09 1.7796e+12 30605
## - BsmtUnfSF 1 8.2551e+09 1.7806e+12 30606
## - BsmtFullBath 1 1.1258e+10 1.7836e+12 30608
## - ScreenPorch 1 1.2305e+10 1.7846e+12 30609
## - WoodDeckSF 1 1.3017e+10 1.7853e+12 30610
## - GarageCars 1 1.6262e+10 1.7886e+12 30612
## - BsmtFinSF1 1 2.2672e+10 1.7950e+12 30618
## - LotArea 1 2.4519e+10 1.7968e+12 30619
## - OverallCond 1 2.6517e+10 1.7988e+12 30621
## - TotRmsAbvGrd 1 2.7148e+10 1.7995e+12 30621
## - MasVnrArea 1 2.9541e+10 1.8019e+12 30623
## - KitchenAbvGr 1 3.0465e+10 1.8028e+12 30624
## - YearBuilt 1 3.2356e+10 1.8047e+12 30625
## - BedroomAbvGr 1 3.5173e+10 1.8075e+12 30628
## - X1stFlrSF 1 8.8301e+10 1.8606e+12 30670
## - X2ndFlrSF 1 8.9277e+10 1.8616e+12 30671
## - OverallQual 1 2.3682e+11 2.0091e+12 30782
##
## Step: AIC=30599
## SalePrice ~ LotFrontage + LotArea + OverallQual + OverallCond +
## YearBuilt + YearRemodAdd + MasVnrArea + BsmtFinSF1 + BsmtFinSF2 +
## BsmtUnfSF + X1stFlrSF + X2ndFlrSF + LowQualFinSF + BsmtFullBath +
## FullBath + HalfBath + BedroomAbvGr + KitchenAbvGr + TotRmsAbvGrd +
## Fireplaces + GarageCars + GarageArea + WoodDeckSF + EnclosedPorch +
## X3SsnPorch + ScreenPorch + PoolArea + MiscVal + YrSold
##
## Df Sum of Sq RSS AIC
## - MiscVal 1 1.3300e+07 1.7723e+12 30597
## - GarageArea 1 8.4752e+07 1.7724e+12 30597
## - HalfBath 1 3.4363e+08 1.7727e+12 30597
## - X3SsnPorch 1 7.0512e+08 1.7730e+12 30598
## - EnclosedPorch 1 7.4911e+08 1.7731e+12 30598
## - YrSold 1 8.9026e+08 1.7732e+12 30598
## - LowQualFinSF 1 1.0749e+09 1.7734e+12 30598
## - FullBath 1 1.5047e+09 1.7738e+12 30598
## - BsmtFinSF2 1 2.0608e+09 1.7744e+12 30599
## <none> 1.7723e+12 30599
## - PoolArea 1 3.6402e+09 1.7760e+12 30600
## - LotFrontage 1 4.2920e+09 1.7766e+12 30601
## + BsmtHalfBath 1 1.2451e+07 1.7723e+12 30601
## + MoSold 1 7.0879e+06 1.7723e+12 30601
## + OpenPorchSF 1 6.7752e+06 1.7723e+12 30601
## - Fireplaces 1 5.2060e+09 1.7775e+12 30601
## - YearRemodAdd 1 7.3280e+09 1.7797e+12 30603
## - BsmtUnfSF 1 8.2431e+09 1.7806e+12 30604
## - BsmtFullBath 1 1.2070e+10 1.7844e+12 30607
## - ScreenPorch 1 1.2321e+10 1.7846e+12 30607
## - WoodDeckSF 1 1.3086e+10 1.7854e+12 30608
## - GarageCars 1 1.6311e+10 1.7886e+12 30610
## - BsmtFinSF1 1 2.3280e+10 1.7956e+12 30616
## - LotArea 1 2.4599e+10 1.7969e+12 30617
## - OverallCond 1 2.6732e+10 1.7991e+12 30619
## - TotRmsAbvGrd 1 2.7136e+10 1.7995e+12 30619
## - MasVnrArea 1 2.9606e+10 1.8019e+12 30621
## - KitchenAbvGr 1 3.0454e+10 1.8028e+12 30622
## - YearBuilt 1 3.2400e+10 1.8047e+12 30623
## - BedroomAbvGr 1 3.5350e+10 1.8077e+12 30626
## - X1stFlrSF 1 8.8296e+10 1.8606e+12 30668
## - X2ndFlrSF 1 8.9270e+10 1.8616e+12 30669
## - OverallQual 1 2.3684e+11 2.0092e+12 30780
##
## Step: AIC=30597.01
## SalePrice ~ LotFrontage + LotArea + OverallQual + OverallCond +
## YearBuilt + YearRemodAdd + MasVnrArea + BsmtFinSF1 + BsmtFinSF2 +
## BsmtUnfSF + X1stFlrSF + X2ndFlrSF + LowQualFinSF + BsmtFullBath +
## FullBath + HalfBath + BedroomAbvGr + KitchenAbvGr + TotRmsAbvGrd +
## Fireplaces + GarageCars + GarageArea + WoodDeckSF + EnclosedPorch +
## X3SsnPorch + ScreenPorch + PoolArea + YrSold
##
## Df Sum of Sq RSS AIC
## - GarageArea 1 8.3027e+07 1.7724e+12 30595
## - HalfBath 1 3.4274e+08 1.7727e+12 30595
## - X3SsnPorch 1 7.0411e+08 1.7730e+12 30596
## - EnclosedPorch 1 7.4450e+08 1.7731e+12 30596
## - YrSold 1 8.9086e+08 1.7732e+12 30596
## - LowQualFinSF 1 1.0775e+09 1.7734e+12 30596
## - FullBath 1 1.5088e+09 1.7738e+12 30596
## - BsmtFinSF2 1 2.0559e+09 1.7744e+12 30597
## <none> 1.7723e+12 30597
## - PoolArea 1 3.6578e+09 1.7760e+12 30598
## - LotFrontage 1 4.3400e+09 1.7767e+12 30599
## + MiscVal 1 1.3300e+07 1.7723e+12 30599
## + BsmtHalfBath 1 1.3205e+07 1.7723e+12 30599
## + MoSold 1 7.1344e+06 1.7723e+12 30599
## + OpenPorchSF 1 7.0546e+06 1.7723e+12 30599
## - Fireplaces 1 5.2028e+09 1.7775e+12 30599
## - YearRemodAdd 1 7.3278e+09 1.7797e+12 30601
## - BsmtUnfSF 1 8.2335e+09 1.7806e+12 30602
## - BsmtFullBath 1 1.2127e+10 1.7845e+12 30605
## - ScreenPorch 1 1.2307e+10 1.7846e+12 30605
## - WoodDeckSF 1 1.3091e+10 1.7854e+12 30606
## - GarageCars 1 1.6373e+10 1.7887e+12 30608
## - BsmtFinSF1 1 2.3271e+10 1.7956e+12 30614
## - LotArea 1 2.4602e+10 1.7969e+12 30615
## - OverallCond 1 2.6761e+10 1.7991e+12 30617
## - TotRmsAbvGrd 1 2.7147e+10 1.7995e+12 30617
## - MasVnrArea 1 2.9644e+10 1.8020e+12 30619
## - KitchenAbvGr 1 3.0663e+10 1.8030e+12 30620
## - YearBuilt 1 3.2393e+10 1.8047e+12 30622
## - BedroomAbvGr 1 3.5340e+10 1.8077e+12 30624
## - X1stFlrSF 1 8.8501e+10 1.8608e+12 30666
## - X2ndFlrSF 1 8.9308e+10 1.8616e+12 30667
## - OverallQual 1 2.3691e+11 2.0092e+12 30778
##
## Step: AIC=30595.08
## SalePrice ~ LotFrontage + LotArea + OverallQual + OverallCond +
## YearBuilt + YearRemodAdd + MasVnrArea + BsmtFinSF1 + BsmtFinSF2 +
## BsmtUnfSF + X1stFlrSF + X2ndFlrSF + LowQualFinSF + BsmtFullBath +
## FullBath + HalfBath + BedroomAbvGr + KitchenAbvGr + TotRmsAbvGrd +
## Fireplaces + GarageCars + WoodDeckSF + EnclosedPorch + X3SsnPorch +
## ScreenPorch + PoolArea + YrSold
##
## Df Sum of Sq RSS AIC
## - HalfBath 1 3.6575e+08 1.7728e+12 30593
## - X3SsnPorch 1 7.0155e+08 1.7731e+12 30594
## - EnclosedPorch 1 7.5684e+08 1.7732e+12 30594
## - YrSold 1 8.8200e+08 1.7733e+12 30594
## - LowQualFinSF 1 1.1015e+09 1.7735e+12 30594
## - FullBath 1 1.4616e+09 1.7739e+12 30594
## - BsmtFinSF2 1 2.0774e+09 1.7745e+12 30595
## <none> 1.7724e+12 30595
## - PoolArea 1 3.6306e+09 1.7761e+12 30596
## - LotFrontage 1 4.4372e+09 1.7769e+12 30597
## + GarageArea 1 8.3027e+07 1.7723e+12 30597
## + OpenPorchSF 1 1.1678e+07 1.7724e+12 30597
## + MiscVal 1 1.1575e+07 1.7724e+12 30597
## + BsmtHalfBath 1 1.1466e+07 1.7724e+12 30597
## + MoSold 1 6.8991e+06 1.7724e+12 30597
## - Fireplaces 1 5.1203e+09 1.7775e+12 30597
## - YearRemodAdd 1 7.2728e+09 1.7797e+12 30599
## - BsmtUnfSF 1 8.3063e+09 1.7807e+12 30600
## - BsmtFullBath 1 1.2149e+10 1.7846e+12 30603
## - ScreenPorch 1 1.2304e+10 1.7847e+12 30603
## - WoodDeckSF 1 1.3092e+10 1.7855e+12 30604
## - BsmtFinSF1 1 2.3623e+10 1.7960e+12 30612
## - LotArea 1 2.4683e+10 1.7971e+12 30613
## - OverallCond 1 2.7012e+10 1.7994e+12 30615
## - TotRmsAbvGrd 1 2.7094e+10 1.7995e+12 30615
## - MasVnrArea 1 2.9856e+10 1.8023e+12 30618
## - KitchenAbvGr 1 3.0843e+10 1.8033e+12 30618
## - YearBuilt 1 3.2650e+10 1.8051e+12 30620
## - BedroomAbvGr 1 3.5599e+10 1.8080e+12 30622
## - GarageCars 1 5.1346e+10 1.8238e+12 30635
## - X1stFlrSF 1 9.0928e+10 1.8633e+12 30666
## - X2ndFlrSF 1 9.1217e+10 1.8636e+12 30666
## - OverallQual 1 2.3683e+11 2.0092e+12 30776
##
## Step: AIC=30593.38
## SalePrice ~ LotFrontage + LotArea + OverallQual + OverallCond +
## YearBuilt + YearRemodAdd + MasVnrArea + BsmtFinSF1 + BsmtFinSF2 +
## BsmtUnfSF + X1stFlrSF + X2ndFlrSF + LowQualFinSF + BsmtFullBath +
## FullBath + BedroomAbvGr + KitchenAbvGr + TotRmsAbvGrd + Fireplaces +
## GarageCars + WoodDeckSF + EnclosedPorch + X3SsnPorch + ScreenPorch +
## PoolArea + YrSold
##
## Df Sum of Sq RSS AIC
## - X3SsnPorch 1 6.8111e+08 1.7735e+12 30592
## - EnclosedPorch 1 7.8997e+08 1.7736e+12 30592
## - YrSold 1 8.9849e+08 1.7737e+12 30592
## - LowQualFinSF 1 1.0980e+09 1.7739e+12 30592
## - BsmtFinSF2 1 2.0078e+09 1.7748e+12 30593
## - FullBath 1 2.3840e+09 1.7752e+12 30593
## <none> 1.7728e+12 30593
## - PoolArea 1 3.5344e+09 1.7763e+12 30594
## + HalfBath 1 3.6575e+08 1.7724e+12 30595
## - LotFrontage 1 4.5395e+09 1.7773e+12 30595
## + GarageArea 1 1.0604e+08 1.7727e+12 30595
## + BsmtHalfBath 1 1.4680e+07 1.7728e+12 30595
## + MoSold 1 1.0763e+07 1.7728e+12 30595
## + MiscVal 1 1.0508e+07 1.7728e+12 30595
## + OpenPorchSF 1 3.6677e+06 1.7728e+12 30595
## - Fireplaces 1 4.9296e+09 1.7777e+12 30595
## - YearRemodAdd 1 7.2378e+09 1.7800e+12 30597
## - BsmtUnfSF 1 8.3074e+09 1.7811e+12 30598
## - ScreenPorch 1 1.2114e+10 1.7849e+12 30601
## - BsmtFullBath 1 1.2391e+10 1.7852e+12 30602
## - WoodDeckSF 1 1.3084e+10 1.7859e+12 30602
## - BsmtFinSF1 1 2.3521e+10 1.7963e+12 30611
## - LotArea 1 2.4822e+10 1.7976e+12 30612
## - TotRmsAbvGrd 1 2.6941e+10 1.7997e+12 30613
## - OverallCond 1 2.7073e+10 1.7999e+12 30614
## - MasVnrArea 1 2.9725e+10 1.8025e+12 30616
## - KitchenAbvGr 1 3.0965e+10 1.8038e+12 30617
## - YearBuilt 1 3.3491e+10 1.8063e+12 30619
## - BedroomAbvGr 1 3.5505e+10 1.8083e+12 30620
## - GarageCars 1 5.1180e+10 1.8240e+12 30633
## - X1stFlrSF 1 9.0697e+10 1.8635e+12 30664
## - X2ndFlrSF 1 1.1235e+11 1.8851e+12 30681
## - OverallQual 1 2.3743e+11 2.0102e+12 30775
##
## Step: AIC=30591.94
## SalePrice ~ LotFrontage + LotArea + OverallQual + OverallCond +
## YearBuilt + YearRemodAdd + MasVnrArea + BsmtFinSF1 + BsmtFinSF2 +
## BsmtUnfSF + X1stFlrSF + X2ndFlrSF + LowQualFinSF + BsmtFullBath +
## FullBath + BedroomAbvGr + KitchenAbvGr + TotRmsAbvGrd + Fireplaces +
## GarageCars + WoodDeckSF + EnclosedPorch + ScreenPorch + PoolArea +
## YrSold
##
## Df Sum of Sq RSS AIC
## - EnclosedPorch 1 7.4067e+08 1.7742e+12 30591
## - YrSold 1 8.6736e+08 1.7743e+12 30591
## - LowQualFinSF 1 1.1106e+09 1.7746e+12 30591
## - BsmtFinSF2 1 1.9338e+09 1.7754e+12 30592
## <none> 1.7735e+12 30592
## - FullBath 1 2.4463e+09 1.7759e+12 30592
## - PoolArea 1 3.5630e+09 1.7770e+12 30593
## + X3SsnPorch 1 6.8111e+08 1.7728e+12 30593
## + HalfBath 1 3.4531e+08 1.7731e+12 30594
## - LotFrontage 1 4.5872e+09 1.7781e+12 30594
## + GarageArea 1 1.0250e+08 1.7734e+12 30594
## + BsmtHalfBath 1 2.3307e+07 1.7734e+12 30594
## + MoSold 1 1.6912e+07 1.7735e+12 30594
## + MiscVal 1 9.6788e+06 1.7735e+12 30594
## + OpenPorchSF 1 1.6283e+06 1.7735e+12 30594
## - Fireplaces 1 4.9102e+09 1.7784e+12 30594
## - YearRemodAdd 1 7.2740e+09 1.7807e+12 30596
## - BsmtUnfSF 1 8.2272e+09 1.7817e+12 30597
## - ScreenPorch 1 1.1889e+10 1.7854e+12 30600
## - BsmtFullBath 1 1.2352e+10 1.7858e+12 30600
## - WoodDeckSF 1 1.2776e+10 1.7862e+12 30600
## - BsmtFinSF1 1 2.3418e+10 1.7969e+12 30609
## - LotArea 1 2.4934e+10 1.7984e+12 30610
## - TotRmsAbvGrd 1 2.6738e+10 1.8002e+12 30612
## - OverallCond 1 2.7424e+10 1.8009e+12 30612
## - MasVnrArea 1 2.9731e+10 1.8032e+12 30614
## - KitchenAbvGr 1 3.1236e+10 1.8047e+12 30615
## - YearBuilt 1 3.3503e+10 1.8070e+12 30617
## - BedroomAbvGr 1 3.5666e+10 1.8091e+12 30619
## - GarageCars 1 5.1331e+10 1.8248e+12 30632
## - X1stFlrSF 1 9.1812e+10 1.8653e+12 30664
## - X2ndFlrSF 1 1.1272e+11 1.8862e+12 30680
## - OverallQual 1 2.3711e+11 2.0106e+12 30773
##
## Step: AIC=30590.55
## SalePrice ~ LotFrontage + LotArea + OverallQual + OverallCond +
## YearBuilt + YearRemodAdd + MasVnrArea + BsmtFinSF1 + BsmtFinSF2 +
## BsmtUnfSF + X1stFlrSF + X2ndFlrSF + LowQualFinSF + BsmtFullBath +
## FullBath + BedroomAbvGr + KitchenAbvGr + TotRmsAbvGrd + Fireplaces +
## GarageCars + WoodDeckSF + ScreenPorch + PoolArea + YrSold
##
## Df Sum of Sq RSS AIC
## - YrSold 1 8.7772e+08 1.7751e+12 30589
## - LowQualFinSF 1 1.0525e+09 1.7753e+12 30589
## - BsmtFinSF2 1 2.0345e+09 1.7762e+12 30590
## <none> 1.7742e+12 30591
## - FullBath 1 2.5220e+09 1.7767e+12 30591
## - PoolArea 1 3.3994e+09 1.7776e+12 30591
## + EnclosedPorch 1 7.4067e+08 1.7735e+12 30592
## + X3SsnPorch 1 6.3182e+08 1.7736e+12 30592
## + HalfBath 1 3.7723e+08 1.7738e+12 30592
## - LotFrontage 1 4.6138e+09 1.7788e+12 30592
## + GarageArea 1 1.1724e+08 1.7741e+12 30593
## + BsmtHalfBath 1 2.6501e+07 1.7742e+12 30593
## + MoSold 1 9.5962e+06 1.7742e+12 30593
## + MiscVal 1 5.7320e+06 1.7742e+12 30593
## + OpenPorchSF 1 5.9028e+05 1.7742e+12 30593
## - Fireplaces 1 4.9216e+09 1.7791e+12 30593
## - YearRemodAdd 1 7.4372e+09 1.7816e+12 30595
## - BsmtUnfSF 1 8.3471e+09 1.7826e+12 30595
## - ScreenPorch 1 1.1297e+10 1.7855e+12 30598
## - WoodDeckSF 1 1.2373e+10 1.7866e+12 30599
## - BsmtFullBath 1 1.2706e+10 1.7869e+12 30599
## - BsmtFinSF1 1 2.3388e+10 1.7976e+12 30608
## - LotArea 1 2.4656e+10 1.7989e+12 30609
## - TotRmsAbvGrd 1 2.6364e+10 1.8006e+12 30610
## - OverallCond 1 2.6761e+10 1.8010e+12 30610
## - MasVnrArea 1 2.9539e+10 1.8037e+12 30613
## - KitchenAbvGr 1 3.1682e+10 1.8059e+12 30614
## - YearBuilt 1 3.4276e+10 1.8085e+12 30617
## - BedroomAbvGr 1 3.5592e+10 1.8098e+12 30618
## - GarageCars 1 5.1675e+10 1.8259e+12 30631
## - X1stFlrSF 1 9.2145e+10 1.8664e+12 30663
## - X2ndFlrSF 1 1.1396e+11 1.8882e+12 30679
## - OverallQual 1 2.4202e+11 2.0162e+12 30775
##
## Step: AIC=30589.27
## SalePrice ~ LotFrontage + LotArea + OverallQual + OverallCond +
## YearBuilt + YearRemodAdd + MasVnrArea + BsmtFinSF1 + BsmtFinSF2 +
## BsmtUnfSF + X1stFlrSF + X2ndFlrSF + LowQualFinSF + BsmtFullBath +
## FullBath + BedroomAbvGr + KitchenAbvGr + TotRmsAbvGrd + Fireplaces +
## GarageCars + WoodDeckSF + ScreenPorch + PoolArea
##
## Df Sum of Sq RSS AIC
## - LowQualFinSF 1 1.1002e+09 1.7762e+12 30588
## - BsmtFinSF2 1 1.9971e+09 1.7771e+12 30589
## <none> 1.7751e+12 30589
## - FullBath 1 2.4879e+09 1.7776e+12 30589
## - PoolArea 1 3.2131e+09 1.7783e+12 30590
## + YrSold 1 8.7772e+08 1.7742e+12 30591
## + EnclosedPorch 1 7.5103e+08 1.7743e+12 30591
## + X3SsnPorch 1 6.0137e+08 1.7745e+12 30591
## + HalfBath 1 3.9449e+08 1.7747e+12 30591
## - LotFrontage 1 4.5772e+09 1.7797e+12 30591
## + GarageArea 1 1.0745e+08 1.7750e+12 30591
## + MoSold 1 5.5582e+07 1.7750e+12 30591
## + BsmtHalfBath 1 3.9915e+07 1.7750e+12 30591
## + MiscVal 1 6.1631e+06 1.7751e+12 30591
## + OpenPorchSF 1 6.3444e+05 1.7751e+12 30591
## - Fireplaces 1 4.9470e+09 1.7800e+12 30591
## - YearRemodAdd 1 7.2099e+09 1.7823e+12 30593
## - BsmtUnfSF 1 8.3748e+09 1.7835e+12 30594
## - ScreenPorch 1 1.1162e+10 1.7862e+12 30596
## - WoodDeckSF 1 1.2196e+10 1.7873e+12 30597
## - BsmtFullBath 1 1.2351e+10 1.7874e+12 30597
## - BsmtFinSF1 1 2.3519e+10 1.7986e+12 30607
## - LotArea 1 2.4770e+10 1.7999e+12 30608
## - TotRmsAbvGrd 1 2.6471e+10 1.8016e+12 30609
## - OverallCond 1 2.6563e+10 1.8016e+12 30609
## - MasVnrArea 1 2.9383e+10 1.8045e+12 30611
## - KitchenAbvGr 1 3.2131e+10 1.8072e+12 30614
## - YearBuilt 1 3.4467e+10 1.8096e+12 30615
## - BedroomAbvGr 1 3.5476e+10 1.8106e+12 30616
## - GarageCars 1 5.2248e+10 1.8273e+12 30630
## - X1stFlrSF 1 9.2097e+10 1.8672e+12 30661
## - X2ndFlrSF 1 1.1394e+11 1.8890e+12 30678
## - OverallQual 1 2.4236e+11 2.0174e+12 30774
##
## Step: AIC=30588.17
## SalePrice ~ LotFrontage + LotArea + OverallQual + OverallCond +
## YearBuilt + YearRemodAdd + MasVnrArea + BsmtFinSF1 + BsmtFinSF2 +
## BsmtUnfSF + X1stFlrSF + X2ndFlrSF + BsmtFullBath + FullBath +
## BedroomAbvGr + KitchenAbvGr + TotRmsAbvGrd + Fireplaces +
## GarageCars + WoodDeckSF + ScreenPorch + PoolArea
##
## Df Sum of Sq RSS AIC
## - BsmtFinSF2 1 2.0445e+09 1.7782e+12 30588
## <none> 1.7762e+12 30588
## - FullBath 1 2.6423e+09 1.7788e+12 30588
## - PoolArea 1 2.9956e+09 1.7792e+12 30589
## + LowQualFinSF 1 1.1002e+09 1.7751e+12 30589
## + GrLivArea 1 1.1002e+09 1.7751e+12 30589
## + YrSold 1 9.2539e+08 1.7753e+12 30589
## + EnclosedPorch 1 6.9156e+08 1.7755e+12 30590
## + X3SsnPorch 1 6.1423e+08 1.7756e+12 30590
## + HalfBath 1 3.8980e+08 1.7758e+12 30590
## - LotFrontage 1 4.6481e+09 1.7808e+12 30590
## + GarageArea 1 1.3326e+08 1.7761e+12 30590
## - Fireplaces 1 4.7892e+09 1.7810e+12 30590
## + MoSold 1 4.5034e+07 1.7761e+12 30590
## + BsmtHalfBath 1 4.1396e+07 1.7761e+12 30590
## + MiscVal 1 8.0154e+06 1.7762e+12 30590
## + OpenPorchSF 1 2.1766e+06 1.7762e+12 30590
## - YearRemodAdd 1 7.4504e+09 1.7836e+12 30592
## - BsmtUnfSF 1 8.4443e+09 1.7846e+12 30593
## - ScreenPorch 1 1.1308e+10 1.7875e+12 30595
## - WoodDeckSF 1 1.2252e+10 1.7884e+12 30596
## - BsmtFullBath 1 1.2430e+10 1.7886e+12 30596
## - BsmtFinSF1 1 2.3612e+10 1.7998e+12 30606
## - LotArea 1 2.4725e+10 1.8009e+12 30606
## - OverallCond 1 2.5897e+10 1.8021e+12 30607
## - TotRmsAbvGrd 1 2.8727e+10 1.8049e+12 30610
## - MasVnrArea 1 2.9054e+10 1.8052e+12 30610
## - YearBuilt 1 3.3369e+10 1.8096e+12 30613
## - KitchenAbvGr 1 3.3406e+10 1.8096e+12 30613
## - BedroomAbvGr 1 3.5675e+10 1.8119e+12 30615
## - GarageCars 1 5.1606e+10 1.8278e+12 30628
## - X1stFlrSF 1 9.1165e+10 1.8674e+12 30659
## - X2ndFlrSF 1 1.1285e+11 1.8890e+12 30676
## - OverallQual 1 2.4470e+11 2.0209e+12 30775
##
## Step: AIC=30587.85
## SalePrice ~ LotFrontage + LotArea + OverallQual + OverallCond +
## YearBuilt + YearRemodAdd + MasVnrArea + BsmtFinSF1 + BsmtUnfSF +
## X1stFlrSF + X2ndFlrSF + BsmtFullBath + FullBath + BedroomAbvGr +
## KitchenAbvGr + TotRmsAbvGrd + Fireplaces + GarageCars + WoodDeckSF +
## ScreenPorch + PoolArea
##
## Df Sum of Sq RSS AIC
## - FullBath 1 2.4245e+09 1.7807e+12 30588
## <none> 1.7782e+12 30588
## + BsmtFinSF2 1 2.0445e+09 1.7762e+12 30588
## + TotalBsmtSF 1 2.0445e+09 1.7762e+12 30588
## - PoolArea 1 2.8365e+09 1.7811e+12 30588
## + LowQualFinSF 1 1.1477e+09 1.7771e+12 30589
## + GrLivArea 1 1.1477e+09 1.7771e+12 30589
## + YrSold 1 8.8753e+08 1.7773e+12 30589
## + EnclosedPorch 1 7.8775e+08 1.7774e+12 30589
## + X3SsnPorch 1 5.4005e+08 1.7777e+12 30589
## + HalfBath 1 3.1990e+08 1.7779e+12 30590
## - Fireplaces 1 4.5743e+09 1.7828e+12 30590
## + GarageArea 1 1.5919e+08 1.7781e+12 30590
## + BsmtHalfBath 1 1.2425e+08 1.7781e+12 30590
## - LotFrontage 1 4.8147e+09 1.7830e+12 30590
## + MoSold 1 3.2650e+07 1.7782e+12 30590
## + OpenPorchSF 1 1.4392e+07 1.7782e+12 30590
## + MiscVal 1 4.1196e+06 1.7782e+12 30590
## - BsmtUnfSF 1 6.3998e+09 1.7846e+12 30591
## - YearRemodAdd 1 7.0167e+09 1.7852e+12 30592
## - ScreenPorch 1 1.2223e+10 1.7905e+12 30596
## - WoodDeckSF 1 1.2789e+10 1.7910e+12 30596
## - BsmtFullBath 1 1.5199e+10 1.7934e+12 30598
## - BsmtFinSF1 1 2.3214e+10 1.8014e+12 30605
## - LotArea 1 2.6093e+10 1.8043e+12 30607
## - OverallCond 1 2.6208e+10 1.8044e+12 30607
## - TotRmsAbvGrd 1 2.8059e+10 1.8063e+12 30609
## - MasVnrArea 1 2.8822e+10 1.8071e+12 30609
## - YearBuilt 1 3.4833e+10 1.8131e+12 30614
## - KitchenAbvGr 1 3.5145e+10 1.8134e+12 30614
## - BedroomAbvGr 1 3.5214e+10 1.8134e+12 30615
## - GarageCars 1 5.0929e+10 1.8292e+12 30627
## - X2ndFlrSF 1 1.1296e+11 1.8912e+12 30676
## - X1stFlrSF 1 1.1764e+11 1.8959e+12 30679
## - OverallQual 1 2.4878e+11 2.0270e+12 30777
##
## Step: AIC=30587.84
## SalePrice ~ LotFrontage + LotArea + OverallQual + OverallCond +
## YearBuilt + YearRemodAdd + MasVnrArea + BsmtFinSF1 + BsmtUnfSF +
## X1stFlrSF + X2ndFlrSF + BsmtFullBath + BedroomAbvGr + KitchenAbvGr +
## TotRmsAbvGrd + Fireplaces + GarageCars + WoodDeckSF + ScreenPorch +
## PoolArea
##
## Df Sum of Sq RSS AIC
## <none> 1.7807e+12 30588
## + FullBath 1 2.4245e+09 1.7782e+12 30588
## - PoolArea 1 2.9855e+09 1.7836e+12 30588
## + BsmtFinSF2 1 1.8267e+09 1.7788e+12 30588
## + TotalBsmtSF 1 1.8267e+09 1.7788e+12 30588
## + LowQualFinSF 1 1.2958e+09 1.7794e+12 30589
## + GrLivArea 1 1.2958e+09 1.7794e+12 30589
## + HalfBath 1 1.2236e+09 1.7794e+12 30589
## + YrSold 1 8.5867e+08 1.7798e+12 30589
## + EnclosedPorch 1 8.5439e+08 1.7798e+12 30589
## + X3SsnPorch 1 5.9863e+08 1.7801e+12 30589
## - Fireplaces 1 4.3912e+09 1.7850e+12 30589
## - LotFrontage 1 4.5169e+09 1.7852e+12 30590
## + GarageArea 1 9.2015e+07 1.7806e+12 30590
## + BsmtHalfBath 1 6.3457e+07 1.7806e+12 30590
## + MoSold 1 3.8328e+07 1.7806e+12 30590
## + OpenPorchSF 1 2.9157e+07 1.7806e+12 30590
## + MiscVal 1 7.3356e+06 1.7806e+12 30590
## - BsmtUnfSF 1 6.2109e+09 1.7869e+12 30591
## - YearRemodAdd 1 8.4608e+09 1.7891e+12 30593
## - ScreenPorch 1 1.1864e+10 1.7925e+12 30596
## - WoodDeckSF 1 1.2769e+10 1.7934e+12 30596
## - BsmtFullBath 1 1.3851e+10 1.7945e+12 30597
## - BsmtFinSF1 1 2.2575e+10 1.8032e+12 30604
## - OverallCond 1 2.5229e+10 1.8059e+12 30606
## - LotArea 1 2.6602e+10 1.8073e+12 30608
## - MasVnrArea 1 2.7902e+10 1.8086e+12 30609
## - TotRmsAbvGrd 1 2.8036e+10 1.8087e+12 30609
## - KitchenAbvGr 1 3.2925e+10 1.8136e+12 30613
## - BedroomAbvGr 1 3.3424e+10 1.8141e+12 30613
## - YearBuilt 1 4.1970e+10 1.8226e+12 30620
## - GarageCars 1 5.1817e+10 1.8325e+12 30628
## - X2ndFlrSF 1 1.3184e+11 1.9125e+12 30690
## - X1stFlrSF 1 1.3286e+11 1.9135e+12 30691
## - OverallQual 1 2.5618e+11 2.0368e+12 30782
##
## Call:
## lm(formula = SalePrice ~ LotFrontage + LotArea + OverallQual +
## OverallCond + YearBuilt + YearRemodAdd + MasVnrArea + BsmtFinSF1 +
## BsmtUnfSF + X1stFlrSF + X2ndFlrSF + BsmtFullBath + BedroomAbvGr +
## KitchenAbvGr + TotRmsAbvGrd + Fireplaces + GarageCars + WoodDeckSF +
## ScreenPorch + PoolArea, data = train1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -500134 -16262 -1921 13800 305988
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.007e+06 1.172e+05 -8.590 < 2e-16 ***
## LotFrontage 5.412e+01 2.833e+01 1.911 0.056259 .
## LotArea 4.676e-01 1.009e-01 4.637 3.86e-06 ***
## OverallQual 1.693e+04 1.176e+03 14.389 < 2e-16 ***
## OverallCond 4.586e+03 1.016e+03 4.515 6.84e-06 ***
## YearBuilt 3.058e+02 5.251e+01 5.824 7.08e-09 ***
## YearRemodAdd 1.715e+02 6.559e+01 2.615 0.009020 **
## MasVnrArea 2.828e+01 5.955e+00 4.748 2.25e-06 ***
## BsmtFinSF1 1.694e+01 3.965e+00 4.271 2.07e-05 ***
## BsmtUnfSF 8.228e+00 3.673e+00 2.240 0.025219 *
## X1stFlrSF 5.378e+01 5.190e+00 10.362 < 2e-16 ***
## X2ndFlrSF 4.188e+01 4.057e+00 10.322 < 2e-16 ***
## BsmtFullBath 8.118e+03 2.426e+03 3.346 0.000842 ***
## BedroomAbvGr -8.686e+03 1.671e+03 -5.197 2.31e-07 ***
## KitchenAbvGr -2.456e+04 4.761e+03 -5.158 2.84e-07 ***
## TotRmsAbvGrd 5.832e+03 1.225e+03 4.760 2.13e-06 ***
## Fireplaces 3.333e+03 1.769e+03 1.884 0.059795 .
## GarageCars 1.107e+04 1.710e+03 6.471 1.33e-10 ***
## WoodDeckSF 2.571e+01 8.002e+00 3.212 0.001346 **
## ScreenPorch 5.289e+01 1.708e+01 3.096 0.001997 **
## PoolArea -3.688e+01 2.375e+01 -1.553 0.120577
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 35180 on 1439 degrees of freedom
## Multiple R-squared: 0.8066, Adjusted R-squared: 0.8039
## F-statistic: 300.1 on 20 and 1439 DF, p-value: < 2.2e-16
The backward stepwise function starts with my model1. At each step, it eliminates the worst variable from the model to improve the AIC value. The same step continues until there are no further ways to improve the AIC value.
It results with multiple R-squared 0.8066 and adjusted R-squared 0.8039.
hist(resid(backward), breaks=35, prob=TRUE)
curve(dnorm(x, mean = mean(resid(backward)), sd = sd(resid(backward))), col="red", add=TRUE)The residuals are randomly dispersed around y=0 with some outliers.
The QQ plot also shows some outliers at both end of the graph.
The histogram is unimodal and fairly normal.
Predict the SalePrice on test dataset.
## Warning in predict.lm(model1, test1): prediction from a rank-deficient fit
## may be misleading
kaggle1 <- as.data.frame(cbind(test$Id, pred1))
colnames(kaggle1) <- c("Id", "SalePrice")
write.csv(kaggle1, file="Kaggle_Submission1.csv", quote=FALSE, row.names=FALSE)Pred1 got 0.44206 on Kaggle.
pred2 <- predict(backward, test1)
kaggle2 <- as.data.frame(cbind(test$Id, pred2))
colnames(kaggle2) <- c("Id", "SalePrice")
write.csv(kaggle2, file="Kaggle_Submission2.csv", quote=FALSE, row.names=FALSE)Pred2 got 0.44345 on Kaggle.
Kaggle
Kaggle Username: sinyingwong
Kaggle Screenshot
Future improvement
Improve the model by using kNN, Decision Tree, Bootstrap, Bagging, and/or other combinations of the algorithms.