Your final is due by the end of the last week of class. You should post your solutions to your GitHub account or RPubs. You are also expected to make a short presentation via YouTube and post that recording to the board. This project will show off your ability to understand the elements of the class.
library(dplyr)
library(kableExtra)
library(corrplot)
library(MASS)
library(ggplot2)
Using R, generate a random variable X that has 10,000 random uniform numbers from 1 to N, where N can be any number of your choosing greater than or equal to 6. Then generate a random variable Y that has 10,000 random normal numbers with a mean of \(\mu = \sigma=(N+1)/2\).
Generate random variable X:
N <- round(runif(1, 10, 100))
n <- 10000
X <- runif(n,min=0,max=N)
hist(X)
Generate random variable Y:
N <- round(runif(1, 10, 100))
n <- 10000
Y <- rnorm(n,(N+1)/2,(N+1)/2)
hist(Y)
Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities.
5 points a. P(X>x | X>y) b. P(X>x, Y>y) c. P(X
x <- median(X)
y <- quantile(Y, 0.25)
\(P(X>x|X>y)= \frac{P(X>x \&\ X>y)}{P(X>y)}\)
Probability that X is greater than its median given that X is greater than the first quartile of Y
p <- (length(X[X>x & X>y])/length(X)) /(length(X[X>y])/length(X))
round(p,2)
## [1] 0.61
\(P(X>x, Y>y) = P(X>x)*P(Y>y)\)
Probability that X is grater than its median and Y is greater than the first quartile of Y
p <- (length(X[X>x]) / length(X)) * (length(Y[Y>y]) / length(Y))
round(p,2)
## [1] 0.38
\(P(X<x | X>y) = \frac{P(X<x \&\ X>y)}{P(X>y)}\)
Probability that X is less than its median given that X is greater than the first quartile of Y
p = (length(X[X<x & X>y])/length(X)) / (length(X[X>y])/length(X))
round(p,2)
## [1] 0.39
5 points. Investigate whether P(X>x and Y>y)=P(X>x)P(Y>y) by building a table and evaluating the marginal and joint probabilities.
Xgx_Ygy <- length(X[X>x & Y>y])
Xgx_Yly <- length(X[X>x & Y<y])
Xlx_Ygy <- length(X[X<x & Y>y])
Xlx_Yly <- length(X[X<x & Y<y])
matrix <- matrix(c(Xgx_Ygy, Xgx_Yly, Xlx_Ygy, Xlx_Yly), nrow = 2, ncol = 2)
colnames(matrix) <- c("X>x","X<x")
rownames(matrix) <- c("Y<y","Y>y")
table <- as.data.frame(matrix)
kable(table) %>%
kable_styling("striped", full_width = FALSE,bootstrap_options = "bordered")
X>x | X<x | |
---|---|---|
Y<y | 3776 | 3724 |
Y>y | 1224 | 1276 |
Evaluate P(X>x and Y>y) using table:
table[1,1]/n
## [1] 0.3776
Evaluate P(X>x)P(Y>y) using table:
((table[1,1]/n) + (table[2,1]/n)) * ((table[1,1]/n) + (table[1,2]/n))
## [1] 0.375
Both sides of the equation are very close to being equal so we will conclude that P(X>x and Y>y)=P(X>x)P(Y>y).
5 points. Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?
Null Hypothesis: X>x and Y>y are independent events
Alternate Hypothesis: X>x and Y>y are dependent events
fisher.test(matrix)
##
## Fisher's Exact Test for Count Data
##
## data: matrix
## p-value = 0.2389
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.964496 1.158476
## sample estimates:
## odds ratio
## 1.057068
chisq.test(matrix)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: matrix
## X-squared = 1.3872, df = 1, p-value = 0.2389
The p-value is about the same for both tests and is greater than 0.05. Thus, we fail to reject the null hypothesis that P(X>x) and P(Y>y) are independent. Fisher’s Exact Test is used to test the association between two categorical variables when cell sizes are small (less than 5). The Chi Square Test is used when cell sizes are large. The Fisher’s Exact Test is most appropriate because it is typically used only for 2×2 contingency table and is more accurate.
You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition. https://www.kaggle.com/c/house-prices-advanced-regression-techniques . I want you to do the following.
Provide univariate descriptive statistics and appropriate plots for the training data set.
train <- read.csv("train.csv", header = TRUE)
head(train)
## Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape
## 1 1 60 RL 65 8450 Pave <NA> Reg
## 2 2 20 RL 80 9600 Pave <NA> Reg
## 3 3 60 RL 68 11250 Pave <NA> IR1
## 4 4 70 RL 60 9550 Pave <NA> IR1
## 5 5 60 RL 84 14260 Pave <NA> IR1
## 6 6 50 RL 85 14115 Pave <NA> IR1
## LandContour Utilities LotConfig LandSlope Neighborhood Condition1
## 1 Lvl AllPub Inside Gtl CollgCr Norm
## 2 Lvl AllPub FR2 Gtl Veenker Feedr
## 3 Lvl AllPub Inside Gtl CollgCr Norm
## 4 Lvl AllPub Corner Gtl Crawfor Norm
## 5 Lvl AllPub FR2 Gtl NoRidge Norm
## 6 Lvl AllPub Inside Gtl Mitchel Norm
## Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt
## 1 Norm 1Fam 2Story 7 5 2003
## 2 Norm 1Fam 1Story 6 8 1976
## 3 Norm 1Fam 2Story 7 5 2001
## 4 Norm 1Fam 2Story 7 5 1915
## 5 Norm 1Fam 2Story 8 5 2000
## 6 Norm 1Fam 1.5Fin 5 5 1993
## YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType
## 1 2003 Gable CompShg VinylSd VinylSd BrkFace
## 2 1976 Gable CompShg MetalSd MetalSd None
## 3 2002 Gable CompShg VinylSd VinylSd BrkFace
## 4 1970 Gable CompShg Wd Sdng Wd Shng None
## 5 2000 Gable CompShg VinylSd VinylSd BrkFace
## 6 1995 Gable CompShg VinylSd VinylSd None
## MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure
## 1 196 Gd TA PConc Gd TA No
## 2 0 TA TA CBlock Gd TA Gd
## 3 162 Gd TA PConc Gd TA Mn
## 4 0 TA TA BrkTil TA Gd No
## 5 350 Gd TA PConc Gd TA Av
## 6 0 TA TA Wood Gd TA No
## BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF
## 1 GLQ 706 Unf 0 150 856
## 2 ALQ 978 Unf 0 284 1262
## 3 GLQ 486 Unf 0 434 920
## 4 ALQ 216 Unf 0 540 756
## 5 GLQ 655 Unf 0 490 1145
## 6 GLQ 732 Unf 0 64 796
## Heating HeatingQC CentralAir Electrical X1stFlrSF X2ndFlrSF LowQualFinSF
## 1 GasA Ex Y SBrkr 856 854 0
## 2 GasA Ex Y SBrkr 1262 0 0
## 3 GasA Ex Y SBrkr 920 866 0
## 4 GasA Gd Y SBrkr 961 756 0
## 5 GasA Ex Y SBrkr 1145 1053 0
## 6 GasA Ex Y SBrkr 796 566 0
## GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr
## 1 1710 1 0 2 1 3
## 2 1262 0 1 2 0 3
## 3 1786 1 0 2 1 3
## 4 1717 1 0 1 0 3
## 5 2198 1 0 2 1 4
## 6 1362 1 0 1 1 1
## KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces FireplaceQu
## 1 1 Gd 8 Typ 0 <NA>
## 2 1 TA 6 Typ 1 TA
## 3 1 Gd 6 Typ 1 TA
## 4 1 Gd 7 Typ 1 Gd
## 5 1 Gd 9 Typ 1 TA
## 6 1 TA 5 Typ 0 <NA>
## GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual
## 1 Attchd 2003 RFn 2 548 TA
## 2 Attchd 1976 RFn 2 460 TA
## 3 Attchd 2001 RFn 2 608 TA
## 4 Detchd 1998 Unf 3 642 TA
## 5 Attchd 2000 RFn 3 836 TA
## 6 Attchd 1993 Unf 2 480 TA
## GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch
## 1 TA Y 0 61 0 0
## 2 TA Y 298 0 0 0
## 3 TA Y 0 42 0 0
## 4 TA Y 0 35 272 0
## 5 TA Y 192 84 0 0
## 6 TA Y 40 30 0 320
## ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold
## 1 0 0 <NA> <NA> <NA> 0 2 2008
## 2 0 0 <NA> <NA> <NA> 0 5 2007
## 3 0 0 <NA> <NA> <NA> 0 9 2008
## 4 0 0 <NA> <NA> <NA> 0 2 2006
## 5 0 0 <NA> <NA> <NA> 0 12 2008
## 6 0 0 <NA> MnPrv Shed 700 10 2009
## SaleType SaleCondition SalePrice
## 1 WD Normal 208500
## 2 WD Normal 181500
## 3 WD Normal 223500
## 4 WD Abnorml 140000
## 5 WD Normal 250000
## 6 WD Normal 143000
summary(train)
## Id MSSubClass MSZoning LotFrontage
## Min. : 1.0 Min. : 20.0 C (all): 10 Min. : 21.00
## 1st Qu.: 365.8 1st Qu.: 20.0 FV : 65 1st Qu.: 59.00
## Median : 730.5 Median : 50.0 RH : 16 Median : 69.00
## Mean : 730.5 Mean : 56.9 RL :1151 Mean : 70.05
## 3rd Qu.:1095.2 3rd Qu.: 70.0 RM : 218 3rd Qu.: 80.00
## Max. :1460.0 Max. :190.0 Max. :313.00
## NA's :259
## LotArea Street Alley LotShape LandContour
## Min. : 1300 Grvl: 6 Grvl: 50 IR1:484 Bnk: 63
## 1st Qu.: 7554 Pave:1454 Pave: 41 IR2: 41 HLS: 50
## Median : 9478 NA's:1369 IR3: 10 Low: 36
## Mean : 10517 Reg:925 Lvl:1311
## 3rd Qu.: 11602
## Max. :215245
##
## Utilities LotConfig LandSlope Neighborhood Condition1
## AllPub:1459 Corner : 263 Gtl:1382 NAmes :225 Norm :1260
## NoSeWa: 1 CulDSac: 94 Mod: 65 CollgCr:150 Feedr : 81
## FR2 : 47 Sev: 13 OldTown:113 Artery : 48
## FR3 : 4 Edwards:100 RRAn : 26
## Inside :1052 Somerst: 86 PosN : 19
## Gilbert: 79 RRAe : 11
## (Other):707 (Other): 15
## Condition2 BldgType HouseStyle OverallQual
## Norm :1445 1Fam :1220 1Story :726 Min. : 1.000
## Feedr : 6 2fmCon: 31 2Story :445 1st Qu.: 5.000
## Artery : 2 Duplex: 52 1.5Fin :154 Median : 6.000
## PosN : 2 Twnhs : 43 SLvl : 65 Mean : 6.099
## RRNn : 2 TwnhsE: 114 SFoyer : 37 3rd Qu.: 7.000
## PosA : 1 1.5Unf : 14 Max. :10.000
## (Other): 2 (Other): 19
## OverallCond YearBuilt YearRemodAdd RoofStyle
## Min. :1.000 Min. :1872 Min. :1950 Flat : 13
## 1st Qu.:5.000 1st Qu.:1954 1st Qu.:1967 Gable :1141
## Median :5.000 Median :1973 Median :1994 Gambrel: 11
## Mean :5.575 Mean :1971 Mean :1985 Hip : 286
## 3rd Qu.:6.000 3rd Qu.:2000 3rd Qu.:2004 Mansard: 7
## Max. :9.000 Max. :2010 Max. :2010 Shed : 2
##
## RoofMatl Exterior1st Exterior2nd MasVnrType MasVnrArea
## CompShg:1434 VinylSd:515 VinylSd:504 BrkCmn : 15 Min. : 0.0
## Tar&Grv: 11 HdBoard:222 MetalSd:214 BrkFace:445 1st Qu.: 0.0
## WdShngl: 6 MetalSd:220 HdBoard:207 None :864 Median : 0.0
## WdShake: 5 Wd Sdng:206 Wd Sdng:197 Stone :128 Mean : 103.7
## ClyTile: 1 Plywood:108 Plywood:142 NA's : 8 3rd Qu.: 166.0
## Membran: 1 CemntBd: 61 CmentBd: 60 Max. :1600.0
## (Other): 2 (Other):128 (Other):136 NA's :8
## ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure
## Ex: 52 Ex: 3 BrkTil:146 Ex :121 Fa : 45 Av :221
## Fa: 14 Fa: 28 CBlock:634 Fa : 35 Gd : 65 Gd :134
## Gd:488 Gd: 146 PConc :647 Gd :618 Po : 2 Mn :114
## TA:906 Po: 1 Slab : 24 TA :649 TA :1311 No :953
## TA:1282 Stone : 6 NA's: 37 NA's: 37 NA's: 38
## Wood : 3
##
## BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2
## ALQ :220 Min. : 0.0 ALQ : 19 Min. : 0.00
## BLQ :148 1st Qu.: 0.0 BLQ : 33 1st Qu.: 0.00
## GLQ :418 Median : 383.5 GLQ : 14 Median : 0.00
## LwQ : 74 Mean : 443.6 LwQ : 46 Mean : 46.55
## Rec :133 3rd Qu.: 712.2 Rec : 54 3rd Qu.: 0.00
## Unf :430 Max. :5644.0 Unf :1256 Max. :1474.00
## NA's: 37 NA's: 38
## BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir
## Min. : 0.0 Min. : 0.0 Floor: 1 Ex:741 N: 95
## 1st Qu.: 223.0 1st Qu.: 795.8 GasA :1428 Fa: 49 Y:1365
## Median : 477.5 Median : 991.5 GasW : 18 Gd:241
## Mean : 567.2 Mean :1057.4 Grav : 7 Po: 1
## 3rd Qu.: 808.0 3rd Qu.:1298.2 OthW : 2 TA:428
## Max. :2336.0 Max. :6110.0 Wall : 4
##
## Electrical X1stFlrSF X2ndFlrSF LowQualFinSF
## FuseA: 94 Min. : 334 Min. : 0 Min. : 0.000
## FuseF: 27 1st Qu.: 882 1st Qu.: 0 1st Qu.: 0.000
## FuseP: 3 Median :1087 Median : 0 Median : 0.000
## Mix : 1 Mean :1163 Mean : 347 Mean : 5.845
## SBrkr:1334 3rd Qu.:1391 3rd Qu.: 728 3rd Qu.: 0.000
## NA's : 1 Max. :4692 Max. :2065 Max. :572.000
##
## GrLivArea BsmtFullBath BsmtHalfBath FullBath
## Min. : 334 Min. :0.0000 Min. :0.00000 Min. :0.000
## 1st Qu.:1130 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:1.000
## Median :1464 Median :0.0000 Median :0.00000 Median :2.000
## Mean :1515 Mean :0.4253 Mean :0.05753 Mean :1.565
## 3rd Qu.:1777 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:2.000
## Max. :5642 Max. :3.0000 Max. :2.00000 Max. :3.000
##
## HalfBath BedroomAbvGr KitchenAbvGr KitchenQual
## Min. :0.0000 Min. :0.000 Min. :0.000 Ex:100
## 1st Qu.:0.0000 1st Qu.:2.000 1st Qu.:1.000 Fa: 39
## Median :0.0000 Median :3.000 Median :1.000 Gd:586
## Mean :0.3829 Mean :2.866 Mean :1.047 TA:735
## 3rd Qu.:1.0000 3rd Qu.:3.000 3rd Qu.:1.000
## Max. :2.0000 Max. :8.000 Max. :3.000
##
## TotRmsAbvGrd Functional Fireplaces FireplaceQu GarageType
## Min. : 2.000 Maj1: 14 Min. :0.000 Ex : 24 2Types : 6
## 1st Qu.: 5.000 Maj2: 5 1st Qu.:0.000 Fa : 33 Attchd :870
## Median : 6.000 Min1: 31 Median :1.000 Gd :380 Basment: 19
## Mean : 6.518 Min2: 34 Mean :0.613 Po : 20 BuiltIn: 88
## 3rd Qu.: 7.000 Mod : 15 3rd Qu.:1.000 TA :313 CarPort: 9
## Max. :14.000 Sev : 1 Max. :3.000 NA's:690 Detchd :387
## Typ :1360 NA's : 81
## GarageYrBlt GarageFinish GarageCars GarageArea GarageQual
## Min. :1900 Fin :352 Min. :0.000 Min. : 0.0 Ex : 3
## 1st Qu.:1961 RFn :422 1st Qu.:1.000 1st Qu.: 334.5 Fa : 48
## Median :1980 Unf :605 Median :2.000 Median : 480.0 Gd : 14
## Mean :1979 NA's: 81 Mean :1.767 Mean : 473.0 Po : 3
## 3rd Qu.:2002 3rd Qu.:2.000 3rd Qu.: 576.0 TA :1311
## Max. :2010 Max. :4.000 Max. :1418.0 NA's: 81
## NA's :81
## GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch
## Ex : 2 N: 90 Min. : 0.00 Min. : 0.00 Min. : 0.00
## Fa : 35 P: 30 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00
## Gd : 9 Y:1340 Median : 0.00 Median : 25.00 Median : 0.00
## Po : 7 Mean : 94.24 Mean : 46.66 Mean : 21.95
## TA :1326 3rd Qu.:168.00 3rd Qu.: 68.00 3rd Qu.: 0.00
## NA's: 81 Max. :857.00 Max. :547.00 Max. :552.00
##
## X3SsnPorch ScreenPorch PoolArea PoolQC
## Min. : 0.00 Min. : 0.00 Min. : 0.000 Ex : 2
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.000 Fa : 2
## Median : 0.00 Median : 0.00 Median : 0.000 Gd : 3
## Mean : 3.41 Mean : 15.06 Mean : 2.759 NA's:1453
## 3rd Qu.: 0.00 3rd Qu.: 0.00 3rd Qu.: 0.000
## Max. :508.00 Max. :480.00 Max. :738.000
##
## Fence MiscFeature MiscVal MoSold
## GdPrv: 59 Gar2: 2 Min. : 0.00 Min. : 1.000
## GdWo : 54 Othr: 2 1st Qu.: 0.00 1st Qu.: 5.000
## MnPrv: 157 Shed: 49 Median : 0.00 Median : 6.000
## MnWw : 11 TenC: 1 Mean : 43.49 Mean : 6.322
## NA's :1179 NA's:1406 3rd Qu.: 0.00 3rd Qu.: 8.000
## Max. :15500.00 Max. :12.000
##
## YrSold SaleType SaleCondition SalePrice
## Min. :2006 WD :1267 Abnorml: 101 Min. : 34900
## 1st Qu.:2007 New : 122 AdjLand: 4 1st Qu.:129975
## Median :2008 COD : 43 Alloca : 12 Median :163000
## Mean :2008 ConLD : 9 Family : 20 Mean :180921
## 3rd Qu.:2009 ConLI : 5 Normal :1198 3rd Qu.:214000
## Max. :2010 ConLw : 5 Partial: 125 Max. :755000
## (Other): 9
#Sample box plots
ggplot(train, aes(x=YearBuilt, y=SalePrice, fill=YearBuilt, group=YearBuilt)) + geom_boxplot()
train$OverallQual_factor <- as.factor(as.character(train$OverallQual))
ggplot(train, aes(x=OverallQual, y=SalePrice, fill=OverallQual_factor)) + geom_boxplot()
Provide a scatterplot matrix for at least two of the independent variables and the dependent variable.
ggplot(train, aes(x = YearBuilt, y = SalePrice)) +
geom_point() +
labs(
title = "Year Built vs Sale Price"
)
ggplot(train, aes(x = TotalBsmtSF, y = SalePrice)) +
geom_point() +
labs(
title = "Total Bsmt SF vs Sale Price"
)
Derive a correlation matrix for any three quantitative variables in the dataset.
var <- dplyr::select(train, YearBuilt, TotalBsmtSF, OverallCond, SalePrice)
corr <- cor(var, method = "pearson", use = "complete.obs")
corr
## YearBuilt TotalBsmtSF OverallCond SalePrice
## YearBuilt 1.0000000 0.3914520 -0.37598320 0.52289733
## TotalBsmtSF 0.3914520 1.0000000 -0.17109751 0.61358055
## OverallCond -0.3759832 -0.1710975 1.00000000 -0.07785589
## SalePrice 0.5228973 0.6135806 -0.07785589 1.00000000
corrplot(corr,method ="color")
Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval.
cor.test(var$SalePrice,var$YearBuilt, conf.level = 0.8)
##
## Pearson's product-moment correlation
##
## data: var$SalePrice and var$YearBuilt
## t = 23.424, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.4980766 0.5468619
## sample estimates:
## cor
## 0.5228973
cor.test(var$SalePrice,var$TotalBsmtSF, conf.level = 0.8)
##
## Pearson's product-moment correlation
##
## data: var$SalePrice and var$TotalBsmtSF
## t = 29.671, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.5922142 0.6340846
## sample estimates:
## cor
## 0.6135806
cor.test(var$SalePrice,var$OverallCond, conf.level = 0.8)
##
## Pearson's product-moment correlation
##
## data: var$SalePrice and var$OverallCond
## t = -2.9819, df = 1458, p-value = 0.002912
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## -0.1111272 -0.0444103
## sample estimates:
## cor
## -0.07785589
Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?
The p-value in all tests is less than 0.05 so we reject the null hypothesis in favor of the alternative and conclude that true correlation is not equal to 0.
Familywise error is the probability of making at least one Type I error – a false positive when performing multiple hypotheses tests. I would not be worried about familywise error in this case because the p-value of each the correlations above are small.
Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.)
precisionMatrix<-solve(corr)
round(precisionMatrix,2)
## YearBuilt TotalBsmtSF OverallCond SalePrice
## YearBuilt 1.63 -0.08 0.54 -0.76
## TotalBsmtSF -0.08 1.65 0.18 -0.96
## OverallCond 0.54 0.18 1.21 -0.30
## SalePrice -0.76 -0.96 -0.30 1.96
Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.
round(corr %*% precisionMatrix, 2)
## YearBuilt TotalBsmtSF OverallCond SalePrice
## YearBuilt 1 0 0 0
## TotalBsmtSF 0 1 0 0
## OverallCond 0 0 1 0
## SalePrice 0 0 0 1
round(precisionMatrix %*% corr, 2)
## YearBuilt TotalBsmtSF OverallCond SalePrice
## YearBuilt 1 0 0 0
## TotalBsmtSF 0 1 0 0
## OverallCond 0 0 1 0
## SalePrice 0 0 0 1
Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary.
Right skewed data has a mean greater than the median
train.num <- dplyr::select_if(train, is.numeric)
summary(train.num)
## Id MSSubClass LotFrontage LotArea
## Min. : 1.0 Min. : 20.0 Min. : 21.00 Min. : 1300
## 1st Qu.: 365.8 1st Qu.: 20.0 1st Qu.: 59.00 1st Qu.: 7554
## Median : 730.5 Median : 50.0 Median : 69.00 Median : 9478
## Mean : 730.5 Mean : 56.9 Mean : 70.05 Mean : 10517
## 3rd Qu.:1095.2 3rd Qu.: 70.0 3rd Qu.: 80.00 3rd Qu.: 11602
## Max. :1460.0 Max. :190.0 Max. :313.00 Max. :215245
## NA's :259
## OverallQual OverallCond YearBuilt YearRemodAdd
## Min. : 1.000 Min. :1.000 Min. :1872 Min. :1950
## 1st Qu.: 5.000 1st Qu.:5.000 1st Qu.:1954 1st Qu.:1967
## Median : 6.000 Median :5.000 Median :1973 Median :1994
## Mean : 6.099 Mean :5.575 Mean :1971 Mean :1985
## 3rd Qu.: 7.000 3rd Qu.:6.000 3rd Qu.:2000 3rd Qu.:2004
## Max. :10.000 Max. :9.000 Max. :2010 Max. :2010
##
## MasVnrArea BsmtFinSF1 BsmtFinSF2 BsmtUnfSF
## Min. : 0.0 Min. : 0.0 Min. : 0.00 Min. : 0.0
## 1st Qu.: 0.0 1st Qu.: 0.0 1st Qu.: 0.00 1st Qu.: 223.0
## Median : 0.0 Median : 383.5 Median : 0.00 Median : 477.5
## Mean : 103.7 Mean : 443.6 Mean : 46.55 Mean : 567.2
## 3rd Qu.: 166.0 3rd Qu.: 712.2 3rd Qu.: 0.00 3rd Qu.: 808.0
## Max. :1600.0 Max. :5644.0 Max. :1474.00 Max. :2336.0
## NA's :8
## TotalBsmtSF X1stFlrSF X2ndFlrSF LowQualFinSF
## Min. : 0.0 Min. : 334 Min. : 0 Min. : 0.000
## 1st Qu.: 795.8 1st Qu.: 882 1st Qu.: 0 1st Qu.: 0.000
## Median : 991.5 Median :1087 Median : 0 Median : 0.000
## Mean :1057.4 Mean :1163 Mean : 347 Mean : 5.845
## 3rd Qu.:1298.2 3rd Qu.:1391 3rd Qu.: 728 3rd Qu.: 0.000
## Max. :6110.0 Max. :4692 Max. :2065 Max. :572.000
##
## GrLivArea BsmtFullBath BsmtHalfBath FullBath
## Min. : 334 Min. :0.0000 Min. :0.00000 Min. :0.000
## 1st Qu.:1130 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:1.000
## Median :1464 Median :0.0000 Median :0.00000 Median :2.000
## Mean :1515 Mean :0.4253 Mean :0.05753 Mean :1.565
## 3rd Qu.:1777 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:2.000
## Max. :5642 Max. :3.0000 Max. :2.00000 Max. :3.000
##
## HalfBath BedroomAbvGr KitchenAbvGr TotRmsAbvGrd
## Min. :0.0000 Min. :0.000 Min. :0.000 Min. : 2.000
## 1st Qu.:0.0000 1st Qu.:2.000 1st Qu.:1.000 1st Qu.: 5.000
## Median :0.0000 Median :3.000 Median :1.000 Median : 6.000
## Mean :0.3829 Mean :2.866 Mean :1.047 Mean : 6.518
## 3rd Qu.:1.0000 3rd Qu.:3.000 3rd Qu.:1.000 3rd Qu.: 7.000
## Max. :2.0000 Max. :8.000 Max. :3.000 Max. :14.000
##
## Fireplaces GarageYrBlt GarageCars GarageArea
## Min. :0.000 Min. :1900 Min. :0.000 Min. : 0.0
## 1st Qu.:0.000 1st Qu.:1961 1st Qu.:1.000 1st Qu.: 334.5
## Median :1.000 Median :1980 Median :2.000 Median : 480.0
## Mean :0.613 Mean :1979 Mean :1.767 Mean : 473.0
## 3rd Qu.:1.000 3rd Qu.:2002 3rd Qu.:2.000 3rd Qu.: 576.0
## Max. :3.000 Max. :2010 Max. :4.000 Max. :1418.0
## NA's :81
## WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.00
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00
## Median : 0.00 Median : 25.00 Median : 0.00 Median : 0.00
## Mean : 94.24 Mean : 46.66 Mean : 21.95 Mean : 3.41
## 3rd Qu.:168.00 3rd Qu.: 68.00 3rd Qu.: 0.00 3rd Qu.: 0.00
## Max. :857.00 Max. :547.00 Max. :552.00 Max. :508.00
##
## ScreenPorch PoolArea MiscVal MoSold
## Min. : 0.00 Min. : 0.000 Min. : 0.00 Min. : 1.000
## 1st Qu.: 0.00 1st Qu.: 0.000 1st Qu.: 0.00 1st Qu.: 5.000
## Median : 0.00 Median : 0.000 Median : 0.00 Median : 6.000
## Mean : 15.06 Mean : 2.759 Mean : 43.49 Mean : 6.322
## 3rd Qu.: 0.00 3rd Qu.: 0.000 3rd Qu.: 0.00 3rd Qu.: 8.000
## Max. :480.00 Max. :738.000 Max. :15500.00 Max. :12.000
##
## YrSold SalePrice
## Min. :2006 Min. : 34900
## 1st Qu.:2007 1st Qu.:129975
## Median :2008 Median :163000
## Mean :2008 Mean :180921
## 3rd Qu.:2009 3rd Qu.:214000
## Max. :2010 Max. :755000
##
hist(train.num$LotArea, col='blue', breaks=20)
Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ).
fit <- fitdistr(train.num$LotArea, densfun = "exponential")
Find the optimal value of \(\lambda\) for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, \(\lambda\))).
lamda <- fit$estimate
lamda
## rate
## 9.50857e-05
sample <- rexp(1000, lamda)
summary(sample)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7.04 3143.47 7098.76 10424.38 14398.37 64966.61
Plot a histogram and compare it with a histogram of your original variable.
hist(sample, col = 'blue', breaks = 20)
The histogram of the new variable looks more like an exponential distribution.
Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF).
qexp(c(0.05,0.95), lamda)
## [1] 539.4428 31505.6013
Also generate a 95% confidence interval from the empirical data, assuming normality.
qnorm(c(0.025, 0.975), mean = mean(train.num$LotArea), sd = sd(train.num$LotArea))
## [1] -9046.092 30079.748
Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.
quantile(train.num$LotArea, c(0.05, 0.95))
## 5% 95%
## 3311.70 17401.15
Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.
c <- cor(train.num, method = "pearson", use = "complete.obs")
lm <- lm(SalePrice ~ OverallQual+ YearBuilt+YearRemodAdd+MasVnrArea+X1stFlrSF+TotalBsmtSF+ GrLivArea+ FullBath+TotRmsAbvGrd+GarageCars+GarageArea+ BsmtFinSF1+ LotArea+Fireplaces+ BedroomAbvGr, data = train.num)
summary(lm)
##
## Call:
## lm(formula = SalePrice ~ OverallQual + YearBuilt + YearRemodAdd +
## MasVnrArea + X1stFlrSF + TotalBsmtSF + GrLivArea + FullBath +
## TotRmsAbvGrd + GarageCars + GarageArea + BsmtFinSF1 + LotArea +
## Fireplaces + BedroomAbvGr, data = train.num)
##
## Residuals:
## Min 1Q Median 3Q Max
## -531792 -16783 -1389 14782 286886
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.130e+06 1.250e+05 -9.040 < 2e-16 ***
## OverallQual 1.841e+04 1.180e+03 15.605 < 2e-16 ***
## YearBuilt 2.015e+02 4.894e+01 4.116 4.07e-05 ***
## YearRemodAdd 3.407e+02 6.180e+01 5.513 4.17e-08 ***
## MasVnrArea 2.956e+01 6.079e+00 4.863 1.28e-06 ***
## X1stFlrSF 5.553e+00 4.794e+00 1.158 0.246885
## TotalBsmtSF 1.165e+01 4.250e+00 2.740 0.006218 **
## GrLivArea 4.006e+01 4.225e+00 9.484 < 2e-16 ***
## FullBath -6.172e+02 2.616e+03 -0.236 0.813537
## TotRmsAbvGrd 4.036e+03 1.219e+03 3.312 0.000948 ***
## GarageCars 9.847e+03 2.921e+03 3.372 0.000767 ***
## GarageArea 7.820e+00 9.929e+00 0.788 0.431071
## BsmtFinSF1 1.625e+01 2.575e+00 6.312 3.67e-10 ***
## LotArea 5.423e-01 1.027e-01 5.281 1.48e-07 ***
## Fireplaces 6.478e+03 1.784e+03 3.632 0.000292 ***
## BedroomAbvGr -7.114e+03 1.710e+03 -4.162 3.35e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 36020 on 1436 degrees of freedom
## (8 observations deleted due to missingness)
## Multiple R-squared: 0.7958, Adjusted R-squared: 0.7936
## F-statistic: 373 on 15 and 1436 DF, p-value: < 2.2e-16
lm1 <- lm(SalePrice ~ OverallQual+ YearBuilt+YearRemodAdd+MasVnrArea+TotalBsmtSF+ GrLivArea+TotRmsAbvGrd+GarageCars+ BsmtFinSF1+ LotArea+Fireplaces+ BedroomAbvGr, data = train.num)
summary(lm1)
##
## Call:
## lm(formula = SalePrice ~ OverallQual + YearBuilt + YearRemodAdd +
## MasVnrArea + TotalBsmtSF + GrLivArea + TotRmsAbvGrd + GarageCars +
## BsmtFinSF1 + LotArea + Fireplaces + BedroomAbvGr, data = train.num)
##
## Residuals:
## Min 1Q Median 3Q Max
## -530612 -17143 -1214 14877 285552
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.109e+06 1.173e+05 -9.452 < 2e-16 ***
## OverallQual 1.821e+04 1.167e+03 15.601 < 2e-16 ***
## YearBuilt 1.944e+02 4.633e+01 4.196 2.88e-05 ***
## YearRemodAdd 3.376e+02 6.142e+01 5.497 4.56e-08 ***
## MasVnrArea 2.977e+01 6.066e+00 4.907 1.03e-06 ***
## TotalBsmtSF 1.543e+01 3.029e+00 5.095 3.96e-07 ***
## GrLivArea 4.104e+01 3.957e+00 10.371 < 2e-16 ***
## TotRmsAbvGrd 4.093e+03 1.213e+03 3.375 0.000759 ***
## GarageCars 1.187e+04 1.734e+03 6.844 1.14e-11 ***
## BsmtFinSF1 1.662e+01 2.538e+00 6.549 8.06e-11 ***
## LotArea 5.512e-01 1.024e-01 5.380 8.67e-08 ***
## Fireplaces 6.581e+03 1.755e+03 3.750 0.000184 ***
## BedroomAbvGr -7.451e+03 1.686e+03 -4.418 1.07e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 36010 on 1439 degrees of freedom
## (8 observations deleted due to missingness)
## Multiple R-squared: 0.7955, Adjusted R-squared: 0.7938
## F-statistic: 466.4 on 12 and 1439 DF, p-value: < 2.2e-16
Let’s check if data transformations can improve model
hist(train.num$YearBuilt, breaks=30) #left skewed
hist(train.num$MasVnrArea, breaks=30) #right skewed
hist(train.num$X1stFlrSF, breaks=30) #right skewed
hist(train.num$GrLivArea, breaks=30) #right skewed
yrblt2 <- (train.num$YearBuilt)^2
logMasVnrArea <- log(train.num$MasVnrArea) #
log1stFlr <- log(train.num$X1stFlrSF)
logGrLivArea <- log(train.num$GrLivArea)
lm2 <- lm(train.num$SalePrice ~ train.num$OverallQual+ yrblt2+train.num$YearRemodAdd+train.num$MasVnrArea+log1stFlr+train.num$TotalBsmtSF+ logGrLivArea+ train.num$FullBath+train.num$GarageCars+train.num$BsmtFinSF1+ train.num$LotArea+train.num$Fireplaces+ train.num$BedroomAbvGr)
summary(lm2)
##
## Call:
## lm(formula = train.num$SalePrice ~ train.num$OverallQual + yrblt2 +
## train.num$YearRemodAdd + train.num$MasVnrArea + log1stFlr +
## train.num$TotalBsmtSF + logGrLivArea + train.num$FullBath +
## train.num$GarageCars + train.num$BsmtFinSF1 + train.num$LotArea +
## train.num$Fireplaces + train.num$BedroomAbvGr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -434893 -18372 -2398 15465 335138
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.362e+06 1.252e+05 -10.876 < 2e-16 ***
## train.num$OverallQual 1.954e+04 1.225e+03 15.956 < 2e-16 ***
## yrblt2 2.855e-02 1.261e-02 2.264 0.023701 *
## train.num$YearRemodAdd 3.649e+02 6.352e+01 5.744 1.13e-08 ***
## train.num$MasVnrArea 3.792e+01 6.192e+00 6.124 1.18e-09 ***
## log1stFlr 1.550e+04 5.408e+03 2.866 0.004220 **
## train.num$TotalBsmtSF 9.617e+00 4.091e+00 2.351 0.018875 *
## logGrLivArea 5.943e+04 5.876e+03 10.114 < 2e-16 ***
## train.num$FullBath 2.632e+03 2.674e+03 0.984 0.325241
## train.num$GarageCars 1.161e+04 1.806e+03 6.428 1.75e-10 ***
## train.num$BsmtFinSF1 1.858e+01 2.612e+00 7.114 1.78e-12 ***
## train.num$LotArea 5.800e-01 1.052e-01 5.513 4.17e-08 ***
## train.num$Fireplaces 6.457e+03 1.837e+03 3.514 0.000455 ***
## train.num$BedroomAbvGr -3.159e+03 1.582e+03 -1.997 0.045989 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 37010 on 1438 degrees of freedom
## (8 observations deleted due to missingness)
## Multiple R-squared: 0.784, Adjusted R-squared: 0.7821
## F-statistic: 401.6 on 13 and 1438 DF, p-value: < 2.2e-16
lm3 <- lm(train.num$SalePrice ~ train.num$OverallQual+ yrblt2+train.num$YearRemodAdd+train.num$MasVnrArea+log1stFlr+train.num$TotalBsmtSF+ logGrLivArea+train.num$GarageCars+train.num$BsmtFinSF1+ train.num$LotArea+train.num$Fireplace)
summary(lm3)
##
## Call:
## lm(formula = train.num$SalePrice ~ train.num$OverallQual + yrblt2 +
## train.num$YearRemodAdd + train.num$MasVnrArea + log1stFlr +
## train.num$TotalBsmtSF + logGrLivArea + train.num$GarageCars +
## train.num$BsmtFinSF1 + train.num$LotArea + train.num$Fireplace)
##
## Residuals:
## Min 1Q Median 3Q Max
## -433535 -18574 -2246 15207 336037
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.401e+06 1.183e+05 -11.845 < 2e-16 ***
## train.num$OverallQual 1.998e+04 1.208e+03 16.538 < 2e-16 ***
## yrblt2 3.150e-02 1.203e-02 2.618 0.008928 **
## train.num$YearRemodAdd 3.868e+02 6.269e+01 6.169 8.89e-10 ***
## train.num$MasVnrArea 3.777e+01 6.196e+00 6.096 1.40e-09 ***
## log1stFlr 1.623e+04 5.402e+03 3.004 0.002707 **
## train.num$TotalBsmtSF 9.428e+00 4.093e+00 2.304 0.021386 *
## logGrLivArea 5.539e+04 4.375e+03 12.659 < 2e-16 ***
## train.num$GarageCars 1.193e+04 1.802e+03 6.620 5.04e-11 ***
## train.num$BsmtFinSF1 1.889e+01 2.577e+00 7.333 3.74e-13 ***
## train.num$LotArea 5.762e-01 1.052e-01 5.475 5.15e-08 ***
## train.num$Fireplace 6.788e+03 1.817e+03 3.736 0.000194 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 37040 on 1440 degrees of freedom
## (8 observations deleted due to missingness)
## Multiple R-squared: 0.7833, Adjusted R-squared: 0.7817
## F-statistic: 473.3 on 11 and 1440 DF, p-value: < 2.2e-16
lm1 as the best model because it had the greatest \(R^2\) and fewer predictor variables.
Now let’s evaluate the quality of the model.
plot(lm1$fitted.values, lm1$residuals, xlab='Fitted Values', ylab='Residuals')
abline(0,0)
hist(lm1$residuals)
qqnorm(lm1$residuals)
qqline(lm1$residuals)
The residuals appear to be normally distributed around the 0 except for some deviation in the upper tail.
To improve this model I would evaluate and remove outlier data points if appropriate by looking at Cook’s distance.
test <- read.csv("test.csv", header = TRUE)
head(test)
## Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape
## 1 1461 20 RH 80 11622 Pave <NA> Reg
## 2 1462 20 RL 81 14267 Pave <NA> IR1
## 3 1463 60 RL 74 13830 Pave <NA> IR1
## 4 1464 60 RL 78 9978 Pave <NA> IR1
## 5 1465 120 RL 43 5005 Pave <NA> IR1
## 6 1466 60 RL 75 10000 Pave <NA> IR1
## LandContour Utilities LotConfig LandSlope Neighborhood Condition1
## 1 Lvl AllPub Inside Gtl NAmes Feedr
## 2 Lvl AllPub Corner Gtl NAmes Norm
## 3 Lvl AllPub Inside Gtl Gilbert Norm
## 4 Lvl AllPub Inside Gtl Gilbert Norm
## 5 HLS AllPub Inside Gtl StoneBr Norm
## 6 Lvl AllPub Corner Gtl Gilbert Norm
## Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt
## 1 Norm 1Fam 1Story 5 6 1961
## 2 Norm 1Fam 1Story 6 6 1958
## 3 Norm 1Fam 2Story 5 5 1997
## 4 Norm 1Fam 2Story 6 6 1998
## 5 Norm TwnhsE 1Story 8 5 1992
## 6 Norm 1Fam 2Story 6 5 1993
## YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType
## 1 1961 Gable CompShg VinylSd VinylSd None
## 2 1958 Hip CompShg Wd Sdng Wd Sdng BrkFace
## 3 1998 Gable CompShg VinylSd VinylSd None
## 4 1998 Gable CompShg VinylSd VinylSd BrkFace
## 5 1992 Gable CompShg HdBoard HdBoard None
## 6 1994 Gable CompShg HdBoard HdBoard None
## MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure
## 1 0 TA TA CBlock TA TA No
## 2 108 TA TA CBlock TA TA No
## 3 0 TA TA PConc Gd TA No
## 4 20 TA TA PConc TA TA No
## 5 0 Gd TA PConc Gd TA No
## 6 0 TA TA PConc Gd TA No
## BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF
## 1 Rec 468 LwQ 144 270 882
## 2 ALQ 923 Unf 0 406 1329
## 3 GLQ 791 Unf 0 137 928
## 4 GLQ 602 Unf 0 324 926
## 5 ALQ 263 Unf 0 1017 1280
## 6 Unf 0 Unf 0 763 763
## Heating HeatingQC CentralAir Electrical X1stFlrSF X2ndFlrSF LowQualFinSF
## 1 GasA TA Y SBrkr 896 0 0
## 2 GasA TA Y SBrkr 1329 0 0
## 3 GasA Gd Y SBrkr 928 701 0
## 4 GasA Ex Y SBrkr 926 678 0
## 5 GasA Ex Y SBrkr 1280 0 0
## 6 GasA Gd Y SBrkr 763 892 0
## GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr
## 1 896 0 0 1 0 2
## 2 1329 0 0 1 1 3
## 3 1629 0 0 2 1 3
## 4 1604 0 0 2 1 3
## 5 1280 0 0 2 0 2
## 6 1655 0 0 2 1 3
## KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces FireplaceQu
## 1 1 TA 5 Typ 0 <NA>
## 2 1 Gd 6 Typ 0 <NA>
## 3 1 TA 6 Typ 1 TA
## 4 1 Gd 7 Typ 1 Gd
## 5 1 Gd 5 Typ 0 <NA>
## 6 1 TA 7 Typ 1 TA
## GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual
## 1 Attchd 1961 Unf 1 730 TA
## 2 Attchd 1958 Unf 1 312 TA
## 3 Attchd 1997 Fin 2 482 TA
## 4 Attchd 1998 Fin 2 470 TA
## 5 Attchd 1992 RFn 2 506 TA
## 6 Attchd 1993 Fin 2 440 TA
## GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch
## 1 TA Y 140 0 0 0
## 2 TA Y 393 36 0 0
## 3 TA Y 212 34 0 0
## 4 TA Y 360 36 0 0
## 5 TA Y 0 82 0 0
## 6 TA Y 157 84 0 0
## ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold
## 1 120 0 <NA> MnPrv <NA> 0 6 2010
## 2 0 0 <NA> <NA> Gar2 12500 6 2010
## 3 0 0 <NA> MnPrv <NA> 0 3 2010
## 4 0 0 <NA> <NA> <NA> 0 6 2010
## 5 144 0 <NA> <NA> <NA> 0 1 2010
## 6 0 0 <NA> <NA> <NA> 0 4 2010
## SaleType SaleCondition
## 1 WD Normal
## 2 WD Normal
## 3 WD Normal
## 4 WD Normal
## 5 WD Normal
## 6 WD Normal
#Clean data (replace NA with 0):
test$LotFrontage <- test$LotFrontage[is.na(test$LotFrontage)] <- 0
test$Alley <- test$Alley[is.na(test$Alley)] <- 0
test$FireplaceQu <- test$FireplaceQu[is.na(test$FireplaceQu)] <- 0
test$PoolQC <- test$PoolQC[is.na(test$PoolQC)] <- 0
test$Fence <- test$Fence[is.na(test$Fence)] <- 0
test$MiscFeature <- test$MiscFeature[is.na(test$MiscFeature)] <- 0
head(test)
## Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape
## 1 1461 20 RH 0 11622 Pave 0 Reg
## 2 1462 20 RL 0 14267 Pave 0 IR1
## 3 1463 60 RL 0 13830 Pave 0 IR1
## 4 1464 60 RL 0 9978 Pave 0 IR1
## 5 1465 120 RL 0 5005 Pave 0 IR1
## 6 1466 60 RL 0 10000 Pave 0 IR1
## LandContour Utilities LotConfig LandSlope Neighborhood Condition1
## 1 Lvl AllPub Inside Gtl NAmes Feedr
## 2 Lvl AllPub Corner Gtl NAmes Norm
## 3 Lvl AllPub Inside Gtl Gilbert Norm
## 4 Lvl AllPub Inside Gtl Gilbert Norm
## 5 HLS AllPub Inside Gtl StoneBr Norm
## 6 Lvl AllPub Corner Gtl Gilbert Norm
## Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt
## 1 Norm 1Fam 1Story 5 6 1961
## 2 Norm 1Fam 1Story 6 6 1958
## 3 Norm 1Fam 2Story 5 5 1997
## 4 Norm 1Fam 2Story 6 6 1998
## 5 Norm TwnhsE 1Story 8 5 1992
## 6 Norm 1Fam 2Story 6 5 1993
## YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType
## 1 1961 Gable CompShg VinylSd VinylSd None
## 2 1958 Hip CompShg Wd Sdng Wd Sdng BrkFace
## 3 1998 Gable CompShg VinylSd VinylSd None
## 4 1998 Gable CompShg VinylSd VinylSd BrkFace
## 5 1992 Gable CompShg HdBoard HdBoard None
## 6 1994 Gable CompShg HdBoard HdBoard None
## MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure
## 1 0 TA TA CBlock TA TA No
## 2 108 TA TA CBlock TA TA No
## 3 0 TA TA PConc Gd TA No
## 4 20 TA TA PConc TA TA No
## 5 0 Gd TA PConc Gd TA No
## 6 0 TA TA PConc Gd TA No
## BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF
## 1 Rec 468 LwQ 144 270 882
## 2 ALQ 923 Unf 0 406 1329
## 3 GLQ 791 Unf 0 137 928
## 4 GLQ 602 Unf 0 324 926
## 5 ALQ 263 Unf 0 1017 1280
## 6 Unf 0 Unf 0 763 763
## Heating HeatingQC CentralAir Electrical X1stFlrSF X2ndFlrSF LowQualFinSF
## 1 GasA TA Y SBrkr 896 0 0
## 2 GasA TA Y SBrkr 1329 0 0
## 3 GasA Gd Y SBrkr 928 701 0
## 4 GasA Ex Y SBrkr 926 678 0
## 5 GasA Ex Y SBrkr 1280 0 0
## 6 GasA Gd Y SBrkr 763 892 0
## GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr
## 1 896 0 0 1 0 2
## 2 1329 0 0 1 1 3
## 3 1629 0 0 2 1 3
## 4 1604 0 0 2 1 3
## 5 1280 0 0 2 0 2
## 6 1655 0 0 2 1 3
## KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces FireplaceQu
## 1 1 TA 5 Typ 0 0
## 2 1 Gd 6 Typ 0 0
## 3 1 TA 6 Typ 1 0
## 4 1 Gd 7 Typ 1 0
## 5 1 Gd 5 Typ 0 0
## 6 1 TA 7 Typ 1 0
## GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual
## 1 Attchd 1961 Unf 1 730 TA
## 2 Attchd 1958 Unf 1 312 TA
## 3 Attchd 1997 Fin 2 482 TA
## 4 Attchd 1998 Fin 2 470 TA
## 5 Attchd 1992 RFn 2 506 TA
## 6 Attchd 1993 Fin 2 440 TA
## GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch
## 1 TA Y 140 0 0 0
## 2 TA Y 393 36 0 0
## 3 TA Y 212 34 0 0
## 4 TA Y 360 36 0 0
## 5 TA Y 0 82 0 0
## 6 TA Y 157 84 0 0
## ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold
## 1 120 0 0 0 0 0 6 2010
## 2 0 0 0 0 0 12500 6 2010
## 3 0 0 0 0 0 0 3 2010
## 4 0 0 0 0 0 0 6 2010
## 5 144 0 0 0 0 0 1 2010
## 6 0 0 0 0 0 0 4 2010
## SaleType SaleCondition
## 1 WD Normal
## 2 WD Normal
## 3 WD Normal
## 4 WD Normal
## 5 WD Normal
## 6 WD Normal
results <- predict(lm1, test)
resultsDf <- data.frame(cbind(test$Id, results))
colnames(resultsDf) = c('Id', 'SalePrice')
head(resultsDf, 10)
## Id SalePrice
## 1 1461 107770.3
## 2 1462 157928.8
## 3 1463 179728.0
## 4 1464 196500.4
## 5 1465 205609.9
## 6 1466 183168.0
## 7 1467 178157.0
## 8 1468 177213.7
## 9 1469 204793.9
## 10 1470 105355.7
#write.csv(resultsDf, file="kaggle_submission.csv", row.names=FALSE, na="0")
Kaggle username: everska
Score: 1.13434
Youtube presentation:
https://youtu.be/3G8eTTgMvJY