Q1
Reading the Datasets / Finding Attributes that would cause issues later on
require(dplyr)
prices <- read.csv("response_id-1.csv")
training <- read.csv("train_reg_features-1.csv", stringsAsFactors = T)
validation <- read.csv("test_reg_features-1.csv", stringsAsFactors = T)
comb_df <- inner_join(prices, validation)
housing_market <- rbind(training, comb_df) %>% select(-c(Id, Alley, FireplaceQu, PoolQC, Fence, MiscFeature, LotFrontage, Utilities))
all_levels <- housing_market %>% select_if(is.factor)
## The loop below is used to find all factors that have two or fewer levels. This would cause problems later on during splitting (which is what happened with the original 50/50 split)
for (col in 1:ncol(all_levels)) {
print(colnames(all_levels[col]))
print(levels(all_levels[, col])[which(dplyr::count(all_levels, all_levels[, col])[, 2] <= 2)])
} ## [1] "MSZoning"
## character(0)
## [1] "Street"
## character(0)
## [1] "LotShape"
## character(0)
## [1] "LandContour"
## character(0)
## [1] "LotConfig"
## character(0)
## [1] "LandSlope"
## character(0)
## [1] "Neighborhood"
## character(0)
## [1] "Condition1"
## character(0)
## [1] "Condition2"
## [1] "RRAe" "RRAn" "RRNn"
## [1] "BldgType"
## character(0)
## [1] "HouseStyle"
## character(0)
## [1] "RoofStyle"
## character(0)
## [1] "RoofMatl"
## [1] "ClyTile" "Membran" "Metal" "Roll"
## [1] "Exterior1st"
## [1] "AsphShn" "CBlock" "ImStucc" "Stone" NA
## [1] "Exterior2nd"
## [1] "Other" NA
## [1] "MasVnrType"
## character(0)
## [1] "ExterQual"
## character(0)
## [1] "ExterCond"
## character(0)
## [1] "Foundation"
## character(0)
## [1] "BsmtQual"
## character(0)
## [1] "BsmtCond"
## character(0)
## [1] "BsmtExposure"
## character(0)
## [1] "BsmtFinType1"
## character(0)
## [1] "BsmtFinType2"
## character(0)
## [1] "Heating"
## [1] "Floor" "OthW"
## [1] "HeatingQC"
## character(0)
## [1] "CentralAir"
## character(0)
## [1] "Electrical"
## [1] "Mix" NA
## [1] "KitchenQual"
## [1] NA
## [1] "Functional"
## [1] "Sev" NA
## [1] "GarageType"
## character(0)
## [1] "GarageFinish"
## character(0)
## [1] "GarageQual"
## character(0)
## [1] "GarageCond"
## character(0)
## [1] "PavedDrive"
## character(0)
## [1] "SaleType"
## [1] NA
## [1] "SaleCondition"
## character(0)
All levels printed above that are not character (0) will be removed. During the original splitting of the data, there were some levels of some attributes that will not divided equally because the levels are only present 1 or 2 times in the entire dataset. The filtering below removes all these levels.
Data Cleaning / Preprocessing
## filter the levels that have only 1 or 2 counts
housing_market <- housing_market %>% filter(RoofMatl != "ClyTile", RoofMatl != "Membran",
RoofMatl != "Metal", RoofMatl != "Roll") %>% droplevels(housing_market$RoofMatl) %>%
filter(Functional != "Sev") %>% droplevels(housing_market$Functional) %>%
filter(Condition2 != "RRAe", Condition2 != "RRNn", Condition2 != "RRAn") %>% droplevels(housing_market$Condition2) %>%
filter(Heating != "Floor", Heating != "OthW") %>% droplevels(housing_market$Heating) %>%
filter(Exterior2nd != "Other") %>% droplevels(housing_market$Exterior2nd) %>%
filter(Exterior1st != "AsphShn",
Exterior1st != "CBlock", Exterior1st != "ImStucc",
Exterior1st != "Stone") %>% droplevels(housing_market$Exterior1st) %>%
filter( Electrical != "Mix") %>% droplevels(housing_market$Electrical)Re-Splitting the Data
require(caret)
require(dplyr)
require(olsrr)
set.seed(100)
trainIndex <- createDataPartition(housing_market$SalePrice, p = 0.5, list = F, times = 1)
train <- housing_market[trainIndex,]
test <- housing_market[-trainIndex, ]EDA After Cleaning the Data on the Training Set
The following variables were chosen for exploration as they are continuous variables. This made graphically tracking easier against Sale Price as opposed to factors and non-continuous variables.
Sale Price and Lot Area
require(ggplot2)
require(ggpubr)
pl1 <- ggplot(data = train, aes(x = LotArea, y = SalePrice)) + geom_point() + theme_classic() + geom_smooth(method = 'lm', formula = y~x)
pl2 <- ggplot(data = train, aes(x = LotArea, y = SalePrice, color = HouseStyle)) + geom_point() + theme_classic()
## Investigating Sale Price by HouseStyle alone
pl3 <- ggplot(data = train, aes(x = HouseStyle, y = SalePrice, color = HouseStyle)) + geom_boxplot() +
theme(axis.title.x = element_blank(), axis.text.x = element_blank())
ggarrange(pl1, pl2, pl3, nrow = 3, heights = c(1.5, 1.5, 2), common.legend = T)Most houses were either one or two story houses. Two story houses indicate a higher potential for sale price than one story, as two story has the highest sales price overall, and seems to have a few more points above the highest for one story houses. Lot area does seem to have some impact on sale price. However, as shown, there is much variability. Some houses with 20,000 square feet or less have sold for higher prices than those with 60,000 or more square feet. This is likely due to other variable influences.
Sale Price by Sale Condition
pl4 <- ggplot(data = train, aes(x = SaleCondition, y = SalePrice, color = SaleCondition)) + geom_boxplot() +
theme(axis.title.x = element_blank(), axis.text.x = element_blank())
pl5 <- ggplot(data = train, aes(x = X1stFlrSF, y = SalePrice, color = SaleCondition)) + geom_point() +
xlab("1st Floor Square Feet")
pl6 <- ggplot(data = train, aes(x = X2ndFlrSF, y = SalePrice, color = SaleCondition)) + geom_point() +
xlab("2nd Floor Square Feet")
ggarrange(pl4, pl5, pl6, nrow = 3, heights = c(2, 1.5, 1.5), common.legend = T)Most houses were in either Normal or Partial condition. The highest sale prices were from Normal condition houses, with an Abnormal condition house with an outlier of nearly the same value. Square footage of first and second floors of homes are quite different. The chart for second floors show that many homes do not have any square feet. This is due to them being one story homes. Even one story homes are contestants for having one of the highest sale prices. The second square foot chart also seems somewhat skewed and more linear. This is because the first floor chart contains a very high value above 5,000 square feet for one observation. Although it had a large amount of square feet, its sale price is not one of the highest. Once again, this variation is likely due to pull from other variables.
Sale Price based on Basement SF by Basement Quality
pl7 <- train %>% filter(!is.na(TotalBsmtSF)) %>% filter(!is.na(BsmtQual)) %>%
ggplot(aes(x = TotalBsmtSF, y = SalePrice, color = BsmtQual)) + geom_point() + xlab("Total Basement Square Feet")
pl7A majority of basements were in typical condition, also with many in good or excellent condition. Homes with excellent quality seem to have potential for much higher sale values than that of other conditions. Good and typical condition seem to be on par for price potential, but overall good has higher home prices than typical.
Sale Price Statistics by Different Groupings
Here we also did exploration on factored and non-continuous variables.
require(crayon)
table1 <- train %>% group_by(TotRmsAbvGrd) %>%
summarise(Mean.Price = mean(round(SalePrice), 9))
table1$Mean.Price <- "$" %+% as.character(table1$Mean.Price)
tb1 <- ggtexttable(table1)
table2 <- train %>% group_by(OverallCond) %>% summarise(Mean.Price = mean(round(SalePrice), 9))
table2$Mean.Price <- "$" %+% as.character(table2$Mean.Price)
tb2 <- ggtexttable(table2)
table3 <- train %>% group_by(LandContour) %>% summarise(Mean.Price = mean(round(SalePrice), 9))
table3$Mean.Price <- "$" %+% as.character(table3$Mean.Price)
tb3 <- ggtexttable(table3)
ggarrange(tb1, tb2, NULL, tb3, nrow = 2, ncol =2)Total rooms above ground show an increase in average sale prices as total rooms increase. Overall Condition of the home did not seem to impact average sale price significantly. As shown, homes with the lowest score of one had a higher average sale price than that of other higher scores. Not only this but others also are higher or lower than others that are of higher quality rating. Land Contour does not seem to impact price much at all. Differences in mean price are minute and in a range of about ten thousand dollars.
Using Backward Selection to Identify Insignificant Variables
require(caTools)
model <- lm(SalePrice ~ ., data = train)
## The function below iterates through the models while removing the insignificant
## variables based on p-values. At first we did this by hand.... but this is much faster.
## We saved the backward selection as an object so we can select the names of the variables to remove.
varsToRemove <- ols_step_backward_p(model)
varsToRemove##
##
## Elimination Summary
## -------------------------------------------------------------------------------------
## Variable Adj.
## Step Removed R-Square R-Square C(p) AIC RMSE
## -------------------------------------------------------------------------------------
## 1 GarageQual 0.6126 0.5424 -65.4674 31916.1865 40759.5553
## 2 BldgType 0.6122 0.5436 -66.4137 31909.4341 40705.9025
## 3 GarageCars 0.6122 0.544 -68.3910 31907.4611 40688.1425
## 4 Foundation 0.6117 0.545 -68.8928 31901.2331 40642.9910
## 5 YrSold 0.6116 0.5454 -70.8412 31899.2940 40625.8628
## 6 HeatingQC 0.6111 0.5463 -71.2517 31893.1713 40582.7221
## 7 BedroomAbvGr 0.6111 0.5467 -73.1989 31891.2337 40565.7213
## 8 MSSubClass 0.611 0.5471 -75.1185 31889.3286 40549.2432
## 9 Street 0.611 0.5474 -77.0392 31887.4221 40532.7675
## 10 EnclosedPorch 0.611 0.5478 -78.8987 31885.5878 40517.4233
## 11 KitchenQual 0.6105 0.5484 -79.4355 31881.3131 40490.3800
## 12 BsmtFinType1 0.6094 0.5492 -78.4360 31874.8426 40455.6987
## 13 Fireplaces 0.6094 0.5495 -80.2991 31873.0035 40440.4503
## 14 X3SsnPorch 0.6093 0.5498 -82.1290 31871.2034 40425.8214
## 15 LotConfig 0.6085 0.5504 -81.7344 31866.0136 40398.3155
## 16 LandSlope 0.6082 0.5509 -82.9910 31862.8849 40376.4835
## 17 GarageFinish 0.6078 0.5513 -83.9428 31860.1123 40360.1774
## 18 GarageType 0.6068 0.552 -82.8396 31853.7394 40328.2525
## 19 OpenPorchSF 0.6067 0.5523 -84.5320 31852.0984 40316.3106
## 20 RoofStyle 0.6052 0.5525 -82.3815 31846.9328 40303.3268
## 21 YearRemodAdd 0.6051 0.5528 -84.0274 31845.3444 40292.2867
## 22 CentralAir 0.6049 0.553 -85.5426 31843.9079 40283.5847
## 23 Heating 0.6045 0.5533 -86.2788 31841.3753 40271.4334
## 24 GarageYrBlt 0.6044 0.5535 -87.9322 31839.7774 40260.3169
## 25 GarageCond 0.5996 0.5522 -92.3903 33346.1095 40181.7801
## 26 Exterior2nd 0.5969 0.5532 -86.4783 33331.2107 40135.9130
## 27 Electrical 0.5964 0.5537 -86.9785 33326.9292 40112.5933
## 28 MSZoning 0.5955 0.5541 -86.1750 33322.1357 40094.9854
## 29 HalfBath 0.5954 0.5544 -87.9457 33320.3976 40082.7824
## 30 BsmtFullBath 0.5953 0.5546 -89.6193 33318.7703 40072.2049
## 31 BsmtHalfBath 0.5952 0.5549 -91.4199 33316.9981 40059.5403
## 32 WoodDeckSF 0.5951 0.5551 -92.9395 33315.5465 40051.5405
## 33 SaleCondition 0.5935 0.5551 -90.1750 33310.9733 40050.6725
## 34 GarageArea 0.5932 0.5551 -91.2362 33310.0401 40050.2681
## -------------------------------------------------------------------------------------
Removing Insignificant Variables
The variables that are removed were indicated by the backward selection function in the above code chunk.
## This assigned the variables to remove.
removed <- varsToRemove$removed
## Remove variables with low p-values using the backward elimination above.
cleanTrain2 <- select(train, -removed)
cleanTest2 <- select(test, -removed)Now, the new model without the insignificant variables is created.
model2 <- lm(SalePrice ~ ., data = cleanTrain2)
## The backwards selection cannot eliminate every variable that is insignificant
## due to computation constraints. We saved the model summary as an object to then filter
## the rownames of the variables by p-values less than 0.05
results <- summary(model2)
## We now look at the new model results and choose a the variables with low p-vals
pvals <- data.frame(results$coefficients)
pvals <- filter(pvals, pvals$Pr...t.. < 0.05)
print(rownames(pvals))## [1] "LotArea" "LandContourHLS" "NeighborhoodNoRidge"
## [4] "Condition1Norm" "Condition1RRAe" "Condition2PosA"
## [7] "HouseStyle2.5Fin" "HouseStyle2.5Unf" "HouseStyle2Story"
## [10] "OverallCond" "YearBuilt" "RoofMatlWdShngl"
## [13] "ExterQualGd" "ExterQualTA" "BsmtQualGd"
## [16] "BsmtQualTA" "BsmtFinType2BLQ" "BsmtFinType2LwQ"
## [19] "X1stFlrSF" "X2ndFlrSF" "KitchenAbvGr"
## [22] "TotRmsAbvGrd" "PavedDriveY" "ScreenPorch"
## [25] "PoolArea" "MiscVal" "SaleTypeConLI"
Final Model / Variable Selection
The final training and test set will be created by selecting only the important variables, based on the preceding steps.
cleanTrain3 <- select(cleanTrain2, c(LotArea, LandContour, Neighborhood, Condition1,
Condition2, HouseStyle, OverallCond, YearBuilt,
RoofMatl, ExterQual, BsmtQual, BsmtFinType2,
X1stFlrSF, X2ndFlrSF, KitchenAbvGr, TotRmsAbvGrd,
PavedDrive, ScreenPorch, PoolArea, MiscVal,
SaleType, SalePrice))
cleanTest3 <- select(cleanTest2, c(LotArea, LandContour, Neighborhood, Condition1,
Condition2, HouseStyle, OverallCond, YearBuilt,
RoofMatl, ExterQual, BsmtQual, BsmtFinType2,
X1stFlrSF, X2ndFlrSF, KitchenAbvGr, TotRmsAbvGrd,
PavedDrive, ScreenPorch, PoolArea, MiscVal,
SaleType, SalePrice))
model3 <- lm(SalePrice ~ ., data = cleanTrain3)
results2 <- summary(model3)
results2##
## Call:
## lm(formula = SalePrice ~ ., data = cleanTrain3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -129475 -20692 -572 19877 290128
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.696e+05 2.055e+05 -1.799 0.072301 .
## LotArea 1.499e+00 3.351e-01 4.473 8.37e-06 ***
## LandContourHLS -2.072e+04 8.754e+03 -2.367 0.018100 *
## LandContourLow 1.038e+04 1.131e+04 0.918 0.359001
## LandContourLvl 6.562e+03 6.206e+03 1.057 0.290577
## NeighborhoodBlueste 1.422e+04 2.012e+04 0.706 0.480044
## NeighborhoodBrDale -3.934e+03 1.700e+04 -0.231 0.816992
## NeighborhoodBrkSide -2.248e+03 1.480e+04 -0.152 0.879308
## NeighborhoodClearCr -1.325e+02 1.659e+04 -0.008 0.993625
## NeighborhoodCollgCr 9.820e+03 1.255e+04 0.782 0.434189
## NeighborhoodCrawfor 1.601e+04 1.455e+04 1.100 0.271362
## NeighborhoodEdwards -4.469e+03 1.363e+04 -0.328 0.743001
## NeighborhoodGilbert 9.967e+03 1.328e+04 0.750 0.453242
## NeighborhoodIDOTRR -1.822e+03 1.540e+04 -0.118 0.905867
## NeighborhoodMeadowV 1.104e+03 1.571e+04 0.070 0.943990
## NeighborhoodMitchel 3.099e+03 1.372e+04 0.226 0.821315
## NeighborhoodNAmes 3.020e+03 1.331e+04 0.227 0.820586
## NeighborhoodNoRidge 5.035e+04 1.453e+04 3.464 0.000549 ***
## NeighborhoodNPkVill 9.943e+03 1.625e+04 0.612 0.540612
## NeighborhoodNridgHt 1.797e+04 1.313e+04 1.369 0.171130
## NeighborhoodNWAmes -3.885e+03 1.387e+04 -0.280 0.779425
## NeighborhoodOldTown -2.690e+03 1.446e+04 -0.186 0.852419
## NeighborhoodSawyer 2.068e+03 1.381e+04 0.150 0.881006
## NeighborhoodSawyerW 5.603e+03 1.339e+04 0.418 0.675673
## NeighborhoodSomerst 8.097e+03 1.279e+04 0.633 0.526929
## NeighborhoodStoneBr 9.435e+03 1.488e+04 0.634 0.526017
## NeighborhoodSWISU -1.958e+03 1.683e+04 -0.116 0.907389
## NeighborhoodTimber 7.821e+03 1.501e+04 0.521 0.602315
## NeighborhoodVeenker 1.533e+04 1.719e+04 0.892 0.372560
## Condition1Feedr 6.197e+03 8.387e+03 0.739 0.460160
## Condition1Norm 1.532e+04 6.954e+03 2.204 0.027732 *
## Condition1PosA 1.544e+04 1.762e+04 0.876 0.381028
## Condition1PosN 1.288e+04 1.278e+04 1.008 0.313603
## Condition1RRAe 2.516e+04 1.328e+04 1.895 0.058295 .
## Condition1RRAn 4.295e+03 1.199e+04 0.358 0.720345
## Condition1RRNe 2.325e+04 2.570e+04 0.904 0.365969
## Condition1RRNn 1.728e+04 2.560e+04 0.675 0.499841
## Condition2Feedr -2.217e+04 3.412e+04 -0.650 0.516017
## Condition2Norm -1.165e+04 2.257e+04 -0.516 0.605813
## Condition2PosA -1.246e+05 4.058e+04 -3.071 0.002177 **
## Condition2PosN 5.303e+03 3.858e+04 0.137 0.890692
## HouseStyle1.5Unf -6.834e+03 1.306e+04 -0.523 0.600907
## HouseStyle1Story 8.009e+03 5.966e+03 1.342 0.179713
## HouseStyle2.5Fin -6.531e+04 2.229e+04 -2.930 0.003446 **
## HouseStyle2.5Unf -3.247e+04 1.413e+04 -2.297 0.021778 *
## HouseStyle2Story -1.241e+04 5.267e+03 -2.355 0.018644 *
## HouseStyleSFoyer 1.428e+04 8.686e+03 1.644 0.100395
## HouseStyleSLvl -2.890e+02 7.540e+03 -0.038 0.969433
## OverallCond 5.454e+03 1.167e+03 4.672 3.29e-06 ***
## YearBuilt 2.515e+02 1.013e+02 2.482 0.013187 *
## RoofMatlTar&Grv -1.445e+04 1.310e+04 -1.104 0.269965
## RoofMatlWdShake -7.370e+03 1.932e+04 -0.381 0.702912
## RoofMatlWdShngl 9.759e+04 2.153e+04 4.533 6.34e-06 ***
## ExterQualFa 2.397e+03 1.689e+04 0.142 0.887185
## ExterQualGd -3.078e+04 7.227e+03 -4.259 2.20e-05 ***
## ExterQualTA -2.917e+04 8.098e+03 -3.602 0.000327 ***
## BsmtQualFa -1.627e+04 9.674e+03 -1.681 0.092908 .
## BsmtQualGd -2.271e+04 5.411e+03 -4.197 2.88e-05 ***
## BsmtQualTA -2.279e+04 6.714e+03 -3.394 0.000709 ***
## BsmtFinType2BLQ -2.543e+04 1.050e+04 -2.422 0.015551 *
## BsmtFinType2GLQ -2.113e+04 1.341e+04 -1.575 0.115461
## BsmtFinType2LwQ -2.521e+04 1.036e+04 -2.434 0.015076 *
## BsmtFinType2Rec -2.033e+04 9.681e+03 -2.100 0.035925 *
## BsmtFinType2Unf -1.822e+04 7.927e+03 -2.298 0.021723 *
## X1stFlrSF 4.813e+01 5.763e+00 8.352 < 2e-16 ***
## X2ndFlrSF 5.751e+01 7.394e+00 7.778 1.48e-14 ***
## KitchenAbvGr -2.048e+04 6.949e+03 -2.948 0.003256 **
## TotRmsAbvGrd 3.276e+03 1.384e+03 2.366 0.018121 *
## PavedDriveP -1.151e+04 8.896e+03 -1.294 0.195857
## PavedDriveY -1.074e+04 5.711e+03 -1.881 0.060175 .
## ScreenPorch 4.904e+01 1.966e+01 2.495 0.012710 *
## PoolArea 1.505e+02 3.559e+01 4.229 2.51e-05 ***
## MiscVal -6.826e+00 1.744e+00 -3.913 9.56e-05 ***
## SaleTypeCon 3.957e+04 2.500e+04 1.583 0.113753
## SaleTypeConLD -2.354e+03 1.419e+04 -0.166 0.868289
## SaleTypeConLI 9.899e+04 3.015e+04 3.284 0.001052 **
## SaleTypeConLw 2.541e+04 2.210e+04 1.150 0.250493
## SaleTypeCWD 9.511e+03 1.706e+04 0.558 0.577220
## SaleTypeNew 2.583e+03 8.235e+03 0.314 0.753848
## SaleTypeOth 2.234e+03 2.171e+04 0.103 0.918036
## SaleTypeWD -3.569e+03 6.921e+03 -0.516 0.606161
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 40750 on 1316 degrees of freedom
## (51 observations deleted due to missingness)
## Multiple R-squared: 0.5682, Adjusted R-squared: 0.542
## F-statistic: 21.65 on 80 and 1316 DF, p-value: < 2.2e-16
Note:
All the insignificant variables above are dummy variables. Removing them from the model would inherently cause the model to produce NAs.
Building the final model / Testing Error (before cross-validation)
The SalePrice variable has a skewed distribution. In order to combat this, a log-transformed value of SalePrice will be used in the model. This is illustrated below using quantile-quantile plots and histograms. This further illustrated by observing the skewness for each distribution.
The skew for the SalePrice distribution is 2.5754382 , and the skew for the log-transformed SalePrice distribution is -0.1327215 . The first distribution is very positively skewed, whereas the transformed values are slightly negatively skewed.
require(ggplot2)
require(ggpubr)
q1 <- qplot(sample = SalePrice, data = housing_market) + theme_classic() + xlab("Theoretical Quantiles")
h1 <- ggplot(data = housing_market, aes(SalePrice)) + geom_histogram(fill = "Blue", color = "black", bins = 25)
q2 <- qplot(sample = log(SalePrice), data = housing_market) + theme_classic() + xlab("Theoretical Quantiles")
h2 <- ggplot(data = housing_market, aes(log(SalePrice))) + geom_histogram(fill = "Purple", color = "black", bins = 25)
ggarrange(q1, h1, q2, h2, nrow = 2, ncol = 2)Predictions and Test Errors Using MLR
predictions <- predict(lm(log(SalePrice) ~ ., data = cleanTrain3), cleanTest3[-22])MSE
mean((log(cleanTest3$SalePrice) - predictions)^2, na.rm = T)## [1] 0.05118671
MSE is slightly above .05, which is acceptable and close enough to our target of .05 for our needs.
MAE
abs(mean(log(cleanTest3$SalePrice) - predictions, na.rm = T))## [1] 0.01001231
The MAE for non log-transformed values is still a good representation of the average difference in predicted and actual values.
abs(mean(cleanTest3$SalePrice - predict(lm(SalePrice ~ ., data = cleanTrain3), cleanTest3[-22]),na.rm = T))## [1] 1797.03
Which means on average, the predicted values were off by about 1800 dollars, which is about 1% if you are purchasing a home for $180,000.
Validation Set Approach Error Summary
require(kableExtra)
kbl(data.frame("MSE" = mean((log(cleanTest3$SalePrice) - predictions)^2, na.rm = T),
"MAE" = abs(mean(log(cleanTest3$SalePrice) - predictions, na.rm = T)
)))| MSE | MAE |
|---|---|
| 0.0511867 | 0.0100123 |
Prediction and Test Error Using 5-Fold CV and LOOCV
5-Fold CV results in a split ratio of 0.8. 80% of the total observations will make up the training set.
require(caret)
cvDF <- rbind(cleanTrain3, cleanTest3)
cv5 <- trainControl(method = "cv", number = 5)
loocv <- trainControl(method = "loocv")
modelcv5 <- train(log(SalePrice) ~ LotArea + LandContour + Neighborhood + Condition1 +
Condition2 + HouseStyle + OverallCond + YearBuilt + RoofMatl + ExterQual + BsmtQual +
BsmtFinType2 + X1stFlrSF + X2ndFlrSF + KitchenAbvGr +
TotRmsAbvGrd + PavedDrive + ScreenPorch + PoolArea + MiscVal +
SaleType, data = cvDF, method = "lm", trControl = cv5, na.action = na.pass)
modelloocv <- train(log(SalePrice) ~ LotArea + LandContour + Neighborhood + Condition1 +
Condition2 + HouseStyle + OverallCond + YearBuilt + RoofMatl + ExterQual + BsmtQual +
BsmtFinType2 + X1stFlrSF + X2ndFlrSF + KitchenAbvGr +
TotRmsAbvGrd + PavedDrive + ScreenPorch + PoolArea + MiscVal +
SaleType, data = cvDF, method = "lm", trControl = loocv, na.action = na.pass)cvresults <- modelcv5
loocvresults <- modelloocv
cvMSE <- (cvresults$results[, 2])^2
loocvMSE <- (loocvresults$results[, 2])^2The 5-Fold cross validation, using a split ratio of 0.8 resulted in a slightly improved model. The Error rate for the 5-Fold CV MLR is 0.0439646 . Using LOOCV, the MSE is 0.0254423 . Which is unexpected, as LOOCV typically has a greater variance because the test set is composed of a single observation for each iteration.
LOOCV and K-Fold CV Error Summary
kbl(data.frame(" " = "MSE", "5-Fold CV" = cvMSE,
"LOOCV" = loocvMSE),
col.names = c(" ", "5-Fold CV", "LOOCV"),
booktabs = T)| 5-Fold CV | LOOCV | |
|---|---|---|
| MSE | 0.0439646 | 0.0254423 |
Using Random Forest
The random forest model will be run using 200 trees and all the variables.
require(randomForest)
rfpredictions <- predict(randomForest(log(SalePrice) ~ ., data = train, na.action = na.omit, ntree = 200), test[-73])Random Forest MSE and MAE for Log-transformed Sale Price values
kbl(data.frame("MSE" = mean((log(test$SalePrice) - rfpredictions)^2, na.rm = T),
"MAE" = abs(mean(log(test$SalePrice) - rfpredictions, na.rm = T)
)))| MSE | MAE |
|---|---|
| 0.0396786 | 0.0050124 |
Using Random Forest with 200 trees and all variables slightly improved the model as well. The Error rate was 0.0396786. This also lowered MAE to 0.0050124.
Q2
require(dplyr)
q2train <- read.table("train_GE_LW-1.txt", header = T, stringsAsFactors = T) %>% filter(COS.Intensity != "NA")
q2test <- read.table("test_GE_LW-1.txt", header = T, stringsAsFactors = T) %>% filter(COS.Intensity != "NA")EDA
Correlation Heatmap Between All Numeric Variables
require(plotly)
require(heatmaply)
heat <- heatmaply(cor(q2train[-c(25:29)]),
heatmap_layers = theme(axis.line = element_blank()))
heatrequire(ggplot2)
require(ggpubr)
meanxy_COS <- ggplot(data = q2train, aes(x = mean.GE_LW_X, y = mean.GE_LW_Y, color = COS.Intensity)) + geom_point() + theme_classic()
meanxy_Sex <- ggplot(data = q2train, aes(x = mean.GE_LW_X, y = mean.GE_LW_Y, color = Sex)) + geom_point() + theme_classic()
ggarrange(meanxy_COS, meanxy_Sex, nrow = 2, ncol = 1)It already looks like there is a clear separation between clusters.
meanxy_COSY90 <- ggplot(data = q2train, aes(x = mean.GE_LW_X, y = quarlite.GE_LW_Y90, color = COS.Intensity)) + geom_point() + theme_classic()
meanxy_SexY90 <- ggplot(data = q2train, aes(x = mean.GE_LW_X, y = quarlite.GE_LW_Y90, color = Sex)) + geom_point() + theme_classic()
ggarrange(meanxy_COSY90, meanxy_SexY90, nrow = 2, ncol = 1) It is clear there is separation at 0.
3D Plot
The plots below accounts for the height and weight respectfull.
require(plotly)
par(mfrow = c(1,1))
plot_ly(x = q2train$mean.GE_LW_X, y = q2train$mean.GE_LW_Y, z = q2train$Ht, type = "scatter3d", color = q2train$Sex)plot_ly(x = q2train$mean.GE_LW_X, y = q2train$mean.GE_LW_Y, z = q2train$Wt, type = "scatter3d", color = q2train$Sex)The boundary is still clearly seen, but it is revealed that those around average height are in the positive XY quadrant. In the second plot, there are two groups where this occurs, however. Similarly, in the second plot, those with average weights tend to have points in the positive XY quadrant. It should also be noted that there is very little overlap between males and females in the XY positive quadrants for both plots.
Check the counts for each response level
require(kableExtra)
train_response_counts <- q2train %>% count(COS.Intensity)
test_response_counts <- q2test %>% count(COS.Intensity)
kable(list(train_response_counts, test_response_counts))
|
|
Baseline Multinomial Logistic Regression
The baseline multinomial logistic regression has an error rate of 0.3987943 .
LDA
require(MASS)
q2model1 <- lda(COS.Intensity ~ ., data = q2train)
q2predictionsLDA <- predict(q2model1, q2test[-29])Interpreting the Results
require(caret)
require(yardstick)
results_tab1 <- table(q2predictionsLDA$class, q2test$COS.Intensity)
autoplot(conf_mat(results_tab1), type = "heatmap") + scale_fill_continuous() +
xlab("Predicted Values") +
ylab("Actual Values")Recall the count tables above for the response variable. This means LDA does not very accurately predict results. It was only successful in predicting values of SED. These values would likely change depending on adjusting the split ratio, but it is clear that another algorithm should be used.
Statistics Based on the confusion matrix
The above confusion matrix
print(confusionMatrix(results_tab1))## Confusion Matrix and Statistics
##
##
## LPA MPA SED VPA
## LPA 461 385 253 30
## MPA 174 660 44 136
## SED 472 203 1389 11
## VPA 1 8 1 85
##
## Overall Statistics
##
## Accuracy : 0.6017
## 95% CI : (0.5869, 0.6163)
## No Information Rate : 0.3911
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4097
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: LPA Class: MPA Class: SED Class: VPA
## Sensitivity 0.4161 0.5255 0.8234 0.32443
## Specificity 0.7916 0.8842 0.7388 0.99753
## Pos Pred Value 0.4083 0.6509 0.6694 0.89474
## Neg Pred Value 0.7968 0.8193 0.8668 0.95804
## Prevalence 0.2569 0.2912 0.3911 0.06075
## Detection Rate 0.1069 0.1530 0.3220 0.01971
## Detection Prevalence 0.2618 0.2351 0.4811 0.02203
## Balanced Accuracy 0.6038 0.7048 0.7811 0.66098
Consistent with what was said above, this model scored a 60.2% accuracy rating.
QDA
q2model2 <- qda(COS.Intensity ~ ., data = q2train)
q2predictionsQDA <- predict(q2model2, q2test[-29])Confusion Matrix
results_tab2 <- table(q2predictionsQDA$class, q2test$COS.Intensity)
autoplot(conf_mat(results_tab2), type = "heatmap") + scale_fill_continuous() +
xlab("Predicted Values") +
ylab("Actual Values")Statistics Based on the confusion matrix
print(confusionMatrix(results_tab2))## Confusion Matrix and Statistics
##
##
## LPA MPA SED VPA
## LPA 404 404 230 31
## MPA 110 482 28 132
## SED 594 357 1428 13
## VPA 0 13 1 86
##
## Overall Statistics
##
## Accuracy : 0.5565
## 95% CI : (0.5415, 0.5714)
## No Information Rate : 0.3911
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.3352
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: LPA Class: MPA Class: SED Class: VPA
## Sensitivity 0.36462 0.3838 0.8465 0.32824
## Specificity 0.79251 0.9117 0.6329 0.99654
## Pos Pred Value 0.37792 0.6410 0.5970 0.86000
## Neg Pred Value 0.78298 0.7826 0.8652 0.95822
## Prevalence 0.25690 0.2912 0.3911 0.06075
## Detection Rate 0.09367 0.1118 0.3311 0.01994
## Detection Prevalence 0.24786 0.1744 0.5546 0.02319
## Balanced Accuracy 0.57857 0.6477 0.7397 0.66239
KNN
Since KNN does not rely on labels, the factor variables need to all be converted to numeric
require(class)
require(dplyr)
temp_train <- q2train
levels(temp_train$COS.Intensity) <- as.integer(c(1,2,3,4))
levels(temp_train$Sex) <- as.integer(c(1, 2))
temp_test <- q2test
levels(temp_test$COS.Intensity) <- as.integer(c(1,2,3,4))
levels(temp_test$Sex) <- as.integer(c(1, 2))
model_knn <- knn(temp_train, temp_test, cl = temp_train$COS.Intensity, k = 50)
results_tab3 <- table(model_knn, temp_test$COS.Intensity)
autoplot(conf_mat(results_tab3), type = "heatmap") + scale_fill_continuous() +
xlab("Predicted Values") +
ylab("Actual Values")print(confusionMatrix(results_tab3))## Confusion Matrix and Statistics
##
##
## model_knn 1 2 3 4
## 1 691 410 86 50
## 2 213 629 134 160
## 3 201 213 1465 28
## 4 3 4 2 24
##
## Overall Statistics
##
## Accuracy : 0.6513
## 95% CI : (0.6368, 0.6655)
## No Information Rate : 0.3911
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4843
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3 Class: 4
## Sensitivity 0.6236 0.5008 0.8684 0.091603
## Specificity 0.8296 0.8342 0.8317 0.997778
## Pos Pred Value 0.5586 0.5537 0.7682 0.727273
## Neg Pred Value 0.8644 0.8026 0.9077 0.944393
## Prevalence 0.2569 0.2912 0.3911 0.060747
## Detection Rate 0.1602 0.1458 0.3397 0.005565
## Detection Prevalence 0.2868 0.2634 0.4422 0.007651
## Balanced Accuracy 0.7266 0.6675 0.8500 0.544691
KNN provided the most accurate baseline method.
Let’s see if the accuracy changes at all depending on a value of K.
k <- c(5, 10, 25, 50, 100, 150, 200, 400)
knn1 <- 1 - mean(knn(temp_train, temp_test, cl = temp_train$COS.Intensity, k = k[1]) == temp_test$COS.Intensity)
knn2 <- 1 - mean(knn(temp_train, temp_test, cl = temp_train$COS.Intensity, k = k[2]) == temp_test$COS.Intensity)
knn3 <- 1 - mean(knn(temp_train, temp_test, cl = temp_train$COS.Intensity, k = k[3]) == temp_test$COS.Intensity)
knn4 <- 1 - mean(knn(temp_train, temp_test, cl = temp_train$COS.Intensity, k = k[4]) == temp_test$COS.Intensity)
knn5 <- 1 - mean(knn(temp_train, temp_test, cl = temp_train$COS.Intensity, k = k[5]) == temp_test$COS.Intensity)
knn6 <- 1 - mean(knn(temp_train, temp_test, cl = temp_train$COS.Intensity, k = k[6]) == temp_test$COS.Intensity)
knn7 <- 1 - mean(knn(temp_train, temp_test, cl = temp_train$COS.Intensity, k = k[7]) == temp_test$COS.Intensity)
knn8 <- 1 - mean(knn(temp_train, temp_test, cl = temp_train$COS.Intensity, k = k[8]) == temp_test$COS.Intensity)
knndf <- data.frame("Error" = c(knn1, knn2, knn3, knn4, knn5, knn6, knn7, knn8), "K" = k)knn_table <- ggtexttable(knndf)
knn_curve <- ggplot(data = knndf, aes(x = K, y = Error)) + geom_line(color = "blue", size = 1.5) + geom_point(shape = "triangle", size = 3) + theme_classic()
ggarrange(knn_curve, knn_table, nrow = 1, ncol = 2, widths = c(3, 2))Each Model again with a 0.7 split
dataset <- rbind(q2train, q2test)
dim(dataset)## [1] 8622 29
Using K-Fold Cross-Validation for each model
Using \(k=5\) results in reshuffles of the training data of a size equal to \(\frac{k-1}{k}n\) , in this scenario, using \(k=5\) results in a training set of size of \(n = \frac{5 - 1 }{5}8622 = 6898\) .
LDA with Cross-Validation
cvLDA <- train(COS.Intensity ~ ., data = dataset, method = "lda", trControl = five_fold, metric = "Accuracy")
system.time({train(COS.Intensity ~ ., data = dataset, method = "lda", trControl = five_fold, metric = "Accuracy")})## user system elapsed
## 0.7 0.0 0.7
QDA with Cross-Validation
cvQDA <- train(COS.Intensity ~ ., data = dataset, method = "qda", trControl = five_fold, metric = "Accuracy")
system.time({train(COS.Intensity ~ ., data = dataset, method = "qda", trControl = five_fold, metric = "Accuracy")})## user system elapsed
## 0.63 0.03 0.66
KNN with Cross-Validation
temp_dataset <- rbind(temp_train, temp_test)
cvKNN <- train(COS.Intensity ~ ., data = temp_dataset, method = "knn", tuneGrid = expand.grid(k = k), trControl = five_fold,
metric = "Accuracy")
system.time({train(COS.Intensity ~ ., data = dataset, method = "knn", trControl = five_fold, metric = "Accuracy")})## user system elapsed
## 4.98 0.00 4.99
Overall Error Rates
finalResults <- data.frame("Method" = c("Logistic Regression", "LDA", "QDA", "KNN"),
"Error" = c(error_rate, 1- 0.6017, 1 - 0.5565, min(knndf$Error)),
"5-Fold CV Error" = c(1-max(kfcvMLR$results[, 2]),1- max(cvLDA$results[, 2]),
1-max(cvQDA$results[, 2]), 1-max(cvKNN$results[, 2])))
kbl(finalResults, col.names = c("Method", "Test Error", "K-Fold CV Test Error"))| Method | Test Error | K-Fold CV Test Error |
|---|---|---|
| Logistic Regression | 0.3987943 | 0.3741543 |
| LDA | 0.3983000 | 0.3801926 |
| QDA | 0.4435000 | 0.4054735 |
| KNN | 0.2304660 | 0.2646639 |
Computational Expenses for Each 5-Fold Cross-Validation Method
Multinomial Logistic Regression
time1## user system elapsed
## 14.89 0.00 14.89
LDA
system.time({train(COS.Intensity ~ ., data = dataset, method = "lda", trControl = five_fold, metric = "Accuracy")})## user system elapsed
## 0.72 0.00 0.72
QDA
system.time({train(COS.Intensity ~ ., data = dataset, method = "qda", trControl = five_fold, metric = "Accuracy")})## user system elapsed
## 0.65 0.00 0.65
KNN
system.time({train(COS.Intensity ~ ., data = temp_dataset, method = "knn", tuneGrid = expand.grid(k = k), trControl = five_fold, metric = "Accuracy")})## user system elapsed
## 21.66 0.01 21.67
Group Work: Each member of the group contributed equally to the construction of this project. We initially began by doing research and discussing the project questions. We then created an RStudio Cloud session and were able to collaborate on programming simultaneously. Each member was able to utilize their skills in various areas such as data cleaning, charting, coding optimization, and rendering. This project helped our group build better synergy which will prepare us for future projects.