Q1

Reading the Datasets / Finding Attributes that would cause issues later on

require(dplyr)

prices <- read.csv("response_id-1.csv")

training <- read.csv("train_reg_features-1.csv", stringsAsFactors = T) 

validation <- read.csv("test_reg_features-1.csv", stringsAsFactors = T) 

comb_df <- inner_join(prices, validation)

housing_market <- rbind(training, comb_df) %>% select(-c(Id, Alley, FireplaceQu, PoolQC, Fence, MiscFeature, LotFrontage, Utilities))

all_levels <- housing_market %>% select_if(is.factor)

## The loop below is used to find all factors that have two or fewer levels. This would cause problems later on during splitting (which is what happened with the original 50/50 split)

for (col in 1:ncol(all_levels)) {
  

  print(colnames(all_levels[col])) 
  print(levels(all_levels[, col])[which(dplyr::count(all_levels, all_levels[, col])[, 2] <= 2)])

    
    
}

## [1] "MSZoning"
## character(0)
## [1] "Street"
## character(0)
## [1] "LotShape"
## character(0)
## [1] "LandContour"
## character(0)
## [1] "LotConfig"
## character(0)
## [1] "LandSlope"
## character(0)
## [1] "Neighborhood"
## character(0)
## [1] "Condition1"
## character(0)
## [1] "Condition2"
## [1] "RRAe" "RRAn" "RRNn"
## [1] "BldgType"
## character(0)
## [1] "HouseStyle"
## character(0)
## [1] "RoofStyle"
## character(0)
## [1] "RoofMatl"
## [1] "ClyTile" "Membran" "Metal"   "Roll"   
## [1] "Exterior1st"
## [1] "AsphShn" "CBlock"  "ImStucc" "Stone"   NA       
## [1] "Exterior2nd"
## [1] "Other" NA     
## [1] "MasVnrType"
## character(0)
## [1] "ExterQual"
## character(0)
## [1] "ExterCond"
## character(0)
## [1] "Foundation"
## character(0)
## [1] "BsmtQual"
## character(0)
## [1] "BsmtCond"
## character(0)
## [1] "BsmtExposure"
## character(0)
## [1] "BsmtFinType1"
## character(0)
## [1] "BsmtFinType2"
## character(0)
## [1] "Heating"
## [1] "Floor" "OthW" 
## [1] "HeatingQC"
## character(0)
## [1] "CentralAir"
## character(0)
## [1] "Electrical"
## [1] "Mix" NA   
## [1] "KitchenQual"
## [1] NA
## [1] "Functional"
## [1] "Sev" NA   
## [1] "GarageType"
## character(0)
## [1] "GarageFinish"
## character(0)
## [1] "GarageQual"
## character(0)
## [1] "GarageCond"
## character(0)
## [1] "PavedDrive"
## character(0)
## [1] "SaleType"
## [1] NA
## [1] "SaleCondition"
## character(0)

All levels printed above that are not character (0) will be removed. During the original splitting of the data, there were some levels of some attributes that will not divided equally because the levels are only present 1 or 2 times in the entire dataset. The filtering below removes all these levels.

Data Cleaning / Preprocessing

## filter the levels that have only 1 or 2 counts 

housing_market <- housing_market %>% filter(RoofMatl != "ClyTile", RoofMatl != "Membran", 
                         RoofMatl != "Metal", RoofMatl != "Roll") %>% droplevels(housing_market$RoofMatl) %>%
  filter(Functional != "Sev") %>%  droplevels(housing_market$Functional) %>%
  filter(Condition2 != "RRAe", Condition2 !=  "RRNn", Condition2 != "RRAn") %>% droplevels(housing_market$Condition2) %>%
  filter(Heating != "Floor", Heating != "OthW") %>%  droplevels(housing_market$Heating) %>%
  filter(Exterior2nd != "Other") %>% droplevels(housing_market$Exterior2nd) %>%
  filter(Exterior1st != "AsphShn", 
                         Exterior1st != "CBlock", Exterior1st !=  "ImStucc",
                         Exterior1st != "Stone") %>% droplevels(housing_market$Exterior1st) %>%
  filter( Electrical != "Mix") %>% droplevels(housing_market$Electrical)

Re-Splitting the Data

require(caret)
require(dplyr)
require(olsrr)


set.seed(100)

trainIndex <- createDataPartition(housing_market$SalePrice, p = 0.5, list = F, times = 1)

train <- housing_market[trainIndex,]

test <- housing_market[-trainIndex, ]

EDA After Cleaning the Data on the Training Set

The following variables were chosen for exploration as they are continuous variables. This made graphically tracking easier against Sale Price as opposed to factors and non-continuous variables.

Sale Price and Lot Area

require(ggplot2)
require(ggpubr)

pl1 <- ggplot(data = train, aes(x = LotArea, y = SalePrice)) + geom_point() + theme_classic() + geom_smooth(method = 'lm', formula = y~x)

pl2 <- ggplot(data = train, aes(x = LotArea, y = SalePrice, color = HouseStyle)) + geom_point() + theme_classic()

## Investigating Sale Price by HouseStyle alone

pl3 <- ggplot(data = train, aes(x = HouseStyle, y = SalePrice, color = HouseStyle)) + geom_boxplot() +
  theme(axis.title.x = element_blank(), axis.text.x = element_blank())

ggarrange(pl1, pl2, pl3, nrow = 3, heights = c(1.5, 1.5, 2), common.legend = T)

Most houses were either one or two story houses. Two story houses indicate a higher potential for sale price than one story, as two story has the highest sales price overall, and seems to have a few more points above the highest for one story houses. Lot area does seem to have some impact on sale price. However, as shown, there is much variability. Some houses with 20,000 square feet or less have sold for higher prices than those with 60,000 or more square feet. This is likely due to other variable influences.

Sale Price by Sale Condition

pl4 <- ggplot(data = train, aes(x = SaleCondition, y = SalePrice, color = SaleCondition)) + geom_boxplot() +
  theme(axis.title.x = element_blank(), axis.text.x = element_blank())

pl5 <- ggplot(data = train, aes(x = X1stFlrSF, y = SalePrice, color = SaleCondition)) + geom_point() +
  xlab("1st Floor Square Feet")

pl6 <- ggplot(data = train, aes(x = X2ndFlrSF, y = SalePrice, color = SaleCondition)) + geom_point() +
  xlab("2nd Floor Square Feet")

ggarrange(pl4, pl5, pl6, nrow = 3, heights = c(2, 1.5, 1.5), common.legend = T)

Most houses were in either Normal or Partial condition. The highest sale prices were from Normal condition houses, with an Abnormal condition house with an outlier of nearly the same value. Square footage of first and second floors of homes are quite different. The chart for second floors show that many homes do not have any square feet. This is due to them being one story homes. Even one story homes are contestants for having one of the highest sale prices. The second square foot chart also seems somewhat skewed and more linear. This is because the first floor chart contains a very high value above 5,000 square feet for one observation. Although it had a large amount of square feet, its sale price is not one of the highest. Once again, this variation is likely due to pull from other variables.

Sale Price based on Basement SF by Basement Quality

pl7 <- train %>% filter(!is.na(TotalBsmtSF)) %>% filter(!is.na(BsmtQual)) %>%
  ggplot(aes(x = TotalBsmtSF, y = SalePrice, color = BsmtQual)) + geom_point() + xlab("Total Basement Square Feet")

pl7

A majority of basements were in typical condition, also with many in good or excellent condition. Homes with excellent quality seem to have potential for much higher sale values than that of other conditions. Good and typical condition seem to be on par for price potential, but overall good has higher home prices than typical.

Sale Price Statistics by Different Groupings

Here we also did exploration on factored and non-continuous variables.

require(crayon)

table1 <- train %>% group_by(TotRmsAbvGrd) %>%
  summarise(Mean.Price = mean(round(SalePrice), 9))

table1$Mean.Price <- "$" %+% as.character(table1$Mean.Price)

tb1 <- ggtexttable(table1)

table2 <- train %>% group_by(OverallCond) %>% summarise(Mean.Price = mean(round(SalePrice), 9))

table2$Mean.Price <- "$" %+% as.character(table2$Mean.Price)

tb2 <- ggtexttable(table2)

table3 <- train %>% group_by(LandContour) %>% summarise(Mean.Price = mean(round(SalePrice), 9))

table3$Mean.Price <- "$" %+% as.character(table3$Mean.Price)

tb3 <- ggtexttable(table3)


ggarrange(tb1, tb2, NULL, tb3, nrow = 2, ncol =2)

Total rooms above ground show an increase in average sale prices as total rooms increase. Overall Condition of the home did not seem to impact average sale price significantly. As shown, homes with the lowest score of one had a higher average sale price than that of other higher scores. Not only this but others also are higher or lower than others that are of higher quality rating. Land Contour does not seem to impact price much at all. Differences in mean price are minute and in a range of about ten thousand dollars.

Using Backward Selection to Identify Insignificant Variables

require(caTools)

model <- lm(SalePrice ~ ., data = train)



## The function below iterates through the models while removing the insignificant 

## variables based on p-values. At first we did this by hand.... but this is much faster.

## We saved the backward selection as an object so we can select the names of the variables to remove.
 
varsToRemove <- ols_step_backward_p(model)
varsToRemove

## 
## 
##                                  Elimination Summary                                  
## -------------------------------------------------------------------------------------
##         Variable                       Adj.                                              
## Step       Removed       R-Square    R-Square      C(p)         AIC           RMSE       
## -------------------------------------------------------------------------------------
##    1    GarageQual         0.6126      0.5424    -65.4674    31916.1865    40759.5553    
##    2    BldgType           0.6122      0.5436    -66.4137    31909.4341    40705.9025    
##    3    GarageCars         0.6122       0.544    -68.3910    31907.4611    40688.1425    
##    4    Foundation         0.6117       0.545    -68.8928    31901.2331    40642.9910    
##    5    YrSold             0.6116      0.5454    -70.8412    31899.2940    40625.8628    
##    6    HeatingQC          0.6111      0.5463    -71.2517    31893.1713    40582.7221    
##    7    BedroomAbvGr       0.6111      0.5467    -73.1989    31891.2337    40565.7213    
##    8    MSSubClass          0.611      0.5471    -75.1185    31889.3286    40549.2432    
##    9    Street              0.611      0.5474    -77.0392    31887.4221    40532.7675    
##   10    EnclosedPorch       0.611      0.5478    -78.8987    31885.5878    40517.4233    
##   11    KitchenQual        0.6105      0.5484    -79.4355    31881.3131    40490.3800    
##   12    BsmtFinType1       0.6094      0.5492    -78.4360    31874.8426    40455.6987    
##   13    Fireplaces         0.6094      0.5495    -80.2991    31873.0035    40440.4503    
##   14    X3SsnPorch         0.6093      0.5498    -82.1290    31871.2034    40425.8214    
##   15    LotConfig          0.6085      0.5504    -81.7344    31866.0136    40398.3155    
##   16    LandSlope          0.6082      0.5509    -82.9910    31862.8849    40376.4835    
##   17    GarageFinish       0.6078      0.5513    -83.9428    31860.1123    40360.1774    
##   18    GarageType         0.6068       0.552    -82.8396    31853.7394    40328.2525    
##   19    OpenPorchSF        0.6067      0.5523    -84.5320    31852.0984    40316.3106    
##   20    RoofStyle          0.6052      0.5525    -82.3815    31846.9328    40303.3268    
##   21    YearRemodAdd       0.6051      0.5528    -84.0274    31845.3444    40292.2867    
##   22    CentralAir         0.6049       0.553    -85.5426    31843.9079    40283.5847    
##   23    Heating            0.6045      0.5533    -86.2788    31841.3753    40271.4334    
##   24    GarageYrBlt        0.6044      0.5535    -87.9322    31839.7774    40260.3169    
##   25    GarageCond         0.5996      0.5522    -92.3903    33346.1095    40181.7801    
##   26    Exterior2nd        0.5969      0.5532    -86.4783    33331.2107    40135.9130    
##   27    Electrical         0.5964      0.5537    -86.9785    33326.9292    40112.5933    
##   28    MSZoning           0.5955      0.5541    -86.1750    33322.1357    40094.9854    
##   29    HalfBath           0.5954      0.5544    -87.9457    33320.3976    40082.7824    
##   30    BsmtFullBath       0.5953      0.5546    -89.6193    33318.7703    40072.2049    
##   31    BsmtHalfBath       0.5952      0.5549    -91.4199    33316.9981    40059.5403    
##   32    WoodDeckSF         0.5951      0.5551    -92.9395    33315.5465    40051.5405    
##   33    SaleCondition      0.5935      0.5551    -90.1750    33310.9733    40050.6725    
##   34    GarageArea         0.5932      0.5551    -91.2362    33310.0401    40050.2681    
## -------------------------------------------------------------------------------------

Removing Insignificant Variables

The variables that are removed were indicated by the backward selection function in the above code chunk.

## This assigned the variables to remove.

removed <- varsToRemove$removed
  
## Remove variables with low p-values using the backward elimination above.

cleanTrain2 <- select(train, -removed)

cleanTest2 <- select(test, -removed)

Now, the new model without the insignificant variables is created.

model2 <- lm(SalePrice ~ ., data = cleanTrain2)

## The backwards selection cannot eliminate every variable that is insignificant

## due to computation constraints. We saved the model summary as an object to then filter

## the rownames of the variables by p-values less than 0.05

results <- summary(model2)


 ## We now look at the new model results and choose a the variables with low p-vals

pvals <- data.frame(results$coefficients)


pvals <- filter(pvals, pvals$Pr...t.. < 0.05)

print(rownames(pvals))

##  [1] "LotArea"             "LandContourHLS"      "NeighborhoodNoRidge"
##  [4] "Condition1Norm"      "Condition1RRAe"      "Condition2PosA"     
##  [7] "HouseStyle2.5Fin"    "HouseStyle2.5Unf"    "HouseStyle2Story"   
## [10] "OverallCond"         "YearBuilt"           "RoofMatlWdShngl"    
## [13] "ExterQualGd"         "ExterQualTA"         "BsmtQualGd"         
## [16] "BsmtQualTA"          "BsmtFinType2BLQ"     "BsmtFinType2LwQ"    
## [19] "X1stFlrSF"           "X2ndFlrSF"           "KitchenAbvGr"       
## [22] "TotRmsAbvGrd"        "PavedDriveY"         "ScreenPorch"        
## [25] "PoolArea"            "MiscVal"             "SaleTypeConLI"

Final Model / Variable Selection

The final training and test set will be created by selecting only the important variables, based on the preceding steps.

cleanTrain3 <- select(cleanTrain2, c(LotArea, LandContour, Neighborhood, Condition1,
                                     Condition2, HouseStyle, OverallCond, YearBuilt,
                                     RoofMatl, ExterQual, BsmtQual, BsmtFinType2,
                                     X1stFlrSF, X2ndFlrSF, KitchenAbvGr, TotRmsAbvGrd,
                                     PavedDrive, ScreenPorch, PoolArea, MiscVal, 
                                     SaleType, SalePrice))

cleanTest3 <- select(cleanTest2, c(LotArea, LandContour, Neighborhood, Condition1,
                                     Condition2, HouseStyle, OverallCond, YearBuilt,
                                     RoofMatl, ExterQual, BsmtQual, BsmtFinType2,
                                     X1stFlrSF, X2ndFlrSF, KitchenAbvGr, TotRmsAbvGrd,
                                     PavedDrive, ScreenPorch, PoolArea, MiscVal, 
                                     SaleType, SalePrice))

model3 <- lm(SalePrice ~ ., data = cleanTrain3)

results2 <- summary(model3)


results2

## 
## Call:
## lm(formula = SalePrice ~ ., data = cleanTrain3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -129475  -20692    -572   19877  290128 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -3.696e+05  2.055e+05  -1.799 0.072301 .  
## LotArea              1.499e+00  3.351e-01   4.473 8.37e-06 ***
## LandContourHLS      -2.072e+04  8.754e+03  -2.367 0.018100 *  
## LandContourLow       1.038e+04  1.131e+04   0.918 0.359001    
## LandContourLvl       6.562e+03  6.206e+03   1.057 0.290577    
## NeighborhoodBlueste  1.422e+04  2.012e+04   0.706 0.480044    
## NeighborhoodBrDale  -3.934e+03  1.700e+04  -0.231 0.816992    
## NeighborhoodBrkSide -2.248e+03  1.480e+04  -0.152 0.879308    
## NeighborhoodClearCr -1.325e+02  1.659e+04  -0.008 0.993625    
## NeighborhoodCollgCr  9.820e+03  1.255e+04   0.782 0.434189    
## NeighborhoodCrawfor  1.601e+04  1.455e+04   1.100 0.271362    
## NeighborhoodEdwards -4.469e+03  1.363e+04  -0.328 0.743001    
## NeighborhoodGilbert  9.967e+03  1.328e+04   0.750 0.453242    
## NeighborhoodIDOTRR  -1.822e+03  1.540e+04  -0.118 0.905867    
## NeighborhoodMeadowV  1.104e+03  1.571e+04   0.070 0.943990    
## NeighborhoodMitchel  3.099e+03  1.372e+04   0.226 0.821315    
## NeighborhoodNAmes    3.020e+03  1.331e+04   0.227 0.820586    
## NeighborhoodNoRidge  5.035e+04  1.453e+04   3.464 0.000549 ***
## NeighborhoodNPkVill  9.943e+03  1.625e+04   0.612 0.540612    
## NeighborhoodNridgHt  1.797e+04  1.313e+04   1.369 0.171130    
## NeighborhoodNWAmes  -3.885e+03  1.387e+04  -0.280 0.779425    
## NeighborhoodOldTown -2.690e+03  1.446e+04  -0.186 0.852419    
## NeighborhoodSawyer   2.068e+03  1.381e+04   0.150 0.881006    
## NeighborhoodSawyerW  5.603e+03  1.339e+04   0.418 0.675673    
## NeighborhoodSomerst  8.097e+03  1.279e+04   0.633 0.526929    
## NeighborhoodStoneBr  9.435e+03  1.488e+04   0.634 0.526017    
## NeighborhoodSWISU   -1.958e+03  1.683e+04  -0.116 0.907389    
## NeighborhoodTimber   7.821e+03  1.501e+04   0.521 0.602315    
## NeighborhoodVeenker  1.533e+04  1.719e+04   0.892 0.372560    
## Condition1Feedr      6.197e+03  8.387e+03   0.739 0.460160    
## Condition1Norm       1.532e+04  6.954e+03   2.204 0.027732 *  
## Condition1PosA       1.544e+04  1.762e+04   0.876 0.381028    
## Condition1PosN       1.288e+04  1.278e+04   1.008 0.313603    
## Condition1RRAe       2.516e+04  1.328e+04   1.895 0.058295 .  
## Condition1RRAn       4.295e+03  1.199e+04   0.358 0.720345    
## Condition1RRNe       2.325e+04  2.570e+04   0.904 0.365969    
## Condition1RRNn       1.728e+04  2.560e+04   0.675 0.499841    
## Condition2Feedr     -2.217e+04  3.412e+04  -0.650 0.516017    
## Condition2Norm      -1.165e+04  2.257e+04  -0.516 0.605813    
## Condition2PosA      -1.246e+05  4.058e+04  -3.071 0.002177 ** 
## Condition2PosN       5.303e+03  3.858e+04   0.137 0.890692    
## HouseStyle1.5Unf    -6.834e+03  1.306e+04  -0.523 0.600907    
## HouseStyle1Story     8.009e+03  5.966e+03   1.342 0.179713    
## HouseStyle2.5Fin    -6.531e+04  2.229e+04  -2.930 0.003446 ** 
## HouseStyle2.5Unf    -3.247e+04  1.413e+04  -2.297 0.021778 *  
## HouseStyle2Story    -1.241e+04  5.267e+03  -2.355 0.018644 *  
## HouseStyleSFoyer     1.428e+04  8.686e+03   1.644 0.100395    
## HouseStyleSLvl      -2.890e+02  7.540e+03  -0.038 0.969433    
## OverallCond          5.454e+03  1.167e+03   4.672 3.29e-06 ***
## YearBuilt            2.515e+02  1.013e+02   2.482 0.013187 *  
## RoofMatlTar&Grv     -1.445e+04  1.310e+04  -1.104 0.269965    
## RoofMatlWdShake     -7.370e+03  1.932e+04  -0.381 0.702912    
## RoofMatlWdShngl      9.759e+04  2.153e+04   4.533 6.34e-06 ***
## ExterQualFa          2.397e+03  1.689e+04   0.142 0.887185    
## ExterQualGd         -3.078e+04  7.227e+03  -4.259 2.20e-05 ***
## ExterQualTA         -2.917e+04  8.098e+03  -3.602 0.000327 ***
## BsmtQualFa          -1.627e+04  9.674e+03  -1.681 0.092908 .  
## BsmtQualGd          -2.271e+04  5.411e+03  -4.197 2.88e-05 ***
## BsmtQualTA          -2.279e+04  6.714e+03  -3.394 0.000709 ***
## BsmtFinType2BLQ     -2.543e+04  1.050e+04  -2.422 0.015551 *  
## BsmtFinType2GLQ     -2.113e+04  1.341e+04  -1.575 0.115461    
## BsmtFinType2LwQ     -2.521e+04  1.036e+04  -2.434 0.015076 *  
## BsmtFinType2Rec     -2.033e+04  9.681e+03  -2.100 0.035925 *  
## BsmtFinType2Unf     -1.822e+04  7.927e+03  -2.298 0.021723 *  
## X1stFlrSF            4.813e+01  5.763e+00   8.352  < 2e-16 ***
## X2ndFlrSF            5.751e+01  7.394e+00   7.778 1.48e-14 ***
## KitchenAbvGr        -2.048e+04  6.949e+03  -2.948 0.003256 ** 
## TotRmsAbvGrd         3.276e+03  1.384e+03   2.366 0.018121 *  
## PavedDriveP         -1.151e+04  8.896e+03  -1.294 0.195857    
## PavedDriveY         -1.074e+04  5.711e+03  -1.881 0.060175 .  
## ScreenPorch          4.904e+01  1.966e+01   2.495 0.012710 *  
## PoolArea             1.505e+02  3.559e+01   4.229 2.51e-05 ***
## MiscVal             -6.826e+00  1.744e+00  -3.913 9.56e-05 ***
## SaleTypeCon          3.957e+04  2.500e+04   1.583 0.113753    
## SaleTypeConLD       -2.354e+03  1.419e+04  -0.166 0.868289    
## SaleTypeConLI        9.899e+04  3.015e+04   3.284 0.001052 ** 
## SaleTypeConLw        2.541e+04  2.210e+04   1.150 0.250493    
## SaleTypeCWD          9.511e+03  1.706e+04   0.558 0.577220    
## SaleTypeNew          2.583e+03  8.235e+03   0.314 0.753848    
## SaleTypeOth          2.234e+03  2.171e+04   0.103 0.918036    
## SaleTypeWD          -3.569e+03  6.921e+03  -0.516 0.606161    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 40750 on 1316 degrees of freedom
##   (51 observations deleted due to missingness)
## Multiple R-squared:  0.5682, Adjusted R-squared:  0.542 
## F-statistic: 21.65 on 80 and 1316 DF,  p-value: < 2.2e-16

Note:

All the insignificant variables above are dummy variables. Removing them from the model would inherently cause the model to produce NAs.

Building the final model / Testing Error (before cross-validation)

The SalePrice variable has a skewed distribution. In order to combat this, a log-transformed value of SalePrice will be used in the model. This is illustrated below using quantile-quantile plots and histograms. This further illustrated by observing the skewness for each distribution.

The skew for the SalePrice distribution is 2.5754382 , and the skew for the log-transformed SalePrice distribution is -0.1327215 . The first distribution is very positively skewed, whereas the transformed values are slightly negatively skewed.

require(ggplot2)
require(ggpubr)

q1 <- qplot(sample = SalePrice, data = housing_market) + theme_classic() + xlab("Theoretical Quantiles")

h1 <- ggplot(data = housing_market, aes(SalePrice)) + geom_histogram(fill = "Blue", color = "black", bins = 25)

q2 <- qplot(sample = log(SalePrice), data = housing_market) + theme_classic() +  xlab("Theoretical Quantiles")

h2 <- ggplot(data = housing_market, aes(log(SalePrice))) + geom_histogram(fill = "Purple", color = "black", bins = 25)

ggarrange(q1, h1, q2, h2,  nrow = 2, ncol = 2)

Predictions and Test Errors Using MLR

predictions <- predict(lm(log(SalePrice) ~ ., data = cleanTrain3), cleanTest3[-22])

MSE

mean((log(cleanTest3$SalePrice) - predictions)^2, na.rm = T)

## [1] 0.05118671

MSE is slightly above .05, which is acceptable and close enough to our target of .05 for our needs.

MAE

abs(mean(log(cleanTest3$SalePrice) - predictions, na.rm = T))

## [1] 0.01001231

The MAE for non log-transformed values is still a good representation of the average difference in predicted and actual values.

abs(mean(cleanTest3$SalePrice - predict(lm(SalePrice ~ ., data = cleanTrain3), cleanTest3[-22]),na.rm = T))

## [1] 1797.03

Which means on average, the predicted values were off by about 1800 dollars, which is about 1% if you are purchasing a home for $180,000.

Validation Set Approach Error Summary

require(kableExtra)

kbl(data.frame("MSE" = mean((log(cleanTest3$SalePrice) - predictions)^2, na.rm = T), 
               "MAE" = abs(mean(log(cleanTest3$SalePrice) - predictions, na.rm = T)
    )))

MSE	MAE
0.0511867	0.0100123

Prediction and Test Error Using 5-Fold CV and LOOCV

5-Fold CV results in a split ratio of 0.8. 80% of the total observations will make up the training set.

require(caret)

cvDF <- rbind(cleanTrain3, cleanTest3)

cv5 <- trainControl(method = "cv", number = 5)

loocv <- trainControl(method = "loocv")

modelcv5 <- train(log(SalePrice) ~ LotArea + LandContour + Neighborhood + Condition1 +
                    Condition2 + HouseStyle + OverallCond + YearBuilt + RoofMatl + ExterQual + BsmtQual + 
                    BsmtFinType2 + X1stFlrSF + X2ndFlrSF + KitchenAbvGr +
                    TotRmsAbvGrd + PavedDrive + ScreenPorch + PoolArea + MiscVal +
                    SaleType, data = cvDF, method = "lm", trControl = cv5, na.action = na.pass)


modelloocv <- train(log(SalePrice) ~ LotArea + LandContour + Neighborhood + Condition1 +
                    Condition2 + HouseStyle + OverallCond + YearBuilt + RoofMatl + ExterQual + BsmtQual + 
                    BsmtFinType2 + X1stFlrSF + X2ndFlrSF + KitchenAbvGr +
                    TotRmsAbvGrd + PavedDrive + ScreenPorch + PoolArea + MiscVal +
                    SaleType, data = cvDF, method = "lm", trControl = loocv, na.action = na.pass)

cvresults <- modelcv5

loocvresults <- modelloocv

cvMSE <- (cvresults$results[, 2])^2

loocvMSE <- (loocvresults$results[, 2])^2

The 5-Fold cross validation, using a split ratio of 0.8 resulted in a slightly improved model. The Error rate for the 5-Fold CV MLR is 0.0439646 . Using LOOCV, the MSE is 0.0254423 . Which is unexpected, as LOOCV typically has a greater variance because the test set is composed of a single observation for each iteration.

LOOCV and K-Fold CV Error Summary

kbl(data.frame(" " = "MSE", "5-Fold CV" = cvMSE,
    "LOOCV" = loocvMSE),
    col.names = c(" ", "5-Fold CV", "LOOCV"),
    booktabs = T)

	5-Fold CV	LOOCV
MSE	0.0439646	0.0254423

Using Random Forest

The random forest model will be run using 200 trees and all the variables.

require(randomForest)


rfpredictions <- predict(randomForest(log(SalePrice) ~ ., data = train, na.action = na.omit, ntree = 200), test[-73])

Random Forest MSE and MAE for Log-transformed Sale Price values

kbl(data.frame("MSE" = mean((log(test$SalePrice) - rfpredictions)^2, na.rm = T), 
               "MAE" = abs(mean(log(test$SalePrice) - rfpredictions, na.rm = T)
    )))

MSE	MAE
0.0396786	0.0050124

Using Random Forest with 200 trees and all variables slightly improved the model as well. The Error rate was 0.0396786. This also lowered MAE to 0.0050124.

Q2

require(dplyr)

q2train <- read.table("train_GE_LW-1.txt", header = T, stringsAsFactors = T) %>% filter(COS.Intensity != "NA")

q2test <- read.table("test_GE_LW-1.txt", header = T, stringsAsFactors = T) %>% filter(COS.Intensity != "NA")

EDA

Correlation Heatmap Between All Numeric Variables

require(plotly)
require(heatmaply)

heat <- heatmaply(cor(q2train[-c(25:29)]), 
                  heatmap_layers = theme(axis.line = element_blank()))

heat

require(ggplot2)
require(ggpubr)


meanxy_COS <- ggplot(data = q2train, aes(x = mean.GE_LW_X, y = mean.GE_LW_Y, color = COS.Intensity)) + geom_point() + theme_classic()

meanxy_Sex <- ggplot(data = q2train, aes(x = mean.GE_LW_X, y = mean.GE_LW_Y, color = Sex)) + geom_point() + theme_classic()

ggarrange(meanxy_COS, meanxy_Sex, nrow = 2, ncol = 1)

It already looks like there is a clear separation between clusters.

meanxy_COSY90 <- ggplot(data = q2train, aes(x = mean.GE_LW_X, y = quarlite.GE_LW_Y90, color = COS.Intensity)) + geom_point() + theme_classic()

meanxy_SexY90 <- ggplot(data = q2train, aes(x = mean.GE_LW_X, y = quarlite.GE_LW_Y90, color = Sex)) + geom_point() + theme_classic()

ggarrange(meanxy_COSY90, meanxy_SexY90, nrow = 2, ncol = 1)

It is clear there is separation at 0.

3D Plot

The plots below accounts for the height and weight respectfull.

require(plotly)

par(mfrow = c(1,1))

plot_ly(x = q2train$mean.GE_LW_X, y =  q2train$mean.GE_LW_Y, z = q2train$Ht, type = "scatter3d", color = q2train$Sex)

plot_ly(x = q2train$mean.GE_LW_X, y =  q2train$mean.GE_LW_Y, z = q2train$Wt, type = "scatter3d", color = q2train$Sex)

The boundary is still clearly seen, but it is revealed that those around average height are in the positive XY quadrant. In the second plot, there are two groups where this occurs, however. Similarly, in the second plot, those with average weights tend to have points in the positive XY quadrant. It should also be noted that there is very little overlap between males and females in the XY positive quadrants for both plots.

Check the counts for each response level

require(kableExtra)


train_response_counts <- q2train %>% count(COS.Intensity)

test_response_counts <- q2test %>% count(COS.Intensity)

kable(list(train_response_counts, test_response_counts))

COS.Intensity	n
LPA	1395
MPA	1117
SED	1472
VPA	325

COS.Intensity	n
LPA	1108
MPA	1256
SED	1687
VPA	262

Baseline Multinomial Logistic Regression

The baseline multinomial logistic regression has an error rate of 0.3987943 .

LDA

require(MASS)

q2model1 <- lda(COS.Intensity ~ ., data = q2train)

q2predictionsLDA <- predict(q2model1, q2test[-29])

Interpreting the Results

require(caret)

require(yardstick)

results_tab1 <- table(q2predictionsLDA$class, q2test$COS.Intensity)

autoplot(conf_mat(results_tab1), type = "heatmap") + scale_fill_continuous() +
  xlab("Predicted Values") +
  ylab("Actual Values")

Recall the count tables above for the response variable. This means LDA does not very accurately predict results. It was only successful in predicting values of SED. These values would likely change depending on adjusting the split ratio, but it is clear that another algorithm should be used.

Statistics Based on the confusion matrix

The above confusion matrix

print(confusionMatrix(results_tab1))

## Confusion Matrix and Statistics
## 
##      
##        LPA  MPA  SED  VPA
##   LPA  461  385  253   30
##   MPA  174  660   44  136
##   SED  472  203 1389   11
##   VPA    1    8    1   85
## 
## Overall Statistics
##                                           
##                Accuracy : 0.6017          
##                  95% CI : (0.5869, 0.6163)
##     No Information Rate : 0.3911          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4097          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: LPA Class: MPA Class: SED Class: VPA
## Sensitivity              0.4161     0.5255     0.8234    0.32443
## Specificity              0.7916     0.8842     0.7388    0.99753
## Pos Pred Value           0.4083     0.6509     0.6694    0.89474
## Neg Pred Value           0.7968     0.8193     0.8668    0.95804
## Prevalence               0.2569     0.2912     0.3911    0.06075
## Detection Rate           0.1069     0.1530     0.3220    0.01971
## Detection Prevalence     0.2618     0.2351     0.4811    0.02203
## Balanced Accuracy        0.6038     0.7048     0.7811    0.66098

Consistent with what was said above, this model scored a 60.2% accuracy rating.

QDA

q2model2 <- qda(COS.Intensity ~ ., data = q2train)

q2predictionsQDA <- predict(q2model2, q2test[-29])

Confusion Matrix

results_tab2 <- table(q2predictionsQDA$class, q2test$COS.Intensity)

autoplot(conf_mat(results_tab2), type = "heatmap") + scale_fill_continuous() +
  xlab("Predicted Values") +
  ylab("Actual Values")

Statistics Based on the confusion matrix

print(confusionMatrix(results_tab2))

## Confusion Matrix and Statistics
## 
##      
##        LPA  MPA  SED  VPA
##   LPA  404  404  230   31
##   MPA  110  482   28  132
##   SED  594  357 1428   13
##   VPA    0   13    1   86
## 
## Overall Statistics
##                                           
##                Accuracy : 0.5565          
##                  95% CI : (0.5415, 0.5714)
##     No Information Rate : 0.3911          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.3352          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: LPA Class: MPA Class: SED Class: VPA
## Sensitivity             0.36462     0.3838     0.8465    0.32824
## Specificity             0.79251     0.9117     0.6329    0.99654
## Pos Pred Value          0.37792     0.6410     0.5970    0.86000
## Neg Pred Value          0.78298     0.7826     0.8652    0.95822
## Prevalence              0.25690     0.2912     0.3911    0.06075
## Detection Rate          0.09367     0.1118     0.3311    0.01994
## Detection Prevalence    0.24786     0.1744     0.5546    0.02319
## Balanced Accuracy       0.57857     0.6477     0.7397    0.66239

KNN

Since KNN does not rely on labels, the factor variables need to all be converted to numeric

require(class)
require(dplyr)

temp_train <- q2train

levels(temp_train$COS.Intensity) <- as.integer(c(1,2,3,4))
levels(temp_train$Sex) <- as.integer(c(1, 2))

temp_test <- q2test

levels(temp_test$COS.Intensity) <- as.integer(c(1,2,3,4))
levels(temp_test$Sex) <- as.integer(c(1, 2))

model_knn <- knn(temp_train, temp_test, cl = temp_train$COS.Intensity, k = 50)

results_tab3 <- table(model_knn, temp_test$COS.Intensity)

autoplot(conf_mat(results_tab3), type = "heatmap") + scale_fill_continuous() +
  xlab("Predicted Values") +
  ylab("Actual Values")

print(confusionMatrix(results_tab3))

## Confusion Matrix and Statistics
## 
##          
## model_knn    1    2    3    4
##         1  691  410   86   50
##         2  213  629  134  160
##         3  201  213 1465   28
##         4    3    4    2   24
## 
## Overall Statistics
##                                           
##                Accuracy : 0.6513          
##                  95% CI : (0.6368, 0.6655)
##     No Information Rate : 0.3911          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4843          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4
## Sensitivity            0.6236   0.5008   0.8684 0.091603
## Specificity            0.8296   0.8342   0.8317 0.997778
## Pos Pred Value         0.5586   0.5537   0.7682 0.727273
## Neg Pred Value         0.8644   0.8026   0.9077 0.944393
## Prevalence             0.2569   0.2912   0.3911 0.060747
## Detection Rate         0.1602   0.1458   0.3397 0.005565
## Detection Prevalence   0.2868   0.2634   0.4422 0.007651
## Balanced Accuracy      0.7266   0.6675   0.8500 0.544691

KNN provided the most accurate baseline method.

Let’s see if the accuracy changes at all depending on a value of K.

k <- c(5, 10, 25, 50, 100, 150, 200, 400)

knn1 <- 1 - mean(knn(temp_train, temp_test, cl = temp_train$COS.Intensity, k = k[1]) == temp_test$COS.Intensity)

knn2 <- 1 - mean(knn(temp_train, temp_test, cl = temp_train$COS.Intensity, k = k[2]) == temp_test$COS.Intensity)

knn3 <- 1 - mean(knn(temp_train, temp_test, cl = temp_train$COS.Intensity, k = k[3]) == temp_test$COS.Intensity)

knn4 <- 1 - mean(knn(temp_train, temp_test, cl = temp_train$COS.Intensity, k = k[4]) == temp_test$COS.Intensity)

knn5 <- 1 - mean(knn(temp_train, temp_test, cl = temp_train$COS.Intensity, k = k[5]) == temp_test$COS.Intensity)

knn6 <- 1 - mean(knn(temp_train, temp_test, cl = temp_train$COS.Intensity, k = k[6]) == temp_test$COS.Intensity)

knn7 <- 1 - mean(knn(temp_train, temp_test, cl = temp_train$COS.Intensity, k = k[7]) == temp_test$COS.Intensity)

knn8 <- 1 - mean(knn(temp_train, temp_test, cl = temp_train$COS.Intensity, k = k[8]) == temp_test$COS.Intensity)

knndf <- data.frame("Error" = c(knn1, knn2, knn3, knn4, knn5, knn6, knn7, knn8), "K" = k)

knn_table <- ggtexttable(knndf)

knn_curve <- ggplot(data = knndf, aes(x = K, y = Error)) + geom_line(color = "blue", size = 1.5) + geom_point(shape = "triangle", size = 3) + theme_classic()

ggarrange(knn_curve, knn_table, nrow = 1, ncol = 2, widths = c(3, 2))

Each Model again with a 0.7 split

dataset <- rbind(q2train, q2test)

dim(dataset)

## [1] 8622   29

Using K-Fold Cross-Validation for each model

Using $k=5$ results in reshuffles of the training data of a size equal to $\frac{k-1}{k}n$ , in this scenario, using $k=5$ results in a training set of size of $n = \frac{5 - 1 }{5}8622 = 6898$ .

LDA with Cross-Validation

cvLDA <- train(COS.Intensity ~ ., data = dataset, method = "lda", trControl = five_fold, metric = "Accuracy")

system.time({train(COS.Intensity ~ ., data = dataset, method = "lda", trControl = five_fold, metric = "Accuracy")})

##    user  system elapsed 
##     0.7     0.0     0.7

QDA with Cross-Validation

cvQDA <- train(COS.Intensity ~ ., data = dataset, method = "qda", trControl = five_fold, metric = "Accuracy")

system.time({train(COS.Intensity ~ ., data = dataset, method = "qda", trControl = five_fold, metric = "Accuracy")})

##    user  system elapsed 
##    0.63    0.03    0.66

KNN with Cross-Validation

temp_dataset <- rbind(temp_train, temp_test)

cvKNN <- train(COS.Intensity ~ ., data = temp_dataset, method = "knn", tuneGrid = expand.grid(k = k), trControl = five_fold, 
               metric = "Accuracy")

system.time({train(COS.Intensity ~ ., data = dataset, method = "knn", trControl = five_fold, metric = "Accuracy")})

##    user  system elapsed 
##    4.98    0.00    4.99

Overall Error Rates

finalResults <- data.frame("Method" = c("Logistic Regression", "LDA", "QDA", "KNN"),
                                        "Error" = c(error_rate, 1- 0.6017, 1 - 0.5565, min(knndf$Error)), 
                                        "5-Fold CV Error" = c(1-max(kfcvMLR$results[, 2]),1- max(cvLDA$results[, 2]),
                                                              1-max(cvQDA$results[, 2]), 1-max(cvKNN$results[, 2])))

kbl(finalResults, col.names = c("Method", "Test Error", "K-Fold CV Test Error"))

Method	Test Error	K-Fold CV Test Error
Logistic Regression	0.3987943	0.3741543
LDA	0.3983000	0.3801926
QDA	0.4435000	0.4054735
KNN	0.2304660	0.2646639

Computational Expenses for Each 5-Fold Cross-Validation Method

Multinomial Logistic Regression

time1

##    user  system elapsed 
##   14.89    0.00   14.89

LDA

system.time({train(COS.Intensity ~ ., data = dataset, method = "lda", trControl = five_fold, metric = "Accuracy")})

##    user  system elapsed 
##    0.72    0.00    0.72

QDA

system.time({train(COS.Intensity ~ ., data = dataset, method = "qda", trControl = five_fold, metric = "Accuracy")})

##    user  system elapsed 
##    0.65    0.00    0.65

KNN

system.time({train(COS.Intensity ~ ., data = temp_dataset, method = "knn", tuneGrid = expand.grid(k = k), trControl = five_fold, metric = "Accuracy")})

##    user  system elapsed 
##   21.66    0.01   21.67

Group Work: Each member of the group contributed equally to the construction of this project. We initially began by doing research and discussing the project questions. We then created an RStudio Cloud session and were able to collaborate on programming simultaneously. Each member was able to utilize their skills in various areas such as data cleaning, charting, coding optimization, and rendering. This project helped our group build better synergy which will prepare us for future projects.

MATH 624 Project 1

Samuel Harold, Collin Hoskins, Meiren Jiang, Cordell Jones

2021-10-17

Q1

Reading the Datasets / Finding Attributes that would cause issues later on

Data Cleaning / Preprocessing

Re-Splitting the Data

EDA After Cleaning the Data on the Training Set

Sale Price and Lot Area

Sale Price by Sale Condition

Sale Price based on Basement SF by Basement Quality

Sale Price Statistics by Different Groupings

Using Backward Selection to Identify Insignificant Variables

Removing Insignificant Variables

Final Model / Variable Selection

Building the final model / Testing Error (before cross-validation)

Predictions and Test Errors Using MLR

MSE

MAE

Validation Set Approach Error Summary

Prediction and Test Error Using 5-Fold CV and LOOCV

LOOCV and K-Fold CV Error Summary

Using Random Forest

Random Forest MSE and MAE for Log-transformed Sale Price values

Q2

EDA

Correlation Heatmap Between All Numeric Variables

3D Plot

Check the counts for each response level

Baseline Multinomial Logistic Regression

LDA

Interpreting the Results

Statistics Based on the confusion matrix

QDA

Confusion Matrix

Statistics Based on the confusion matrix

KNN

Each Model again with a 0.7 split

Using K-Fold Cross-Validation for each model

LDA with Cross-Validation

QDA with Cross-Validation

KNN with Cross-Validation

Overall Error Rates

Computational Expenses for Each 5-Fold Cross-Validation Method

Multinomial Logistic Regression

LDA

QDA

KNN