————————————————————————————————————————–

Part 1: Data Exploration

Data Summary

This data set explores the extent that 79 housing attributes in Ames Iowa explain regional housing prices. There are 1460 rows of data, each representing a distinct attribute of a house recently sold in the Ames Iowa metropolitan area. These attributes can all potentially be used as predictor variables to the one response variable (“Sale Price”) which indicates the price that the house actually sold for in that market. Using a training data set from this superset of housing data, this exercise attempts to model and predict sales prices using test data from the same superset. The predictions derived from test data and the model will ultimately be compared against actual sales data to determine the accuracy of the predictions.

————————————————————————————————————————–

Part 2: Data Preparation

Meta-data was provided with the data sets for this exercise, including a data dictionary “data_description.txt”. In order to model attributes from complete cases, data was imputed for the attributes below. The primary consideration in imputing data was the conversion of “NA” and “NaN” data to “None” and 0 respectively. Doing so retains consistency with the original data for all cases without introducing skew and bias into the model caused by missing data.

Note that introducing “None” for “NA” simply indicates the absence of that attribute, i.e. a measurement for a garage when none exists or a Masonry facade when the facade is not masonry. Similarly, 0 is substituted for “NaN” when in the absence of a numeric value when the attribute is numeric, i.e. Lot Frontage in square feet when there is no Lot Frontage.

Ordinal Imputations

LotFrontage(4) 259, GarageYrBlt(60) 81, MasVnrArea(27) 8

Factor Imputations

Electrical(43) 1, MasVnrType(26) 8, BsmtQual(31) 57, BsmtCond(32) 57, BsmtFinType1(34) 57, BsmtExposure(33) 58, BsmtFinType2(36) 58, GarageType(59) 81, GarageFinish(61) 81, GarageQual(64) 81, GarageCond(65) 81, FireplaceQu(58) 690 Fence(74) 1179, Alley(7) 1369, MiscFeature(75) 1406, PoolQC(73) 1453

Descriptive Statistics

Significant skew and kurtosis exists for the attributes LotArea, LowQualFinSF, X3SsnPorch, ScreenPorch, PoolArea and MiscVal. These attributes are indicative of an uncommon occurance in the data set. The model will be determinative of whether these attributes truly represent “outliers” or are uncommon but valuable to include in the model.

vars id n mean sd median min max range skew kurtosis se
MSSubClass 2 1460 57 42 50 20 190 170 1 2 1
LotFrontage 4 1201 70 24 69 21 313 292 2 17 1
LotArea 5 1460 10517 9981 9478 1300 215245 213945 12 202 261
OverallQual 18 1460 6 1 6 1 10 9 0 0 0
OverallCond 19 1460 6 1 5 1 9 8 1 1 0
YearBuilt 20 1460 1971 30 1973 1872 2010 138 -1 0 1
YearRemodAdd 21 1460 1985 21 1994 1950 2010 60 -1 -1 1
MasVnrArea 27 1452 104 181 0 0 1600 1600 3 10 5
BsmtFinSF1 35 1460 444 456 384 0 5644 5644 2 11 12
BsmtFinSF2 37 1460 47 161 0 0 1474 1474 4 20 4
BsmtUnfSF 38 1460 567 442 478 0 2336 2336 1 0 12
TotalBsmtSF 39 1460 1057 439 992 0 6110 6110 2 13 11
X1stFlrSF 44 1460 1163 387 1087 334 4692 4358 1 6 10
X2ndFlrSF 45 1460 347 437 0 0 2065 2065 1 -1 11
-LowQualFinSF 46 1460 6 49 0 0 572 572 9 83 1
GrLivArea 47 1460 1515 525 1464 334 5642 5308 1 5 14
BsmtFullBath 48 1460 0 1 0 0 3 3 1 -1 0
BsmtHalfBath 49 1460 0 0 0 0 2 2 4 16 0
FullBath 50 1460 2 1 2 0 3 3 0 -1 0
HalfBath 51 1460 0 1 0 0 2 2 1 -1 0
BedroomAbvGr 52 1460 3 1 3 0 8 8 0 2 0
KitchenAbvGr 53 1460 1 0 1 0 3 3 4 21 0
TotRmsAbvGrd 55 1460 7 2 6 2 14 12 1 1 0
Fireplaces 57 1460 1 1 1 0 3 3 1 0 0
GarageYrBlt 60 1379 1979 25 1980 1900 2010 110 -1 0 1
-GarageCars 62 1460 2 1 2 0 4 4 0 0 0
GarageArea 63 1460 473 214 480 0 1418 1418 0 1 6
WoodDeckSF 67 1460 94 125 0 0 857 857 2 3 3
OpenPorchSF 68 1460 47 66 25 0 547 547 2 8 2
EnclosedPorch 69 1460 22 61 0 0 552 552 3 10 2
X3SsnPorch 70 1460 3 29 0 0 508 508 10 123 1
ScreenPorch 71 1460 15 56 0 0 480 480 4 18 1
PoolArea 72 1460 3 40 0 0 738 738 15 222 1
MiscVal 76 1460 43 496 0 0 15500 15500 24 698 13
MoSold 77 1460 6 3 6 1 12 11 0 0 0
YrSold 78 1460 2008 1 2008 2006 2010 4 0 -1 0
SalePrice 81 1460 180921 79443 163000 4900 755000 720100 2 6 2079

Barplots & Catagorical Data

Catagorical data is converted to factor data for inclusion in the model and represented below as proportional bar plots to indicate their relative relation to housing price.

Histograms for Ordinal Data

Ordinal attributes are show as histograms. Most of these attributes contain enough cases that even when significant skew occurs they are still useful to an ordinary least squares model. The significance of each attribute will be tested by p-value, AIC and vif using the lm model in step 3.

————————————————————————————————————————–

Part 3 - Build Models

Model 1: Use the stepAIC (both directions) function to Build a Model

————————————————————————————————————————–

Part 4 - Generate Test Predictions from Model

Barplots & Catagorical Data

## $title
## [1] "Zoning Proportions"
## 
## attr(,"class")
## [1] "labels"
## NULL
## $title
## [1] "Exterior1st Proportions"
## 
## attr(,"class")
## [1] "labels"

## Warning: Removed 1 rows containing non-finite values (stat_count).

————————————————————————————————————————–

Part 5 - Output file

https://raw.githubusercontent.com/scottkarr/IS605-scottkarr-final/master/predictions.csv