Study Objective

Hello, college graduate! Are you interested in buying your first home? Would you like to stay here in Cincinnati, Ohio? At this point in time the housing market is considered a Seller’s Market, meaning there are more buyers than sellers, so sellers have the upper hand. Now more than ever you want to ensure you are paying no more than similar homes in the area.

For all of my home owners out there who are looking to sell without an agent, this study is for you as well. Read below to ensure you are an informed seller and are not over or under pricing the value of your home.

We performed a study of recently sold homes in the Cincinnati area in an attempt to model the factors that contribute to a home’s value. Our data was pulled from Zillow, an online real estate database for the United States. We only used data on homes that were recently sold to ensure we captured finalized sales price data.

link to zillow maps

Data Preparation

We first want to acquaint you with the data used in this study. Should you like to reperform our analysis, see below.

Required Packages

The packages required to run the code in this study are the following:

library(readxl)      # To read in the data
library(tidyverse)   # For general data manipulation and regression analysis
library(knitr)       # To format tables
library(DAAG)
library(ggplot2)     # To visualize data
library(psych)
library(qwraps2)     # To make summary tables
library(naniar)      # To replace missing values
library(formattable) # To format tables into currency
library(MASS)        # To calculate VIF

Data Wrangling

Reading in Observations and Clean Missing Values

The following code illustrates how to load in the housing data obtained from Zillow and get it in a proper/workable format. This includes handling missing values in our data.

# Read in the data
data <- read_xlsx("zillow data.xlsx")

# Remove the column that includes a link to the listing
zillow <- data[,-1]
# Coerse 'N/A' in the data to be read as NAs
zillow <- zillow %>%
  replace_with_na_all(condition = ~.x %in% common_na_strings)
# Coerse variables to their proper data type
# str(zillow)
zillow$zip <- as.factor(zillow$zip)
zillow$heating <- as.factor(zillow$heating)
zillow$cooling <- as.factor(zillow$cooling)
zillow$garage <- as.factor(zillow$garage)
zillow$basement_status <- as.factor(zillow$basement_status)
zillow$lot_size_ft <- as.numeric(zillow$lot_size_ft)

# Correct typo in a header
names(zillow)[15]<-"num_stories"
# Impute NA in Square Footage with the Average Square Footage in that Zip Code
zillow %>% filter(zip == 45202) %>% 
  summarise(mean(sqr_ft, na.rm = TRUE))
## # A tibble: 1 x 1
##   `mean(sqr_ft, na.rm = TRUE)`
##                          <dbl>
## 1                        2602.
  # 2566.32 
zillow$sqr_ft[is.na(zillow$sqr_ft)] <- 2566.32

# Impute NA in Lot Size with the Average Lot Size in that Zip Code
zillow %>% filter(zip == 45202) %>% 
  summarise(mean(lot_size_ft, na.rm = TRUE))
## # A tibble: 1 x 1
##   `mean(lot_size_ft, na.rm = TRUE)`
##                               <dbl>
## 1                             5066.
  # 13233.12    
zillow$lot_size_ft[is.na(zillow$lot_size_ft)] <- 13233.12

# Impute 0 in the Fireplace option, because we assume if it wasn't listed, they don't have it
zillow$Fireplace[is.na(zillow$Fireplace)] <- 0

# Confirm all missing values are taken care of
colSums(is.na(zillow))     
##         Address             zip      sold_price       sold_date 
##               0               0               0               0 
##         num_bed        num_bath          sqr_ft      year_built 
##               0               0               0               0 
##     lot_size_ft         heating         cooling  parking_spaces 
##               0               0               0               0 
##          garage basement_status     num_stories           Pool? 
##               0               0               0               0 
##  type_and_style       Fireplace 
##               0               0


Zillow Home Value Time Series by Zip Code

To enhance this analysis, we additionally obtained Median Home Values (per Square Foot) by Zip Code from Zillow’s Research website, found here. To upload this data into our workspace, please do the following:

home_val <- read_csv("Zip_MedianValuePerSqft_AllHomes.csv")

We then want to filter this dataset for only zip codes of interest, which are the zipcodes included in the zillow dataset (which are Cincinnati zip codes).

home_val_f <- filter(home_val, RegionName %in% unique(zillow$zip))

With this new information, we can calculate an average of the median prices (which are given by month) from the last 12 months for each zip code. This new variable will be shown in the preview on the next tab.

home_val_avgmed <- data.frame(ID=home_val_f[,2], Means=rowMeans(home_val_f[,17:28]))


Merging Datasets + Variable Transformation

To simplify this analysis, we can merge the two datasets together. The only important information in the Home Value dataset is the median value of homes for each zip code. The average of those median values for each zipcode was calculated and stored in the home_val_avgmed data frame.

zillow$zip <- as.character(zillow$zip)
home_val_avgmed$RegionName <- as.character(home_val_avgmed$RegionName)
zillow1 <- left_join(x = zillow, y = home_val_avgmed, by = c("zip" = "RegionName"))

# Convert the Median value per square foot to median value for similar homes.
zillow2 <- zillow1 %>% mutate(med_val = sqr_ft * Means)
zillow <- zillow2[,-19]
zillow$zip <- as.factor(zillow$zip)


Another point - For analysis purposes, it may be easier to transform the year_built variable into an age variable. In other words, how many years old is the building as of 2019?

zillow$age <- 2019 - zillow$year_built


Code for formatting data

In the background, we can make sure our data output reads well to the viewer.

#assign a class    
class(zillow$sold_price) <- c("money", class(zillow$sold_price))

#S3 print method for the class    
print.money <- function(x, ...) {
  print.default(paste0("$", formatC(as.numeric(x), format="f", digits=0, big.mark=",")))
}

#format method, which is necessary for formating in a data.frame   
format.money  <- function(x, ...) {
  paste0("$", formatC(as.numeric(x), format="f", digits=0, big.mark=","))
}

Data Preview

Data Dictionary This dataset includes the following variables:

Variable Type Description
Zip Factor Zip Code in which the Home is located
Price Double Final Sales Price of the Home
# Bed Double Number of Bedrooms
# Bath Double Number of Bathrooms (full- or half-bath)
Square Footage Double Square Footage of the Home
Year Built Character Year in which the Home was first built
Lot Size Double Square Footage of the Property on which the Home sits
Heating Factor Method of Heating, such as Gas, Electric, Baseboard, None, or Other
Cooling Factor Method of Cooling, such as Central, Wall Units, None, or Other
Parking Spaces Double Number of Parking Spaces in the Garage. Zero if no garage.
Garage Factor Type of Garage: Attached, Detatched, Carport, or None
Basement Factor Status of the Basement: Finished, Partially Finished, Unfinished, or None
Stories Double Number of Floors in the Home
Pool Boolean/Numeric Does the Home have a private Pool; 1 if Yes, 0 if no
Style Factor Style of Home, such as Historical, Traditional, Transitional, Contemporary, etc.
Fireplace Double How many fireplaces are in the home, working or otherwise?
Average Median Home Value in Zip Double The Average of the Median Home Values for each Zip Code. Taken from Zillow’s monthly data
Years Old Double Number of Years since being built. [ 2019 - Year Built ]


Below is a brief preview of the data:

Zip Price # Bed # Bath Square Footage Year Built Lot Size Heating Cooling Parking Spaces Garage Basement Stories Pool Style Fireplace Average Median Home Value in Zip Years Old
45204 $51,000 4 2.0 2000 1930 4095.00 gas central 1 detached Unfinished 3 0 other 0 $153,667 89
45205 $20,000 2 1.0 1320 1892 130680.00 gas central 0 none Partially Finished 2 0 traditional 0 $54,670 127
45205 $16,000 2 1.0 1200 1892 1263.00 gas central 0 none Partially Finished 2 0 traditional 1 $49,700 127
45238 $124,900 2 2.0 1731 1950 8276.00 gas central 1 none Finished 1 0 ranch 1 $151,463 69
45208 $399,900 3 3.0 1653 1916 9147.00 gas central 2 detached Partially Finished 2 0 traditional 1 $362,971 103
45208 $414,250 3 2.0 1754 1959 9583.00 gas central 1 attached Finished 4 0 transitional 1 $385,149 60
45225 $80,025 5 2.0 2189 1890 2613.00 gas wall units 1 none Partially Finished 2 0 traditional 3 $144,292 129
45230 $170,000 3 2.0 1291 1960 6098.00 gas central 1 attached Partially Finished 1 0 ranch 1 $155,458 59
45237 $145,000 4 2.0 1779 1937 5924.00 gas central 1 none Finished 1 0 cape cod 1 $127,643 82
45237 $158,000 4 2.0 1472 1925 5488.00 gas central 1 detached Partially Finished 1 0 cape cod 1 $105,616 94
45224 $122,900 2 2.0 1051 1922 10628.00 gas central 2 none Finished 1 0 traditional 1 $83,817 97
45237 $195,000 5 3.0 2542 1936 13939.20 gas central 2 attached Partially Finished 3 0 colonial 1 $182,389 83
45206 $219,000 2 2.0 1432 1895 3659.00 gas central 0 none Partially Finished 2 0 traditional 0 $206,805 124
45237 $133,400 3 2.0 1470 1934 4443.00 gas central 1 attached Partially Finished 1 0 cape cod 1 $105,473 85
45237 $59,000 4 2.0 1680 1948 4791.00 gas none 2 detached Partially Finished 1 0 multifamily 0 $120,540 71
45223 $325,000 4 3.0 1925 1918 4791.00 gas central 0 none Partially Finished 3 0 traditional 0 $160,738 101
45230 $75,000 2 2.0 1473 1895 7710.00 gas central 0 none Unfinished 1 0 cape cod 0 $177,374 124
45208 $142,000 1 1.0 1135 1900 2047.00 gas central 0 none Unfinished 2 0 cape cod 0 $249,227 119
45217 $205,000 4 2.0 1920 1930 11325.60 gas central 1 attached Unfinished 2 0 traditional 1 $133,600 89
45206 $204,900 2 3.0 1284 1985 1655.00 gas central 1 attached Partially Finished 2 0 transitional 1 $185,431 34
45230 $154,900 3 1.0 1456 1951 17424.00 Baseboard central 1 attached Partially Finished 1 0 other 2 $175,327 68
45211 $198,000 3 2.0 1276 2007 6843.00 gas central 2 attached Partially Finished 1 0 bungalow 1 $96,125 12
45208 $975,000 5 7.0 4190 1941 15246.00 gas central 3 attached Finished 2 0 traditional 1 $920,054 78
45208 $835,000 4 4.0 3947 1907 14374.80 Baseboard central 2 detached Unfinished 3 0 other 1 $866,695 112
45238 $70,000 4 1.0 1373 1948 7710.00 gas central 2 none Partially Finished 1 0 cape cod 1 $120,138 71
45211 $195,000 4 4.0 2348 1976 14374.80 electric central 2 attached Finished 2 0 traditional 1 $176,883 43
45239 $138,000 4 2.0 1558 1951 7884.00 gas central 1 attached Finished 1 0 cape cod 0 $133,988 68
45211 $139,000 4 3.0 2270 1929 9147.00 electric wall units 2 detached Partially Finished 2 0 tudor 0 $171,007 90
45208 $465,000 3 3.0 1976 2010 3049.00 gas central 1 attached Partially Finished 2 0 traditional 1 $433,897 9
45239 $183,000 4 2.0 2263 1948 10149.00 gas wall units 3 attached Partially Finished 2 0 multifamily 0 $194,618 71
45213 $184,900 2 2.0 1308 1920 3484.00 gas central 3 none Finished 1 0 ranch 0 $161,865 99
45208 $378,000 4 3.0 2011 1925 7535.00 gas wall units 3 detached Partially Finished 2 0 traditional 1 $441,582 94
45220 $125,000 2 2.0 1632 1912 4704.00 gas central 1 none Unfinished 2 0 victorian 2 $186,184 107
45206 $445,250 5 4.0 3318 1905 5314.00 forced air central 0 none Partially Finished 2 0 historical 1 $479,174 114
45238 $138,900 3 3.0 1740 1964 13068.00 gas central 2 attached Finished 1 0 ranch 1 $152,250 55
45211 $108,000 3 2.0 1279 1937 8712.00 gas central 1 detached Partially Finished 1 0 cape cod 1 $96,351 82
45238 $92,000 3 1.0 1267 1954 8058.00 gas central 1 attached Partially Finished 1 0 cape cod 0 $110,863 65
45209 $230,000 2 1.0 1100 1935 3781.00 gas central 1 attached Partially Finished 1 0 historical 1 $217,433 84
45211 $75,000 4 1.0 1221 1953 13503.60 gas central 1 none Unfinished 1 0 cape cod 0 $91,982 66
45205 $121,500 4 2.0 1500 1910 8407.00 gas central 2 attached Partially Finished 2 0 traditional 1 $62,125 109
45208 $243,000 4 2.0 1565 1925 4573.00 gas central 1 attached Partially Finished 2 0 tudor 1 $343,648 94
45205 $55,000 3 1.0 1200 1915 4225.00 gas central 0 none Unfinished 2 0 traditional 0 $49,700 104
45208 $799,000 5 3.0 2370 1908 6577.00 Baseboard central 2 detached Unfinished 2 0 other 0 $520,413 111
45208 $595,000 4 4.0 2280 1986 11325.60 other central 2 detached Partially Finished 2 1 other 1 $500,650 33
45226 $285,000 3 3.0 1916 1890 5314.00 gas central 0 none Partially Finished 2 0 victorian 0 $369,469 129
45208 $545,000 4 4.0 1932 1993 8276.00 gas central 2 attached Finished 2 0 traditional 1 $424,235 26
45227 $162,630 3 1.0 1403 1940 24393.60 oil central 1 attached Unfinished 1 0 cape cod 1 $177,830 79
45205 $78,000 2 2.0 1198 1929 5625.00 forced air central 2 detached Partially Finished 1 0 other 1 $49,617 90
45209 $307,000 2 3.0 1600 1904 4181.00 gas central 5 detached Finished 1 0 craftsman 1 $316,267 115
45209 $622,000 5 5.0 2505 1913 8494.00 gas central 2 detached Partially Finished 2 0 victorian 0 $495,155 106
45245 $141,000 3 1.0 1572 1972 10497.00 gas central 1 attached Partially Finished 2 0 split level 0 $183,138 47
45245 $80,000 3 1.0 1050 1964 10019.00 gas none 1 attached none 1 0 ranch 0 $122,325 55
45226 $692,500 4 4.0 3483 1941 9408.00 Baseboard central 2 attached Finished 2 0 other 1 $671,639 78
45209 $202,000 2 2.0 1776 1923 9975.00 gas wall units 1 detached Partially Finished 2 0 multifamily 0 $351,056 96
45209 $651,815 2 1.0 1240 1912 5662.00 gas wall units 1 detached Unfinished 2 0 new traditional 0 $245,107 107
45223 $160,000 2 2.0 1094 1922 3310.00 gas central 0 none Partially Finished 1 0 transitional 1 $91,349 97
45245 $190,000 3 2.0 1828 1993 65775.60 electric central 2 attached none 1 0 ranch 1 $212,962 26
45226 $570,000 3 4.0 2204 2014 1306.00 gas central 2 attached Finished 2 0 contemporary 1 $425,005 5
45245 $173,000 2 2.0 1336 2010 3759.00 gas central 2 attached none 1 0 ranch 0 $155,644 9
45245 $173,000 2 3.0 1336 2009 3759.00 gas central 2 attached none 1 0 ranch 0 $155,644 10
45209 $380,600 3 3.0 1950 1914 5576.00 gas central 2 none Unfinished 2 0 traditional 0 $385,450 105
45245 $175,000 3 3.0 1570 1979 10018.00 electric central 1 attached Finished 1 1 split level 0 $182,905 40
45245 $154,900 3 2.0 1032 1952 27878.40 gas central 1 detached none 1 0 ranch 0 $120,228 67
45223 $265,000 3 3.0 1592 1910 3649.00 gas central 2 none Partially Finished 2 0 multifamily 2 $132,932 109
45223 $225,000 3 3.0 1905 1900 2613.00 electric central 1 detached Partially Finished 1 0 traditional 0 $159,068 119
45223 $152,500 3 2.0 1344 1914 2352.00 gas central 1 none Unfinished 2 0 traditional 0 $112,224 105
45209 $155,000 2 1.0 966 1940 4965.00 Baseboard central 1 attached Partially Finished 1 0 other 0 $190,946 79
45206 $145,000 2 3.0 1989 1993 10018.80 gas central 2 attached Finished 2 0 other 0 $287,245 26
45202 $284,000 3 2.0 1904 1890 21780.00 gas central 0 none Partially Finished 2 0 historical 0 $437,127 129
45238 $118,500 4 3.0 1314 1950 7623.00 gas central 1 attached Partially Finished 1 0 cape cod 0 $114,975 69
45205 $51,000 3 1.0 1208 1935 5837.00 gas central 1 attached Partially Finished 2 0 cape cod 1 $50,031 84
45223 $97,000 3 1.0 1348 1920 17424.00 gas central 1 none Partially Finished 1 0 cape cod 0 $112,558 99
45227 $35,000 3 1.0 1338 1911 5662.00 gas none 0 none Partially Finished 1 0 cape cod 0 $169,592 108
45255 $200,000 3 2.0 1584 2004 19602.00 gas central 2 attached none 1 0 ranch 1 $198,264 15
45202 $757,000 5 5.0 3678 1865 9147.00 gas central 3 detached Partially Finished 3 0 historical 0 $844,408 154
45239 $138,000 4 2.0 1915 1961 6403.00 gas central 1 attached Finished 2 0 transitional 0 $164,690 58
45227 $199,000 2 2.0 1096 1909 7274.00 gas central 2 none Partially Finished 2 0 traditional 0 $138,918 110
45239 $77,500 3 2.0 1231 1949 10890.00 gas central 1 attached Partially Finished 2 0 cape cod 0 $105,866 70
45230 $165,000 4 2.0 2173 1960 14810.40 Baseboard central 2 attached Partially Finished 2 0 other 1 $261,665 59
45230 $135,100 2 1.0 1216 1919 11325.60 gas central 1 attached Partially Finished 2 0 traditional 0 $146,427 100
45202 $240,000 2 2.0 2105 1880 2308.00 electric, gas central 2 detached Partially Finished 2 0 historical 1 $483,273 139
45227 $200,000 3 2.0 2086 1910 7448.00 Baseboard none 0 none Partially Finished 2 0 other 0 $264,401 109
45233 $130,000 3 2.0 1227 1959 7623.00 gas central 1 detached Partially Finished 1 0 ranch 0 $133,027 60
45230 $150,000 3 2.0 1580 1937 23958.00 Baseboard central 1 attached Partially Finished 1 0 other 1 $190,258 82
45213 $380,000 3 2.0 1756 1927 11761.20 gas central 2 detached Partially Finished 2 0 transitional 1 $217,305 92
45224 $265,000 3 4.0 2250 1948 23958.00 gas central 2 attached Finished 2 0 traditional 1 $179,438 71
45224 $197,250 5 2.0 2430 1963 9100.00 gas central 2 attached Unfinished 1 0 other 1 $193,793 56
45227 $83,000 2 1.0 1210 1915 4356.00 gas central 1 none Partially Finished 2 0 traditional 1 $153,368 104
45227 $159,900 2 2.0 1278 1938 4356.00 gas central 1 attached Finished 2 0 cape cod 1 $161,987 81
45230 $150,000 3 2.0 1322 1951 6316.00 electric, gas central 3 detached Partially Finished 1 0 cape cod 0 $159,191 68
45227 $80,000 2 1.0 958 1928 4791.00 gas central 1 detached Partially Finished 1 0 ranch 0 $121,427 91
45233 $91,000 2 1.0 1000 1960 7840.00 gas central 1 none Partially Finished 1 0 ranch 0 $108,417 59
45245 $184,900 3 3.0 1128 1970 20037.60 gas central 2 attached Finished 1 0 traditional 1 $131,412 49
45216 $103,500 4 4.0 2408 1898 6969.00 gas central 1 none Partially Finished 2 0 traditional 4 $136,052 121
45204 $159,000 3 2.0 1422 1913 4356.00 gas central 0 none Partially Finished 2 0 colonial 1 $109,257 106
45220 $675,000 5 5.0 4630 1942 93654.00 Baseboard central 3 attached Partially Finished 2 1 other 1 $528,206 77
45204 $200,000 2 1.0 2280 1900 8407.00 gas central 1 none Unfinished 2 0 multifamily 0 $175,180 119
45204 $105,000 3 2.0 1352 1900 1655.00 gas none 0 none Partially Finished 2 0 transitional 0 $103,879 119
45238 $70,000 2 2.0 916 1939 5488.00 gas central 1 attached Partially Finished 1 0 cape cod 0 $80,150 80
45245 $213,000 3 3.0 1680 2011 10018.00 gas central 2 attached Unfinished 2 0 colonial 0 $195,720 8
45208 $2,632,000 6 7.0 7900 1986 104544.00 Baseboard central 2 attached Finished 2 0 other 0 $1,734,708 33
45227 $455,000 2 3.0 2416 1929 5885.00 gas central 1 detached Finished 2 0 tudor 0 $306,228 90
45205 $99,900 3 1.0 1794 1922 7274.00 Baseboard central 0 none Partially Finished 1 0 other 1 $74,302 97
45238 $110,000 4 2.0 1534 1949 6098.00 gas central 1 attached Partially Finished 1 0 cape cod 0 $134,225 70
45208 $500,000 3 4.0 1788 1957 6534.00 gas central 2 attached Finished 2 0 traditional 1 $392,615 62
45202 $320,000 4 2.0 2424 1880 3920.00 gas central 1 detached Unfinished 3 0 traditional 1 $556,510 139
45202 $1,000,000 3 4.0 3848 1996 2178.00 gas central 2 attached Finished 3 0 other 1 $883,437 23
45240 $173,655 4 3.0 2250 1974 9191.16 gas central 2 attached Partially Finished 2 0 traditional 1 $193,875 45
45249 $187,000 3 3.0 1504 1988 11761.00 electric central 2 attached Finished 1 0 other 1 $217,203 31
45202 $562,500 3 5.0 3304 2005 2470.00 gas central 2 attached Finished 2 0 contemporary 1 $758,543 14
45214 $298,000 3 3.0 2162 2008 1991.00 gas central 2 attached Finished 3 0 traditional 1 $142,512 11
45202 $581,000 3 2.0 3208 1875 2250.00 gas central 0 none Partially Finished 2 0 historical 5 $736,503 144
45206 $357,700 4 3.0 1883 1916 3920.00 gas central 2 none Unfinished 2 0 traditional 1 $271,937 103
45202 $1,300,000 3 4.0 3582 1900 3005.00 gas central 2 attached Unfinished 3 0 other 1 $822,368 119
45202 $582,500 2 3.0 2226 1872 1385.00 gas central 2 attached Unfinished 3 0 historical 2 $511,053 147
45202 $300,000 4 4.0 2800 1952 7927.00 Baseboard central 1 attached Unfinished 3 0 modern 1 $642,833 67
45212 $80,000 2 1.0 1065 1926 2744.28 gas none 0 none Unfinished 2 0 traditional 0 $109,784 93
45202 $645,000 3 3.0 2000 2015 1002.00 gas central 2 attached none 3 0 loft 0 $459,167 4
45202 $475,000 2 3.0 2100 1881 1655.00 gas central 2 attached None 4 0 contemporary 0 $482,125 138
45202 $500,000 3 5.0 2291 2018 13233.12 gas central 1 attached None 4 0 contemporary 0 $525,975 1
45202 $530,000 3 4.0 2300 2018 13233.12 gas central 1 attached none 4 0 contemporary 0 $528,042 1
45208 $2,700,000 4 5.0 5080 1941 53579.00 Baseboard central 1 attached Finished 2 0 other 1 $1,115,483 78
45215 $419,000 4 3.0 2305 1988 20908.80 gas central 2 attached Finished 2 0 other 1 $236,263 31
45211 $148,000 3 2.0 1767 1926 8537.76 Baseboard wall units 2 detached Unfinished 1 0 traditional 1 $133,114 93
45206 $637,000 4 4.0 4748 1890 11935.44 gas central 1 none Partially Finished 2 0 victorian 6 $685,690 129
45202 $575,000 3 3.0 2810 1890 2178.00 gas central 1 attached Unfinished 3 0 traditional 0 $645,129 129
45202 $254,000 2 2.0 1008 1881 784.00 gas central 0 none Finished 3 0 historical 0 $231,420 138
45202 $270,000 3 3.0 2000 1869 6098.00 gas other 0 none Finished 3 0 other 1 $459,167 150
45208 $595,000 5 3.0 3190 1914 10018.00 gas central 3 detached Unfinished 3 0 traditional 3 $700,471 105
45208 $2,300,000 3 4.0 3253 1952 30492.00 Baseboard central 3 Carport Finished 1 1 other 1 $714,305 67
45202 $787,000 3 4.0 3506 2018 5140.00 gas central 2 attached none 3 0 contemporary 1 $804,919 1
45202 $1,890,000 3 3.0 3840 1960 27878.00 Baseboard central 2 attached Partially Finished 1 0 other 0 $881,600 59
45238 $88,000 4 2.0 1821 1948 10018.00 gas central 1 attached Finished 1 0 cape cod 1 $159,338 71
45244 $600,000 3 3.0 4147 2004 17424.00 gas central 3 attached Finished 2 0 other 1 $548,441 15
45208 $967,000 4 3.0 3908 2005 12196.80 gas central 2 attached Finished 2 0 TRANSITIONAL 1 $858,132 14
45206 $145,000 2 3.0 1989 1993 12196.80 gas central 2 attached Finished 2 0 other 0 $287,245 26
45238 $100,000 4 1.0 1394 1941 5227.00 Baseboard none 1 attached Partially Finished 1 0 cape cod 1 $121,975 78
45208 $645,000 4 4.0 3777 1990 8319.00 gas central 2 attached Finished 2 0 traditional 1 $829,366 29
45243 $568,500 4 3.0 2719 1948 48787.20 gas central 3 detached Finished 2 0 traditional 2 $564,193 71
45202 $125,000 2 2.0 1230 1870 4138.00 gas central 2 none Unfinished 2 0 historical 0 $282,388 149
45202 $371,000 4 2.0 2400 1887 3136.00 forced air central 0 none Partially Finished 3 0 other 2 $551,000 132
45244 $147,000 2 2.0 937 1921 27007.20 gas central 2 detached Unfinished 2 0 cape cod 0 $123,918 98
45243 $1,852,040 4 5.0 5172 2016 95832.00 gas central 1 attached Partially Finished 2 0 TRANSITIONAL 1 $1,073,190 3
45213 $189,370 4 2.0 1232 1930 8233.00 gas central 1 attached Finished 1 0 bungalow 0 $152,460 89
45202 $240,000 2 2.0 2105 1880 2308.00 gas central 2 detached Unfinished 2 0 historical 1 $483,273 139
45245 $184,900 3 3.0 1128 1970 20037.60 gas central 2 attached Finished 1 0 traditional 1 $131,412 49
45248 $244,000 3 2.0 1521 1985 17293.32 gas central 2 attached Partially Finished 1 0 other 1 $160,973 34
45255 $259,000 3 4.0 1678 1941 48917.88 gas central 2 detached Partially Finished 1 0 traditional 1 $210,030 78
45243 $1,230,000 4 5.0 6096 2013 40075.00 other central 3 attached Partially Finished 2 0 other 2 $1,264,920 6
45245 $170,000 4 3.0 1898 1974 14810.40 electric central 2 attached Finished 1 0 other 1 $221,117 45
45249 $467,500 4 3.0 3318 1987 39465.36 gas central 2 attached Finished 2 0 other 1 $479,174 32
45202 $435,000 3 3.0 2053 1920 1306.00 gas central 0 none Partially Finished 3 0 traditional 4 $471,335 99
45231 $94,000 2 2.0 792 1959 6011.28 gas central 2 Carport Partially Finished 1 0 other 0 $66,990 60
45202 $1,223,000 2 5.0 4295 2007 2090.00 gas central 2 attached none 4 0 contemporary 1 $986,060 12
45202 $550,000 3 3.0 1920 1986 5227.00 gas central 2 attached Finished 1 0 contemporary 1 $440,800 33
45202 $625,000 4 4.0 2727 2002 2265.00 Baseboard central 2 attached Partially Finished 2 0 TRANSITIONAL 1 $626,074 17
45230 $109,900 2 1.0 925 1954 5401.00 gas central 1 detached Partially Finished 1 0 other 0 $111,385 65
45227 $695,000 4 4.0 3254 1962 9234.00 gas central 1 detached Finished 2 0 colonial 2 $412,445 57
45230 $153,500 3 1.5 1548 1970 9539.00 Baseboard central 1 attached Partially Finished 2 0 split level 0 $186,405 49
45237 $55,000 2 1.0 1627 1936 7753.00 Baseboard none 1 attached Partially Finished 1 0 other 0 $116,737 83
45204 $66,500 5 2.0 1840 1890 7405.00 Baseboard none 1 detached Partially Finished 2 0 other 0 $141,373 129
45227 $525,000 4 4.0 2915 1951 8480.00 gas central 2 attached Finished 2 0 colonial 1 $369,476 68
45208 $949,000 4 4.0 4500 2013 12197.00 electric central 3 attached Finished 2 0 traditional 1 $988,125 6
45245 $275,000 4 3.0 2903 2002 13939.00 gas central 2 attached Unfinished 2 0 transitional 1 $338,200 17
45206 $490,924 4 3.0 5090 1865 12632.00 gas central 0 none Unfinished 3 0 historical 3 $735,081 154
45208 $280,000 2 1.0 1060 1921 4791.00 gas central 1 detached Partially Finished 1 0 bungalow 1 $232,758 98
45227 $313,000 3 2.0 1922 1948 4725.00 gas central 1 attached Finished 1 0 cape cod 1 $243,614 71
45230 $189,000 4 2.0 1385 1949 7579.00 electric, gas central 1 attached Unfinished 1 0 cape cod 1 $166,777 70
45255 $237,500 3 3.0 1261 1957 43996.00 electric, gas central 2 attached Finished 1 0 traditional 1 $157,835 62
45219 $169,530 5 3.0 1936 1912 6359.00 gas wall units 2 Carport Finished 2 0 traditional 1 $176,821 107
45209 $215,000 2 2.0 855 1940 5009.00 gas central 1 attached Partially Finished 1 0 bungalow 1 $169,005 79
45226 $627,000 3 4.0 2834 2012 2613.00 gas central 2 attached Finished 4 0 traditional 1 $546,490 7
45212 $135,000 2 1.0 1152 1925 2874.00 Baseboard central 1 detached Unfinished 2 0 traditional 1 $118,752 94
45233 $100,000 3 2.0 1053 1957 7405.00 gas central 1 attached Partially Finished 1 0 other 0 $114,163 62
45220 $560,000 4 3.0 3071 1970 27442.80 electric, gas central 2 attached Partially Finished 3 0 contemporary 1 $350,350 49
45229 $220,000 3 4.0 1719 1924 6250.00 gas central 1 attached Unfinished 2 0 other 1 $140,099 95
45229 $239,900 3 2.0 2080 1928 7971.00 gas central 2 detached Unfinished 3 0 traditional 1 $169,520 91
45229 $169,900 4 3.0 2372 1930 7753.00 gas central 2 attached Partially Finished 1 0 tudor 1 $193,318 89
45216 $189,000 4 3.0 2079 1882 14810.40 Baseboard central 1 detached Unfinished 2 0 traditional 0 $117,464 137
45216 $74,860 2 1.0 1119 1942 4356.00 Baseboard other 1 none Partially Finished 1 0 other 0 $63,224 77
45216 $101,000 4 2.0 1543 1945 4356.00 Baseboard central 1 attached Finished 1 0 other 0 $87,180 74
45216 $113,500 4 2.0 1772 1946 4830.00 gas central 1 attached Finished 1 0 cape cod 0 $100,118 73
45219 $189,000 3 2.0 1346 1916 2439.00 forced air other 1 none Finished 2 0 other 0 $122,935 103
45232 $52,000 3 2.0 960 1960 6098.00 electric other 1 attached Partially Finished 1 0 other 0 $59,120 59
45232 $184,000 4 2.0 2046 1909 8276.00 gas central 2 detached Unfinished 3 0 traditional 1 $126,000 110
45232 $139,900 3 2.0 1442 1925 6490.00 electric, gas central 1 attached Partially Finished 2 0 tudor 1 $88,803 94
45232 $120,000 2 2.0 832 1926 3746.00 gas central 1 none Partially Finished 1 0 traditional 0 $51,237 93
45219 $189,000 2 2.0 1234 1905 4500.00 Baseboard none 1 detached Partially Finished 2 0 other 0 $112,705 114
45219 $215,000 4 3.0 2355 1897 4704.00 gas central 0 none Partially Finished 2 0 traditional 1 $215,090 122
45243 $248,000 3 2.0 1496 1956 9600.00 gas central 1 attached Partially Finished 3 0 other 1 $310,420 63
45243 $493,000 5 4.0 3061 2014 11325.60 electric, gas central 2 attached Finished 2 0 traditional 1 $635,158 5

Summary Statistics

The data captured is from the sell dates of 2017-10-25 to 2019-11-22. There are 191 observations in the dataset that range over 35 zipcodes.

The following characteristics that were captured for each sold home are repeated below, and explained in the prior tab:

##  [1] "Address"         "zip"             "sold_price"     
##  [4] "sold_date"       "num_bed"         "num_bath"       
##  [7] "sqr_ft"          "year_built"      "lot_size_ft"    
## [10] "heating"         "cooling"         "parking_spaces" 
## [13] "garage"          "basement_status" "num_stories"    
## [16] "Pool?"           "type_and_style"  "Fireplace"      
## [19] "med_val"         "age"

Below are some summary statistics of the houses’ sales price and pysical size:

Summary Statistics Table of Home Prices and Size zillow (N = 191)
Final Sale Price   
   min 16000
   median (Q1, Q3) 197,250.00 (136,550.00, 450,125.00)
   mean (sd) 344,773.82 ± 404,522.37
   max 2700000
   Missing 0
Square Footage of Home   
   min 792
   median (Q1, Q3) 1,776.00 (1,321.00, 2,302.50)
   mean (sd) 2,043.58 ± 1,058.74
   max 7900
   Missing 0
Square Footage of Lot   
   min 784
   median (Q1, Q3) 7,535.00 (4,399.50, 11,325.60)
   mean (sd) 11,809.30 ± 17,006.72
   max 130680
   Missing 0

The average price of homes in Cincinnati is $344,774, but this price ranges from $16,000 to $2,700,000.

The average size of homes in Cincinnati is 2,043 square feet, which sit on an average lot size of 11,809 square feet. (This is approximately 0.2711043 acres.)

The same data but by zip code:

Home Sizes and Values by Zip Code
Zip Code Number Sold in ZIP Median Home Value Avg Closing Price Avg Square Footage Avg Lot Size (Sq Ft)
45202 26 $597,482 $593,154 2,602 5,694
45204 5 $136,671 $116,300 1,779 5,184
45205 7 $55,735 $63,057 1,346 23,330
45206 8 $392,326 $330,597 2,717 7,666
45208 20 $650,724 $867,958 2,963 16,720
45209 8 $296,302 $345,427 1,499 5,955
45211 6 $127,577 $143,833 1,694 10,186
45212 2 $114,268 $107,500 1,108 2,809
45213 3 $177,210 $251,423 1,432 7,826
45214 1 $142,512 $298,000 2,162 1,991
45215 1 $236,263 $419,000 2,305 20,909
45216 5 $100,807 $116,372 1,784 7,064
45217 1 $133,600 $205,000 1,920 11,326
45219 4 $156,888 $190,633 1,718 4,500
45220 3 $354,913 $453,333 3,111 41,934
45223 6 $128,145 $204,083 1,535 5,690
45224 3 $152,349 $195,050 1,910 14,562
45225 1 $144,292 $80,025 2,189 2,613
45226 4 $503,150 $543,625 2,609 4,660
45227 11 $229,026 $264,321 1,807 7,873
45229 3 $167,646 $209,933 2,057 7,325
45230 10 $173,027 $145,240 1,437 11,016
45231 1 $66,990 $94,000 792 6,011
45232 4 $81,290 $123,975 1,320 6,152
45233 3 $118,536 $107,000 1,093 7,623
45237 6 $126,400 $124,233 1,762 7,056
45238 9 $127,264 $101,367 1,454 7,952
45239 4 $149,791 $134,125 1,742 8,832
45240 1 $193,875 $173,655 2,250 9,191
45243 5 $769,576 $878,308 3,709 41,124
45244 2 $336,180 $373,500 2,542 22,216
45245 12 $179,226 $176,225 1,538 17,546
45248 1 $160,973 $244,000 1,521 17,293
45249 2 $348,189 $327,250 2,411 25,613
45255 3 $188,710 $232,167 1,508 37,505

Data Visualization

The center, shape, and spread of the data:

Shape

As shown in the histogram and boxplots below, our data is highly skewed to the right. This is intuitive, as all values must be nonnegative and therefore have a lower bound.

Spread

Below are some boxplots to see the spread of the data. It’s interesting to note that the lot size on which homes are built are rather standard/highly concentrated around 7,535 square feet, causing a few outliers of homes that are on large plots of land.

This bar graph shows the number of homes built throughout the years.

By Zip Code

Here we can see the average home price by zip code to identify the general price of homes by neighborhood.

Correlation

Here we can look for correlation between covariates by plotting a pairplot.

Modeling

The process of determining an appropriate model to represent home values in the Cincinnati is two-fold. First, we need to perform a residual analysis. Through this analysis, we can ensure that the full model - that is all the possible covariates along with the response variable - meets the following criteria for linear regression:

  • The relationship between the regressors and the response variable is approximately linear

  • Errors are independent

  • Errors are normally distributed

  • Error term has an equal/constant variance

If all of these assumptions are not met, variables in the model must be transformed and checked in a process called Model Adequacy Checking, which involves Transformation and Residual Analysis.


Part I: Residual Analysis

Let’s begin with a full model. We can create a dataset on which to fit a model, shown below. Note that we are removing Address, sold_date, year_built, and type_and_style from this dataset, as they are not covariates we want to include in our model. Address and sold_date are not generalizable, year_built is represented in the age variable, and type_and_style has too many categories and its contribution to the model may not be worth the work of including it.

zillow_slim <- zillow[,-c(1,4,8,17)]

We can start our modeling process by using all covariates available.


Full Model
zillow_fit_full <- lm(sold_price ~ ., data = zillow_slim)

Observations regarding our first model:

  • The residuals vs Fitted Values graph indicates a nonconstant variance.
  • The normal Q-Q plot indicates that the distribution has heavy tails, but otherwise is symmetrical. It is not normally distributed.
  • Observations 130 and 170 seem to be the two observations with the highest leverage, according to Residuals vs Leverage graph (Cook’s Distance). However I would not consider these observations to be outliers because they are not close to 1.
zillow_slim[c(130,170),]
##       zip sold_price num_bed num_bath sqr_ft lot_size_ft   heating
## 130 45208    2300000       3        4   3253       30492 Baseboard
## 170 45219     169530       5        3   1936        6359       gas
##        cooling parking_spaces  garage basement_status num_stories Pool?
## 130    central              3 Carport        Finished           1     1
## 170 wall units              2 Carport        Finished           2     0
##     Fireplace  med_val age
## 130         1 714304.6  67
## 170         1 176821.3 107


Response Variable Transformation - Box-Cox Procedure

Because we identified a nonconstant variance, we can utilize the box-cox procedure to determine the appropriate transformation for the response variable to linearize the model.

bc <-  MASS::boxcox(sold_price ~ . , data=zillow_slim)

(lambda <- bc$x[which.max(bc$y)])
## [1] 0.1010101
zillow_slim$sold_price2 <- (zillow_slim$sold_price ^ lambda - 1) / lambda
zillow_fit_bc <- lm(sold_price2 ~ . - sold_price, data = zillow_slim)

# obtain MSRes
( MSRes=summary(zillow_fit_bc)$sigma^2 )
## [1] 1.613653


Covariate Transformation

After correcting for the nonconstant variance, we can now work on improving the normality of the model. There were countless models I used to come to this final one. For a concise report, the final product is below.

zillow_fit <- lm(sold_price2 ~ zip + num_bed + num_bath + sqr_ft + lot_size_ft + heating + cooling + parking_spaces + garage + basement_status + age + I(age^2) + num_stories + `Pool?` + Fireplace + med_val + sqr_ft:num_bath, data = zillow_slim)

# obtain MSRes
( MSRes=summary(zillow_fit)$sigma^2 )
## [1] 1.498322

Create the final Dataset

# new dataset called zillow_transformed
zillow_transformed <- zillow_slim %>% dplyr::select(sold_price2, zip, num_bed, num_bath, sqr_ft, lot_size_ft, heating, cooling, parking_spaces, garage, basement_status, age, num_stories, `Pool?`, Fireplace, med_val)

Create the final full Working Model

zillow_fit_full <- lm(sold_price2 ~ zip + num_bed + num_bath + sqr_ft + lot_size_ft + heating + cooling + parking_spaces + garage + basement_status + age + I(age^2) + num_stories + `Pool?` + Fireplace + med_val + sqr_ft:num_bath, data = zillow_transformed)


Part II: Variable Selection

We have satisfied the linear regression assumptions by transforming the variables accordingly. Now we can move into the phase of selecting the variables from this full model to determine the optimal combination/selection of regressors to best model their relationship with Final Sold Price.


Forward Selection + Backward Elimination

When attempting to use the step function, I keep getting a message that says “attempting model selection on an essentially perfect fit is nonsense”… but when tested, we get a predicted R squared of around 50%. My hypothesis is that this is due to varying levels of zip code. Therefore I would be more beneficial to manually select variables through forward selection and backward elimination.

null= lm(sold_price2 ~ 1, data=zillow_transformed)  #This is why we use zillow_transformed, so it only includes the columns we want.
full= zillow_fit_full

add1(null, full, test="F")
add1(lm(sold_price2 ~ med_val, data = zillow_transformed), full, test="F")
add1(lm(sold_price2 ~ med_val + num_bath, data = zillow_transformed), full, test="F")
add1(lm(sold_price2 ~ med_val + num_bath + zip, data = zillow_transformed), full, test="F")
add1(lm(sold_price2 ~ med_val + num_bath + zip + sqr_ft, data = zillow_transformed), full, test="F")
add1(lm(sold_price2 ~ med_val + num_bath + zip + sqr_ft + num_bath:sqr_ft, data = zillow_transformed), full, test="F")
add1(lm(sold_price2 ~ med_val + num_bath + zip + sqr_ft + num_bath:sqr_ft + cooling, data = zillow_transformed), full, test="F")
add1(lm(sold_price2 ~ med_val + num_bath + zip + sqr_ft + num_bath:sqr_ft + cooling + heating, data = zillow_transformed), full, test="F")
add1(lm(sold_price2 ~ med_val + num_bath + zip + sqr_ft + num_bath:sqr_ft + cooling + heating + `Pool?`, data = zillow_transformed), full, test="F")
add1(lm(sold_price2 ~ med_val + num_bath + zip + sqr_ft + num_bath:sqr_ft + cooling + heating + `Pool?` + age + I(age^2), data = zillow_transformed), full, test="F")

drop1(lm(sold_price2 ~ med_val + num_bath + zip + sqr_ft + num_bath:sqr_ft + cooling + heating + `Pool?` + age + I(age^2), data = zillow_transformed), test="F")
#age doesn't seem to add much to the model, because it tells us to drop it

We found the best model to be:

test_fit <- lm(sold_price2 ~ med_val + num_bath + num_bath:sqr_ft + sqr_ft + cooling + `Pool?` + zip + heating, data = zillow_transformed)
## 
## Call:
## lm(formula = sold_price2 ~ med_val + num_bath + num_bath:sqr_ft + 
##     sqr_ft + cooling + `Pool?` + zip + heating, data = zillow_transformed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2112 -0.6239  0.0000  0.5206  4.0747 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           2.020e+01  9.495e-01  21.271  < 2e-16 ***
## med_val               5.929e-06  2.597e-06   2.283 0.023937 *  
## num_bath              1.026e+00  1.931e-01   5.315 4.04e-07 ***
## sqr_ft                1.193e-03  4.326e-04   2.758 0.006578 ** 
## coolingnone          -1.595e+00  4.805e-01  -3.320 0.001145 ** 
## coolingother         -6.881e-01  7.123e-01  -0.966 0.335648    
## coolingwall units     2.922e-02  5.334e-01   0.055 0.956387    
## `Pool?`               1.579e+00  7.478e-01   2.111 0.036487 *  
## zip45204             -1.165e+00  1.037e+00  -1.123 0.263143    
## zip45205             -2.817e+00  1.039e+00  -2.711 0.007527 ** 
## zip45206             -8.035e-01  7.573e-01  -1.061 0.290436    
## zip45208              7.523e-01  3.860e-01   1.949 0.053230 .  
## zip45209              1.065e+00  6.101e-01   1.746 0.082904 .  
## zip45211             -6.103e-01  1.012e+00  -0.603 0.547474    
## zip45212              7.697e-02  1.154e+00   0.067 0.946904    
## zip45213              1.017e+00  9.455e-01   1.075 0.284016    
## zip45214              1.005e+00  1.561e+00   0.644 0.520934    
## zip45215              1.622e+00  1.456e+00   1.114 0.267130    
## zip45216             -1.709e+00  1.138e+00  -1.501 0.135489    
## zip45217              4.110e-01  1.515e+00   0.271 0.786595    
## zip45219              1.555e-01  1.038e+00   0.150 0.881138    
## zip45220             -4.337e-01  1.173e+00  -0.370 0.712059    
## zip45223              4.723e-01  9.371e-01   0.504 0.615041    
## zip45224             -3.266e-01  1.116e+00  -0.293 0.770153    
## zip45225             -2.957e+00  1.619e+00  -1.826 0.069932 .  
## zip45226              2.901e-01  6.997e-01   0.415 0.679071    
## zip45227              1.463e-01  7.806e-01   0.187 0.851605    
## zip45229             -3.877e-01  1.124e+00  -0.345 0.730685    
## zip45230             -7.899e-01  7.889e-01  -1.001 0.318390    
## zip45231             -9.818e-01  1.393e+00  -0.705 0.482012    
## zip45232             -4.447e-01  1.062e+00  -0.419 0.676047    
## zip45233             -8.579e-01  9.765e-01  -0.879 0.381141    
## zip45237             -9.627e-01  1.029e+00  -0.936 0.350980    
## zip45238             -1.444e+00  8.731e-01  -1.654 0.100402    
## zip45239             -1.128e+00  1.011e+00  -1.116 0.266187    
## zip45240             -1.217e+00  1.504e+00  -0.809 0.419718    
## zip45243              4.189e-01  6.546e-01   0.640 0.523228    
## zip45244              1.583e-01  1.096e+00   0.144 0.885335    
## zip45245             -1.819e-01  7.614e-01  -0.239 0.811536    
## zip45248              1.133e+00  1.398e+00   0.811 0.418891    
## zip45249              2.254e-01  1.065e+00   0.212 0.832631    
## zip45255              5.032e-02  9.507e-01   0.053 0.957860    
## heatingelectric      -1.732e+00  5.575e-01  -3.106 0.002287 ** 
## heatingelectric, gas -3.797e-01  5.825e-01  -0.652 0.515479    
## heatingforced air    -3.490e-02  7.199e-01  -0.048 0.961405    
## heatinggas           -6.442e-01  3.162e-01  -2.037 0.043485 *  
## heatingoil           -3.847e-01  1.314e+00  -0.293 0.770160    
## heatingother         -2.087e+00  1.004e+00  -2.079 0.039429 *  
## num_bath:sqr_ft      -2.457e-04  6.586e-05  -3.731 0.000275 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.205 on 142 degrees of freedom
## Multiple R-squared:  0.8869, Adjusted R-squared:  0.8486 
## F-statistic:  23.2 on 48 and 142 DF,  p-value: < 2.2e-16


Bin Categorical Covariates

However, this model is not easy to interpret. There are too many zip code categories that make our model less clear. A solution to this is manually creating dummy variables for the zip code variable and group the nonsignificant zip codes into a single “other zipcode” column. Below, we only kept the one zip code that was highly significant: 45205.

zillow_transformed <- zillow_transformed %>% 
  mutate(zip_group = case_when(
    zip == 45205 ~ "45205",
    TRUE         ~ "other"
))
zillow_transformed$zip_group <- as.factor(zillow_transformed$zip_group)
zillow_transformed$zip_group <- relevel(x = zillow_transformed$zip_group, ref="other")

We can do the same for heating and cooling types and see how it effects the model.

zillow_transformed <- zillow_transformed %>% 
  mutate(heating_group = case_when(
    heating == "electric" ~ "electric",
    TRUE             ~ "other"
))
zillow_transformed$heating_group <- as.factor(zillow_transformed$heating_group)
zillow_transformed$heating_group <- relevel(x = zillow_transformed$heating_group, ref="other")
zillow_transformed <- zillow_transformed %>% 
  mutate(cooling_group = case_when(
    cooling == "none" ~ "none",
    TRUE             ~ "yes"
))
zillow_transformed$cooling_group <- as.factor(zillow_transformed$cooling_group)
zillow_transformed$cooling_group <- relevel(x = zillow_transformed$cooling_group, ref="yes")


Updated Model

Below is the updated model with zip code, heating, and cooling in groups.

#test_fit <- lm(sold_price2 ~ zip_group + med_val + num_bath + garage, data = zillow_transformed)
test_fit <- lm(sold_price2 ~ med_val + num_bath + num_bath:sqr_ft + sqr_ft + cooling_group + `Pool?` + zip_group + heating_group, data = zillow_transformed)
## 
## Call:
## lm(formula = sold_price2 ~ med_val + num_bath + num_bath:sqr_ft + 
##     sqr_ft + cooling_group + `Pool?` + zip_group + heating_group, 
##     data = zillow_transformed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.6399 -0.7158  0.0042  0.6728  5.1095 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            1.947e+01  4.759e-01  40.913  < 2e-16 ***
## med_val                8.996e-06  8.183e-07  10.994  < 2e-16 ***
## num_bath               1.084e+00  1.667e-01   6.502 7.40e-10 ***
## sqr_ft                 6.630e-04  2.849e-04   2.327 0.021072 *  
## cooling_groupnone     -1.441e+00  4.273e-01  -3.373 0.000908 ***
## `Pool?`                1.776e+00  6.566e-01   2.705 0.007490 ** 
## zip_group45205        -2.073e+00  5.122e-01  -4.048 7.63e-05 ***
## heating_groupelectric -1.010e+00  4.427e-01  -2.282 0.023624 *  
## num_bath:sqr_ft       -2.539e-04  5.411e-05  -4.691 5.31e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.265 on 182 degrees of freedom
## Multiple R-squared:  0.8402, Adjusted R-squared:  0.8332 
## F-statistic: 119.6 on 8 and 182 DF,  p-value: < 2.2e-16

All regressors are significant. The Adjusted R-squared is greater than .80, which indicates a strong model. It is also significant, with a p-value less than .01.


Ridge Regression

We know for a fact that there exists multicollinearity in this model. Below, the VIF’s for sqr_ft and its interaction term num_bath:sqr_ft show undeniable multicollinearity.

vif(test_fit)
##         med_val        num_bath          sqr_ft   cooling_group 
##        5.795886        4.399345       10.797945        1.080441 
##         `Pool?`       zip_group   heating_group num_bath:sqr_ft 
##        1.054497        1.105089        1.049527       14.629011

We cannot perform Ridge Regression to fix sqr_ft and num_bath:sqr_ft because our model contains more than just numeric variables. While we ideally would keep these in our model, the VIFs are too great to do so. Fortunately, square footage is a direct contributer to med_val. Let us remove them and test the performance of this final model.

test_fit <- lm(sold_price2 ~ med_val + num_bath + cooling_group + `Pool?` + zip_group + heating_group, data = zillow_transformed)
summary(test_fit)
## 
## Call:
## lm(formula = sold_price2 ~ med_val + num_bath + cooling_group + 
##     `Pool?` + zip_group + heating_group, data = zillow_transformed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8886 -0.7910  0.0647  0.7590  4.8564 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            2.112e+01  2.682e-01  78.736  < 2e-16 ***
## med_val                7.586e-06  5.335e-07  14.217  < 2e-16 ***
## num_bath               5.401e-01  1.289e-01   4.190 4.32e-05 ***
## cooling_groupnone     -1.688e+00  4.474e-01  -3.774 0.000217 ***
## `Pool?`                1.741e+00  6.932e-01   2.512 0.012861 *  
## zip_group45205        -2.493e+00  5.305e-01  -4.699 5.10e-06 ***
## heating_groupelectric -9.717e-01  4.667e-01  -2.082 0.038733 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.337 on 184 degrees of freedom
## Multiple R-squared:  0.8197, Adjusted R-squared:  0.8139 
## F-statistic: 139.5 on 6 and 184 DF,  p-value: < 2.2e-16


Part III: Model Validation

The Final Model to be Validated:

\[ sold.price2 = 21.12 -2.49*zip.group.45205 + 0.0000076*med.val +0.54*num.bath -1.69*cooling.group.none + 1.74*pool-0.97*heating.group.electric \]


Fit against New Test Data

Fourteen new observations were collected to test our model. Below we read in that test set and prepare it to be used in the model.


Prepare Test Data
zillow_test <- read_csv("zillow_data_new.csv")

We need to calculate the median value of homes in the zipcodes included in the test data, as well as prepare the dataset for prediction in the model.

home_val_f <- filter(home_val, RegionName %in% unique(zillow_test$zip))
home_val_avgmed <- data.frame(ID=home_val_f[,2], Means=rowMeans(home_val_f[,17:28]))
home_val_avgmed$Means <- currency(home_val_avgmed$Means, digits = 2L)
zillow_test$zip <- as.factor(zillow_test$zip)
home_val_avgmed$RegionName <- as.factor(home_val_avgmed$RegionName)

zillow_test <- left_join(x = zillow_test, y = home_val_avgmed, by = c("zip" = "RegionName"))

# Convert the Median value per square foot to median value for similar homes. Also trasnform the sold price.
zillow2 <- zillow_test %>% mutate(med_val = sqr_ft * Means,
                                  sold_price2 = (sold_price ^ lambda - 1) / lambda)

We can also remove columns that are not needed for the model.

# Remove columns not used in model
zillow_test <- zillow2[,-c(1:2,4:5,9,18,21)]
zillow_test <- zillow_test %>% dplyr::select(sold_price2, everything())

Finally, we need to create the category dummy variables for zip code and change the garage levels.

# Add zip level to test data
zillow_test <- zillow_test %>% 
  mutate(zip_group = case_when(
    zip == 45205 ~ "45205",
    TRUE         ~ "other"
))
zillow_test$zip_group <- as.factor(zillow_test$zip_group)
zillow_test$zip_group <- relevel(x = zillow_test$zip_group, ref="other")
zillow_test <- zillow_test %>% 
  mutate(heating_group = case_when(
    heating == "electric" ~ "electric",
    TRUE             ~ "other"
))
zillow_test$heating_group <- as.factor(zillow_test$heating_group)
zillow_test$heating_group <- relevel(x = zillow_test$heating_group, ref="other")
zillow_test <- zillow_test %>% 
  mutate(cooling_group = case_when(
    cooling == "none" ~ "none",
    TRUE             ~ "yes"
))
zillow_test$cooling_group <- as.factor(zillow_test$cooling_group)
zillow_test$cooling_group <- relevel(x = zillow_test$cooling_group, ref="yes")


Run the Test Data against the model

We run the test data into the model and get the following predictions and confidence bands:

Actual Price Predicted Price Lower Estimate Upper Estimate
$149,900.00 $337,414.13 $304,452.31 $373,549.74
$75,000.00 $119,854.92 $107,244.13 $133,783.34
$267,500.00 $285,695.60 $251,870.64 $323,550.24
$895,000.00 $722,723.48 $658,902.47 $792,048.37
$130,000.00 $155,121.45 $143,798.75 $167,239.37
$205,000.00 $155,323.09 $143,996.43 $167,444.44
$285,000.00 $330,644.35 $295,956.49 $368,944.85
$278,000.00 $230,973.95 $214,569.62 $248,497.17
$86,000.00 $135,734.65 $121,179.10 $151,843.32
$235,000.00 $174,866.76 $133,859.02 $226,838.21
$685,000.00 $799,558.79 $724,281.96 $881,796.71
$208,000.00 $167,423.04 $155,777.09 $179,845.86
$195,000.00 $176,839.60 $132,792.23 $233,607.95
$124,900.00 $140,698.09 $129,563.50 $152,685.56
$239,000.00 $177,530.01 $165,481.92 $190,360.96
$335,000.00 $276,666.69 $259,404.91 $294,954.26
$535,000.00 $561,683.05 $516,304.10 $610,616.27
$118,500.00 $150,735.36 $139,489.06 $162,790.26
Determine Performance

Calculate the mean squared prediction error.

MSPE = sum( (zillow_test$sold_price2 - prediction[,1])^2 ) / dim(zillow_test)[1]
MSPE
## [1] 1.079118

Calculate the PRESS statistic.

PRESS = sum( (zillow_test$sold_price2 - prediction[,1])^2 )
PRESS
## [1] 19.42413

Calculate the predicted R-squared.

pred_Rsq = 1 - PRESS/sum((zillow_test$sold_price2 - mean(zillow_test$sold_price2))^2)
pred_Rsq
## [1] 0.7938061

Compare the predicted R-squared with the model’s R-squared and Adjusted R-squared to see how comparable they are.

summary(test_fit)$r.squared
## [1] 0.8197372
summary(test_fit)$adj.r
## [1] 0.813859

There are many other validation techniques and comparison charts that cannot be made automatically due to the nature of the categorical variables. The statistics above for our final model are satisfactory for our purposes.


Final Model + Conclusion

Again, the final model chosen to predict the price of a home in Cincinnati is below:

summary(test_fit)
## 
## Call:
## lm(formula = sold_price2 ~ med_val + num_bath + cooling_group + 
##     `Pool?` + zip_group + heating_group, data = zillow_transformed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8886 -0.7910  0.0647  0.7590  4.8564 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            2.112e+01  2.682e-01  78.736  < 2e-16 ***
## med_val                7.586e-06  5.335e-07  14.217  < 2e-16 ***
## num_bath               5.401e-01  1.289e-01   4.190 4.32e-05 ***
## cooling_groupnone     -1.688e+00  4.474e-01  -3.774 0.000217 ***
## `Pool?`                1.741e+00  6.932e-01   2.512 0.012861 *  
## zip_group45205        -2.493e+00  5.305e-01  -4.699 5.10e-06 ***
## heating_groupelectric -9.717e-01  4.667e-01  -2.082 0.038733 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.337 on 184 degrees of freedom
## Multiple R-squared:  0.8197, Adjusted R-squared:  0.8139 
## F-statistic: 139.5 on 6 and 184 DF,  p-value: < 2.2e-16

We hope this helps you gain a general expectation of the price you should be willing to pay given the type of home and amenities you are looking for. Please reach out with your dream ideas and we can help you determine must-haves vs would-be-nice to find a home solution that meets your budget.

We wish you the best of luck on your house-hunting journey!



disclaimers

With the housing market constantly fluctuating, this analysis should not be extrapolated into 2020 without re-validating the model on new data and median housing prices.

The number of observations used in this analysis is limited and could affect the accuracy of this model