Nathan Chan
2023-09-13
This data set includes the following variables:
id: unique IDdate: date of the saleprice: selling price of the housebedrooms: number of bedroomsbathrooms: number of bathroomsliving: size of the interior living space (in square
feet)lot: size of the lot (in square feet)floors: number of floorswaterfront: is the housing located on the waterfront?
(yes or no)view: a numeric rating of 1 to 5 for the
quality of the view (higher = better)condition: the condition of the house (neutral, good,
or great)above: size of the interior living space above ground
level (in square feet)basement: size of the interior living space below
ground level (in square feet)year: original year of constructionzipcode: zip codeliving15: average size of the interior living space for
the 15 nearest neighbors (in square feet)lot15: average size of the lot for the 15 nearest
neighbors (in square feet)renovated: was the house renovated in the previous 25
years? (yes or no)library("car")
library("ggmap")
library("ggplot2")
library("mapdata")
library("maps")
library("dplyr")
library("ggthemes")
library("knitr")
library("kableExtra")
library("cowplot")King County is a prominent and populous county located in the Pacific Northwest region of the United States. It is situated in the western part of the state and encompasses the city of Seattle, the largest city in Washington. The county is also known for its stunning natural beauty, including Puget Sound, Cascade Range, and Mount Rainier. On this map, King County is highlighted in purple.
Let’s take a look at some different scatterplots that relate some of our variables to price, so we can get a high level understanding of some of the potential relationships we may have here.
I think that it is interesting that the number of bathrooms is not as clear cut as the number of bedrooms. This may indicate some half bathrooms and even quarter bathrooms (which I didn’t know was even a thing!).
I think that it is interesting that the number of bathrooms is not as clear cut as the number of bedrooms. This may indicate some half bathrooms and even quarter bathrooms (which I didn’t know was even a thing!). When it comes to the views and presence of a waterfront, it appears that a higher rated view and being on a waterfront raises the price ceiling.
The above variable also aligns with the “living” variable given that not many of these housing units have basements and that they could be smaller apartments. What is surprising is that the condition of the housing unit as well as if the housing unit has been renovated appears to not make a difference in price at first. I think this may be because a lot of the buildings in this county are much newer. This is something we will check on later.
Something that I want to look at are the average home prices within each zipcode, as real estate markets are highly localized. Different zipcodes within a city or region can exhibit varying property values due to factors like neighborhood amenities, school quality, proximity to job centers, and overall desirability.
## # A tibble: 6 × 2
## Zipcode `Average House Price`
## <int> <dbl>
## 1 98039 3640000
## 2 98040 2290000
## 3 98004 2112143.
## 4 98033 2002150
## 5 98008 1689758.
## 6 98105 1598429.
## # A tibble: 6 × 2
## Zipcode `Average House Price`
## <int> <dbl>
## 1 98032 283500
## 2 98003 281071.
## 3 98188 266362.
## 4 98001 239075
## 5 98002 232151.
## 6 98168 231983.
Taking a look at the top 6 most expensive zipcodes in this dataset, we see the areas: Medina, Mercer Island, and Kirkland. All of these areas have a high cost of living, with Medina even having a cost of living that is 120% greater than the national average! They all have desirable qualities such as being near water and having a close proximity to Seattle for those who want easy access to the city. On the other hand the 6 least expensive zipcodes in this dataset host cities such as Tukwila, Burien, and SeaTac where these cities are actually slightly below the national average cost of living. When looking at location, these cities are not near the water as well as being much further from Seattle which could explain the urban effect not being as prominent as the top 6.
I wanted to take a closer look at some of the frequencies we see for some of our variables, so I constructed a few histograms:
Some takeaways I have from these histograms is that the houses in this dataset are fairly evenly distributed amongst the zipcodes which could indicate this being an accurate representation. Also, many of the houses are built fairly similarily with the lot sizes being much smaller and the living spaces taking up a large portion of the lot. This indicates that there is not much space for a backyard or frontyard and usage of parks is probably standard which could lead to an increase cost of living due to taxes for upkeep of public spaces. While speaking about lot sizes, it also noted that there are a few outliers here that have a square footage of around half a million. The price of houses in Washington are also significantly more expensive than my hometown as majority of the homes are in the 7 figure range.
I am thinking about tossing out the basement variable entirely after also doing a quick search about houses not having basements on the West Coast (which is somewhat unusual coming from a Midwest perspective). Nonetheless, just to confirm I wanted to look at its relationship with price.
As seen here, there is very little correlation between the two due to there actually being no basement in many homes, as stated before. I think we can rule this variable out.
It appears that price and the living room size of the nearest 15 neighbors have a little bit of a quadratic relationship which we will account for in the model.
Also noted earlier, the histogram of lot size was extremely right skewed. I conducted a log transformation to make the distribution appear more normal. This will also be accounted for in the model.
It appears that there is a positive relationship with the number of bathrooms and price of a house, but for the bedrooms it appears to not have that much of a relationship, which makes me wonder if there is an interaction between these two. Why would you want to have a house with 6 bedrooms when you only have 2 bathrooms?
##
## Call:
## lm(formula = price ~ bedrooms * bathrooms, data = housing)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2400317 -385775 -120061 231973 4643665
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 381317 214389 1.779 0.07585 .
## bedrooms -269166 61446 -4.380 1.42e-05 ***
## bathrooms 247263 91906 2.690 0.00735 **
## bedrooms:bathrooms 100807 22258 4.529 7.26e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 627100 on 556 degrees of freedom
## Multiple R-squared: 0.4339, Adjusted R-squared: 0.4309
## F-statistic: 142.1 on 3 and 556 DF, p-value: < 2.2e-16
The summary shows that there is a possible interaction between number of bedrooms and bathrooms, as the statistically significant demonstrates as proof.
## # A tibble: 16 × 2
## zipcode avg_price_no_water_over_million
## <int> <dbl>
## 1 98006 1650000
## 2 98004 1527500
## 3 98027 1520000
## 4 98105 1490000
## 5 98144 1400000
## 6 98033 1386667.
## 7 98040 1320000
## 8 98112 1294000
## 9 98109 1260000
## 10 98177 1200000
## 11 98199 1180000
## 12 98053 1160000
## 13 98074 1130000
## 14 98103 1130000
## 15 98117 1080000
## 16 98005 1040000
As seen earlier with the zipcodes, the zipcodes where there was a possibility for owning a house on a waterfront were much more expensive than ones that weren’t. This boxplot also confirms that statement due to the median price being much higher for homes on a waterfront.
Renovations appear to not make a difference in price. Maybe the houses are relatively new. Let’s check.
Looking from the 1950s and onwards, it appears that there has been an increase in construction. However, it appears that houses in genereal haven’t been renovated as seen with the homes built before 1950.
It appears that the view rating has a slightly positive correlation with the price of homes. I wanted to group the zipcodes into two different areas: Downtown Seattle and Other. I made a new column called Zipcode Area
## id date price bedrooms bathrooms living lot floors
## 1 65000085 7/8/2014 430000 3 2.00 1550 6039 1.0
## 2 87000283 11/18/2014 359950 3 2.50 1980 7800 2.0
## 3 98030660 3/11/2015 815000 4 2.50 3880 7208 2.0
## 4 109210280 11/11/2014 220000 4 2.25 1950 7280 2.0
## 5 121029034 6/24/2014 549000 2 1.00 2034 13392 1.0
## 6 121039042 3/13/2015 425000 3 2.75 3610 107386 1.5
## waterfront view condition above basement year zipcode living15 lot15
## 1 no 1 great 830 721 1942 98126 1330 6042
## 2 no 1 neutral 1980 1 1999 98055 1700 20580
## 3 no 1 neutral 3880 1 2006 98075 3280 7221
## 4 no 1 good 1950 1 1979 98023 1910 7280
## 5 yes 5 great 1159 876 1947 98070 1156 15961
## 6 yes 4 neutral 3130 481 1918 98023 2630 42126
## renovated zipcodeArea
## 1 no Downtown Seattle
## 2 no Other
## 3 no Other
## 4 no Other
## 5 no Other
## 6 no Other
It looks like there is a small difference between Downtown Seattle and other areas when looking at lot size. There are no observations in Downtown Seattle that are above 100,000 square feet. This could be most likely due to construction guidelines.
After some exploratory data analysis, I am ready to construct a model:
model1 <-lm(price ~ living + waterfront + zipcodeArea + log(lot) + I(living15^2) + year + bathrooms*bedrooms, data = housing3)
summary(model1)##
## Call:
## lm(formula = price ~ living + waterfront + zipcodeArea + log(lot) +
## I(living15^2) + year + bathrooms * bedrooms, data = housing3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1240980 -191427 2325 178602 2179257
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.595e+06 1.406e+06 3.978 7.86e-05 ***
## living 4.096e+02 2.654e+01 15.433 < 2e-16 ***
## waterfrontyes 5.792e+05 4.303e+04 13.462 < 2e-16 ***
## zipcodeAreaOther -1.018e+05 3.993e+04 -2.550 0.01105 *
## log(lot) -1.008e+05 2.000e+04 -5.040 6.32e-07 ***
## I(living15^2) 3.276e-02 5.950e-03 5.505 5.66e-08 ***
## year -2.290e+03 7.110e+02 -3.221 0.00135 **
## bathrooms -1.714e+05 6.077e+04 -2.820 0.00497 **
## bedrooms -2.151e+05 3.754e+04 -5.728 1.67e-08 ***
## bathrooms:bedrooms 6.845e+04 1.359e+04 5.037 6.41e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 369600 on 550 degrees of freedom
## Multiple R-squared: 0.8055, Adjusted R-squared: 0.8023
## F-statistic: 253.1 on 9 and 550 DF, p-value: < 2.2e-16
An adjusted R-squared value of 0.8023 (or 80.23%) suggests that approximately 80.23% of the variability in the dependent variable is explained by the independent variables included in the regression model. In other words, the model accounts for about 80.23% of the variance in the data. Not bad. The p-value is also very small (2.2e16) which indicates statistical significance.
## there are higher-order terms (interactions) in this model
## consider setting type = 'predictor'; see ?vif
## living waterfront zipcodeArea log(lot)
## 4.246220 1.554670 1.441908 1.493667
## I(living15^2) year bathrooms bedrooms
## 2.323150 1.707737 11.433322 5.058192
## bathrooms:bedrooms
## 18.510285
Using the vif function, I can see that there is limited multicollinearity except the variables regarding bathrooms and bedrooms. Logically this makes sense, as a home with more bedrooms is more likely to have more bathrooms to match. Let’s see if removing the interaction or one of the variables makes a better model.
##
## Call:
## lm(formula = price ~ living + waterfront + zipcodeArea + log(lot) +
## I(living15^2) + year + bathrooms + bedrooms, data = housing3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1340617 -185399 4878 174594 2214697
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.315e+06 1.430e+06 4.417 1.20e-05 ***
## living 4.274e+02 2.688e+01 15.898 < 2e-16 ***
## waterfrontyes 5.748e+05 4.396e+04 13.077 < 2e-16 ***
## zipcodeAreaOther -1.006e+05 4.080e+04 -2.466 0.01395 *
## log(lot) -1.016e+05 2.044e+04 -4.972 8.86e-07 ***
## I(living15^2) 2.858e-02 6.021e-03 4.747 2.63e-06 ***
## year -2.954e+03 7.139e+02 -4.138 4.04e-05 ***
## bathrooms 9.242e+04 3.149e+04 2.935 0.00348 **
## bedrooms -5.808e+04 2.140e+04 -2.715 0.00684 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 377700 on 551 degrees of freedom
## Multiple R-squared: 0.7966, Adjusted R-squared: 0.7936
## F-statistic: 269.7 on 8 and 551 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = price ~ living + waterfront + zipcodeArea + log(lot) +
## I(living15^2) + year + bathrooms, data = housing3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1305098 -193209 5474 179435 2275013
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.820e+06 1.426e+06 4.081 5.14e-05 ***
## living 4.112e+02 2.636e+01 15.597 < 2e-16 ***
## waterfrontyes 6.144e+05 4.171e+04 14.730 < 2e-16 ***
## zipcodeAreaOther -9.606e+04 4.100e+04 -2.343 0.019484 *
## log(lot) -1.029e+05 2.055e+04 -5.009 7.39e-07 ***
## I(living15^2) 2.808e-02 6.053e-03 4.639 4.38e-06 ***
## year -2.757e+03 7.143e+02 -3.860 0.000127 ***
## bathrooms 7.051e+04 3.061e+04 2.303 0.021651 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 379800 on 552 degrees of freedom
## Multiple R-squared: 0.7938, Adjusted R-squared: 0.7912
## F-statistic: 303.6 on 7 and 552 DF, p-value: < 2.2e-16
Looking at the adjusted R-squared value we see for the removal of the interaction, it is 0.7936 and for removal of the bedrooms variable it is 0.7912. Both are slightly lower than the original model, but nonetheless lower. I am going to make the decision to keep the interaction in the model as it has a high adjusted R-squared, and I believe it shows a difference in price with number of bedrooms that would not have been uncovered before.