Part 1 Examination of the Variables

houses <- read_csv("/Users/oliviacochran/Desktop/Grad School/Spring 2021/Strategy Analytics II/Problem Statement/b Excel Copy of Group 2 Regression Dataset.csv")
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   .default = col_double(),
##   date = col_datetime(format = "")
## )
## ℹ Use `spec()` for the full column specifications.
# number of bedrooms
ggplot(houses, aes(bedrooms, price)) + geom_point()

houses.1 <-
  filter(houses, bedrooms < 15)
ggplot(houses.1, aes(bedrooms, price)) + geom_point()

ggplot(houses.1, aes(bedrooms)) + geom_histogram(binwidth = .5) 

After running the first ggplot, we found that there was an extreme outlier of 30+ bedrooms, so in our second graph, we removed it and condensed our data to only include 15 bedrooms or less.

The histogram displays that the most common number of bedrooms in our data is 3 bedrooms, followed by 4 bedrooms.

# number of bathrooms
ggplot(houses, aes(bathrooms, price)) + geom_point()

houses.2 <- filter(houses.1, bathrooms < 7)
ggplot(houses.2, aes(bathrooms, price)) + geom_point()

ggplot(houses.2, aes(bathrooms)) + geom_histogram(binwidth = .2) 

After evaluating the number of bathrooms scatterplot, we decided to remove any data points where the number of bathrooms was greater than 7. This eliminated a number of data points.

The histogram indicates that the most common number of bathrooms in a house is 2.5 (i.e. a full bath and a half bath)

# square feet (living)

ggplot(houses.2, aes(sqft_living, price)) + geom_point()

houses.3 <- filter(houses.2, sqft_living < 15000)
ggplot(houses.3, aes(sqft_living, price)) + geom_point() # remove outlier sqft > 15000

ggplot(houses.3, aes(sqft_living)) + geom_histogram(binwidth = 1) 

Data points where the sqft_living was greater than 15,000 were removed after looking at the original scatterplot. The histogram displays that the majority of homes have a square footage of between 750 and 2500.

An Examination of the Correlations of all Continuous Variables

library(corrplot)
## corrplot 0.84 loaded
houses2 <- select(houses.3, price, bedrooms, bathrooms, sqft_living, sqft_lot, sqft_above, sqft_basement, sqft_living15, sqft_lot15)
corrplot(cor(houses2))

Based on the correlation plot, it appears that sqft_living, sqft_above, and sqft_living15 have the highest correlation to price, followed by bathrooms, bedrooms, and sqft_basement.

cor(houses2)
##                    price   bedrooms  bathrooms sqft_living   sqft_lot
## price         1.00000000 0.31536685 0.52195073   0.6998023 0.08907970
## bedrooms      0.31536685 1.00000000 0.52727931   0.5917330 0.03155474
## bathrooms     0.52195073 0.52727931 1.00000000   0.7527634 0.08559870
## sqft_living   0.69980235 0.59173297 0.75276342   1.0000000 0.17003716
## sqft_lot      0.08907970 0.03155474 0.08559870   0.1700372 1.00000000
## sqft_above    0.60243094 0.48930797 0.68254153   0.8753405 0.18144130
## sqft_basement 0.32028178 0.30832246 0.28008772   0.4306427 0.01237187
## sqft_living15 0.58909708 0.40223227 0.56855179   0.7580559 0.14342026
## sqft_lot15    0.08150328 0.02906542 0.08484357   0.1803249 0.71787253
##                sqft_above sqft_basement sqft_living15 sqft_lot15
## price          0.60243094    0.32028178     0.5890971 0.08150328
## bedrooms       0.48930797    0.30832246     0.4022323 0.02906542
## bathrooms      0.68254153    0.28008772     0.5685518 0.08484357
## sqft_living    0.87534047    0.43064270     0.7580559 0.18032494
## sqft_lot       0.18144130    0.01237187     0.1434203 0.71787253
## sqft_above     1.00000000   -0.05941699     0.7322469 0.19179385
## sqft_basement -0.05941699    1.00000000     0.1982335 0.01428746
## sqft_living15  0.73224693    0.19823350     1.0000000 0.18196101
## sqft_lot15     0.19179385    0.01428746     0.1819610 1.00000000

The correlation plot confirms that the following have the highest correlation to price: sqft_living (0.6998), sqft_above (0.6024), sqft_living15 (0.5891), bathrooms (0.5220), sqft_basement (0.3203), bedrooms (0.3154), sqft_lot (0.0891), sqft_lot15 (0.0815).

As discussed in our problem statement, we have chosen number of bedrooms, number of bathrooms, and sqft_living as our variables. Based on the correlation matrix, sqft_living has the highest correlation and the number of bedrooms and the number of bathrooms provide an additional level of context to understanding price.

Identification and Evaluation of Regression Models

library(caTools)
smp_siz = floor(0.75*nrow(houses2))
smp_siz
## [1] 16206
set.seed(1600)
train_ind = sample(seq_len(nrow(houses2)),size = smp_siz)

train = houses2[train_ind,]
test = houses2[-train_ind,]

fit.houses <- lm(price~bedrooms+bathrooms+sqft_living, train)
summary(fit.houses)
## 
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1100882  -144841   -23464   101417  3093905 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  86299.929   7945.569   10.86   <2e-16 ***
## bedrooms    -57665.989   2742.066  -21.03   <2e-16 ***
## bathrooms     9793.845   3981.816    2.46   0.0139 *  
## sqft_living    301.144      3.516   85.64   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 251700 on 16202 degrees of freedom
## Multiple R-squared:  0.502,  Adjusted R-squared:  0.5019 
## F-statistic:  5443 on 3 and 16202 DF,  p-value: < 2.2e-16

The first step is setting our train and test data, followed by creating the first regression model using the variables sqft_living, number of bedrooms, and number of bathrooms.

The F-Statistic of this model is 5443, which is significant as it is a large value, which therefore creates a p-value that is less than 0.05. The p-value for each variable in this regression is less than 0.05, indicating that all variables are significant in this model.

The parameter estimates make sense for the number of bathrooms and the sqft_living, as they are both positive (i.e. as bathrooms increase, price increases). However, the parameter estimate for bedrooms is negative, which indicates that as price increases, the number of bedrooms decreases. One hypothesis for this is that the data set includes both houses in the city and in the suburbs and houses in the city could be more expensive, but have less bedrooms.

fit.houses.df <- augment(fit.houses)
ggplot(fit.houses.df, aes(x=.fitted, y=.resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = 2) +
  labs(x= "Fitted Values", y = "Residuals")

qqPlot(fit.houses, pch=16)

## [1]  6944 13810
ncvTest(fit.houses)
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 16710, Df = 1, p = < 2.22e-16
library(nortest)
ad.test(fit.houses$residuals)
## 
##  Anderson-Darling normality test
## 
## data:  fit.houses$residuals
## A = 335.13, p-value < 2.2e-16
# Anderson-Darling normality test

Above is an examination of the residual plot, along with tests for any indication of non-normality or non-constant variance, including the Anderson-Darling Normality Test. The residuals seems to be normal for the majority of the data points, before skewing off once the quantile reaches 2.

vif(fit.houses)
##    bedrooms   bathrooms sqft_living 
##    1.568025    2.380840    2.621093

Using the Variance Inflation Factor test, it was found that all VIFs were less than 10, indicating that multicollinearity is not present.

summary(fit.houses)
## 
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1100882  -144841   -23464   101417  3093905 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  86299.929   7945.569   10.86   <2e-16 ***
## bedrooms    -57665.989   2742.066  -21.03   <2e-16 ***
## bathrooms     9793.845   3981.816    2.46   0.0139 *  
## sqft_living    301.144      3.516   85.64   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 251700 on 16202 degrees of freedom
## Multiple R-squared:  0.502,  Adjusted R-squared:  0.5019 
## F-statistic:  5443 on 3 and 16202 DF,  p-value: < 2.2e-16

The adjusted r-squared of this regression is 0.5019. In general, the higher the r-squared value is, the better, so this value is relatively moderate.

Regression Model - Round Two

Based on the Residuals Plot and the negative parameter estimate for the bedrooms, it was decided to run a second regression model using bathrooms and sqft_living as the variables.

fit.houses.2 <- lm(price~bathrooms+sqft_living, train)
summary(fit.houses.2)
## 
## Call:
## lm(formula = price ~ bathrooms + sqft_living, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1016564  -147518   -24777   104907  3186857 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -27125.427   5913.514  -4.587 4.53e-06 ***
## bathrooms    -3644.869   3983.362  -0.915     0.36    
## sqft_living    276.062      3.353  82.344  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 255100 on 16203 degrees of freedom
## Multiple R-squared:  0.4884, Adjusted R-squared:  0.4883 
## F-statistic:  7733 on 2 and 16203 DF,  p-value: < 2.2e-16

In this model, the F-Statistic is 7733, which is significant as it is large.

The p-values in this regression are only significant for sqft_living as the value is less than 0.05. The p-value for bathrooms is 0.36 which is greater than 0.05.

The parameter estimate makes sense for sqft_living, as it is positive, but it negative for bathrooms, which doesn’t make sense.

fit.houses.2.df <- augment(fit.houses.2)
ggplot(fit.houses.2.df, aes(x=.fitted, y=.resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = 2) +
  labs(x= "Fitted Values", y = "Residuals")

qqPlot(fit.houses.2, pch=16)

## [1]  6944 13810
ncvTest(fit.houses.2)
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 16186.01, Df = 1, p = < 2.22e-16
library(nortest)
ad.test(fit.houses.2$residuals)
## 
##  Anderson-Darling normality test
## 
## data:  fit.houses.2$residuals
## A = 319.57, p-value < 2.2e-16
# Anderson-Darling normality test

Similar to the first model, the regression appears normal until the quantile reaches 2, when it skews off. The other, formal tests show that it is normal.

vif(fit.houses.2)

The VIF is not above 10, therefore this is evidence of not having multicollinearity in this dataset.

summary(fit.houses.2)
## 
## Call:
## lm(formula = price ~ bathrooms + sqft_living, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1016564  -147518   -24777   104907  3186857 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -27125.427   5913.514  -4.587 4.53e-06 ***
## bathrooms    -3644.869   3983.362  -0.915     0.36    
## sqft_living    276.062      3.353  82.344  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 255100 on 16203 degrees of freedom
## Multiple R-squared:  0.4884, Adjusted R-squared:  0.4883 
## F-statistic:  7733 on 2 and 16203 DF,  p-value: < 2.2e-16

The adjusted r-squared of this model is 0.4883, which is lower than the first regression model. Based on this information, the first model is more accurate.

library(Metrics)
fit.houses.3 <- lm(price~bedrooms+bathrooms+sqft_living, test)
summary(fit.houses.3)
## 
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living, data = test)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1283271  -141296   -21602   103519  3993306 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  92605.003  14261.880   6.493 9.16e-11 ***
## bedrooms    -73491.531   4912.625 -14.960  < 2e-16 ***
## bathrooms     4588.200   7062.807   0.650    0.516    
## sqft_living    330.767      6.338  52.192  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 261400 on 5398 degrees of freedom
## Multiple R-squared:  0.5148, Adjusted R-squared:  0.5146 
## F-statistic:  1909 on 3 and 5398 DF,  p-value: < 2.2e-16
RMSE <- function(error) { sqrt(mean(error^2)) }
RMSE(fit.houses$residuals)
## [1] 251703.6
RMSE(fit.houses.3$residuals)
## [1] 261343.3

The RMSE is the standard deviation of the residuals. Calculating it for both the test data and the training data, the difference is not too drastic. In fact, the adjusted r-squared value for the test data is larger than the training data, indicating that it is a good fit.

Regression Model - Round Three

After the second iteration of our logic model, we thought that the issue of our model might be with the sqft_living variable. We decided to look at what our original model would look like if we used the sqft_lot variable (which includes the land that came with the purchase of the home) in place of the sqft_living variable.

fit.houses.4 <- lm(price~bathrooms+bedrooms+sqft_lot, train)
summary(fit.houses.4)
## 
## Call:
## lm(formula = price ~ bathrooms + bedrooms + sqft_lot, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1008496  -182229   -44313   110549  4158461 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.394e+04  9.449e+03  -2.533   0.0113 *  
## bathrooms    2.276e+05  3.673e+03  61.986  < 2e-16 ***
## bedrooms     2.250e+04  3.105e+03   7.248 4.43e-13 ***
## sqft_lot     3.989e-01  5.453e-02   7.315 2.70e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 302900 on 16202 degrees of freedom
## Multiple R-squared:  0.2789, Adjusted R-squared:  0.2787 
## F-statistic:  2089 on 3 and 16202 DF,  p-value: < 2.2e-16

The F-statistic of this model is 2089 which, although lower in comparison to our first two models, is still a significant value. The p-value for each variable is well below .05 with one of the values as close to zero as R will recognize indicating that all of our variables are significant.

This model appears to solve the problem we ran into with our first two models which was that we kept getting one variable with a negative parameter estimate (even after removing that variable in the second model). In this iteration, every variable has a positive parameter estimate solving an “oddity” that was present in our first two models.

fit.houses.4.df <- augment(fit.houses.4)
ggplot(fit.houses.4.df, aes(x=.fitted, y=.resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = 2) +
  labs(x= "Fitted Values", y = "Residuals")

qqPlot(fit.houses.4, pch=16)

## [1]  6944 13810
ncvTest(fit.houses.4)
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 9378.129, Df = 1, p = < 2.22e-16
library(nortest)
ad.test(fit.houses.4$residuals)
## 
##  Anderson-Darling normality test
## 
## data:  fit.houses.4$residuals
## A = 467.2, p-value < 2.2e-16
# Anderson-Darling normality test

Similar to the first two models, the regression appears normal until the quantile reaches 2, when it skews off. The other, formal tests show that it is normal.

vif(fit.houses.4)
## bathrooms  bedrooms  sqft_lot 
##  1.398825  1.388316  1.008948

The VIF in this model is again lower than 10, and even lower than the values in each of the first two models. Again, this is evidence of not having multicollinearity in this dataset.

summary(fit.houses.4)
## 
## Call:
## lm(formula = price ~ bathrooms + bedrooms + sqft_lot, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1008496  -182229   -44313   110549  4158461 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.394e+04  9.449e+03  -2.533   0.0113 *  
## bathrooms    2.276e+05  3.673e+03  61.986  < 2e-16 ***
## bedrooms     2.250e+04  3.105e+03   7.248 4.43e-13 ***
## sqft_lot     3.989e-01  5.453e-02   7.315 2.70e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 302900 on 16202 degrees of freedom
## Multiple R-squared:  0.2789, Adjusted R-squared:  0.2787 
## F-statistic:  2089 on 3 and 16202 DF,  p-value: < 2.2e-16

The adjusted r-squared of this model is 0.2787, which is the lowest of the three models and means there is a moderate relationship. We believe that the adjustment in the parameter estimates and p-values for each of the values indicates that this model is the most accurate even though it has a smaller adjusted r-squared.

library(Metrics)
fit.houses.4 <- lm(price~bedrooms+bathrooms+sqft_lot, test)
summary(fit.houses.4)
## 
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_lot, data = test)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -937037 -184001  -43192  109314 5905460 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -4.444e+04  1.722e+04  -2.580  0.00990 ** 
## bedrooms     2.248e+04  5.582e+03   4.027 5.74e-05 ***
## bathrooms    2.381e+05  6.686e+03  35.617  < 2e-16 ***
## sqft_lot     4.016e-01  1.327e-01   3.025  0.00249 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 320400 on 5398 degrees of freedom
## Multiple R-squared:  0.2712, Adjusted R-squared:  0.2708 
## F-statistic: 669.7 on 3 and 5398 DF,  p-value: < 2.2e-16
RMSE <- function(error) { sqrt(mean(error^2)) }
RMSE(fit.houses$residuals)
## [1] 251703.6
RMSE(fit.houses.4$residuals)
## [1] 320300.6

After running the RMSE, we notice that there are some changes but none that feel significant enough to adjust the model. All of the values previously noted as significant remain that way after the test with no new suspicious values introduced.

Commentary

Technical

We found that best model comprised of the sqft_lot, bedrooms and bathrooms variables. Even though the relationship was moderate as opposed to strong, our results were less suspicious than the first two iterations.

The overview of our approach was to start with what we thought would have the strongest correlation which was sqft_living, bedrooms and bathrooms as these are often listed as the most important qualities when buying a house. As we began building our models we noticed some suspicious results like negative parameter estimates for bedrooms (in the first model) and bathrooms (in the second model) meaning that there was a negative correlation with price and that price went down with an increase in those variables. These results were very surprising because we’d expect the price of a house to go up with an increase in bedrooms and bathrooms and not the inverse. After updating the model and receiving the same results, we decided we’d do a third iteration where we substituted sqft_living with sqft_lot. We thought there may be a significant amount of value in the land that comes with the home that is skewing the results of our models. After running this model, we were able to get rid of some of the suspicious results we received in the first two models and received more significant parameter estimates and p-values.

One variable that would be helpful (although extremely hard if not impossible to calculate) is to have a variable that represents the impact of location on the price of a home. When we kept getting negative parameter estimates for bedrooms and bathrooms, one thought that we had that could be the cause of those results is that buying a smaller house or apartment in a better location would be worth more than a house with more bedrooms or bathrooms in a poorer location. Zip code is probably the closest fit to this, but even within zip codes there can be a disparity between home values. If there was a variable that could represent a true “location value” for each home, it could help us build a more accurate model.

Practical

The original question our group sought to answer was “How does the number of bedrooms, bathrooms, and number of square-feet predict the selling price of a home in Kings County, Washington? As a result, which of these has the biggest impact?” A few key assumptions included was that the data collected was during a relatively stable housing market, and the economic landscape at the time did not have major impacts on the dataset. Secondly, we assumed that a bedroom in this data is not a shared common space like a multipurpose game room, or home office, but specifically includes a door, closet, and bed. Most importantly, we assumed that the sellers priced their house at a reasonable rate for their neighborhood, not significantly overselling or underselling their house.

After running our regression models, square-feet of living resulted in the smallest p-value and in turn, showing that its effect on house price is larger than bedrooms and bathrooms. Applying our model to a real-word situation of a person looking to buy a home, that person would learn that the house price will be most closely tied to the number of square-feet, followed by the number of bathrooms. Interestingly, the number of bedrooms had a significant effect on price, but in the negative direction, indicating that as the number of bedrooms increase, the price decreases.

This finding was quite confusing at first, as we originally thought the number of bedrooms would increase house price. A possible explanation for this result could be the inclusion of house data from both in the city and in rural parts of Washington. For example, a house in the city could be more expensive, but with less bedrooms. This would make sense because the dataset includes the city of Seattle. Today, we continue to see how the square footage of a home truly does play as the most important and relevant factor in the pricing of a home but it also matters where it is located.

We believe that our model was suitable for the time period and city the data was collected, 2014-2015 in Kings Country, Washington. We think that if our model and regression was available in the 2-3 years following the collection of data, it could have been used. Applying our model to the real world, the subsets of stakeholders that would be interested in the data include homeowners wanting to sell, homebuyers looking to buy, realtors identifying homes to show, and appraisers determining mortgage rates. These stakeholders are all directly impacted by the final model and prediction of housing cost. Secondly, this data indirectly impacts neighborhoods, where a person pricing their house to sell can increase or decrease the value of the neighborhood. However, since we analyzed the data 7 years later, it is a lot less useful due to the changes in that time. The constraints are further explained below.

A constraint to the usability of this dataset is the constantly changing and geographically specific housing market. This data was obtained from 2014-2015. Since then, the world has experienced a global pandemic, causing an economic recession. The increased mandates in several states are causing people to move rapidly, which has a direct impact on the housing market in those places, of both the locations people are leaving from and where people are seeking to live. One great example is the city of Nashville, at one point one could purchase a property in the heart of the city for 50+% less than what they can get a house for today. These prices per square footage changed so rapidly as a result of popularity the city continued to receive due to its friendly and safe atmosphere in addition to great tax laws. This is a trend we have also seen in states like Florida and Texas.

Secondly, none of the data variables included a rating of appliances or newness. Often, these factors like the refrigerator, stove, ovens, and washing machines play a large role in housing prices. It is important to note that our analysis simply looked at the number of bedrooms, bathrooms, and square-feet, but not the quality, condition or architectural style of these items. A house with 4 bedrooms and 5 bathrooms could be very inexpensive because it is a “fixer-upper”, but the data did not include any ranking of this, and therefore we could not account for it in the regression. In a similar way, a 2 bedroom, 1 bathroom house could be super expensive, because of its renovations and high-end appliances.

Lastly, we wonder how state laws and reputation of the state could have impacted the price of the homes. Like mentioned above, the housing market is constantly changing as a result of the economy as whole but also as a result of the politics and laws that affect the state. We wonder how this information could have impacted the housing market for Kings Country, Washington.

Overall, we found this to be a very interesting study that only confirmed that square footage has been and will continue to be the leading factor in home pricing. We also understand from this study that rural versus city impact the price per square foot just as much as demand and availability in the housing market for the given town and city. While the number of bedrooms and bathrooms can make or break a deal for home buyers it still is not the main proponent in the pricing of a home.