GEOG 5680 Final Project: House Prices

Introduction

The data set we are using contains information about homes that sold in a town of New Jersey in the year 2001. We are interested in understanding the relationship between house characteristics and the sale price.

Setting up

First, I need to load the library for all the future plotting.

library(ggplot2)

Now, I need to read in the data and see how its structured.

homes = read.csv("homeprice.csv")
str(homes)

## 'data.frame':    29 obs. of  7 variables:
##  $ list        : num  80 151 310 295 339 ...
##  $ sale        : num  118 151 300 275 340 ...
##  $ full        : int  1 1 2 2 2 1 3 1 1 1 ...
##  $ half        : int  0 0 1 1 0 1 0 1 2 0 ...
##  $ bedrooms    : int  3 4 4 4 3 4 3 3 3 1 ...
##  $ rooms       : int  6 7 9 8 7 8 7 7 7 3 ...
##  $ neighborhood: int  1 1 3 3 4 3 2 2 3 2 ...

Question 1

Explore the relationship between the sale price and the other variables using plots

Using the ggplot9() function I am going to create simple line or point graph to show the correlation between the two variables.

ggplot(homes, aes(x = list, y = sale,)) + ggtitle("Sales Price v Listing Price") + geom_line()

ggplot(homes, aes(x = full, y = sale)) + ggtitle("Sales Price v Full Baths") + geom_point()

ggplot(homes, aes(x = half, y = sale)) + ggtitle("Sales Price v Half Baths") + geom_point()

ggplot(homes, aes(x = bedrooms, y = sale)) + ggtitle("Sales Price v Bedrooms") + geom_point()

ggplot(homes, aes(x = rooms, y = sale)) + ggtitle("Sales Price v Rooms") + geom_point()

ggplot(homes, aes(x = neighborhood, y = sale)) + ggtitle("Sales Price v Neighborhood Safety") + geom_point()

Identify those variables that appear to have the strongest relationship with sale price

The variable that has the strongest relation to the sale price, outside of the listing price, is the neighborhood rank. The baths and rooms do have a postive correlation but not nearly as strong.

Question 2

Use these variables to build a multiple linear regression model to explain the sale price.

First, using the sale_model function i am going to create coefficients.

sale_model = lm(sale ~ list + full + half + bedrooms + rooms + neighborhood, data = homes)
sale_model

## 
## Call:
## lm(formula = sale ~ list + full + half + bedrooms + rooms + neighborhood, 
##     data = homes)
## 
## Coefficients:
##  (Intercept)          list          full          half      bedrooms  
##       5.1336        0.9713       -4.9776       -1.0064        2.4922  
##        rooms  neighborhood  
##      -0.4341        2.0343

Then, using the summary() function I will get the coefficients and a goodness-of-fit model

summary(sale_model)

## 
## Call:
## lm(formula = sale ~ list + full + half + bedrooms + rooms + neighborhood, 
##     data = homes)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -28.807  -6.626  -0.270   5.580  32.933 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   5.13359   17.15496   0.299    0.768    
## list          0.97131    0.07616  12.754 1.22e-11 ***
## full         -4.97759    5.48033  -0.908    0.374    
## half         -1.00644    5.70418  -0.176    0.862    
## bedrooms      2.49224    6.43616   0.387    0.702    
## rooms        -0.43411    3.70424  -0.117    0.908    
## neighborhood  2.03434    6.88609   0.295    0.770    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.87 on 22 degrees of freedom
## Multiple R-squared:  0.989,  Adjusted R-squared:  0.986 
## F-statistic: 330.5 on 6 and 22 DF,  p-value: < 2.2e-16

The higher the values the better, this means the higher the “Std. Error” values the more they correlate with the sale prices.

Lastly, I am going to use the anova() function to identify which variable appears to have the greatest affect on sale prices.

anova(sale_model)

## Analysis of Variance Table
## 
## Response: sale
##              Df Sum Sq Mean Sq   F value Pr(>F)    
## list          1 381050  381050 1981.6252 <2e-16 ***
## full          1    156     156    0.8116 0.3774    
## half          1     21      21    0.1092 0.7441    
## bedrooms      1     25      25    0.1314 0.7204    
## rooms         1      3       3    0.0141 0.9065    
## neighborhood  1     17      17    0.0873 0.7704    
## Residuals    22   4230     192                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

These are similar to the “summary()” , because the higher “F Value” equals a lower P-value(“Pr>f”). The lower the P-value the better.

Question 3

Build a second model using the same variables to explain the list price.

First, using the list_model function i am going to create new coefficients that use the list price.

list_model = lm(list ~ sale + full + half + bedrooms + rooms + neighborhood, data = homes)
list_model

## 
## Call:
## lm(formula = list ~ sale + full + half + bedrooms + rooms + neighborhood, 
##     data = homes)
## 
## Coefficients:
##  (Intercept)          sale          full          half      bedrooms  
##     -21.8752        0.9069        8.3411        6.3398       -0.0627  
##        rooms  neighborhood  
##       1.2426        7.3793

Now I am going to use the anova() function to see which ones have the greatest affect on price

anova(list_model)

## Analysis of Variance Table
## 
## Response: list
##              Df Sum Sq Mean Sq   F value Pr(>F)    
## sale          1 401374  401374 2235.5702 <2e-16 ***
## full          1    346     346    1.9259 0.1791    
## half          1    134     134    0.7440 0.3977    
## bedrooms      1      4       4    0.0209 0.8864    
## rooms         1     24      24    0.1326 0.7192    
## neighborhood  1    233     233    1.2973 0.2670    
## Residuals    22   3950     180                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Are there differences from the sale price?

The biggest differences are between the neighborhood rank, full and half bathrooms, all of these values dropped significantly

Could you use this information to recommend which characteristic of a house a real estate agent should concentrate on?

Yes, I would recommend the real estate agent focus more on the neighborhood the house is located in and the quantity bathrooms throughout the house.

Question 4

What is the effect of neighborhood on the difference between sale price and list price? Do richer neighborhoods mean it is more likely to have a house go over the asking price?

Houses in richer neighborhoods are listed at higher prices, and their sales prices also tend to be higher. However, the neighborhood’s influence is more significant on the list price than on the sales price. This does not imply that these homes are more likely to sell above the asking price.

Summary

What do these results mean?

The data indicates that the number of bedrooms and other rooms has the least correlation with the sale price. Full and half bathrooms show a higher correlation than rooms, but not nearly as significant as neighborhood ranking. Overall, the list price is the most correlated, with neighborhood ranking following in its effect on the sale price of the house.