Introduction:

Using a housing price data sheet, this study will assess the statistical significance and effects of several variables on both the sale and listing price of 30 different homes. Specifically, through the study, the importance of neighborhood, bedrooms, rooms, half bathrooms, and full bathrooms will be analyzed. This study will first graph each variable to highlight positive, negative, or no correlation. Further, each variable will be plotted against the sale price, next analyzed using a correlation test, and finally looked at using the mean values of the variables. Next this study will use multiple reggression models ( with the anova and summary functions) to look at the significance of each variable and the fit of the model. Lastly, this study will attempt to answer if the neighborhood of the house influences the sale price compared to the listing price.

Part 1: Explore the relationship between the sale price and the other variables using graphs and further identify variables that appear to have the strongest relationship with sale price.

knitr::opts_chunk$set(warning= FALSE)
setwd("~/Desktop/GEOG 5680/Final Project")
list.files()
## [1] "homeprice.csv"             "Nevins_Final_Project.html"
## [3] "Nevins_Final_Project.R"    "Nevins_Final_Project.Rmd" 
## [5] "Untitled.html"             "Untitled.Rmd"
home <- read.csv("homeprice.csv")

Neighborhood vs Sale Price

require("ggplot2")
## Loading required package: ggplot2
library(ggplot2)
neighborhood = ggplot(home, aes(x=neighborhood, y=sale)) + ggtitle("Sale Price Based on Neighborhood")
neighborhood + geom_point() + geom_smooth() + xlab("Neighborhood (1: Low Income; 5: High Income)") + ylab ("Sale Price of House")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Correlation test of ‘Neighborhood and Sale Price’

cor.test(home$sale, home$neighborhood)
## 
##  Pearson's product-moment correlation
## 
## data:  home$sale and home$neighborhood
## t = 9.4853, df = 27, p-value = 4.357e-10
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7523501 0.9410457
## sample estimates:
##       cor 
## 0.8770245

Mean Sale Price by Neighborhood

tapply(home$sale, home$neighborhood, mean)
##        1        2        3        4        5 
## 134.3500 173.9375 275.8333 379.8000 531.5000
plot(tapply(home$sale, home$neighborhood, mean ), xlab = "Neighborhood", ylab = "Mean Sale Price of Home", pch = 16, col = 6)

Interpretation of Data:

The first graph clearly demonstrates that there is a strong relationship between ‘Sale price’ and ‘Neighborhood’. Moreover, the nicer the neighborhood the more expensive the house or higher the sale price. This is further evident in the second model which specifically shows the mean sale price of each neighborhood which highlights an almost linear growth from neighborhood 1 to 5. Lastly, using a correlation test, the P-Value is extremely small (specifically: 4.357e-10). With this in mind, a small p-value (typically ≤ 0.05) reveals strong evidence against the null hypothesis, and therefore it is highly unlikely that the correlation occured by chance.

Bedrooms vs Sale Price

library(ggplot2)
bedrooms = ggplot(home, aes(x=bedrooms, y=sale)) + ggtitle("Sale Price Based on Number of Bedrooms")
bedrooms + geom_point() + xlab("Number of Bedrooms") + ylab ("Sale Price of House") + geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Correlation test of ‘Number of Bedrooms’ and ‘Sale Price’

cor.test(home$sale, home$bedrooms)
## 
##  Pearson's product-moment correlation
## 
## data:  home$sale and home$bedrooms
## t = 2.8932, df = 27, p-value = 0.007452
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1460026 0.7239114
## sample estimates:
##       cor 
## 0.4864766

Mean Sale Price by Number of Bedrooms

tapply(home$sale, home$bedrooms, mean)
##        1        2        3        4        5 
##  48.0000 224.8333 265.2625 313.3125 459.0000
plot(tapply(home$sale, home$bedrooms, mean ), xlab = "Number of Bedrooms", ylab = "Mean Sale Price of Home", pch = 16, col = 2)

Interpretation of Data:

The first graph reveals that there is a relationship between ‘Sale price’ and ‘Number of Bedrooms’. Furthermore, the more bedrooms in the house the more expensive the house or higher the sale price is on average. This idea is emphasized when assessing the mean of the sale price based of off the number of bedrooms. There appears to be a large spike in price from 1 bedroom to 2, a less dramatic increase from 2-4, and then a larger spike in price from 4-5. This further shows that there is not much difference in sale prices from houses with 2 bedrooms to houses with 4 bedrooms. Lastly, using a correlation test, the P-Value is small (specifically: 0.007452). With this in mind, a small p-value (typically ≤ 0.05) reveals strong evidence against the null hypothesis, and therefore it is highly unlikely that the correlation occured by chance.

Rooms vs Sale Price

library(ggplot2)
rooms = ggplot(home, aes(x=rooms, y=sale)) + ggtitle("Sale Price Based on Number of Rooms")
rooms + geom_point() +scale_x_continuous(name="Number of Rooms", limits=c(2, 11)) + geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Correlation test of ‘Number of Rooms’ and ‘Sale Price’

cor.test(home$sale, home$rooms)
## 
##  Pearson's product-moment correlation
## 
## data:  home$sale and home$rooms
## t = 4.1973, df = 27, p-value = 0.0002621
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3402261 0.8086477
## sample estimates:
##       cor 
## 0.6283765

Mean Sale Price by Number of Rooms

tapply(home$sale, home$rooms, mean)
##        3        4        6        7        8        9       10       11 
##  48.0000 249.0000 216.3000 257.3333 288.7500 300.0000 613.0000 459.0000
plot(tapply(home$sale, home$rooms, mean ), xlab = "Number of Rooms", ylab = "Mean Sale Price of Home", pch = 16, col = 10)

Interpretation of Data:

The fist graph reveals that there is a somewhat of a relationship between ‘Sale price’ and ‘Number of Rooms’. Furthermore, for the most part the more rooms in the house the more expensive the house is or higher the sale price is on average. This idea is shown through further analysis of the mean sale price based of off the rooms. There appears to be a large difference in mean price from 1 room to 2 rooms, but much less difference or change from 2 rooms - 6 rooms. Further, it is apparent that the number of rooms does not directly or solely impact the saleprice as some houses with less rooms sold for more than houses with more rooms. Lastly, using a correlation test, the P-Value is small (specifically: 0.0002621). With this in mind, a small p-value (typically ≤ 0.05) reveals strong evidence against the null hypothesis, and therefore it is highly unlikely that the correlation occured by chance.

Full Bathrooms vs Sale Price

library(ggplot2)
full.bath = ggplot(home, aes(x=full, y=sale)) + ggtitle("Sale Price Based on Number of Full Bathrooms")
full.bath + geom_point() + scale_x_continuous(limits=c(1,3))+ geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Correlation test of ‘Full Bathrooms’ and ‘Sale Price’

cor.test(home$sale, home$full)
## 
##  Pearson's product-moment correlation
## 
## data:  home$sale and home$full
## t = 4.184, df = 27, p-value = 0.0002716
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3384567 0.8079545
## sample estimates:
##       cor 
## 0.6271649

Mean Sale Price by Number of Bathrooms

tapply(home$sale, home$full, mean)
##        1        2        3 
## 211.3615 279.7727 421.4000
plot(tapply(home$sale, home$full, mean ), xlab = "Number of Full Bathrooms", ylab = "Mean Sale Price of Home", pch = 16, col = 5)

Interpretation of Data:

The fist graph reveals that there is a slight relationship between ‘Sale price’ and ‘Number of Full Bathrooms’. Furthermore, the trend line reveals that the more full bathrooms in the house the more expensive the house is or higher the sale price is on average. Looking at the mean sale price based of off the full bathrooms, there is a slight growth (or rise in price) from 1 - 3. While the change is not dramatic, it is apparent that the number of full bedrooms does not directly or solely impact the saleprice as some houses with less full bathrooms sold for more than houses with more more full bathrooms. Lastly, using a correlation test, the P-Value is small (specifically: 0.0002716). With this in mind, a small p-value (typically ≤ 0.05) reveals strong evidence against the null hypothesis, and therefore it is highly unlikely that the correlation occured by chance.

Half Bathrooms vs Sale Price

library(ggplot2)
half.bath = ggplot(home, aes(x=half, y=sale)) + ggtitle("Sale Price Based on Number of Half Bathrooms")
half.bath + geom_point() + scale_x_continuous(name="Number of Half Bathrooms", limits=c(0,2)) + geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Correlation test of ‘Half Bathrooms’ and ‘Sale Price’

cor.test(home$sale, home$half)
## 
##  Pearson's product-moment correlation
## 
## data:  home$sale and home$half
## t = 2.2285, df = 27, p-value = 0.03436
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.0323264 0.6646506
## sample estimates:
##       cor 
## 0.3941621

Mean Sale Price by Number of Half Bathrooms

tapply(home$sale, home$half, mean)
##        0        1        2 
## 237.5923 278.6923 406.8333
plot(tapply(home$sale, home$half, mean ), xlab = "Number of Half Bathrooms", ylab = "Mean Sale Price of Home", pch = 16, col = 20)

Interpretation of Data:

The fist graph reveals that there is a slight/small relationship between ‘Sale price’ and ‘Number of Half Bathrooms’. Furthermore, the trend line shows that the more half bathrooms in the house the more expensive the house is or higher the sale price is on average. As the change is not dramatic from 0-2 half bathrooms, it is apparent that the number of half bedrooms does not directly or solely impact the saleprice as some houses with less half bathrooms sold for more than houses with more more half bathrooms. Lastly, using a correlation test, the P-Value is small (specifically: 0.03436). With this in mind, a small p-value (typically ≤ 0.05) reveals strong evidence against the null hypothesis, and therefore it is highly unlikely that the correlation occured by chance.

List Price vs Sale Price

library(ggplot2)
list = ggplot(home, aes(x=list, y=sale)) + ggtitle("Sale Price Based on Listing Price") + xlab("Listing Price") +ylab("Sale Price")
list + geom_line() + geom_point(col=2) 

Interpretation of Data:

After assessing the ‘Sale price’ versus the ‘Listing Price’ it is apparent that most houses sold close to their listing price. There are not any major outliers, therefore further revealing consistency between list price and sale price.

Part 2: Make a multiple linear regression model of variables to ‘Sale Price’:

sale.reg.mod = lm(sale ~ full + half + bedrooms + rooms + neighborhood, data = home)
sale.reg.mod
## 
## Call:
## lm(formula = sale ~ full + half + bedrooms + rooms + neighborhood, 
##     data = home)
## 
## Coefficients:
##  (Intercept)          full          half      bedrooms         rooms  
##     -135.263        26.225        43.242        20.409         6.488  
## neighborhood  
##       77.243
summary(sale.reg.mod)
## 
## Call:
## lm(formula = sale ~ full + half + bedrooms + rooms + neighborhood, 
##     data = home)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -59.31 -34.06   7.20  21.32  55.93 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -135.263     37.283  -3.628  0.00141 ** 
## full           26.225     13.896   1.887  0.07181 .  
## half           43.242     12.830   3.370  0.00264 ** 
## bedrooms       20.409     17.798   1.147  0.26329    
## rooms           6.488     10.383   0.625  0.53823    
## neighborhood   77.243     10.077   7.665 8.86e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 39.29 on 23 degrees of freedom
## Multiple R-squared:  0.9079, Adjusted R-squared:  0.8879 
## F-statistic: 45.34 on 5 and 23 DF,  p-value: 3.686e-11
anova(sale.reg.mod)
## Analysis of Variance Table
## 
## Response: sale
##              Df Sum Sq Mean Sq F value    Pr(>F)    
## full          1 151632  151632 98.2101 9.062e-10 ***
## half          1  87430   87430 56.6271 1.206e-07 ***
## bedrooms      1  10581   10581  6.8530   0.01538 *  
## rooms         1   9632    9632  6.2387   0.02009 *  
## neighborhood  1  90717   90717 58.7562 8.859e-08 ***
## Residuals    23  35511    1544                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Sale Price Regression Model Analysis

After running the regression model for all of the variables, there are several key conclusions. Most importantly, looking at the P-Values it is apparent that the neighborhood has the smallest value (specifically 8.686e-11). This is an extremely small value and is therefore significant as it is below (>0.05). Further, this indicates strong evidence against the null hypothesis, so the results are likely not random. In addition to the ‘Neighborhood’ variable, it is apparent that ‘Half’ bathrooms correlate to sale price. With a P-Value of .002, it was the only variable other than ‘Neighborhood’ which showed a significance level which rejects the null. Further, roughly 90% of the variance is explained through the variables, this is evident through the ‘R-Squared’ value of (.9070). As .9070 is close to 1, the model fits well.

Part 3: Create a second model using the variables to explain the ‘List price’:

lm(list ~ ., data = home)
## 
## Call:
## lm(formula = list ~ ., data = home)
## 
## Coefficients:
##  (Intercept)          sale          full          half      bedrooms  
##     -21.8752        0.9069        8.3411        6.3398       -0.0627  
##        rooms  neighborhood  
##       1.2426        7.3793
list.reg.mod = lm(list ~ full + half + bedrooms + rooms + neighborhood, data = home)
list.reg.mod
## 
## Call:
## lm(formula = list ~ full + half + bedrooms + rooms + neighborhood, 
##     data = home)
## 
## Coefficients:
##  (Intercept)          full          half      bedrooms         rooms  
##     -144.544        32.125        45.556        18.446         7.126  
## neighborhood  
##       77.430

Use summary to find the P-Values and Significance of each variable within the model

summary(list.reg.mod)
## 
## Call:
## lm(formula = list ~ full + half + bedrooms + rooms + neighborhood, 
##     data = home)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -60.788 -28.776   4.351  23.859  62.720 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -144.544     36.026  -4.012 0.000546 ***
## full           32.125     13.427   2.392 0.025293 *  
## half           45.556     12.397   3.675 0.001257 ** 
## bedrooms       18.446     17.197   1.073 0.294572    
## rooms           7.126     10.033   0.710 0.484661    
## neighborhood   77.430      9.737   7.952 4.75e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 37.97 on 23 degrees of freedom
## Multiple R-squared:  0.9183, Adjusted R-squared:  0.9006 
## F-statistic: 51.74 on 5 and 23 DF,  p-value: 9.358e-12
anova(list.reg.mod)
## Analysis of Variance Table
## 
## Response: list
##              Df Sum Sq Mean Sq  F value    Pr(>F)    
## full          1 169594  169594 117.6457 1.615e-10 ***
## half          1  92249   92249  63.9922 4.294e-08 ***
## bedrooms      1   9745    9745   6.7597   0.01601 *  
## rooms         1  10162   10162   7.0494   0.01415 *  
## neighborhood  1  91158   91158  63.2352 4.754e-08 ***
## Residuals    23  33156    1442                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Are there differences from the sale price? Could you use this information to recommend which characteristic of a house a real estate agent should concentrate on?

After running the regression model on the list price, it is fairly similar to the regression model for sale price. This being said, there appears to be one main difference- the ‘full’ bathroom variable’s P-Value shows significance in addition to ‘half’ bathrooms and ‘neighborhood’. Again, ‘neighborhood’ had the smalled P-Value, next was ‘half’, and then ‘full’. Through these results it is apparent that the ‘neighborhood’ variable has the most important and largest impact on sale price (as it is highly correlated). The model further proves to be valid as roughly 91% of the variance is explained through the variables, this is evident through the ‘R-Squared’ value of (.9183). As .9183 is close to 1, the model fits well. With this being said, it is evident that a real estate agent should focus on neighborhood as a key characteristic for selling the house (as it has proven to be the most important in both regression models).

Part 4: What is the effect of neighborhood on the difference between ‘Sale Price’ and ‘List Price’?

neigh.reg.mod = lm(neighborhood ~ sale + list, data = home)
neigh.reg.mod
## 
## Call:
## lm(formula = neighborhood ~ sale + list, data = home)
## 
## Coefficients:
## (Intercept)         sale         list  
##   0.8551268    0.0008349    0.0065966
summary(neigh.reg.mod)
## 
## Call:
## lm(formula = neighborhood ~ sale + list, data = home)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.97992 -0.31827 -0.01618  0.33585  0.84921 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 0.8551268  0.2395232   3.570  0.00142 **
## sale        0.0008349  0.0074462   0.112  0.91159   
## list        0.0065966  0.0072552   0.909  0.37158   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4968 on 26 degrees of freedom
## Multiple R-squared:  0.7763, Adjusted R-squared:  0.7591 
## F-statistic: 45.11 on 2 and 26 DF,  p-value: 3.516e-09
anova(neigh.reg.mod)
## Analysis of Variance Table
## 
## Response: neighborhood
##           Df  Sum Sq Mean Sq F value    Pr(>F)    
## sale       1 22.0673 22.0673 89.3927 6.724e-10 ***
## list       1  0.2041  0.2041  0.8267    0.3716    
## Residuals 26  6.4183  0.2469                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Do nicer neighborhoods mean it is more likely to have a house go over the asking price?

Through the anova function, it is apparent that the sale has a high f-value and an extremely small p-value. This further shows that the value is significant while also the results are significant. In the contrary, the ‘list’ price seems to show little significance through its p-value while also having a small f-value. Looking at the data and the ‘list vs sale’ graph in part 1, it does not appear that the majority of houses in nicer neighborhoods (4-5) sold more than their asking price. Specifically, only 3/6 (50%) homes in the nicer neighborhoods sold for more than the asking price. This being said, the percentage of houses that sold for more in neighborhoods 4-5 is higher than the percentage in neighborhoods 1-3 (~29% of houses in the lower and middle class neighborhoods sold for more than the asking price). While the homes in the nicer neighborhoods sold slightly more homes for a greater profit than the asking price, I think more data would be needed to prove that there is a statistical significance and that the data did not occur just by chance. Because of this, it does not appear that a nicer neighborhood will make it more likely for a sale price to be greater than the listing price.

Further, as shown above in parts 1-3 of the study, the nicer neighborhood’s homes are selling for more and have the greatest correlation with sale price.