The data set we are using contains information about homes that sold in a town of New Jersey in the year 2001. We are interested in understanding the relationship between house characteristics and the sale price.
First, I need to load the library for all the future plotting.
library(ggplot2)
Now, I need to read in the data and see how its structured.
homes = read.csv("homeprice.csv")
str(homes)
## 'data.frame': 29 obs. of 7 variables:
## $ list : num 80 151 310 295 339 ...
## $ sale : num 118 151 300 275 340 ...
## $ full : int 1 1 2 2 2 1 3 1 1 1 ...
## $ half : int 0 0 1 1 0 1 0 1 2 0 ...
## $ bedrooms : int 3 4 4 4 3 4 3 3 3 1 ...
## $ rooms : int 6 7 9 8 7 8 7 7 7 3 ...
## $ neighborhood: int 1 1 3 3 4 3 2 2 3 2 ...
Explore the relationship between the sale price and the other variables using plots
Using the ggplot9() function I am going to create simple line or point graph to show the correlation between the two variables.
ggplot(homes, aes(x = list, y = sale,)) + ggtitle("Sales Price v Listing Price") + geom_line()
ggplot(homes, aes(x = full, y = sale)) + ggtitle("Sales Price v Full Baths") + geom_point()
ggplot(homes, aes(x = half, y = sale)) + ggtitle("Sales Price v Half Baths") + geom_point()
ggplot(homes, aes(x = bedrooms, y = sale)) + ggtitle("Sales Price v Bedrooms") + geom_point()
ggplot(homes, aes(x = rooms, y = sale)) + ggtitle("Sales Price v Rooms") + geom_point()
ggplot(homes, aes(x = neighborhood, y = sale)) + ggtitle("Sales Price v Neighborhood Safety") + geom_point()
Identify those variables that appear to have the strongest relationship with sale price
The variable that has the strongest relation to the sale price, outside of the listing price, is the neighborhood rank. The baths and rooms do have a postive correlation but not nearly as strong.
Use these variables to build a multiple linear regression model to explain the sale price.
First, using the sale_model function i am going to create coefficients.
sale_model = lm(sale ~ list + full + half + bedrooms + rooms + neighborhood, data = homes)
sale_model
##
## Call:
## lm(formula = sale ~ list + full + half + bedrooms + rooms + neighborhood,
## data = homes)
##
## Coefficients:
## (Intercept) list full half bedrooms
## 5.1336 0.9713 -4.9776 -1.0064 2.4922
## rooms neighborhood
## -0.4341 2.0343
Then, using the summary() function I will get the coefficients and a goodness-of-fit model
summary(sale_model)
##
## Call:
## lm(formula = sale ~ list + full + half + bedrooms + rooms + neighborhood,
## data = homes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -28.807 -6.626 -0.270 5.580 32.933
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.13359 17.15496 0.299 0.768
## list 0.97131 0.07616 12.754 1.22e-11 ***
## full -4.97759 5.48033 -0.908 0.374
## half -1.00644 5.70418 -0.176 0.862
## bedrooms 2.49224 6.43616 0.387 0.702
## rooms -0.43411 3.70424 -0.117 0.908
## neighborhood 2.03434 6.88609 0.295 0.770
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.87 on 22 degrees of freedom
## Multiple R-squared: 0.989, Adjusted R-squared: 0.986
## F-statistic: 330.5 on 6 and 22 DF, p-value: < 2.2e-16
The higher the values the better, this means the higher the “Std. Error” values the more they correlate with the sale prices.
Lastly, I am going to use the anova() function to identify which variable appears to have the greatest affect on sale prices.
anova(sale_model)
## Analysis of Variance Table
##
## Response: sale
## Df Sum Sq Mean Sq F value Pr(>F)
## list 1 381050 381050 1981.6252 <2e-16 ***
## full 1 156 156 0.8116 0.3774
## half 1 21 21 0.1092 0.7441
## bedrooms 1 25 25 0.1314 0.7204
## rooms 1 3 3 0.0141 0.9065
## neighborhood 1 17 17 0.0873 0.7704
## Residuals 22 4230 192
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
These are similar to the “summary()” , because the higher “F Value” equals a lower P-value(“Pr>f”). The lower the P-value the better.
Build a second model using the same variables to explain the list price.
First, using the list_model function i am going to create new coefficients that use the list price.
list_model = lm(list ~ sale + full + half + bedrooms + rooms + neighborhood, data = homes)
list_model
##
## Call:
## lm(formula = list ~ sale + full + half + bedrooms + rooms + neighborhood,
## data = homes)
##
## Coefficients:
## (Intercept) sale full half bedrooms
## -21.8752 0.9069 8.3411 6.3398 -0.0627
## rooms neighborhood
## 1.2426 7.3793
Now I am going to use the anova() function to see which ones have the greatest affect on price
anova(list_model)
## Analysis of Variance Table
##
## Response: list
## Df Sum Sq Mean Sq F value Pr(>F)
## sale 1 401374 401374 2235.5702 <2e-16 ***
## full 1 346 346 1.9259 0.1791
## half 1 134 134 0.7440 0.3977
## bedrooms 1 4 4 0.0209 0.8864
## rooms 1 24 24 0.1326 0.7192
## neighborhood 1 233 233 1.2973 0.2670
## Residuals 22 3950 180
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Are there differences from the sale price?
The biggest differences are between the neighborhood rank, full and half bathrooms, all of these values dropped significantly
Could you use this information to recommend which characteristic of a house a real estate agent should concentrate on?
Yes, I would recommend the real estate agent focus more on the neighborhood the house is located in and the quantity bathrooms throughout the house.
What is the effect of neighborhood on the difference between sale price and list price? Do richer neighborhoods mean it is more likely to have a house go over the asking price?
Houses in richer neighborhoods are listed at higher prices, and their sales prices also tend to be higher. However, the neighborhood’s influence is more significant on the list price than on the sales price. This does not imply that these homes are more likely to sell above the asking price.
What do these results mean?
The data indicates that the number of bedrooms and other rooms has the least correlation with the sale price. Full and half bathrooms show a higher correlation than rooms, but not nearly as significant as neighborhood ranking. Overall, the list price is the most correlated, with neighborhood ranking following in its effect on the sale price of the house.