Project Intro
For this project we are looking at home prices that sold in a town of New Jersey in the year 2001.WE will be looking at some different variables and understanding the relationship between house characteristics and the sale price.
Lets start by adding in the Necessary libraries.
library(ggplot2)
Then we will investigate the data and its structure
homes = read.csv("homeprice.csv")
str(homes)
## 'data.frame': 29 obs. of 7 variables:
## $ list : num 80 151 310 295 339 ...
## $ sale : num 118 151 300 275 340 ...
## $ full : int 1 1 2 2 2 1 3 1 1 1 ...
## $ half : int 0 0 1 1 0 1 0 1 2 0 ...
## $ bedrooms : int 3 4 4 4 3 4 3 3 3 1 ...
## $ rooms : int 6 7 9 8 7 8 7 7 7 3 ...
## $ neighborhood: int 1 1 3 3 4 3 2 2 3 2 ...
Part 1 Explore the relationship between sale price and other variables.
List Price vs Sale Price
ggplot(homes, aes(x = list, y = sale)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
ggtitle("Scatter Plot of List Price vs Sale Price") +
xlab("List Price") +
ylab("Sale Price") +
theme_bw()
## `geom_smooth()` using formula = 'y ~ x'
Number of Full Bathrooms vs Sale Price
ggplot(homes, aes(x = full, y = sale)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
ggtitle("Scatter Plot of Number of Full Bathrooms vs Sale Price") +
xlab("Number of Full Bathrooms") +
ylab("Sale Price") +
theme_bw()
## `geom_smooth()` using formula = 'y ~ x'
Number of Half Bathrooms vs. Sale Price
ggplot(homes, aes(x = half, y = sale)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
ggtitle("Scatter Plot of Number of Half Bathrooms vs Sale Price") +
xlab("Number of Half Bathrooms") +
ylab("Sale Price") +
theme_bw()
## `geom_smooth()` using formula = 'y ~ x'
Number of Bedrooms vs. Sale Price
ggplot(homes, aes(x = bedrooms, y = sale)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
ggtitle("Number of Bedrooms vs Sale Price") +
xlab("Number of Bedrooms") +
ylab("Sale Price") +
theme_bw()
## `geom_smooth()` using formula = 'y ~ x'
Number of Non-bedrooms vs. Sale Price
ggplot(homes, aes(x = rooms, y = sale)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
ggtitle("Number of non-bedrooms vs Sale Price") +
xlab("Number of non-bedrooms") +
ylab("Sale Price") +
theme_bw()
## `geom_smooth()` using formula = 'y ~ x'
Neighborhood Rank vs. Sale Price
ggplot(homes, aes(x = neighborhood, y = sale)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
ggtitle("Neighborhood Rank vs Sale Price") +
xlab("Neighborhood Rank") +
ylab("Sale Price") +
theme_bw()
## `geom_smooth()` using formula = 'y ~ x'
Identify the variables that appear to have the strongest relationship with Sale Price
Looking at the graphs and each of their line of best fits, it is clear to see that neighborhood rank and the list price are the variables with the strongest relationships when looking at the Sale Price. We can conclude that as the list price goes up and the neighborhood rank increases then the sale price will also increase. It is clear to see that all of the variables have a positive relationship with sale price by looking at their respective line of best fits.
Part 2:
Build a multiple linear regression model using variables to explain the sale price
sale_m = lm(sale ~ list + full + half + bedrooms + rooms + neighborhood, data = homes)
sale_m
##
## Call:
## lm(formula = sale ~ list + full + half + bedrooms + rooms + neighborhood,
## data = homes)
##
## Coefficients:
## (Intercept) list full half bedrooms
## 5.1336 0.9713 -4.9776 -1.0064 2.4922
## rooms neighborhood
## -0.4341 2.0343
Using summary() function, to find coefficients and goodness-of-fit of the model.
summary(sale_m)
##
## Call:
## lm(formula = sale ~ list + full + half + bedrooms + rooms + neighborhood,
## data = homes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -28.807 -6.626 -0.270 5.580 32.933
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.13359 17.15496 0.299 0.768
## list 0.97131 0.07616 12.754 1.22e-11 ***
## full -4.97759 5.48033 -0.908 0.374
## half -1.00644 5.70418 -0.176 0.862
## bedrooms 2.49224 6.43616 0.387 0.702
## rooms -0.43411 3.70424 -0.117 0.908
## neighborhood 2.03434 6.88609 0.295 0.770
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.87 on 22 degrees of freedom
## Multiple R-squared: 0.989, Adjusted R-squared: 0.986
## F-statistic: 330.5 on 6 and 22 DF, p-value: < 2.2e-16
Using the anova() function to identify which variable appears to have the greatest effect on sale price.
anova(sale_m)
## Analysis of Variance Table
##
## Response: sale
## Df Sum Sq Mean Sq F value Pr(>F)
## list 1 381050 381050 1981.6252 <2e-16 ***
## full 1 156 156 0.8116 0.3774
## half 1 21 21 0.1092 0.7441
## bedrooms 1 25 25 0.1314 0.7204
## rooms 1 3 3 0.0141 0.9065
## neighborhood 1 17 17 0.0873 0.7704
## Residuals 22 4230 192
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Based on the results from our anova test we can conclude that the function with the greatest effect on sale price is the listing value. this is due to its high p-vale.
Part 3:
Build a second model using the same variables to explain the list price.
list_m = lm(list ~ sale + full + half + bedrooms + rooms + neighborhood, data = homes)
list_m
##
## Call:
## lm(formula = list ~ sale + full + half + bedrooms + rooms + neighborhood,
## data = homes)
##
## Coefficients:
## (Intercept) sale full half bedrooms
## -21.8752 0.9069 8.3411 6.3398 -0.0627
## rooms neighborhood
## 1.2426 7.3793
Use anova() function to identify which variable has greatest effect on list price
anova(list_m)
## Analysis of Variance Table
##
## Response: list
## Df Sum Sq Mean Sq F value Pr(>F)
## sale 1 401374 401374 2235.5702 <2e-16 ***
## full 1 346 346 1.9259 0.1791
## half 1 134 134 0.7440 0.3977
## bedrooms 1 4 4 0.0209 0.8864
## rooms 1 24 24 0.1326 0.7192
## neighborhood 1 233 233 1.2973 0.2670
## Residuals 22 3950 180
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Based on the anova test, it would appear that sale price has the greatest effect on list price. However, knowing that the sale price comes from the listing price we would have to look at the next variable which would then mean that the number of full bathrooms has the greatest effect on sale price because of its somewhat high p-value. It is important to note that the neighborhood function also had a hogh p-value.
Are there differences from the sale price? Yes, there are other factors that are important to consider when looking at the list price. When looking at the list price, number of full bathrooms and neighborhood rank are also important.
Could you use this info to recommend which characteristic of a house a real estate agent should concentrate on? Yes, you could recommend a few different characteristics for a real estate agent to focus on. My recommendation would be to focus mainly on the number of full bathrooms and the rank of the neighborhood.
Part 4
what is the effect of neighborhood on the difference between sale price and list price? Do richer neighborhoods mean it is more likely to have a house go over the asking price? Typically, when the neighborhood is ranked higher, the sale price and list price will also be higher. I think that this would mean that yes, richer neighborhoods are more likely to have a house go over the asking price. The quality/rank of a neighborhood definitely plays a role in the list price, but it is important to consider other variables. While in this example bedrooms did not have a huge effect this may not always be the case. It would also be interesting to look at other variabels such as house size or lot size and possibly otehr amenities that would add value to the house.