Project Intro

For this project we are looking at home prices that sold in a town of New Jersey in the year 2001.WE will be looking at some different variables and understanding the relationship between house characteristics and the sale price.

Lets start by adding in the Necessary libraries.

library(ggplot2)

Then we will investigate the data and its structure

homes = read.csv("homeprice.csv")
str(homes)
## 'data.frame':    29 obs. of  7 variables:
##  $ list        : num  80 151 310 295 339 ...
##  $ sale        : num  118 151 300 275 340 ...
##  $ full        : int  1 1 2 2 2 1 3 1 1 1 ...
##  $ half        : int  0 0 1 1 0 1 0 1 2 0 ...
##  $ bedrooms    : int  3 4 4 4 3 4 3 3 3 1 ...
##  $ rooms       : int  6 7 9 8 7 8 7 7 7 3 ...
##  $ neighborhood: int  1 1 3 3 4 3 2 2 3 2 ...

Part 1 Explore the relationship between sale price and other variables.

List Price vs Sale Price

ggplot(homes, aes(x = list, y = sale)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  ggtitle("Scatter Plot of List Price vs Sale Price") +
  xlab("List Price") +
  ylab("Sale Price") +
  theme_bw()
## `geom_smooth()` using formula = 'y ~ x'

Number of Full Bathrooms vs Sale Price

ggplot(homes, aes(x = full, y = sale)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  ggtitle("Scatter Plot of Number of Full Bathrooms vs Sale Price") +
  xlab("Number of Full Bathrooms") +
  ylab("Sale Price") +
  theme_bw()
## `geom_smooth()` using formula = 'y ~ x'

Number of Half Bathrooms vs. Sale Price

ggplot(homes, aes(x = half, y = sale)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  ggtitle("Scatter Plot of Number of Half Bathrooms vs Sale Price") +
  xlab("Number of Half Bathrooms") +
  ylab("Sale Price") +
  theme_bw()
## `geom_smooth()` using formula = 'y ~ x'

Number of Bedrooms vs. Sale Price

ggplot(homes, aes(x = bedrooms, y = sale)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  ggtitle("Number of Bedrooms vs Sale Price") +
  xlab("Number of Bedrooms") +
  ylab("Sale Price") +
  theme_bw()
## `geom_smooth()` using formula = 'y ~ x'

Number of Non-bedrooms vs. Sale Price

ggplot(homes, aes(x = rooms, y = sale)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  ggtitle("Number of non-bedrooms vs Sale Price") +
  xlab("Number of non-bedrooms") +
  ylab("Sale Price") +
  theme_bw()
## `geom_smooth()` using formula = 'y ~ x'

Neighborhood Rank vs. Sale Price

ggplot(homes, aes(x = neighborhood, y = sale)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  ggtitle("Neighborhood Rank vs Sale Price") +
  xlab("Neighborhood Rank") +
  ylab("Sale Price") +
  theme_bw()
## `geom_smooth()` using formula = 'y ~ x'

Identify the variables that appear to have the strongest relationship with Sale Price

Looking at the graphs and each of their line of best fits, it is clear to see that neighborhood rank and the list price are the variables with the strongest relationships when looking at the Sale Price. We can conclude that as the list price goes up and the neighborhood rank increases then the sale price will also increase. It is clear to see that all of the variables have a positive relationship with sale price by looking at their respective line of best fits.

Part 2:

Build a multiple linear regression model using variables to explain the sale price

sale_m = lm(sale ~ list + full + half + bedrooms + rooms + neighborhood, data = homes)
sale_m
## 
## Call:
## lm(formula = sale ~ list + full + half + bedrooms + rooms + neighborhood, 
##     data = homes)
## 
## Coefficients:
##  (Intercept)          list          full          half      bedrooms  
##       5.1336        0.9713       -4.9776       -1.0064        2.4922  
##        rooms  neighborhood  
##      -0.4341        2.0343

Using summary() function, to find coefficients and goodness-of-fit of the model.

summary(sale_m)
## 
## Call:
## lm(formula = sale ~ list + full + half + bedrooms + rooms + neighborhood, 
##     data = homes)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -28.807  -6.626  -0.270   5.580  32.933 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   5.13359   17.15496   0.299    0.768    
## list          0.97131    0.07616  12.754 1.22e-11 ***
## full         -4.97759    5.48033  -0.908    0.374    
## half         -1.00644    5.70418  -0.176    0.862    
## bedrooms      2.49224    6.43616   0.387    0.702    
## rooms        -0.43411    3.70424  -0.117    0.908    
## neighborhood  2.03434    6.88609   0.295    0.770    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.87 on 22 degrees of freedom
## Multiple R-squared:  0.989,  Adjusted R-squared:  0.986 
## F-statistic: 330.5 on 6 and 22 DF,  p-value: < 2.2e-16

Using the anova() function to identify which variable appears to have the greatest effect on sale price.

anova(sale_m)
## Analysis of Variance Table
## 
## Response: sale
##              Df Sum Sq Mean Sq   F value Pr(>F)    
## list          1 381050  381050 1981.6252 <2e-16 ***
## full          1    156     156    0.8116 0.3774    
## half          1     21      21    0.1092 0.7441    
## bedrooms      1     25      25    0.1314 0.7204    
## rooms         1      3       3    0.0141 0.9065    
## neighborhood  1     17      17    0.0873 0.7704    
## Residuals    22   4230     192                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Based on the results from our anova test we can conclude that the function with the greatest effect on sale price is the listing value. this is due to its high p-vale.

Part 3:

Build a second model using the same variables to explain the list price.

list_m = lm(list ~ sale + full + half + bedrooms + rooms + neighborhood, data = homes)
list_m
## 
## Call:
## lm(formula = list ~ sale + full + half + bedrooms + rooms + neighborhood, 
##     data = homes)
## 
## Coefficients:
##  (Intercept)          sale          full          half      bedrooms  
##     -21.8752        0.9069        8.3411        6.3398       -0.0627  
##        rooms  neighborhood  
##       1.2426        7.3793

Use anova() function to identify which variable has greatest effect on list price

anova(list_m)
## Analysis of Variance Table
## 
## Response: list
##              Df Sum Sq Mean Sq   F value Pr(>F)    
## sale          1 401374  401374 2235.5702 <2e-16 ***
## full          1    346     346    1.9259 0.1791    
## half          1    134     134    0.7440 0.3977    
## bedrooms      1      4       4    0.0209 0.8864    
## rooms         1     24      24    0.1326 0.7192    
## neighborhood  1    233     233    1.2973 0.2670    
## Residuals    22   3950     180                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Based on the anova test, it would appear that sale price has the greatest effect on list price. However, knowing that the sale price comes from the listing price we would have to look at the next variable which would then mean that the number of full bathrooms has the greatest effect on sale price because of its somewhat high p-value. It is important to note that the neighborhood function also had a hogh p-value.

Are there differences from the sale price? Yes, there are other factors that are important to consider when looking at the list price. When looking at the list price, number of full bathrooms and neighborhood rank are also important.

Could you use this info to recommend which characteristic of a house a real estate agent should concentrate on? Yes, you could recommend a few different characteristics for a real estate agent to focus on. My recommendation would be to focus mainly on the number of full bathrooms and the rank of the neighborhood.

Part 4

what is the effect of neighborhood on the difference between sale price and list price? Do richer neighborhoods mean it is more likely to have a house go over the asking price? Typically, when the neighborhood is ranked higher, the sale price and list price will also be higher. I think that this would mean that yes, richer neighborhoods are more likely to have a house go over the asking price. The quality/rank of a neighborhood definitely plays a role in the list price, but it is important to consider other variables. While in this example bedrooms did not have a huge effect this may not always be the case. It would also be interesting to look at other variabels such as house size or lot size and possibly otehr amenities that would add value to the house.