`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
I then ran five scatter plots to plot the variables: List Price, Number of Bedroooms, Number of Non-Bedrooms, Bathrooms and Half Bathrooms against the sale price. I did this using ggplot by setting the dataset to home the x variable to sale and the y to the variable being testing. I set the type of plot to scatter using “geom_point()” then added a regression line using “geom_smooth()”.
ggplot(home, aes(x = sale, y = list)) +geom_point() +geom_smooth() +ggtitle("Sale vs. List Price")
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
ggplot(home, aes(x = sale, y = bedrooms)) +geom_point() +geom_smooth() +ggtitle("Sale vs. Number of Bedrooms")
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
ggplot(home, aes(x = sale, y = rooms)) +geom_point() +geom_smooth() +ggtitle("Sale vs. Number of Non-Bedrooms")
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
ggplot(home, aes(x = sale, y = full)) +geom_point() +geom_smooth() +ggtitle("Sale vs. Bathrooms")
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
ggplot(home, aes(x = sale, y = half)) +geom_point() +geom_smooth() +ggtitle("Sale vs. Half Bathrooms")
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
For the neighborhood variable I made a box plot instead and needed to convert it into a factor.
ggplot(home, aes(x =factor(neighborhood), y = sale))+geom_boxplot() +ggtitle("Sale and Neighborhood Wealth")
Based on the plots sale and list price are the most strongly correlated with is expected and obvious there is no need to test for a relationship between those variables in a real situation. While the other plots all did technically have positive correlation, half bathrooms had the least linear, full bathrooms also appeared to be slightly weaker than bedrooms and non bedrooms rooms, which were not strong due to outlier in the mid sale range price points. The box plot for neighborhoods and price was seemed stronger than the rooms based on the lack of overlap from the bulk of the data situated in the box. The relationship between v variables was positive and linear.
Multiple Linear Regression for Sale Price
modelforsale <-lm(sale ~ full + half + rooms + bedrooms + neighborhood + list, data = home)
Using the linear model function in R I created a variable for the sale model and set the formula to be sale explained by all the variables.
summary(modelforsale)
Call:
lm(formula = sale ~ full + half + rooms + bedrooms + neighborhood +
list, data = home)
Residuals:
Min 1Q Median 3Q Max
-28.807 -6.626 -0.270 5.580 32.933
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.13359 17.15496 0.299 0.768
full -4.97759 5.48033 -0.908 0.374
half -1.00644 5.70418 -0.176 0.862
rooms -0.43411 3.70424 -0.117 0.908
bedrooms 2.49224 6.43616 0.387 0.702
neighborhood 2.03434 6.88609 0.295 0.770
list 0.97131 0.07616 12.754 1.22e-11 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 13.87 on 22 degrees of freedom
Multiple R-squared: 0.989, Adjusted R-squared: 0.986
F-statistic: 330.5 on 6 and 22 DF, p-value: < 2.2e-16
The summary function returns the coefficients, residuals and significance.
None of the variables beside the list price were statistically significant all had p values above .1. Of the other names variables however bathrooms had the second lowest p value. The R-Squared was close enough to one to justify the model explaining the sale price.
anova(modelforsale)
Analysis of Variance Table
Response: sale
Df Sum Sq Mean Sq F value Pr(>F)
full 1 151632 151632 788.5513 < 2.2e-16 ***
half 1 87430 87430 454.6719 3.479e-16 ***
rooms 1 19851 19851 103.2341 9.030e-10 ***
bedrooms 1 362 362 1.8827 0.1839
neighborhood 1 90717 90717 471.7666 2.359e-16 ***
list 1 31280 31280 162.6723 1.222e-11 ***
Residuals 22 4230 192
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The Anova test for the model returns the significance of predictors in the multiple regression model. To interpret the significance the p value is used.
The results showed that all variables bedsides bedrooms were significant predictors.
Using ggplot to model the residuals to model the distribution of the residuals to asses the validity of the model. The residuals should show they are normally distributed.
To do this the fitted values were plotted against the residuals using a scatter plot.
ggplot(modelforsale, aes(x = .fitted, y = .resid)) +geom_point() +geom_hline(yintercept =0) +ggtitle("Residual for Sale Model")
The residuals do show that the model is valid based on the distribution.
Multiple Linear Regression for List Price
The same process was repeated for the list price in created the model.
modelforlist <-lm(list ~ full + half + rooms + bedrooms + neighborhood + sale, data = home)summary (modelforlist)
Call:
lm(formula = list ~ full + half + rooms + bedrooms + neighborhood +
sale, data = home)
Residuals:
Min 1Q Median 3Q Max
-27.8544 -6.7013 -0.7265 6.7894 31.3427
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -21.8752 15.9419 -1.372 0.184
full 8.3411 5.0923 1.638 0.116
half 6.3398 5.3475 1.186 0.248
rooms 1.2426 3.5706 0.348 0.731
bedrooms -0.0627 6.2402 -0.010 0.992
neighborhood 7.3793 6.4787 1.139 0.267
sale 0.9069 0.0711 12.754 1.22e-11 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 13.4 on 22 degrees of freedom
Multiple R-squared: 0.9903, Adjusted R-squared: 0.9876
F-statistic: 373.3 on 6 and 22 DF, p-value: < 2.2e-16
The summary of the multiple regression for list price returned lower p values than found in the sale price model. While the only variable that was deemed significant using the p value was once again only the sale price. Bedrooms as well had the highest p value. The R-Squared Ajusted was 0.9876 which is close enough to 1 to justify the model.
anova(modelforlist)
Analysis of Variance Table
Response: list
Df Sum Sq Mean Sq F value Pr(>F)
full 1 169594 169594 944.6042 < 2.2e-16 ***
half 1 92249 92249 513.8081 < 2.2e-16 ***
rooms 1 19349 19349 107.7693 6.085e-10 ***
bedrooms 1 558 558 3.1071 0.09184 .
neighborhood 1 91158 91158 507.7299 < 2.2e-16 ***
sale 1 29206 29206 162.6723 1.222e-11 ***
Residuals 22 3950 180
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ggplot(modelforlist, aes(x = .fitted, y = .resid)) +geom_point() +geom_hline(yintercept =0) +ggtitle("Residual for List Model")
The results of the Anova table were the exact same as the results for the multiple linear regression model for the sale price. All variables except for the number of bedrooms were significant predictors. As well the residuals were normally distributed.
Overall using the results for both the sale price and the list price, the list price having more accuracy with the predictors makes practical sense when understanding the housing market from a human perspective. A house having more rooms will mean the seller expects people to pay more, but the combination of quality, size and other factors will determine what the house is actually worth to a buyer. One might think that bedrooms would be a main predictor of price as well, however for families if the household size is consistent, more bedrooms would not be necessary, families may focus on paying extra for a different type of room. The number of full bathrooms was the variable that was consistent with having the highest significance level throughout both models, which would be the information presented to a real estate agent that outside of the list or sale price, they should be focusing on full bathrooms, then half bathrooms.
Neighborhoods on Sale vs. List Price
To explore the connection between the wealth of neighborhoods and the sale vs list price of a house a regression model was created comparing the difference in sale vs list price.
Call:
lm(formula = differnce ~ neighborhood, data = home)
Residuals:
Min 1Q Median 3Q Max
-30.05 -7.50 -0.85 5.80 33.05
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.800 7.435 1.049 0.303
neighborhood -3.150 2.428 -1.298 0.205
Residual standard error: 13 on 27 degrees of freedom
Multiple R-squared: 0.0587, Adjusted R-squared: 0.02383
F-statistic: 1.684 on 1 and 27 DF, p-value: 0.2054
Then another box plot was created to visualize the difference between rank of neighborhood wealth to difference between sale and list price and well as a plot of the residuals to test the validity of the model.
ggplot(home, aes(factor(neighborhood), y = differnce)) +geom_boxplot() +geom_hline(yintercept =0) +ggtitle("Sale - List Price for Neighborhoods")
ggplot(n_diff, aes(x = .fitted, y = .resid)) +geom_point() +geom_hline(yintercept =0) +ggtitle("Residual for Neighborhood")
While the p value of the regression model was not significant enough to suggest a correlation between the difference between list and sale price the p value was lower than previous p values obtained for bedroom and non bedroom predictors.
When looking at the residuals the amount of data obtained was not enough to make the model as the richest and poorest neighborhoods had only four data points in total.
The correlation was predicted that there would be a positive difference between the sale and list price however the visual trend of the box plot is the opposite. Tying this to a human perspective as the housing market has property values increasing at a rate faster than income levels are, there more interest in houses which are affordable for more people. These houses would likely be in poor neighborhoods which creates more demand and therefor more buyers offering above the listing price. Rather than a house with little demand it might sell for a lower price than the lisitng.