This project uses a dataset containing home sale prices
home = read.csv("homeprice.csv")
str(home)
## 'data.frame': 29 obs. of 7 variables:
## $ list : num 80 151 310 295 339 ...
## $ sale : num 118 151 300 275 340 ...
## $ full : int 1 1 2 2 2 1 3 1 1 1 ...
## $ half : int 0 0 1 1 0 1 0 1 2 0 ...
## $ bedrooms : int 3 4 4 4 3 4 3 3 3 1 ...
## $ rooms : int 6 7 9 8 7 8 7 7 7 3 ...
## $ neighborhood: int 1 1 3 3 4 3 2 2 3 2 ...
names(home)
## [1] "list" "sale" "full" "half" "bedrooms"
## [6] "rooms" "neighborhood"
library(ggplot2)
# Sale price and number of full bathrooms
ggplot(home, aes(x=sale, y=full)) + geom_point()
full2 = factor(home$full)
ggplot(home, aes(x=full2, y=sale)) + geom_boxplot()
# Sale price and number of half bathrooms
ggplot(home, aes(x=sale, y=half)) + geom_point()
half2 = factor(home$half)
ggplot(home, aes(x=half2, y=sale)) + geom_boxplot()
# Sale price and number of bedrooms
ggplot(home, aes(x=sale, y=bedrooms)) + geom_point()
bedrooms2 = factor(home$bedrooms)
ggplot(home, aes(x=bedrooms2, y=sale)) + geom_boxplot()
# Sale price and neighborhood of home
neighborhood2 = factor(home$neighborhood)
ggplot(home, aes(x=sale, y=neighborhood, col=neighborhood2)) + geom_point()
ggplot(home, aes(x=neighborhood2, y=sale)) + geom_boxplot()
The variables that appear to have the strongest relationship with sale price are neighborhood and the number of full bathrooms.
lm_sale = lm(sale ~ full + half + bedrooms + rooms + neighborhood,
data = home)
summary(lm_sale)
##
## Call:
## lm(formula = sale ~ full + half + bedrooms + rooms + neighborhood,
## data = home)
##
## Residuals:
## Min 1Q Median 3Q Max
## -59.31 -34.06 7.20 21.32 55.93
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -135.263 37.283 -3.628 0.00141 **
## full 26.225 13.896 1.887 0.07181 .
## half 43.242 12.830 3.370 0.00264 **
## bedrooms 20.409 17.798 1.147 0.26329
## rooms 6.488 10.383 0.625 0.53823
## neighborhood 77.243 10.077 7.665 8.86e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 39.29 on 23 degrees of freedom
## Multiple R-squared: 0.9079, Adjusted R-squared: 0.8879
## F-statistic: 45.34 on 5 and 23 DF, p-value: 3.686e-11
The coefficients can be seen under the ‘Coefficients:’ header of the function, and these values show that neighborhood is the variable with the greatest effect on sale price. The goodness-of-fit of the model can be understood from the multiple R-squared value, the F-statistic, and the p-value; all of which indicate that this model explains the correlation between the variables well.
plot(lm_sale, which=1)
hist(residuals(lm_sale))
The residuals do not appear to be symmetrically distributed around zero in the histogram, but there are also minimal patterns in the scatter plot. Based on the residuals, the model has for the most part accurately captured the relationship between the variables.
anova(lm_sale)
## Analysis of Variance Table
##
## Response: sale
## Df Sum Sq Mean Sq F value Pr(>F)
## full 1 151632 151632 98.2101 9.062e-10 ***
## half 1 87430 87430 56.6271 1.206e-07 ***
## bedrooms 1 10581 10581 6.8530 0.01538 *
## rooms 1 9632 9632 6.2387 0.02009 *
## neighborhood 1 90717 90717 58.7562 8.859e-08 ***
## Residuals 23 35511 1544
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Based on the ANOVA of the model, number of full bathrooms and neighborhood are the variables with the greatest effect on the sale price, with full bathrooms having slightly more significance.
lm_list = lm(list ~ full + half + bedrooms + rooms + neighborhood,
data = home)
anova(lm_list)
## Analysis of Variance Table
##
## Response: list
## Df Sum Sq Mean Sq F value Pr(>F)
## full 1 169594 169594 117.6457 1.615e-10 ***
## half 1 92249 92249 63.9922 4.294e-08 ***
## bedrooms 1 9745 9745 6.7597 0.01601 *
## rooms 1 10162 10162 7.0494 0.01415 *
## neighborhood 1 91158 91158 63.2352 4.754e-08 ***
## Residuals 23 33156 1442
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Based on the ANOVA of the model, the number of full bathrooms is the variable with the greatest affect on list price, followed by the neighborhood and the number of half bathrooms. This is very similar to the sale price. You could likely use this information to recommend the characteristics of a house that a real estate agent should concentrate on.
lm_hoodsale = lm(sale ~ neighborhood, home)
summary(lm_hoodsale)
##
## Call:
## lm(formula = sale ~ neighborhood, data = home)
##
## Residuals:
## Min 1Q Median 3Q Max
## -134.378 -35.041 -9.041 36.985 125.633
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -20.95 32.82 -0.638 0.529
## neighborhood 101.66 10.72 9.485 4.36e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 57.41 on 27 degrees of freedom
## Multiple R-squared: 0.7692, Adjusted R-squared: 0.7606
## F-statistic: 89.97 on 1 and 27 DF, p-value: 4.357e-10
lm_hoodlist = lm(list ~ neighborhood, home)
summary(lm_hoodlist)
##
## Call:
## lm(formula = list ~ neighborhood, data = home)
##
## Residuals:
## Min 1Q Median 3Q Max
## -137.878 -31.504 -2.878 47.822 103.683
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -28.75 33.17 -0.867 0.394
## neighborhood 104.81 10.83 9.676 2.86e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 58.02 on 27 degrees of freedom
## Multiple R-squared: 0.7762, Adjusted R-squared: 0.7679
## F-statistic: 93.63 on 1 and 27 DF, p-value: 2.863e-10
ggplot(home, aes(x=neighborhood, y=list)) + geom_point()
ggplot(home, aes(x=neighborhood, y=sale)) + geom_point()
There seems to be very little effect from the neighborhood on the list price vs. the sale price. There are very few instances where the sale price was over the list price. There does not seem to be an effect from richer or poorer neighborhoods.