This project uses a dataset containing home sale prices

home = read.csv("homeprice.csv")
str(home)
## 'data.frame':    29 obs. of  7 variables:
##  $ list        : num  80 151 310 295 339 ...
##  $ sale        : num  118 151 300 275 340 ...
##  $ full        : int  1 1 2 2 2 1 3 1 1 1 ...
##  $ half        : int  0 0 1 1 0 1 0 1 2 0 ...
##  $ bedrooms    : int  3 4 4 4 3 4 3 3 3 1 ...
##  $ rooms       : int  6 7 9 8 7 8 7 7 7 3 ...
##  $ neighborhood: int  1 1 3 3 4 3 2 2 3 2 ...
names(home)
## [1] "list"         "sale"         "full"         "half"         "bedrooms"    
## [6] "rooms"        "neighborhood"
library(ggplot2)

Use this file to explore the relationship between the sale price and the other variables using scatterplots, histograms, and/or boxplots. Identify the variables that appear to have the strongest relationship with sale price.

# Sale price and number of full bathrooms
ggplot(home, aes(x=sale, y=full)) + geom_point()

full2 = factor(home$full)
ggplot(home, aes(x=full2, y=sale)) + geom_boxplot()

# Sale price and number of half bathrooms
ggplot(home, aes(x=sale, y=half)) + geom_point()

half2 = factor(home$half)
ggplot(home, aes(x=half2, y=sale)) + geom_boxplot()

# Sale price and number of bedrooms
ggplot(home, aes(x=sale, y=bedrooms)) + geom_point()

bedrooms2 = factor(home$bedrooms)
ggplot(home, aes(x=bedrooms2, y=sale)) + geom_boxplot()

# Sale price and neighborhood of home
neighborhood2 = factor(home$neighborhood)
ggplot(home, aes(x=sale, y=neighborhood, col=neighborhood2)) + geom_point()

ggplot(home, aes(x=neighborhood2, y=sale)) + geom_boxplot()

The variables that appear to have the strongest relationship with sale price are neighborhood and the number of full bathrooms.

Now use these variables to build a multiple linear regression model to explain the sale price. Use the summary() function to find the coefficients and goodness-of-fit of the model. Use the anova() function to identify which variable appears to have the greatest effect on sale price. Look at the distribution of residuals.

lm_sale = lm(sale ~ full + half + bedrooms + rooms + neighborhood,
             data = home)
summary(lm_sale)
## 
## Call:
## lm(formula = sale ~ full + half + bedrooms + rooms + neighborhood, 
##     data = home)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -59.31 -34.06   7.20  21.32  55.93 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -135.263     37.283  -3.628  0.00141 ** 
## full           26.225     13.896   1.887  0.07181 .  
## half           43.242     12.830   3.370  0.00264 ** 
## bedrooms       20.409     17.798   1.147  0.26329    
## rooms           6.488     10.383   0.625  0.53823    
## neighborhood   77.243     10.077   7.665 8.86e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 39.29 on 23 degrees of freedom
## Multiple R-squared:  0.9079, Adjusted R-squared:  0.8879 
## F-statistic: 45.34 on 5 and 23 DF,  p-value: 3.686e-11

The coefficients can be seen under the ‘Coefficients:’ header of the function, and these values show that neighborhood is the variable with the greatest effect on sale price. The goodness-of-fit of the model can be understood from the multiple R-squared value, the F-statistic, and the p-value; all of which indicate that this model explains the correlation between the variables well.

plot(lm_sale, which=1)

hist(residuals(lm_sale))

The residuals do not appear to be symmetrically distributed around zero in the histogram, but there are also minimal patterns in the scatter plot. Based on the residuals, the model has for the most part accurately captured the relationship between the variables.

anova(lm_sale)
## Analysis of Variance Table
## 
## Response: sale
##              Df Sum Sq Mean Sq F value    Pr(>F)    
## full          1 151632  151632 98.2101 9.062e-10 ***
## half          1  87430   87430 56.6271 1.206e-07 ***
## bedrooms      1  10581   10581  6.8530   0.01538 *  
## rooms         1   9632    9632  6.2387   0.02009 *  
## neighborhood  1  90717   90717 58.7562 8.859e-08 ***
## Residuals    23  35511    1544                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Based on the ANOVA of the model, number of full bathrooms and neighborhood are the variables with the greatest effect on the sale price, with full bathrooms having slightly more significance.

Build a second model using the same variables to explain the list price. Use the anova() function to identify which variable appears to have the greatest affect on list price. Are there differences from the sale price? Could you use this information to recommend which characteristics of a house a real estate agent should concentrate on?

lm_list = lm(list ~ full + half + bedrooms + rooms + neighborhood,
             data = home)
anova(lm_list)
## Analysis of Variance Table
## 
## Response: list
##              Df Sum Sq Mean Sq  F value    Pr(>F)    
## full          1 169594  169594 117.6457 1.615e-10 ***
## half          1  92249   92249  63.9922 4.294e-08 ***
## bedrooms      1   9745    9745   6.7597   0.01601 *  
## rooms         1  10162   10162   7.0494   0.01415 *  
## neighborhood  1  91158   91158  63.2352 4.754e-08 ***
## Residuals    23  33156    1442                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Based on the ANOVA of the model, the number of full bathrooms is the variable with the greatest affect on list price, followed by the neighborhood and the number of half bathrooms. This is very similar to the sale price. You could likely use this information to recommend the characteristics of a house that a real estate agent should concentrate on.

What is the effect of neighborhood on the difference between sale price and list price? Do richer neighborhoods mean it is more likely to have a house go over the asking price?

lm_hoodsale = lm(sale ~ neighborhood, home)
summary(lm_hoodsale)
## 
## Call:
## lm(formula = sale ~ neighborhood, data = home)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -134.378  -35.041   -9.041   36.985  125.633 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -20.95      32.82  -0.638    0.529    
## neighborhood   101.66      10.72   9.485 4.36e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 57.41 on 27 degrees of freedom
## Multiple R-squared:  0.7692, Adjusted R-squared:  0.7606 
## F-statistic: 89.97 on 1 and 27 DF,  p-value: 4.357e-10
lm_hoodlist = lm(list ~ neighborhood, home)
summary(lm_hoodlist)
## 
## Call:
## lm(formula = list ~ neighborhood, data = home)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -137.878  -31.504   -2.878   47.822  103.683 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -28.75      33.17  -0.867    0.394    
## neighborhood   104.81      10.83   9.676 2.86e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 58.02 on 27 degrees of freedom
## Multiple R-squared:  0.7762, Adjusted R-squared:  0.7679 
## F-statistic: 93.63 on 1 and 27 DF,  p-value: 2.863e-10
ggplot(home, aes(x=neighborhood, y=list)) + geom_point()

ggplot(home, aes(x=neighborhood, y=sale)) + geom_point()

There seems to be very little effect from the neighborhood on the list price vs. the sale price. There are very few instances where the sale price was over the list price. There does not seem to be an effect from richer or poorer neighborhoods.