Analysis of Housing Prices

Author

Doahn Lee

Variables

The “homeprice.csv” file is used as a variable “hprice” and the data frame structure is made with this variable. Package ggplot2 is used for the plotting figures.

hprice = read.csv("homeprice.csv", stringsAsFactors = TRUE)
str(hprice)
'data.frame':   29 obs. of  7 variables:
 $ list        : num  80 151 310 295 339 ...
 $ sale        : num  118 151 300 275 340 ...
 $ full        : int  1 1 2 2 2 1 3 1 1 1 ...
 $ half        : int  0 0 1 1 0 1 0 1 2 0 ...
 $ bedrooms    : int  3 4 4 4 3 4 3 3 3 1 ...
 $ rooms       : int  6 7 9 8 7 8 7 7 7 3 ...
 $ neighborhood: int  1 1 3 3 4 3 2 2 3 2 ...
library(ggplot2)

Relationship Between Sale Price and Other Variables

Sale Price vs List Price

This is a scatter plot of relationship between sale price and list price.

qplot(sale, list, data = hprice, xlab = "Sale Price", ylab = "List Price",
      main = "Sale Price vs List Price")
Warning: `qplot()` was deprecated in ggplot2 3.4.0.

Sale Price vs Number of Full Bathrooms

This is a scatter plot of relationship between sale price and number of full bathrooms.

qplot(sale, full, data = hprice, xlab = "Sale Price", ylab = "Full Bathroom",
      main = "Sale Price vs Full Bathroom")

Sale Price vs Number of Half Bathrooms

This is a scatter plot of relationship between sale price and number of half bathrooms.

qplot(sale, half, data = hprice, xlab = "Sale Price", ylab = "Half Bathroom",
      main = "Sale Price vs Half Bathroom")

Sale Price vs Number of Bedrooms

This is a scatter plot of relationship between sale price and number of bedrooms.

qplot(sale, bedrooms, data = hprice, xlab = "Sale Price", ylab = "Bedrooms",
      main = "Sale Price vs Bedrooms")

Sale Price vs Number of Non-Bedrooms

This is a scatter plot of relationship between sale price and number of non-bedrooms.

qplot(sale, rooms, data = hprice, xlab = "Sale Price", ylab = "Non-Bedrooms",
      main = "Sale Price vs Non-Bedrooms")

Sale Price vs Neighborhood Rank

This is a scatter plot of relationship between sale price and neighborhood rank.

qplot(sale, neighborhood, data = hprice, xlab = "Sale Price", ylab = "Neighborhood Rank",
      main = "Sale Price vs Neighborhood Rank")

Result

After reviewing the relationship between the sale price and other variables through scatter plot, list price appears to have the strongest relationship with sale price.

Linear Regression Model

This section will make multiple linear regression models to find out the relationship of sale price and list price with other variables. The summary and ANOVA functions are used to interpret the result of the linear regression models. Histograms and plots are used for finding out the residuals of the models used in the analysis.

Sale Price vs Other Variables

Linear Regression of Sale Price and Other Variables

lin1 = lm(sale ~ list + full + half + bedrooms + rooms + neighborhood, data = hprice)
summary(lin1)

Call:
lm(formula = sale ~ list + full + half + bedrooms + rooms + neighborhood, 
    data = hprice)

Residuals:
    Min      1Q  Median      3Q     Max 
-28.807  -6.626  -0.270   5.580  32.933 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   5.13359   17.15496   0.299    0.768    
list          0.97131    0.07616  12.754 1.22e-11 ***
full         -4.97759    5.48033  -0.908    0.374    
half         -1.00644    5.70418  -0.176    0.862    
bedrooms      2.49224    6.43616   0.387    0.702    
rooms        -0.43411    3.70424  -0.117    0.908    
neighborhood  2.03434    6.88609   0.295    0.770    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 13.87 on 22 degrees of freedom
Multiple R-squared:  0.989, Adjusted R-squared:  0.986 
F-statistic: 330.5 on 6 and 22 DF,  p-value: < 2.2e-16

For this linear regression model, the multiple R-squared value is 0.989 and adjusted R-squared value is 0.986. Both R-squared value is close to 1 which means that this model fits input data almost exactly and has a good goodness-of-fit.

ANOVA of Sale Price and Other Variables

anova(lin1)
Analysis of Variance Table

Response: sale
             Df Sum Sq Mean Sq   F value Pr(>F)    
list          1 381050  381050 1981.6252 <2e-16 ***
full          1    156     156    0.8116 0.3774    
half          1     21      21    0.1092 0.7441    
bedrooms      1     25      25    0.1314 0.7204    
rooms         1      3       3    0.0141 0.9065    
neighborhood  1     17      17    0.0873 0.7704    
Residuals    22   4230     192                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The linear regression and ANOVA results shows that list price of housing has the biggest effect on the sale price of housing. The F-value of list price is 1981.6252 which means that there is a strong relationship between sale price and list price. The F-value for other variables are all less than 1, which means that there is a weak relationship with sale price and there could be a null hypothesis.

The p-value of list price is 2e-16 which is smaller than 0.05. Connecting to the F-value, the p-value below 0.05 means that there is a strong relationship between sale price and list price. The p-value for other variables are all bigger than 0.05, which means that there is a weak relationship with sale price and there could be a null hypothesis.

Since list price is the only variable that has a F-value bigger than 1 and p-value less than 0.05, list price most effects sale price among all variables.

Residuals of Sale Price and Other Variables Model

The residuals of this model is not well distributed and the model might have a bias in the final results.

Histogram of Residuals

hist(residuals(lin1))

Residuals Against the Fitted Values

plot(lin1, which = 1)

Cook’s Distance for Each Observations

plot(lin1, which = 4)

List Price vs Other Variables

Linear Regression of List Price and Other Variables

lin2 = lm(list ~ sale + full + half + bedrooms + rooms + neighborhood, data = hprice)
summary(lin2)

Call:
lm(formula = list ~ sale + full + half + bedrooms + rooms + neighborhood, 
    data = hprice)

Residuals:
     Min       1Q   Median       3Q      Max 
-27.8544  -6.7013  -0.7265   6.7894  31.3427 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -21.8752    15.9419  -1.372    0.184    
sale           0.9069     0.0711  12.754 1.22e-11 ***
full           8.3411     5.0923   1.638    0.116    
half           6.3398     5.3475   1.186    0.248    
bedrooms      -0.0627     6.2402  -0.010    0.992    
rooms          1.2426     3.5706   0.348    0.731    
neighborhood   7.3793     6.4787   1.139    0.267    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 13.4 on 22 degrees of freedom
Multiple R-squared:  0.9903,    Adjusted R-squared:  0.9876 
F-statistic: 373.3 on 6 and 22 DF,  p-value: < 2.2e-16

For this linear regression model, the multiple R-squared value is 0.9903 and adjusted R-squared value is 0.9876. Both R-squared value is close to 1 which means that this model fits input data almost exactly and has a good goodness-of-fit.

ANOVA of List Price and Other Variables

anova(lin2)
Analysis of Variance Table

Response: list
             Df Sum Sq Mean Sq   F value Pr(>F)    
sale          1 401374  401374 2235.5702 <2e-16 ***
full          1    346     346    1.9259 0.1791    
half          1    134     134    0.7440 0.3977    
bedrooms      1      4       4    0.0209 0.8864    
rooms         1     24      24    0.1326 0.7192    
neighborhood  1    233     233    1.2973 0.2670    
Residuals    22   3950     180                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The linear regression and ANOVA results shows that sale price of housing has the biggest effect on the list price of housing. The F-value of list price is 2235.5702 which means that there is a strong relationship between list price and sale price. The F-value for number of full bathroom and neighborhood rank is 1.9259 and 1.2973 each. The F-values of these two variables are all above 1 which means that they have strong relationship with the list price. However, comparing to the F-value of the sale price, they have weaker relationship with list price than sale price. The F-value of other variables are all less than 1, which means that there is a weak relationship with sale price and there could be a null hypothesis.

The p-value of sale price is 2e-16 which is smaller than 0.05. Connecting to the F-value, the p-value below 0.05 means that there is a strong relationship between list price and sale price. The p-value for other variables are all bigger than 0.05, which means that there is a weak relationship with sale price and there could be a null hypothesis.

For number of full bathroom and neighborhood rank which had a F-value bigger than 1, has a p-value bigger than 0.05 unlike sale price which has both F-value bigger than 1 and p-value less than 0.05. This means that the sale price most effects list price among all variables. This information can be used for a real estate agent to recommend housing characteristic of sale price, number of full bathroom, and neighborhood rank.

Residuals of List Price and Other Variables Model

The residuals of this model is not well distributed and the model might have a bias in the final results.

Histogram of Residuals

hist(residuals(lin2))

Residuals Against the Fitted Values

plot(lin2, which = 1)

Cook’s Distance for Each Observations

plot(lin2, which = 4)

Effect of Neighborhood on the Difference between Sale Price and List Price

This section will make a multiple linear regression model to find out the relationship of neighborhood rank with sale price and list price. The summary and ANOVA functions are used to interpret the result of the linear regression model. Histogram and plots are used for finding out the residuals of the model used in the analysis.

Linear Regression Model and ANOVA

lin3 = lm(neighborhood ~ sale + list, data = hprice)
summary(lin3)

Call:
lm(formula = neighborhood ~ sale + list, data = hprice)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.97992 -0.31827 -0.01618  0.33585  0.84921 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)   
(Intercept) 0.8551268  0.2395232   3.570  0.00142 **
sale        0.0008349  0.0074462   0.112  0.91159   
list        0.0065966  0.0072552   0.909  0.37158   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4968 on 26 degrees of freedom
Multiple R-squared:  0.7763,    Adjusted R-squared:  0.7591 
F-statistic: 45.11 on 2 and 26 DF,  p-value: 3.516e-09
anova(lin3)
Analysis of Variance Table

Response: neighborhood
          Df  Sum Sq Mean Sq F value    Pr(>F)    
sale       1 22.0673 22.0673 89.3927 6.724e-10 ***
list       1  0.2041  0.2041  0.8267    0.3716    
Residuals 26  6.4183  0.2469                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The result shows that there are more effect on the sale price by neighborhoods. Sale price have a F-value of 89.3927 and p-value of 6.724e-10 and list price have a F-value of 0.8267 and p-value of 0.3716. Sale price have both F-value over 1 and p-value less than 0.05, but list price does not. This shows that the neighborhood rank effects the sale price rather than list price. This can connect to richer neighborhoods has more probability of to have a house go over the asking price.

For this linear regression model, the multiple R-squared value is 0.7763 and adjusted R-squared value is 0.7591. Both R-squared value is close to 1 which means that this model fits input data almost exactly and has a good goodness-of-fit.

Residuals of the Model

The residuals of this model is well distributed and the model does not have a bias in the final results.

Histogram of Residuals

hist(residuals(lin3))

Residuals Against the Fitted Values

plot(lin3, which = 1)

Cook’s Distance for Each Observations

plot(lin3, which = 4)