Introduction

The dataset I will be working with is the Housing Prices Dataset from Kaggle. I will be creating a linear model that explains the relationship between area and housing price.

Visualize the Data

houses <- read.csv("https://raw.githubusercontent.com/kristinlussi/DATA605/main/WEEK11/Housing.csv")
head(houses)
##      price area bedrooms bathrooms stories mainroad guestroom basement
## 1 13300000 7420        4         2       3      yes        no       no
## 2 12250000 8960        4         4       4      yes        no       no
## 3 12250000 9960        3         2       2      yes        no      yes
## 4 12215000 7500        4         2       2      yes        no      yes
## 5 11410000 7420        4         1       2      yes       yes      yes
## 6 10850000 7500        3         3       1      yes        no      yes
##   hotwaterheating airconditioning parking prefarea furnishingstatus
## 1              no             yes       2      yes        furnished
## 2              no             yes       3       no        furnished
## 3              no              no       2      yes   semi-furnished
## 4              no             yes       3      yes        furnished
## 5              no             yes       2       no        furnished
## 6              no             yes       2      yes   semi-furnished

The below plot shows a slight positive correlation between area and housing price.

plot(houses[,"area"],houses[,"price"], main="Area vs Housing Price",
    xlab="Area", ylab="Housing Price")

The Linear Model

houses.lm <- lm(price~area, data = houses)
houses.lm
## 
## Call:
## lm(formula = price ~ area, data = houses)
## 
## Coefficients:
## (Intercept)         area  
##     2387308          462

The y-intercept is 2387308 and the slope is 462.

plot(price~area, data = houses)
abline(houses.lm)

Evaluating the Quality of the Model

summary(houses.lm)
## 
## Call:
## lm(formula = price ~ area, data = houses)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -4867112 -1022228  -200135   683027  7484838 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2.387e+06  1.745e+05   13.68   <2e-16 ***
## area        4.620e+02  3.123e+01   14.79   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1581000 on 543 degrees of freedom
## Multiple R-squared:  0.2873, Adjusted R-squared:  0.286 
## F-statistic: 218.9 on 1 and 543 DF,  p-value: < 2.2e-16

A “good” model would have a median around 0, min and max values around the same magnitude, and first and third quartiles around the same magnitude. The values in the summary above are not the best, but we will keep going.

We want to see a standard error that is at least 5-10 times smaller than the corresponding coefficient. For price, the standard error is 14.79 times smaller than the coefficient value. This large ratio means that there is little variability in the slope estimate.

The p-value for price has three asteriks (***), which means that the p-value is between \(0< p \leq 0.001\). Since this value is so small, we can say that there is strong evidence of a linear relationship between area and housing price.

The multiple R-squared is 0.2873, which means that about 28.7%. of the variability in price can be explained by the variation in area.

Residual Analysis

In the following graph, the residuals mostly fall along the line, which means that we can assume normality.

qqnorm(resid(houses.lm))
qqline(resid(houses.lm))

In the below plot, we can see that there is no obvious pattern and the residuals are scattered about 0. This suggests that the model is well-fit. However, most of the residuals seem to be distributed towards the left side of the graph. A better model would be more random about 0 and not as distributed towards the left side.

plot(fitted(houses.lm),resid(houses.lm))

par(mfrow=c(2,2))
plot(houses.lm)

Conclusion

In conclusion, with an \(R^2\) of 28.7% and Residuals vs. Fitted plot, we can see that there likely other variables that contribute to housing prices that we are not accounting for in our linear model. The qq-plot does show that there is evidence of normality. This is an appropriate linear model, but I think further analysis would be appropriate using multiple linear regression to determine which combination of variables in the dataset are the most explanatory of the variability in housing price.

Sources

Housing Prices Dataset Textbook