The dataset I will be working with is the Housing Prices Dataset from Kaggle. I will be creating a linear model that explains the relationship between area and housing price.
houses <- read.csv("https://raw.githubusercontent.com/kristinlussi/DATA605/main/WEEK11/Housing.csv")
head(houses)
## price area bedrooms bathrooms stories mainroad guestroom basement
## 1 13300000 7420 4 2 3 yes no no
## 2 12250000 8960 4 4 4 yes no no
## 3 12250000 9960 3 2 2 yes no yes
## 4 12215000 7500 4 2 2 yes no yes
## 5 11410000 7420 4 1 2 yes yes yes
## 6 10850000 7500 3 3 1 yes no yes
## hotwaterheating airconditioning parking prefarea furnishingstatus
## 1 no yes 2 yes furnished
## 2 no yes 3 no furnished
## 3 no no 2 yes semi-furnished
## 4 no yes 3 yes furnished
## 5 no yes 2 no furnished
## 6 no yes 2 yes semi-furnished
The below plot shows a slight positive correlation between area and housing price.
plot(houses[,"area"],houses[,"price"], main="Area vs Housing Price",
xlab="Area", ylab="Housing Price")
houses.lm <- lm(price~area, data = houses)
houses.lm
##
## Call:
## lm(formula = price ~ area, data = houses)
##
## Coefficients:
## (Intercept) area
## 2387308 462
The y-intercept is 2387308 and the slope is 462.
plot(price~area, data = houses)
abline(houses.lm)
summary(houses.lm)
##
## Call:
## lm(formula = price ~ area, data = houses)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4867112 -1022228 -200135 683027 7484838
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.387e+06 1.745e+05 13.68 <2e-16 ***
## area 4.620e+02 3.123e+01 14.79 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1581000 on 543 degrees of freedom
## Multiple R-squared: 0.2873, Adjusted R-squared: 0.286
## F-statistic: 218.9 on 1 and 543 DF, p-value: < 2.2e-16
A “good” model would have a median around 0, min and max values around the same magnitude, and first and third quartiles around the same magnitude. The values in the summary above are not the best, but we will keep going.
We want to see a standard error that is at least 5-10 times smaller than the corresponding coefficient. For price, the standard error is 14.79 times smaller than the coefficient value. This large ratio means that there is little variability in the slope estimate.
The p-value for price has three asteriks (***), which means that the p-value is between \(0< p \leq 0.001\). Since this value is so small, we can say that there is strong evidence of a linear relationship between area and housing price.
The multiple R-squared is 0.2873, which means that about 28.7%. of the variability in price can be explained by the variation in area.
In the following graph, the residuals mostly fall along the line, which means that we can assume normality.
qqnorm(resid(houses.lm))
qqline(resid(houses.lm))
In the below plot, we can see that there is no obvious pattern and the residuals are scattered about 0. This suggests that the model is well-fit. However, most of the residuals seem to be distributed towards the left side of the graph. A better model would be more random about 0 and not as distributed towards the left side.
plot(fitted(houses.lm),resid(houses.lm))
par(mfrow=c(2,2))
plot(houses.lm)
In conclusion, with an \(R^2\) of 28.7% and Residuals vs. Fitted plot, we can see that there likely other variables that contribute to housing prices that we are not accounting for in our linear model. The qq-plot does show that there is evidence of normality. This is an appropriate linear model, but I think further analysis would be appropriate using multiple linear regression to determine which combination of variables in the dataset are the most explanatory of the variability in housing price.