#Introduction
This report aims to build a linear regression model to predict the sale price of residential in Ames. The analysis will include data cleaning, transformation, modeling, & performance evaluation.
Bathrooms are not the only thing that significiantly contribute to higher sale prices, but so does Overall Quality, Above Ground Living Area (GrLivArea), and the Neighborhood.
#Data Description
The data used in this analysis is from Ames, Iowa, from the year 2006 to the year 2010. The data comes from the Assessor’s Office of the town. Missing values of “SalePrice” have been removed completely.
ames <- read.csv("ames.csv")
#Data Cleaning and Transformation
ames <- ames %>%
filter(!is.na(SalePrice))
ames$Neighborhood <- as.factor(ames$Neighborhood)
ames$TotalBathrooms <- ames$FullBath + 0.5 * ames$HalfBath
ames$LogSalePrice <- log(ames$SalePrice)
model <- lm(LogSalePrice ~ GrLivArea+ OverallQual + Neighborhood , data = ames)
prediction <- predict(model, newdata = ames)
rmse <- sqrt(mean((prediction - ames$LogSalePrice)^2))
r2 <- summary(model)$r.squared
#Key Findings -GrLivArea: larger living areas lead to higher sale prices, as we expected -OverallQual: higher quality ratings strongly correlate with higher prices -Neighborhood: certain neighborhoods are associated with higher prices
#Model Performance
kable(data.frame(RMSE = rmse, R_squared = r2))
| RMSE | R_squared |
|---|---|
| 0.1664275 | 0.8332147 |
#Plot
ggplot(data = NULL, aes(x = prediction, y = prediction - ames$LogSalePrice)) +
geom_point(alpha = 0.5) +
geom_hline(yintercept = 0, linetype = "dashed") +
labs(x = "Predicted Values", y = "Redisuals", title = "Plot")