Housing prices are influenced by a wide range of socioeconomic, environmental, and structural factors. Understanding how these variables interact is critical for economists, policymakers, real estate professionals, and prospective homebuyers. This project applies statistical modeling and exploratory data analysis techniques to investigate the drivers of housing prices using two real-world datasets: the Boston Housing dataset and King County housing data.
In Part 1, we analyze the Boston Housing dataset to explore relationships between median home values and key covariates such as pollution levels, number of rooms, distance to employment centers, pupil-teacher ratios, and socioeconomic status indicators. We begin with exploratory data analysis and summary statistics, followed by bootstrap resampling to estimate the variability of the median home price. Correlation analysis and a multivariate linear regression model are then used to identify statistically significant predictors and evaluate model assumptions.
In Part 2, we extend this analysis to a larger and more modern dataset from King County, Washington. Using a random subsample of housing records, we explore price distributions, correlations among structural features, and construct a multivariate regression model to predict housing prices. Diagnostic plots and performance metrics such as Root Mean Squared Error (RMSE) are used to assess model fit and highlight limitations of linear regression in the presence of skewed and heteroscedastic data.
Overall, this project demonstrates how statistical inference, regression modeling, and diagnostic evaluation can be used to extract meaningful insights from housing data, while also illustrating the practical challenges of model assumptions when working with real-world datasets.
We will extract the Boston dataset from the MASS library in R.
We are interested in the variables:
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
data(Boston)
nox<-Boston[,5]
rm<-Boston[,6]
dis<-Boston[,8]
pr<-Boston[,11]
ls<-Boston[,13]
medv<-Boston[,14]
data.boston<- data.frame(medv, nox, rm, dis, pr, ls)
head(data.boston)
## medv nox rm dis pr ls
## 1 24.0 0.538 6.575 4.0900 15.3 4.98
## 2 21.6 0.469 6.421 4.9671 17.8 9.14
## 3 34.7 0.469 7.185 4.9671 17.8 4.03
## 4 33.4 0.458 6.998 6.0622 18.7 2.94
## 5 36.2 0.458 7.147 6.0622 18.7 5.33
## 6 28.7 0.458 6.430 6.0622 18.7 5.21
par(mfrow = c(2, 3))
hist(data.boston$medv, main = "Median Value of Homes",
xlab = "Median Value in $1000", col = "lightblue", border = "black")
hist(data.boston$nox, main = "Nitric Oxide Concentration",
xlab = "Nox parts per 10 million", col = "lightblue", border = "black")
hist(data.boston$rm, main = "Average Number of Rooms",
xlab = "Average Number of Rooms Per Dwelling", col = "lightblue", border = "black")
hist(data.boston$dis, main = "Weighted Distances",
xlab = "Distances to Employment Centers", col = "lightblue", border = "black")
hist(data.boston$pr, main = "Pupil-Teacher Ratio",
xlab = "Pupil-Teacher Ratio", col = "lightblue", border = "black")
hist(data.boston$ls, main = "Percentage of Lower Status in Populaton",
xlab = "Percentage of Lower Status", col = "lightblue", border = "black")
Let’s look at the summary statistics to see if they align with what we could conclude from the graphs.
You can also embed plots, for example:
summary(data.boston)
## medv nox rm dis
## Min. : 5.00 Min. :0.3850 Min. :3.561 Min. : 1.130
## 1st Qu.:17.02 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 2.100
## Median :21.20 Median :0.5380 Median :6.208 Median : 3.207
## Mean :22.53 Mean :0.5547 Mean :6.285 Mean : 3.795
## 3rd Qu.:25.00 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 5.188
## Max. :50.00 Max. :0.8710 Max. :8.780 Max. :12.127
## pr ls
## Min. :12.60 Min. : 1.73
## 1st Qu.:17.40 1st Qu.: 6.95
## Median :19.05 Median :11.36
## Mean :18.46 Mean :12.65
## 3rd Qu.:20.20 3rd Qu.:16.95
## Max. :22.00 Max. :37.97
Based on the histogram of target variable medv, a median statistic should be used for a bootstrap simulation of the data because it is skewed, so the mean will be heavily affected by potential outliers while the median is less affected.
set.seed(123)
B = 3000
np_boot = replicate(B, {
boot_samp = sample(data.boston$medv, size = length(data.boston$medv), replace = TRUE)
median(boot_samp)
})
sd(np_boot)
## [1] 0.3794048
summary(np_boot)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 20.10 20.90 21.20 21.18 21.40 22.40
hist(np_boot,
main = "Histogram of Bootstrap Medians",
xlab = "Bootstrap Medians",
col = "lightblue",
border = "black",
breaks = 30)
Based on the histogram of the bootstrap medians, the sampling distribution of the median appears approximately normal.
With 3,000 bootstrap resamples, the distribution of the median appears approximately normal, consistent with the Central Limit Theorem.
Although the original median home value data is skewed, the bootstrap distribution of the median appears approximately normal, allowing normal-based inference.
cor_matrix = cor(data.boston)
print(cor_matrix)
## medv nox rm dis pr ls
## medv 1.0000000 -0.4273208 0.6953599 0.2499287 -0.5077867 -0.7376627
## nox -0.4273208 1.0000000 -0.3021882 -0.7692301 0.1889327 0.5908789
## rm 0.6953599 -0.3021882 1.0000000 0.2052462 -0.3555015 -0.6138083
## dis 0.2499287 -0.7692301 0.2052462 1.0000000 -0.2324705 -0.4969958
## pr -0.5077867 0.1889327 -0.3555015 -0.2324705 1.0000000 0.3740443
## ls -0.7376627 0.5908789 -0.6138083 -0.4969958 0.3740443 1.0000000
# We can just look at the "lower half" of the correlation matrix since it is symmetric.
corrplot(cor(data.boston[, 1:6]), type = "lower")
# What if we don't care about negative and positive and just want the magnitude of correlation?
corrplot(abs(cor(data.boston[,1:6])), method = "shade", type = "lower", col.lim = c(0, 1))
With the target variable being median value of homes
(medv), we only care about the first column of the
correlation plot.
The variable with the least interactions/correlation for
median value of homes (medv) is the weighted distances to
five Boston employment centers (dis). This may be because
much of Boston’s population lives outside of the city and more towards
suburban areas, so the demand for living close to employment centers is
not that extreme to where it heavily impacts the median home value.
The variables with the greatest interactions/correlation for
median value of homes (medv) include: percentage of lower
status in population (ls) and average number of rooms
(rm) where ls is negatively correlated to
medv while rm is positively correlated. This
means that if the percentage of lower status in the population is
greater, then the median value of homes would most likely be
significantly less. On the other hand, if the average number of rooms is
greater, then the median value of homes would most likely be
significantly more. Thus, there is a strong correlation between
ls and rm covariates with medv.
This could support how many Bostonians would prefer to live in places
that are more “safe” and are larger with more rooms as it seems to be a
strict demand.
In terms of correlation between covariates themselves, it can be seen
that the weighted distances to five Boston employment centers
(dis) and the nitric oxide concentration has a very strong
correlation. This may be because the urban areas of Boston have very
high levels of pollution while places farther from the city have less
pollution. This does make sense as downtown and other very urban areas
have greater population density and production of nitric oxides.
For least correlation between the covariates, pupil-teacher ratio
(pr) and nox has very little correlation
probably because it does not directly affect each other. rm
and dis also have low correlation, but still positive which
could signify that farther away from urban areas mean more land and more
rooms for the house although it isn’t direct.
Because several covariates are correlated with one another, a multivariate linear regression model allows us to evaluate the effect of each predictor on median home value while controlling for the others. This approach reduces omitted variable bias and provides a clearer understanding of how multiple socioeconomic and environmental factors jointly influence housing prices.
Recall that the equation for a multivariate linear regression for the Boston data we have is: \[\text{medv}_i = \beta_0 + \beta_1\text{nox}_i+ \beta_2\text{rm}_i + \beta_3 \text{dis}_i + \beta_4 \text{pr}_i + \beta_5 \text{ls}_i + \epsilon_i \quad i = 1, 2, ..., n\] * \(\beta_0 = \text{intercept}\) * \(\beta_1, ..., \beta_5 = \text{slopes (effect of each predictor holding others constant)}\) * \(\epsilon_i = \text{error term (often assumed mean 0, constant variance)}\)
# Randomly split the data into two parts
# 80% for training and 20% for testing
split = sample.split(data.boston$medv, SplitRatio = 0.8)
train_split = data.boston[split == TRUE, ]
test_split = data.boston[split == FALSE, ]
boston_lm = lm(medv ~ nox + rm + dis + pr + ls, data = train_split)
plot(boston_lm)
For the residuals vs fitted plot, it seems like the residuals are randomly scattered around the horizontal line suggesting that linear assumption is possible. There are a couple of outliers that are very far from the horizontal line which could signify some errors made by the model, but there seems to be an equal number of points above and below the line which supports linearity.
Overall, the QQ residuals plot does tend to align with the dashed line meaning that overall it supports the idea that the residuals are normally distributed. Towards the ends (< -2 & > 2 Theoretical Quantiles) is when it deviates from the line which may suggest that there may be a few extreme values. This means that outside of the second standard deviations of the normal curve, is when data starts to be “less normal” and could limit our regression model in terms of accuracy.
The scale-location plot shows have points relatively scattered and the line is close to horizontal but not exactly which could suggest that it would probably be mostly homoscedasticity. In terms of the residuals vs leverage plot, there are some points that are in the far right, so removing some of these points could critically affect the linear regression model. One thing to be a bit worried about is that the points do seem to cluster at certain points as it suggests that the variance of the residuals is not constant across all levels of the fitted values. This can potentially violate the assumption of homoscedasticity which we want for the model.
The residuals vs leverage plot does have some outliers in the far right which could mean that these data points (if removed) could significantly impact our regression model. However, none of the points are beyond the Cook’s distance so this is a beneficial thing to take about the model as no points are incredibly influential.
summary(boston_lm)
##
## Call:
## lm(formula = medv ~ nox + rm + dis + pr + ls, data = train_split)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.4827 -2.9234 -0.8236 1.8485 27.8618
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.74529 5.10764 7.390 7.77e-13 ***
## nox -18.87990 3.50365 -5.389 1.17e-07 ***
## rm 4.31837 0.44408 9.724 < 2e-16 ***
## dis -1.28107 0.19348 -6.621 1.07e-10 ***
## pr -1.06806 0.12113 -8.818 < 2e-16 ***
## ls -0.57497 0.05084 -11.310 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.985 on 428 degrees of freedom
## Multiple R-squared: 0.7243, Adjusted R-squared: 0.7211
## F-statistic: 224.9 on 5 and 428 DF, p-value: < 2.2e-16
It can be seen that average number of rooms (rm) and
percentage of low status population (ls) are the covariates
which have the greatest predictive power on the median home price in
Boston because of its high absolute test statistic and its very low
p-value which signifies statistical significance.
This finding does back our hypothesis on how rm and
ls are the covariates which have the greatest correlation
for medv using the correlation plot from earlier.
This makes sense as more rooms typically indicate larger and more expensive houses, which is valuable information for homebuyers. The information about low status population can be important for economists analyzing socioeconomic factors in housing markets.
In addition, the overall residuals do look pretty good where the median is close to 0 meaning that overall, the differences between the predicted values and the actual values are not extreme. It is a quite skewed, however, where the quantiles and ranges are not symmetrical which is something to note about the validity and accuracy of our linear model.
prediction = predict(boston_lm, newdata = test_split)
data.frame(Actual = test_split$medv, Predicted = prediction)
## Actual Predicted
## 7 22.9 23.309550
## 11 15.0 19.267334
## 25 15.6 15.732544
## 31 12.7 11.412493
## 36 18.9 23.567936
## 39 24.7 22.827825
## 59 23.3 20.732762
## 60 19.6 19.565360
## 61 18.7 16.126652
## 67 19.4 23.668308
## 69 17.4 18.143362
## 72 21.7 22.728907
## 77 20.0 24.563627
## 82 23.9 26.922711
## 87 22.5 22.418506
## 92 22.0 28.484824
## 104 19.3 20.901885
## 107 19.5 17.245996
## 108 20.4 21.243750
## 123 20.5 19.135331
## 126 21.4 21.157611
## 133 23.0 21.462860
## 134 18.4 16.657076
## 138 17.1 20.433666
## 159 24.3 30.869474
## 163 50.0 40.596221
## 164 50.0 42.110299
## 166 25.0 28.403253
## 190 34.9 35.338457
## 216 25.0 24.844034
## 218 28.7 28.573712
## 239 23.7 28.335808
## 246 18.5 12.607924
## 249 24.5 21.488242
## 251 24.4 24.353423
## 252 24.8 25.469999
## 287 20.1 18.863092
## 298 20.3 20.205362
## 309 22.8 29.614451
## 311 16.1 18.802495
## 315 23.8 26.230604
## 319 23.1 24.900158
## 320 21.0 21.774506
## 328 22.2 19.480229
## 330 22.6 27.458073
## 331 19.8 24.791481
## 346 17.5 18.971176
## 350 26.6 23.920397
## 352 24.1 21.973542
## 354 30.1 22.779303
## 355 18.2 12.716039
## 362 19.9 17.527643
## 365 21.9 35.048422
## 370 50.0 29.234323
## 374 13.8 3.248799
## 378 13.3 18.889107
## 380 10.2 16.076835
## 392 23.2 15.521549
## 396 13.1 18.976734
## 402 7.2 16.778294
## 409 17.2 12.112779
## 410 27.5 21.238424
## 423 20.8 18.361748
## 431 14.5 19.785590
## 432 14.1 20.655838
## 447 14.9 16.699127
## 451 13.4 18.849702
## 454 17.8 21.867826
## 468 19.1 15.555219
## 485 20.6 18.075918
## 486 21.2 21.224290
## 489 15.2 15.615239
ggplot(test_split, aes(x = medv, y = prediction)) +
geom_point(alpha = 0.5) +
geom_abline(intercept = 0, slope = 1, color = "red") +
labs(title = "Actual vs Predicted Prices", x = "Actual Prices", y = "Predicted Prices")
# Root mean squared error:
paste("The root mean squared error:", sqrt(mean((test_split$medv - prediction)^2)))
## [1] "The root mean squared error: 5.06851220452036"
The \(y=x\) line is the benchmark
where it represents perfect prediction. This is where the predicted
price is exactly the same as the actual price. While testing, the closer
our points are to the red line means the better our linear regression
model does in predicting the most accurate medv prices.
The smaller the root mean squared error is, the closer on average the expected values are to the actual values. In this case, the root mean squared error is about 5. As this Boston linear model has a reasonable root mean squared error close to 0, it suggests that this model performs reasonably well in predicting the median home values. Because this model is good for predicting home values as seen with the relatively low root mean squared error, it can potentially be used by market analysts, real estate agents, and home buyers. This model can really help home buyers see what covariates they should consider the most and how it effects the overall value of the house(s) they might buy.
We will now look at market housing trends in King County, Washington. This is because we can see similarities and differences between what influences housing prices in King County versus Boston. Recall that Boston is in the east coast while King County is in the west coast, so we can expect that there will be some differences that may be present.
Furthermore, this King County data that we are importing has a lot more variables which will make trying to predict home prices even more interesting.
king_county_housing <- read.csv("kc_house_data.csv")
head(king_county_housing)
## id date price bedrooms bathrooms sqft_living sqft_lot
## 1 7129300520 20141013T000000 221900 3 1.00 1180 5650
## 2 6414100192 20141209T000000 538000 3 2.25 2570 7242
## 3 5631500400 20150225T000000 180000 2 1.00 770 10000
## 4 2487200875 20141209T000000 604000 4 3.00 1960 5000
## 5 1954400510 20150218T000000 510000 3 2.00 1680 8080
## 6 7237550310 20140512T000000 1225000 4 4.50 5420 101930
## floors waterfront view condition grade sqft_above sqft_basement yr_built
## 1 1 0 0 3 7 1180 0 1955
## 2 2 0 0 3 7 2170 400 1951
## 3 1 0 0 3 6 770 0 1933
## 4 1 0 0 5 7 1050 910 1965
## 5 1 0 0 3 8 1680 0 1987
## 6 1 0 0 3 11 3890 1530 2001
## yr_renovated zipcode lat long sqft_living15 sqft_lot15
## 1 0 98178 47.5112 -122.257 1340 5650
## 2 1991 98125 47.7210 -122.319 1690 7639
## 3 0 98028 47.7379 -122.233 2720 8062
## 4 0 98136 47.5208 -122.393 1360 5000
## 5 0 98074 47.6168 -122.045 1800 7503
## 6 0 98053 47.6561 -122.005 4760 101930
set.seed(123)
king_county_housing <- slice_sample(king_county_housing, n = 3000)
head(king_county_housing)
## id date price bedrooms bathrooms sqft_living sqft_lot
## 1 4254000220 20150307T000000 475000 4 2.50 2040 16200
## 2 7302000610 20150508T000000 316000 4 1.50 2120 46173
## 3 7855600820 20140904T000000 802000 4 2.25 2130 8734
## 4 4139460200 20150325T000000 905000 4 2.50 3330 9557
## 5 8901500178 20141030T000000 700000 4 2.25 2440 9450
## 6 2730000270 20150212T000000 178500 3 1.00 900 10511
## floors waterfront view condition grade sqft_above sqft_basement yr_built
## 1 2.0 0 0 3 8 2040 0 1997
## 2 2.0 0 0 3 7 2120 0 1974
## 3 2.0 0 2 4 8 2130 0 1961
## 4 2.0 0 0 3 10 3330 0 1995
## 5 1.5 0 0 3 7 2440 0 1947
## 6 1.0 0 0 4 6 900 0 1961
## yr_renovated zipcode lat long sqft_living15 sqft_lot15
## 1 0 98019 47.7366 -121.958 2530 15389
## 2 0 98053 47.6503 -121.968 2000 46173
## 3 0 98006 47.5672 -122.161 2550 8800
## 4 0 98006 47.5526 -122.102 3360 9755
## 5 2014 98125 47.7061 -122.307 1720 7503
## 6 0 98001 47.2883 -122.272 1460 10643
# Time to plot the histogram:
ggplot(king_county_housing, aes(x = price)) +
geom_histogram(binwidth = 50000, fill = "lightblue", color = "gray") +
labs(title = "Histogram of House Prices", x = "Price", y = "Frequency")
summary(king_county_housing)
## id date price bedrooms
## Min. :1.200e+06 Length:3000 Min. : 82000 Min. :0.000
## 1st Qu.:2.131e+09 Class :character 1st Qu.: 324666 1st Qu.:3.000
## Median :3.905e+09 Mode :character Median : 460000 Median :3.000
## Mean :4.614e+09 Mean : 542444 Mean :3.344
## 3rd Qu.:7.305e+09 3rd Qu.: 649960 3rd Qu.:4.000
## Max. :9.842e+09 Max. :5300000 Max. :8.000
## bathrooms sqft_living sqft_lot floors waterfront
## Min. :0.000 Min. : 390 Min. : 690 Min. :1.000 Min. :0.00
## 1st Qu.:1.500 1st Qu.:1420 1st Qu.: 5000 1st Qu.:1.000 1st Qu.:0.00
## Median :2.250 Median :1880 Median : 7530 Median :1.000 Median :0.00
## Mean :2.097 Mean :2058 Mean : 14654 Mean :1.487 Mean :0.01
## 3rd Qu.:2.500 3rd Qu.:2552 3rd Qu.: 10464 3rd Qu.:2.000 3rd Qu.:0.00
## Max. :6.000 Max. :8010 Max. :920423 Max. :3.500 Max. :1.00
## view condition grade sqft_above
## Min. :0.0000 Min. :1.000 Min. : 4.000 Min. : 390
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.: 7.000 1st Qu.:1180
## Median :0.0000 Median :3.000 Median : 7.000 Median :1550
## Mean :0.2357 Mean :3.417 Mean : 7.658 Mean :1769
## 3rd Qu.:0.0000 3rd Qu.:4.000 3rd Qu.: 8.000 3rd Qu.:2180
## Max. :4.0000 Max. :5.000 Max. :13.000 Max. :6220
## sqft_basement yr_built yr_renovated zipcode
## Min. : 0.0 Min. :1900 Min. : 0.0 Min. :98001
## 1st Qu.: 0.0 1st Qu.:1951 1st Qu.: 0.0 1st Qu.:98033
## Median : 0.0 Median :1975 Median : 0.0 Median :98065
## Mean : 289.5 Mean :1971 Mean : 87.8 Mean :98078
## 3rd Qu.: 550.0 3rd Qu.:1996 3rd Qu.: 0.0 3rd Qu.:98117
## Max. :3500.0 Max. :2015 Max. :2015.0 Max. :98199
## lat long sqft_living15 sqft_lot15
## Min. :47.18 Min. :-122.5 Min. : 399 Min. : 659
## 1st Qu.:47.47 1st Qu.:-122.3 1st Qu.:1480 1st Qu.: 5040
## Median :47.58 Median :-122.2 Median :1840 Median : 7560
## Mean :47.56 Mean :-122.2 Mean :1983 Mean : 12589
## 3rd Qu.:47.68 3rd Qu.:-122.1 3rd Qu.:2370 3rd Qu.: 10008
## Max. :47.78 Max. :-121.3 Max. :5790 Max. :434728
# Calculate the correlation matrix
cor_matrix <- cor(king_county_housing %>% dplyr::select(price, bedrooms, bathrooms, sqft_living, sqft_lot, floors, waterfront, condition, grade, sqft_above, sqft_basement, yr_built, yr_renovated))
# Plot the correlation matrix
corrplot(cor_matrix, type = "upper", tl.cex = 0.8, tl.srt = 45)
From the correlation plot, it seems like the variables with the lowest interactions are waterfront, condition, year built especially with the price. This may be because there are new homes that are not that high demand and homes being based on condition may have some subjectivity as different buyers might perceive the home’s looks and conditions differently.
Variables with the highest interactions are living square footage
(sqft_living) and sqft_above,
grade, and bathrooms especially with price.
This is because a larger house with greater space is more needed for the
typical buyer as it grants more things to do and furniture which is the
ultimate goal especially for a larger family. Furthermore, grade is the
ultimate measurable factor on if a house is “good” (like a rating for
any subject) which many buyers heavily consider when trying to purchase
a home.
From interpreting this correlation plot, we hypothesize that the most
important covariate predictors of King County home prices during 2014 to
2015 include sqft_living, sqft_above,
sqft_basement, grade, and
bathrooms. We can also predict that
sqft_living and grade will have the greatest
positive correlation with the price.
It can also be seen that every covariate has a positive correlation with the price. This means that the greater value of each covariate will theoretically result in higher prices.
Similar to the Boston analysis, there are several covariates are
correlated with one another. As we identified, bathrooms,
sqft_living, sqft_above, and
grade have relatively strong correlation with other
covaraites. Therefore, a multivariate linear regression model
is a good idea for a prediction model because it allows us to evaluate
the effect of each predictor on median home value while controlling for
the others. This approach reduces omitted variable bias and provides a
clearer understanding of how multiple socioeconomic and environmental
factors jointly influence housing prices.
set.seed(123)
split <- sample.split(king_county_housing$price, SplitRatio = 0.7)
training_data <- subset(king_county_housing, split == TRUE)
testing_data <- subset(king_county_housing, split == FALSE)
kc_model <- lm(price ~ sqft_living + grade + bathrooms + sqft_above
+ sqft_basement, data = training_data)
plot(kc_model)
The diagnostic plots for this linear regression model on King County housing data is worse than the Boston one in terms of normality of variance and to have a linear model that properly and accurate predicts prices based on covariates.
The residuals vs fitted plot does show a trend where there is a huge cluster to the left of the plot and then it suddenly starts to scatter towards the right. The point are not really random around the red line of the graph, so this suggests that the residuals are not evenly distributed which could indicate a potential non-linearity relationship between the covariates and the housing prices.
The Q-Q plot does have greater deviation from the dashed line after the second standard deviations suggesting that the residuals are skewed rather than normal which raises a red flag about our linear regression model and how the residuals may have different values of certain data.
The scale-location plot has extreme clustering towards the left of the graph rather than having points equally spaced out around the whole graph. This violates the assumption of homoscedasticity. This suggests that the variance of the residuals changes with the level of the predictors. We want our residuals to be evenly distributed throughout all of our predictive data, so that our model can properly predict the prices for all potential covariate values. However, the clustering here can potentially limit the accuracy of our model’s predictions.
The residuals vs leverage plot looks much worse than the plot of the regression model for the Boston dataset. Not only does the red line not stay around 0, but there are points in the graph that lie to the far right and even outside of Cook’s distance. Therefore, if we were to remove some points that are the outliers, this will heavily affect how our linear regression model.
summary(kc_model)
##
## Call:
## lm(formula = price ~ sqft_living + grade + bathrooms + sqft_above +
## sqft_basement, data = training_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -816582 -146507 -20762 103551 3184689
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -712467.63 44213.82 -16.114 < 2e-16 ***
## sqft_living 272.11 14.06 19.358 < 2e-16 ***
## grade 120465.47 7768.32 15.507 < 2e-16 ***
## bathrooms -33735.69 11245.68 -3.000 0.00273 **
## sqft_above -85.25 14.16 -6.021 2.02e-09 ***
## sqft_basement NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 260000 on 2259 degrees of freedom
## Multiple R-squared: 0.5657, Adjusted R-squared: 0.5649
## F-statistic: 735.5 on 4 and 2259 DF, p-value: < 2.2e-16
Based on the summary statistics: All of the five covariates chosen
for the linear model may have significant predictive power on the price
of homes. In fact, living square footage (sqft_living) and
grade (grade) have the greatest predictive power due to
their high test statistic and also their low p-values which stands for
greater statistical significance. This does back our hypothesis that
sqft_living and grade have some of the
greatest correlation to the price of all the covariates. This may be due
to the fact that the price of a home depends on how large it is as it is
what the buyer is ultimately getting from their purchase. Therefore,
with a constant distance and other variables, a very large house with
large living room space for a large family should cost more than a
smaller house with a small living room for an individual.
Furthermore, many buyers and real estate professionals look for the
grade of the house and whether it is “good” or “bad”. Therefore the
grade plays a significant role on the price of the home. In conclusion,
homebuyers would greatly benefit from this model focusing on the
predictive power of sqft_living and grade
because they are critical factors influencing the price of a home. By
understanding how these variables impact pricing, homebuyers can make
more informed decisions about which properties offer the best value for
their needs and budget.
prediction <- predict(kc_model, newdata = testing_data)
ggplot(testing_data, aes(x = price, y = prediction)) +
geom_point(alpha = 0.5) +
geom_abline(intercept = 0, slope = 1, color = "red") +
labs(title = "Actual vs Predicted Prices", x = "Actual Prices", y = "Predicted Prices")
paste("Average housing prices of King County dataset:", mean(king_county_housing$price))
## [1] "Average housing prices of King County dataset: 542444.273"
# RMSE:
paste("The root mean squared error:", sqrt(mean((testing_data$price - prediction)^2)))
## [1] "The root mean squared error: 198072.992144565"
This is quite a high root mean squared error of about 198,073 dollars. Because our mean house prices based on our randomly subsetted data set of King County homes is around 542,000 dollars, the root mean squared error of 198,073 dollars is very significant to how accurate our model is in predictive home prices based on the covariates. This may be due to the fact that our data is quite skewed particularly with the residuals as seen from the diagnostics which will impact our RMSE.
One way we can work towards being better at predicting the prices and seeing the correlations of price and covariates is to potentially choose a different model. Maybe a linear regression model is not fit for this specific scenario as correlations may not be linear. We can also try to find data with lower covariate correlation to improve the accuracy of our model. By either improving our linear regression model or choosing a better model for this data set, we can try to make the overall residuals lower and more normal without huge trends and that will benefit our model and reduce the RMSE.